In [1]:
!conda install -yc conda-forge ipython-sql

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



# Sequel to Data Analyst carrer:
## Exploring global territorial dataset with SQL

![banner](https://www.infoplease.com/sites/infoplease.com/files/styles/webp_image/public/inline-images/Global%20Population.jpg.webp)

## Abstract

In this first contact with SQL, we lay over general global territorial statistics from the CIA to inquire about the outliers on multiple variables. We discover, among others, that India is the fastest growing country, Macau is the most densely populated, and British Indian Ocean Territory is nearly 99.9% covered by water. 

## Table of Contents

1. [**Introduction**](#section1)
2. [**Goals and limitations**](#section2)
3. [**Methodology**](#section3)
4. [**Metadata**](#section4)
5. [**Data Analysis**](#section5)      
6. [**Conclusion**](#section6)

## 1. Introduction<a name="section1"></a>


This project deals with the [CIA World Factbook](https://www.cia.gov/the-world-factbook/), where general statistics about all countries on Earth can be found. We'll use SQL in Jupyter to find generalizations about the countries.

## 2. Goals and limitations<a name="section2"></a>

In this project, we investigate:

* (1) which countries have the largest and smallest populations in the dataset;
* (2) the lowest population growth rate;
* (3) the most densely populated countries;
* (4) which countries have the highest ration of water to land, and more water than land?
* (5) Which countries will add the most people to their populations next year?
* (6) Which countries have higher death rate than birth rate?
* (7) Which countries have the highest population/area ratio, and how does it compare to list we found in the previous screen?

Limitations:
* as our first project with SQL, it is quite expected for the data manipulations to be simpler.

## 3. Methodology<a name="section3"></a>

We'll explore the dataset with basic concepts of SQLite: SELECT FROM, WHERE, LIMIT, NOT IN, AND, ORDER BY, CAST.

## 4. Metadata<a name="section4"></a>

Here are the descriptions for some of the columns:

* name — the name of the country.
* area— the country's total area (both land and water).
* area_land — the country's land area in square kilometers.
* area_water — the country's water area in square kilometers.
* population — the country's population.
* population_growth— the country's population growth as a percentage.
* birth_rate — the country's birth rate, or the number of births per year per 1,000 people.
* death_rate — the country's death rate, or the number of death per year per 1,000 people.

## 5. Data Analysis<a name="section5"></a>

In this project to familiarize us with SQL, we assume the data has no mistakes.

In [2]:
#%%capture
%load_ext sql
%sql sqlite:///factbook.db
    


'Connected: @factbook.db'

In [4]:
%%sql
SELECT *
FROM sqlite_master
WHERE type='table'

 * sqlite:///factbook.db
Done.


type,name,tbl_name,rootpage,sql
table,sqlite_sequence,sqlite_sequence,3,"CREATE TABLE sqlite_sequence(name,seq)"
table,facts,facts,47,"CREATE TABLE ""facts"" (""id"" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, ""code"" varchar(255) NOT NULL, ""name"" varchar(255) NOT NULL, ""area"" integer, ""area_land"" integer, ""area_water"" integer, ""population"" integer, ""population_growth"" float, ""birth_rate"" float, ""death_rate"" float, ""migration_rate"" float)"


In [5]:
%%sql
SELECT *
    FROM facts
LIMIT 5;

 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
1,af,Afghanistan,652230,652230,0,32564342,2.32,38.57,13.89,1.51
2,al,Albania,28748,27398,1350,3029278,0.3,12.92,6.58,3.3
3,ag,Algeria,2381741,2381741,0,39542166,1.84,23.67,4.31,0.92
4,an,Andorra,468,468,0,85580,0.12,8.13,6.96,0.0
5,ao,Angola,1246700,1246700,0,19625353,2.78,38.78,11.49,0.46


Write a single querry: min population, max population, minimum population growth, max population growth

In [6]:
%%sql
SELECT MIN(population) AS min_pop, MAX(population) AS max_pop, MIN(population_growth) AS min_pop_growth, MAX(population_growth) AS max_population_growth
    FROM facts;

 * sqlite:///factbook.db
Done.


min_pop,max_pop,min_pop_growth,max_population_growth
0,7256490011,0.0,4.02


* It is proposterus to think a country would have a population of 0;
* How could a country have the population of the whole planet, at 7.25 billion people?

In [10]:
%%sql
SELECT *
    FROM facts
    WHERE population == (SELECT MIN(population)
                           FROM facts);

 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
250,ay,Antarctica,,280000,,0,,,,


So the 'country' with 0 folks registered in it is Antartica, which is not really a country. This agrees with the CIA Factbook [page for Antartica](https://www.cia.gov/library/publications/the-world-factbook/geos/ay.html).

![banner](https://s3.amazonaws.com/dq-content/257/fb_antarctica.png)

In [9]:
%%sql
SELECT *
    FROM facts
    WHERE population==(SELECT MAX(population)
                         FROM facts);

 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
261,xx,World,,,,7256490011,1.08,18.6,7.8,


As we could foresee, there is an entry for 'World' with all living population in it. As both minimum and maximum are not satisfactory, we should recalculate excluding these statistics.

In [12]:
%%sql
SELECT MIN(population) AS min_pop, MAX(population) AS max_pop, MIN(population_growth) AS min_pop_growth, MAX(population_growth) AS max_population_growth
    FROM facts
    WHERE name NOT IN ('World', 'Antarctica');

 * sqlite:///factbook.db
Done.


min_pop,max_pop,min_pop_growth,max_population_growth
48,1367485388,0.0,4.02


In [15]:
%%sql
SELECT AVG(population) AS avg_pop, AVG(area) AS avg_area
    FROM facts
    WHERE name NOT IN ('World', 'Antarctica');

 * sqlite:///factbook.db
Done.


avg_pop,avg_area
32377011.0125,555093.546184739


So the average population in the dataset is 32 million citizens per country, averaging 555 thousand square kilometers.

* (1) which countries have the largest and smallest populations in the dataset?

In [25]:
%%sql
SELECT *
    FROM facts
    WHERE name NOT IN ('Antarctica', 'World')
    AND population == (SELECT MAX(population)
                          FROM facts
                          WHERE name NOT IN ('Antarctica', 'World'))

 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
37,ch,China,9596960,9326410,270550,1367485388,0.45,12.49,7.53,0.44


In [26]:
%%sql
SELECT *
    FROM facts
    WHERE name NOT IN ('Antarctica', 'World')
    AND population == (SELECT MIN(population)
                          FROM facts
                          WHERE name NOT IN ('Antarctica', 'World'))

 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
238,pc,Pitcairn Islands,47,47,0,48,0.0,,,


The countries with the largest and smallest populations are China and Pitcairn Islands, respectively.

* (2) the lowest population growth rate?

In [32]:
%%sql
SELECT *
    FROM facts
    WHERE (name NOT IN ('Antarctica','World'))
    AND (population_growth == (SELECT MIN(population_growth)
            FROM facts));

 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
190,vt,Holy See (Vatican City),0,0,0.0,842,0.0,,,
200,ck,Cocos (Keeling) Islands,14,14,0.0,596,0.0,,,
207,gl,Greenland,2166086,2166086,,57733,0.0,14.48,8.49,5.98
238,pc,Pitcairn Islands,47,47,0.0,48,0.0,,,


Populations with the lowest growth rate are: Greenland, Vatican, Cocos (Kelling) Islands, Pitcairn Islands.

* (3) the largest densely populated countries?

In [49]:
%%sql
SELECT name AS country
    FROM facts
    WHERE name NOT IN ('WORLD', 'ANTARCTICA') AND (population > (SELECT AVG(population)
                           FROM facts
                           WHERE name NOT IN ('World', 'Antarctica'))) AND (area < (SELECT AVG(area)
                           FROM facts
                           WHERE name NOT IN ('World', 'Antarctica')))
    ORDER BY area DESC;

 * sqlite:///factbook.db
Done.


country
Thailand
Spain
Morocco
Iraq
Japan
Germany
Vietnam
Poland
Italy
Philippines


The largest densely populated countries are: Thailand, Spain, Morrocco, Iraq, Japan;

* (4) which countries have the highest ration of water to land, and more water than land? 

In [42]:
%%sql
SELECT *, CAST(area_water AS FLOAT)/area_land AS water_land_ratio
    FROM facts
    ORDER BY water_land_ratio DESC
    LIMIT 5;

 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate,water_land_ratio
228,io,British Indian Ocean Territory,54400,60,54340,,,,,,905.6666666666666
247,vq,Virgin Islands,1910,346,1564,103574.0,0.59,10.31,8.54,7.67,4.520231213872832
246,rq,Puerto Rico,13791,8870,4921,3598357.0,0.6,10.86,8.67,8.15,0.5547914317925592
12,bf,"Bahamas, The",13880,10010,3870,324597.0,0.85,15.5,7.05,0.0,0.3866133866133866
71,pu,Guinea-Bissau,36125,28120,8005,1726170.0,1.91,33.38,14.33,0.0,0.2846728307254623


The countries with more water than land are the British Indian Ocean Territory at 905 times water area, and the Virgin Islands, with 4.52 times the land area in water.

* (5) Which countries will add the most people to their populations next year ?

In [44]:
%%sql
SELECT name, population*population_growth AS new_folks
    FROM facts
    ORDER BY new_folks DESC;

 * sqlite:///factbook.db
Done.


name,new_folks
World,7837009211.88
India,1527068612.48
China,615368424.6
Nigeria,444827037.2000001
Pakistan,290665336.62
Ethiopia,287456216.91
Bangladesh,270332392.0
United States,250667713.92
Indonesia,235514180.08
"Congo, Democratic Republic of the",194469083.2


India is expected to more than double its size next year, with 1.53 billion new borns. China ranks second with 600 million new children.

* (6) Which countries have higher death rate than birth rate?

In [45]:
%%sql
SELECT *, birth_rate - death_rate AS net_rate
    FROM facts
    WHERE net_rate<0
    ORDER BY net_rate
    ;

 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate,net_rate
26,bu,Bulgaria,110879,108489,2390,7186893,0.58,8.92,14.44,0.29,-5.52
153,ri,Serbia,77474,77474,0,7176794,0.46,9.08,13.66,0.0,-4.58
96,lg,Latvia,64589,62249,2340,1986705,1.06,10.0,14.31,6.26,-4.3100000000000005
102,lh,Lithuania,65300,62680,2620,2884433,1.04,10.1,14.27,6.27,-4.17
183,up,Ukraine,603550,579330,24220,44429471,0.6,10.72,14.46,2.25,-3.74
75,hu,Hungary,93028,89608,3420,9897541,0.22,9.16,12.73,1.33,-3.5700000000000003
65,gm,Germany,357022,348672,8350,80854408,0.17,8.47,11.42,1.24,-2.9499999999999997
158,si,Slovenia,20273,20151,122,1983412,0.26,8.42,11.37,0.37,-2.9499999999999997
142,ro,Romania,238391,229891,8500,21666350,0.3,9.14,11.9,0.24,-2.76
44,hr,Croatia,56594,55974,620,4464844,0.13,9.45,12.18,1.39,-2.7300000000000004


There is a long list of countries shrinking their populations, headed by Bulgaria (-5,5%), Serbia, Latvia, Lithuania above 4%.

* (7) Which countries have the highest population/area ratio, and how does it compare to list we found in the previous screen?

In [47]:
%%sql
SELECT *, population/area AS pop_density
    FROM facts
    ORDER BY pop_density DESC;    

 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate,pop_density
205,mc,Macau,28.0,28.0,0.0,592731.0,0.8,8.88,4.22,3.37,21168.0
117,mn,Monaco,2.0,2.0,0.0,30535.0,0.12,6.65,9.24,3.83,15267.0
156,sn,Singapore,697.0,687.0,10.0,5674472.0,1.89,8.27,3.43,14.05,8141.0
204,hk,Hong Kong,1108.0,1073.0,35.0,7141106.0,0.38,9.23,7.07,1.68,6445.0
251,gz,Gaza Strip,360.0,360.0,0.0,1869055.0,2.81,31.11,3.04,0.0,5191.0
233,gi,Gibraltar,6.0,6.0,0.0,29258.0,0.24,14.08,8.37,3.28,4876.0
13,ba,Bahrain,760.0,760.0,0.0,1346613.0,2.41,13.66,2.69,13.09,1771.0
108,mv,Maldives,298.0,298.0,0.0,393253.0,0.08,15.75,3.89,12.68,1319.0
110,mt,Malta,316.0,316.0,0.0,413965.0,0.31,10.18,9.09,1.98,1310.0
227,bd,Bermuda,54.0,54.0,0.0,70196.0,0.5,11.33,8.23,1.88,1299.0


We can see that the most dense countries are Macau, Monaco, Singapore, Honk Kong and the Gaza Strip, being very different from the previous list where we combined only populations above average with areas bellow average.

## 6. Conclusion<a name="section6"></a>

In this first contact with SQL, we obtained for our queries:

* (1) countries with largest and smallest populations: China, Pitcairn Islands;
* (2) the lowest population growth rate: Greenland, Vatican, Cocos (Kelling) Islands, Pitcairn Islands at 0;
* (3) the largest densely populated countries: Thailand, Spain, Morrocco, Iraq, Japan;
* (4) countries with more water than land: British Indian Ocean Territory, and the Virgin Islands
* (5) countries that will grow the most next year: India and China, at 1.52B and 600M.
* (6) countries with worst growth rates: Bulgaria(-5,5%), Serbia, Latvia, Lithuania (all >4%).
* (7) most densely populated countries overall: Macau, Monaco, Singapore, Honk Kong, Gaza Strip. 