# Analyzing CIA Data with SQL

Let's use some simple exploratory SQL to investigate a table from the CIA database.

We will start by loading SQL into Jupyter and investigating the master.

In [1]:
%%capture
%load_ext sql
%sql sqlite:///factbook.db

'Connected: None@factbook.db'

In [2]:
%%sql
SELECT *
  FROM sqlite_master
 WHERE type='table';

Done.


type,name,tbl_name,rootpage,sql
table,sqlite_sequence,sqlite_sequence,3,"CREATE TABLE sqlite_sequence(name,seq)"
table,facts,facts,47,"CREATE TABLE ""facts"" (""id"" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, ""code"" varchar(255) NOT NULL, ""name"" varchar(255) NOT NULL, ""area"" integer, ""area_land"" integer, ""area_water"" integer, ""population"" integer, ""population_growth"" float, ""birth_rate"" float, ""death_rate"" float, ""migration_rate"" float)"


It seems the bulk of our informative data is in a table called 'facts'. Let's display the first five rows of it to get a sense of the data.

In [3]:
%%sql
SELECT *
    FROM facts
    LIMIT 5;

Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
1,af,Afghanistan,652230,652230,0,32564342,2.32,38.57,13.89,1.51
2,al,Albania,28748,27398,1350,3029278,0.3,12.92,6.58,3.3
3,ag,Algeria,2381741,2381741,0,39542166,1.84,23.67,4.31,0.92
4,an,Andorra,468,468,0,85580,0.12,8.13,6.96,0.0
5,ao,Angola,1246700,1246700,0,19625353,2.78,38.78,11.49,0.46


Our columns can be described as such:
* name — the name of the country.
* area— the country's total area (both land and water).
* area_land — the country's land area in square kilometers.
* area_water — the country's water area in square kilometers.
* population — the country's population.
* population_growth— the country's population growth as a percentage.
* birth_rate — the country's birth rate, or the number of births per year per 1,000 people.
* death_rate — the country's death rate, or the number of death per year per 1,000 people.

Let's isolate the extremes in terms of population and population growth.

In [4]:
%%sql
SELECT MIN(population), MAX(population), MIN(population_growth), MAX(population_growth)
    FROM facts;

Done.


MIN(population),MAX(population),MIN(population_growth),MAX(population_growth)
0,7256490011,0.0,4.02


A minimum of zero population seems wrong. The maximum is also too high, over 7  billion. This should be investigated.

In [5]:
%%sql
SELECT name, population
    FROM facts
    WHERE (Population  = (SELECT MAX(population) FROM facts) or (Population = (Select MIN(population) FROM facts)))
    ORDER BY population;

Done.


name,population
Antarctica,0
World,7256490011


We can see the outliers are Antartica and the whole World. Both of these make sense, but we want to exclude them from the analysis.

In [6]:
%%sql
SELECT MIN(population), MAX(population), MIN(population_growth), MAX(population_growth)
    FROM facts
    WHERE population > (Select MIN(population) FROM facts) and population < (Select MAX(population) FROM facts)

Done.


MIN(population),MAX(population),MIN(population_growth),MAX(population_growth)
48,1367485388,0.0,4.02


## Densely Populated Countries

Let's isolate the countries with populations above the average but areas below the average, showing a high density.

In [7]:
%%sql
SELECT CAST(AVG(Population) AS int) AS average_pop, CAST(AVG(area) AS int) AS average_area
    FROM facts
     WHERE population > (Select MIN(population) FROM facts) and population < (Select MAX(population) FROM facts)

Done.


average_pop,average_area
32377011,582949


In [8]:
%%sql
SELECT name, population, area
    FROM facts
    WHERE population > (SELECT AVG(Population) FROM facts) AND area < (SELECT AVG(area) FROM facts)

Done.


name,population,area
Bangladesh,168957745,148460
Germany,80854408,357022
Japan,126919659,377915
Philippines,100998376,300000
Thailand,67976405,513120
United Kingdom,64088222,243610
Vietnam,94348835,331210


We can see that these seven countries all fit the criteria provided for densely populated countries, done using subqueries in the WHERE statement. Rather than defining by the average, let's just see the countries with the highest population to area ratios.

In [30]:
%%sql
SELECT name, population, area, population/area AS density
    FROM facts
    ORDER BY density
    LIMIT 5;

Done.


name,population,area,density
Chad,11631456.0,,
Niger,18045729.0,,
Holy See (Vatican City),842.0,0.0,
Ashmore and Cartier Islands,,5.0,
Coral Sea Islands,,3.0,


We can see lots of null or zero values. Let's change this by creating a threshold of having at least one citizen and one square unit, and sort by descending.

In [31]:
%%sql
SELECT name, population, area, ROUND(CAST(population AS float)/area, 2) AS density
    FROM facts
    WHERE population > 0 AND area > 0
    ORDER BY density DESC
    LIMIT 10;

Done.


name,population,area,density
Macau,592731,28,21168.96
Monaco,30535,2,15267.5
Singapore,5674472,697,8141.28
Hong Kong,7141106,1108,6445.04
Gaza Strip,1869055,360,5191.82
Gibraltar,29258,6,4876.33
Bahrain,1346613,760,1771.86
Maldives,393253,298,1319.64
Malta,413965,316,1310.02
Bermuda,70196,54,1299.93


None of the original countries are on this list, as instead these are all much smaller regions that wouldn't be in the upper half of population, but still have much more people than space.

## Exploring Further: People and Growth Rate

Let's find out the countries with the most people and the countries with the highest growth rates. Then, we can calculate the countries set to add the most to their population next year.

In [13]:
%%sql
SELECT name, population, population_growth
    FROM facts
    WHERE (population = (SELECT MAX(population) FROM facts WHERE population < (SELECT MAX(population) FROM facts))) or (population_growth = (SELECT MAX(population_growth) FROM facts))

Done.


name,population,population_growth
China,1367485388,0.45
South Sudan,12042910,4.02


To calculate how much will be added to the population, We the growth factor factor times the current population. So if the growth is 1%, and current population is x, we need x\*0.01.

In [20]:
%%sql
SELECT name, population, population_growth, ROUND(population*(population_growth/100), 2) AS pop_increase
    FROM facts
    WHERE population < (SELECT MAX(population) FROM facts)
    ORDER BY pop_increase DESC
    LIMIT 5;

Done.


name,population,population_growth,pop_increase
India,1251695584,1.22,15270686.12
China,1367485388,0.45,6153684.25
Nigeria,181562056,2.45,4448270.37
Pakistan,199085847,1.46,2906653.37
Ethiopia,99465819,2.89,2874562.17


Here we can see the countries adding the most to their population. Even though India and China don't have the highest growth rates, because they have the most people in general, the magnitude is high enought o where they still will increase the most. 

### Conclusion

In this project we succesfully used subqueries to investigate the country data, without much issue. It is important to filter out data skewing the results by investigating the extremes.