## Analysing CIA Factbook Data Using SQL

In this project, I'll be working with data from the CIA World Factbook, a compendium of statistics about all of the countries on Earth. The Factbook contains demographic information like the following:

* **population** — the global population.
* **population_growth** — the annual population growth rate, as a percentage.
* **area** — the total land and water area.

I'll use sqlite to analyse data from this database. This will allow me to quickly get up and running, without bothering with larger databases like MySQL or Postgres. I'll also use the `pandas` method - `read_sql_query` - for simplicity.

The purpose of this project is not to run in an depth analysis. Rather, it's intended as a means to test basic SQL query skills. Please do not take any conclusions drawn within this project as fact, as none of these will have been fact checked or looked into in any great depth.

### Loading in our database

In [89]:
import sqlite3

import pandas as pd

Let's first connect to our database using sqlite.

In [94]:
# Connect to our database
con = sqlite3.connect("factbook.db")

The syntax for running a query via the `read_sql_query` method is a bit long, so let's simplify it with our own function.

In [91]:
# Function runs the query, simplifying the syntax we have to run
def run(query, connection = con):
    return pd.read_sql_query(query, connection)

### Exploring our Data

Let's get a bit more detail about what we're working with.

In [92]:
run("SELECT * FROM sqlite_master WHERE type='table';")

Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,sqlite_sequence,sqlite_sequence,3,"CREATE TABLE sqlite_sequence(name,seq)"
1,table,facts,facts,47,"CREATE TABLE ""facts"" (""id"" INTEGER PRIMARY KEY..."


Let's now look at the first 5 rows of the `facts` table.

In [57]:
run("SELECT * FROM facts LIMIT 5;")

Unnamed: 0,id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
0,1,af,Afghanistan,652230,652230,0,32564342,2.32,38.57,13.89,1.51
1,2,al,Albania,28748,27398,1350,3029278,0.3,12.92,6.58,3.3
2,3,ag,Algeria,2381741,2381741,0,39542166,1.84,23.67,4.31,0.92
3,4,an,Andorra,468,468,0,85580,0.12,8.13,6.96,0.0
4,5,ao,Angola,1246700,1246700,0,19625353,2.78,38.78,11.49,0.46


Here's a description of each of these columns:

* **name** — the name of the country.
* **area** — the country's total area (both land and water).
* **area_land** — the country's land area in square kilometers.
* **area_water** — the country's waterarea in square kilometers.
* **population** — the country's population.
* **population_growth** — the country's population growth as a percentage.
* **birth_rate** — the country's birth rate, or the number of births per year per 1,000 people.
* **death_rate** — the country's death rate, or the number of death per year per 1,000 people.

### Summary Stats

Let's try a few queries. We'll try returning the max & min of both the `population` and `population_growth` columns.

In [58]:
run("""SELECT MAX(population) 'Max Population',
              MIN(population) 'Min Population',
              MAX(population_growth) 'Max Growth',
              MIN(population_growth) 'Min Growth'
    FROM facts;""")

Unnamed: 0,Max Population,Min Population,Max Growth,Min Growth
0,7256490011,0,4.02,0.0


### Exploring Outliers

This is odd. One country appears to have a population of over 7 billion! And another has a population of zero. Let's investigate this a bit further.

In [61]:
run("""SELECT name, population
FROM facts 
WHERE population = (SELECT MAX(population) FROM facts);""")

Unnamed: 0,name,population
0,World,7256490011


In [62]:
run("""SELECT name, population
FROM facts 
WHERE population = (SELECT MIN(population) FROM facts);""")

Unnamed: 0,name,population
0,Antarctica,0


Now this makes a bit more sense. We can see that the zero population is referring to Antartica, while the 7 billion figure represents the population of the world!

### Exploring Density

Let's repeat the above, but this time we will exclude **World** and **Antarctica**.

In [66]:
run("""SELECT 
    MAX(population) 'Max Population',
    MIN(population) 'Min Population',
    MAX(population_growth) 'Max Growth',
    MIN(population_growth) 'Min Growth'
FROM facts
    WHERE name != 'World' AND name != 'Antarctica';""")

Unnamed: 0,Max Population,Min Population,Max Growth,Min Growth
0,1367485388,48,4.02,0.0


Let's also check for `NULL` values.

In [185]:
nulls = run("""SELECT name FROM facts WHERE area_land is NULL OR population is NULL OR area_land = 0 OR population = 0""")
nulls = tuple(nulls['name'])
nulls

('Ethiopia',
 'South Sudan',
 'Sudan',
 'Holy See (Vatican City)',
 'European Union',
 'Ashmore and Cartier Islands',
 'Coral Sea Islands',
 'Heard Island and McDonald Islands',
 'Clipperton Island',
 'French Southern and Antarctic Lands',
 'Saint Barthelemy',
 'Bouvet Island',
 'Jan Mayen',
 'Akrotiri',
 'British Indian Ocean Territory',
 'Dhekelia',
 'South Georgia and South Sandwich Islands',
 'Navassa Island',
 'Wake Island',
 'United States Pacific Island Wildlife Refuges',
 'Antarctica',
 'Paracel Islands',
 'Spratly Islands',
 'Arctic Ocean',
 'Atlantic Ocean',
 'Indian Ocean',
 'Pacific Ocean',
 'Southern Ocean',
 'World')

Let's remove these from our future queries.

We will now calculate the average value for `population` and `area_land` columns - then calculate the ratio between these to give a density population ratio.

In [187]:
# Density ratio average.
ratio = run("""SELECT
    AVG(Population) / AVG(area_land) 'ratio'
    FROM facts WHERE name not in {};""".format(nulls))

ratio = float(ratio.loc[0])
ratio

55.408233504543524

We'll now use this to identify countries that have a density ratio above average. We'll only show the top 10.

In [186]:
density = run("""SELECT
name, population, area_land, ROUND(CAST(population AS float) / CAST(area_land AS float), 2) 'density'
FROM facts
WHERE 
density > {} AND 
name not in {}""".format(ratio, nulls))

density.sort_values(by = "density", ascending = False).head(10)

Unnamed: 0,name,population,area_land,density
125,Macau,592731,28,21168.96
75,Monaco,30535,2,15267.5
98,Singapore,5674472,687,8259.78
124,Hong Kong,7141106,1073,6655.27
147,Gaza Strip,1869055,360,5191.82
137,Gibraltar,29258,6,4876.33
6,Bahrain,1346613,760,1771.86
68,Maldives,393253,298,1319.64
69,Malta,413965,316,1310.02
134,Bermuda,70196,54,1299.93


Here we can see that Macau is the most densley populated, followed by Monaco and Singapore.

Let's see if we can now answer the following:

1. Which country has the most people? Which country has the highest growth rate?
1. Which countries have the highest ratios of water to land? Which countries have more water than land?
1. Which countries will add the most people to their populations next year?
1. Which countries have a higher death rate than birth rate?

### Most populated country, and country with highest growth rate

In [198]:
run("""SELECT name, population_growth
FROM facts 
WHERE population_growth = (SELECT MAX(population_growth) FROM facts WHERE name != 'World') AND name != 'World'
""")

Unnamed: 0,name,population_growth
0,South Sudan,4.02


In [199]:
run("""SELECT name, population 
FROM facts 
WHERE population = (SELECT MAX(population) FROM facts WHERE name != 'World') AND name != 'World'
""")

Unnamed: 0,name,population
0,China,1367485388


We can see that the country with the highest population is China, while the country with the highest population growth is South Sudan

### Highest ratio of water to land

In [222]:
run("""SELECT name, area_water, area_land, CAST(area_water AS float) / cast(area_land AS float) ratio
    FROM facts 
    WHERE area_land is not NULL AND area_water is not NULL AND area_water != 0 AND area_land != 0
    ORDER BY ratio DESC
    LIMIT 10;""")

Unnamed: 0,name,area_water,area_land,ratio
0,British Indian Ocean Territory,54340,60,905.666667
1,Virgin Islands,1564,346,4.520231
2,Puerto Rico,4921,8870,0.554791
3,"Bahamas, The",3870,10010,0.386613
4,Guinea-Bissau,8005,28120,0.284673
5,Malawi,24404,94080,0.259396
6,Netherlands,7650,33893,0.22571
7,Uganda,43938,197100,0.222922
8,Eritrea,16600,101000,0.164356
9,Liberia,15049,96320,0.15624


Here we can see the top 10 countries with the highest water to land ratio. This would require more analysis to assess the accuracy, but we will leave it for now.

### Countries that will add the most people to their populations next year

In [236]:
run("""SELECT name, population, population_growth, ROUND(((population * population_growth) / 100),0) increase
FROM facts
WHERE population is not NULL AND 
      population_growth is not NULL AND 
      population != 0 AND population_growth != 0
ORDER BY increase DESC
LIMIT 10;
      """)

Unnamed: 0,name,population,population_growth,increase
0,World,7256490011,1.08,78370092.0
1,India,1251695584,1.22,15270686.0
2,China,1367485388,0.45,6153684.0
3,Nigeria,181562056,2.45,4448270.0
4,Pakistan,199085847,1.46,2906653.0
5,Ethiopia,99465819,2.89,2874562.0
6,Bangladesh,168957745,1.6,2703324.0
7,United States,321368864,0.78,2506677.0
8,Indonesia,255993674,0.92,2355142.0
9,"Congo, Democratic Republic of the",79375136,2.45,1944691.0


We can see what country is going to add the most people to their population next year.

### Countries with a higher death rate than birth rate

In [241]:
run("""SELECT name, birth_rate, death_rate, ROUND(death_rate / birth_rate, 2) ratio
FROM facts
WHERE death_rate > birth_rate
ORDER BY ratio DESC
LIMIT 10;""")

Unnamed: 0,name,birth_rate,death_rate,ratio
0,Bulgaria,8.92,14.44,1.62
1,Serbia,9.08,13.66,1.5
2,Latvia,10.0,14.31,1.43
3,Lithuania,10.1,14.27,1.41
4,Hungary,9.16,12.73,1.39
5,Monaco,6.65,9.24,1.39
6,Germany,8.47,11.42,1.35
7,Slovenia,8.42,11.37,1.35
8,Ukraine,10.72,14.46,1.35
9,Saint Pierre and Miquelon,7.42,9.72,1.31


Here we can see the countries with the highest death-birth ratio.

In [85]:
# Close our connection
con.close()