## Analysing CIA Factbook Data Using SQL

In this project, I'll be working with data from the CIA World Factbook, a compendium of statistics about all of the countries on Earth. The Factbook contains demographic information like the following:

* **population** — the global population.
* **population_growth** — the annual population growth rate, as a percentage.
* **area** — the total land and water area.

I'll use sqlite to analyse data from this database. This will allow me to quickly get up and running, without bothering with larger databases like MySQL or Postgres. I'll also use the `pandas` method - `read_sql_query` - for simplicity.

### Loading in our database

In [89]:
import sqlite3

import pandas as pd

Let's first connect to our database using sqlite.

In [90]:
# Connect to our database
con = sqlite3.connect("factbook.db")

The syntax for running a query via the `read_sql_query` method is a bit long, so let's simplify it with our own function.

In [91]:
# Function runs the query, simplifying the syntax we have to run
def run(query, connection = con):
    return pd.read_sql_query(query, connection)

### Exploring our Data

Let's get a bit more detail about what we're working with.

In [92]:
run("SELECT * FROM sqlite_master WHERE type='table';")

Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,sqlite_sequence,sqlite_sequence,3,"CREATE TABLE sqlite_sequence(name,seq)"
1,table,facts,facts,47,"CREATE TABLE ""facts"" (""id"" INTEGER PRIMARY KEY..."


Let's now look at the first 5 rows of the `facts` table.

In [57]:
run("SELECT * FROM facts LIMIT 5;")

Unnamed: 0,id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
0,1,af,Afghanistan,652230,652230,0,32564342,2.32,38.57,13.89,1.51
1,2,al,Albania,28748,27398,1350,3029278,0.3,12.92,6.58,3.3
2,3,ag,Algeria,2381741,2381741,0,39542166,1.84,23.67,4.31,0.92
3,4,an,Andorra,468,468,0,85580,0.12,8.13,6.96,0.0
4,5,ao,Angola,1246700,1246700,0,19625353,2.78,38.78,11.49,0.46


Here's a description of each of these columns:

* **name** — the name of the country.
* **area** — the country's total area (both land and water).
* **area_land** — the country's land area in square kilometers.
* **area_water** — the country's waterarea in square kilometers.
* **population** — the country's population.
* **population_growth** — the country's population growth as a percentage.
* **birth_rate** — the country's birth rate, or the number of births per year per 1,000 people.
* **death_rate** — the country's death rate, or the number of death per year per 1,000 people.

### Summary Stats

Let's try a few queries. We'll try returning the max & min of both the `population` and `population_growth` columns.

In [58]:
run("""SELECT MAX(population) 'Max Population',
              MIN(population) 'Min Population',
              MAX(population_growth) 'Max Growth',
              MIN(population_growth) 'Min Growth'
    FROM facts;""")

Unnamed: 0,Max Population,Min Population,Max Growth,Min Growth
0,7256490011,0,4.02,0.0


### Exploring Outliers

This is odd. One country appears to have a population of over 7 billion! And another has a population of zero. Let's investigate this a bit further.

In [61]:
run("""SELECT name, population
FROM facts 
WHERE population = (SELECT MAX(population) FROM facts);""")

Unnamed: 0,name,population
0,World,7256490011


In [62]:
run("""SELECT name, population
FROM facts 
WHERE population = (SELECT MIN(population) FROM facts);""")

Unnamed: 0,name,population
0,Antarctica,0


Now this makes a bit more sense. We can see that the zero population is referring to Antartica, while the 7 billion figure represents the population of the world!

### Exploring Average Population and Area

Let's repeat the above, but this time we will exclude **World** and **Antarctica**.

In [66]:
run("""SELECT 
    MAX(population) 'Max Population',
    MIN(population) 'Min Population',
    MAX(population_growth) 'Max Growth',
    MIN(population_growth) 'Min Growth'
FROM facts
    WHERE name != 'World' AND name != 'Antarctica';""")

Unnamed: 0,Max Population,Min Population,Max Growth,Min Growth
0,1367485388,48,4.02,0.0


That's better. Let's now calculate the average value for `Population` and `Area` columns.

In [77]:
run("""SELECT
    AVG(Population) 'Avg Population',
    AVG(Area) 'Avg Area'
    FROM facts
    WHERE name <> 'World' AND name <> 'Antarctica';""")

Unnamed: 0,Avg Population,Avg Area
0,32377010.0,555093.546185


We'll now use this to identify countries that have:

* Above-average values for population.
* Below-average values for area.

This will allow us to see which countries are densley populated.

In [93]:
run("""SELECT
name, population, area, ROUND(CAST(population AS float) / CAST(area AS float), 2) 'Ratio'
FROM facts
WHERE 
    population > (SELECT
    AVG(Population)
    FROM facts
    WHERE name <> 'World' AND name <> 'Antarctica') 
    AND 
    area < (SELECT
    AVG(Area)
    FROM facts
    WHERE name <> 'World' AND name <> 'Antarctica')
    """)

Unnamed: 0,name,population,area,Ratio
0,Bangladesh,168957745,148460,1138.07
1,Germany,80854408,357022,226.47
2,Iraq,37056169,438317,84.54
3,Italy,61855120,301340,205.27
4,Japan,126919659,377915,335.84
5,"Korea, South",49115196,99720,492.53
6,Morocco,33322699,446550,74.62
7,Philippines,100998376,300000,336.66
8,Poland,38562189,312685,123.33
9,Spain,48146134,505370,95.27


In [85]:
# Close our connection
con.close()