# Introduction

In this project, data from the CIA World Factbook, a compendium of statistics about all of the countries on Earth is being used to do analysis.

# Data Exploration

In [2]:
import pandas as pd
import sqlite3

To list the summary of tables in database

In [4]:
conn = sqlite3.connect("factbook.db")
q = "SELECT * FROM sqlite_master WHERE type='table';"
pd.read_sql_query(q, conn)

Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,sqlite_sequence,sqlite_sequence,3,"CREATE TABLE sqlite_sequence(name,seq)"
1,table,facts,facts,47,"CREATE TABLE ""facts"" (""id"" INTEGER PRIMARY KEY..."


To preview the facts table in database

In [6]:
facts_table = "SELECT * FROM facts LIMIT 5;"
pd.read_sql_query(facts_table, conn)

Unnamed: 0,id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
0,1,af,Afghanistan,652230,652230,0,32564342,2.32,38.57,13.89,1.51
1,2,al,Albania,28748,27398,1350,3029278,0.3,12.92,6.58,3.3
2,3,ag,Algeria,2381741,2381741,0,39542166,1.84,23.67,4.31,0.92
3,4,an,Andorra,468,468,0,85580,0.12,8.13,6.96,0.0
4,5,ao,Angola,1246700,1246700,0,19625353,2.78,38.78,11.49,0.46


Here are the descriptions for some of the columns:

- name - The name of the country.
- area - The total land and sea area of the country.
- population - The country's population.
- population_growth - The country's population growth as a percentage.
- birth_rate - The country's birth rate, or the number of births a year per 1,000 people.
- death_rate - The country's death rate, or the number of death a year per 1,000 people.
- area- The country's total area (both land and water).
- area_land - The country's land area in square kilometers.
- area_water - The country's waterarea in square kilometers.

Let's start by calculating some summary statistics and see what they tell us.

In [9]:
mini_population = "SELECT name, MIN(population) FROM facts"
pd.read_sql_query(mini_population, conn)

Unnamed: 0,name,MIN(population)
0,Antarctica,0


In [11]:
max_population = "SELECT name, MAX(population) FROM facts"
pd.read_sql_query(max_population, conn)

Unnamed: 0,name,MAX(population)
0,World,7256490011


In [12]:
mini_population_growth = "SELECT name, MIN(population_growth) FROM facts"
pd.read_sql_query(mini_population_growth, conn)

Unnamed: 0,name,MIN(population_growth)
0,Holy See (Vatican City),0.0


In [13]:
max_population_growth = "SELECT name, MAX(population_growth) FROM facts"
pd.read_sql_query(max_population_growth, conn)

Unnamed: 0,name,MAX(population_growth)
0,South Sudan,4.02


Above queries show the outliner in the dataset
- Antarctica with a population of 0
- World with a population of 7256490011

According to wiki https://en.wikipedia.org/wiki/Antarctica .Antarctica, on average, is the coldest, driest, and windiest continent, and has the highest average elevation of all the continents. The temperature in Antarctica has dropped to −89.2 °C (−128.6 °F) (or even −94.7 °C (−135.8 °F) as measured from space[9]), though the average for the third quarter (the coldest part of the year) is −63 °C (−81 °F). So that, it is reasonable Antarctica with 0 population.

# Excluding outliner

The row for the whole world will be excluded and calculate some summary statistics again.

In [20]:
exclude_world = "SELECT MIN(population), MAX(population), MIN(population_growth), MAX(population_growth) FROM facts WHERE name != 'World'"
pd.read_sql_query(exclude_world, conn)

Unnamed: 0,MIN(population),MAX(population),MIN(population_growth),MAX(population_growth)
0,0,1367485388,0.0,4.02


After excluding the whole world row, the max population shows 1367485388.

In [21]:
clean_max_population = "SELECT name, MAX(population) FROM facts WHERE name != 'World'"
pd.read_sql_query(clean_max_population, conn)

Unnamed: 0,name,MAX(population)
0,China,1367485388


It is reasonable that China has the most population in the world. Furthermore,  the average value for the following columns will be calculated.

In [22]:
average_value = "SELECT AVG(population), AVG(area) FROM facts WHERE name != 'World'"
pd.read_sql_query(average_value, conn)

Unnamed: 0,AVG(population),AVG(area)
0,32242670.0,555093.546185


We will find all countries meeting both of the following criteria:
- The population is above average.
- The area is below average.

In [26]:
pop_above_average = "SELECT * FROM facts WHERE population > (SELECT avg(population) FROM facts) AND area < (SELECT avg(area) FROM facts)"
pd.read_sql_query(pop_above_average, conn)

Unnamed: 0,id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
0,14,bg,Bangladesh,148460,130170,18290,168957745,1.6,21.14,5.61,0.46
1,65,gm,Germany,357022,348672,8350,80854408,0.17,8.47,11.42,1.24
2,85,ja,Japan,377915,364485,13430,126919659,0.16,7.93,9.51,0.0
3,138,rp,Philippines,300000,298170,1830,100998376,1.61,24.27,6.11,2.09
4,173,th,Thailand,513120,510890,2230,67976405,0.34,11.19,7.8,0.0
5,185,uk,United Kingdom,243610,241930,1680,64088222,0.54,12.17,9.35,2.54
6,192,vm,Vietnam,331210,310070,21140,94348835,0.97,15.96,5.93,0.3


Some of these countries are well-known to be densely populated, especially some developing countries in Asia, such as Bangladesh, Philippines and Vietnam. The birth rate of them is 21.14, 24.27 and 15.96 respectively.