Found this CIA's World Factbook data set as "Single-File SQLite" on Git and enabled Python magic functions (Git) to be able to use SQL code in Jupyter environment and check data quality and make some exploratory analysis while having flexibility of using Python libraries for simple visuals.
https://github.com/factbook/factbook.sql/releases
 

In [71]:
%%capture
%load_ext sql
%sql sqlite:///factbook.db


In [67]:
%%sql

SELECT *
  FROM sqlite_master
 WHERE type='table'; 


 * sqlite:///factbook.db
Done.


type,name,tbl_name,rootpage,sql
table,sqlite_sequence,sqlite_sequence,3,"CREATE TABLE sqlite_sequence(name,seq)"
table,facts,facts,47,"CREATE TABLE ""facts"" (""id"" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, ""code"" varchar(255) NOT NULL, ""name"" varchar(255) NOT NULL, ""area"" integer, ""area_land"" integer, ""area_water"" integer, ""population"" integer, ""population_growth"" float, ""birth_rate"" float, ""death_rate"" float, ""migration_rate"" float)"


Exploratory analysis in SQL & Pandas:

In [70]:
%%sql 
SELECT *
FROM facts


 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
1,af,Afghanistan,652230.0,652230.0,0.0,32564342.0,2.32,38.57,13.89,1.51
2,al,Albania,28748.0,27398.0,1350.0,3029278.0,0.3,12.92,6.58,3.3
3,ag,Algeria,2381741.0,2381741.0,0.0,39542166.0,1.84,23.67,4.31,0.92
4,an,Andorra,468.0,468.0,0.0,85580.0,0.12,8.13,6.96,0.0
5,ao,Angola,1246700.0,1246700.0,0.0,19625353.0,2.78,38.78,11.49,0.46
6,ac,Antigua and Barbuda,442.0,442.0,0.0,92436.0,1.24,15.85,5.69,2.21
7,ar,Argentina,2780400.0,2736690.0,43710.0,43431886.0,0.93,16.64,7.33,0.0
8,am,Armenia,29743.0,28203.0,1540.0,3056382.0,0.15,13.61,9.34,5.8
9,as,Australia,7741220.0,7682300.0,58920.0,22751014.0,1.07,12.15,7.14,5.65
10,au,Austria,83871.0,82445.0,1426.0,8665550.0,0.55,9.41,9.42,5.56


Inspecting the total number of rows

In [44]:
%%sql
SELECT COUNT (*)
  FROM facts

 * sqlite:///factbook.db
Done.


COUNT (*)
261


Inspecting the number of unique countries, if the same as total number of rows we don't have duplicates

In [49]:
%%sql
SELECT COUNT (DISTINCT name)
  FROM facts

 * sqlite:///factbook.db
Done.


COUNT (DISTINCT name)
261


Checking again for duplicates in a different way

In [50]:
%%sql
SELECT code, name, COUNT (*)
 from facts
    GROUP BY code, name
    HAVING COUNT(*) > 1

 * sqlite:///factbook.db
Done.


code,name,COUNT (*)


Checking for NULL values

In [53]:
%%sql
SELECT * 
FROM facts 
WHERE coalesce (area, area_land, area_water, population, population_growth, birth_rate, death_rate , migration_rate)
IS NULL


 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
210,fs,French Southern and Antarctic Lands,,,,,,,,
249,um,United States Pacific Island Wildlife Refuges,,,,,,,,
256,xq,Arctic Ocean,,,,,,,,
257,zh,Atlantic Ocean,,,,,,,,
258,xo,Indian Ocean,,,,,,,,
259,zn,Pacific Ocean,,,,,,,,
260,oo,Southern Ocean,,,,,,,,


Leaving out the rows from above, that have nulls across all the columns (e.g.oceans). However there are still some left with "None" in particular cells, for example - countries that are not populated etc.

In [155]:
%%sql
SELECT * 
FROM facts 
WHERE coalesce (area, area_land, area_water, population_growth, birth_rate, death_rate , migration_rate)
IS NOT NULL


 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
1,af,Afghanistan,652230.0,652230.0,0.0,32564342.0,2.32,38.57,13.89,1.51
2,al,Albania,28748.0,27398.0,1350.0,3029278.0,0.3,12.92,6.58,3.3
3,ag,Algeria,2381741.0,2381741.0,0.0,39542166.0,1.84,23.67,4.31,0.92
4,an,Andorra,468.0,468.0,0.0,85580.0,0.12,8.13,6.96,0.0
5,ao,Angola,1246700.0,1246700.0,0.0,19625353.0,2.78,38.78,11.49,0.46
6,ac,Antigua and Barbuda,442.0,442.0,0.0,92436.0,1.24,15.85,5.69,2.21
7,ar,Argentina,2780400.0,2736690.0,43710.0,43431886.0,0.93,16.64,7.33,0.0
8,am,Armenia,29743.0,28203.0,1540.0,3056382.0,0.15,13.61,9.34,5.8
9,as,Australia,7741220.0,7682300.0,58920.0,22751014.0,1.07,12.15,7.14,5.65
10,au,Austria,83871.0,82445.0,1426.0,8665550.0,0.55,9.41,9.42,5.56


Inspecting for outliers, decided to use z score. Issue below is that stdev is not on the list of aggregate functions in sql lite https://www.sqlite.org/lang_aggfunc.html That's why I'll switch to Pandas to deal with outliers.

In [143]:
%%sql
SELECT name, (area, area - avg(area)) / stdev(area)
   from (SELECT * 
FROM facts 
WHERE coalesce (area, area_land, area_water, population, population_growth, birth_rate, death_rate , migration_rate)
IS NOT NULL)
    

 * sqlite:///factbook.db
(sqlite3.OperationalError) no such function: stdev
[SQL: SELECT name, (area, area - avg(area)) / stdev(area)
   from (SELECT * 
FROM facts 
WHERE coalesce (area, area_land, area_water, population, population_growth, birth_rate, death_rate , migration_rate)
IS NOT NULL)]
(Background on this error at: http://sqlalche.me/e/14/e3q8)


Working with data frame in Pandas

In [193]:
#Working with data frames in Pandas in order to calculate z-score (could be done in Tableau as well, but don't have it atm)

import pandas as pd
import numpy as np
import scipy.stats as stats
df = pd.read_excel('factbook.xlsx')
df['population'] = df['population'].fillna(0)
print(df.head(5))

   id code         name       area  area_land  area_water  population  \
0   1   af  Afghanistan   652230.0   652230.0         0.0  32564342.0   
1   2   al      Albania    28748.0    27398.0      1350.0   3029278.0   
2   3   ag      Algeria  2381741.0  2381741.0         0.0  39542166.0   
3   4   an      Andorra      468.0      468.0         0.0     85580.0   
4   5   ao       Angola  1246700.0  1246700.0         0.0  19625353.0   

   population_growth  birth_rate  death_rate  migration_rate  
0               2.32       38.57       13.89            1.51  
1               0.30       12.92        6.58            3.30  
2               1.84       23.67        4.31            0.92  
3               0.12        8.13        6.96            0.00  
4               2.78       38.78       11.49            0.46  


In [192]:
#since it's a small data set, going trough the list of z scores to check outliers, =/- 3 stdev
#last line with id=253 and z-score=15, obviously outlier and spoils the avg of whole column "population"

df['population_zscore'] = stats.zscore(df['population'])

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(df)

      id code                                           name        area  \
0      1   af                                    Afghanistan    652230.0   
1      2   al                                        Albania     28748.0   
2      3   ag                                        Algeria   2381741.0   
3      4   an                                        Andorra       468.0   
4      5   ao                                         Angola   1246700.0   
5      6   ac                            Antigua and Barbuda       442.0   
6      7   ar                                      Argentina   2780400.0   
7      8   am                                        Armenia     29743.0   
8      9   as                                      Australia   7741220.0   
9     10   au                                        Austria     83871.0   
10    11   aj                                     Azerbaijan     86600.0   
11    12   bf                                   Bahamas, The     13880.0   
12    13   b