## Exploratory Data Analysis

This workflow explores the underlying world development data and provides commentary and observations.

The workflow is intended for exploration and documentation. A separate visualization workflow is intended for presenting final findings and recommendations.


In [2]:
import pandas as pd

life_expectancy = pd.read_csv('../data/life_expectancy_cleaned.csv', index_col = 'country')
population = pd.read_csv('../data/population_cleaned.csv', index_col = 'country')
gni = pd.read_csv('../data/gni_per_capita_cleaned.csv', index_col = 'country')

gni_title = "Per Capita GNI (Gross National Income)"
life_expectancy_title = "Life Expectancy"
population_title = "Population"

datasets = {
    gni_title: gni,
    life_expectancy_title: life_expectancy,
    population_title: population
}

print('successfully loaded cleaned datasets.')

successfully loaded cleaned datasets.


In [5]:
all_data = pd.read_csv('../data/world_development_data.csv')
years = sorted(list(all_data.year.unique()))
countries = sorted(list(all_data.country.unique()))
print(f'{len(countries)} countries from {min(years)} to {max(years)} ({1 + max(years) - min(years)} years)')

190 countries from 1923 to 2023 (101 years)


In [23]:
# 1. Summary statistics
print("Summary Statistics :")
for index, (name, data) in enumerate(datasets.items()):
    print(name, ':')
    print(data.T.describe())
    if index != len(datasets) - 1:
        print('\n---\n')

Summary Statistics :
Per Capita GNI (Gross National Income) :
country  Afghanistan       Angola      Albania  United Arab Emirates  \
count     101.000000   101.000000   101.000000            101.000000   
mean      416.732673  1730.534653  1648.287129          34416.336634   
std       106.017818  1089.116289  1668.142244          16895.415220   
min       151.000000   502.000000   447.000000           9550.000000   
25%       380.000000  1020.000000   503.000000          16900.000000   
50%       407.000000  1370.000000   998.000000          38500.000000   
75%       456.000000  2140.000000  1400.000000          47400.000000   
max       728.000000  5530.000000  5950.000000          67000.000000   

country     Argentina      Armenia  Antigua and Barbuda     Australia  \
count      101.000000   101.000000           101.000000    101.000000   
mean      6863.168317  1531.881188          7398.118812  24853.267327   
std       3307.840060  1337.753664          5666.655224  17689.612438 

In [24]:
# 2. Standard deviation using dictionary comprehension
data = datasets['Population']
print('Population standard deviations by year:')
sd = { year:f'{int(countries.std()):,}' for year, countries in data.items() }
sd

Population standard deviations by year:


{'1923': '42,279,462',
 '1924': '42,539,915',
 '1925': '42,814,093',
 '1926': '43,019,267',
 '1927': '43,295,231',
 '1928': '43,571,125',
 '1929': '43,835,664',
 '1930': '44,152,732',
 '1931': '44,458,035',
 '1932': '44,858,976',
 '1933': '45,222,433',
 '1934': '45,625,768',
 '1935': '46,016,356',
 '1936': '46,422,160',
 '1937': '46,827,756',
 '1938': '47,180,438',
 '1939': '47,547,151',
 '1940': '47,834,632',
 '1941': '48,055,708',
 '1942': '48,234,818',
 '1943': '48,323,382',
 '1944': '48,369,159',
 '1945': '48,512,635',
 '1946': '48,621,493',
 '1947': '48,735,934',
 '1948': '48,896,602',
 '1949': '49,141,471',
 '1950': '49,596,223',
 '1951': '50,541,785',
 '1952': '51,542,095',
 '1953': '52,591,935',
 '1954': '53,759,977',
 '1955': '54,930,073',
 '1956': '56,102,595',
 '1957': '57,333,679',
 '1958': '58,557,004',
 '1959': '59,489,823',
 '1960': '60,089,221',
 '1961': '60,629,228',
 '1962': '61,724,272',
 '1963': '63,313,544',
 '1964': '65,005,921',
 '1965': '66,599,342',
 '1966': '6

In [25]:
# 3. Investigation

# For each data set, look at:
# - Top 10 of the century (by highest avg)
# - Most improved of the century (by highest max:min ratio)
n = 10

def top_countries(data: pd.DataFrame) -> pd.DataFrame:
    top = data.agg(['mean', 'max', 'min'], axis = 1)
    return top.sort_values(by = 'mean', ascending = False).head(n)

def bottom_countries(data: pd.DataFrame) -> pd.DataFrame:
    top = data.agg(['mean', 'min', 'max'], axis = 1)
    return top.sort_values(by = 'mean', ascending = True).head(n)
    
def most_changed_countries(data: pd.DataFrame) -> pd.DataFrame:
    minmax = data.agg(['max', 'min'], axis = 1)
    minmax['max:min'] = minmax['max'] / minmax['min']
    minmax = minmax.iloc[:, [2,0,1]]
    return minmax.sort_values(by = 'max:min', ascending = False).head(n)

# Utility methods for outputting data:
def format_growth_rate(data: pd.DataFrame) -> pd.DataFrame:
    # Given a data frame from most_changed_countries, format its max:min column as a multiplier.
    # E.g.: 16.289456 -> '16.3x'
    # Note that this converts the underlying data to string, preventing further analysis. Perform this operation only as a final step before presenting data.
    data['max:min'] = data['max:min'].map(describe_float).map(lambda v: v + 'x')
    return data

def describe_int(value) -> str:
    # Given a numeric value, output the number in a way intended for large integers, using K/M/B suffices for thousands/millions/billions.
    # E.g.: 1_234_567.0 -> '1.2M'
    if type(value) == str:
        return value
    if value >= 1_000_000_000:
        return f'{value / 1_000_000_000:,.1f}B'
    if value >= 1_000_000:
        return f'{value / 1_000_000:,.1f}M'
    if value >= 1_000:
        return f'{value / 1_000:,.1f}K'
    return f'{int(value):,}'

def describe_float(value) -> str:
    # Outputs fractional numbers in abbreviated format for display, limiting their number of trailing decimals.
    if type(value) == str:
        return value
    return f'{float(value):,.2f}'

print(f'Top {n} Population (by mean):')
print(top_countries(population).map(describe_int))

print('\n\n')
print(f'Top {n} Population Growth Rate:')
print(most_changed_countries(population).pipe(format_growth_rate).map(describe_int))

Top 10 Population (by mean):
                 mean     max     min
country                              
China          900.5M    1.4B  470.0M
India          712.5M    1.4B  308.0M
United States  212.9M  340.0M  109.0M
Indonesia      140.6M  278.0M   49.0M
Russia         123.8M  149.0M   79.9M
Brazil         112.1M  216.0M   29.3M
Japan          103.0M  128.0M   58.7M
Pakistan        93.3M  240.0M   22.9M
Bangladesh      83.4M  173.0M   29.0M
Nigeria         82.1M  224.0M   23.9M



Top 10 Population Growth Rate:
                      max:min     max     min
country                                      
United Arab Emirates  163.29x    9.5M   58.3K
Qatar                 150.27x    2.8M   18.7K
Jordan                 55.67x   11.3M  203.0K
Kuwait                 37.63x    4.4M  118.0K
Djibouti               24.89x    1.1M   45.8K
Israel                 18.83x    9.2M  487.0K
Papua New Guinea       18.80x   10.3M  548.0K
Cote d'Ivoire          16.24x   28.9M    1.8M
Brunei               

### Observations - Population

#### India + China Are Most Populous
**China** and **India** unsurprisingly dominate the list of most populous countries, with **USA** at a distant 3rd place. All of the most populous countries show a somewhat steady growth rate, with the minimum and maximum population coinciding with the oldest and most recent measurement. **India has recently taken the position of most populous country**, so we need to measure by average instead of by current value for China to take the #1 spot.

#### United Arab Emirates, Qatar, and Middle Eastern Countries Have Grown Most Rapidly
Showing countries that have grown the most over the last century yields a far more surprising and interesting result. The neighboring **United Arab Emirates** and **Qatar** boast by far the most prolific growth rates, having grown their population by over 150x in the last century. Also surprising is the fact that the list of fastest-growing countries is dominated by Middle Eastern countries. All countries that feature prominently on the growth rate list have quite low minimum populations, which may suggest population expansion from an initially small land mass or substantial population reduction due to conflict or disaster. Establishing the reasons for these explosions in population, though interesting, is outside of the scope of this project.

In [26]:
print(f'Life Expectancy - Top {n} (by mean):')
print(top_countries(life_expectancy).map(describe_float))

print('\n\n')
print(f'Life Expectancy - Top {n} (by growth rate):')
print(most_changed_countries(life_expectancy).pipe(format_growth_rate).map(describe_float))

print('\n\n')
print(f'Life Expectancy - Bottom {n} (by mean):')
print(bottom_countries(life_expectancy).map(describe_float))

Life Expectancy - Top 10 (by mean):
                 mean    max    min
country                            
Sweden          74.42  83.40  61.70
Norway          74.21  83.50  62.00
Iceland         74.07  84.70  54.30
Netherlands     73.73  82.30  55.50
Switzerland     73.52  84.50  59.50
Australia       73.49  83.50  62.00
Denmark         72.83  81.70  61.40
Canada          72.69  82.80  57.10
New Zealand     72.37  82.40  57.80
United Kingdom  71.94  81.60  57.80



Life Expectancy - Top 10 (by growth rate):
                max:min    max    min
country                              
Kazakhstan       17.28x  72.40   4.19
Ukraine           7.69x  74.40   9.68
Rwanda            7.35x  69.80   9.50
Lithuania         6.10x  76.80  12.60
Turkmenistan      5.17x  71.30  13.80
Poland            5.11x  78.70  15.40
Belarus           5.10x  74.50  14.60
Pakistan          4.89x  66.50  13.60
Kyrgyz Republic   4.84x  74.00  15.30
Russia            4.59x  73.50  16.00



Life Expectancy - Bottom 10

### Observations - Life Expectancy

#### Life Expectancy is On The Rise
We can rejoice in the fact that **worldwide life expectancy has increased significantly over the past century**. With the exception of limited periods of very low life expectancy that suggest historic atrocities, the overall trend of upward movement in life expectancy is generally universal and constant.

#### The Life Expectancy Gap Has Narrowed
Countries with the lowest life expectancies have shown the greatest gains - 6 of the 10 countries with the lowest life expectancies have more than doubled their life expectancies over the last century. Moreover, while life expectancies around the world ranged from about 30-60 years in 1923, they currently range from about 63-84 years - a much narrower gap.

#### Some Years Show Alarmingly Low Life Expectancy
When we look at countries who have improved life expectancy the most, we find countries with exceptionally low life expectancies. In the most extreme case, **Kazakhstan's average life expectancy was 4.2 years in 1933**. The presence of such alarmingly low life expectancies also suggests that looking for exceptionally low life expectancies in world data could be an effective method for discovering historic atrocities.

#### Life Expectancy Lists Resemble GNI Lists
The top of the list of life expectancies is dominated by small first-world European countries such as **Sweden, Norway, and Iceland** that also fare well in per-capita wealth. Meanwhile the bottom of the list is dominated by 3rd-world, predominantly African countries such as **Ethiopia**. This suggests that wealth and life expectancy are linked.

Note that USA did not make it onto the list of top 10 countries with the highest life expectancy.


In [27]:
print(f'GNI - Top {n} (by mean):')
print(top_countries(gni).map(describe_int))

print('\n\n')
print(f'GNI - Top {n} (by growth rate):')
print(most_changed_countries(gni).pipe(format_growth_rate).map(describe_float))

print('\n\n')
print(f'GNI - Bottom {n} (by mean):')
print(bottom_countries(gni).map(describe_int))

GNI - Top 10 (by mean):
                       mean     max    min
country                                   
Switzerland           56.9K  103.0K  17.3K
Luxembourg            38.6K  106.0K   5.5K
Norway                37.5K  117.0K   5.8K
United States         34.8K   70.9K   9.9K
United Arab Emirates  34.4K   67.0K   9.6K
Sweden                33.3K   69.3K   7.5K
Denmark               32.7K   73.2K   8.3K
Qatar                 31.7K   89.3K    542
Japan                 28.2K   67.6K   4.7K
Netherlands           28.1K   64.5K   4.8K



GNI - Top 10 (by growth rate):
                   max:min        max       min
country                                        
Iraq               232.84x  15,600.00     67.00
Azerbaijan         224.74x   8,540.00     38.00
Romania            170.59x  14,500.00     85.00
Qatar              164.76x  89,300.00    542.00
Equatorial Guinea  104.85x  17,300.00    165.00
Brunei              69.40x  50,800.00    732.00
Kuwait              62.00x  77,500.00  1,2

### Observations - GNI

#### GNI Lists Resemble Life Expectancy Lists
The lowest and highest-ranking countries in the GNI list resemble those of countries ranked by life expectancy. More on this in our analysis of life expectancies.

#### Qatar and Singapore are Outliers
The data for **Qatar** and **Singapore** suggest that these countries are historic economic success cases worth investigating in more detail. Most countries that boast the highest economic growth rates come from extreme poverty, but have not risen to prominence on the world stage. Qatar and Singabore, however, show recent GNIs well above any other countries with the highest GNI growth rates - GNIs that are on par with first-world countries.