## Data Exploration

The purposes of this notebook are to
* better understand the datasets at hand,
* find out characteristics of the datasets,
* gain insight for preprocessing (i. e. missing data, columns of interest).

The datasets to be analyzed are
* the Happy Planet Index for 2016 (see https://happyplanetindex.org/),
* the World Development Indicators (1960 - 2019) by the World Bank (see https://datacatalog.worldbank.org/dataset/world-development-indicators)
    

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Import HPI dataset (Happy Planet Index)
df_hpi = pd.read_excel('../data/raw_data/hpi-data-2016.xlsx', sheet_name = 'Complete HPI data', header = 5, usecols = 'B:O')
df_hpi.head()

Unnamed: 0,HPI Rank,Country,Region,Average Life Expectancy,Average Wellbeing (0-10),Happy Life Years,Footprint (gha/capita),Inequality of Outcomes,Inequality-adjusted Life Expectancy,Inequality-adjusted Wellbeing,Happy Planet Index,GDP/capita ($PPP),Population,GINI index
0,110.0,Afghanistan,Middle East and North Africa,59.668,3.8,12.396024,0.79,0.426557,38.348818,3.390494,20.22535,690.842629,29726803.0,Data unavailable
1,13.0,Albania,Post-communist,77.347,5.5,34.414736,2.21,0.165134,69.671159,5.09765,36.766874,4247.485437,2900489.0,28.96
2,30.0,Algeria,Middle East and North Africa,74.313,5.6,30.469461,2.12,0.244862,60.474545,5.196449,33.300543,5583.61616,37439427.0,Data unavailable
3,19.0,Argentina,Americas,75.927,6.5,40.166674,3.14,0.164238,68.349583,6.034707,35.190244,14357.411589,42095224.0,42.49
4,73.0,Armenia,Post-communist,74.446,4.3,24.01876,2.23,0.216648,66.921682,3.74714,25.666417,3565.517575,2978339.0,30.48


In [3]:
# Import WDI indicator descriptions (World Development Indicators)
df_wdi_inds = pd.read_csv('../data/raw_data/WDISeries.csv')
df_wdi_inds.head()
#df_wdi_inds.shape

Unnamed: 0,Series Code,Topic,Indicator Name,Short definition,Long definition,Unit of measure,Periodicity,Base Period,Other notes,Aggregation method,...,Notes from original source,General comments,Source,Statistical concept and methodology,Development relevance,Related source links,Other web links,Related indicators,License Type,Unnamed: 20
0,AG.AGR.TRAC.NO,Environment: Agricultural production,"Agricultural machinery, tractors",,Agricultural machinery refers to the number of...,,Annual,,,Sum,...,,,"Food and Agriculture Organization, electronic ...",A tractor provides the power and traction to m...,Agricultural land covers more than one-third o...,,,,CC BY-4.0,
1,AG.CON.FERT.PT.ZS,Environment: Agricultural production,Fertilizer consumption (% of fertilizer produc...,,Fertilizer consumption measures the quantity o...,,Annual,,,Weighted average,...,,,"Food and Agriculture Organization, electronic ...",Fertilizer consumption measures the quantity o...,"Factors such as the green revolution, has led ...",,,,CC BY-4.0,
2,AG.CON.FERT.ZS,Environment: Agricultural production,Fertilizer consumption (kilograms per hectare ...,,Fertilizer consumption measures the quantity o...,,Annual,,,Weighted average,...,,,"Food and Agriculture Organization, electronic ...",Fertilizer consumption measures the quantity o...,"Factors such as the green revolution, has led ...",,,,CC BY-4.0,
3,AG.LND.AGRI.K2,Environment: Land use,Agricultural land (sq. km),,Agricultural land refers to the share of land ...,,Annual,,,Sum,...,,,"Food and Agriculture Organization, electronic ...",Agricultural land constitutes only a part of a...,Agricultural land covers more than one-third o...,,,,CC BY-4.0,
4,AG.LND.AGRI.ZS,Environment: Land use,Agricultural land (% of land area),,Agricultural land refers to the share of land ...,,Annual,,,Weighted average,...,,,"Food and Agriculture Organization, electronic ...",Agriculture is still a major sector in many ec...,Agricultural land covers more than one-third o...,,,,CC BY-4.0,


In [4]:
# Explore topics & indicators
topic_count = df_wdi_inds['Topic'].unique().shape[0]
indicator_count = df_wdi_inds['Indicator Name'].unique().shape[0]
topics = np.sort(df_wdi_inds['Topic'].unique())

min_per_topic = df_wdi_inds.groupby(['Topic'])['Indicator Name'].count().min()
max_per_topic = df_wdi_inds.groupby(['Topic'])['Indicator Name'].count().max()


print('There are', topic_count, 'topics covering', indicator_count, 'indicators. \nPer topic there is a minimum of', min_per_topic, 'indicators and a maximum of', max_per_topic, 'indicators.')

df_wdi_inds.groupby(['Topic'])['Indicator Name'].count()

There are 88 topics covering 1429 indicators. 
Per topic there is a minimum of 1 indicators and a maximum of 77 indicators.


Topic
Economic Policy & Debt: Balance of payments: Capital & financial account                              11
Economic Policy & Debt: Balance of payments: Current account: Balances                                 4
Economic Policy & Debt: Balance of payments: Current account: Goods, services & income                22
Economic Policy & Debt: Balance of payments: Current account: Transfers                                7
Economic Policy & Debt: Balance of payments: Reserves & other items                                    6
Economic Policy & Debt: External debt: Debt outstanding                                               10
Economic Policy & Debt: External debt: Debt ratios & other items                                      11
Economic Policy & Debt: External debt: Debt service                                                    4
Economic Policy & Debt: External debt: Net flows                                                      20
Economic Policy & Debt: National accounts: Adjust

In [5]:
# Import WDI data (World Development Indicators)
df_wdi = pd.read_csv('../data/raw_data/WDIData.csv')

# Look at the dataset
print(df_wdi.head())    # show first five rows
print('\nDataset shape (rows, columns):', df_wdi.shape)

  Country Name Country Code  \
0   Arab World          ARB   
1   Arab World          ARB   
2   Arab World          ARB   
3   Arab World          ARB   
4   Arab World          ARB   

                                      Indicator Name     Indicator Code  1960  \
0  2005 PPP conversion factor, GDP (LCU per inter...      PA.NUS.PPP.05   NaN   
1  2005 PPP conversion factor, private consumptio...  PA.NUS.PRVT.PP.05   NaN   
2  Access to clean fuels and technologies for coo...     EG.CFT.ACCS.ZS   NaN   
3            Access to electricity (% of population)     EG.ELC.ACCS.ZS   NaN   
4  Access to electricity, rural (% of rural popul...  EG.ELC.ACCS.RU.ZS   NaN   

   1961  1962  1963  1964  1965     ...            2011       2012       2013  \
0   NaN   NaN   NaN   NaN   NaN     ...             NaN        NaN        NaN   
1   NaN   NaN   NaN   NaN   NaN     ...             NaN        NaN        NaN   
2   NaN   NaN   NaN   NaN   NaN     ...       82.783289  83.120303  83.533457   
3 

In [6]:
# Show columns of the WDI dataset
for col in df_wdi.columns:
    print(col)

Country Name
Country Code
Indicator Name
Indicator Code
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
Unnamed: 64


In [7]:
# Show columns without missing data
columns_no_nulls = set(df_wdi.columns[df_wdi.isnull().mean()==0])
columns_no_nulls

{'Country Code', 'Country Name', 'Indicator Code', 'Indicator Name'}

In [8]:
# Show columns with most missing data
columns_most_nulls = set(df_wdi.columns[df_wdi.isnull().mean() > 0.5])    # more than 50% of data is missing
print('More than 50% of data is missing in {} years.'.format(len(columns_most_nulls)))
columns_most_nulls

More than 50% of data is missing in 49 years.


{'1960',
 '1961',
 '1962',
 '1963',
 '1964',
 '1965',
 '1966',
 '1967',
 '1968',
 '1969',
 '1970',
 '1971',
 '1972',
 '1973',
 '1974',
 '1975',
 '1976',
 '1977',
 '1978',
 '1979',
 '1980',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1989',
 '1990',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2017',
 '2018',
 '2019',
 'Unnamed: 64'}

In [9]:
# Count unique countries
country_count = df_wdi['Country Name'].unique().shape[0]
print('There are', country_count, 'unique countries.')

There are 264 unique countries.
