## Exploratory Data Analysis

#### Dataset : 
Life Expectancy Data

#### Source: 
Kaggle

#### Dataset Description: 
It contains information of 193 countries related to life expectancy, health factors and some economic data like GDP for the years 2000 to 2015.

#### Aim:
This case study aims at finding the factors that contribute to increase in life expectancy (target attribute) of the citizens of each country.

#### Approach:
1. Did my own research on which factors influence life expectancy of a person
2. Then went through the data given in the dataset
3. Applied some data pre-processing followed by Univariate and Bivariate Analysis of the attributes provided
4. The data also had some missing values which were treated with appropriate logics.

##### Note: In this notebook, the steps for handling missing data is covered.

#### Attributes:
'Country': 193 countries

'Year': 2000 -2015

'Status': Developed or Developing status

'Life expectancy ': Life Expectancy in age

'Adult Mortality': Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)

'infant deaths': Number of Infant Deaths per 1000 population

'Alcohol': Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)

'percentage expenditure': Expenditure on health as a percentage of Gross Domestic Product per capita(%)

'Hepatitis B': Hepatitis B  immunization coverage among 1-year-olds (%)

'Measles ': number of reported cases per 1000 population

'BMI': Average Body Mass Index of entire population

'under-five deaths ': Number of under-five deaths per 1000 population

'Polio': Polio (Pol3) immunization coverage among 1-year-olds (%)

'Total expenditure': General government expenditure on health as a percentage of total government expenditure (%)

'Diphtheria ': Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)

'HIV/AIDS': Deaths per 1000 live births HIV/AIDS (0-4 years)

'GDP': Gross Domestic Product per capita (in USD)

'Population': Population of the country

'thinness 1-19 years': Prevalence of thinness among children and adolescents for Age 10 to 19 (% )

'thinness 5-9 years': Prevalence of thinness among children for Age 5 to 9(%)

'Income composition of resources': Human Development Index in terms of income composition of resources (index ranging from 0 to 1)

'Schooling': Number of years of Schooling(years)


### 1. Importing the libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore',category= DataConversionWarning)
warnings.simplefilter(action='ignore',category=FutureWarning)

### 2. Reading the Dataset

In [2]:
df = pd.read_csv('Life Expectancy Data.csv')
df.head()

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5


In [3]:
## get columns of the dataset

df.columns

Index(['Country', 'Year', 'Status', 'Life expectancy ', 'Adult Mortality',
       'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
       'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure',
       'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population',
       ' thinness  1-19 years', ' thinness 5-9 years',
       'Income composition of resources', 'Schooling'],
      dtype='object')

In [4]:
## get information about the datatypes of each attribute

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2938 entries, 0 to 2937
Data columns (total 22 columns):
Country                            2938 non-null object
Year                               2938 non-null int64
Status                             2938 non-null object
Life expectancy                    2928 non-null float64
Adult Mortality                    2928 non-null float64
infant deaths                      2938 non-null int64
Alcohol                            2744 non-null float64
percentage expenditure             2938 non-null float64
Hepatitis B                        2385 non-null float64
Measles                            2938 non-null int64
 BMI                               2904 non-null float64
under-five deaths                  2938 non-null int64
Polio                              2919 non-null float64
Total expenditure                  2712 non-null float64
Diphtheria                         2919 non-null float64
 HIV/AIDS                          2938 non-null

### 3. Data Pre-processing

In [5]:
## removing the leading and trailing spaces from attribute names

df.columns = [i.strip() for i in df.columns.tolist()]

In [6]:
## renaming the attribute as given in description

df.rename(columns = {'thinness  1-19 years':'thinness 10-19 years'},inplace = True)

In [7]:
## making the attribute names of similar format

df.columns = [i.replace(' ','_').lower() for i in df.columns.tolist()]

In [8]:
df.columns

Index(['country', 'year', 'status', 'life_expectancy', 'adult_mortality',
       'infant_deaths', 'alcohol', 'percentage_expenditure', 'hepatitis_b',
       'measles', 'bmi', 'under-five_deaths', 'polio', 'total_expenditure',
       'diphtheria', 'hiv/aids', 'gdp', 'population', 'thinness_10-19_years',
       'thinness_5-9_years', 'income_composition_of_resources', 'schooling'],
      dtype='object')

In [9]:
## renaming the attribute under-five_deaths 

df.rename(columns = {'under-five_deaths':'under_five_deaths'},inplace = True)

In [10]:
df.columns

Index(['country', 'year', 'status', 'life_expectancy', 'adult_mortality',
       'infant_deaths', 'alcohol', 'percentage_expenditure', 'hepatitis_b',
       'measles', 'bmi', 'under_five_deaths', 'polio', 'total_expenditure',
       'diphtheria', 'hiv/aids', 'gdp', 'population', 'thinness_10-19_years',
       'thinness_5-9_years', 'income_composition_of_resources', 'schooling'],
      dtype='object')

In [11]:
## get the numerical attributes

num_cols = df.select_dtypes(exclude = 'object').columns.tolist()
print(num_cols)

['year', 'life_expectancy', 'adult_mortality', 'infant_deaths', 'alcohol', 'percentage_expenditure', 'hepatitis_b', 'measles', 'bmi', 'under_five_deaths', 'polio', 'total_expenditure', 'diphtheria', 'hiv/aids', 'gdp', 'population', 'thinness_10-19_years', 'thinness_5-9_years', 'income_composition_of_resources', 'schooling']


In [12]:
## get categorical attributes

cat_cols = df.select_dtypes(include = 'object').columns.tolist()
print(cat_cols)

['country', 'status']


### 4. Descriptive Statistics

In [13]:
## for numerical columns

df[num_cols].describe()

Unnamed: 0,year,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,bmi,under_five_deaths,polio,total_expenditure,diphtheria,hiv/aids,gdp,population,thinness_10-19_years,thinness_5-9_years,income_composition_of_resources,schooling
count,2938.0,2928.0,2928.0,2938.0,2744.0,2938.0,2385.0,2938.0,2904.0,2938.0,2919.0,2712.0,2919.0,2938.0,2490.0,2286.0,2904.0,2904.0,2771.0,2775.0
mean,2007.51872,69.224932,164.796448,30.303948,4.602861,738.251295,80.940461,2419.59224,38.321247,42.035739,82.550188,5.93819,82.324084,1.742103,7483.158469,12753380.0,4.839704,4.870317,0.627551,11.992793
std,4.613841,9.523867,124.292079,117.926501,4.052413,1987.914858,25.070016,11467.272489,20.044034,160.445548,23.428046,2.49832,23.716912,5.077785,14270.169342,61012100.0,4.420195,4.508882,0.210904,3.35892
min,2000.0,36.3,1.0,0.0,0.01,0.0,1.0,0.0,1.0,0.0,3.0,0.37,2.0,0.1,1.68135,34.0,0.1,0.1,0.0,0.0
25%,2004.0,63.1,74.0,0.0,0.8775,4.685343,77.0,0.0,19.3,0.0,78.0,4.26,78.0,0.1,463.935626,195793.2,1.6,1.5,0.493,10.1
50%,2008.0,72.1,144.0,3.0,3.755,64.912906,92.0,17.0,43.5,4.0,93.0,5.755,93.0,0.1,1766.947595,1386542.0,3.3,3.3,0.677,12.3
75%,2012.0,75.7,228.0,22.0,7.7025,441.534144,97.0,360.25,56.2,28.0,97.0,7.4925,97.0,0.8,5910.806335,7420359.0,7.2,7.2,0.779,14.3
max,2015.0,89.0,723.0,1800.0,17.87,19479.91161,99.0,212183.0,87.3,2500.0,99.0,17.6,99.0,50.6,119172.7418,1293859000.0,27.7,28.6,0.948,20.7


In [14]:
## for categorical columns

df[cat_cols].describe()

Unnamed: 0,country,status
count,2938,2938
unique,193,2
top,Qatar,Developing
freq,16,2426


In [15]:
developed = df[df['status']=='Developed']

In [16]:
developing = df[df['status']=='Developing']

### 5. Detecting and Treating Missing Values

In [17]:
df.isnull().sum()[df.isnull().sum() > 0]

life_expectancy                     10
adult_mortality                     10
alcohol                            194
hepatitis_b                        553
bmi                                 34
polio                               19
total_expenditure                  226
diphtheria                          19
gdp                                448
population                         652
thinness_10-19_years                34
thinness_5-9_years                  34
income_composition_of_resources    167
schooling                          163
dtype: int64

In [18]:
## missing values for life expectancy

df[df['life_expectancy'].isnull()]

Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,polio,total_expenditure,diphtheria,hiv/aids,gdp,population,thinness_10-19_years,thinness_5-9_years,income_composition_of_resources,schooling
624,Cook Islands,2013,Developing,,,0,0.01,0.0,98.0,0,...,98.0,3.58,98.0,0.1,,,0.1,0.1,,
769,Dominica,2013,Developing,,,0,0.01,11.419555,96.0,0,...,96.0,5.58,96.0,0.1,722.75665,,2.7,2.6,0.721,12.7
1650,Marshall Islands,2013,Developing,,,0,0.01,871.878317,8.0,0,...,79.0,17.24,79.0,0.1,3617.752354,,0.1,0.1,,0.0
1715,Monaco,2013,Developing,,,0,0.01,0.0,99.0,0,...,99.0,4.3,99.0,0.1,,,,,,
1812,Nauru,2013,Developing,,,0,0.01,15.606596,87.0,0,...,87.0,4.65,87.0,0.1,136.18321,,0.1,0.1,,9.6
1909,Niue,2013,Developing,,,0,0.01,0.0,99.0,0,...,99.0,7.2,99.0,0.1,,,0.1,0.1,,
1958,Palau,2013,Developing,,,0,,344.690631,99.0,0,...,99.0,9.27,99.0,0.1,1932.12237,292.0,0.1,0.1,0.779,14.2
2167,Saint Kitts and Nevis,2013,Developing,,,0,8.54,0.0,97.0,0,...,96.0,6.14,96.0,0.1,,,3.7,3.6,0.749,13.4
2216,San Marino,2013,Developing,,,0,0.01,0.0,69.0,0,...,69.0,6.5,69.0,0.1,,,,,,15.1
2713,Tuvalu,2013,Developing,,,0,0.01,78.281203,9.0,0,...,9.0,16.61,9.0,0.1,3542.13589,1819.0,0.2,0.1,,0.0


##### Now as we have data of years from 2000 to 2015, we can impute the values by grouping the data for each country based on year and taking their mean

In [19]:
## check data for a signle country
df[df['country'] == 'Cook Islands']

Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,polio,total_expenditure,diphtheria,hiv/aids,gdp,population,thinness_10-19_years,thinness_5-9_years,income_composition_of_resources,schooling
624,Cook Islands,2013,Developing,,,0,0.01,0.0,98.0,0,...,98.0,3.58,98.0,0.1,,,0.1,0.1,,


##### Here, Cook Islands have data for only one year so cannot perfrom any kind of imputation

In [20]:
## check for other countries also
le = df[df['life_expectancy'].isnull()]
for i in le['country'].unique():
    print(i, le[le['country'] == i]['year'].unique())

Cook Islands [2013]
Dominica [2013]
Marshall Islands [2013]
Monaco [2013]
Nauru [2013]
Niue [2013]
Palau [2013]
Saint Kitts and Nevis [2013]
San Marino [2013]
Tuvalu [2013]


##### From above results it is evident that only data for a single year is available which cannot be imputed. So, drop these rows

In [21]:
## drop the above 10 rows using the indexes

df.drop(le.index, axis = 0, inplace = True)

In [22]:
df.shape

(2928, 22)

In [23]:
## check for remaining missing values

df.isnull().sum()[df.isnull().sum() > 0]

alcohol                            193
hepatitis_b                        553
bmi                                 32
polio                               19
total_expenditure                  226
diphtheria                          19
gdp                                443
population                         644
thinness_10-19_years                32
thinness_5-9_years                  32
income_composition_of_resources    160
schooling                          160
dtype: int64

In [24]:
## Now finding the countries which have data for only one year

for i in df['country'].unique():
    if len(df[df['country'] == i]['year'].unique()) == 1:
        print(i,df[df['country'] == i]['year'].unique()) 
        
## there are no countries with data for one year only

In [25]:
## for bmi

bmi = df[df['bmi'].isnull()]
bmi.head()

Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,polio,total_expenditure,diphtheria,hiv/aids,gdp,population,thinness_10-19_years,thinness_5-9_years,income_composition_of_resources,schooling
2409,South Sudan,2015,Developing,57.3,332.0,26,,0.0,31.0,878,...,41.0,,31.0,3.4,758.725782,11882136.0,,,0.421,4.9
2410,South Sudan,2014,Developing,56.6,343.0,26,,46.074469,,441,...,44.0,2.74,39.0,3.5,1151.861715,1153971.0,,,0.421,4.9
2411,South Sudan,2013,Developing,56.4,345.0,26,,47.44453,,525,...,5.0,2.62,45.0,3.6,1186.11325,1117749.0,,,0.417,4.9
2412,South Sudan,2012,Developing,56.0,347.0,26,,38.338232,,1952,...,64.0,2.77,59.0,3.8,958.45581,1818258.0,,,0.419,4.9
2413,South Sudan,2011,Developing,55.4,355.0,27,,0.0,,1256,...,66.0,,61.0,3.9,176.9713,1448857.0,,,0.429,4.9


In [26]:
## check status of countries having missing values for bmi

bmi['status'].unique()

array(['Developing'], dtype=object)

In [27]:
## check countries having missing values for bmi

bmi['country'].unique()

array(['South Sudan', 'Sudan'], dtype=object)

In [28]:
developing['bmi'].describe()

count    2392.000000
mean       35.435326
std        19.425091
min         1.000000
25%        18.300000
50%        35.200000
75%        53.200000
max        87.300000
Name: bmi, dtype: float64

In [29]:
## now imputing the values of South Sudan with mean bmi of developing countries
df.loc[df['country']=='South Sudan','bmi'] = developing['bmi'].mean()

In [30]:
## now imputing the values of Sudan with mean bmi of developing countries
df.loc[df['country']=='Sudan','bmi'] = developing['bmi'].mean()

In [31]:
df[df['bmi'].isnull()]

Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,polio,total_expenditure,diphtheria,hiv/aids,gdp,population,thinness_10-19_years,thinness_5-9_years,income_composition_of_resources,schooling


In [32]:
## check for remaining missing values

df.isnull().sum()[df.isnull().sum() > 0]

alcohol                            193
hepatitis_b                        553
polio                               19
total_expenditure                  226
diphtheria                          19
gdp                                443
population                         644
thinness_10-19_years                32
thinness_5-9_years                  32
income_composition_of_resources    160
schooling                          160
dtype: int64

In [33]:
## for diphtheria
dip = df[df['diphtheria'].isnull()]
dip.head()

Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,polio,total_expenditure,diphtheria,hiv/aids,gdp,population,thinness_10-19_years,thinness_5-9_years,income_composition_of_resources,schooling
1742,Montenegro,2005,Developing,73.6,133.0,0,,527.307672,,0,...,,8.46,,0.1,3674.617924,614261.0,2.3,2.3,0.746,12.8
1743,Montenegro,2004,Developing,73.5,134.0,0,0.01,57.121901,,0,...,,8.45,,0.1,338.199535,613353.0,2.3,2.4,0.74,12.6
1744,Montenegro,2003,Developing,73.5,134.0,0,0.01,495.078296,,0,...,,8.91,,0.1,2789.1735,612267.0,2.4,2.4,0.0,0.0
1745,Montenegro,2002,Developing,73.4,136.0,0,0.01,36.48024,,0,...,,8.33,,0.1,216.243274,69828.0,2.5,2.5,0.0,0.0
1746,Montenegro,2001,Developing,73.3,136.0,0,0.01,33.669814,,0,...,,8.23,,0.1,199.583957,67389.0,2.5,2.6,0.0,0.0


In [34]:
dip['country'].unique()

array(['Montenegro', 'South Sudan', 'Timor-Leste'], dtype=object)

In [35]:
dip['year'].unique()

array([2005, 2004, 2003, 2002, 2001, 2000, 2010, 2009, 2008, 2007, 2006],
      dtype=int64)

In [36]:
dip['status'].unique()

array(['Developing'], dtype=object)

In [37]:
## years for which the countries have missing values
for i in dip['country'].unique():
    print(i, dip[dip['country'] == i]['year'].unique())

Montenegro [2005 2004 2003 2002 2001 2000]
South Sudan [2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000]
Timor-Leste [2001 2000]


In [38]:
developing['diphtheria'].describe()

count    2407.000000
mean       79.951807
std        24.834300
min         2.000000
25%        75.000000
50%        91.000000
75%        96.500000
max        99.000000
Name: diphtheria, dtype: float64

##### It has been observed that the developing countries initially had low % of immumnization coverages which gradually increased over the years

##### The missing values are for the initial years of these countries 'Montenegro', 'South Sudan', 'Timor-Leste'

##### So, can be imputed by min value of specific country as they are developing countries 

In [39]:
## imputing the data for the above countries - South Sudan
    
dip_id = dip[dip['country']== 'South Sudan'].index
print(dip_id)
    
for i in dip_id:
    df.loc[i,'diphtheria'] = df[df['country'] == 'South Sudan']['diphtheria'].min()
    #print(df.loc[i,'diphtheria'])

Int64Index([2414, 2415, 2416, 2417, 2418, 2419, 2420, 2421, 2422, 2423, 2424], dtype='int64')


In [40]:
## imputing the data for the above countries - Montenegro

dip_id = dip[dip['country']== 'Montenegro'].index
#print(dip_id)
    
for i in dip_id:
    df.loc[i,'diphtheria'] = df[df['country'] == 'Montenegro']['diphtheria'].min()
    #print(df.loc[i,'diphtheria'])

In [41]:
## imputing the data for the above countries - Timor-Leste

dip_id = dip[dip['country']== 'Timor-Leste'].index
#print(dip_id)
    
for i in dip_id:
    df.loc[i,'diphtheria'] = df[df['country'] == 'Timor-Leste']['diphtheria'].min()
    #print(df.loc[i,'diphtheria'])

In [42]:
## check if all the missing values are imputed

df[df['diphtheria'].isnull()]

Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,polio,total_expenditure,diphtheria,hiv/aids,gdp,population,thinness_10-19_years,thinness_5-9_years,income_composition_of_resources,schooling


In [43]:
## check for remaining missing values

df.isnull().sum()[df.isnull().sum() > 0]

alcohol                            193
hepatitis_b                        553
polio                               19
total_expenditure                  226
gdp                                443
population                         644
thinness_10-19_years                32
thinness_5-9_years                  32
income_composition_of_resources    160
schooling                          160
dtype: int64

In [44]:
## for thinness_10-19_years

thin1 = df[df['thinness_10-19_years'].isnull()]
thin1.head()

Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,polio,total_expenditure,diphtheria,hiv/aids,gdp,population,thinness_10-19_years,thinness_5-9_years,income_composition_of_resources,schooling
2409,South Sudan,2015,Developing,57.3,332.0,26,,0.0,31.0,878,...,41.0,,31.0,3.4,758.725782,11882136.0,,,0.421,4.9
2410,South Sudan,2014,Developing,56.6,343.0,26,,46.074469,,441,...,44.0,2.74,39.0,3.5,1151.861715,1153971.0,,,0.421,4.9
2411,South Sudan,2013,Developing,56.4,345.0,26,,47.44453,,525,...,5.0,2.62,45.0,3.6,1186.11325,1117749.0,,,0.417,4.9
2412,South Sudan,2012,Developing,56.0,347.0,26,,38.338232,,1952,...,64.0,2.77,59.0,3.8,958.45581,1818258.0,,,0.419,4.9
2413,South Sudan,2011,Developing,55.4,355.0,27,,0.0,,1256,...,66.0,,61.0,3.9,176.9713,1448857.0,,,0.429,4.9


In [45]:
thin1['country'].unique()

array(['South Sudan', 'Sudan'], dtype=object)

In [46]:
thin1['year'].unique()

array([2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005,
       2004, 2003, 2002, 2001, 2000], dtype=int64)

In [47]:
thin1['status'].unique()

array(['Developing'], dtype=object)

In [48]:
developing['thinness_10-19_years'].describe()

count    2392.000000
mean        5.592935
std         4.514453
min         0.100000
25%         2.100000
50%         4.500000
75%         7.725000
max        27.700000
Name: thinness_10-19_years, dtype: float64

In [49]:
## years for which the countries have missing values
for i in thin1['country'].unique():
    print(i, thin1[thin1['country'] == i]['year'].unique())

South Sudan [2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002
 2001 2000]
Sudan [2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002
 2001 2000]


In [50]:
## now imputing the values of South Sudan with median thinness_10-19_years of developing countries
df.loc[df['country']=='South Sudan','thinness_10-19_years'] = developing['thinness_10-19_years'].median()


In [51]:
## now imputing the values of Sudan with median thinness_10-19_years of developing countries
df.loc[df['country']=='Sudan','thinness_10-19_years'] = developing['thinness_10-19_years'].median()

In [52]:
## check if all missing values are imputed or not 
df[df['thinness_10-19_years'].isnull()]

Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,polio,total_expenditure,diphtheria,hiv/aids,gdp,population,thinness_10-19_years,thinness_5-9_years,income_composition_of_resources,schooling


In [53]:
## for thinness_5-9_years

thin2 = df[df['thinness_5-9_years'].isnull()]
thin2.head()

Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,polio,total_expenditure,diphtheria,hiv/aids,gdp,population,thinness_10-19_years,thinness_5-9_years,income_composition_of_resources,schooling
2409,South Sudan,2015,Developing,57.3,332.0,26,,0.0,31.0,878,...,41.0,,31.0,3.4,758.725782,11882136.0,4.5,,0.421,4.9
2410,South Sudan,2014,Developing,56.6,343.0,26,,46.074469,,441,...,44.0,2.74,39.0,3.5,1151.861715,1153971.0,4.5,,0.421,4.9
2411,South Sudan,2013,Developing,56.4,345.0,26,,47.44453,,525,...,5.0,2.62,45.0,3.6,1186.11325,1117749.0,4.5,,0.417,4.9
2412,South Sudan,2012,Developing,56.0,347.0,26,,38.338232,,1952,...,64.0,2.77,59.0,3.8,958.45581,1818258.0,4.5,,0.419,4.9
2413,South Sudan,2011,Developing,55.4,355.0,27,,0.0,,1256,...,66.0,,61.0,3.9,176.9713,1448857.0,4.5,,0.429,4.9


In [54]:
thin2['country'].unique()

array(['South Sudan', 'Sudan'], dtype=object)

In [55]:
thin2['year'].unique()

array([2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005,
       2004, 2003, 2002, 2001, 2000], dtype=int64)

In [56]:
thin2['status'].unique()

array(['Developing'], dtype=object)

In [57]:
developing['thinness_5-9_years'].describe()

count    2392.000000
mean        5.635242
std         4.606130
min         0.100000
25%         2.100000
50%         4.600000
75%         7.800000
max        28.600000
Name: thinness_5-9_years, dtype: float64

In [58]:
## years for which the countries have missing values
for i in thin2['country'].unique():
    print(i, thin2[thin1['country'] == i]['year'].unique())

South Sudan [2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002
 2001 2000]
Sudan [2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002
 2001 2000]


In [59]:
## now imputing the values of South Sudan with median thinness_5-9_years of developing countries
df.loc[df['country']=='South Sudan','thinness_5-9_years'] = developing['thinness_5-9_years'].median()

In [60]:
## now imputing the values of Sudan with median thinness_5-9_years of developing countries
df.loc[df['country']=='Sudan','thinness_5-9_years'] = developing['thinness_5-9_years'].median()


In [61]:
## check if all missing values are imputed or not 
df[df['thinness_5-9_years'].isnull()]

Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,polio,total_expenditure,diphtheria,hiv/aids,gdp,population,thinness_10-19_years,thinness_5-9_years,income_composition_of_resources,schooling


In [62]:
## check for remaining missing values

df.isnull().sum()[df.isnull().sum() > 0]

alcohol                            193
hepatitis_b                        553
polio                               19
total_expenditure                  226
gdp                                443
population                         644
income_composition_of_resources    160
schooling                          160
dtype: int64

In [63]:
## for polio
pdf = df[df['polio'].isnull()]
pdf.head()

Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,polio,total_expenditure,diphtheria,hiv/aids,gdp,population,thinness_10-19_years,thinness_5-9_years,income_composition_of_resources,schooling
1742,Montenegro,2005,Developing,73.6,133.0,0,,527.307672,,0,...,,8.46,9.0,0.1,3674.617924,614261.0,2.3,2.3,0.746,12.8
1743,Montenegro,2004,Developing,73.5,134.0,0,0.01,57.121901,,0,...,,8.45,9.0,0.1,338.199535,613353.0,2.3,2.4,0.74,12.6
1744,Montenegro,2003,Developing,73.5,134.0,0,0.01,495.078296,,0,...,,8.91,9.0,0.1,2789.1735,612267.0,2.4,2.4,0.0,0.0
1745,Montenegro,2002,Developing,73.4,136.0,0,0.01,36.48024,,0,...,,8.33,9.0,0.1,216.243274,69828.0,2.5,2.5,0.0,0.0
1746,Montenegro,2001,Developing,73.3,136.0,0,0.01,33.669814,,0,...,,8.23,9.0,0.1,199.583957,67389.0,2.5,2.6,0.0,0.0


In [64]:
pdf['country'].unique()

array(['Montenegro', 'South Sudan', 'Timor-Leste'], dtype=object)

In [65]:
pdf['year'].unique()

array([2005, 2004, 2003, 2002, 2001, 2000, 2010, 2009, 2008, 2007, 2006],
      dtype=int64)

In [66]:
pdf['status'].unique()

array(['Developing'], dtype=object)

In [67]:
## years for which the countries have missing values
for i in pdf['country'].unique():
    print(i, pdf[pdf['country'] == i]['year'].unique())


Montenegro [2005 2004 2003 2002 2001 2000]
South Sudan [2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000]
Timor-Leste [2001 2000]


In [68]:
developing['polio'].describe()

count    2407.000000
mean       80.170752
std        24.671531
min         3.000000
25%        74.000000
50%        91.000000
75%        97.000000
max        99.000000
Name: polio, dtype: float64

In [69]:
## imputing the data for the above countries - South Sudan
    
pid = pdf[pdf['country']== 'South Sudan'].index
print(pid)
    
for i in pid:
    df.loc[i,'polio'] = df[df['country'] == 'South Sudan']['polio'].min()
    #print(df.loc[i,'diphtheria'])



Int64Index([2414, 2415, 2416, 2417, 2418, 2419, 2420, 2421, 2422, 2423, 2424], dtype='int64')


In [70]:
## imputing the data for the above countries - Montenegro

pid = pdf[pdf['country']== 'Montenegro'].index
#print(pid)
    
for i in pid:
    df.loc[i,'polio'] = df[df['country'] == 'Montenegro']['polio'].min()
    #print(df.loc[i,'diphtheria'])



In [71]:
## imputing the data for the above countries - Timor-Leste

pid = pdf[pdf['country']== 'Timor-Leste'].index
#print(pid)
    
for i in pid:
    df.loc[i,'polio'] = df[df['country'] == 'Timor-Leste']['polio'].min()
    #print(df.loc[i,'diphtheria'])



In [72]:
## check if all the missing values are imputed

df[df['polio'].isnull()]

Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,polio,total_expenditure,diphtheria,hiv/aids,gdp,population,thinness_10-19_years,thinness_5-9_years,income_composition_of_resources,schooling


In [73]:
## check for remaining missing values

df.isnull().sum()[df.isnull().sum() > 0]

alcohol                            193
hepatitis_b                        553
total_expenditure                  226
gdp                                443
population                         644
income_composition_of_resources    160
schooling                          160
dtype: int64

In [74]:
## for hepatitis_b
hb = df[df['hepatitis_b'].isnull()]
hb.head()



Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,polio,total_expenditure,diphtheria,hiv/aids,gdp,population,thinness_10-19_years,thinness_5-9_years,income_composition_of_resources,schooling
44,Algeria,2003,Developing,71.7,146.0,20,0.34,25.018523,,15374,...,87.0,3.6,87.0,0.1,294.33556,3243514.0,6.3,6.1,0.663,11.5
45,Algeria,2002,Developing,71.6,145.0,20,0.36,148.511984,,5862,...,86.0,3.73,86.0,0.1,1774.33673,3199546.0,6.3,6.2,0.653,11.1
46,Algeria,2001,Developing,71.4,145.0,20,0.23,147.986071,,2686,...,89.0,3.84,89.0,0.1,1732.857979,31592153.0,6.4,6.3,0.644,10.9
47,Algeria,2000,Developing,71.3,145.0,21,0.25,154.455944,,0,...,86.0,3.49,86.0,0.1,1757.17797,3118366.0,6.5,6.4,0.636,10.7
57,Angola,2006,Developing,47.7,381.0,90,5.84,25.086888,,765,...,36.0,4.54,34.0,2.5,262.415149,2262399.0,9.8,9.7,0.439,7.2


In [75]:
hb['country'].unique()



array(['Algeria', 'Angola', 'Antigua and Barbuda', 'Argentina',
       'Australia', 'Azerbaijan', 'Bahamas', 'Bangladesh', 'Barbados',
       'Benin', 'Bosnia and Herzegovina', 'Burkina Faso', 'Burundi',
       "Côte d'Ivoire", 'Cabo Verde', 'Cambodia', 'Cameroon', 'Canada',
       'Central African Republic', 'Chad', 'Chile', 'Comoros', 'Congo',
       'Croatia', 'Czechia', "Democratic People's Republic of Korea",
       'Democratic Republic of the Congo', 'Denmark', 'Djibouti',
       'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Finland',
       'Gabon', 'Ghana', 'Grenada', 'Guatemala', 'Guinea',
       'Guinea-Bissau', 'Guyana', 'Haiti', 'Hungary', 'Iceland', 'India',
       'Ireland', 'Jamaica', 'Japan', 'Kenya',
       "Lao People's Democratic Republic", 'Lesotho', 'Liberia',
       'Madagascar', 'Malawi', 'Mali', 'Malta', 'Mauritania',
       'Montenegro', 'Mozambique', 'Myanmar', 'Namibia', 'Nepal',
       'Netherlands', 'Niger', 'Nigeria', 'Norway', 'Pakistan', 'Panam

In [76]:
hb['year'].unique()


array([2003, 2002, 2001, 2000, 2006, 2005, 2004, 2008, 2007, 2015, 2014,
       2013, 2012, 2011, 2010, 2009], dtype=int64)

In [77]:

hb['status'].unique()


array(['Developing', 'Developed'], dtype=object)

In [78]:
## years for which the countries have missing values
for i in hb['country'].unique():
    print(i, hb[hb['country'] == i]['year'].unique())

Algeria [2003 2002 2001 2000]
Angola [2006 2005 2004 2003 2002 2001 2000]
Antigua and Barbuda [2000]
Argentina [2001 2000]
Australia [2000]
Azerbaijan [2001 2000]
Bahamas [2000]
Bangladesh [2002 2001 2000]
Barbados [2000]
Benin [2001 2000]
Bosnia and Herzegovina [2003 2002 2001 2000]
Burkina Faso [2005 2004 2003 2002 2001 2000]
Burundi [2003 2002 2001 2000]
Côte d'Ivoire [2000]
Cabo Verde [2001 2000]
Cambodia [2005 2004 2003 2002 2001 2000]
Cameroon [2004 2003 2002 2001 2000]
Canada [2002 2001 2000]
Central African Republic [2008 2007 2006 2005 2004 2003 2002 2001 2000]
Chad [2007 2006 2005 2004 2003 2002 2001 2000]
Chile [2005 2004 2003 2002 2001 2000]
Comoros [2002 2001 2000]
Congo [2006 2005 2004 2003 2002 2001 2000]
Croatia [2006 2005 2004 2003 2002 2001 2000]
Czechia [2001 2000]
Democratic People's Republic of Korea [2002 2001 2000]
Democratic Republic of the Congo [2006 2005 2004 2003 2002 2001 2000]
Denmark [2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002
 

In [79]:
developing['hepatitis_b'].describe()

count    2046.000000
mean       79.763930
std        25.564884
min         1.000000
25%        75.000000
50%        91.000000
75%        97.000000
max        99.000000
Name: hepatitis_b, dtype: float64

In [80]:
developed['hepatitis_b'].describe()

count    339.000000
mean      88.041298
std       20.489240
min        2.000000
25%       89.000000
50%       95.000000
75%       97.000000
max       99.000000
Name: hepatitis_b, dtype: float64

In [81]:
hb[hb['status']=='Developed']['country'].unique()

array(['Australia', 'Croatia', 'Czechia', 'Denmark', 'Hungary', 'Iceland',
       'Ireland', 'Japan', 'Malta', 'Netherlands', 'Norway', 'Slovenia',
       'Sweden', 'Switzerland',
       'United Kingdom of Great Britain and Northern Ireland'],
      dtype=object)

In [82]:
hb[hb['status']=='Developing']['country'].unique()

array(['Algeria', 'Angola', 'Antigua and Barbuda', 'Argentina',
       'Azerbaijan', 'Bahamas', 'Bangladesh', 'Barbados', 'Benin',
       'Bosnia and Herzegovina', 'Burkina Faso', 'Burundi',
       "Côte d'Ivoire", 'Cabo Verde', 'Cambodia', 'Cameroon', 'Canada',
       'Central African Republic', 'Chad', 'Chile', 'Comoros', 'Congo',
       "Democratic People's Republic of Korea",
       'Democratic Republic of the Congo', 'Djibouti',
       'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Finland',
       'Gabon', 'Ghana', 'Grenada', 'Guatemala', 'Guinea',
       'Guinea-Bissau', 'Guyana', 'Haiti', 'India', 'Jamaica', 'Kenya',
       "Lao People's Democratic Republic", 'Lesotho', 'Liberia',
       'Madagascar', 'Malawi', 'Mali', 'Mauritania', 'Montenegro',
       'Mozambique', 'Myanmar', 'Namibia', 'Nepal', 'Niger', 'Nigeria',
       'Pakistan', 'Panama', 'Paraguay', 'Peru', 'Russian Federation',
       'Rwanda', 'Saint Lucia', 'Saint Vincent and the Grenadines',
       'Sao Tom

In [83]:
## imputing the data for the developed countries

for i in hb[hb['status']=='Developed']['country'].unique():

    pid = hb[hb['country']== i].index
    ##print(pid)
    
    for j in pid:
        df.loc[j,'hepatitis_b'] = df[df['country'] == i]['hepatitis_b'].median()
        #print(df.loc[i,'diphtheria'])

In [84]:
## imputing the data for the developing countries

for i in hb[hb['status']=='Developing']['country'].unique():

    pid = hb[hb['country']== i].index
    ##print(pid)
    
    for j in pid:
        df.loc[j,'hepatitis_b'] = df[df['country'] == i]['hepatitis_b'].median()
        #print(df.loc[i,'diphtheria'])

In [85]:
## check if all the missing values are imputed

hb1 = df[df['hepatitis_b'].isnull()]
hb1.head()

Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,polio,total_expenditure,diphtheria,hiv/aids,gdp,population,thinness_10-19_years,thinness_5-9_years,income_composition_of_resources,schooling
737,Denmark,2015,Developed,86.0,71.0,0,,0.0,,9,...,93.0,,93.0,0.1,5314.64416,5683483.0,1.1,0.9,0.923,19.2
738,Denmark,2014,Developed,84.0,73.0,0,9.64,10468.76292,,27,...,94.0,1.8,94.0,0.1,62425.5392,5643475.0,1.1,0.9,0.926,19.2
739,Denmark,2013,Developed,81.0,75.0,0,9.5,10261.763,,17,...,94.0,11.25,94.0,0.1,61191.19263,5614932.0,1.1,0.9,0.924,18.7
740,Denmark,2012,Developed,80.0,76.0,0,9.26,928.417079,,2,...,94.0,1.98,94.0,0.1,5857.521,5591572.0,1.1,0.9,0.922,18.4
741,Denmark,2011,Developed,79.7,79.0,0,10.47,10251.10872,,84,...,91.0,1.87,91.0,0.1,61753.667,557572.0,1.1,0.9,0.91,16.9


In [86]:
hb1['country'].unique()

array(['Denmark', 'Finland', 'Hungary', 'Iceland', 'Japan', 'Norway',
       'Slovenia', 'Switzerland',
       'United Kingdom of Great Britain and Northern Ireland'],
      dtype=object)

In [87]:
hb1['status'].unique()

array(['Developed', 'Developing'], dtype=object)

In [88]:
hb1['year'].unique()

array([2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005,
       2004, 2003, 2002, 2001, 2000], dtype=int64)

##### The  developing and developed countries that have no data for any of the years are remaining

##### so can be imputed by median of developing and developed countries respectively

In [89]:
## imputing the data for the developed countries

for i in hb1[hb1['status']=='Developed']['country'].unique():

    pid = hb1[hb1['country']== i].index
    ##print(pid)
    
    for j in pid:
        df.loc[j,'hepatitis_b'] = developed['hepatitis_b'].median()
        #print(df.loc[i,'diphtheria'])

In [90]:
## imputing the data for the developing countries

for i in hb1[hb1['status'] == 'Developing']['country'].unique():

    pid = hb1[hb1['country'] == i].index
    ##print(pid)
    
    for j in pid:
        df.loc[j,'hepatitis_b'] = developing['hepatitis_b'].median()
        #print(df.loc[i,'diphtheria'])

In [91]:
## check for remaining missing values

df.isnull().sum()[df.isnull().sum() > 0]

alcohol                            193
total_expenditure                  226
gdp                                443
population                         644
income_composition_of_resources    160
schooling                          160
dtype: int64

In [93]:
## for schooling
sch = df[df['schooling'].isnull()]
sch.head()

Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,polio,total_expenditure,diphtheria,hiv/aids,gdp,population,thinness_10-19_years,thinness_5-9_years,income_composition_of_resources,schooling
432,Côte d'Ivoire,2015,Developing,53.3,397.0,57,,0.0,83.0,65,...,81.0,,83.0,1.9,,,5.5,5.5,,
433,Côte d'Ivoire,2014,Developing,52.8,47.0,58,0.01,0.0,76.0,50,...,76.0,5.72,76.0,2.0,,,5.6,5.6,,
434,Côte d'Ivoire,2013,Developing,52.3,412.0,59,3.15,0.0,8.0,48,...,79.0,5.81,8.0,2.4,,,5.8,5.7,,
435,Côte d'Ivoire,2012,Developing,52.0,415.0,59,3.24,0.0,82.0,137,...,83.0,6.14,82.0,2.9,,,5.9,5.9,,
436,Côte d'Ivoire,2011,Developing,51.7,419.0,60,3.13,0.0,62.0,628,...,58.0,6.42,62.0,3.3,,,6.1,6.0,,


In [94]:
sch['country'].unique()

array(["Côte d'Ivoire", 'Czechia',
       "Democratic People's Republic of Korea",
       'Democratic Republic of the Congo', 'Republic of Korea',
       'Republic of Moldova', 'Somalia',
       'United Kingdom of Great Britain and Northern Ireland',
       'United Republic of Tanzania', 'United States of America'],
      dtype=object)

In [95]:
sch['status'].unique()

array(['Developing', 'Developed'], dtype=object)

In [96]:
sch['year'].unique()

array([2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005,
       2004, 2003, 2002, 2001, 2000], dtype=int64)

In [97]:
## years for which the countries have missing values
for i in sch['country'].unique():
    print(i, sch[sch['country'] == i]['year'].unique())

Côte d'Ivoire [2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002
 2001 2000]
Czechia [2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002
 2001 2000]
Democratic People's Republic of Korea [2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002
 2001 2000]
Democratic Republic of the Congo [2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002
 2001 2000]
Republic of Korea [2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002
 2001 2000]
Republic of Moldova [2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002
 2001 2000]
Somalia [2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002
 2001 2000]
United Kingdom of Great Britain and Northern Ireland [2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002
 2001 2000]
United Republic of Tanzania [2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002
 2001 2000]
United States of America [2015 20

In [99]:
sch[sch['status'] == 'Developed']['country'].unique()

array(['Czechia', 'United Kingdom of Great Britain and Northern Ireland',
       'United States of America'], dtype=object)

In [100]:
sch[sch['status'] == 'Developing']['country'].unique()

array(["Côte d'Ivoire", "Democratic People's Republic of Korea",
       'Democratic Republic of the Congo', 'Republic of Korea',
       'Republic of Moldova', 'Somalia', 'United Republic of Tanzania'],
      dtype=object)

In [101]:
developed['schooling'].describe()

count    464.000000
mean      15.845474
std        1.766799
min       11.500000
25%       14.700000
50%       15.800000
75%       16.800000
max       20.700000
Name: schooling, dtype: float64

In [102]:
developing['schooling'].describe()

count    2311.000000
mean       11.219256
std         3.056601
min         0.000000
25%         9.600000
50%        11.700000
75%        13.200000
max        18.300000
Name: schooling, dtype: float64

##### The schooling values for both developed and developing countries can be imputed by median

In [105]:
## imputing the data for the developed countries

for i in sch[sch['status']=='Developed']['country'].unique():

    pid = sch[sch['country']== i].index
    ##print(pid)
    
    for j in pid:
        df.loc[j,'schooling'] = developed['schooling'].median()
        #print(df.loc[i,'schooling'])


In [106]:

## imputing the data for the developing countries

for i in sch[sch['status']=='Developing']['country'].unique():

    pid = sch[sch['country']== i].index
    ##print(pid)
    
    for j in pid:
        df.loc[j,'schooling'] = developing['schooling'].median()
        #print(df.loc[i,'schooling'])


In [107]:
df[df['schooling'].isnull()]

Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,polio,total_expenditure,diphtheria,hiv/aids,gdp,population,thinness_10-19_years,thinness_5-9_years,income_composition_of_resources,schooling


In [118]:
df.isnull().sum()[df.isnull().sum()>0]

alcohol                            193
total_expenditure                  226
gdp                                443
population                         644
income_composition_of_resources    160
dtype: int64

In [120]:
## for income_composition_of_resources
hdi = df[df['income_composition_of_resources'].isnull()]
hdi.head()

Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,polio,total_expenditure,diphtheria,hiv/aids,gdp,population,thinness_10-19_years,thinness_5-9_years,income_composition_of_resources,schooling
432,Côte d'Ivoire,2015,Developing,53.3,397.0,57,,0.0,83.0,65,...,81.0,,83.0,1.9,,,5.5,5.5,,11.7
433,Côte d'Ivoire,2014,Developing,52.8,47.0,58,0.01,0.0,76.0,50,...,76.0,5.72,76.0,2.0,,,5.6,5.6,,11.7
434,Côte d'Ivoire,2013,Developing,52.3,412.0,59,3.15,0.0,8.0,48,...,79.0,5.81,8.0,2.4,,,5.8,5.7,,11.7
435,Côte d'Ivoire,2012,Developing,52.0,415.0,59,3.24,0.0,82.0,137,...,83.0,6.14,82.0,2.9,,,5.9,5.9,,11.7
436,Côte d'Ivoire,2011,Developing,51.7,419.0,60,3.13,0.0,62.0,628,...,58.0,6.42,62.0,3.3,,,6.1,6.0,,11.7


In [121]:
hdi['status'].unique()

array(['Developing', 'Developed'], dtype=object)

In [122]:
## years for which the countries have missing values
for i in hdi['country'].unique():
    print(i, hdi[hdi['country'] == i]['year'].unique())

Côte d'Ivoire [2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002
 2001 2000]
Czechia [2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002
 2001 2000]
Democratic People's Republic of Korea [2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002
 2001 2000]
Democratic Republic of the Congo [2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002
 2001 2000]
Republic of Korea [2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002
 2001 2000]
Republic of Moldova [2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002
 2001 2000]
Somalia [2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002
 2001 2000]
United Kingdom of Great Britain and Northern Ireland [2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002
 2001 2000]
United Republic of Tanzania [2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002
 2001 2000]
United States of America [2015 20

##### The countries have missing data for all the years. Cannot impute any random value as that would lead to wrong information. So, not imputing the remaining values

### 6. Check Statistical Significance

#### Tests that can be perfromed are:
1. Numerical vs Numerical - ttest 
2. Categorical vs Numerical - anova (more than 2 categories) | independent ttest (2 categories)
3. Categorical vs Categorical - chi-square


#### Hypothesis Statements

##### For ttest
H0: The mean values for the numerical variables are same.

H1: The mean values for the numerical variables are different.

##### For independent ttest
H0: The mean value of both categories is same.

H1: The mean value of both categories is different.

##### For ANOVA
H0: The mean value of all categories is same.

H1: The mean value of atleast one category is different.

##### For Chi-Square
H0: There is a relation between the categorical variables.

H1: There is no relation between the categorical variables.


##### All these tests return a pvalue which helps to decide which variables are significant w.r.t the target variable
- If pvalue < 0.05, reject null hypothesis and conclude that the variable is significant
- If pvalue > 0.05, fail to reject null hypothesis and conclude that variable is insignificant

In [127]:
from scipy.stats import ttest_rel,ttest_ind,f_oneway,chi2_contingency

In [136]:
print(df.columns)

Index(['country', 'year', 'status', 'life_expectancy', 'adult_mortality',
       'infant_deaths', 'alcohol', 'percentage_expenditure', 'hepatitis_b',
       'measles', 'bmi', 'under_five_deaths', 'polio', 'total_expenditure',
       'diphtheria', 'hiv/aids', 'gdp', 'population', 'thinness_10-19_years',
       'thinness_5-9_years', 'income_composition_of_resources', 'schooling'],
      dtype='object')


In [129]:
## function for ttest
def perform_ttest_rel(v1,v2):
    tstat, pval = ttest_rel(v1,v2)
    return pval

In [131]:
# life_expectancy,bmi 
perform_ttest_rel(df['life_expectancy'],df['bmi'] )

0.0

In [135]:
# life_expectancy,schooling 
perform_ttest_rel(df['life_expectancy'],df['schooling'] )

0.0

In [137]:
# life_expectancy,adult_mortality
perform_ttest_rel(df['life_expectancy'],df['adult_mortality'] )

2.744737418411219e-273

In [138]:
# life_expectancy,infant_deaths
perform_ttest_rel(df['life_expectancy'],df['infant_deaths'] )

5.699714765035447e-65

In [140]:
# life_expectancy, percentage_expenditure
perform_ttest_rel(df['life_expectancy'],df['percentage_expenditure'] )

1.0087273336885506e-70

In [141]:
# life_expectancy, hepatitis_b
perform_ttest_rel(df['life_expectancy'],df['hepatitis_b'] )

9.928207278502644e-130

##### From the above pvalues, it is evident that bmi, schooling, adult_mortality, infant_deaths, percentage_expenditure, hepatitis_b are significant variables.

##### As the pvalues for all are less than 0.05 so, reject null

In [188]:
## function for independent ttest 
pvalues = []

def peform_ind_ttest(col,arr):
        v1 = df[df[col] == arr[0] ]['life_expectancy']
        v2 = df[df[col] == arr[1] ]['life_expectancy']
        tstat, pval = ttest_ind(v1,v2)
        pvalues.append(pval)

In [189]:
unique_dict = {}
for i in df.select_dtypes(include='object').columns.tolist():
    unique_dict.update({i: df[i].unique()})

In [190]:
for col,arr in unique_dict.items():
    if len(arr) == 2:
        print(col,arr)
        peform_ind_ttest(col,arr)
print(pvalues)

status ['Developing' 'Developed']
[2.4650861700062064e-170]


##### From the above value calculated for status variable, it can be said that the variable is significant