#### Introduction

There have been lot of studies undertaken in the past on factors affecting life expectancy considering demographic variables, income composition and mortality rates. This study will focus on immunization factors, mortality factors, economic factors, social factors and other health related factors as well that relate to life expectancy. Since the observations this dataset are based on different countries, it will be easier for a country to determine the predicting factor which is contributing to lower value of life expectancy. This will help in suggesting a country which area should be given importance in order to efficiently improve the life expectancy of its population.

#### Importing important libraries

In [132]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt

#### Reading the dataset into a Pandas Dataframe Object


In [133]:
data = pd.read_csv("Life_Expectancy_Data.csv")

#### taking a quick look at what the data looks like:

In [134]:
data.head()

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5


#### Since Status is Categorical ordinal, we can replace it with numerical distinct values.
We use Dataframe.replace() for that.

In [135]:
data.replace(['Developing', 'Developed'], [0, 1], inplace=True)
data.head()

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,0,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,0,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,0,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,0,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,0,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5


#### Performing data cleaning


First we remove outliers from our dataset

To do that we use Scipy library

In [137]:
from scipy import stats
#remove outliers and skip non numeric values
data2 = data.drop('Country', axis = 1)
data2 = data2[np.abs(stats.zscore(data2)) < 3]

In [139]:

data2.describe()


Unnamed: 0,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
count,20294.0,20294.0,20224.0,20224.0,20294.0,18942.0,20294.0,16518.0,20294.0,20056.0,...,20161.0,18718.0,20161.0,20294.0,17165.0,15741.0,20056.0,20056.0,19132.0,19160.0
mean,2007.531832,0.17281,69.292034,164.077729,26.208929,4.580637,705.379583,81.143783,2087.202178,38.417656,...,82.650017,5.937276,82.42979,1.652582,7299.632565,11525450.0,4.78035,4.808531,0.627741,11.992046
std,4.615833,0.378092,9.425133,122.489036,96.703042,4.050341,1894.996613,24.910565,10021.27646,20.042946,...,23.404726,2.495498,23.682591,4.815516,13855.634824,50213620.0,4.292681,4.373611,0.210551,3.356424
min,2000.0,0.0,36.3,1.0,0.0,0.01,0.0,1.0,0.0,1.0,...,3.0,0.37,2.0,0.1,1.68135,34.0,0.1,0.1,0.0,0.0
25%,2004.0,0.0,63.3,74.0,0.0,0.85,4.395886,77.0,0.0,19.4,...,78.0,4.26,78.0,0.1,464.229758,193759.0,1.6,1.5,0.493,10.1
50%,2008.0,0.0,72.1,144.0,3.0,3.71,64.737149,92.0,16.0,43.8,...,93.0,5.76,93.0,0.1,1771.58662,1367566.0,3.3,3.3,0.678,12.4
75%,2012.0,0.0,75.6,227.0,21.0,7.67,437.080244,97.0,342.75,56.2,...,97.0,7.4875,97.0,0.8,5878.76127,7295394.0,7.1,7.1,0.779,14.2
max,2015.0,1.0,89.0,723.0,1800.0,17.87,19479.91161,99.0,212183.0,87.3,...,99.0,17.6,99.0,50.6,119172.7418,1293859000.0,27.7,28.6,0.948,20.7


Getting a look of the number of null values in the dataset. We will use isnull().sum() for that as it will count the number of null values in each column. 

In [None]:
data.isnull().sum()


Country                              0
Year                                 0
Status                               0
Life expectancy                     10
Adult Mortality                     10
infant deaths                        0
Alcohol                            194
percentage expenditure               0
Hepatitis B                        553
Measles                              0
 BMI                                34
under-five deaths                    0
Polio                               19
Total expenditure                  226
Diphtheria                          19
 HIV/AIDS                            0
GDP                                448
Population                         652
 thinness  1-19 years               34
 thinness 5-9 years                 34
Income composition of resources    167
Schooling                          163
dtype: int64

Replacing null values with the mean. For that we calculate the mean of all columns and use `fillna()` function to replace all null occurances with the mean. 

In [None]:
def replace_with_mean(s):
    if s != 'Country':
        mean = data[s].mean()
        data[s].fillna(mean, inplace=True)

for column in data.columns:
    replace_with_mean(column)


Now our data doesn't contain any null value

In [None]:
data.isnull().sum()

Country                            0
Year                               0
Status                             0
Life expectancy                    0
Adult Mortality                    0
infant deaths                      0
Alcohol                            0
percentage expenditure             0
Hepatitis B                        0
Measles                            0
 BMI                               0
under-five deaths                  0
Polio                              0
Total expenditure                  0
Diphtheria                         0
 HIV/AIDS                          0
GDP                                0
Population                         0
 thinness  1-19 years              0
 thinness 5-9 years                0
Income composition of resources    0
Schooling                          0
dtype: int64

#### Getting general information about the dataset

In [None]:
data.describe()

Unnamed: 0,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
count,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,...,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0
mean,2007.51872,0.174268,69.224932,164.796448,30.303948,4.602861,738.251295,80.940461,2419.59224,38.321247,...,82.550188,5.93819,82.324084,1.742103,7483.158469,12753380.0,4.839704,4.870317,0.627551,11.992793
std,4.613841,0.379405,9.50764,124.080302,117.926501,3.916288,1987.914858,22.586855,11467.272489,19.927677,...,23.352143,2.400274,23.640073,5.077785,13136.800417,53815460.0,4.394535,4.482708,0.20482,3.264381
min,2000.0,0.0,36.3,1.0,0.0,0.01,0.0,1.0,0.0,1.0,...,3.0,0.37,2.0,0.1,1.68135,34.0,0.1,0.1,0.0,0.0
25%,2004.0,0.0,63.2,74.0,0.0,1.0925,4.685343,80.940461,0.0,19.4,...,78.0,4.37,78.0,0.1,580.486996,418917.2,1.6,1.6,0.50425,10.3
50%,2008.0,0.0,72.0,144.0,3.0,4.16,64.912906,87.0,17.0,43.0,...,93.0,5.93819,93.0,0.1,3116.561755,3675929.0,3.4,3.4,0.662,12.1
75%,2012.0,0.0,75.6,227.0,22.0,7.39,441.534144,96.0,360.25,56.1,...,97.0,7.33,97.0,0.8,7483.158469,12753380.0,7.1,7.2,0.772,14.1
max,2015.0,1.0,89.0,723.0,1800.0,17.87,19479.91161,99.0,212183.0,87.3,...,99.0,17.6,99.0,50.6,119172.7418,1293859000.0,27.7,28.6,0.948,20.7


Checking for null values in the dataset

In [None]:
data.isnull().sum()

Country                            0
Year                               0
Status                             0
Life expectancy                    0
Adult Mortality                    0
infant deaths                      0
Alcohol                            0
percentage expenditure             0
Hepatitis B                        0
Measles                            0
 BMI                               0
under-five deaths                  0
Polio                              0
Total expenditure                  0
Diphtheria                         0
 HIV/AIDS                          0
GDP                                0
Population                         0
 thinness  1-19 years              0
 thinness 5-9 years                0
Income composition of resources    0
Schooling                          0
dtype: int64

Dropping the null values

In [None]:
data = data.dropna()

In [None]:
data.isnull().sum()

Country                            0
Year                               0
Status                             0
Life expectancy                    0
Adult Mortality                    0
infant deaths                      0
Alcohol                            0
percentage expenditure             0
Hepatitis B                        0
Measles                            0
 BMI                               0
under-five deaths                  0
Polio                              0
Total expenditure                  0
Diphtheria                         0
 HIV/AIDS                          0
GDP                                0
Population                         0
 thinness  1-19 years              0
 thinness 5-9 years                0
Income composition of resources    0
Schooling                          0
dtype: int64

In [None]:
data.describe()

Unnamed: 0,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
count,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,...,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0
mean,2007.51872,0.174268,69.224932,164.796448,30.303948,4.602861,738.251295,80.940461,2419.59224,38.321247,...,82.550188,5.93819,82.324084,1.742103,7483.158469,12753380.0,4.839704,4.870317,0.627551,11.992793
std,4.613841,0.379405,9.50764,124.080302,117.926501,3.916288,1987.914858,22.586855,11467.272489,19.927677,...,23.352143,2.400274,23.640073,5.077785,13136.800417,53815460.0,4.394535,4.482708,0.20482,3.264381
min,2000.0,0.0,36.3,1.0,0.0,0.01,0.0,1.0,0.0,1.0,...,3.0,0.37,2.0,0.1,1.68135,34.0,0.1,0.1,0.0,0.0
25%,2004.0,0.0,63.2,74.0,0.0,1.0925,4.685343,80.940461,0.0,19.4,...,78.0,4.37,78.0,0.1,580.486996,418917.2,1.6,1.6,0.50425,10.3
50%,2008.0,0.0,72.0,144.0,3.0,4.16,64.912906,87.0,17.0,43.0,...,93.0,5.93819,93.0,0.1,3116.561755,3675929.0,3.4,3.4,0.662,12.1
75%,2012.0,0.0,75.6,227.0,22.0,7.39,441.534144,96.0,360.25,56.1,...,97.0,7.33,97.0,0.8,7483.158469,12753380.0,7.1,7.2,0.772,14.1
max,2015.0,1.0,89.0,723.0,1800.0,17.87,19479.91161,99.0,212183.0,87.3,...,99.0,17.6,99.0,50.6,119172.7418,1293859000.0,27.7,28.6,0.948,20.7


In [None]:
data.corr()

Unnamed: 0,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
Year,1.0,-0.001864,0.169623,-0.078861,-0.037415,-0.048168,0.0314,0.089398,-0.082493,0.108327,...,0.09382,0.08186,0.133853,-0.139741,0.093351,0.014951,-0.047592,-0.050627,0.236333,0.203471
Status,-0.001864,1.0,0.481962,-0.315171,-0.112252,0.579371,0.454261,0.095642,-0.076955,0.310873,...,0.220098,0.289985,0.216763,-0.14859,0.445911,-0.041091,-0.367934,-0.366297,0.457302,0.491444
Life expectancy,0.169623,0.481962,1.0,-0.696359,-0.196535,0.391598,0.381791,0.203771,-0.157574,0.559255,...,0.461574,0.207981,0.475418,-0.556457,0.430493,-0.019638,-0.472162,-0.466629,0.692483,0.715066
Adult Mortality,-0.078861,-0.315171,-0.696359,1.0,0.078747,-0.190408,-0.242814,-0.138591,0.031174,-0.381449,...,-0.272694,-0.110875,-0.273014,0.523727,-0.277053,-0.012501,0.299863,0.305366,-0.440062,-0.435108
infant deaths,-0.037415,-0.112252,-0.196535,0.078747,1.0,-0.113812,-0.085612,-0.178783,0.501128,-0.22722,...,-0.170674,-0.126564,-0.175156,0.025231,-0.107109,0.548522,0.46559,0.471228,-0.143663,-0.191757
Alcohol,-0.048168,0.579371,0.391598,-0.190408,-0.113812,1.0,0.339634,0.075447,-0.051055,0.31807,...,0.213744,0.294898,0.215242,-0.04865,0.318591,-0.030765,-0.416946,-0.405881,0.416099,0.497546
percentage expenditure,0.0314,0.454261,0.381791,-0.242814,-0.085612,0.339634,1.0,0.011679,-0.056596,0.228537,...,0.147203,0.173414,0.14357,-0.097857,0.88814,-0.024648,-0.25119,-0.252725,0.380374,0.388105
Hepatitis B,0.089398,0.095642,0.203771,-0.138591,-0.178783,0.075447,0.011679,1.0,-0.090317,0.134929,...,0.408519,0.050084,0.499958,-0.102405,0.062318,-0.109811,-0.105144,-0.108334,0.150992,0.171755
Measles,-0.082493,-0.076955,-0.157574,0.031174,0.501128,-0.051055,-0.056596,-0.090317,1.0,-0.175925,...,-0.136146,-0.104569,-0.141861,0.030899,-0.06806,0.23625,0.224742,0.221007,-0.115764,-0.122609
BMI,0.108327,0.310873,0.559255,-0.381449,-0.22722,0.31807,0.228537,0.134929,-0.175925,1.0,...,0.282156,0.231814,0.281059,-0.243548,0.276645,-0.063238,-0.532025,-0.538911,0.479837,0.508105


In [None]:
[column for column in data.columns]


['Country',
 'Year',
 'Status',
 'Life expectancy ',
 'Adult Mortality',
 'infant deaths',
 'Alcohol',
 'percentage expenditure',
 'Hepatitis B',
 'Measles ',
 ' BMI ',
 'under-five deaths ',
 'Polio',
 'Total expenditure',
 'Diphtheria ',
 ' HIV/AIDS',
 'GDP',
 'Population',
 ' thinness  1-19 years',
 ' thinness 5-9 years',
 'Income composition of resources',
 'Schooling']

In [None]:
data.fillna(data.mean())

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,0,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.259210,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,0,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,0,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.470,9.9
3,Afghanistan,2012,0,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959000,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,0,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2933,Zimbabwe,2004,0,44.3,723.0,27,4.36,0.000000,68.0,31,...,67.0,7.13,65.0,33.6,454.366654,12777511.0,9.4,9.4,0.407,9.2
2934,Zimbabwe,2003,0,44.5,715.0,26,4.06,0.000000,7.0,998,...,7.0,6.52,68.0,36.7,453.351155,12633897.0,9.8,9.9,0.418,9.5
2935,Zimbabwe,2002,0,44.8,73.0,25,4.43,0.000000,73.0,304,...,73.0,6.53,71.0,39.8,57.348340,125525.0,1.2,1.3,0.427,10.0
2936,Zimbabwe,2001,0,45.3,686.0,25,1.72,0.000000,76.0,529,...,76.0,6.16,75.0,42.1,548.587312,12366165.0,1.6,1.7,0.427,9.8


In [None]:
data.corr()

Unnamed: 0,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
Year,1.0,-0.001864,0.169623,-0.078861,-0.037415,-0.048168,0.0314,0.089398,-0.082493,0.108327,...,0.09382,0.08186,0.133853,-0.139741,0.093351,0.014951,-0.047592,-0.050627,0.236333,0.203471
Status,-0.001864,1.0,0.481962,-0.315171,-0.112252,0.579371,0.454261,0.095642,-0.076955,0.310873,...,0.220098,0.289985,0.216763,-0.14859,0.445911,-0.041091,-0.367934,-0.366297,0.457302,0.491444
Life expectancy,0.169623,0.481962,1.0,-0.696359,-0.196535,0.391598,0.381791,0.203771,-0.157574,0.559255,...,0.461574,0.207981,0.475418,-0.556457,0.430493,-0.019638,-0.472162,-0.466629,0.692483,0.715066
Adult Mortality,-0.078861,-0.315171,-0.696359,1.0,0.078747,-0.190408,-0.242814,-0.138591,0.031174,-0.381449,...,-0.272694,-0.110875,-0.273014,0.523727,-0.277053,-0.012501,0.299863,0.305366,-0.440062,-0.435108
infant deaths,-0.037415,-0.112252,-0.196535,0.078747,1.0,-0.113812,-0.085612,-0.178783,0.501128,-0.22722,...,-0.170674,-0.126564,-0.175156,0.025231,-0.107109,0.548522,0.46559,0.471228,-0.143663,-0.191757
Alcohol,-0.048168,0.579371,0.391598,-0.190408,-0.113812,1.0,0.339634,0.075447,-0.051055,0.31807,...,0.213744,0.294898,0.215242,-0.04865,0.318591,-0.030765,-0.416946,-0.405881,0.416099,0.497546
percentage expenditure,0.0314,0.454261,0.381791,-0.242814,-0.085612,0.339634,1.0,0.011679,-0.056596,0.228537,...,0.147203,0.173414,0.14357,-0.097857,0.88814,-0.024648,-0.25119,-0.252725,0.380374,0.388105
Hepatitis B,0.089398,0.095642,0.203771,-0.138591,-0.178783,0.075447,0.011679,1.0,-0.090317,0.134929,...,0.408519,0.050084,0.499958,-0.102405,0.062318,-0.109811,-0.105144,-0.108334,0.150992,0.171755
Measles,-0.082493,-0.076955,-0.157574,0.031174,0.501128,-0.051055,-0.056596,-0.090317,1.0,-0.175925,...,-0.136146,-0.104569,-0.141861,0.030899,-0.06806,0.23625,0.224742,0.221007,-0.115764,-0.122609
BMI,0.108327,0.310873,0.559255,-0.381449,-0.22722,0.31807,0.228537,0.134929,-0.175925,1.0,...,0.282156,0.231814,0.281059,-0.243548,0.276645,-0.063238,-0.532025,-0.538911,0.479837,0.508105


In [None]:
data.dtypes

Country                             object
Year                                 int64
Status                               int64
Life expectancy                    float64
Adult Mortality                    float64
infant deaths                        int64
Alcohol                            float64
percentage expenditure             float64
Hepatitis B                        float64
Measles                              int64
 BMI                               float64
under-five deaths                    int64
Polio                              float64
Total expenditure                  float64
Diphtheria                         float64
 HIV/AIDS                          float64
GDP                                float64
Population                         float64
 thinness  1-19 years              float64
 thinness 5-9 years                float64
Income composition of resources    float64
Schooling                          float64
dtype: object

In [None]:
data.Status.unique()

array([0, 1], dtype=int64)

In [None]:
data.drop('Country', axis=1, inplace=True)

In [None]:
data.dtypes


Year                                 int64
Status                               int64
Life expectancy                    float64
Adult Mortality                    float64
infant deaths                        int64
Alcohol                            float64
percentage expenditure             float64
Hepatitis B                        float64
Measles                              int64
 BMI                               float64
under-five deaths                    int64
Polio                              float64
Total expenditure                  float64
Diphtheria                         float64
 HIV/AIDS                          float64
GDP                                float64
Population                         float64
 thinness  1-19 years              float64
 thinness 5-9 years                float64
Income composition of resources    float64
Schooling                          float64
dtype: object