# Life Expectancy Prediction Project

I will use the **Life Expectancy dataset**, collected by the WHO and the United Nations, assembled and available at: https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who. The columns of the dataset are:

* **Life expectancy:** measured in years (target variable)
* **Country**
* **Year**
* **Status:** developing or developed country
* **Population:** population of the country
* **Adult mortality:** probability of dying between 15 and 60 years per 1000 population, for both sexes
* **Infant deaths:** number of infant deaths per 1000 population
* **Under-five deaths:** number of under-five deaths per 1000 population
* **GDP:** Gross Domestic Product per capita (in USD)
* **Percentage expediture:** expenditure on health as a percentage of Gross Domestic Product per capita (%)
* **Total expediture:** general government expenditure on health as a percentage of total government expenditure (%)
* **Income composition:** Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
* **Hepatitis B:** hepatitis B (HepB) immunization coverage among 1-year-olds (%)
* **Polio:** polio (Pol3) immunization coverage among 1-year-olds (%)
* **Diphteria:** Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)
* **Measles:** number of reported cases per 1000 population
* **HIV/AIDS:** deaths per 1000 live births HIV/AIDS (0-4 years)
* **Thinness 5-9:** prevalence of thinness among children and adolescents for age 5 to 9 (%)
* **Thinness 10-19:** prevalence of thinness among children and adolescents for age 10 to 19 (%)
* **BMI:** average Body Mass Index of entire population
* **Alcohol:** recorded per capita (15+) consumption (in litres of pure alcohol)
* **Schooling:** number of years of schooling

In [1]:
import pandas as pd
pd.set_option('display.max_rows', 500)
import numpy as np
np.set_printoptions(legacy='1.25')
import plotly.io as pio
import plotly.express as px
pio.templates.default = "simple_white"
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv("life_expectancy_data.csv")
df

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.259210,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.470,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959000,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2933,Zimbabwe,2004,Developing,44.3,723.0,27,4.36,0.000000,68.0,31,...,67.0,7.13,65.0,33.6,454.366654,12777511.0,9.4,9.4,0.407,9.2
2934,Zimbabwe,2003,Developing,44.5,715.0,26,4.06,0.000000,7.0,998,...,7.0,6.52,68.0,36.7,453.351155,12633897.0,9.8,9.9,0.418,9.5
2935,Zimbabwe,2002,Developing,44.8,73.0,25,4.43,0.000000,73.0,304,...,73.0,6.53,71.0,39.8,57.348340,125525.0,1.2,1.3,0.427,10.0
2936,Zimbabwe,2001,Developing,45.3,686.0,25,1.72,0.000000,76.0,529,...,76.0,6.16,75.0,42.1,548.587312,12366165.0,1.6,1.7,0.427,9.8


# Exploratory Data Analysis

In [3]:
# Formatting column names:
df.columns = df.columns.str.lower().str.strip()
df.columns = df.columns.str.replace(' ', '_').str.replace('__', '_').str.replace('/', '_')
df = df.rename(columns={'thinness_1-19_years': 'thinness_10-19_years'})

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2938 entries, 0 to 2937
Data columns (total 22 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   country                          2938 non-null   object 
 1   year                             2938 non-null   int64  
 2   status                           2938 non-null   object 
 3   life_expectancy                  2928 non-null   float64
 4   adult_mortality                  2928 non-null   float64
 5   infant_deaths                    2938 non-null   int64  
 6   alcohol                          2744 non-null   float64
 7   percentage_expenditure           2938 non-null   float64
 8   hepatitis_b                      2385 non-null   float64
 9   measles                          2938 non-null   int64  
 10  bmi                              2904 non-null   float64
 11  under-five_deaths                2938 non-null   int64  
 12  polio               

In [5]:
# Check if there are duplicated data:
df.duplicated().any()

False

In [6]:
cols_numerical = df.select_dtypes(include=['int64', 'float64']).columns
cols_categorical = df.select_dtypes(include=['object']).columns

In [7]:
# Value counts of categorical data:
for col in cols_categorical:
    print('Number of unique values of', col, ':', df[col].nunique())
    value_counts = df[col].value_counts()
    print(value_counts.value_counts())
    print()

Number of unique values of country : 193
count
16    183
1      10
Name: count, dtype: int64

Number of unique values of status : 2
count
2426    1
512     1
Name: count, dtype: int64



In [8]:
# Summary stats of numerical data:
df.describe().T.apply(lambda s: s.apply('{0:.2f}'.format))

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
year,2938.0,2007.52,4.61,2000.0,2004.0,2008.0,2012.0,2015.0
life_expectancy,2928.0,69.22,9.52,36.3,63.1,72.1,75.7,89.0
adult_mortality,2928.0,164.8,124.29,1.0,74.0,144.0,228.0,723.0
infant_deaths,2938.0,30.3,117.93,0.0,0.0,3.0,22.0,1800.0
alcohol,2744.0,4.6,4.05,0.01,0.88,3.75,7.7,17.87
percentage_expenditure,2938.0,738.25,1987.91,0.0,4.69,64.91,441.53,19479.91
hepatitis_b,2385.0,80.94,25.07,1.0,77.0,92.0,97.0,99.0
measles,2938.0,2419.59,11467.27,0.0,0.0,17.0,360.25,212183.0
bmi,2904.0,38.32,20.04,1.0,19.3,43.5,56.2,87.3
under-five_deaths,2938.0,42.04,160.45,0.0,0.0,4.0,28.0,2500.0


In [None]:
# Histograms for all columns:
for col in df.columns:
    fig = px.histogram(df, x=col)
    fig.show()

When inspecting the summary stats and histograms, several columns had odd or peculiar data:

* Although the majority of countries have data for the 16 years (2000-2015), 10 countries only have data for year 2013.
* `infant_deaths`, `under-five_deaths` and `measles` contains values >1000. These may represent total cases, instead of cases per 1000 population.
* `population` showed values as low as 34 and as high as 1.3 billion.
* `percentage_expediture` does not cointain values between 0-1 or 0-100 (percentages), thus it may not be corrected by GDP. However, `total_expenditure` seems to have correct values and gives similar information as `percentage_expediture`.
* `income_composition_of_resources` contains many 0 values, which could represent missing values, as there is a gap of values >0 and <0.25.
* `bmi` shows values unrealistically low (<18.5) or high (>39.9), outside of the natural ranges of BMI.
* Also, the distribution of prevalence of thinness in children between 5-9 and 10-19 was very similar.

We'll have a deeper look into some of these columns:

* Fill missing values after filtering rows
* Data transformation
* Bivariate/Multivariate analysis

## Data cleaning