
# <center> GLOBAL LIFE EXPECTANCY - A DATA STORY

<a href="https://www.tehrantimes.com/news/407556/Life-expectancy-increased-in-Iran-WHO">
  <img align="right" width="500" height="200" src="https://media.mehrnews.com/d/2016/10/22/4/2250060.jpg?ts=1486462047399">
  



    
    
    

>**Project Owner:** Swintabel Agyei<br>
>**Email:** swintabel95@gmail.com<br>
>[**Github Profile** ](https://github.com/Swintabel) | [**LinkedIn Profile**](https://www.linkedin.com/in/swintabelagyei/)|
**Data Source:** [Kaggle](https://www.kaggle.com/datasets/augustus0498/life-expectancy-who)<br>
> Presentation on Tableau    

# **INTRODUCTION**

### BACKGROUND
Life expectancy is a hypothetical measure of the average number of years a person is expected to live given some common factors affecting the individual at a particular year or time. The most common type of this measure is 'Life Expectancy at birth', the average number of years that a newborn could expect to live if he or she were to pass through life exposed to the sex- and age-specific death rates prevailing at the time of his or her birth, for a specific year, in a given country, territory, or geographic area. [[1]](#ref1)[[2]](#ref2) Generally, past studies suggest that developed countries tend to have a higher life expectancy compared to less developed countries.[[3]](#ref3) However, with the introduction of some immunization policies and evolving socio-economic factors over the years, it is important to find empirical evidence to answer questions on the country specific determining factors of life expectancy.
    
    

### OBJECTIVE
Living longer has been one of humanity’s greatest ambitions, and currently living more than 80 years is a realistic expectation in many countries. [[5]](#ref5) Several factors can lead to death, but good health policies are expected to be associated with higher life expectancy in any population. Health policies are perhaps not the only or most relevant driver of life expectancy but other factors such as; socio-economic status, lifestyle, unforseen disease outbreaks, amongst others may also influence a population's life expectancy on various levels.[[4]](#ref4) In 2019, life expectancy at birth reached 73.3 years globally but with a difference of around 16 years between high-income and low-income countries.[[6]](#ref6) Given the aforementioned background, this project seeks to investigate the following Questions;

* To what level does a country's economic status affect life expectancy of its population?
* Does lifestyle trends in a country affects life expectancy?
* Do Health policies have a strong impact on life expectancy?
* Does the prevalence of diseases have a strong effect on life expectancy?

# **EXPLORATORY DATA ANALYSIS**


In [None]:
# loading libraries
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sn
import seaborn.objects as so
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px
from scipy.stats.mstats import winsorize
import os
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

from sklearn.model_selection import train_test_split

#### **ABOUT DATA**

The Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries. The datasets are made available to public for the purpose of health data analysis. The dataset related to life expectancy, health factors and socio-economic status for 193 countries has been collected from the same WHO data repository website and its corresponding economic data was collected from United Nation website. The data spans through 2000 to 2015.

It is important to see the general overview the data provides in terms of the underlying objectives of the analysis and then probe further into the interesting trends.

**DATA STRUCTURE**

In [None]:
filepath=("/kaggle/input/ledcsv/led_clean.xlsx")
# loading and inspecting data 
led=pd.read_excel(filepath)
led.drop(columns=["Unnamed: 0"], inplace=True)
print(led.info())
led.head(2)

**DATA INFORMATION TABLES**

Data is mostly presented in tables. It is important to know the information carried in every level of the data for correct evaluations and interpretation.

*The table below gives a description of the various variables in the dataset.** [[7]](#ref7)

|Field|Description|
|---:|:---|
|Country|Country|
|Year|Year|
|Status|Developed or Developing status|
|Life expectancy|Life Expectancy in age|
|Adult Mortality|Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)|
|infant deaths|Number of Infant Deaths per 1000 population|
|Alcohol|Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)|
|percentage expenditure|Expenditure on health as a percent of Gross Domestic Product per capita(%)|
|Hepatitis B|Hepatitis B (HepB) immunization coverage among 1-year-olds (%)|
|Measles|Measles - number of reported cases per 1000 population|
|BMI|Average Body Mass Index of entire population|
|under-five deaths|Number of under-five deaths per 1000 population|
|Polio|Polio (Pol3) immunization coverage among 1-year-olds (%)|
|Total expenditure|General government expenditure on health as a percent of total government expenditure (%)|
|Diphtheria|Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)|
|HIV/AIDS|Deaths per 1 000 live births HIV/AIDS (0-4 years)|
|GDP|Gross Domestic Product per capita (in USD)|
|Population|Population of the country|
|thinness 1-19 years|Prevalence of thinness among children and adolescents for Age 10 to 19 (%)|
|thinness 5-9 years|Prevalence of thinness among children for Age 5 to 9(%)|
|Income composition of resources|Income composition of resources|
|Schooling|Number of years of Schooling(years)|

# GEOGRAPHICAL DISTRIBUTION OF LIFE EXPECTANCY

##### `FIG.1` <font color='blue'>**World Map**</font>

In [None]:
fig = px.choropleth(led, locations="Match",
                    color="Lifeexpectancy", # lifeExp is a column of gapminder
                    hover_name="Country", # column to add to hover information
                    color_continuous_scale=px.colors.diverging.RdBu)
fig.update_layout(
        autosize=False,
        margin = dict(
                l=0,
                r=0,
                b=0,
                t=0,
                pad=4,
                autoexpand=True
            ),
            width=1000,
            height=400,
            paper_bgcolor= "rgba(0, 0, 0, 0)",
            plot_bgcolor= "rgba(0, 0, 0, 0)"
    )
        
fig.show()

`From the map, areas colored in blue are areas with very high life expectancy and the reds are places with very low life expectancy. The light colored areas stand in the middle. We can also observe that low life expectancy is predominant in African countries, while countries in Australia & Oceania and North America have high life expectancy`

# LIFE EXPECTANCY AND ECONOMIC STATUS (DEVELOPED VS. DEVELOPING COUNTRIES)

In [None]:
led_group = led.groupby(by = ['Status', 'Year']).mean().reset_index()
#led_group.head()

##### `FIG.2` <font color='blue'>**AVERAGE TREND**</font>

A visual look at the levels at which life expectancy has been changing over the years for both developed and developing countries over the years will help provide good understanding of the data.

In [None]:
Developed = led_group.loc[led_group['Status'] == 'Developed',:]
Developing = led_group.loc[led_group['Status'] == 'Developing',:]
plt.figure(figsize=(30, 10))

xs=Developed['Year']
ys=Developed['Lifeexpectancy']
plt.plot(xs, ys, color='darkblue', linestyle='--', marker='o')

    
xs1= Developing['Year']
ys1= Developing['Lifeexpectancy']
plt.plot(xs1, ys1, color='coral', linestyle=':', marker='*')

plt.rcParams['axes.facecolor'] = 'White'
plt.title('AVERAGE LIFE EXPECTANCY FOR DEVELOPED VS DEVELOPING COUNTRIES', fontsize = 15)
plt.legend(['Developed','Developing'])
plt.ylabel("Life Expectancy", fontsize = 12)
plt.xlabel("Year", fontsize = 12)

plt.show()

From the figure above, there is a continuous increase in Life Expectancy from 2000 to 2014 for both developed and developing countries. However, there is also a gap between the two trends consistently throughout the years. This is consistent with the claims by past studies that people in developed countries on average tend to live longer than people in developing countries.[[3]](#ref3)

##### `FIG.3` <font color='blue'>**MEDIAN DIFFERENCE**</font>

In [None]:
difference=Developed["Lifeexpectancy"].median()-Developing["Lifeexpectancy"].median()

fig = go.Figure(go.Funnel(
    y =["Developed", "Developing", "Difference"],
    x = [round(Developed["Lifeexpectancy"].median()), round(Developing["Lifeexpectancy"].median()),round(difference)],
    textposition = "inside",
    textinfo = "value",marker = {"color": ["darkblue", "salmon", "grey"],"line": {"width": [4, 2, 2, 3, 1, 1], "color": ["blue", "salmon", "red"]}},
    connector = {"line": {"color": "grey", "width": 5}})
    )
fig.update_layout(
        autosize=False,
        margin = dict(
                l=0,
                r=0,
                b=0,
                t=0,
                pad=4,
                autoexpand=True
            ),
            width=800,
            height=200,
            paper_bgcolor= "rgba(0, 0, 0, 0)",
            plot_bgcolor= "rgba(0, 0, 0, 0)"
    )

fig.show()

From the plot, we observe that there is a 12 years gap between the median life expectancy for Developed and Developing Countries.

##### `FIG.4` <font color='blue'>**LIFESTYLE & DISEASES**</font>

In [None]:

y = led_group["Status"]
fig = make_subplots(rows=3, cols=2)
fig.add_trace(go.Box(x=led_group["Alcohol"], y=y, name='Alcohol', marker_color='#515A5A'), row=1, col=1)

fig.add_trace(go.Box(x=led_group['BMI'], y=y, name='BMI', marker_color='#85929E'),row=2, col=1)

fig.add_trace(go.Box(x=led_group['HIV/AIDS'],y=y,name='HIV/AIDS',marker_color='#A93226'),row=1, col=2)

fig.add_trace(go.Box(x=led_group['Measles'],y=y,name='Measles',marker_color='#E74C3C '),row=2, col=2)

fig.add_trace(go.Box(x=led_group['Schooling'],y=y,name='Schooling',marker_color='darkblue'),row=3, col=1)                        
fig.update_layout(
        autosize=False,
        margin = dict(
                l=0,
                r=0,
                b=0,
                t=0,
                pad=4,
                autoexpand=True
            ),
            width=1000,
            height=400
    )

fig.update_traces(orientation='h') # horizontal box plots
fig.show()

It is noticed from the Alcohol consumption plot that, on average developing countries consume less alcohol and also have less body mass index compared to developed countries as shown with the sizeable gap between the maximum values(Alcohol: 3.9 litre, BMI:39.6 ) for developing countries and the minimum values (Alcohol: 9.6 litres, BMI: 45.6) for the developing countries.

However, developing countries have higher disease cases than developed countries as shown with the maximun values (HIV/AIDS: 0.1, Measles: 160.3) for developed countries and minimum values(HIV/AIDS: 0.4, Measles: 149) for developing countries.

Lastly, years of Schooling is higher in developed countries(14yrs minimum) than developing countries(12yrs maximum).

##### `FIG.5` <font color='blue'>**MORTALITY RATE & CHILD HEALTH**</font> 

In [None]:
y = led_group["Status"]
fig = make_subplots(rows=3, cols=2)
fig.add_trace(go.Box(x=led_group['infantdeaths'], y=y, name='infantdeaths', marker_color='#F1948A'), row=1, col=1)

fig.add_trace(go.Box(x=led_group['under-fivedeaths'], y=y, name='under-fivedeaths', marker_color='#FF0099'),row=1, col=2)

fig.add_trace(go.Box(x=led_group['thinness1-19years'],y=y,name='thinness1-19years',marker_color='#424949'),row=2, col=1)

fig.add_trace(go.Box(x=led_group['thinness5-9years'],y=y,name='thinness5-9years',marker_color='#7F8C8D'),row=2, col=2)

fig.add_trace(go.Box(x=led_group['Population'],y=y,name='Population',marker_color='#52BE80'),row=3, col=1)

fig.add_trace(go.Box(x=led_group['AdultMortality'],y=y,name='Adult Mortality',marker_color='#78281F'),row=3, col=2)

fig.update_traces(orientation='h') # horizontal box plots

fig.update_layout(
        autosize=False,
        margin = dict(
                l=0,
                r=0,
                b=0,
                t=0,
                pad=4,
                autoexpand=True
            ),
            width=1000,
            height=400
    )
fig.show()

From the above plots, it is seen that deaths amongst infants and adults are prevalent in developing countries than developed countries. Also, there are more thin children in developing countries than developed countries. Population in developing countries appear to be in a close range of 4.5M to 6.1M compared the relatively widely spread population amongst developed countries from 2.6M to 7.1M

##### `FIG.6` <font color='blue'>HEALTH POLICIES</font>

In [None]:
y = led_group["Status"]
fig = make_subplots(rows=3, cols=2)

fig.add_trace(go.Box(x=led_group['HepatitisB'],y=y,name='HepatitisB Immunization',marker_color='yellowgreen'),row=1, col=1)
fig.add_trace(go.Box(x=led_group['Polio'],y=y,name='Polio Immunization',marker_color='#C9CC3F'),row=2, col=1)
fig.add_trace(go.Box(x=led_group['Diphtheria'],y=y,name='Diphtheria Immunization',marker_color='#DFFF00'),row=3, col=1)
fig.add_trace(go.Box(x=led_group['Totalexpenditure'], y=y, name='Total Expenditure', marker_color='lightslategray'), row=1, col=2)
fig.add_trace(go.Box(x=led_group['Incomecompositionofresources'], y=y, name='Income composition of resources', marker_color='grey'),row=2, col=2)
fig.add_trace(go.Box(x=led_group['GDP'], y=y, name='GDP', marker_color='black'),row=3, col=2)

fig.update_traces(orientation='h') # horizontal box plots

fig.update_layout(
        autosize=False,
        margin = dict(
                l=0,
                r=0,
                b=0,
                t=0,
                pad=4,
                autoexpand=True
            ),
            width=1100,
            height=400
    )
fig.show()

It can be observed that immunization coverage is higher on all three levels of diseases for developed countries (Hepatitis: 86%, Polio: 96%, Diptheria: 96% max values) than developing countries (Hepatitis: 84%, Polio: 86%, Diptheria: 96% max values).

Also, the highest percentage of government expenditure on health for developing countries is 6.1% compared to 8.4% for developed countries. Income composition of resources stands at 0.65 maximum for developing countries and 0.87 for developed countries. Gross domestic product is also lower for developing countries as compared to Developing countries on average. 

##### `FIG.7` <font color='blue'>LIFESTYLE AND DISEASES' RELATIONSHIP WITH LIFE EXPECTANCY</font>

In [None]:
p=so.Plot(led, y='Lifeexpectancy', color="Status").pair(x=['BMI', 'Alcohol','HIV/AIDS','Measles','Schooling']).add(so.Dots()).layout(size=(20, 8))
p.add(so.Line(color=".5", linewidth=2), so.PolyFit())

While there appear to be a weak relationship between life expectancy and BMI for Developed countries, there is a strong positive relationship for developing countries. On the other hand, The relationship with Alcohol seems to be weakly negative for developed countries but weakly positive for developing countries. HIV/AIDS and Measles appears to have  very negligible records in developed countries and so no relevant relationship is seen. However, for developing countries, there is a very noticeable negative relationship.
Lastly, Schooling appears to have a positive relationship for both developed and developing nations.

##### `FIG.8` <font color='blue'>IMMUNIZATION POLICIES' RELATIONSHIP WITH LIFE EXPECTANCY</font>

In [None]:
p=so.Plot(led, y='Lifeexpectancy', color="Status").pair(x=['HepatitisB','Polio','Diphtheria','Totalexpenditure']).add(so.Dots()).layout(size=(15, 6))
p.add(so.Line(color=".5", linewidth=2), so.PolyFit())

It is clear from the plots that health policies in favor of immunization or government's expenditure on health have positive relationships with life expectancy fordeveloping countries. On the other hand, the relationship between these health policies and life expectancy for developed countries are not as strong as developing countries.

# BUILD MODEL

# STEP 1: DATA PREPROCESSING

This is an iterative process where data is inspected and anomalies treated before modelling. At this stage, variables that can lead to data leakage, multicollinearity, high or low cardinality, missing values will be treated before modelling begins. The final variables are shown with heatmaps. 

Leaky variables are variables that are realized after the target variable is created or updated. In this case, there are no variables of such in the data. Multicollinearity occurs when two or more variables are correlated with each other. To treat multicollinearity, independent variables with correlation to another independant variable greater than 0.5 are taken out iteratively based on relevance to the dependent variable and how many variables are affected by the variables in question.

In addition, rows with only one representation for country is also taken out. Lastly, Outliers are winsorized based on whether or not they make sense for the variable in question. This is because some outliers are true values and has to be captured by the model. 

In [None]:
filepath=("/kaggle/input/life-expectancy-who/led.csv")
def wrangle(path):
    
     # loading data 
    led=pd.read_csv(path) 
    
     #dropping columns with high missing data
    led.drop(columns= ['HepatitisB', 'Population', 'GDP'], inplace=True) 

     # dropping multicollinear columns 
    led.drop(columns=['BMI','thinness5-9years', 'Schooling', "Polio","under-fivedeaths", "Measles", "infantdeaths", "HIV/AIDS", "Incomecompositionofresources" ], inplace=True)
    
    #dropping rows with only one representation for country
    led.drop([624, 769,1650,1715, 1812, 1909, 1958,2167, 2216, 2713], axis=0, inplace=True)
    
    #dealing with outliers
    led['percentageexpenditure'] = winsorize(led['percentageexpenditure'] , limits=[0, 0.2])

    #led['Incomecompositionofresources'] = winsorize(led['Incomecompositionofresources'] , limits=[0.05, 0])
    
    return led

df=wrangle(filepath)
df.head()

##### `FIG.9` <font color='blue'>CORRELATION WITH TARGET VARIABLE AND INSPECTING ANOMALIES</font>

In [None]:
plt.figure(figsize=(50,20))
sn.set(font_scale=2)

led_num = df._get_numeric_data()
corrMatrix1 = led_num.corr()

plt.subplot(1,2,1)
sn.heatmap(corrMatrix1[['Lifeexpectancy']].sort_values(by='Lifeexpectancy', ascending=False), vmin=-1, vmax=1, annot=True, cmap='RdBu', cbar=True)

corrMatrix2 = led_num.drop(columns='Lifeexpectancy').corr()
plt.subplot(1,2,2)
sn.heatmap(corrMatrix2, vmin=-1, vmax=1, annot=True, cmap='RdBu',cbar=False)
plt.show()

In [None]:
led_num.describe()

In [None]:
for column in led_num:
        plt.figure(figsize=(12,1))
        sn.boxplot(data=led_num, x=column)

In [None]:
def missing_values(data):
    """Function that checks for null values and computes the percentage of null values
    Args:
        data: loaded dataframe
    Return:
        dataframe: dataframe of total null values with corresponding percentages
    """
    total = data.isnull().sum().sort_values(ascending=False)   # create an empty datafram
    percentage = round((total / data.shape[0]) * 100, 2)
    
    return pd.concat([total, percentage], axis=1, keys=['Total','Percentage'])


missing_values(df)

#### <font color='blue'>LINEARITY BETWEEN DEPENDENT AND INDEPENDENT VARIABLES</font>

From the plot, It can be seen that the linearity between Life Epectancy and the predictors are not very strong for all variables. As such linear regression may not perform very well. Also, the target variable appears to be negatively skewed and therefor the assumption of mormally distributed dependant variable does not hold. 

In [None]:
sn.set(font_scale=1)
sn.pairplot(df, dropna=True, hue="Status");

**Structure of Final Data**

In [None]:

#inspect data
print(df.info())

print("------------------------------------------------------------")
print("count of categories in the categorical Variables")
print(df.select_dtypes("object").nunique())


# STEP 2: DATA SPLITTING

Here, the data is split into target and predictors towards modelling. The target variable is "Life Expectancy" and there are 9 predictors after preprocessing. The structure of the splitted data is as follows:

In [None]:
y=df["Lifeexpectancy"] #Target variable
X=df.drop(columns="Lifeexpectancy")  #predictors
print("The shape of the target variable:", y.shape)
print(y.head(3))

print("------------------------------------------------------------")
print("The shape of the predictors is:", X.shape, "and they are;")
print(X.columns)

**Below are the libraries I will be using here:**

In [None]:
from sklearn.model_selection import train_test_split
from category_encoders import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV

**This is the structure after splitting the data into Training, Validation and Testing sets:**

In [None]:
X_split, X_val, y_split, y_val= train_test_split(X,y, test_size=0.2, random_state= 42)
X_train, X_test, y_train, y_test= train_test_split(X_split,y_split, test_size=0.2, random_state= 42)

print("Structure of the training set")
print(X_train.info())
print(y_train.shape)
print("------------------------------")
print("Description of the Validation Set")
print(X_val.info())
print(y_val.shape)
print("------------------------------")
print("Structure of the test set")
print(X_test.info())
print(y_test.shape)

# STEP 3: BASELINE MODEL

**The baseline model which will be the reference point for model performance gave the following results**

In [None]:
#define model and fit to training set
base_model=make_pipeline(OneHotEncoder(use_cat_names=True),
                        SimpleImputer(strategy="mean"),
                        LinearRegression()).fit(X_train, y_train)

#make predictiion with validation set
y_val_pred=pd.Series(base_model.predict(X_val))

#evaluate base mode
base_mae= mean_absolute_error(y_val,y_val_pred)

base_Rsquared= r2_score(y_val, y_val_pred)

#print results
print("The mean absolute error for the base model is:", base_mae)
print("The R-squared score for the base model is:", base_Rsquared)

# STEP 4: ITERATE

**MODEL SELECTION**

In [None]:
#define regression models and group them into a list
lin_models=[LinearRegression(), Ridge(), Lasso(), ElasticNet()]
ens_models=[RandomForestRegressor(), ExtraTreesRegressor(), AdaBoostRegressor()]
other_models=[ DecisionTreeRegressor(), KNeighborsRegressor(),XGBRegressor() ]

In [None]:
#define a function to build the models
def build_model(model, X_train, y_train, X_val,y_val):
    
    #create model pipeline
    model= make_pipeline(OneHotEncoder(use_cat_names=True),
                        SimpleImputer(strategy="mean"),
                        model).fit(X_train, y_train)
    #make prediction
    prediction=pd.Series(model.predict(X_val))
    #evaluate the predicitions
    mae=mean_absolute_error(y_val, prediction)
    Rsquared= r2_score(y_val, prediction)
    
    #return the results
    return mae, Rsquared


**The preformance of the other models are as follows:**

In [None]:
for model in lin_models:
    results=list(build_model(model, X_train, y_train, X_val,y_val))
    print(f"'{model}' Rsquared score is {results[1]} and MAE is {results[0]}")
print("_____________________________________________________________________________")
    
for model in ens_models:
    results=list(build_model(model, X_train, y_train, X_val,y_val))
    print(f"'{model}' Rsquared score is {results[1]} and MAE is {results[0]}")
print("_____________________________________________________________________________") 

for model in other_models :
    results=list(build_model(model, X_train, y_train, X_val,y_val))
    print(f"'{model}' Rsquared score is {results[1]} and MAE is {results[0]}")
    

**PARAMETER TUNING**

**After Iterating through several models, the best performing model was the Extra tree regressor which beats the base model's mean absolute error and Rsquared scores. Here, the hyper parameters of the best model will be tuned to find the most appropriate value. The hyper parameter of interest to me is the max_depth parameter. The results show that a 50 max depth gives better results that beats the initial results. It is also not computationally expensive.**

In [None]:
max_depth = [25,50,75,100]

for d in max_depth:
    best_model= make_pipeline(OneHotEncoder(use_cat_names=True),
                        SimpleImputer(strategy="mean"),
                        ExtraTreesRegressor(random_state=42, max_depth=d)).fit(X_train, y_train)
    #make prediction
    prediction=pd.Series(best_model.predict(X_val))
    #evaluate the predicitions
    mae=mean_absolute_error(y_val, prediction)
    Rsquared= r2_score(y_val, prediction)
    print(f"mae: {mae}, Rsquared: {Rsquared}, max_depth: {d}")

**The best parameter for the model is a max depth of 50 as such, the final model is defined with this parameter**

In [None]:
best_model=make_pipeline(OneHotEncoder(use_cat_names=True),
                        SimpleImputer(strategy="mean"),
                        ExtraTreesRegressor(random_state=42, max_depth=50)).fit(X_train, y_train)

In [None]:
prediction=pd.Series(best_model.predict(X_val))
#evaluate the predicitions
mae=mean_absolute_error(y_val, prediction)
Rsquared= r2_score(y_val, prediction)
print(f"mae: {mae}, Rsquared: {Rsquared}")

In [None]:
#plot to visually see model performance
plt.scatter(y_val, prediction)
plt.xlabel("y_value")
plt.ylabel("predicted");

# STEP 5: EVALUATE

In [None]:
prediction=pd.Series(best_model.predict(X_test))
#evaluate the predicitions
mae=mean_absolute_error(y_test, prediction)
Rsquared= r2_score(y_test, prediction)
print(f"mae: {mae}, Rsquared: {Rsquared}")

In [None]:
plt.scatter(y_test, prediction)
plt.xlabel("y_value")
plt.ylabel("predicted");

# STEP 6: COMMUNICATE

**Feature Importances**

From below, it is seen that the most influencial features are Adult Mortality, Status (Developed), Diphtheria Immunization, Thinness amongst people from 19years and below, and percentage expenditure on health. 

In [None]:
feat_imp.sort_values().tail(15)

In [None]:
coefficients =  best_model.named_steps['extratreesregressor'].feature_importances_
features = best_model.named_steps["onehotencoder"].get_feature_names_out()
feat_imp = pd.Series(coefficients, index=features)
feat_imp.sort_values().tail(15).plot(kind="barh")
plt.title("Importance of Top 15 Features")
plt.xlabel("Coefficient")
plt.ylabel("Predictor");

# **REFERENCES**

`1`
[Bezy, Judith Marie. "life expectancy". Encyclopedia Britannica, 11 Nov. 2021, https://www.britannica.com/science/life-expectancy. Accessed 18 May 2022.]<a id='ref1'></a>

`2`[WHO. "Life expectancy at birth (years)". Indicator Metadata Registry List, https://www.who.int/data/gho/indicator-metadata-registry/imr-details/65. Accessed 18 May 2022.]<a id='ref2'></a>

`3`[Meyer, A.C., Drefahl, S., Ahlbom, A. et al. Trends in life expectancy: did the gap between the healthy and the ill widen or close?. BMC Med 18, 41 (2020). https://doi.org/10.1186/s12916-020-01514-z] <a id='ref3'></a>

`4`[Hao, L., Xu, X., Dupre, M.E. et al. Adequate access to healthcare and added life expectancy among older adults in China. BMC Geriatr 20, 129 (2020). https://doi.org/10.1186/s12877-020-01524-9] <a id='ref4'></a>

`5`[United Nations Department of Economic and Social Affairs. World population ageing 2019. New York: United Nations; 2020.] <a id='ref5'></a>

`6` [Wang H, Abbas KM, Abbasifard M, Abbasi-Kangevari M, Abbastabar H, Abd-Allah F, et al. Global age-sex-specific fertility,
mortality, healthy life expectancy (HALE), and population estimates in 204 countries and territories, 1950–2019: a comprehensive
demographic analysis for the Global Burden of Disease Study 2019.
Lancet. 2020;396(10258):1160–203.] <a id='ref6'></a>

`7`
[Ziring, Steve. "Markdown table for descriptions". Kaggle, 2020, https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who/discussion/161876. Accessed 07 June 2022.]<a id='ref7'></a>
