# Introduction/Project Overview:

In this notebook, I will go over the World Health Organization Life Expectancy dataset. I found a version on kaggle that had been updated and had a higher usability score so I chose to use that. It is linked here if you want to read more about it. [dataset](https://www.kaggle.com/datasets/lashagoch/life-expectancy-who-updated). We are given data on a country, population, health issues, life expectancy, etc. Our goal would be to create a model that can accurately determine the life expectency given a some characteristics on a population. Throughout this notebook I will visualize the data, explain some data preprocessing techniques, construct and evaluate models and analyze the results.

### Data Exploration & Preprocessing:
I will go over the dataset, analyzing its various features, checking for missing values, and gaining insights into the distribution of variables. Prior to building the models, I will preprocess the data by handling missing values, encoding categorical variables, and scaling numerical features to ensure good model performance.

### Model Building & Evaluation:
In this notebook I will try implement a couple of models to try and see which ones are more accurate at predicting life expectancy. This is a supervised learning task since we are given the life expectancy of these population. Additionally this is a regression tasks because we are trying to predict a numerical value rather then put something into a category. For this notebook the models I chose is multiple linear regression, polynomial regression and a neural network. Lastly I will interpret the results of each model. 


### Conclusion:
Finally, I will be discussing potential areas for model improvement, what stood out to me and what were some challanges. The conclusion serves more as a reflection for me on my time working on this notebook. This will serve as a good test for me to keep learning and testing my skills. Lets get to it!

In [27]:
import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt

# Data Exploration & Preprocessing:
As mentioned earlier I got the dataset from kaggle. The link to that can be found above in the project overview. The download came with a csv file. Since I have it locally on my computer I can eassily access the data as shown below. Some of the first steps we will do before creating a model is to see what our data looks like.

In [28]:
data = pd.read_csv('./data.csv')

Lets take a look at the first couple of entries in our dataset and what columns are int it.

In [29]:
data.head(5)

Unnamed: 0,Country,Region,Year,Infant_deaths,Under_five_deaths,Adult_mortality,Alcohol_consumption,Hepatitis_B,Measles,BMI,...,Diphtheria,Incidents_HIV,GDP_per_capita,Population_mln,Thinness_ten_nineteen_years,Thinness_five_nine_years,Schooling,Economy_status_Developed,Economy_status_Developing,Life_expectancy
0,Turkiye,Middle East,2015,11.1,13.0,105.824,1.32,97,65,27.8,...,97,0.08,11006,78.53,4.9,4.8,7.8,0,1,76.5
1,Spain,European Union,2015,2.7,3.3,57.9025,10.35,97,94,26.0,...,97,0.09,25742,46.44,0.6,0.5,9.7,1,0,82.8
2,India,Asia,2007,51.5,67.9,201.0765,1.57,60,35,21.2,...,64,0.13,1076,1183.21,27.1,28.0,5.0,0,1,65.4
3,Guyana,South America,2006,32.8,40.5,222.1965,5.68,93,74,25.3,...,93,0.79,4146,0.75,5.7,5.5,7.9,0,1,67.0
4,Costa Rica,Central America and Caribbean,2006,9.8,11.2,95.22,4.19,88,86,26.4,...,89,0.16,9110,4.35,2.0,1.9,7.9,0,1,78.2


In [30]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2848 entries, 0 to 2847
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Country                      2848 non-null   object 
 1   Region                       2848 non-null   object 
 2   Year                         2848 non-null   int64  
 3   Infant_deaths                2848 non-null   float64
 4   Under_five_deaths            2848 non-null   float64
 5   Adult_mortality              2848 non-null   float64
 6   Alcohol_consumption          2848 non-null   float64
 7   Hepatitis_B                  2848 non-null   int64  
 8   Measles                      2848 non-null   int64  
 9   BMI                          2848 non-null   float64
 10  Polio                        2848 non-null   int64  
 11  Diphtheria                   2848 non-null   int64  
 12  Incidents_HIV                2848 non-null   float64
 13  GDP_per_capita    

So we have 22 columns in total. We have three categorical columns and 19 numeric columns. We also have quite some missing values so we will try and fill them later. The column names are pretty straight forward but if you would like to read more on them there are explanations on kaggle.

Lets see what other information we can get at a glance from our data. 

In [31]:
data.describe()

Unnamed: 0,Year,Infant_deaths,Under_five_deaths,Adult_mortality,Alcohol_consumption,Hepatitis_B,Measles,BMI,Polio,Diphtheria,Incidents_HIV,GDP_per_capita,Population_mln,Thinness_ten_nineteen_years,Thinness_five_nine_years,Schooling,Economy_status_Developed,Economy_status_Developing,Life_expectancy
count,2848.0,2848.0,2848.0,2848.0,2848.0,2848.0,2848.0,2848.0,2848.0,2848.0,2848.0,2848.0,2848.0,2848.0,2848.0,2848.0,2848.0,2848.0,2848.0
mean,2007.5,30.510955,43.150456,192.962955,4.833032,84.21875,77.250351,25.023455,86.45611,86.227879,0.898862,11430.386938,36.840976,4.886763,4.921173,7.604951,0.202247,0.797753,68.78901
std,4.610582,27.545025,44.604702,114.837818,3.98979,16.00973,18.667562,2.19626,15.111162,15.566384,2.387287,16916.391089,136.85102,4.441881,4.528915,3.159514,0.401746,0.401746,9.388885
min,2000.0,1.8,2.3,49.384,0.0,12.0,10.0,19.8,8.0,16.0,0.01,148.0,0.08,0.1,0.1,1.1,0.0,0.0,39.4
25%,2003.75,8.2,9.8,108.346,1.1975,78.0,64.0,23.2,81.0,81.0,0.08,1404.25,2.07,1.6,1.6,5.0,0.0,1.0,62.675
50%,2007.5,19.8,23.2,164.42225,4.055,89.0,83.0,25.5,93.0,93.0,0.16,4169.0,7.925,3.4,3.4,7.8,0.0,1.0,71.3
75%,2011.25,47.725,66.325,248.09975,7.81,96.0,93.0,26.4,97.0,97.0,0.47,12255.75,23.85,7.2,7.3,10.3,0.0,1.0,75.4
max,2015.0,138.1,224.9,719.3605,17.87,99.0,99.0,32.1,99.0,99.0,21.68,112418.0,1379.86,27.7,28.6,14.1,1.0,1.0,83.8


From the table we can see that the minimum life expectancy is 36 while the max is 89. The max adult mortality is 723 per thousand in population, while the min is 1. The average alcohol consumption per liters per capita is a lot lower than at I expected but the min is 0.01 which makes sense as some countries might have religious restrictions. It is also very disturbing seeing the low immunization coverage for some of the diseases. Especially when the average is pretty high for some. Overall this table gives us some interesting information on the distribution of the data. 

Below I am changing the column names to be lower case because they previously gave me issues the way it was originally formated. 

In [32]:
data.columns = data.columns.str.lower()

Now we are selecting the numerical and categorical columns so we can do some light data exploration.

In [33]:
numerical = data.select_dtypes(include=['float64', 'int64']).columns

In [34]:
categorical = data.select_dtypes(include=['object']).columns

Lets start by checking the differences between each region.

In [45]:
pd.pivot_table(data, index='region', values=numerical)

Unnamed: 0_level_0,adult_mortality,alcohol_consumption,bmi,diphtheria,economy_status_developed,economy_status_developing,gdp_per_capita,hepatitis_b,incidents_hiv,infant_deaths,life_expectancy,measles,polio,population_mln,schooling,thinness_five_nine_years,thinness_ten_nineteen_years,under_five_deaths,year
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Africa,319.073434,2.980694,23.249265,75.992647,0.0,1.0,2314.009804,78.025735,2.702132,60.604044,57.847304,67.625,76.205882,18.438395,4.59326,7.199142,7.270343,94.025,2007.5
Asia,171.166565,2.4017,23.412269,87.94213,0.037037,0.962963,6560.958333,84.402778,0.139884,32.536111,69.454861,73.314815,88.194444,136.131782,7.375231,9.283796,8.833333,40.965278,2007.5
Central America and Caribbean,163.581169,5.623224,26.274671,90.878289,0.0,1.0,8387.8125,89.980263,0.419375,20.965461,72.4375,82.644737,91.049342,4.052237,7.808553,3.269737,3.304605,25.932237,2007.5
European Union,104.395392,10.791319,26.021528,95.678241,1.0,0.0,28140.368056,89.208333,0.073912,4.872222,77.715046,88.534722,95.530093,16.210046,11.064815,1.412037,1.408102,5.858333,2007.5
Middle East,114.658666,0.747702,26.94375,89.274038,0.0,1.0,18014.25,87.721154,0.08976,17.598558,73.450962,82.956731,89.697115,20.288077,7.348558,6.109135,6.232692,21.633173,2007.5
North America,100.693844,7.053125,27.429167,93.729167,0.666667,0.333333,33783.854167,73.291667,0.127917,9.925,77.841667,87.833333,92.854167,148.473958,11.035417,0.904167,0.989583,11.647917,2007.5
Oceania,162.886801,2.880239,27.203977,80.556818,0.181818,0.818182,10146.971591,80.664773,0.155284,26.305114,69.517614,70.971591,81.829545,3.230568,7.876705,1.922159,1.8625,32.49375,2007.5
Rest of Europe,138.582856,7.580083,25.99125,92.341667,0.266667,0.733333,19090.15,87.366667,0.107292,9.627917,74.525417,89.954167,92.954167,20.065833,10.775417,1.99,1.960833,11.179583,2007.5
South America,155.275326,5.745677,26.071875,89.625,0.0,1.0,7232.03125,88.036458,0.298333,19.953646,72.780729,74.130208,89.427083,31.78526,8.022396,2.064583,2.154688,23.938021,2007.5


Above we can see that there are very big differences between the two types of countries. Developing countires have higher moratlity rates, and overall haelth issues. While the difference in immunization coverage for developed and developing countries is not a huge difference, developing countries still have less of it, which is important to note. Now lets take a look at the same data by each country.

In [44]:
pd.pivot_table(data, index='country', values=numerical)

Unnamed: 0_level_0,adult_mortality,alcohol_consumption,bmi,diphtheria,economy_status_developed,economy_status_developing,gdp_per_capita,hepatitis_b,incidents_hiv,infant_deaths,life_expectancy,measles,polio,population_mln,schooling,thinness_five_nine_years,thinness_ten_nineteen_years,under_five_deaths,year
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Afghanistan,265.804969,0.016125,22.46250,55.1250,0.0,1.0,408.5625,64.5625,0.022500,71.08125,59.65625,24.3750,55.3750,27.450625,2.90000,15.58125,16.58125,98.61250,2007.5
Albania,83.132969,4.696875,25.85625,98.0625,0.0,1.0,3071.1250,98.0000,0.025625,15.25625,75.95000,95.9375,98.1250,2.969375,9.24375,1.70000,1.61875,17.14375,2007.5
Algeria,113.439281,0.400625,24.86875,91.8750,0.0,1.0,3745.1250,88.3125,0.021875,26.75625,73.78750,93.2500,91.7500,34.820625,6.99375,5.97500,6.09375,31.19375,2007.5
Angola,297.844063,4.935625,22.51875,55.5625,0.0,1.0,2647.8125,68.8125,1.303750,88.76875,52.82500,64.0000,35.7500,21.623750,4.60625,6.66875,6.19375,144.16250,2007.5
Antigua and Barbuda,142.478813,7.755000,25.85000,98.3125,0.0,1.0,14678.7500,98.2500,0.125000,9.47500,75.35000,75.4375,96.9375,0.085000,9.01875,3.37500,3.42500,11.51875,2007.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
"Venezuela, RB",151.056156,7.458125,26.49375,75.2500,0.0,1.0,9294.8750,72.4375,0.400000,15.85000,72.79375,83.0000,79.1875,27.392500,8.23750,1.56250,1.65000,18.50000,2007.5
Vietnam,133.054406,1.742500,20.76250,91.7500,0.0,1.0,1807.8750,87.4375,0.196875,19.50000,74.33125,65.0000,94.9375,86.032500,6.90000,15.62500,14.92500,24.47500,2007.5
"Yemen, Rep.",239.653937,0.047000,23.02500,72.6250,0.0,1.0,2406.8125,60.1875,0.025000,52.44375,64.16250,95.0000,71.6250,21.730625,2.22500,13.75000,13.83125,69.57500,2007.5
Zambia,452.761219,2.903125,22.00000,83.8125,0.0,1.0,1059.8750,82.8750,6.868125,60.58125,52.33125,39.1875,82.9375,12.862500,6.43750,6.76250,6.88125,99.20625,2007.5


I realized that there are way too many countries to do some analysis on. I did not know that there was this many in the dataset. Regardless we can see that all these countries have distinct values. It would be interesting to see if we could possibly plot this. 

Now we are going to be ploting life expectancy against all other numerical columns to see if there is a relationship. We are using life expectancy because this is the column of interest that we are trying to predict later.

In [None]:
fig, axes = plt.subplots(nrows=5, ncols=4, figsize=(20, 15))
axes = axes.flatten()
for i, col in enumerate(numerical):
    axes[i].scatter(data[col], data['life_expectancy'])  
    axes[i].set_title(f'{col} vs life_expectancy')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('life_expectancy')
plt.tight_layout()
plt.show()

From the graphs above there does not seem to be much of a relationship between life expectancy and the other columns in the dataset. By that I mean that there is no very noticable relationship.  The columns that seem to have some relationship are adult mortality (makes sense), infant deaths, percentage expenditure, under five deaths, hiv/aids, and income composition of resources. Of these adult mortality and income composition of resources seem to be the most coralated. It is also interesting to see infant deaths and under five deaths seem very similar. Same for the thiness columns.

Lastly there seems to be a group of populations that are constant in their life expectancy. For example if we look at income composition of resources we see that even at 0.0 the life expectancy changes up and down. This pattern can be seen through most graphs above. It might be some inconsistent or false reporting. 

Now we move on to graphing the distribution of the columns. 

In [None]:
fig, axes = plt.subplots(nrows=5, ncols=4, figsize=(20, 15))  
axes = axes.flatten()
for i, col in enumerate(numerical):
    axes[i].hist(data[col])
    axes[i].set_title(col)
plt.tight_layout()
plt.show()

Everything seems alright, theres no big outliers. We will normalize and scale the data later. Lets now count the number of developed and devloping countries. 

In [None]:
data['status'].value_counts() 

## Filling missing Values
Now that we have taken a look at our data, checked patterns, etc lets see what we can do about the missing values. First we must count how many values we are missing and where. 

In [None]:
data.isnull().sum()

We have a lot of missing values all through out the columns. Most missing value is population, hepatitis and gdp. We can look into some patterns of which countries have missing values and try to see why. We can then start looking into countries that are similar to fill in the values. We'll start by seperating the data that has missing values.

In [None]:
missing_data = data[data.isnull().any(axis=1)]

In [None]:
missing_data.head(5)

In [None]:
missing_data.country.value_counts()

The cell above tells us which countries seem to appear in the missing dataframe more often. We can use this to see if its a reocurring issue or a one time thing. If these countries appear often or if their values are missing often then we could look into it.

In [None]:
missing_data.status.value_counts()

For now lets look at other interesting details. There seems to be more developing countires with missing values whihc makes sense because they might not have the infrustructure to keep count of all the data. 

Now we want to check the rows and see the ones that only have 1 missing value.

In [None]:
missing_counts = missing_data.isnull().sum(axis=1)
df_one_missing = missing_data[missing_counts == 1]

In [None]:
# missing_data[(missing_data['country'] == 'female') & (data['Pclass'] == 1) & (data['Fare'] >= 80) &

In [None]:
len(df_one_missing)

In [None]:
df_one_missing.isnull().sum()