## Name - Anchal Garg
## Roll no - D21004
## Name - Preethi Vijayaraghavalu
## Roll no - D21026

#### A typical Regression Machine Learning project leverages historical data to predict insights into the future. This problem statement is aimed at Predicting the Life Expectancy rate of a country given various features. 

#### Life expectancy is a statistical measure of the average time a human being is expected to live, Life expectancy depends on various factors: Regional variations, Economic Circumstances, Sex Differences, Mental Illnesses, Physical Illnesses, Education, Year of their birth and other demographic factors. This problem statement provides a way to predict average life expectancy of people living in a country when various factors such as year, GDP, education, alcohol intake of people in the country, expenditure on healthcare system and some specific disease related deaths that happened in the country are given.

##  Dataset Columns Explanation:

1. Life expectancy = Average life expectancy in years
2. Infant deaths = Number of Infant Deaths per 1000 population
3. Alcohol = Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)
4. percentage expenditure = Expenditure on health as a percentage of Gross Domestic Product per capita(%)
5. HepatitisB = Hepatitis B (HepB) immunization coverage among 1-year-olds (%)
6. Measles = Number of reported cases
7. BMI = Prevalence of overweight among adults, BMI ≥ 25, crude
8. under-five deaths = Number of under-five deaths per 1000 population
9. Total expenditure = General government expenditure on health as a percentage of total government expenditure (%)
10. HIV/AIDS = Deaths per 1000 live births HIV/AIDS (0-4 years)
11. GDP = Gross Domestic Product in USD
12. Population = Population of the country
13. thinness1-19years = Prevalence of thinness among children and adolescents for Age 10 to 19 (% )
14. thinness5-9years = Prevalence of thinness among children for Age 5 to 9(%)
15. Income composition of resources = Relative share of each income source or group of sources, expressed as a percentage of the aggregate total income of that group or area.
16. Schooling = Number of years of Schooling
17. Adult Mortality = Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)
18. Polio = Polio (Pol3) immunization coverage among 1-year-olds (%)
19. Diphtheria = Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)

## Importing required libraries

In [1]:

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor 
from sklearn.tree import DecisionTreeRegressor 
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
import seaborn


## Getting the directory

In [34]:

os.getcwd()

os.chdir("Users/preetvilu/Downloads")

## Reading Dataset

In [2]:

data1= pd.read_csv("/Users/preetvilu/Downloads/after_eda_data.csv")
data1.shape

(2938, 22)

## Renaming the columns

In [3]:
data1.rename(columns = {'Life.expectancy':'Life_expectancy',"Adult.Mortality":"Adult_Mortality",
                       "infant.deaths":"infant_deaths","percentage.expenditure":'percentage_expenditure',
                       "Hepatitis.B":"Hepatitis_B","under.five.deaths":"under_5_deaths","Total.expenditure":"Total_expenditure",
                    "HIV.AIDS":"HIV_AIDS","thinness..1.19.years":"thinness1_19","thinness.5.9.years":"thinness5_9",
                      "Income.composition.of.resources":"Inc_com_resouce" }, inplace = True)
data1.columns

Index(['Country', 'Year', 'Status', 'Life_expectancy', 'Adult_Mortality',
       'infant_deaths', 'Alcohol', 'percentage_expenditure', 'Hepatitis_B',
       'Measles', 'BMI', 'under_5_deaths', 'Polio', 'Total_expenditure',
       'Diphtheria', 'HIV_AIDS', 'GDP', 'Population', 'thinness1_19',
       'thinness5_9', 'Inc_com_resouce', 'Schooling'],
      dtype='object')

### Checking the types of the features and the number of non-null examples.There are missing values on every feature except Country, Year, Status, infant deaths, percentage expenditure, Measles, under-five deaths and HIV/AIDS. Also, Country and Status features are objects. Before feeding the dataset to the model, they need to be dealt.

In [4]:
data1.isnull().sum()

Country                   0
Year                      0
Status                    0
Life_expectancy           0
Adult_Mortality           0
infant_deaths             0
Alcohol                   0
percentage_expenditure    0
Hepatitis_B               0
Measles                   0
BMI                       0
under_5_deaths            0
Polio                     0
Total_expenditure         0
Diphtheria                0
HIV_AIDS                  0
GDP                       0
Population                0
thinness1_19              0
thinness5_9               0
Inc_com_resouce           0
Schooling                 0
dtype: int64

In [39]:

data1.head()

Unnamed: 0,Country,Year,Status,Life_expectancy,Adult_Mortality,infant_deaths,Alcohol,percentage_expenditure,Hepatitis_B,Measles,...,Polio,Total_expenditure,Diphtheria,HIV_AIDS,GDP,Population,thinness1_19,thinness5_9,Inc_com_resouce,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65,1154,...,6,8.16,65,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62,492,...,58,8.18,62,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64,430,...,62,8.13,64,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67,2787,...,67,8.52,67,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68,3013,...,68,7.87,68,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5


### Describing the dataset. The dataset consists of examples from 193 different countries from different years. It is shown that minimum value for year is 2000 and maximum is 2015; it consists of 15 year period. Also the status column is binary, can be converted to 1 and 0s. The population column has huge values in it, which is expected. The values can be showed as millions.

In [33]:


data1.describe()

Unnamed: 0,Year,Life_expectancy,Adult_Mortality,infant_deaths,Alcohol,percentage_expenditure,Hepatitis_B,Measles,BMI,under_5_deaths,Polio,Total_expenditure,Diphtheria,HIV_AIDS,GDP,Population,thinness1_19,thinness5_9,Inc_com_resouce,Schooling
count,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0
mean,2007.51872,69.224932,164.796448,30.303948,4.602861,738.251295,83.022124,2419.59224,38.321247,42.035739,82.617767,5.93819,82.393125,1.742103,7833.522681,11907910.0,4.839704,4.870317,0.627551,11.992793
std,4.613841,9.50764,124.080302,117.926501,3.916288,1987.914858,22.996984,11467.272489,19.927677,160.445548,23.367166,2.400274,23.655562,5.077785,13162.751788,52927380.0,4.394535,4.482708,0.20482,3.264381
min,2000.0,36.3,1.0,0.0,0.01,0.0,1.0,0.0,1.0,0.0,3.0,0.37,2.0,0.1,1.68135,34.0,0.1,0.1,0.0,0.0
25%,2004.0,63.2,74.0,0.0,1.0925,4.685343,82.0,0.0,19.4,0.0,78.0,4.37,78.0,0.1,580.486996,618577.5,1.6,1.6,0.50425,10.3
50%,2008.0,72.0,144.0,3.0,4.16,64.912906,92.0,17.0,43.0,4.0,93.0,5.93819,93.0,0.1,3116.561755,8222362.0,3.4,3.4,0.662,12.1
75%,2012.0,75.6,227.0,22.0,7.39,441.534144,96.0,360.25,56.1,28.0,97.0,7.33,97.0,0.8,9780.859485,10616520.0,7.1,7.2,0.772,14.1
max,2015.0,89.0,723.0,1800.0,17.87,19479.91161,99.0,212183.0,87.3,2500.0,99.0,17.6,99.0,50.6,119172.7418,1293859000.0,27.7,28.6,0.948,20.7


## SPLITTING of data into train and test


In [5]:
#spliiting the data into train and test 90-10 ratio

data = data1[:2056] 
data_test = data1[2056:]

## shape before dropping missing columns

In [6]:


data.shape

(2056, 22)

In [7]:
data.columns

Index(['Country', 'Year', 'Status', 'Life_expectancy', 'Adult_Mortality',
       'infant_deaths', 'Alcohol', 'percentage_expenditure', 'Hepatitis_B',
       'Measles', 'BMI', 'under_5_deaths', 'Polio', 'Total_expenditure',
       'Diphtheria', 'HIV_AIDS', 'GDP', 'Population', 'thinness1_19',
       'thinness5_9', 'Inc_com_resouce', 'Schooling'],
      dtype='object')

## Simple Linear Regression

### Using Linear Regression as the first model. Model is trained on the train set and made predictions for test. Mean Squared Error, R2, RMSE is used to compute the loss.

In [8]:
SLR_list = []                                 #creating a list to store tuple of predictor,R-square,RMSE and MAE.
regr = linear_model.LinearRegression()          #initializing the sklearn linear regression model
for i in range(4,np.shape(data.columns)[0]):    #iterating through all columns of dataset except id,date and price.
    scores1 = cross_val_score(regr,pd.DataFrame(data.iloc[:,i]),pd.DataFrame(data.Life_expectancy),cv=10,scoring='r2')                          #cv score with R-square metric
    scores2 = cross_val_score(regr,pd.DataFrame(data.iloc[:,i]),pd.DataFrame(data.Life_expectancy),cv=10,scoring='neg_root_mean_squared_error') #cv score with RMSE metric
    scores3 = cross_val_score(regr,pd.DataFrame(data.iloc[:,i]),pd.DataFrame(data.Life_expectancy),cv=10,scoring='neg_mean_absolute_error')     #cv score with MAE metric
    SLR_list.append([data.columns[i],scores1.mean(),np.abs(scores2.mean()),np.abs(scores3.mean())])   #appending  the list to store tuple of predictor,R-square,RMSE and MAE.
SLR_Result = pd.DataFrame(SLR_list, columns = ["Predictor","R2","RMSE","MAE"])  #conerting the list to pandas dataframe
SLR_Result = SLR_Result.sort_values("R2")       #sorting a dataframe by R2 column
SLR_Result 


Unnamed: 0,Predictor,R2,RMSE,MAE
1,infant_deaths,-0.371321,10.125749,8.051379
7,under_5_deaths,-0.353185,10.046908,7.978132
13,Population,-0.080852,9.438539,7.881712
5,Measles,-0.052608,9.32535,7.773569
4,Hepatitis_B,-0.045223,9.291684,7.623226
9,Total_expenditure,0.004648,9.063879,7.372646
3,percentage_expenditure,0.098657,8.619344,7.153721
2,Alcohol,0.111429,8.537448,6.901496
12,GDP,0.12796,8.471936,6.979411
15,thinness5_9,0.14844,8.344766,6.76159


# Multiple linear regression

### The calculations for multiple regression are almost identical to those for simple linear regression, except that the test statistic (MSR)/(MSE) has an F(k, n – k – 1) distribution. 

In [9]:
def model_house(X,y,model_number=None):
    model=LinearRegression()
    RMSE = -cross_val_score(model, X, y, cv=10, scoring='neg_root_mean_squared_error').mean() 
    r2 = cross_val_score(model, X, y, cv=10, scoring='r2').mean()   #meteric used are RMSE and r_square
    adj_r2=np.round(r2-((X.shape[1])-1)/(len(X)-(X.shape[1]))*(1-r2),2)
    results=print('Model',model_number,':\nRMSE:',np.round(RMSE,4),'\nr_square:',np.round(r2,2),'\nAdjusted r2:',adj_r2)
    return(results)

In [10]:
X_train=data[['Country', 'Year', 'Status', 'Adult_Mortality',
       'infant_deaths', 'Alcohol', 'percentage_expenditure', 'Hepatitis_B',
       'Measles', 'BMI', 'under_5_deaths', 'Polio', 'Total_expenditure',
       'Diphtheria', 'HIV_AIDS', 'GDP', 'Population', 'thinness1_19',
       'thinness5_9', 'Schooling',"Inc_com_resouce"]]

y_train = data[['Life_expectancy']]

# Model 1 

### Using the method outlined above, the first model was created:

In [11]:
X=X_train[[ 'Adult_Mortality',
       'infant_deaths', 'Alcohol', 'percentage_expenditure', 'Hepatitis_B',
       'Measles', 'BMI', 'under_5_deaths', 'Polio', 'Total_expenditure',
       'Diphtheria', 'HIV_AIDS', 'GDP', 'Population', 'thinness1_19',
       'thinness5_9', 'Schooling',"Inc_com_resouce"]]    #Predictor
y=np.array(y_train) #Target
model_house(X,y,2)

Model 2 :
RMSE: 4.1742 
r_square: 0.78 
Adjusted r2: 0.78


# Model - 2

In [12]:
X=X_train[['Adult_Mortality', 'Alcohol', 'percentage_expenditure', 'Hepatitis_B', 'BMI', 'Polio',
       'Diphtheria', 'HIV_AIDS', 'GDP', 'Schooling',"Inc_com_resouce"]]    #Predictor
y=np.array(y_train) #Target
model_house(X,y,2)

Model 2 :
RMSE: 4.1684 
r_square: 0.78 
Adjusted r2: 0.78


# MODEL - 3

In [13]:
X=X_train[["Schooling","Inc_com_resouce","Adult_Mortality","BMI","HIV_AIDS","Diphtheria","Polio","thinness1_19","thinness5_9","GDP","percentage_expenditure"]]    #Predictor
y=np.array(y_train) #Target
model_house(X,y,2)

Model 2 :
RMSE: 4.1547 
r_square: 0.78 
Adjusted r2: 0.78


# MODEL - 4

In [14]:

X=X_train[["Schooling","Inc_com_resouce","Adult_Mortality","BMI","HIV_AIDS","Diphtheria","Polio","thinness1_19","thinness5_9","GDP","percentage_expenditure","Alcohol","Hepatitis_B","Total_expenditure","Measles","Population"]]    #Predictor
y=np.array(y_train) #Target
model_house(X,y,2)

Model 2 :
RMSE: 4.1945 
r_square: 0.78 
Adjusted r2: 0.78


# Decision Tree Model Fitting

# Model 1

In [15]:
from sklearn.model_selection import GridSearchCV
X = data[['Adult_Mortality','infant_deaths', 'Alcohol', 'percentage_expenditure', 'Hepatitis_B', 'Measles', 'BMI', 'under_5_deaths', 'Polio', 'Total_expenditure',
        'Diphtheria', 'HIV_AIDS', 'GDP', 'Population', 'thinness1_19',
       'thinness5_9', 'Schooling',"Inc_com_resouce"]]# zipcode, view
y = data['Life_expectancy']
Dtree = DecisionTreeRegressor()
Dtree.fit = Dtree.fit(X,y)

In [16]:
from sklearn.tree import DecisionTreeRegressor # importing DecisionTreeRegressor module from sklearn library

Dtree = DecisionTreeRegressor(max_depth = 4, min_samples_leaf=5)  #  defining decison tree parameters, max_dept and leaf

R_square = cross_val_score(Dtree,X,y,cv = 10,scoring = "r2")

scores = cross_val_score(Dtree,X,y,cv = 10,scoring = "neg_mean_squared_error")

RMSE = np.sqrt(abs(scores.mean()))

RSquare = R_square.mean()

print("RMSE is: ",RMSE)
print("Rsquare is: ",RSquare)

RMSE is:  3.8723907494947185
Rsquare is:  0.8115127944186391


# Model 2

In [17]:
from sklearn.model_selection import GridSearchCV
X = data[['Adult_Mortality', 'Alcohol', 'percentage_expenditure', 'Hepatitis_B', 'BMI', 'Polio',
       'Diphtheria', 'HIV_AIDS', 'GDP', 'Schooling',"Inc_com_resouce"]]
y = data['Life_expectancy']
Dtree = DecisionTreeRegressor()
Dtree.fit = Dtree.fit(X,y)

In [18]:

Dtree = DecisionTreeRegressor(max_depth = 4, min_samples_leaf=5)  #  defining decison tree parameters, max_dept and leaf

R_square = cross_val_score(Dtree,X,y,cv = 10,scoring = "r2")

scores = cross_val_score(Dtree,X,y,cv = 10,scoring = "neg_mean_squared_error")

RMSE = np.sqrt(abs(scores.mean()))

RSquare = R_square.mean()

print("RMSE is: ",RMSE)
print("Rsquare is: ",RSquare)

RMSE is:  3.7088170809426515
Rsquare is:  0.8260301375680648


# Random Forest

# Model 1

In [19]:

X = X_train[['Adult_Mortality','infant_deaths', 'Alcohol', 'percentage_expenditure', 'Hepatitis_B', 'Measles', 'BMI', 'under_5_deaths', 'Polio', 'Total_expenditure',
        'Diphtheria', 'HIV_AIDS', 'GDP', 'Population', 'thinness1_19',
       'thinness5_9', 'Schooling',"Inc_com_resouce"]]
y = y_train['Life_expectancy']
regressor = RandomForestRegressor()

regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X,y)

RandomForestRegressor(n_estimators=20, random_state=0)

In [20]:
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics

y_pred = regressor.predict(X)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_train, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_train, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_train, y_pred)))

Mean Absolute Error: 0.44926377847855675
Mean Squared Error: 0.5902969826356581
Root Mean Squared Error: 0.7683078696952531


# Model 2

In [21]:

X = X_train[['Adult_Mortality', 'Alcohol', 'percentage_expenditure', 'Hepatitis_B', 'BMI', 'Polio',
       'Diphtheria', 'HIV_AIDS', 'GDP', 'Schooling',"Inc_com_resouce"]]
y = y_train['Life_expectancy']
regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X, y)
y_pred = regressor.predict(X)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_train, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_train, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_train, y_pred)))

Mean Absolute Error: 0.4932703479486939
Mean Squared Error: 0.6948275470672189
Root Mean Squared Error: 0.8335631632139336


## Bagging and Boosting

## Model 1

In [26]:
# test regression dataset
from sklearn.datasets import make_regression
# define dataset
X, y = make_regression(n_samples=2056, n_features=22, n_informative=15, noise=0.1, random_state=5)
# summarize the dataset
print(X.shape, y.shape)

(2056, 22) (2056,)


# Model Testing on Test Data

In [27]:
X_test=data_test[['Country', 'Year', 'Status', 'Adult_Mortality',
       'infant_deaths', 'Alcohol', 'percentage_expenditure', 'Hepatitis_B',
       'Measles', 'BMI', 'under_5_deaths', 'Polio', 'Total_expenditure',
       'Diphtheria', 'HIV_AIDS', 'GDP', 'Population', 'thinness1_19',
       'thinness5_9', 'Schooling',"Inc_com_resouce"]]

y_test = data_test[['Life_expectancy']]

# Model 2 fitting on test data

# Linear regression

In [28]:
X=X_test[['Adult_Mortality', 'Alcohol', 'percentage_expenditure', 'Hepatitis_B', 'BMI', 'Polio',
       'Diphtheria', 'HIV_AIDS', 'GDP', 'Schooling',"Inc_com_resouce"]]    #Predictor
y=np.array(y_test) #Target
model_house(X,y,2) 

Model 2 :
RMSE: 4.6768 
r_square: 0.64 
Adjusted r2: 0.64


# Decision Tree

In [29]:
X =data_test [['Adult_Mortality', 'Alcohol', 'percentage_expenditure', 'Hepatitis_B', 'BMI', 'Polio',
       'Diphtheria', 'HIV_AIDS', 'GDP', 'Schooling',"Inc_com_resouce"]]
y = data_test['Life_expectancy']
Dtree = DecisionTreeRegressor()
Dtree.fit = Dtree.fit(X,y)

In [30]:

Dtree = DecisionTreeRegressor(max_depth = 4, min_samples_leaf=5)  #  defining decison tree parameters, max_dept and leaf

R_square = cross_val_score(Dtree,X,y,cv = 10,scoring = "r2")

scores = cross_val_score(Dtree,X,y,cv = 10,scoring = "neg_mean_squared_error")

RMSE = np.sqrt(abs(scores.mean()))

RSquare = R_square.mean()

print("RMSE is: ",RMSE)
print("Rsquare is: ",RSquare)

RMSE is:  5.373043640444159
Rsquare is:  0.5942711427026749


# Random Forest

In [32]:

X = X_test[['Adult_Mortality', 'Alcohol', 'percentage_expenditure', 'Hepatitis_B', 'BMI', 'Polio',
       'Diphtheria', 'HIV_AIDS', 'GDP', 'Schooling',"Inc_com_resouce"]]
y = y_test['Life_expectancy']

y_pred = regressor.predict(X)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error: 2.358598297925727
Mean Squared Error: 11.960469305194302
Root Mean Squared Error: 3.4583911440428916


### It can be seen from the results shown from the 3 Algorithms (Linear Regression, Random Forest Regressor,Decision Tree ) that Living in any other continent aside from Africa has a lot of influence on how long you live also factors such as the year we are in also has a positive effect on your life expectancy while and HiV/AIDs have negative effect with a magnitude more than that of living in outside Africa.

### Surprisingly, Alcohol also has a positive effect on life expectancy as well.



## CONCLUSION

### The best regression model to predict life expectancy is the random forest regression model using the default hyperparameters. All of the data features were included as the random forest regression models can handle multicollinearity. The model does not appear to overfit the training set. The random forest testing set R2 is significantly higher than the other regressions.