# Life Expectancy (WHO)
#### Regression Models for predicting Life Expectancy according to WHO dataset.

### Context
Although there have been lot of studies undertaken in the past on factors affecting life expectancy considering demographic variables, income composition and mortality rates. It was found that affect of immunization and human development index was not taken into account in the past. Also, some of the past research was done considering multiple linear regression based on data set of one year for all the countries. Hence, this gives motivation to resolve both the factors stated previously by formulating a regression model based on mixed effects model and multiple linear regression while considering data from a period of 2000 to 2015 for all the countries. Important immunization like Hepatitis B, Polio and Diphtheria will also be considered. In a nutshell, this study will focus on immunization factors, mortality factors, economic factors, social factors and other health related factors as well. Since the observations this dataset are based on different countries, it will be easier for a country to determine the predicting factor which is contributing to lower value of life expectancy. This will help in suggesting a country which area should be given importance in order to efficiently improve the life expectancy of its population.

### Content
The project relies on accuracy of data. The Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries The data-sets are made available to public for the purpose of health data analysis. The data-set related to life expectancy, health factors for 193 countries has been collected from the same WHO data repository website and its corresponding economic data was collected from United Nation website. Among all categories of health-related factors only those critical factors were chosen which are more representative. It has been observed that in the past 15 years , there has been a huge development in health sector resulting in improvement of human mortality rates especially in the developing nations in comparison to the past 30 years. Therefore, in this project we have considered data from year 2000-2015 for 193 countries for further analysis. The individual data files have been merged together into a single data-set. On initial visual inspection of the data showed some missing values. As the data-sets were from WHO, we found no evident errors. Missing data was handled in R software by using Missmap command. The result indicated that most of the missing data was for population, Hepatitis B and GDP. The missing data were from less known countries like Vanuatu, Tonga, Togo, Cabo Verde etc. Finding all data for these countries was difficult and hence, it was decided that we exclude these countries from the final model data-set. The final merged file(final dataset) consists of 22 Columns and 2938 rows which meant 20 predicting variables. All predicting variables was then divided into several broad categories:​Immunization related factors, Mortality factors, Economical factors and Social factors.

### Acknowledgements
The data was collected from WHO and United Nations website with the help of Deeksha Russell and Duan Wang.

- Inspiration
- The data-set aims to answer the following key questions:

- Does various predicting factors which has been chosen initially really affect the Life expectancy? What are the predicting variables actually affecting the life expectancy?

- Should a country having a lower life expectancy value(<65) increase its healthcare expenditure in order to improve its average lifespan?

- How does Infant and Adult mortality rates affect life expectancy?

- Does Life Expectancy has positive or negative correlation with eating habits, lifestyle, exercise, smoking, drinking alcohol etc.

- What is the impact of schooling on the lifespan of humans?

- Does Life Expectancy have positive or negative relationship with drinking alcohol?

- Do densely populated countries tend to have lower life expectancy?

- What is the impact of Immunization coverage on life Expectancy?

# 1. Importing Ml Libraries & Dataset

In [None]:
# basic libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings 
import datetime
import math

# librabries for data preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

# libraries for ML Models
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV,Lasso,LassoCV
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from mlxtend.regressor import StackingCVRegressor

# libraries for model evaluation
from sklearn.metrics import mean_absolute_percentage_error, mean_squared_error, mean_absolute_error

# libraries for hyperparameter tuning
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# visualization
from yellowbrick.regressor import residuals_plot,prediction_error

# to ignore warnings
warnings.filterwarnings("ignore")
%matplotlib inline
sns.set_style("whitegrid")

### Importing data

In [None]:
df = pd.read_csv('../input/life-expectancy-who/Life Expectancy Data.csv')
df

In [None]:
df.shape

In [None]:
df.info()

# 2. Data Cleaning
## 2.1 Null Values

In [None]:
# Function to find out missing values
def check_na(data):
    missing_values= data.isna().sum().reset_index()
    missing_values.columns= ["Features", "Missing_Values"]
    missing_values["Missing_Percent"]= round(missing_values.Missing_Values/len(data)*100,2)
    missing_values = missing_values[missing_values.Missing_Values > 0 ]

    return missing_values.sort_values("Missing_Percent", ascending=False).reset_index(drop=True)

In [None]:
# List of Null value features
missing_values = check_na(df)
missing_values

In [None]:
# Function to impute null/missing values with Mean, Median, Mode
def missing_value_imputer(data, feature, method):
    if method == "mode":
        data[feature] = data[feature].fillna(data[feature].mode()[0])
    elif method == "median":
        data[feature] = data[feature].fillna(data[feature].median())
    else:
        data[feature] = data[feature].fillna(data[feature].mean())
    return data

In [None]:
# Imputing the missing values for each column having missing values
for feature in missing_values["Features"]:
    missing_value_imputer(data= df, feature=feature, method="median")

In [None]:
# Check missing values
missing_values = check_na(df)
missing_values

#### We have imputed the missing values with median of respective features with the help of missing_value_imputer() function 

## 2.2 Categorical and Numerical Data

In [None]:
# Finding out which features have categorical values and which one of them have numerical values.
categorical = df.select_dtypes(include="O")
numerical = df.select_dtypes(exclude="O")

In [None]:
categorical

In [None]:
# Label encoding the categorical features 
columns = categorical.columns
def label_enoder(data, columns):
    for feature in columns:
        le = LabelEncoder()
        data[feature]= le.fit_transform(data[feature])
        data[feature].astype("int64")
    return data

In [None]:
df = label_enoder(df, categorical.columns)
df

# 3. Feature Engineering

### Correlation matrix
Finding the feature correlation from heatmap to see which two features are highly correlated, so that unnecessary features could be deleted.

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(df.corr(), annot=True)
plt.show()

In [None]:
# Function to find and remove correlated features
def correlation(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
                colname = corr_matrix.columns[i]  # getting the name of column
                col_corr.add(colname)
    return col_corr

In [None]:
droppable_features = list(correlation(df, 0.8))
droppable_features

In [None]:
df.drop(droppable_features, axis=1, inplace=True)

Here, we dropped the features which are highly correlated to another feature hence does not give cause information gain, so we can delete them to reduce dimensionality.

correlation() function returns the columns which cn be dropped as they are having correlation higher than given threshold value

### Feature Selection

In [None]:
df

In [None]:
from sklearn.feature_selection import mutual_info_regression

x = df.drop(["Life expectancy "], axis=1)
y = df["Life expectancy "]

# Function to get the features and their mutual information gain for regression
def select_features_mutual_info_regression(x, y):
    mutual_info = mutual_info_regression(x,y)
    mutual_data=pd.Series(mutual_info,index = x.columns)
    return mutual_data.sort_values(ascending=False)

top_features = select_features_mutual_info_regression(x, y)
top_features

In [None]:
# Selecting top 15 features with highest mutual information gain
top_features = top_features.head(15)
top_features

### Splitting Data into Features and Target

In [None]:
# Splitting data into Features
x = df[top_features.index]
x

In [None]:
# Splitting data into Target
y = df["Life expectancy "]
y

### Feature Transformation

In [None]:
def scaler_transform(data):
    columns = data.columns
    sc = StandardScaler()
    for i in columns:
        data[[i]] = sc.fit_transform(df[[i]])
        
    return data

In [None]:
x = scaler_transform(x)

# 4. Predictive Model

### Splitting data into Training and Testing Datasets

In [None]:
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.3, random_state=0)

In [None]:
# Function to apply Regression algorithms and return the results of models
def predictive_models():
    algorithms = [LinearRegression(), Ridge(alpha=0.1), RidgeCV(alphas=[0.1,0.01,0.001,1],cv=10), Lasso(alpha=0.1), LassoCV(alphas=[0.1,0.01,0.001,1],cv=10), SVR(), KNeighborsRegressor(), DecisionTreeRegressor(), RandomForestRegressor(),GradientBoostingRegressor(), AdaBoostRegressor(), XGBRegressor(),
                  StackingCVRegressor(regressors=(LinearRegression(), Ridge(alpha=0.1), RidgeCV(alphas=[0.1,0.01,0.001,1],cv=10), Lasso(alpha=0.1), LassoCV(alphas=[0.1,0.01,0.001,1],cv=10), SVR(), KNeighborsRegressor(), DecisionTreeRegressor(), RandomForestRegressor(),GradientBoostingRegressor(),AdaBoostRegressor(), XGBRegressor()),meta_regressor=Ridge(), use_features_in_secondary=True,cv=30)]
    algorithm_names = ["Linear Regression",  "Ridge", "RidgeCV", "Lasso"," LassoCV", "SVR", "KNeighbors Regressor", "Decision-Tree Regressor", "Random-Forest Regressor", "Gradient Boosting Regressor", "Ada-Boost Regressor", "XGB-Regressor","Stacked Regressor"]
    
    # Errors for training data
    Mean_Squared_Error_Training = []
    Mean_Absolute_Error_Training = []
    Accuracy_Training = []
    
    # Errors for testing data
    Mean_Squared_Error_Testing = []
    Mean_Absolute_Error_Testing = []
    Accuracy_Testing = []
    
    # Regression models
    for i in algorithms:
        model = i
        model.fit(x_train,y_train)
    
        y_test_predict = model.predict(x_test)
        y_train_predict = model.predict(x_train)
            
        mse_1 = round(mean_squared_error(y_train, y_train_predict),4)
        mae_1 = round(mean_absolute_error(y_train, y_train_predict),4)
        acc_1 = round((1-mean_absolute_percentage_error(y_train, y_train_predict))*100,4)
        
        mse_2 = round(mean_squared_error(y_test, y_test_predict),4)
        mae_2 = round(mean_absolute_error(y_test, y_test_predict),4)
        acc_2 = round((1-mean_absolute_percentage_error(y_test, y_test_predict))*100,4)
        
        # Appending the Errors into the list for training data
        Mean_Squared_Error_Training.append(mse_1)
        Mean_Absolute_Error_Training.append(mae_1)
        Accuracy_Training.append(acc_1)
                
        # Appending the Errors into the list for training data
        Mean_Squared_Error_Testing.append(mse_2)
        Mean_Absolute_Error_Testing.append(mae_2)
        Accuracy_Testing.append(acc_2)
        
    # Creating DataFrame for Logs of Models and their errors    
    results = pd.DataFrame({"Models":algorithm_names,
                            "Mean Squared Error Training":Mean_Squared_Error_Training,
                            "Mean Absolute Error Training":Mean_Absolute_Error_Training,
                            "Accuracy_Training %":Accuracy_Training,                          
                            "Mean Squared Error Testing":Mean_Squared_Error_Testing,
                            "Mean Absolute Error Testing":Mean_Absolute_Error_Testing,
                            "Accuracy Testing %":Accuracy_Testing})

    return results.sort_values("Accuracy Testing %", ascending=False).reset_index(drop=True)

In [None]:
results = predictive_models()
results

# XGBRegressor
### Hyperparameter Tuning 

### RandomizedSearchCV for hyperparamter tuning

In [None]:
xgb_model = XGBRegressor()

# Parameter dictionary for RandomizedSearchCV
parameters = {'learning_rate': [.03, 0.05, .07], 
              'max_depth': [4, 5, 6, 7, 8, 9, 10],
              'min_child_weight': [3, 4, 5, 6, 7, 8],
              'subsample': [0.6,0.7,0.8],
              'colsample_bytree': [0.6,0.7,0.8],
              'n_estimators': [100,200,300,400,500]
             }
# Using RandomizedSearchCV()
xgb_random_cv = RandomizedSearchCV(estimator=xgb_model, param_distributions=parameters, n_iter=100, cv=2, verbose=2)
xgb_random_cv.fit(x_train, y_train)

In [None]:
# Best Parameters for XGBregressor by RandomizedSearchCV
best_param = xgb_random_cv.best_params_
best_param

### GridSearchCV for thorough search of best hyperparameters.
We will use GridSearchCV for thorough search of best parameters for XGBregressor

In [None]:
# Parameter grid for GridSearchCV
parameters = {'learning_rate': [best_param["learning_rate"]-0.01, best_param["learning_rate"], best_param["learning_rate"]+0.01], 
              'max_depth': [best_param["max_depth"]-1, best_param["max_depth"], best_param["max_depth"]+1],
              'min_child_weight': [best_param["min_child_weight"]-1, best_param["min_child_weight"], best_param["min_child_weight"]+1],
              'subsample': [best_param["subsample"]-0.05, best_param["subsample"], best_param["subsample"]+0.05],
              'colsample_bytree': [best_param["colsample_bytree"]-0.1, best_param["colsample_bytree"], best_param["colsample_bytree"]+0.1],
              'n_estimators': [best_param["n_estimators"],best_param["n_estimators"]+50,best_param["n_estimators"]+100,best_param["n_estimators"]+150]
             }

# Using GridSearchCV()
xgb_grid = GridSearchCV(xgb_model, parameters, cv= 2, n_jobs=-1, verbose=3)
xgb_grid.fit(x_train, y_train)

In [None]:
# Best Hyperparameters for XGBRegressor by GridSearchCV
best_param_gridCV = xgb_grid.best_params_
best_param_gridCV

In [None]:
xgb_regressor = XGBRegressor(colsample_bytree = best_param_gridCV["colsample_bytree"],
                             learning_rate = best_param_gridCV["learning_rate"], 
                             max_depth = best_param_gridCV["max_depth"], 
                             min_child_weight = best_param_gridCV["min_child_weight"], 
                             n_estimators = best_param_gridCV["n_estimators"], 
                             subsample = best_param_gridCV["subsample"])

In [None]:
xgb_regressor.fit(x_train, y_train)

In [None]:
y_pred = xgb_regressor.predict(x_test)

### Evaluation of Model 

In [None]:
mse = round(mean_squared_error(y_test, y_pred),4)
mae = round(mean_absolute_error(y_test, y_pred),4)
acc = round((1-mean_absolute_percentage_error(y_test, y_pred))*100,4)

In [None]:
print(" Mean Squared Error = ",mse)
print("Mean Absolute Error = ",mae)
print("           Accuracy = ",acc)


In [None]:
print("Prediction v/s Actual values")
plt.scatter(y_pred, y_test)
plt.title("Prediction v/s Actual values")

In [None]:
print("Prediction Error Plot")
print(prediction_error(xgb_regressor, x_train, y_train, x_test, y_test))

In [None]:
plt.figure(figsize=(15,5))

plt.subplot(1,2,1)
sns.histplot(y_test)

plt.subplot(1,2,2)
sns.histplot(y_pred, color='r')

plt.show()

### Predictions using XGBoost Regression

#### - Feature List to give as input

In [None]:
feature_list = list(x.columns)
print("Number of Features = ", len(feature_list))
feature_list

#### - Input to model

In [None]:
def input_data():
    sc = StandardScaler()
    input_distionary = {}
    
    for i in feature_list:
        print("Enter ", i)
        input_distionary[i] = eval(input())
        
        data = pd.DataFrame(input_distionary, columns=feature_list, index=[0])
        data = sc.fit_transform(data)
        
        prediction = xgb_regressor.predict(data)
        return prediction