# <p><center style="font-family:newtimeroman;font-size:180%;">Forecasting Life Expectancy By Regression </center></p>
### Table of contents:

* [Introduction](#1)
* [Import Libraries](#2)
* [Import Dataset](#3)
* [Visualization](#4)
* [Preprocessing and Feature Engineering](#5)
* [Using GridSearch to find optimal hyperparameters](#6)
* [Train and Evaluate Models](#7)

<a id="1"></a>
# <p style="background-image: url(https://i.postimg.cc/K87ByXmr/stage5.jpg);font-family:camtasia;font-size:120%;color:white;text-align:center;border-radius:15px 50px; padding:7px"> Introduction</p>

## Forecasting Life Expectancy By Regression 
This notebook aims to forecast the life expectancy of nations through the assessment of health, financial, and national factors. Life expectancy, also known as average lifespan, is a statistical measure that reveals the typical duration of life for individuals within a society or, in simpler terms, the projected lifespan.

## Dataset
The research on life expectancy in countries takes the spotlight in the notebook's machine learning model. Substantial data analysis and predictive algorithms are used to uncover the reasons causing differences in longevity among countries. With the aid of strong statistical tools, valuable insights into the complex link between healthcare, socioeconomic factors, and life expectancy are sought. In the following, the features of this dataset are listed:

1. Country: Country under study

2. Year: year

3. Status: Status of the country's development

4. Population: Population of country

5. Hepatitis B: Percentage of people finally one year old who were immunized against hepatitis B

6. Measles: The number of reported measles cases per 1000 people

7. Polio: Percentage of 1-year-olds immunized against polio

8. Diphtheria: Percentage of people finally one year old who were immunized against diphtheria

9. HIV/AIDS: The number of deaths caused by AIDS of the last 4-year-olds who were born alive per 1000 people

10. infant deaths: The number of infant deaths per 1000 people

11. under-five deaths: The number of deaths of people under 5 years old per 1000 people

12. Total expenditure: The ratio of government medical-health expenses to total government expenses in percentage

13. GDP: Gross domestic product

14. BMI: The average body mass index of the entire population of the country

15. thinness 1-19 years: Prevalence of thinness among people 19 years old in percentage

16. Alcohol: Liters of alcohol consumption among people over 15 years old

17. Schooling: The number of years that people study

18. Life expectancy: Country life expectancy

<a id="2"></a>
# <p style="background-image: url(https://i.postimg.cc/K87ByXmr/stage5.jpg);font-family:camtasia;font-size:120%;color:white;text-align:center;border-radius:15px 50px; padding:7px">Import Libraries </p>

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="darkgrid")
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score


import warnings
warnings.filterwarnings('ignore')


<a id="3"></a>
# <p style="background-image: url(https://i.postimg.cc/K87ByXmr/stage5.jpg);font-family:camtasia;font-size:120%;color:white;text-align:center;border-radius:15px 50px; padding:7px">Import Dataset </p>
<a class="btn" href="#home">Tabel of Contents</a>

In [None]:
# Import dataset
train_data = pd.read_csv("/kaggle/input/countries-life-expectancy/life_expectancy.csv")
train_data.head()

In [None]:
# More details about dataset
train_data.info()

<a id="4"></a>
# <p style="background-image: url(https://i.postimg.cc/K87ByXmr/stage5.jpg);font-family:camtasia;font-size:120%;color:white;text-align:center;border-radius:15px 50px; padding:7px">Visualization </p>
<a class="btn" href="#home">Tabel of Contents</a>

In [None]:
# Exploring Relationships between Variables and Life Expectancy

class RelationshipPlotter:
    
    def __init__(self, data):
        self.data = data
        self.fig, self.axes = plt.subplots(nrows=4, ncols=4, figsize=(12, 12))  # Create a 4x4 subplot grid
    
    def plot(self, row, col, x, y, plot_type):
        ax = self.axes[row, col]  # Select the appropriate subplot
        if plot_type == 'scatter':
            sns.scatterplot(data=self.data, x=x, y=y, ax=ax)
        elif plot_type == 'box':
            sns.boxplot(data=self.data, x=x, y=y, ax=ax)
        ax.set_title(f"{x.capitalize()} vs. {y.capitalize()}")

    def show_plots(self):
        plt.tight_layout()  # Adjust spacing between subplots
        plt.show()

columns_of_box = ['Year', 'Status']       
columns_of_scat = ['Population', 'Hepatitis B', 'Measles',
       'Polio', 'Diphtheria', 'HIV/AIDS', 'infant deaths', 'under-five deaths',
       'Total expenditure', 'GDP', 'BMI', 'thinness  1-19 years', 'Alcohol',
       'Schooling']         
        
plotter = RelationshipPlotter(data=train_data)

# Plot the relationships
for i, col in enumerate(columns_of_box):
    plotter.plot(i // 4, i % 4, x=col, y='Life expectancy', plot_type='box')
    
for i, col in enumerate(columns_of_scat):
    plotter.plot((i + len(columns_of_box)) // 4, (i + len(columns_of_box)) % 4, x=col, y='Life expectancy', plot_type='scatter')

plotter.show_plots()

<a id="5"></a>
# <p style="background-image: url(https://i.postimg.cc/K87ByXmr/stage5.jpg);font-family:camtasia;font-size:120%;color:white;text-align:center;border-radius:15px 50px; padding:7px">Preprocessing and Feature Engineering </p>
<a class="btn" href="#home">Tabel of Contents</a>

In [None]:
# Splitting Data into Training and Validation Sets 
Y= train_data['Life expectancy']
X= train_data.drop(columns=['Life expectancy'])
X_train, X_val, y_train, y_val = train_test_split(X,Y, test_size=0.2, shuffle=True)

In [None]:
# Data Imputation for Missing Values in Training and Validation Sets
class DataImputer:
    def __init__(self, data):
        self.data = data

    def fill_nan_with_group_mean(self, column):
        means = self.data.groupby('Country')[column].mean()
        self.data[column] = self.data[column].fillna(self.data['Country'].map(means))

    def fill_nan_with_median(self, column):
        median = self.data[column].median()
        self.data[column].fillna(median, inplace=True)

    def fill_nan_values(self):
        self.fill_nan_with_group_mean('Population')
        self.fill_nan_with_median('Population')
        self.fill_nan_with_group_mean('Hepatitis B')
        self.fill_nan_with_median('Hepatitis B')
        self.fill_nan_with_group_mean('Polio')
        self.fill_nan_with_median('Polio')
        self.fill_nan_with_group_mean('Diphtheria')
        self.fill_nan_with_median('Diphtheria')
        self.fill_nan_with_group_mean('Total expenditure')
        self.fill_nan_with_median('Total expenditure')
        self.fill_nan_with_group_mean('thinness  1-19 years')
        self.fill_nan_with_median('thinness  1-19 years')
        self.fill_nan_with_group_mean('GDP')
        self.fill_nan_with_median('GDP')
        self.fill_nan_with_group_mean('BMI')
        self.fill_nan_with_median('BMI')
        self.fill_nan_with_group_mean('Alcohol')
        self.fill_nan_with_median('Alcohol')
        self.fill_nan_with_group_mean('Schooling')
        self.fill_nan_with_median('Schooling')


# Create an instance of the DataImputer class for X_train
imputer_train = DataImputer(X_train)
imputer_train.fill_nan_values()

# Create an instance of the DataImputer class for X_val
imputer_val = DataImputer(X_val)
imputer_val.fill_nan_values()

In [None]:
# Checking dataset for mising value
X_train.info()
X_val.info()

In [None]:
def drop_col(df):
    return df.drop(labels=['Country', 'Year'], axis=1)

X_train = drop_col(X_train)
X_val = drop_col(X_val)

In [None]:
def preprocess_data(data):
    # Create dummy variables for 'Status' column
    dummies = pd.get_dummies(data['Status'], dtype=int)
    data = pd.concat([data, dummies], axis=1)

    # Drop 'Status' column
    data = data.drop(labels='Status', axis=1)
    
    return data

# Preprocess train_data
X_train = preprocess_data(X_train)

# Preprocess test_data
X_val = preprocess_data(X_val)

In [None]:
# Scalling the dataset using StandardScaler
scaler = StandardScaler()

X_V = X_val.values
scaled_x_train = scaler.fit_transform(X_train)
scaled_x_val = scaler.transform(X_V)

<a id="6"></a>
# <p style="background-image: url(https://i.postimg.cc/K87ByXmr/stage5.jpg);font-family:camtasia;font-size:120%;color:white;text-align:center;border-radius:15px 50px; padding:7px"> Using GridSearch to find optimal hyperparameters </p>
<a class="btn" href="#home">Tabel of Contents</a>

GridSearchCV is a technique used in machine learning for hyperparameter tuning, which is the process of finding the best combination of hyperparameters for a given model. Hyperparameters are adjustable parameters that are not learned from the data but are set prior to the model training. GridSearchCV systematically searches through a specified grid of hyperparameter values and evaluates the model's performance using cross-validation to determine the optimal set of hyperparameters.

In [None]:
# Define a list of regression models
models = [
    LinearRegression(),
    DecisionTreeRegressor(),
    RandomForestRegressor(),
    GradientBoostingRegressor(),
    ExtraTreesRegressor(),
]

# Train and evaluate each model
for model in models:
    model.fit(scaled_x_train, y_train)
    y_pred = model.predict(scaled_x_val)
    mse = mean_squared_error(y_val, y_pred)
    mae = mean_absolute_error(y_val, y_pred)
    r2 = r2_score(y_val, y_pred)
    #print(f"Model: {type(model).__name__}, mse: {mse}")
    #print(f"Model: {type(model).__name__}, mae: {mse}")
    print(f"Model: {type(model).__name__}, r2: {r2}")


In [None]:
# Define the hyperparameters to tune
param_grid = {
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

# Create the Decision Tree Regressor model
model = DecisionTreeRegressor()

# Perform grid search
grid_search = GridSearchCV(model, param_grid, scoring='r2', cv=5)
grid_search.fit(scaled_x_train, y_train)

# Print the best parameters and best R2 score
print("Best Parameters:", grid_search.best_params_)
print("Best R2 Score:", grid_search.best_score_)

In [None]:
# Define the hyperparameters to tune
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

# Create the RandomForestRegressor model
model = RandomForestRegressor()

# Perform grid search
grid_search = GridSearchCV(model, param_grid, scoring='r2', cv=5)
grid_search.fit(scaled_x_train, y_train)

# Print the best parameters and best R2 score
print("Best Parameters:", grid_search.best_params_)
print("Best R2 Score:", grid_search.best_score_)

In [None]:
# Define the hyperparameters to tune
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 1.0],
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

# Create the GradientBoostingRegressor model
model = GradientBoostingRegressor()

# Perform grid search
grid_search = GridSearchCV(model, param_grid, scoring='r2', cv=5)
grid_search.fit(scaled_x_train, y_train)

# Print the best parameters and best R2 score
print("Best Parameters:", grid_search.best_params_)
print("Best R2 Score:", grid_search.best_score_)

In [None]:
# Define the hyperparameters to tune
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

# Create the ExtraTreesRegressor model
model = ExtraTreesRegressor()

# Perform grid search
grid_search = GridSearchCV(model, param_grid, scoring='r2', cv=5)
grid_search.fit(scaled_x_train, y_train)

# Print the best parameters and best R2 score
print("Best Parameters:", grid_search.best_params_)
print("Best R2 Score:", grid_search.best_score_)

<a id="7"></a>
# <p style="background-image: url(https://i.postimg.cc/K87ByXmr/stage5.jpg);font-family:camtasia;font-size:120%;color:white;text-align:center;border-radius:15px 50px; padding:7px"> Train and Evaluate Models  </p>
<a class="btn" href="#home">Tabel of Contents</a>

In [None]:
# Comparison of Predicted vs True Values and Residual Plot
class ModelComparisonPlot:
    def __init__(self, model_name):
        self.model_name = model_name
    
    def plot_comparison(self, y_val, y_pred, mse, mae, r2):
        # Create a figure with two subplots
        fig, axes = plt.subplots(1, 2, figsize=(11, 5))

        # Plot the predicted vs true values
        sns.regplot(x=y_val, y=y_pred, color='blue', scatter_kws={'alpha':0.5}, ax=axes[0])
        axes[0].plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'k--', lw=2)
        axes[0].set_xlabel('True values', fontsize=12)
        axes[0].set_ylabel('Predicted values', fontsize=12)
        axes[0].set_title('Predicted vs true values')
        axes[0].grid(color='lightgray', linestyle='--', linewidth=0.5)

        # Plot the residuals vs predicted values
        residuals = y_val - y_pred
        sns.residplot(x=y_pred, y=residuals, color='blue', scatter_kws={'alpha':0.5}, ax=axes[1])
        axes[1].plot([y_val.min(), y_val.max()], [0, 0], 'k--', lw=2)
        axes[1].set_xlabel('Predicted values', fontsize=12)
        axes[1].set_ylabel('Residuals', fontsize=12)
        axes[1].set_title('Residual plot', fontsize=15)
        axes[1].grid(color='lightgray', linestyle='--', linewidth=0.5)

        # Add a title to the figure
        fig.suptitle('Comparison of Predicted vs True Values and Residual Plot\n{}'.format(self.model_name), fontsize=15)

        # Adjust the spacing between subplots
        plt.subplots_adjust(wspace=0.4)

        # Display the figure with the title
        plt.show()

# Model 1: LinearRegression

In [None]:
# Polynomial Regression Model Evaluation
poly_transformer = PolynomialFeatures(degree=2, include_bias=True)
x_train_trans = poly_transformer.fit_transform(scaled_x_train)
x_val_trans = poly_transformer.transform(scaled_x_val)
model_LR = LinearRegression()
model_LR.fit(x_train_trans, y_train)
y_pred = model_LR.predict(x_val_trans)
mse = mean_squared_error(y_val, y_pred)
mae = mean_absolute_error(y_val, y_pred)
r2 = r2_score(y_val, y_pred)
print(f"Model: {type(model_LR).__name__}, mse: {mse}")
print(f"Model: {type(model_LR).__name__}, mae: {mse}")
print(f"Model: {type(model_LR).__name__}, r2: {r2}")

In [None]:
# Create instances of ModelComparisonPlot for each model
model_LR_plot = ModelComparisonPlot('LinearRegression')
# Plot comparison for each model
model_LR_plot.plot_comparison(y_val, y_pred, mse, mae, r2)

# Model 2: DecisionTreeRegressor

In [None]:
model_DTR = DecisionTreeRegressor( max_depth = None,  max_features = 'auto',  min_samples_leaf = 2,  min_samples_split = 10)
model_DTR.fit(scaled_x_train, y_train)
y_pred = model_DTR.predict(scaled_x_val)
mse = mean_squared_error(y_val, y_pred)
mae = mean_absolute_error(y_val, y_pred)
r2 = r2_score(y_val, y_pred)
print(f"Model: {type(model_DTR).__name__}, mse: {mse}")
print(f"Model: {type(model_DTR).__name__}, mae: {mse}")
print(f"Model: {type(model_DTR).__name__}, r2: {r2}")

In [None]:
# Create instances of ModelComparisonPlot for each model
model_DTR_plot = ModelComparisonPlot('DecisionTreeRegressor')
# Plot comparison for each model
model_DTR_plot.plot_comparison(y_val, y_pred, mse, mae, r2)

# Model 3: RandomForestRegressor

In [None]:
model_RFR = RandomForestRegressor(max_depth= 10, max_features= 'auto', min_samples_leaf= 1, min_samples_split= 2, n_estimators= 300)
model_RFR.fit(scaled_x_train, y_train)
y_pred = model_RFR.predict(scaled_x_val)
mse = mean_squared_error(y_val, y_pred)
mae = mean_absolute_error(y_val, y_pred)
r2 = r2_score(y_val, y_pred)
print(f"Model: {type(model_RFR).__name__}, mse: {mse}")
print(f"Model: {type(model_RFR).__name__}, mae: {mse}")
print(f"Model: {type(model_RFR).__name__}, r2: {r2}")

In [None]:
# Create instances of ModelComparisonPlot for each model
model_RFR_plot = ModelComparisonPlot('RandomForestRegressor')
# Plot comparison for each model
model_RFR_plot.plot_comparison(y_val, y_pred, mse, mae, r2)

# Model 4: GradientBoostingRegressor

In [None]:
model_GBR = GradientBoostingRegressor(learning_rate= 0.1, max_depth= 7, max_features= 'sqrt', min_samples_leaf= 2, min_samples_split= 2, n_estimators= 200)
model_GBR.fit(scaled_x_train, y_train)
y_pred = model_GBR.predict(scaled_x_val)
mse = mean_squared_error(y_val, y_pred)
mae = mean_absolute_error(y_val, y_pred)
r2 = r2_score(y_val, y_pred)
print(f"Model: {type(model_GBR).__name__}, mse: {mse}")
print(f"Model: {type(model_GBR).__name__}, mae: {mse}")
print(f"Model: {type(model_GBR).__name__}, r2: {r2}")

In [None]:
# Create instances of ModelComparisonPlot for each model
model_GBR_plot = ModelComparisonPlot('GradientBoostingRegressor')
# Plot comparison for each model
model_GBR_plot.plot_comparison(y_val, y_pred, mse, mae, r2)

# Model 5: ExtraTreesRegressor

In [None]:
model_ETR = ExtraTreesRegressor(max_depth= None, max_features= 'auto', min_samples_leaf= 1, min_samples_split= 2, n_estimators= 300)
model_ETR.fit(scaled_x_train, y_train)
y_pred = model_ETR.predict(scaled_x_val)
mse = mean_squared_error(y_val, y_pred)
mae = mean_absolute_error(y_val, y_pred)
r2 = r2_score(y_val, y_pred)
print(f"Model: {type(model_ETR).__name__}, mse: {mse}")
print(f"Model: {type(model_ETR).__name__}, mae: {mse}")
print(f"Model: {type(model_ETR).__name__}, r2: {r2}")

In [None]:
# Create instances of ModelComparisonPlot for each model
model_ETR_plot = ModelComparisonPlot('ExtraTreesRegressor')
# Plot comparison for each model
model_ETR_plot.plot_comparison(y_val, y_pred, mse, mae, r2)