<div style="width: 100%; height: 100px; background-color: #11009E; border: 3px solid #8696FE; text-align: center; line-height: 100px; color: #C4B0FF; font-size: 24px; font-weight: bold; border-radius:6px;">
    👵🏻 Countries Life Expectancy Prediction 🧓🏻| Polynomial Linear Regression
</div>

![1.jpg](attachment:4970124e-79b0-4268-adb9-ed61484aae78.jpg)

<div style="width: 100%; background-color: #11009E; color: white; padding: 20px; border: 3px solid #8696FE; margin-bottom: 20px;border-radius:10px;">
    <h3 style="color: #C4B0FF;">Introduction</h3>
    <span >The research on life expectancy in countries takes the spotlight in the notebook's machine learning model. Substantial data analysis and predictive algorithms are used to uncover the reasons causing differences in longevity among countries. With the aid of strong statistical tools, valuable insights into the complex link between healthcare, socioeconomic factors, and life expectancy are sought. Each line of code in this project strives to unravel the riddles behind human lifespan, ultimately contributing to a deeper knowledge of global health inequities.
    </span> 
    <h3 style="color: #C4B0FF;">Tasks in this notebook</h3>
    <ul style="list-style-type: none; padding-left: 0;">
        <li><span style="margin-left: -10px;">&#8226;</span> Dataset overview</li>
        <li><span style="margin-left: -10px;">&#8226;</span> Import libraries</li>
        <li><span style="margin-left: -10px;">&#8226;</span> Read dataset and get information from data</li>
        <li><span style="margin-left: -10px;">&#8226;</span> Cleaning Data</li>
        <li><span style="margin-left: -10px;">&#8226;</span> Data visualization</li>
        <li><span style="margin-left: -10px;">&#8226;</span> Features</li>
        <li>
            <span style="margin-left: -10px;">&#8226;</span> Modeling
            <ul style="list-style-type: none; padding-left: 20px;">
                <li><span style="margin-left: -10px;">&#8226;</span> Linear Regression</li>
                <li><span style="margin-left: -10px;">&#8226;</span> KNeighbors Regressor</li>
                <li><span style="margin-left: -10px;">&#8226;</span> Decision Tree Regressor</li>
            </ul>
        </li>
        <li><span style="margin-left: -10px;">&#8226;</span>Predictions visualization</li>
    </ul>
</div>


<h2 style="position: relative;">
    <span style="color: #11009E;">Dataset Overview</span> 
    <br/>
    <br/>
    <br/>
    <hr style="position: absolute; bottom: -8px; border: none; height: 4px; width: 100%; background-color: #8696FE;">
</h2>

<center>
<table style="direction: rtl; line-height: 200%; font-family: vazir; font-size: medium;">
  <tr>
    <th>Description</th>
    <th>Column</th>
  </tr>
  <tr>
    <td>Country under study</td>
    <td><code>Country</code></td>
  </tr>
  <tr>
    <td>Year</td>
    <td><code>Year</code></td>
  </tr>
  <tr>
    <td>Status of the country's development</td>
    <td><code>Status</code></td>
  </tr>
  <tr>
    <td>Population of country</td>
    <td><code>Population</code></td>
  </tr>
  <tr>
    <td>Percentage of people finally one year old who were immunized against hepatitis B</td>
    <td><code>Hepatitis B</code></td>
  </tr>
  <tr>
    <td>The number of reported measles cases per 1000 people</td>
    <td><code>Measles</code></td>
  </tr>
  <tr>
    <td>Percentage of 1-year-olds immunized against polio</td>
    <td><code>Polio</code></td>
  </tr>
  <tr>
    <td>Percentage of people finally one year old who were immunized against diphtheria</td>
    <td><code>Diphtheria</code></td>
  </tr>
  <tr>
    <td>The number of deaths caused by AIDS of the last 4-year-olds who were born alive per 1000 people</td>
    <td><code>HIV/AIDS</code></td>
  </tr>
  <tr>
    <td>The number of infant deaths per 1000 people</td>
    <td><code>infant deaths</code></td>
  </tr>
  <tr>
    <td>The number of deaths of people under 5 years old per 1000 people</td>
    <td><code>under-five deaths</code></td>
  </tr>
  <tr>
    <td>The ratio of government medical-health expenses to total government expenses in percentage</td>
    <td><code>Total expenditure</code></td>
  </tr>
  <tr>
    <td>Gross domestic product</td>
    <td><code>GDP</code></td>
  </tr>
  <tr>
    <td>The average body mass index of the entire population of the country</td>
    <td><code>BMI</code></td>
  </tr>
  <tr>
    <td>Prevalence of thinness among people 19 years old in percentage</td>
    <td><code>thinness 1-19 years</code></td>
  </tr>
  <tr>
    <td>Liters of alcohol consumption among people over 15 years old</td>
    <td><code>Alcohol</code></td>
  </tr>
  <tr>
    <td>The number of years that people study</td>
    <td><code>Schooling</code></td>
  </tr>
  <tr>
    <td>Country life expectancy</td>
    <td><code><b>Life expectancy [target variable]</b></code></td>
  </tr>
</table>
</center>

<h2 style="position: relative;">
    <span style="color: #11009E;">Import libraries</span> 
    <br/>
    <br/>
    <br/>
    <hr style="position: absolute; bottom: -8px; border: none; height: 4px; width: 100%; background-color: #8696FE;">
</h2>


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split , GridSearchCV, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

<h2 style="position: relative;">
    <span style="color: #11009E;">Read dataset and get information from data</span> 
    <br/>
    <br/>
    <br/>
    <hr style="position: absolute; bottom: -8px; border: none; height: 4px; width: 100%; background-color: #8696FE;">
</h2>


In [4]:
df = pd.read_csv('data/life_expectancy.csv')
df.head(3)

Unnamed: 0,Country,Year,Status,Population,Hepatitis B,Measles,Polio,Diphtheria,HIV/AIDS,infant deaths,under-five deaths,Total expenditure,GDP,BMI,thinness 1-19 years,Alcohol,Schooling,Life expectancy
0,Afghanistan,2015,Developing,33736494.0,65.0,1154,6.0,65.0,0.1,62,83,8.16,584.25921,19.1,17.2,0.01,10.1,65.0
1,Afghanistan,2014,Developing,327582.0,62.0,492,58.0,62.0,0.1,64,86,8.18,612.696514,18.6,17.5,0.01,10.0,59.9
2,Afghanistan,2013,Developing,31731688.0,64.0,430,62.0,64.0,0.1,66,89,8.13,631.744976,18.1,17.7,0.01,9.9,59.9


In [5]:
row, col = df.shape
print("This Dataset have",row,"rows and",col,"columns.")

This Dataset have 2848 rows and 18 columns.


In [6]:
print("Number of duplicate data : ",df.duplicated().sum())

Number of duplicate data :  0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2848 entries, 0 to 2847
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Country               2848 non-null   object 
 1   Year                  2848 non-null   int64  
 2   Status                2848 non-null   object 
 3   Population            2204 non-null   float64
 4   Hepatitis B           2306 non-null   float64
 5   Measles               2848 non-null   int64  
 6   Polio                 2829 non-null   float64
 7   Diphtheria            2829 non-null   float64
 8   HIV/AIDS              2848 non-null   float64
 9   infant deaths         2848 non-null   int64  
 10  under-five deaths     2848 non-null   int64  
 11  Total expenditure     2627 non-null   float64
 12  GDP                   2406 non-null   float64
 13  BMI                   2816 non-null   float64
 14  thinness  1-19 years  2816 non-null   float64
 15  Alcohol              

In [8]:
df.describe()

Unnamed: 0,Year,Population,Hepatitis B,Measles,Polio,Diphtheria,HIV/AIDS,infant deaths,under-five deaths,Total expenditure,GDP,BMI,thinness 1-19 years,Alcohol,Schooling,Life expectancy
count,2848.0,2204.0,2306.0,2848.0,2829.0,2829.0,2848.0,2848.0,2848.0,2627.0,2406.0,2816.0,2816.0,2660.0,2688.0,2848.0
mean,2007.5,12834570.0,81.076756,2083.082163,82.68222,82.451396,1.756461,28.359902,39.5,5.935577,7664.398813,38.503374,4.84723,4.638932,12.060156,69.347402
std,4.610582,61960940.0,25.019068,10249.107207,23.434954,23.693936,5.148935,117.188032,159.800866,2.504439,14466.241793,19.955485,4.443695,4.064721,3.32016,9.528332
min,2000.0,34.0,1.0,0.0,3.0,2.0,0.1,0.0,0.0,0.37,1.68135,1.0,0.1,0.01,0.0,36.3
25%,2003.75,196758.5,77.0,0.0,78.0,78.0,0.1,0.0,0.0,4.24,477.541713,19.5,1.6,0.93,10.2,63.5
50%,2007.5,1391756.0,92.0,16.0,93.0,93.0,0.1,3.0,4.0,5.76,1841.08683,43.9,3.3,3.785,12.4,72.2
75%,2011.25,7438947.0,97.0,336.75,97.0,97.0,0.7,20.0,25.0,7.53,6265.658907,56.2,7.125,7.81,14.3,75.8
max,2015.0,1293859000.0,99.0,212183.0,99.0,99.0,50.6,1800.0,2500.0,17.6,119172.7418,77.6,27.7,17.87,20.7,89.0


In [9]:
df.describe(include = "object")

Unnamed: 0,Country,Status
count,2848,2848
unique,178,2
top,Afghanistan,Developing
freq,16,2352


<h2 style="position: relative;">
    <span style="color: #11009E;">Cleaning data</span> 
    <br/>
    <br/>
    <br/>
    <hr style="position: absolute; bottom: -8px; border: none; height: 4px; width: 100%; background-color: #8696FE;">
</h2>

In [10]:
df.isna().sum()

Country                   0
Year                      0
Status                    0
Population              644
Hepatitis B             542
Measles                   0
Polio                    19
Diphtheria               19
HIV/AIDS                  0
infant deaths             0
under-five deaths         0
Total expenditure       221
GDP                     442
BMI                      32
thinness  1-19 years     32
Alcohol                 188
Schooling               160
Life expectancy           0
dtype: int64

In [11]:
class Preprocessing(): 
    def __init__(self):
        self.col_means = {}
        self.col_medians = {}
    
    def fit(self, data):
        cols_with_na = data.isna().sum()[data.isna().sum()>0].index.tolist()
        for col in cols_with_na:
            self.col_means[col] = data.groupby('Country')[col].transform('mean')
            self.col_medians[col] = data[col].median()

    def transform(self, data):
        cols_with_na = data.isna().sum()[data.isna().sum()>0].index.tolist()
        for col in cols_with_na:
            mean_value = self.col_means[col]
            data.loc[:, col].fillna(mean_value, inplace=True)

        cols_with_na = data.isna().sum()[data.isna().sum()>0].index.tolist()
        for col in cols_with_na:
            median_value = self.col_medians[col]
            data.loc[:, col].fillna(median_value, inplace=True)
        
        data['Status'].replace({'Developing' : 0, 'Developed' : 1,}, inplace=True)
        return data
    
    def fit_transform(self, data):
        self.fit(data)
        return self.transform(data)

In [12]:
preprocesser = Preprocessing()

df = preprocesser.fit_transform(df)

<span style="color: #11009E;">
Missing values in the dataset were handled throughout the data cleaning process by replacing them with the mean value of the related characteristic across every country. The goal of this technique was to ensure consistency and data integrity. The dataset became more complete by imputing missing values in this manner, allowing for more accurate analysis and modeling. In this case, the mean-based replacement approach proved to be an excellent solution for dealing with missing numbers.
</span> 

<h2 style="position: relative;">
    <span style="color: #11009E;">Data visualization</span> 
    <br/>
    <br/>
    <br/>
    <hr style="position: absolute; bottom: -8px; border: none; height: 4px; width: 100%; background-color: #8696FE;">
</h2>

In [13]:
fig = px.pie(df, names='Status')
fig

In [14]:
go.Figure(
    data=[go.Histogram(x=df["Life expectancy"], xbins={"start": 36.0, "end": 90.0, "size": 1.0})],
    layout=go.Layout(title="Histogram of Life expectancy", yaxis={"title": "Count"}, bargap=0.05),
)

In [15]:
df_corr = df.corr()

fig = px.imshow(df_corr,
                labels=dict(x="Features", y="Features"),
                x=df_corr.columns,
                y=df_corr.columns,
                color_continuous_scale="Blues",
                color_continuous_midpoint=0)

fig.update_layout(
    title="Correlation Heatmap",
    width=800,
    height=500,
    xaxis_showgrid=False,
    yaxis_showgrid=False,
    yaxis_autorange='reversed')

fig.show()

<h2 style="position: relative;">
    <span style="color: #11009E;">Features</span> 
    <br/>
    <br/>
    <br/>
    <hr style="position: absolute; bottom: -8px; border: none; height: 4px; width: 100%; background-color: #8696FE;">
</h2>

In [16]:
df.drop(['Country'], axis=1, inplace=True)

In [17]:
X = df.drop('Life expectancy', axis=1)
y = df['Life expectancy']

<h2 style="position: relative;">
    <span style="color: #11009E;">Modeling</span> 
    <br/>
    <br/>
    <br/>
    <hr style="position: absolute; bottom: -8px; border: none; height: 4px; width: 100%; background-color: #8696FE;">
</h2>


<h4 style="position: relative;">
    <span style="color: #11009E;">LinearRegression</span> 
    <br/>
    <br/>
    <hr style="position: absolute; bottom: -8px; border: none; height: 4px; width: 100%; background-color: #8696FE;">
</h4>

In [18]:
model = LinearRegression()

test_sizes = [0.1, 0.15, 0.2, 0.3]
random_states = [0, 1, 42, 43, 100, 313]

best_test_size = None
best_random_state = None
best_r2_score = -float('inf')

std_scaler = StandardScaler()
poly_transformer = PolynomialFeatures(degree=2)


for test_size in test_sizes:
    for random_state in random_states:
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
        X_train = std_scaler.fit_transform(X_train)
        X_test = std_scaler.transform(X_test)
        poly_transformer.fit(X_train)
        poly_features = poly_transformer.transform(X_train)
        model.fit(poly_features, y_train)
        valid_poly_features = poly_transformer.transform(X_test)
        y_pred = model.predict(valid_poly_features)
        r2 = r2_score(y_test, y_pred)
        if r2 > best_r2_score:
            best_r2_score = r2
            best_test_size = test_size
            best_random_state = random_state

print(f"Best test size: {best_test_size}")
print(f"Best random state: {best_random_state}")
print(f"Best R2 score: {best_r2_score}")

Best test size: 0.1
Best random state: 42
Best R2 score: 0.895767062644041


<h4 style="position: relative;">
    <span style="color: #11009E;">KNN regressor</span> 
    <br/>
    <br/>
    <hr style="position: absolute; bottom: -8px; border: none; height: 4px; width: 100%; background-color: #8696FE;">
</h4>

In [19]:
model = KNeighborsRegressor()

test_sizes = [0.1, 0.15, 0.2, 0.3]
random_states = [0, 1, 42, 43, 100, 313]

best_test_size = None
best_random_state = None
best_r2_score = -float('inf')

for test_size in test_sizes:
    for random_state in random_states:
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
        X_train = std_scaler.fit_transform(X_train)
        X_test = std_scaler.transform(X_test)
        poly_transformer.fit(X_train)
        poly_features = poly_transformer.transform(X_train)
        model.fit(poly_features, y_train)
        valid_poly_features = poly_transformer.transform(X_test)
        y_pred = model.predict(valid_poly_features)
        r2 = r2_score(y_test, y_pred)
        if r2 > best_r2_score:
            best_r2_score = r2
            best_test_size = test_size
            best_random_state = random_state

print(f"Best test size: {best_test_size}")
print(f"Best random state: {best_random_state}")
print(f"Best R2 score: {best_r2_score}")

Best test size: 0.1
Best random state: 42
Best R2 score: 0.926595171136862


In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

KNNmodel = KNeighborsRegressor()

param_grid = {
    'n_neighbors': range(3,11,2),
    'weights': ['uniform', 'distance'],
    'p': [1, 2]
}

X_train = std_scaler.fit_transform(X_train)
X_test = std_scaler.transform(X_test)

poly_transformer.fit(X_train)
poly_features = poly_transformer.transform(X_train)

grid_search = GridSearchCV(estimator=KNNmodel, param_grid=param_grid, scoring='r2', cv=5)
grid_search.fit(poly_features, y_train)

best_r2_score = grid_search.best_score_
best_params = grid_search.best_params_
print(f"Best R2 score: {best_r2_score}")
print(f"Best hyperparameters: {best_params}")

best_model = grid_search.best_estimator_
valid_poly_features = poly_transformer.transform(X_test)
y_pred = model.predict(valid_poly_features)
test_r2 = r2_score(y_test, y_pred)
print(f"R2 score on test set: {test_r2}")

Best R2 score: 0.9270451763868264
Best hyperparameters: {'n_neighbors': 3, 'p': 1, 'weights': 'distance'}
R2 score on test set: 0.9313019019782571


In [21]:
KNNmodel = KNeighborsRegressor(**best_params)
KNNmodel.fit(poly_features, y_train)

In [22]:
y_pred = KNNmodel.predict(valid_poly_features)
r2_score(y_test, y_pred)

0.957159546055142

<h4 style="position: relative;">
    <span style="color: #11009E;">Decision Tree regressor</span> 
    <br/>
    <br/>
    <hr style="position: absolute; bottom: -8px; border: none; height: 4px; width: 100%; background-color: #8696FE;">
</h4>

In [23]:
model = DecisionTreeRegressor()

test_sizes = [0.1, 0.15, 0.2, 0.3]
random_states = [0, 1, 42, 43, 100, 313]

best_test_size = None
best_random_state = None
best_r2_score = -float('inf')

for test_size in test_sizes:
    for random_state in random_states:
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
        X_train = std_scaler.fit_transform(X_train)
        X_test = std_scaler.transform(X_test)
        poly_transformer.fit(X_train)
        poly_features = poly_transformer.transform(X_train)
        model.fit(poly_features, y_train)
        valid_poly_features = poly_transformer.transform(X_test)
        y_pred = model.predict(valid_poly_features)
        r2 = r2_score(y_test, y_pred)
        if r2 > best_r2_score:
            best_r2_score = r2
            best_test_size = test_size
            best_random_state = random_state

print(f"Best test size: {best_test_size}")
print(f"Best random state: {best_random_state}")
print(f"Best R2 score: {best_r2_score}")

Best test size: 0.1
Best random state: 42
Best R2 score: 0.9185730421448025


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=100)

DTRmodel = DecisionTreeRegressor()
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2')

param_grid = {
    'max_depth': [None, 5, 10, 15, 20],
    'min_samples_split': [15, 20, 25, 30, 35],
    'min_samples_leaf': [1, 2, 4, 6, 8, 10]
}

X_train = std_scaler.fit_transform(X_train)
X_test = std_scaler.transform(X_test)

poly_transformer.fit(X_train)
poly_features = poly_transformer.transform(X_train)

grid_search = GridSearchCV(estimator=DTRmodel, param_grid=param_grid, scoring='r2', cv=5)
grid_search.fit(poly_features, y_train)

best_r2_score = grid_search.best_score_
best_params = grid_search.best_params_
print(f"Best R2 score: {best_r2_score}")
print(f"Best hyperparameters: {best_params}")

best_model = grid_search.best_estimator_
valid_poly_features = poly_transformer.transform(X_test)
y_pred = model.predict(valid_poly_features)
test_r2 = r2_score(y_test, y_pred)
print(f"R2 score on test set: {test_r2}")

In [None]:
DTRmodel = DecisionTreeRegressor(**best_params)
DTRmodel.fit(poly_features, y_train)

In [None]:
y_pred = DTRmodel.predict(valid_poly_features)
r2_score(y_test, y_pred)

<div style="width: 100%; height: 100px; background-color: #11009E; border: 3px solid #8696FE; text-align: center; line-height: 100px; color: #C4B0FF; font-size: 24px; font-weight: bold; border-radius:6px;">
    By improving our model we reached 95.7% accuracy 🏆
</div>

<span style="color: #11009E;">
The accuracy of the KNN regressor model was 95.7%, showing how useful it is in predicting life expectancy. The model effectively caught the underlying patterns and trends in the dataset by analyzing and using surrounding data points. Its excellent accuracy makes it a trustworthy tool for calculating life expectancy in various nations. The model's performance demonstrates its ability to provide useful insights into the field of global health study.
</span> 

In [None]:
X_final = std_scaler.fit_transform(X)
poly_transformer.fit(X_final)
final_poly_features = poly_transformer.transform(X_final)
y_final = KNNmodel.predict(final_poly_features)

<h2 style="position: relative;">
    <span style="color: #11009E;">Prediction visualization</span> 
    <br/>
    <br/>
    <br/>
    <hr style="position: absolute; bottom: -8px; border: none; height: 4px; width: 100%; background-color: #8696FE;">
</h2>

In [None]:
comparison_df = pd.DataFrame({'Actual': y, 'Predicted': y_final})

fig = px.scatter(comparison_df, x='Actual', y='Predicted', color='Actual')
fig.update_layout(
    title='Comparison of Actual vs. Predicted',
    xaxis_title='Actual',
    yaxis_title='Predicted'
)
fig.show()

In [None]:
plt.figure(dpi=300)
sns.histplot(df['Life expectancy'], kde=True, label='Real Values')
sns.histplot(y_final, kde=True, label='Predicted Values')
plt.xlabel('Life Expectancy')
plt.ylabel('Frequency')
plt.title('Distribution')
plt.legend()
plt.show();

In [None]:
plt.figure(figsize=(8, 6),dpi=300)
plt.scatter(range(len(y)), y, color='#185ADB', label='Actual Values')
plt.scatter(range(len(y_final)), y_final, color='#FC5C9C', label='Predicted Values')
plt.xlabel('Index')
plt.ylabel('Life Expectancy')
plt.title('Actual Values vs. Predicted Values')
plt.legend()
plt.show()

In [None]:
residuals = y - y_final

plt.figure(figsize=(8, 6), dpi=300)
plt.scatter(y_final, residuals, color='#2B3467')
plt.axhline(y=0, color='orange', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show();

In [None]:
plt.figure(figsize=(8, 6),dpi=300)
plt.plot(y, y, color='#98DFD6', label='Ideal Line')
plt.scatter(y, y_final, color='#00235B', label='Predicted Values')
plt.plot(np.unique(y), np.poly1d(np.polyfit(y, y_final, 1))(np.unique(y)), color='#FFDD83', label='Regression Line')
plt.xlabel('True Values')
plt.ylabel('Predicted Values')
plt.title('Predicted vs. True Line Plot')
plt.legend()
plt.show()


<div style="width: 100%; height: 100px; background-color: #11009E; border: 3px solid #8696FE; text-align: center; line-height: 100px; color: #C4B0FF; font-size: 24px; font-weight: bold; border-radius:6px;">
    Thanks for paying attention to this notebook ♥
</div>