# Predicting Life Expectancy with Neural Networks

Embark on a journey into global health insights as we explore the factors influencing life expectancy across countries. The World Health Organization's (WHO) Global Health Observatory (GHO) dataset, spanning 2000 to 2015, provides a rich tapestry of indicators encompassing immunization, mortality, economic, social, and other health-related factors.

### Project Context

In the quest to enhance life expectancy predictions, this project undertakes the task of designing, training, and evaluating a neural network model using regression. Traditionally studied variables such as demographic factors, income composition, and mortality rates are considered, alongside often-overlooked elements like immunization and human development index.

### Dataset Overview

- **Temporal Coverage:** 2000 to 2015
- **Indicators:** Immunization, Mortality, Economic, Social, Health-related Factors
- **Target Variable:** Life Expectancy (Years)

### Initial Data Exploration

To lay the groundwork, the dataset is loaded and explored. Columns are appropriately formatted, and an initial inspection provides an overview of the data's structure and statistical summary.

### Data Preprocessing

Understanding the importance of a unified approach, the 'Country' column is dropped to focus on learning general patterns applicable across countries. The data is then split into labels and features, and numerical features are standardized/normalized for consistent modeling.

### Building the Neural Network Model

With TensorFlow and Keras, a neural network model is crafted. The architecture includes an input layer, a hidden layer with 64 neurons and a ReLU activation function, and an output layer with a single neuron for regression predictions.

### Model Compilation

The model is compiled using the Mean Squared Error (MSE) loss function and Mean Absolute Error (MAE) as a metric. The Adam optimizer with a learning rate of 0.01 is employed.

### Model Training and Evaluation

The model undergoes training with 50 epochs and a batch size of 1. The validation split aids in assessing performance during training. Subsequently, the model is evaluated on the test set, and the MSE and MAE on the test data are reported.

### Key Findings

As we unravel the mysteries of life expectancy prediction, the project aims to uncover patterns that contribute significantly to this vital health indicator. Stay tuned for in-depth insights into the factors that shape the life expectancy landscape globally.



------------

## Data Loading & Exploration

In [61]:
import pandas as pd

# Load the data
dataset = pd.read_csv('data/life_expectancy.csv', delimiter=';', header=0)
dataset

Unnamed: 0,Country,Year,Status,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,...,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling,Life expectancy
0,Afghanistan,2015,Developing,263,62,0.01,71.279624,65,1154,19.1,...,8.16,65,0.1,584.259210,33736494.0,17.2,17.3,0.479,10.1,65.0
1,Afghanistan,2014,Developing,271,64,0.01,73.523582,62,492,18.6,...,8.18,62,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0,59.9
2,Afghanistan,2013,Developing,268,66,0.01,73.219243,64,430,18.1,...,8.13,64,0.1,631.744976,31731688.0,17.7,17.7,0.470,9.9,59.9
3,Afghanistan,2012,Developing,272,69,0.01,78.184215,67,2787,17.6,...,8.52,67,0.1,669.959000,3696958.0,17.9,18.0,0.463,9.8,59.5
4,Afghanistan,2011,Developing,275,71,0.01,7.097109,68,3013,17.2,...,7.87,68,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5,59.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2933,Zimbabwe,2004,Developing,723,27,4.36,0.000000,68,31,27.1,...,7.13,65,33.6,454.366654,12777511.0,9.4,9.4,0.407,9.2,44.3
2934,Zimbabwe,2003,Developing,715,26,4.06,0.000000,7,998,26.7,...,6.52,68,36.7,453.351155,12633897.0,9.8,9.9,0.418,9.5,44.5
2935,Zimbabwe,2002,Developing,73,25,4.43,0.000000,73,304,26.3,...,6.53,71,39.8,57.348340,125525.0,1.2,1.3,0.427,10.0,44.8
2936,Zimbabwe,2001,Developing,686,25,1.72,0.000000,76,529,25.9,...,6.16,75,42.1,548.587312,12366165.0,1.6,1.7,0.427,9.8,45.3


In [62]:
# Remove leading spaces & capitalize initial letter of each column name
dataset.columns = dataset.columns.str.strip().str.capitalize()

In [63]:
# Inspect the data
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2938 entries, 0 to 2937
Data columns (total 22 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Country                          2938 non-null   object 
 1   Year                             2938 non-null   int64  
 2   Status                           2938 non-null   object 
 3   Adult mortality                  2938 non-null   int64  
 4   Infant deaths                    2938 non-null   int64  
 5   Alcohol                          2938 non-null   float64
 6   Percentage expenditure           2938 non-null   float64
 7   Hepatitis b                      2938 non-null   int64  
 8   Measles                          2938 non-null   int64  
 9   Bmi                              2938 non-null   float64
 10  Under-five deaths                2938 non-null   int64  
 11  Polio                            2938 non-null   int64  
 12  Total expenditure   

In [64]:
# Get the summary statistics
dataset.describe()

Unnamed: 0,Year,Adult mortality,Infant deaths,Alcohol,Percentage expenditure,Hepatitis b,Measles,Bmi,Under-five deaths,Polio,Total expenditure,Diphtheria,Hiv/aids,Gdp,Population,Thinness 1-19 years,Thinness 5-9 years,Income composition of resources,Schooling,Life expectancy
count,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0,2938.0
mean,2007.51872,164.725664,30.303948,4.546875,738.251295,83.022124,2419.59224,38.381178,42.035739,82.617767,5.924098,82.393125,1.742103,6611.523863,10230850.0,4.821886,4.852144,0.630362,12.009837,69.234717
std,4.613841,124.086215,117.926501,3.921946,1987.914858,22.996984,11467.272489,19.935375,160.445548,23.367166,2.40077,23.655562,5.077785,13296.603449,54022420.0,4.397621,4.485854,0.20514,3.265139,9.509115
min,2000.0,1.0,0.0,0.01,0.0,1.0,0.0,1.0,0.0,3.0,0.37,2.0,0.1,1.68135,34.0,0.1,0.1,0.0,0.0,36.3
25%,2004.0,74.0,0.0,1.0925,4.685343,82.0,0.0,19.4,0.0,78.0,4.37,78.0,0.1,580.486996,418917.2,1.6,1.6,0.50425,10.3,63.2
50%,2008.0,144.0,3.0,3.755,64.912906,92.0,17.0,43.5,4.0,93.0,5.755,93.0,0.1,1766.947595,1386542.0,3.3,3.3,0.677,12.3,72.1
75%,2012.0,227.0,22.0,7.39,441.534144,96.0,360.25,56.1,28.0,97.0,7.33,97.0,0.8,4779.40519,4584371.0,7.1,7.2,0.772,14.1,75.6
max,2015.0,723.0,1800.0,17.87,19479.91161,99.0,212183.0,87.3,2500.0,99.0,17.6,99.0,50.6,119172.7418,1293859000.0,27.7,28.6,0.948,20.7,89.0


Next I drop the country column. Given that I want to learn a general pattern for all the countries, and not only those dependent on specific countries.

In [65]:
# Drop 'countries' column
dataset.drop(['Country'], inplace=True, axis=1) 
dataset

Unnamed: 0,Year,Status,Adult mortality,Infant deaths,Alcohol,Percentage expenditure,Hepatitis b,Measles,Bmi,Under-five deaths,...,Total expenditure,Diphtheria,Hiv/aids,Gdp,Population,Thinness 1-19 years,Thinness 5-9 years,Income composition of resources,Schooling,Life expectancy
0,2015,Developing,263,62,0.01,71.279624,65,1154,19.1,83,...,8.16,65,0.1,584.259210,33736494.0,17.2,17.3,0.479,10.1,65.0
1,2014,Developing,271,64,0.01,73.523582,62,492,18.6,86,...,8.18,62,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0,59.9
2,2013,Developing,268,66,0.01,73.219243,64,430,18.1,89,...,8.13,64,0.1,631.744976,31731688.0,17.7,17.7,0.470,9.9,59.9
3,2012,Developing,272,69,0.01,78.184215,67,2787,17.6,93,...,8.52,67,0.1,669.959000,3696958.0,17.9,18.0,0.463,9.8,59.5
4,2011,Developing,275,71,0.01,7.097109,68,3013,17.2,97,...,7.87,68,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5,59.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2933,2004,Developing,723,27,4.36,0.000000,68,31,27.1,42,...,7.13,65,33.6,454.366654,12777511.0,9.4,9.4,0.407,9.2,44.3
2934,2003,Developing,715,26,4.06,0.000000,7,998,26.7,41,...,6.52,68,36.7,453.351155,12633897.0,9.8,9.9,0.418,9.5,44.5
2935,2002,Developing,73,25,4.43,0.000000,73,304,26.3,40,...,6.53,71,39.8,57.348340,125525.0,1.2,1.3,0.427,10.0,44.8
2936,2001,Developing,686,25,1.72,0.000000,76,529,25.9,39,...,6.16,75,42.1,548.587312,12366165.0,1.6,1.7,0.427,9.8,45.3


Next I split the data into labels and features.

In [66]:
# Assign final column to labels
labels = dataset.iloc[:, -1]

# Assign all but the final column to features
features = dataset.iloc[:, :-1]

## Data Preprocessing

In [67]:
# Apply one-hot encoder to features
features = pd.get_dummies(features)
features

Unnamed: 0,Year,Adult mortality,Infant deaths,Alcohol,Percentage expenditure,Hepatitis b,Measles,Bmi,Under-five deaths,Polio,...,Diphtheria,Hiv/aids,Gdp,Population,Thinness 1-19 years,Thinness 5-9 years,Income composition of resources,Schooling,Status_Developed,Status_Developing
0,2015,263,62,0.01,71.279624,65,1154,19.1,83,6,...,65,0.1,584.259210,33736494.0,17.2,17.3,0.479,10.1,False,True
1,2014,271,64,0.01,73.523582,62,492,18.6,86,58,...,62,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0,False,True
2,2013,268,66,0.01,73.219243,64,430,18.1,89,62,...,64,0.1,631.744976,31731688.0,17.7,17.7,0.470,9.9,False,True
3,2012,272,69,0.01,78.184215,67,2787,17.6,93,67,...,67,0.1,669.959000,3696958.0,17.9,18.0,0.463,9.8,False,True
4,2011,275,71,0.01,7.097109,68,3013,17.2,97,68,...,68,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2933,2004,723,27,4.36,0.000000,68,31,27.1,42,67,...,65,33.6,454.366654,12777511.0,9.4,9.4,0.407,9.2,False,True
2934,2003,715,26,4.06,0.000000,7,998,26.7,41,7,...,68,36.7,453.351155,12633897.0,9.8,9.9,0.418,9.5,False,True
2935,2002,73,25,4.43,0.000000,73,304,26.3,40,73,...,71,39.8,57.348340,125525.0,1.2,1.3,0.427,10.0,False,True
2936,2001,686,25,1.72,0.000000,76,529,25.9,39,76,...,75,42.1,548.587312,12366165.0,1.6,1.7,0.427,9.8,False,True


In [68]:
from sklearn.model_selection import train_test_split

# Split data into training set and test
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.2, random_state=42)

Next I standardize/normalize the numerical features.

In [69]:
from sklearn.preprocessing import StandardScaler
#from sklearn.preprocessing import Normalizer
from sklearn.compose import ColumnTransformer

# Identify numeric features
numerical_features = features.select_dtypes(include=['float64', 'int64'])
numerical_columns = numerical_features.columns

# Instantiate ColumnTransformer using StandardScaler on numeric columns
ct = ColumnTransformer([("only numeric", StandardScaler(), numerical_columns)], remainder='passthrough')

# Apply scaler to training and test data
features_train_scaled = ct.fit_transform(features_train)
features_test_scaled = ct.transform(features_test)

## Building the Model

In [73]:
import tensorflow as tf
import keras 

# Define the model
model = tf.keras.models.Sequential()

# Create input layer with shape corresponding to the number of features in dataset
input = tf.keras.Input(shape=(features.shape[1],))

# Add layers to network model
model.add(input) # Add input layer to model
model.add(tf.keras.layers.Dense(64, activation='relu')) # Add hidden layer to model
model.add(tf.keras.layers.Dense(1)) # Add output layer to model - one neuron since we need a single output for a regression prediction

# Print model summary
print(model.summary())

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 64)                1408      
                                                                 
 dense_1 (Dense)             (None, 1)                 65        
                                                                 
Total params: 1473 (5.75 KB)
Trainable params: 1473 (5.75 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None


## Initializing the optimizer and compiling the model

In [74]:
# Create an instance of the Adam optimizer with the learning rate equal to 0.01
opt = tf.keras.optimizers.Adam(learning_rate=0.01)

# Compile model
model.compile(loss='mse', metrics=['mae'], optimizer=opt)

## Fit and evaluate the model

In [75]:
# Train model with the Sequential.fit() method
model.fit(features_train_scaled, labels_train, epochs=50, batch_size=1, verbose=1, validation_split=0.2)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.src.callbacks.History at 0x22f21a458d0>

In [77]:
# Evaluate model with the Sequential.evaluate() method
rse_mse, rse_mae = model.evaluate(features_test_scaled, labels_test, verbose=0)
print('Mean squared error on test data: {:.2f}'.format(rse_mse))
print('Mean absolute error on test data: {:.2f}'.format(rse_mae))

Mean squared error on test data: 9.94
Mean absolute error on test data: 2.47
