# Final Project: Regression Analysis - Predicting Life Expectancy Using Regression Analysis: A Data-Driven Approach

Author: Data-Git-Hub <br>
GitHub Project Repository Link: https://github.com/Data-Git-Hub/applied-ml-data-git-hub <br>
5 April 2025 <br>

### Introduction and Background

Life expectancy is a central metric used to assess the health and development status of populations across the globe. It reflects the average number of years a person can expect to live, assuming that current mortality rates persist. Given its sensitivity to a wide range of health, social, and economic conditions, life expectancy serves as a comprehensive indicator for public health monitoring and policy evaluation (Marmot, 2005). Accurately predicting life expectancy using relevant predictors can enable governments, health organizations, and researchers to better understand the drivers of longevity and identify intervention points to improve population health outcomes. <br>

This project utilizes the Life Expectancy (WHO) dataset, which is publicly available on Kaggle (Kumar, 2017). The dataset spans from the year 2000 to 2015 and includes 193 countries, capturing both developing and developed regions. Variables in the dataset represent a diverse set of indicators including adult mortality, alcohol consumption, hepatitis B immunization coverage, GDP, BMI, HIV/AIDS prevalence, schooling, and health expenditure. <br>

The goal of this project is to apply regression modeling techniques to predict life expectancy using various socioeconomic and health-related features. This involves exploring the structure and quality of the data, selecting relevant features, applying and comparing multiple regression models, and interpreting the results to draw actionable insights. <br>

### Analytical Framework and Methodology

#### Exploratory Data Analysis (EDA)

The first step in any data-driven investigation is an in-depth exploratory data analysis (EDA). Through visualizations such as histograms, scatterplots, boxplots, and heatmaps, patterns and distributions can be observed, while anomalies, missing values, and potential outliers can be identified. EDA serves to inform subsequent modeling decisions and aids in understanding the relationships between predictors and the target variable. <br>

According to Tukey (1977), exploratory data analysis plays a critical role in revealing structure in data that may not be apparent through formal statistical modeling. It promotes the generation of insights and hypotheses by relying on graphical techniques and descriptive summaries. <br>

#### Data Preparation and Preprocessing

Data preprocessing includes: <br>

- Handling missing values using statistical imputation or row-wise deletion where necessary.

- Encoding categorical variables, such as country development status, into numerical formats using label or one-hot encoding.

- Outlier detection and treatment based on inter-quartile ranges or z-scores.

- Feature scaling, particularly through standardization, to ensure that features contribute equally during model training.

These preprocessing techniques are essential to minimize bias and variance in the resulting models, thereby enhancing their predictive performance and interpretability (Han, Pei, & Kamber, 2011). <br>

#### Feature Selection and Engineering

Feature selection involves identifying the most influential variables to include in the model. This step can be guided by domain knowledge, correlation analysis, or automated methods such as recursive feature elimination (RFE). In the context of life expectancy, prior research suggests that variables such as income level, education, health spending, and disease prevalence are strong predictors (Preston, Heuveline, & Guillot, 2001). Feature engineering may involve creating new variables by combining existing ones (e.g., healthcare expenditure per capita) or transforming variables to better capture non-linear relationships. <br>

#### Regression Modeling Techniques

The primary modeling technique used in this project is linear regression, which serves as a baseline model. Linear regression is widely used for its simplicity and interpretability and is suitable for quantifying linear relationships between the target variable and the predictors. <br>

To improve accuracy and accommodate non-linear relationships, polynomial regression will also be implemented. This allows the model to capture curvature in the data that linear models may fail to represent. <br>

Further, pipelines will be created to streamline the modeling process, combining preprocessing steps with model training in a reproducible workflow. Pipelines reduce data leakage and improve consistency in cross-validation and testing (Pedregosa et al., 2011). <br>

Model performance will be evaluated using metrics such as: <br>

- R^2 (coefficient of determination) to measure the proportion of variance explained,

- MAE (mean absolute error) to assess average prediction error,

- RMSE (root mean squared error) to penalize large errors more heavily.

These metrics will be compared across models to determine which configuration yields the most accurate and generalizable predictions. <br>

#### Expected Outcomes and Significance

By the end of this analysis, the most influential factors affecting life expectancy across countries will be identified. This will not only offer insight into global health trends but also provide evidence-based recommendations for policy interventions. The results may support the allocation of resources to areas such as education, disease prevention, and economic development—all shown to impact longevity in prior research (Cutler, Deaton, & Lleras-Muney, 2006). <br>

This project also serves as a practical demonstration of applying machine learning techniques within the public health domain. The modeling strategies, visualization techniques, and interpretive frameworks developed herein can be extended to other datasets and problem domains involving continuous outcomes. <br>

### Section 1. Import and Inspect the Data

Python libraries are collections of pre-written code that provide specific functionalities, making programming more efficient and reducing the need to write code from scratch. These libraries cover a wide range of applications, including data analysis, machine learning, web development, and automation. Some libraries, such as os, sys, math, json, and datetime, come built-in with Python as part of its standard library, providing essential functions for file handling, system operations, mathematical computations, and data serialization. Other popular third-party libraries, like pandas, numpy, matplotlib, seaborn, and scikit-learn, must be installed separately and are widely used in data science and machine learning. The extensive availability of libraries in Python's ecosystem makes it a versatile and powerful programming language for various domains. <br>

Pandas is a powerful data manipulation and analysis library that provides flexible data structures, such as DataFrames and Series. It is widely used for handling structured datasets, enabling easy data cleaning, transformation, and aggregation. Pandas is essential for data preprocessing in machine learning and statistical analysis. <br>
https://pandas.pydata.org/docs/ <br>

NumPy (Numerical Python) is a foundational library for numerical computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a comprehensive collection of mathematical functions to operate on these arrays efficiently. NumPy is a key component in scientific computing and machine learning. <br>
https://numpy.org/doc/stable/ <br>

Matplotlib is a widely used data visualization library that allows users to create static, animated, and interactive plots. It provides extensive tools for generating various chart types, including line plots, scatter plots, histograms, and bar charts, making it a critical library for exploratory data analysis. <br>
https://matplotlib.org/stable/contents.html <br>

Seaborn is a statistical data visualization library built on top of Matplotlib, designed for creating visually appealing and informative plots. It simplifies complex visualizations, such as heatmaps, violin plots, and pair plots, making it easier to identify patterns and relationships in datasets. <br>
https://seaborn.pydata.org/ <br>

Scikit-learn provides a variety of tools for machine learning, including data preprocessing, model selection, and evaluation. It contains essential functions for building predictive models and analyzing datasets. <br>
sklearn.metrics: This module provides various performance metrics for evaluating machine learning models. <br>
https://scikit-learn.org/stable/modules/model_evaluation.html<br>

IPython.core.display is a module from the IPython library that provides tools for displaying rich output in Jupyter Notebooks, including formatted text, images, HTML, and interactive widgets. It enhances visualization and interaction within Jupyter environments.
https://ipython.readthedocs.io/en/stable/api/generated/IPython.core.display.html <br>



In [1]:
# Data handling
import pandas as pd
import numpy as np

# Machine learning imports
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit, cross_val_score
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, PolynomialFeatures, MinMaxScaler, RobustScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, mean_absolute_error, mean_squared_error, r2_score, precision_score, recall_score, f1_score, classification_report, silhouette_score
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.feature_selection import RFECV, RFE, SelectKBest, f_classif, mutual_info_classif

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Fully disable output truncation in Jupyter (for VS Code)
from IPython.core.interactiveshell import InteractiveShell
from IPython.display import display

#### 1.1. Load the Dataset and Display the First 10 Rows.

In this step, I begin by loading the Life Expectancy dataset into a pandas DataFrame. I then display the first 10 rows to get a quick overview of the data structure, feature names, and the types of values included in the dataset. <br> 

In [None]:
# Define the file path to the dataset
file_path = r'C:\Projects\ml_regression_data-git-hub\data\life_expectancy_data.csv'

# Load the dataset into a DataFrame
df = pd.read_csv(file_path)

# Display the first 10 rows of the dataset
df.head(10)

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5
5,Afghanistan,2010,Developing,58.8,279.0,74,0.01,79.679367,66.0,1989,...,66.0,9.2,66.0,0.1,553.32894,2883167.0,18.4,18.4,0.448,9.2
6,Afghanistan,2009,Developing,58.6,281.0,77,0.01,56.762217,63.0,2861,...,63.0,9.42,63.0,0.1,445.893298,284331.0,18.6,18.7,0.434,8.9
7,Afghanistan,2008,Developing,58.1,287.0,80,0.03,25.873925,64.0,1599,...,64.0,8.33,64.0,0.1,373.361116,2729431.0,18.8,18.9,0.433,8.7
8,Afghanistan,2007,Developing,57.5,295.0,82,0.02,10.910156,63.0,1141,...,63.0,6.73,63.0,0.1,369.835796,26616792.0,19.0,19.1,0.415,8.4
9,Afghanistan,2006,Developing,57.3,295.0,84,0.03,17.171518,64.0,1990,...,58.0,7.43,58.0,0.1,272.56377,2589345.0,19.2,19.3,0.405,8.1


#### 1.2. Check for Missing Values and Display Summary Statistics.

##### 1.2.1. Missing Values and Percentage

In this step, I check for missing values in the dataset to identify any columns with incomplete data that may require cleaning or imputation. I also display summary statistics to understand the central tendency, spread, and distribution of numerical features. This helps guide decisions on data preprocessing and modeling strategies. <br>

In [7]:
# Calculate missing values in total and as percentage
missing_data = df.isnull().sum().to_frame(name='Missing Count')
missing_data['Missing Percentage'] = (missing_data['Missing Count'] / len(df)) * 100

# Display columns with missing values only
missing_data[missing_data['Missing Count'] > 0]

Unnamed: 0,Missing Count,Missing Percentage
Life expectancy,10,0.340368
Adult Mortality,10,0.340368
Alcohol,194,6.603131
Hepatitis B,553,18.822328
BMI,34,1.15725
Polio,19,0.646698
Total expenditure,226,7.692308
Diphtheria,19,0.646698
GDP,448,15.248468
Population,652,22.191967


##### 1.2.2. Display Summary Statistics

In [6]:
# Display summary statistics for numerical columns
df.describe()

Unnamed: 0,Year,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
count,2938.0,2928.0,2928.0,2938.0,2744.0,2938.0,2385.0,2938.0,2904.0,2938.0,2919.0,2712.0,2919.0,2938.0,2490.0,2286.0,2904.0,2904.0,2771.0,2775.0
mean,2007.51872,69.224932,164.796448,30.303948,4.602861,738.251295,80.940461,2419.59224,38.321247,42.035739,82.550188,5.93819,82.324084,1.742103,7483.158469,12753380.0,4.839704,4.870317,0.627551,11.992793
std,4.613841,9.523867,124.292079,117.926501,4.052413,1987.914858,25.070016,11467.272489,20.044034,160.445548,23.428046,2.49832,23.716912,5.077785,14270.169342,61012100.0,4.420195,4.508882,0.210904,3.35892
min,2000.0,36.3,1.0,0.0,0.01,0.0,1.0,0.0,1.0,0.0,3.0,0.37,2.0,0.1,1.68135,34.0,0.1,0.1,0.0,0.0
25%,2004.0,63.1,74.0,0.0,0.8775,4.685343,77.0,0.0,19.3,0.0,78.0,4.26,78.0,0.1,463.935626,195793.2,1.6,1.5,0.493,10.1
50%,2008.0,72.1,144.0,3.0,3.755,64.912906,92.0,17.0,43.5,4.0,93.0,5.755,93.0,0.1,1766.947595,1386542.0,3.3,3.3,0.677,12.3
75%,2012.0,75.7,228.0,22.0,7.7025,441.534144,97.0,360.25,56.2,28.0,97.0,7.4925,97.0,0.8,5910.806335,7420359.0,7.2,7.2,0.779,14.3
max,2015.0,89.0,723.0,1800.0,17.87,19479.91161,99.0,212183.0,87.3,2500.0,99.0,17.6,99.0,50.6,119172.7418,1293859000.0,27.7,28.6,0.948,20.7


### Reflection 1: What do you notice about the dataset? Are there any data issues?

Upon reviewing the summary statistics and missing value analysis of the Life Expectancy dataset, several important observations emerge. First, the dataset spans a substantial time period from 2000 to 2015 and includes a broad set of features covering public health, demographic, and economic indicators. Most variables are numerical and display reasonable ranges, though some have wide standard deviations indicating high variability across countries and years. <br>

In terms of data completeness, there are several columns with missing values. For instance, Population has the highest proportion of missing data at approximately 22.2%, followed by Hepatitis B (18.8%) and GDP (15.2%). Additional variables such as Total expenditure, Alcohol, and Income composition of resources also contain notable gaps. These missing values will need to be addressed either through imputation or removal depending on their impact on modeling. Fortunately, the target variable, Life expectancy, has a relatively low missing percentage (0.34%), which should not severely impact the regression analysis. <br>

Additionally, a few variables such as Measles and under-five deaths display extreme maximum values, which may indicate the presence of outliers. The percentage expenditure variable also has a very high maximum value (over 19,000), suggesting potential data entry anomalies or extreme variation that warrants further investigation. These issues should be explored in detail during the data cleaning and preparation phase. <br>

Overall, while the dataset is rich and well-structured for regression modeling, some preprocessing steps including missing value imputation, outlier handling, and potential normalization or transformation will be necessary to ensure robust and interpretable model performance. <br>
---

### Section 2. Data Exploration and Preparation

#### 2.1. Explore Data Patterns and Distribution

##### 2.1.1 Create Histograms, Box Plots, and Count Plots for Categorical Variables (as Applicable)

##### 2.1.2. Identify Patterns, Outliers, and Anomalies in Feature Distributions

##### 2.1.3. Check for Class Imbalance in the Target Variable (as Applicable)

#### 2.2. Handle Missing Values and Clean Data

##### 2.2.1. Impute or Drop Missing Values (as Applicable)

##### 2.2.2. Remove or Transform Outliers (as Applicable). 

##### 2.2.3. Convert Categorical Data to Numerical Format Using Encoding (as Applicable)

#### 2.3. Feature Selection and Engineering

##### 2.3.1. Create New Features (as Applicable)

##### 2.3.2. Transform or Combine Existing Features to Improve Model Performance (as Applicable)

##### 2.3.3. Scale or Normalize Data (as Applicable)

### Reflection 2: What patterns or anomalies do you see? Do any features stand out? What preprocessing steps were necessary to clean and improve the data? Did you create or modify any features to improve performance?

---

### Section 3. Feature Selection and Justification

#### 3.1 Choose Features and Target

#### 3.2 3.2 Define X and y

#### Reflection 3: Why did you choose these features? How might they impact predictions or accuracy?

---

### Section 4. Train a Model (Linear Regression)

#### 4.1. Split the Data into Training and Test Sets Using `train_test_split` (or `StratifiedShuffleSplit` if Class Imbalance is an Issue)

#### 4.2. Train Model Using Scikit-Learn `model.fit()` Method

#### 4.3. Evalulate Performance, for Example: Regression: R^2, MAE, RMSE; Classification: Accuracy, Precision, Recall, F1-score, Confusion Matrix; Clustering: Inertia, Silhouette Score

### Reflection 4: How well did the model perform? Any surprises in the results?

---

### Section 5. Improve the Model or Try Alternates (Implement Pipelines)

#### 5.1. Implement Pipeline 1: Imputer → StandardScaler → Linear Regression.

#### 5.2. Implement Pipeline 2: Imputer → Polynomial Features (degree=3) → StandardScaler → Linear Regression.

#### 5.3 Compare Performance of All Models Across the Same Performance Metrics

### Reflection 5: Which models performed better? How does scaling impact results?

---

### Section 6. Final Thoughts & Insights

#### 6.1. Summarize Findings

#### 6.2. Discuss Challenges Faced

#### 6.3. If You Had More Time, What Would You Try Next?

### Reflection 6: What did you learn from this project?

---

### References:
Cutler, D. M., Deaton, A. S., & Lleras-Muney, A. (2006). The determinants of mortality. Journal of Economic Perspectives, 20(3), 97–120. https://doi.org/10.1257/jep.20.3.97 <br>

Han, J., Pei, J., & Kamber, M. (2011, July). Data Mining: Concepts and Techniques, 3rd ed. Han and Kamber: Data mining---Concepts and Techniques, 3rd ed., Morgan Kaufmann, 2011. https://hanj.cs.illinois.edu/bk3/bk3_slidesindex.htm  <br>

Kumar, A. (2017). Life expectancy (WHO). Kaggle. https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who <br>

Marmot, M. (2005). Social determinants of health inequalities. The Lancet, 365(9464), 1099–1104. https://doi.org/10.1016/S0140-6736(05)71146-6 <br>

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830. <br>

Preston, S. H., Heuveline, P., & Guillot, M. (2001). Demography: Measuring and modeling population processes. Blackwell Publishing. <br>

Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley. <br>
