# Final Group Project: Predict Life expectancy

**Project Info:**
- The dataset is getting from __[Kaggle](https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who/data)__
- Contributors
    - Priyanka
    - Rohit
    - Grant

## Data Pre-Processing

### Load Data
Load from the data stored in **Github repository** so that each team member can directly run the code.<br>
__[Tutorial: How to read a CSV file from GitHub on Jupyter Notebook](https://www.youtube.com/watch?v=4xXBDXDSFts)__

In [5]:
# imort library
import pandas as pd

In [6]:
# Load data: load file from github repository
data = pd.read_csv('https://raw.githubusercontent.com/GrantCa24/DA_Group6-Final_Project/main/data_raw/Life%20Expectancy%20Data.csv')
data.head()

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5


### Data Assessment

**Highlights:**
- There are 2938 rows, 22 columns.
- Remove leading and trailing characters.
    - __[`Series.str.strip()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.strip.html)__
- Rename column name
    - Base on the discussion as the link here: __[1-19 years: typo in the column header](https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who/discussion/276334)__ we decided to rename the header name.
- No duplicate.
- Data integrity (Outliers): Several columns has max value which does not make sense.

|Field|Description|
|---:|:---|
|Country|Country|
|Year|Year|
|Status|Developed or Developing status|
|Life expectancy|Life Expectancy in age|
|Adult Mortality|Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)|
|infant deaths|Number of Infant Deaths per 1000 population|
|Alcohol|Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)|
|percentage expenditure|Expenditure on health as a percene of Gross Domestic Product per capita(%)|
|Hepatitis B|Hepatitis B (HepB) immunization coverage among 1-year-olds (%)|
|Measles|Measles - number of reported cases per 1000 population|
|BMI|Average Body Mass Index of entire population|
|under-five deaths|Number of under-five deaths per 1000 population|
|Polio|Polio (Pol3) immunization coverage among 1-year-olds (%)|
|Total expenditure|General government expenditure on health as a percene of total government expenditure (%)|
|Diphtheria|Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)|
|HIV/AIDS|Deaths per 1000 live births HIV/AIDS (0-4 years)|
|GDP|Gross Domestic Product per capita (in USD)|
|Population|Population of the country|
|thinness 10-19 years|Prevalence of thinness among children and adolescents for Age 10 to 19 (%)|
|thinness 5-9 years|Prevalence of thinness among children for Age 5 to 9(%)|
|Income composition of resources|Income composition of resources|
|Schooling|Number of years of Schooling(years)|

In [10]:
data.columns

Index(['Country', 'Year', 'Status', 'Life expectancy ', 'Adult Mortality',
       'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
       'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure',
       'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population',
       ' thinness  1-19 years', ' thinness 5-9 years',
       'Income composition of resources', 'Schooling'],
      dtype='object')

In [11]:
# Remove spaces at the beginning and at the end of the headers(string)
data.columns = data.columns.str.strip()
print(data.columns)

Index(['Country', 'Year', 'Status', 'Life expectancy', 'Adult Mortality',
       'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
       'Measles', 'BMI', 'under-five deaths', 'Polio', 'Total expenditure',
       'Diphtheria', 'HIV/AIDS', 'GDP', 'Population', 'thinness  1-19 years',
       'thinness 5-9 years', 'Income composition of resources', 'Schooling'],
      dtype='object')


In [12]:
# Rename column 1-19 years to 10-19 years
data.rename(columns={'thinness  1-19 years': 'thinness 10-19 years'}, inplace=True) # modify the DataFrame

In [13]:
# Final check after renaming column
data.columns

Index(['Country', 'Year', 'Status', 'Life expectancy', 'Adult Mortality',
       'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
       'Measles', 'BMI', 'under-five deaths', 'Polio', 'Total expenditure',
       'Diphtheria', 'HIV/AIDS', 'GDP', 'Population', 'thinness 10-19 years',
       'thinness 5-9 years', 'Income composition of resources', 'Schooling'],
      dtype='object')

In [14]:
# Check the total of rows and columns)
rows, columns = data.shape
print(f"Rows: {rows}, Columns: {columns}")

Rows: 2938, Columns: 22


In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2938 entries, 0 to 2937
Data columns (total 22 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Country                          2938 non-null   object 
 1   Year                             2938 non-null   int64  
 2   Status                           2938 non-null   object 
 3   Life expectancy                  2928 non-null   float64
 4   Adult Mortality                  2928 non-null   float64
 5   infant deaths                    2938 non-null   int64  
 6   Alcohol                          2744 non-null   float64
 7   percentage expenditure           2938 non-null   float64
 8   Hepatitis B                      2385 non-null   float64
 9   Measles                          2938 non-null   int64  
 10  BMI                              2904 non-null   float64
 11  under-five deaths                2938 non-null   int64  
 12  Polio               

### Check Duplicates

There is **no duplicate** need to handle.

In [18]:
# Check for duplicate rows
duplicate_rows = data.duplicated()

# Count of duplicate rows
print(f"Number of duplicate rows: {duplicate_rows.sum()}")

Number of duplicate rows: 0


### Check and Remove Null values in all the columns and rows

**Imputation**: We decided to imputate those columns that has null value with 10% \~ 20% with mean value by `Status` 
- Hepatitis B
- GDP

Reason: The amount of missing value is large, but is not that huge to affect overall after imputation. And we believe `Status` is a great categorical indicator to imputate, considering the time and effort.

Notes: We attemp to inpute these two columns by each country and take the moving average. But after examine the data in detail, it will be too complicated and time-consuming.

**Delete column**: We decided to delete those columns that has null value over 20%
- Population: 22.19% null value

Reason: The amount of missing value is too large, which after imputation will affect a lot of the dataset.

**Dropna**: We decided to drop null value of the columns that has null value lower than 10%
- The remaining column

Reason: These null value is only a small portion compared with the whole dataset, thus it won't affect much with the analysis after we drop them.

**Strategy:**
1. Drop all the rows that contain null value
2. Imputate the remaining null value with mean value by `Status`
3. Delete column

**Notes:**
- Year: 2000~2015

#### Null value count & percentage

In [22]:
# Checking for missing values in each column
missing_values = data.isnull().sum()
print(missing_values)

Country                              0
Year                                 0
Status                               0
Life expectancy                     10
Adult Mortality                     10
infant deaths                        0
Alcohol                            194
percentage expenditure               0
Hepatitis B                        553
Measles                              0
BMI                                 34
under-five deaths                    0
Polio                               19
Total expenditure                  226
Diphtheria                          19
HIV/AIDS                             0
GDP                                448
Population                         652
thinness 10-19 years                34
thinness 5-9 years                  34
Income composition of resources    167
Schooling                          163
dtype: int64


In [23]:
missing_percentage = missing_values * 100 / len(data)
print(missing_percentage)

Country                             0.000000
Year                                0.000000
Status                              0.000000
Life expectancy                     0.340368
Adult Mortality                     0.340368
infant deaths                       0.000000
Alcohol                             6.603131
percentage expenditure              0.000000
Hepatitis B                        18.822328
Measles                             0.000000
BMI                                 1.157250
under-five deaths                   0.000000
Polio                               0.646698
Total expenditure                   7.692308
Diphtheria                          0.646698
HIV/AIDS                            0.000000
GDP                                15.248468
Population                         22.191967
thinness 10-19 years                1.157250
thinness 5-9 years                  1.157250
Income composition of resources     5.684139
Schooling                           5.547992
dtype: flo

In [24]:
import matplotlib.pyplot as plt
import seaborn as sns

ModuleNotFoundError: No module named 'matplotlib.pyplot'

In [None]:
print(type(data.isnull))
#visual representation of missing values in the dataset
plt.figure(figsize=(15,10))
sns.heatmap(data.isnull(), cmap = 'crest')
plt.show()

#### Strategy Step1:
**Drop all the rows that contain null value.**

In [None]:
# Strategy Step1: Drop all the rows that contain null value
data.dropna(
    subset=['Life expectancy', 'Adult Mortality', 'Alcohol', 'BMI', 'Polio', 'Total expenditure', 'Diphtheria', 'thinness 10-19 years', 'thinness 5-9 years', 'Income composition of resources', 'Schooling'],
    inplace=True)
# Show the remaing columns that have null values
data.isnull().sum()

In [None]:
data['Year'].unique()

#### Strategy Step2. Imputate the remaining null value with mean value by `Status`

##### Hepatitis B

In [None]:
null_hep_b = data[data['Hepatitis B'].isnull()]
null_hep_b_country = null_hep_b['Country'].unique()

In [None]:
for country in null_hep_b_country:
    null_hep_b_country = null_hep_b[null_hep_b['Country'] == country]
    print(country, ":")
    print(null_hep_b_country['Year'].unique())

##### GDP

In [None]:
null_gdp = data[data['GDP'].isnull()]
null_gdp_country = null_gdp['Country'].unique()

In [None]:
for country in null_gdp_country:
    null_gdp_country = null_gdp[null_gdp['Country'] == country]
    print(country, ":")
    print(null_gdp_country['Year'].unique())

##### Mean value of `Hepatitis B` & `GDP` by `Status`
- __[`mean()` will exclude null value as default](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html)__

In [None]:
# Impute with Developed / Developing country's median value

#Create a groupby object
data_group = data.groupby('Status')

#Select only required columns
data_columns = data_group[['Hepatitis B', 'GDP']]

#Apply aggregate function
hep_B_gdp_by_status = data_columns.mean()

hep_B_gdp_by_status

##### Fill null value (Imputate with Mean)

In [None]:
# Fill missing values for 'Hepatitis B' based on 'Status'
data.loc[data['Status'] == 'Developed', 'Hepatitis B'] = data.loc[data['Status'] == 'Developed', 'Hepatitis B'].fillna(
    hep_B_gdp_by_status.loc['Developed','Hepatitis B'])
data.loc[data['Status'] == 'Developing', 'Hepatitis B'] = data.loc[data['Status'] == 'Developing', 'Hepatitis B'].fillna(
    hep_B_gdp_by_status.loc['Developing','Hepatitis B'])

In [None]:
# Fill missing values for 'GDP' based on 'Status'
data.loc[data['Status'] == 'Developed', 'GDP'] = data.loc[data['Status'] == 'Developed', 'GDP'].fillna(
    hep_B_gdp_by_status.loc['Developed','GDP'])
data.loc[data['Status'] == 'Developing', 'GDP'] = data.loc[data['Status'] == 'Developing', 'GDP'].fillna(
    hep_B_gdp_by_status.loc['Developing','GDP'])

#### Strategy Step3. Delete column: Population

In [None]:
null_population = data[data['Population'].isnull()]
null_population_country = null_population['Country'].unique()

In [None]:
# Drop Population
data.drop(columns=['Population'], inplace=True)

# Show the null value across columns
data.isnull().sum()

### Remove Dirty Data

__[Warning from the discussion](https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who/discussion/161872)__
- Filter out the observations of the value > 1000 of three columns which is measured by `per 1000 population`
    - `infant death`
    - `Measles`
    - `under-five deaths`

In [None]:
data.describe()

#### Filter out the observations of the value > 1000 which measured by `per 1000 population`

In [None]:
# value of infant deaths, Measles, and under-five deaths should be <= 1000
cols = ["infant deaths", "Measles", "under-five deaths"]

# Filter out rows where any of the specified columns have values > 1,000
data = data[(data[cols] <= 1000).all(axis=1)] # Filter and remain those <= 1000

In [None]:
data.describe()

In [None]:
# Show the remaing columns that have null values
data.isnull().sum()

In [None]:
data.shape

## Exploratory Data Analysis (EDA)

**Highlights:**
- `Life expectancy`: Life expectancy reaches a peak in the **70-80 year range**, as seen in the histogram. The boxplot reveals that **more than 50% of the population lives beyond 70 years**, with the maximum life expectancy nearing 90 years. The median life expectancy is just above 70 years, indicating a generally high life expectancy. However, 25% of the population has a life expectancy of less than 65 years, with **several outliers living less than 50 years**. This reflects that while overall life expectancy is high, there are notable disparities, with a portion of the population suffering from significantly lower life expectancy. 
- **Right skew ( > 1)**: `under-five deaths`, `infant deaths` , `HIV/AIDS`, `percentage expenditure`, `GDP`, `Measles`, `Adult Mortality`, `thinness 5-9 years`, `thinness 10-19 years`
- **Left skew ( < -1)**: `Income composition of resources`, `Hepatitis B`, `Polio`, `Diphtheria`

<br>

**Methods:**
- `pandas.DataFrame.hist` : Only **numerical columns** will be plotted. __[Here for more info](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html)__
- `subplot(nrows, ncols, index)` __[Here for more info](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplot.html)__
- `kdeplot` : Only **numerical columns** will be plotted. __[Here for more info](https://seaborn.pydata.org/generated/seaborn.kdeplot.html)__

In [None]:
# Check the histograms
data.hist(bins=35, figsize=(18, 12))
plt.show()

In [None]:
num_cols = data.select_dtypes("number").columns # select all numeric types
print(f"There are {len(cols)} numeric columns: \n {cols}")

non_num_cols = data.select_dtypes(exclude=['int64', 'float64']).columns # select all numeric types
print(f"There are {len(non_num_cols)} non numeric columns: \n {non_num_cols}")

In [None]:
fig = plt.figure(figsize=(25,18))

graph_index = 1 # Set the position of the subplot to 1
for col in num_cols:
    plt.subplot(5, 4, graph_index) # subplot(nrows, ncols, index)
    graph = sns.kdeplot(data = data, x = col, fill = True)
    graph_index += 1 # Set the position to the next one

In [None]:
fig = plt.figure(figsize=(25,18))

boxplot_index = 1 # Set the position of the subplot to 1
for col in num_cols:
    plt.subplot(5, 4, boxplot_index) # subplot(nrows, ncols, index)
    # The higher the better (Life expectancy & immunization coverage)
    if col in ['Life expectancy', 'Hepatitis B', 'Polio', 'Diphtheria']:
        boxplot = sns.boxplot(data=data, x=col, boxprops=dict(alpha=1))  # Set alpha for transparency
    else:
        boxplot = sns.boxplot(data=data, x=col, boxprops=dict(alpha=0.4))  # Set alpha for transparency
    boxplot_index += 1 # Set the position to the next one

In [None]:
data.skew(axis = 0, skipna=True, numeric_only=True).sort_values(ascending=False) # Skewness in each numeric column with ascending order

- **Positive** value: The distribution is skewed to the **right**.
- **Negative** value: The distribution is skewed to the **left**.
- **0**: **Perfect normal distribution**.

### 🔓synthetic data
adding synthetic data for unbiased data <br>
https://www.techtarget.com/searchcio/definition/synthetic-data <br>
https://broutonlab.com/blog/ai-bias-solved-with-synthetic-data-generation/

As we further analyze, we found out that in a total of 168 countries:
- Developed: **29** (with 406 observations)
- Developing: **139** (with 1710 observations)

This finding can explain why the box plot has so many outliers (⚠️why?)

In [None]:
country_status = data.groupby('Country')['Status'].value_counts()
country_status # Type: Series

In [None]:
# Count the number of unique countries
num_countries = country_status.index.get_level_values('Country').nunique()
print(f'Number of unique countries: {num_countries}')

In [None]:
# Count the total number of countries in each status group
countries_per_status = country_status.groupby('Status').size()

# Group by 'Status' and then count the number of unique countries in each group
#countries_per_status = country_status.groupby('Status').apply(lambda x: x.index.get_level_values('Country').nunique())

print(countries_per_status)

In [None]:
# Count the total number of observations in each status group
observations_per_status = country_status.groupby('Status').sum()
observations_per_status

In [None]:
# Combine the results into a DataFrame
status_summary = pd.DataFrame({'# of Countries': countries_per_status, '# of Observations': observations_per_status})

print(status_summary)

In [None]:
from imblearn.over_sampling import SMOTE
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Separate features and target variable
X = data.drop('Status', axis=1)  # Features
y = data['Status']               # Target

# Apply SMOTE to balance the dataset
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Combine resampled data into a DataFrame
data_resampled = pd.DataFrame(X_resampled, columns=X.columns)
data_resampled['Status'] = y_resampled

# Display the count of each unique value in the 'Status' column after resampling
print(data_resampled['Status'].value_counts())

### Correlation

## Data Visualization

## Predictive Model

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np



# List of numerical features and target
numerical_features = [
    'Adult Mortality', 'infant deaths', 'Alcohol', 'percentage expenditure', 
    'Hepatitis B', 'Measles', 'BMI', 'under-five deaths', 'Polio', 
    'Total expenditure', 'Diphtheria', 'HIV/AIDS', 'GDP', 
    'thinness 10-19 years', 'thinness 5-9 years', 'Income composition of resources', 
    'Schooling'
]
target = 'Life expectancy'

# Prepare the features and target, handling missing values
X = data[numerical_features]
y = data[target]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate the model
y_pred = model.predict(X_test)
r_squared = model.score(X_test, y_test)
mse = mean_squared_error(y_test, y_pred)

print(f'R-squared: {r_squared:.2f}, Mean Squared Error: {mse:.2f}')


In [None]:
# Scatter plot for Actual vs Predicted values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5, color='blue')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red')  # Diagonal line
plt.xlabel('Actual Life Expectancy')
plt.ylabel('Predicted Life Expectancy')
plt.title('Actual vs Predicted Life Expectancy')
plt.show()

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np


# Filter the dataset for the desired years (e.g., 2010-2015)
years_of_interest = [2012, 2013, 2014, 2015]
filtered_data = data[data['Year'].isin(years_of_interest)]

# List of numerical features and target
numerical_features = [
    'Adult Mortality', 'infant deaths', 'Alcohol', 'percentage expenditure', 
    'Hepatitis B', 'Measles', 'BMI', 'under-five deaths', 'Polio', 
    'Total expenditure', 'Diphtheria', 'HIV/AIDS', 'GDP', 
    'thinness 10-19 years', 'thinness 5-9 years', 'Income composition of resources', 
    'Schooling'
]
target = 'Life expectancy'

# Prepare the features and target, handling missing values
Xf = filtered_data[numerical_features]
yf = filtered_data[target]


# Split the data into training and testing sets
Xf_train, Xf_test, yf_train, yf_test = train_test_split(Xf, yf, test_size=0.2, random_state=42)

# Fit the linear regression model
model = LinearRegression()
model.fit(Xf_train, yf_train)

# Predict and evaluate the model
yf_pred = model.predict(Xf_test)
r_squared = model.score(Xf_test, yf_test)
mse = mean_squared_error(yf_test, yf_pred)

print(f'R-squared: {r_squared:.2f}, Mean Squared Error: {mse:.2f}')


In [None]:
# Scatter plot for Actual vs Predicted values
plt.figure(figsize=(10, 6))
plt.scatter(yf_test, yf_pred, alpha=0.5, color='blue')
plt.plot([min(yf_test), max(yf_test)], [min(yf_test), max(yf_test)], color='red')  # Diagonal line
plt.xlabel('Actual Life Expectancy')
plt.ylabel('Predicted Life Expectancy')
plt.title('Actual vs Predicted Life Expectancy')
plt.show()

In [None]:
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

# Create a Ridge regression model
ridge_model = make_pipeline(StandardScaler(), Ridge(alpha=1.0))

# Train the model
ridge_model.fit(X_train, y_train)

# Predict on the test set
y_pred_ridge = ridge_model.predict(X_test)

# Calculate mean squared error and R-squared
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
r2_ridge = r2_score(y_test, y_pred_ridge)

print(f'Mean Squared Error (Ridge): {mse_ridge}')
print(f'R-squared (Ridge): {r2_ridge}')


In [None]:
from sklearn.linear_model import Lasso

# Create a Lasso regression model
lasso_model = make_pipeline(StandardScaler(), Lasso(alpha=0.1))

# Train the model
lasso_model.fit(X_train, y_train)

# Predict on the test set
y_pred_lasso = lasso_model.predict(X_test)

# Calculate mean squared error and R-squared
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
r2_lasso = r2_score(y_test, y_pred_lasso)

print(f'Mean Squared Error (Lasso): {mse_lasso}')
print(f'R-squared (Lasso): {r2_lasso}')


In [None]:
from sklearn.ensemble import RandomForestRegressor

# Create a Random Forest regression model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Predict on the test set
y_pred_rf = rf_model.predict(X_test)

# Calculate mean squared error and R-squared
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print(f'Mean Squared Error (Random Forest): {mse_rf}')
print(f'R-squared (Random Forest): {r2_rf}')



In [None]:
import matplotlib.pyplot as plt

def plot_results(y_test, y_pred_ridge, y_pred_lasso, y_pred_rf):
    plt.figure(figsize=(15, 5))

    # Ridge Regression
    plt.subplot(1, 3, 1)
    plt.scatter(y_test, y_pred_ridge, alpha=0.5, color='blue')
    plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red')  # Diagonal line
    plt.xlabel('Actual Life Expectancy')
    plt.ylabel('Predicted Life Expectancy')
    plt.title('Ridge Regression')

    # Lasso Regression
    plt.subplot(1, 3, 2)
    plt.scatter(y_test, y_pred_lasso, alpha=0.5, color='green')
    plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red')  # Diagonal line
    plt.xlabel('Actual Life Expectancy')
    plt.ylabel('Predicted Life Expectancy')
    plt.title('Lasso Regression')

    # Random Forest Regression
    plt.subplot(1, 3, 3)
    plt.scatter(y_test, y_pred_rf, alpha=0.5, color='purple')
    plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red')  # Diagonal line
    plt.xlabel('Actual Life Expectancy')
    plt.ylabel('Predicted Life Expectancy')
    plt.title('Random Forest Regression')

    plt.tight_layout()
    plt.show()

# Plot results for each model side by side
plot_results(y_test, y_pred_ridge, y_pred_lasso, y_pred_rf)
