# Predicting Life Expectancy with Polynomial Regression & Lasso Regularization
This notebook aims to predict the Life Expectancy of various countries using a variety of health and economic indicators. 

Because the relationship between some variables (like GDP or BMI) and life expectancy might not be perfectly linear, we're going to generate **Polynomial Features**. Then, to prevent the model from overfitting on all these new features, we'll apply **Lasso Regression (L1 Regularization)** to automatically select only the most useful ones.

In [1]:
# First, let's bring in the tools we need.
# Pandas and numpy for data wrangling, and sklearn for the machine learning pipeline.
import pandas as pd
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LassoCV
from sklearn.metrics import mean_squared_error, r2_score


### 1. Data Loading
We have our dataset split into training and testing files. It's crucial to evaluate our final model on the testing set to make sure it generalizes well to unseen data.

In [2]:
# Let's load up the CSV files into pandas DataFrames.
train_df = pd.read_csv("life_expectancy_train_master.csv")
test_df = pd.read_csv("life_expectancy_test_master.csv")


### 2. Feature Selection
We want to predict `LifeExpectancy`. Everything else in the dataset can be used as a predictor (feature). So we separate our target variable (`y`) from our feature matrix (`X`).

In [3]:
# Define what we are trying to predict
target = "LifeExpectancy"

# Split the training data into features (X) and target (y)
y_train = train_df[target]
X_train = train_df.drop(columns=[target])

# Do the exact same for the testing data
y_test = test_df[target]
X_test = test_df.drop(columns=[target])


### 3. Model Training & Evaluation
Here is where the magic happens:
1.  **Polynomial Features**: We use `degree=2` to create squared terms (e.g., $BMI^2$) and interaction terms (e.g., $BMI 	imes GDP$) for all our features. This explodes our feature count from ~27 to over 400!
2.  **LassoCV**: Since 400+ features will definitely cause overfitting, we use Lasso Regression. The `CV` stands for Cross-Validation, meaning the model will test a bunch of different regularization strengths (`alphas`) and pick the best one for us. Lasso is great because it sets the coefficients of useless features exactly to 0, acting as a built-in feature selector.

In [4]:
# Generate polynomial and interaction features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test) # Crucial: only use .transform() on test data to prevent data leakage!

# Set up the Lasso model with 5-fold cross-validation to find the best alpha
model = LassoCV(cv=5, random_state=42, n_alphas=50, max_iter=10000)
model.fit(X_train_poly, y_train)

# Let's see what the model figured out
print(f"Best alpha found by LassoCV: {model.alpha_:.4f}")
print(f"Number of features kept: {np.sum(model.coef_ != 0)} out of {X_train_poly.shape[1]}")

# --- Evaluation ---

# How well does it fit the data it was trained on?
y_train_pred = model.predict(X_train_poly)
print("\n--- Training Performance ---")
print(f"R^2:  {r2_score(y_train, y_train_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_train, y_train_pred)):.4f}")

# How well does it generalize to new data? (This is the real test)
y_test_pred = model.predict(X_test_poly)
print("\n--- Testing Performance ---")
print(f"R^2:  {r2_score(y_test, y_test_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_test_pred)):.4f}")

# Finally, let's look at which features were actually useful enough to keep
print("\n--- Features kept by Lasso ---")
feature_names = poly.get_feature_names_out(X_train.columns)
kept_features = feature_names[model.coef_ != 0]
for feature in kept_features:
    print(f"- {feature}")


Best alpha found by LassoCV: 31.0359
Number of features kept: 23 out of 405

--- Training Performance ---
R^2:  0.8240
RMSE: 4.0092

--- Testing Performance ---
R^2:  0.8263
RMSE: 3.8805

--- Features kept by Lasso ---
- Year^2
- Year AdultMortality
- Year InfantMortality
- Year Alcohol
- Year percentage expenditure
- Year Hepatitis B
- Year Measles
- Year BMI
- Year under-five deaths
- Year Polio
- Year HealthExpenditure
- Year Diphtheria
- Year HIV_AIDS
- Year Population
- Year thinness  1-19 years
- Year IncomeLevel
- Year Schooling
- Year Status_Encoded
- Year health_efficiency_ratio
- Year health_gdp_ratio
- Year log_GDP
- Year mortality_health_interaction
- Year education_income_interaction
