# 3.0 - Regularized Regression Analysis

## Objective

This notebook explores the effect of regularization on linear models. Unlike the other notebooks, this one treats the income prediction task as a **regression problem** (predicting a continuous value between 0 and 1) rather than a classification problem.

This allows us to analyze how `Ridge (L2)`, `Lasso (L1)`, and `ElasticNet` regularization techniques affect the model coefficients and feature selection. The goal here is less about achieving the highest performance and more about **interpreting model behavior**.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import os

# Add src to path
sys.path.append(os.path.join(os.path.abspath(''), '..', 'src'))

from data.make_dataset import load_data
from features.build_features import split_features_target, one_hot_encode_features, split_data

# Models for this notebook
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_squared_error

## 1. Load and Prepare Data

Load the data and apply `One-Hot Encoding` to prepare it for the linear regression models.


In [None]:
df = load_data('../data/processed/adult_cleaned.csv')
X, y = split_features_target(df)

X_ohe = one_hot_encode_features(X)
X_train, X_test, y_train, y_test = split_data(X_ohe, y)

print("Data prepared for linear regression.")
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)

## 2. Train and Evaluate Regularized Models

We will train three different regularized linear models and compare their performance using Mean Squared Error (MSE).


In [None]:
# Initialize models
lin_reg = LinearRegression()
ridge = Ridge(alpha=1.0, random_state=42)
lasso = Lasso(alpha=0.001, random_state=42) # Using a smaller alpha for Lasso to prevent it from zeroing out all coefficients
elastic = ElasticNet(alpha=0.001, l1_ratio=0.5, random_state=42)

# Train models
lin_reg.fit(X_train, y_train)
ridge.fit(X_train, y_train)
lasso.fit(X_train, y_train)
elastic.fit(X_train, y_train)

# Evaluate
lin_reg_pred = lin_reg.predict(X_test)
ridge_pred = ridge.predict(X_test)
lasso_pred = lasso.predict(X_test)
elastic_pred = elastic.predict(X_test)

# Calculate MSE
mse_lin_reg = mean_squared_error(y_test, lin_reg_pred)
mse_ridge = mean_squared_error(y_test, ridge_pred)
mse_lasso = mean_squared_error(y_test, lasso_pred)
mse_elastic = mean_squared_error(y_test, elastic_pred)

# Display results
print("MSE on Test Set:")
print(f"Linear Regression (Unregularized): {mse_lin_reg:.4f}")
print(f"Ridge Regression (L2):             {mse_ridge:.4f}")
print(f"Lasso Regression (L1):             {mse_lasso:.4f}")
print(f"Elastic Net (L1+L2):               {mse_elastic:.4f}")

## 3. Coefficient Analysis

The primary goal of this notebook is to see how regularization affects the model coefficients. We will extract the coefficients from each model and compare them.

- **Lasso (L1)** is expected to drive many coefficients to exactly zero, performing feature selection.
- **Ridge (L2)** is expected to shrink coefficients towards zero, but not completely eliminate them.
- **ElasticNet** should provide a balance between the two.


In [None]:
# Get feature names from the OHE transformer
feature_names = X_ohe.columns

# Create a DataFrame to hold the coefficients
coef_df = pd.DataFrame({
    "Feature": feature_names,
    "LinearRegression": lin_reg.coef_,
    "Ridge": ridge.coef_,
    "Lasso": lasso.coef_,
    "ElasticNet": elastic.coef_
})

# How many coefficients are zero?
print("Number of coefficients set to zero:")
print(f"Lasso: {(coef_df['Lasso'] == 0).sum()} / {len(coef_df)}")
print(f"ElasticNet: {(coef_df['ElasticNet'] == 0).sum()} / {len(coef_df)}")

# Display the top 10 largest coefficients by magnitude for Linear Regression
print("\nTop 10 Linear Regression Coefficients (by magnitude):")
coef_df.reindex(coef_df.LinearRegression.abs().sort_values(ascending=False).index).head(10)

In [None]:
# Plotting the coefficients
# Get top 20 features by Linear Regression coefficient magnitude
top_features = coef_df.reindex(coef_df.LinearRegression.abs().sort_values(ascending=False).index).head(20)

plt.figure(figsize=(10, 8))
sns.barplot(y='Feature', x='LinearRegression', data=top_features, color='gray', label='LinearRegression')
sns.barplot(y='Feature', x='Ridge', data=top_features, color='blue', label='Ridge')
sns.barplot(y='Feature', x='ElasticNet', data=top_features, color='green', label='ElasticNet')
sns.barplot(y='Feature', x='Lasso', data=top_features, color='red', label='Lasso')
plt.title('Comparison of Top 20 Coefficients')
plt.xlabel('Coefficient Value')
plt.ylabel('Feature')
plt.legend()
plt.savefig('../reports/figures/3.0_coefficient_comparison.png')
plt.show()