# 🧠 Ridge and Lasso Regularization in Linear Regression

This notebook covers end-to-end understanding of regularization in linear regression, including:
- Linear Regression without regularization
- Overfitting and high variance
- L2 Regularization (Ridge) and L1 Regularization (Lasso)
- Concept and effect of `alpha`
- Coordinate descent, z_j and soft-thresholding
- Finding best alpha using cross-validation
- Visuals and code implementations

## ❓ What is Regularization?

Regularization helps prevent **overfitting** in linear regression by penalizing large coefficients.
Two types:
- **L2 (Ridge)**: adds \( \lambda \sum \beta_j^2 \) to the loss
- **L1 (Lasso)**: adds \( \lambda \sum |\beta_j| \) to the loss

The loss functions become:
- **Ridge**: \( \text{Loss} = \sum (y_i - \hat{y}_i)^2 + \lambda \sum \beta_j^2 \)
- **Lasso**: \( \text{Loss} = \sum (y_i - \hat{y}_i)^2 + \lambda \sum |\beta_j| \)

`alpha` (or λ) controls the strength of regularization. Higher alpha means more penalty.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LassoCV, RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

df = pd.read_csv('insurance.csv')
df_encoded = pd.get_dummies(df, drop_first=True)
X = df_encoded.drop('charges', axis=1)
y = df_encoded['charges']

In [2]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [3]:
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
mse_lr = mean_squared_error(y_test, lr.predict(X_test_scaled))
print(f"Linear Regression MSE: {mse_lr:.2f}")

Linear Regression MSE: 33596915.85


In [4]:
# Ridge and Lasso with fixed alpha (before CV tuning)
ridge_fixed = Ridge(alpha=100)
ridge_fixed.fit(X_train_scaled, y_train)
mse_ridge_fixed = mean_squared_error(y_test, ridge_fixed.predict(X_test_scaled))

lasso_fixed = Lasso(alpha=100, max_iter=10000)
lasso_fixed.fit(X_train_scaled, y_train)
mse_lasso_fixed = mean_squared_error(y_test, lasso_fixed.predict(X_test_scaled))

print(f"Ridge (fixed alpha=100) MSE: {mse_ridge_fixed:.2f}")
print(f"Lasso (fixed alpha=100) MSE: {mse_lasso_fixed:.2f}")

Ridge (fixed alpha=100) MSE: 35176302.32
Lasso (fixed alpha=100) MSE: 34056599.87


## 🔍 Why Ridge/Lasso Are Still Important?

You might ask:  
**"If Linear Regression has the lowest MSE (33 linear regression compared to 34 ridge and 35 lasso), why bother using Ridge or Lasso?"**

Here’s why:

### 1. Linear Regression can overfit when you have many features

- More features → higher risk of fitting noise, not the true signal
- Linear Regression has **no built-in protection** against overfitting
- It's likely to perform poorly on unseen or future data

### 2. Ridge & Lasso apply penalties (regularization)

- **Ridge (L2)**: Penalizes large coefficients → keeps model stable
- **Lasso (L1)**: Penalizes absolute value → can shrink some coefficients to **exactly zero**
    - This acts as **automatic feature selection**
- These are very helpful when working with **dozens or hundreds of features**

### 3. Regularization improves generalization to new data

- Regularized models are typically more **robust** when applied to new, unseen datasets
- Linear Regression might look great now, but its performance can **drop during deployment**


## 🔍 RidgeCV and LassoCV - Finding the best alpha

`RidgeCV` and `LassoCV` use cross-validation to select the optimal `alpha` value that minimizes error.

**Why use CV?**
Manual alpha tuning is hard. Cross-validation tests multiple alpha values and picks the best.


In [5]:
alphas = [0.1, 1, 10, 100, 500, 1000]

ridge_cv = RidgeCV(alphas=alphas, scoring='neg_mean_squared_error', cv=5)
ridge_cv.fit(X_train_scaled, y_train)

lasso_cv = LassoCV(alphas=alphas, max_iter=10000, cv=5)
lasso_cv.fit(X_train_scaled, y_train)

print(f"Best alpha for Ridge: {ridge_cv.alpha_}")
print(f"Best alpha for Lasso: {lasso_cv.alpha_}")

Best alpha for Ridge: 10.0
Best alpha for Lasso: 100.0


In [6]:
mse_ridge = mean_squared_error(y_test, ridge_cv.predict(X_test_scaled))
mse_lasso = mean_squared_error(y_test, lasso_cv.predict(X_test_scaled))

print(f"RidgeCV MSE: {mse_ridge:.2f}")
print(f"LassoCV MSE: {mse_lasso:.2f}")

RidgeCV MSE: 33685862.86
LassoCV MSE: 34056599.87


## 🧮 Understanding z_j in Coordinate Descent

- z_j = correlation between feature j and residuals
- In Lasso, if |z_j| < alpha → beta_j = 0

This is how Lasso performs automatic feature selection.

In [7]:
residuals = y_train - lasso_cv.predict(X_train_scaled)
X_df = pd.DataFrame(X_train_scaled, columns=X.columns)
z_j = X_df.mul(residuals.values.reshape(-1, 1), axis=0).sum()

coef_df = pd.DataFrame({
    'Feature': X.columns,
    'z_j': z_j.values,
    'Lasso Coefficient': lasso_cv.coef_
}).sort_values(by='z_j', key=abs, ascending=False)
display(coef_df)

Unnamed: 0,Feature,z_j,Lasso Coefficient
1,bmi,107069.446788,1892.789994
6,region_southeast,-107031.607675,-15.589796
4,smoker_yes,107011.13859,9453.06803
7,region_southwest,-107000.0,-104.379764
2,children,106992.118403,424.978753
0,age,106953.876733,3528.802468
3,sex_male,4473.40817,0.0
5,region_northwest,-2797.320238,-0.0
