- let's dive into **Topic 7: Regularized Linear Models**. 
- This is a crucial set of techniques that address some of the limitations of standard Linear Regression, particularly its tendency to overfit data, especially when you have many features or multicollinearity.

**1. Why Do We Need Regularization?**

Standard Linear Regression (Ordinary Least Squares - OLS) aims to find the coefficient values ($b_j$) that minimize the Mean Squared Error (MSE). This works well when you have a good number of samples, few features, and the model assumptions are reasonably met.

However, OLS can run into problems:

* **Overfitting:** If you have many features (high dimensionality), especially if the number of features is close to or exceeds the number of samples, the model can start fitting the *noise* in the training data rather than the underlying signal. This results in a model that performs exceptionally well on the training data but poorly on new, unseen data (poor generalization). Overfit models often have very large coefficient values.
* **Multicollinearity:** When features are highly correlated, the coefficient estimates in OLS can become unstable and have high variance. Small changes in the training data can lead to large swings in the estimated coefficients, making them hard to interpret.


**Regularization** is a technique used to combat these issues. It works by adding a **penalty term** to the cost function. This penalty discourages the model from learning overly complex patterns or assigning excessively large weights (coefficients) to features.

**The Core Idea: Modifying the Cost Function**

* Standard Linear Regression minimizes:
    $$J_{OLS}(b) = MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$
* Regularized Linear Regression minimizes a modified cost function:
    $$J_{Regularized}(b) = MSE + \text{Penalty Term}$$
    The penalty term is a function of the magnitude of the coefficients. By penalizing large coefficients, regularization forces the model to be "simpler" and often improves its ability to generalize to new data.

**Important Note on Feature Scaling:**
For regularized models, it's **essential to scale your features** (e.g., using `StandardScaler` from Scikit-learn). This is because the penalty term is applied to the coefficients, and if features have different scales, their coefficients will naturally be on different scales, leading to uneven penalization. Scaling ensures that all features are treated fairly by the regularization process. The target variable `y` is generally not scaled.


Let's explore the three main types of regularized linear models. We'll use the **California Housing dataset** for our examples, as it has more features than the Advertising dataset and is readily available in Scikit-learn.

In [2]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, RidgeCV, LassoCV, ElasticNetCV
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore', category=FutureWarning) # To suppress some sklearn warnings

In [3]:
# Load California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

In [4]:
# For clarity, let's put X into a DataFrame with feature names
X_df = pd.DataFrame(X, columns=housing.feature_names)

In [5]:
# Split data
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X_df, y, test_size=0.2, random_state=42)

In [6]:
# Feature Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train_raw) # Fit on training and transform
X_test = scaler.transform(X_test_raw)       # Transform test using training fit

In [7]:
# Convert scaled arrays back to DataFrames for easier inspection of coefficients later
X_train_scaled_df = pd.DataFrame(X_train, columns=X_train_raw.columns)
X_test_scaled_df = pd.DataFrame(X_test, columns=X_test_raw.columns)

In [8]:
print("California Housing Dataset Loaded and Scaled.")
print(f"Training features shape: {X_train.shape}")
print(f"Test features shape: {X_test.shape}")
print(f"First 5 rows of X_train_raw (unscaled):\n{X_train_raw.head()}")
print(f"\nFirst 5 rows of X_train_scaled_df (scaled):\n{X_train_scaled_df.head()}")

California Housing Dataset Loaded and Scaled.
Training features shape: (16512, 8)
Test features shape: (4128, 8)
First 5 rows of X_train_raw (unscaled):
       MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
14196  3.2596      33.0  5.017657   1.006421      2300.0  3.691814     32.71   
8267   3.8125      49.0  4.473545   1.041005      1314.0  1.738095     33.77   
17445  4.1563       4.0  5.645833   0.985119       915.0  2.723214     34.66   
14265  1.9425      36.0  4.002817   1.033803      1418.0  3.994366     32.69   
2271   3.5542      43.0  6.268421   1.134211       874.0  2.300000     36.78   

       Longitude  
14196    -117.03  
8267     -118.16  
17445    -120.48  
14265    -117.11  
2271     -119.80  

First 5 rows of X_train_scaled_df (scaled):
     MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0 -0.326196  0.348490 -0.174916  -0.208365    0.768276  0.051376 -1.372811   
1 -0.035843  1.618118 -0.402835  -0.128530   -0.098

---
**2. Ridge Regression (L2 Regularization)**

* **Penalty Term:** Ridge Regression adds a penalty proportional to the **sum of the squares of the coefficient magnitudes** (also known as the L2 norm of the coefficients). The intercept ($b_0$) is typically not regularized.
    $$\text{Penalty}_{L2} = \alpha \sum_{j=1}^{p} b_j^2$$
* **Cost Function for Ridge:**
    $$J_{Ridge}(b) = MSE + \alpha \sum_{j=1}^{p} b_j^2$$
* **Hyperparameter $\alpha$ (alpha):**
    * Controls the strength of the regularization. It's a non-negative value.
    * If $\alpha = 0$: Ridge Regression becomes identical to OLS Linear Regression.
    * As $\alpha \rightarrow \infty$: The penalty becomes dominant, forcing all coefficients $b_j$ (for $j>0$) closer and closer to zero.
    * The optimal value of $\alpha$ is usually found using cross-validation.
* **Effect of Ridge Regression:**
    * It **shrinks** the coefficients towards zero but **rarely makes them exactly zero**. Thus, it keeps all features in the model but reduces their influence.
    * Reduces model variance, which helps to prevent overfitting.
    * Particularly effective when dealing with **multicollinearity** (highly correlated features), as it tends to distribute the coefficient weights more evenly among correlated features.
* **Scikit-learn Implementation:** `sklearn.linear_model.Ridge` and `sklearn.linear_model.RidgeCV` (for built-in cross-validation to find alpha).


In [12]:
print("\n--- Ridge Regression (L2 Regularization) ---")

# --- Plain Ridge with a chosen alpha ---
alpha_ridge = 1.0 # Example alpha value
ridge_model = Ridge(alpha=alpha_ridge)
ridge_model.fit(X_train, y_train)

ridge_predictions = ridge_model.predict(X_test)
ridge_mse = mean_squared_error(y_test, ridge_predictions)
print(f"Ridge MSE with alpha={alpha_ridge}: {ridge_mse:.4f}")
print(f"Ridge R-squared with alpha={alpha_ridge}: {ridge_model.score(X_test, y_test):.4f}")
# print(f"Ridge Coefficients (alpha={alpha_ridge}): {ridge_model.coef_}")


--- Ridge Regression (L2 Regularization) ---
Ridge MSE with alpha=1.0: 0.5559
Ridge R-squared with alpha=1.0: 0.5758


In [13]:
# --- RidgeCV to find the best alpha ---
# Define a range of alphas to test
alphas_to_test = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
# For RidgeCV, 'scoring' can be used to specify the metric for choosing alpha,
# e.g., 'neg_mean_squared_error' (higher is better) or 'r2'. Default is model's score method.
ridge_cv_model = RidgeCV(alphas=alphas_to_test, store_cv_values=True) # store_cv_values is useful for inspection
ridge_cv_model.fit(X_train, y_train)

best_alpha_ridge = ridge_cv_model.alpha_
print(f"\nBest alpha found by RidgeCV: {best_alpha_ridge:.4f}")


Best alpha found by RidgeCV: 1.0000


In [14]:
ridge_cv_predictions = ridge_cv_model.predict(X_test)
ridge_cv_mse = mean_squared_error(y_test, ridge_cv_predictions)
print(f"RidgeCV MSE with best alpha: {ridge_cv_mse:.4f}")
print(f"RidgeCV R-squared with best alpha: {ridge_cv_model.score(X_test, y_test):.4f}")

RidgeCV MSE with best alpha: 0.5559
RidgeCV R-squared with best alpha: 0.5758


In [15]:
print("\nCoefficients from RidgeCV model:")
ridge_coefs = pd.Series(ridge_cv_model.coef_, index=X_train_scaled_df.columns)
print(ridge_coefs.sort_values(ascending=False))


Coefficients from RidgeCV model:
MedInc        0.854327
AveBedrms     0.339008
HouseAge      0.122624
Population   -0.002282
AveOccup     -0.040833
AveRooms     -0.294210
Longitude    -0.869071
Latitude     -0.896168
dtype: float64


In [16]:
# Let's compare with OLS Linear Regression
ols_model = LinearRegression()
ols_model.fit(X_train, y_train)
ols_predictions = ols_model.predict(X_test)
ols_mse = mean_squared_error(y_test, ols_predictions)
print(f"\nFor comparison: OLS Linear Regression MSE: {ols_mse:.4f}")
print(f"OLS Linear Regression R-squared: {ols_model.score(X_test, y_test):.4f}")


For comparison: OLS Linear Regression MSE: 0.5559
OLS Linear Regression R-squared: 0.5758


In [17]:
print("\nOLS Coefficients:")
ols_coefs = pd.Series(ols_model.coef_, index=X_train_scaled_df.columns)
print(ols_coefs.sort_values(ascending=False))


OLS Coefficients:
MedInc        0.854383
AveBedrms     0.339259
HouseAge      0.122546
Population   -0.002308
AveOccup     -0.040829
AveRooms     -0.294410
Longitude    -0.869842
Latitude     -0.896929
dtype: float64


**Observations from Ridge:**
You'll typically see that Ridge coefficients are smaller in magnitude compared to OLS coefficients, especially if OLS had very large ones. The MSE on the test set for Ridge might be slightly higher or lower than OLS, depending on whether OLS was overfitting. The primary benefit is often improved model stability and better generalization.