# 3. Linear Regression with Regularization

**Purpose:** Learn and revise **Ridge**, **Lasso**, and **ElasticNet** regression in Scikit-learn.

---

## Why Regularization?

Standard linear regression can **overfit** when there are many features or correlated features. **Regularization** adds a penalty on the size of coefficients to keep the model simpler and more stable.

**Three main types:**

1. **Ridge (L2):** Penalizes \( \sum \beta_j^2 \). Shrinks coefficients toward zero; rarely makes them exactly zero.
2. **Lasso (L1):** Penalizes \( \sum |\beta_j| \). Can set some coefficients to **exactly zero** → feature selection.
3. **ElasticNet:** Combines L1 + L2. Good when you have many correlated features.

Loss = **MSE** + **penalty**. The strength of the penalty is controlled by \( \alpha \) (and \( l1_ratio \) for ElasticNet).


## Concepts to Remember

| Concept        | Description                                                                           |
| -------------- | ------------------------------------------------------------------------------------- |
| **Ridge**      | L2 penalty; stable with correlated features; all predictors stay in the model.        |
| **Lasso**      | L1 penalty; can zero out coefficients → automatic feature selection.                  |
| **ElasticNet** | Mix of L1 and L2; \( l1_ratio \) controls the mix (1 = Lasso, 0 = Ridge).             |
| **Alpha (α)**  | Larger α = stronger regularization = smaller coefficients. Tune via cross-validation. |


In [1]:
import numpy as np
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

In [2]:
np.random.seed(42)
n = 100
X = np.random.randn(n, 5)  # 5 features, some correlation
X[:, 1] = X[:, 0] + 0.1 * np.random.randn(n)  # correlate with feature 0
beta_true = np.array([3, 0, 1, 0, 2])  # true coeffs; some zero
y = X @ beta_true + np.random.randn(n) * 0.5

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

In [3]:
alpha = 0.5  # regularization strength

ridge = Ridge(alpha=alpha).fit(X_train_s, y_train)
lasso = Lasso(alpha=alpha).fit(X_train_s, y_train)
elastic = ElasticNet(alpha=alpha, l1_ratio=0.5).fit(X_train_s, y_train)

for name, model in [("Ridge", ridge), ("Lasso", lasso), ("ElasticNet", elastic)]:
    pred = model.predict(X_test_s)
    print(f"{name} - MSE: {mean_squared_error(y_test, pred):.4f}, R²: {r2_score(y_test, pred):.4f}")
    print(f"  Coefficients: {model.coef_}")

Ridge - MSE: 0.2462, R²: 0.9793
  Coefficients: [2.17838152 0.42735987 1.00945282 0.02641522 2.19124853]
Lasso - MSE: 1.1590, R²: 0.9028
  Coefficients: [2.0353725  0.         0.55195666 0.         1.66773427]
ElasticNet - MSE: 1.2033, R²: 0.8991
  Coefficients: [1.03994021 0.97245248 0.64748553 0.         1.52395823]


## Key Takeaways

- **Scale features** (e.g. StandardScaler) before Ridge/Lasso/ElasticNet so the penalty is fair across features.
- **Ridge**: Use when you want to keep all features but shrink coefficients.
- **Lasso**: Use when you suspect many coefficients are zero (sparse model).
- **ElasticNet**: Use when features are correlated and you want both shrinkage and some selection.
- Tune **alpha** (and **l1_ratio** for ElasticNet) via **GridSearchCV** or **RidgeCV/LassoCV**.
