# Elastic Net Regression

## Overview
Elastic Net Regression is a **regularized linear regression** technique that combines both the **L1 penalty (Lasso)** and the **L2 penalty (Ridge)** in the loss function. It is particularly useful when there are **correlated features** and when you want both **coefficient shrinkage** and **automatic feature selection**.

---

## Loss Function

The Elastic Net loss function is:

$$L = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + a\|w\|^2 + b\|w\|$$

Where:
- $a\|w\|^2 = a \sum_{j=1}^{p} b_j^2$ — **L2 (Ridge)** penalty term
- $b\|w\| = b \sum_{j=1}^{p} |b_j|$ — **L1 (Lasso)** penalty term

### Relationship to scikit-learn parameters (`alpha` and `l1_ratio`):

$$\lambda = a + b \qquad \text{(total regularization strength = alpha)}$$

$$\text{l1\_ratio} = \frac{a}{a + b} \qquad \Rightarrow \qquad l1 = \frac{a}{\lambda}$$

$$a = \text{l1\_ratio} \times \lambda \qquad b = \lambda - a$$

| Parameter | Meaning | Example ($\lambda=1$, $\text{l1\_ratio}=0.5$) |
|-----------|---------|------------------------------------------------|
| $\lambda$ (`alpha`) | Total regularization strength | $1$ |
| $\text{l1\_ratio}$ | Fraction of L1 (Lasso) | $0.5$ |
| $a$ | L1 penalty coefficient | $0.5$ |
| $b$ | L2 penalty coefficient | $0.5$ |

> **Example:** `l1_ratio=0.9` → $a = 0.9\lambda$ (90% Lasso), $b = 0.1\lambda$ (10% Ridge)

### Effect of `l1_ratio` on model type:

| `l1_ratio` | $a$ (L1) | $b$ (L2) | Equivalent Model |
|------------|----------|----------|-----------------|
| `0` | $0$ | $\lambda$ | **Pure Ridge** (only L2 penalty) |
| `0 < r < 1` | $r\lambda$ | $(1-r)\lambda$ | **Elastic Net** (blend of both) |
| `1` | $\lambda$ | $0$ | **Pure Lasso** (only L1 penalty) |

The intercept (bias term $b_0$) is **not penalized** — only slope coefficients are shrunk.

---

## Closed-Form Solution

Elastic Net does **not have a closed-form solution** due to the non-differentiability of the L1 component. It is solved iteratively using:
- **Coordinate Descent** (default in scikit-learn)
- **Pathwise Coordinate Descent** (efficient for a sequence of alpha values)

---

## Key Properties

- Elastic Net **shrinks** coefficients toward zero and can set some coefficients **exactly to zero** (like Lasso)
- The L2 component groups correlated features together (unlike Lasso which arbitrarily picks one)
- When $\text{l1\_ratio} = 1$ — Elastic Net reduces to **Lasso** ($b = 0$, pure L1)
- When $\text{l1\_ratio} = 0$ — Elastic Net reduces to **Ridge** ($a = 0$, pure L2)
- When $\lambda = 0$ — Elastic Net reduces to **Ordinary Least Squares (OLS)**
- Larger $\lambda$ = more regularization = sparser model = higher bias, lower variance

---

## Impact on Bias and Variance

As $\lambda$ increases:

| Metric | Effect |
|--------|--------|
| Bias | Increases |
| Variance | Decreases |
| Non-zero Coefficients | Decreases (more sparsity as $\text{l1\_ratio} \to 1$) |
| Total Loss | Decreases initially, then increases |

The optimal $\lambda$ and $\text{l1\_ratio}$ are found at the **sweet spot** where total loss is minimized — typically via cross-validation.

---

## Geometric Interpretation (Constraint Region)

In coefficient space ($\beta_1$, $\beta_2$):
- OLS minimizes MSE — represented by **elliptical contours**
- Elastic Net adds a constraint region that is a **blend of a circle (Ridge) and a diamond (Lasso)**
- The constraint region has **rounded corners** — more corner-like than Ridge (can zero out coefficients) but smoother than Lasso (handles correlated features better)
- The solution is where the **ellipse first touches the blended region**

---

## Elastic Net vs Ridge vs Lasso

| Property | Ridge (L2) | Lasso (L1) | Elastic Net (L1 + L2) |
|----------|-----------|-----------|----------------------|
| Penalty | $b\|w\|^2$ | $a\|w\|$ | $a\|w\| + b\|w\|^2$ |
| Coefficients reach zero | Never | Yes | Yes |
| Feature Selection | No | Yes | Yes |
| Handles correlated features | Yes | Picks one | Yes (groups them) |
| Constraint Region | Circle | Diamond | Rounded Diamond |
| Best for | Multicollinearity | Sparse features | Correlated + sparse features |

---

## Syntax (Scikit-learn)

```python
from sklearn.linear_model import ElasticNet

en = ElasticNet(alpha=1.0, l1_ratio=0.5)   # alpha = λ = a+b,  l1_ratio = a/(a+b)
en.fit(X_train, y_train)
y_pred = en.predict(X_test)
```

---

## When to Use Elastic Net Regression

| Condition | Use Elastic Net? |
|-----------|-----------------|
| Correlated features present | Yes |
| Need feature selection | Yes |
| All features are likely relevant | No (use Ridge) |
| Sparse features with no correlation | No (use Lasso) |
| Model is overfitting | Yes |
| Unsure between Ridge and Lasso | Yes |

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, ElasticNetCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

In [None]:
data = load_diabetes()
X, y = data.data, data.target

print("Feature shape:", X.shape)
print("Target shape:", y.shape)
print("\nFeature names:", data.feature_names)

In [72]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

## Model Comparison: Linear Regression, Ridge, Lasso, and Elastic Net

We compare all four models on the same **Diabetes dataset** to observe the impact of different regularization strategies.

In [None]:
# Linear Regression (Baseline)
reg = LinearRegression()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

print("=== Linear Regression ===")
print("R² Score:", round(r2_score(y_test, y_pred), 4))
print("MSE:     ", round(mean_squared_error(y_test, y_pred), 4))

0.4399387660024645

In [None]:
# Ridge Regression (L2 penalty)
reg = Ridge(alpha=0.1)
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

print("=== Ridge Regression (alpha=0.1) ===")
print("R² Score:", round(r2_score(y_test, y_pred), 4))
print("MSE:     ", round(mean_squared_error(y_test, y_pred), 4))

0.4519973816947852

In [None]:
# Lasso Regression (L1 penalty)
reg = Lasso(alpha=0.01)
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

print("=== Lasso Regression (alpha=0.01) ===")
print("R² Score:", round(r2_score(y_test, y_pred), 4))
print("MSE:     ", round(mean_squared_error(y_test, y_pred), 4))
print("Zero coefficients:", (reg.coef_ == 0).sum(), "out of", len(reg.coef_))

0.4411227990495632

In [None]:
# Elastic Net Regression (L1 + L2 penalty)
reg = ElasticNet(alpha=0.005, l1_ratio=0.9)
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

print("=== Elastic Net Regression (alpha=0.005, l1_ratio=0.9) ===")
print("R² Score:", round(r2_score(y_test, y_pred), 4))
print("MSE:     ", round(mean_squared_error(y_test, y_pred), 4))
print("Zero coefficients:", (reg.coef_ == 0).sum(), "out of", len(reg.coef_))
print("Coefficients:", np.round(reg.coef_, 4))

0.4531493801165679

## Observations

Comparing all four models on the Diabetes dataset reveals how regularization affects performance and sparsity:

- Linear Regression serves as the baseline — no regularization, can overfit on noisy data
- Ridge slightly adjusts coefficients but keeps all features — useful when all features matter
- Lasso may zero out some coefficients — useful for feature selection
- Elastic Net (with `l1_ratio=0.9`) leans heavily toward Lasso but uses Ridge to stabilize correlated features

### Why Similar Performance Here?

| Factor | Impact |
|--------|--------|
| Small dataset (442 samples) | Regularization differences are more visible |
| Only 10 features | Limited room for feature selection to show huge gains |
| Low $\lambda$ values | Weak regularization — all models behave close to OLS |
| No strong correlation | Elastic Net's grouping benefit is less visible |

**Takeaway:** Elastic Net shines on **high-dimensional datasets** with **correlated features** — it combines Ridge's ability to handle collinearity with Lasso's feature elimination, giving the best of both worlds.

## Finding the Optimal Parameters ($\lambda$ and `l1_ratio`)

**Using ElasticNetCV** (Recommended)

`ElasticNetCV` performs cross-validation over a grid of both `alpha` and `l1_ratio` values to automatically find the optimal combination. This avoids manual grid search and is much more efficient.

In [None]:
alphas = [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0]
l1_ratios = [0.1, 0.3, 0.5, 0.7, 0.9]

en_cv = ElasticNetCV(alphas=alphas, l1_ratio=l1_ratios, cv=5, random_state=42, max_iter=10000, n_jobs=-1)
en_cv.fit(X_train, y_train)

print("Optimal alpha (lambda):", en_cv.alpha_)
print("Optimal l1_ratio (rho):", en_cv.l1_ratio_)

y_pred_cv = en_cv.predict(X_test)
print("\nElasticNetCV R² Score:", round(r2_score(y_test, y_pred_cv), 4))
print("ElasticNetCV MSE:     ", round(mean_squared_error(y_test, y_pred_cv), 4))
print("Zero coefficients:", (en_cv.coef_ == 0).sum(), "out of", len(en_cv.coef_))

### **Can also use SGDRegressor**

Stochastic Gradient Descent (SGD) can be used to fit an Elastic Net model by setting `penalty='elasticnet'` and tuning the `l1_ratio` parameter. This is particularly useful for very large datasets where coordinate descent may be slow.

- Set `penalty='l2'` for Ridge behavior
- Set `penalty='l1'` for Lasso behavior
- Set `penalty='elasticnet'` with `l1_ratio` for Elastic Net behavior

---

## Key Understandings

**1. What Elastic Net does**
Combines both the L1 ($\lambda \rho \sum |b_j|$) and L2 ($\lambda \frac{1-\rho}{2} \sum b_j^2$) penalties in a single loss function. The `l1_ratio` ($\rho$) controls the blend — closer to 1 means more Lasso-like, closer to 0 means more Ridge-like.

**2. Coefficients can reach exactly zero**
Like Lasso, Elastic Net can drive coefficients to exactly zero due to the L1 component — performing built-in feature selection. But unlike Lasso, the L2 component prevents indiscriminate elimination of correlated features.

**3. Handles correlated features better than Lasso**
Lasso tends to arbitrarily pick one feature among a correlated group and zero out the rest. Elastic Net groups correlated features together and retains or shrinks them proportionally — a more stable and interpretable result.

**4. Two hyperparameters to tune**
Unlike Ridge and Lasso, Elastic Net has two parameters:
- `alpha` ($\lambda$) — overall regularization strength
- `l1_ratio` ($\rho$) — the mix between L1 and L2

Use `ElasticNetCV` to find the optimal combination automatically via cross-validation.

**5. Bias-Variance tradeoff**
As $\lambda$ increases: variance drops (less overfitting) and the model becomes sparser (more like Lasso when $\rho$ is high). The `l1_ratio` controls how aggressively coefficients are zeroed out versus just shrunk.

**6. No closed-form solution**
Because of the absolute value (L1) component, Elastic Net is solved iteratively using **Coordinate Descent**, just like Lasso.

**7. Geometric view**
Elastic Net's constraint region is a **rounded diamond** — smoother than Lasso's sharp diamond but more angular than Ridge's circle. This means some coefficients can still reach exactly zero (like Lasso), but correlated groups aren't arbitrarily cut (like Ridge).

**8. When to use Elastic Net**
Use Elastic Net when:
- You have many features and some are correlated
- You want feature selection but Lasso is too unstable (e.g., picks different features on similar datasets)
- You're unsure whether to use Ridge or Lasso — Elastic Net is a robust default
- High-dimensional data (e.g., genomics, text classification) with both sparsity and collinearity