#### Day 57 – Elastic Net Regression

Elastic Net combines the strengths of Ridge and Lasso regression.

- Ridge → Shrinkage (L2)
- Lasso → Sparsity (L1)
- Elastic Net → Shrinkage + Sparsity

---

#### Why Do We Need Elastic Net?

Lasso has a limitation:

- If features are highly correlated,
- Lasso tends to select one feature and set others to zero.

Ridge handles correlated features better:

- It distributes weights among them,
- But it does not create sparsity.

Elastic Net solves this by combining both penalties.

---

#### Elastic Net Loss Function

Elastic Net minimizes:

$$
L =
\sum_{i=1}^{n} (y_i - \hat{y}_i)^2
+ \lambda_1 \sum |w_j|
+ \lambda_2 \sum w_j^2
$$

More commonly written as:

$$
L =
RSS
+ \lambda \left(
\alpha \sum |w_j|
+
(1 - \alpha) \sum w_j^2
\right)
$$

Where:

- $\lambda$ → overall regularization strength  
- $\alpha \in [0,1]$ → mixing parameter  

---

#### Special Cases

- If $\alpha = 1$ → Elastic Net becomes Lasso  
- If $\alpha = 0$ → Elastic Net becomes Ridge  
- If $0 < \alpha < 1$ → Combination of both  

---

#### Why Elastic Net Works Better for Correlated Features

When features are correlated:

- Lasso selects one and removes others.
- Ridge keeps all but does not perform feature selection.
- Elastic Net encourages grouped selection.

It keeps correlated features together while still allowing sparsity.

---

#### Geometric Intuition

- Ridge → Circular constraint region  
- Lasso → Diamond constraint region  
- Elastic Net → Rounded diamond  

It combines:

- Sharp corners (from L1 → sparsity)
- Smooth edges (from L2 → stability)

---

#### When Should You Use Elastic Net?

- When features are highly correlated
- When $p > n$ (high-dimensional data)
- When you want both stability and feature selection

---



Elastic Net = L1 + L2 regularization.

It provides:

- Feature selection (like Lasso)
- Stability with correlated features (like Ridge)

### implementing Elastic Net using sklearn.ElasticNet

In [1]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler

| l1_ratio | Meaning                |
| -------- | ---------------------- |
| 0        | Pure Ridge (only L2)   |
| 1        | Pure Lasso (only L1)   |
| 0.5      | Equal mix of L1 and L2 |


In [2]:
X, y = make_regression(
    n_samples=100,
    n_features=10,
    n_informative=5,
    noise=20,
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [3]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [4]:
model = ElasticNet(
    alpha=1.0,      # overall regularization strength (λ)
    l1_ratio=0.5,   # mixing parameter (α)
    max_iter=10000
)

model.fit(X_train, y_train)

0,1,2
,"alpha  alpha: float, default=1.0 Constant that multiplies the penalty terms. Defaults to 1.0. See the notes for the exact mathematical meaning of this parameter. ``alpha = 0`` is equivalent to an ordinary least square, solved by the :class:`LinearRegression` object. For numerical reasons, using ``alpha = 0`` with the ``Lasso`` object is not advised. Given this, you should use the :class:`LinearRegression` object.",1.0
,"l1_ratio  l1_ratio: float, default=0.5 The ElasticNet mixing parameter, with ``0 <= l1_ratio <= 1``. For ``l1_ratio = 0`` the penalty is an L2 penalty. ``For l1_ratio = 1`` it is an L1 penalty. For ``0 < l1_ratio < 1``, the penalty is a combination of L1 and L2.",0.5
,"fit_intercept  fit_intercept: bool, default=True Whether the intercept should be estimated or not. If ``False``, the data is assumed to be already centered.",True
,"precompute  precompute: bool or array-like of shape (n_features, n_features), default=False Whether to use a precomputed Gram matrix to speed up calculations. The Gram matrix can also be passed as argument. For sparse input this option is always ``False`` to preserve sparsity. Check :ref:`an example on how to use a precomputed Gram Matrix in ElasticNet ` for details.",False
,"max_iter  max_iter: int, default=1000 The maximum number of iterations.",10000
,"copy_X  copy_X: bool, default=True If ``True``, X will be copied; else, it may be overwritten.",True
,"tol  tol: float, default=1e-4 The tolerance for the optimization: if the updates are smaller or equal to ``tol``, the optimization code checks the dual gap for optimality and continues until it is smaller or equal to ``tol``, see Notes below.",0.0001
,"warm_start  warm_start: bool, default=False When set to ``True``, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. See :term:`the Glossary `.",False
,"positive  positive: bool, default=False When set to ``True``, forces the coefficients to be positive.",False
,"random_state  random_state: int, RandomState instance, default=None The seed of the pseudo random number generator that selects a random feature to update. Used when ``selection`` == 'random'. Pass an int for reproducible output across multiple function calls. See :term:`Glossary `.",


In [5]:
print("Coefficients:")
print(model.coef_)

print("\nNumber of Non-Zero Coefficients:",
      np.sum(model.coef_ != 0))

Coefficients:
[10.5404096  -2.88651419  2.48542508 37.45569997  0.55306254 41.87177344
  3.34659799  6.19789635  2.65673132  0.69111596]

Number of Non-Zero Coefficients: 10


In [7]:
# Pure Ridge
ElasticNet(alpha=1.0, l1_ratio=0)

0,1,2
,"alpha  alpha: float, default=1.0 Constant that multiplies the penalty terms. Defaults to 1.0. See the notes for the exact mathematical meaning of this parameter. ``alpha = 0`` is equivalent to an ordinary least square, solved by the :class:`LinearRegression` object. For numerical reasons, using ``alpha = 0`` with the ``Lasso`` object is not advised. Given this, you should use the :class:`LinearRegression` object.",1.0
,"l1_ratio  l1_ratio: float, default=0.5 The ElasticNet mixing parameter, with ``0 <= l1_ratio <= 1``. For ``l1_ratio = 0`` the penalty is an L2 penalty. ``For l1_ratio = 1`` it is an L1 penalty. For ``0 < l1_ratio < 1``, the penalty is a combination of L1 and L2.",0
,"fit_intercept  fit_intercept: bool, default=True Whether the intercept should be estimated or not. If ``False``, the data is assumed to be already centered.",True
,"precompute  precompute: bool or array-like of shape (n_features, n_features), default=False Whether to use a precomputed Gram matrix to speed up calculations. The Gram matrix can also be passed as argument. For sparse input this option is always ``False`` to preserve sparsity. Check :ref:`an example on how to use a precomputed Gram Matrix in ElasticNet ` for details.",False
,"max_iter  max_iter: int, default=1000 The maximum number of iterations.",1000
,"copy_X  copy_X: bool, default=True If ``True``, X will be copied; else, it may be overwritten.",True
,"tol  tol: float, default=1e-4 The tolerance for the optimization: if the updates are smaller or equal to ``tol``, the optimization code checks the dual gap for optimality and continues until it is smaller or equal to ``tol``, see Notes below.",0.0001
,"warm_start  warm_start: bool, default=False When set to ``True``, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. See :term:`the Glossary `.",False
,"positive  positive: bool, default=False When set to ``True``, forces the coefficients to be positive.",False
,"random_state  random_state: int, RandomState instance, default=None The seed of the pseudo random number generator that selects a random feature to update. Used when ``selection`` == 'random'. Pass an int for reproducible output across multiple function calls. See :term:`Glossary `.",


In [8]:
# Pure Lasso
ElasticNet(alpha=1.0, l1_ratio=1)

0,1,2
,"alpha  alpha: float, default=1.0 Constant that multiplies the penalty terms. Defaults to 1.0. See the notes for the exact mathematical meaning of this parameter. ``alpha = 0`` is equivalent to an ordinary least square, solved by the :class:`LinearRegression` object. For numerical reasons, using ``alpha = 0`` with the ``Lasso`` object is not advised. Given this, you should use the :class:`LinearRegression` object.",1.0
,"l1_ratio  l1_ratio: float, default=0.5 The ElasticNet mixing parameter, with ``0 <= l1_ratio <= 1``. For ``l1_ratio = 0`` the penalty is an L2 penalty. ``For l1_ratio = 1`` it is an L1 penalty. For ``0 < l1_ratio < 1``, the penalty is a combination of L1 and L2.",1
,"fit_intercept  fit_intercept: bool, default=True Whether the intercept should be estimated or not. If ``False``, the data is assumed to be already centered.",True
,"precompute  precompute: bool or array-like of shape (n_features, n_features), default=False Whether to use a precomputed Gram matrix to speed up calculations. The Gram matrix can also be passed as argument. For sparse input this option is always ``False`` to preserve sparsity. Check :ref:`an example on how to use a precomputed Gram Matrix in ElasticNet ` for details.",False
,"max_iter  max_iter: int, default=1000 The maximum number of iterations.",1000
,"copy_X  copy_X: bool, default=True If ``True``, X will be copied; else, it may be overwritten.",True
,"tol  tol: float, default=1e-4 The tolerance for the optimization: if the updates are smaller or equal to ``tol``, the optimization code checks the dual gap for optimality and continues until it is smaller or equal to ``tol``, see Notes below.",0.0001
,"warm_start  warm_start: bool, default=False When set to ``True``, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. See :term:`the Glossary `.",False
,"positive  positive: bool, default=False When set to ``True``, forces the coefficients to be positive.",False
,"random_state  random_state: int, RandomState instance, default=None The seed of the pseudo random number generator that selects a random feature to update. Used when ``selection`` == 'random'. Pass an int for reproducible output across multiple function calls. See :term:`Glossary `.",


In [11]:
from sklearn.metrics import r2_score

In [12]:
models = {
    "Pure Ridge (ElasticNet l1_ratio=0)": ElasticNet(alpha=1.0, l1_ratio=0, max_iter=10000),
    "ElasticNet (l1_ratio=0.5)": ElasticNet(alpha=1.0, l1_ratio=0.5, max_iter=10000),
    "Pure Lasso (ElasticNet l1_ratio=1)": ElasticNet(alpha=1.0, l1_ratio=1, max_iter=10000)
}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    r2 = r2_score(y_test, y_pred)
    non_zero = np.sum(model.coef_ != 0)
    
    print(name)
    print(f"R2 Score: {r2:.4f}")
    print(f"Non-zero Coefficients: {non_zero}")
    print("-"*50)

Pure Ridge (ElasticNet l1_ratio=0)
R2 Score: 0.6786
Non-zero Coefficients: 10
--------------------------------------------------
ElasticNet (l1_ratio=0.5)
R2 Score: 0.8215
Non-zero Coefficients: 10
--------------------------------------------------
Pure Lasso (ElasticNet l1_ratio=1)
R2 Score: 0.9679
Non-zero Coefficients: 8
--------------------------------------------------


Linear regression models with a zero l1 penalization strength are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
  model = cd_fast.enet_coordinate_descent(


In [13]:
##### you can also use SGDRegressor regressor to implemnet these regression 

In [None]:
SGDRegressor(
    penalty='l2',
    alpha=0.01,
    max_iter=1000
)

In [None]:
SGDRegressor(
    penalty='l1',
    alpha=0.01,
    max_iter=1000
    
)

In [None]:
SGDRegressor(
    penalty='elasticnet',
    alpha=0.01,
    l1_ratio=0.5,
    max_iter=1000
)

| penalty      | Behavior                 |
| ------------ | ------------------------ |
| 'l2'         | Ridge (shrinkage only)   |
| 'l1'         | Lasso (sparsity)         |
| 'elasticnet' | Combination of L1 and L2 |


#### When Should You Use SGDRegressor with a Penalty Term?

`SGDRegressor` uses stochastic gradient descent instead of a closed-form solution.

You should use it in the following situations:

---

#### 1) Large Datasets

If you have:

- Very large number of samples
- High-dimensional feature space
- Memory constraints

Closed-form solutions become computationally expensive.

SGD is efficient because it updates weights incrementally using small batches or single samples.

---

#### 2) Online or Streaming Learning

If data arrives continuously:

- You want to update the model without retraining from scratch
- The dataset is dynamic

`SGDRegressor` supports `partial_fit()`, which makes it suitable for online learning.

---

#### 3) High-Dimensional Data (p >> n)

When the number of features is very large:

- Regularization becomes important
- Coefficients can grow excessively without control

Using a penalty term:

- Reduces overfitting
- Improves generalization
- Can produce sparsity (with L1)

---

#### 4) When You Need Regularization

Choose penalty based on your objective:

- `penalty='l2'` → Ridge (shrinkage)
- `penalty='l1'` → Lasso (sparsity)
- `penalty='elasticnet'` → Combination of L1 and L2

Regularization helps control variance and stabilize the model.

---

#### When Not to Use SGD

Avoid SGD when:

- Dataset is small
- Exact closed-form solution is preferred
- Fast and guaranteed convergence is needed

In such cases, use:

- `Ridge`
- `Lasso`
- `ElasticNet`

---

#### Practical Rule of Thumb

Small to medium dataset → Use Ridge, Lasso, or ElasticNet  
Very large dataset → Use SGDRegressor  

---


Use SGD with a penalty term when:

- The dataset is large
- Scalability is important
- Online learning is required
- Regularization is necessary

SGD trades exact closed-form solutions for scalability and computational efficiency.