## 1. Introduction to Regularization

In machine learning, one of the major challenges is overfitting—when a model learns the training data too well (including noise) and fails to generalize to unseen data. **Regularization** is a strategy to overcome this by adding a penalty term to the loss function. This penalty discourages overly complex models (with very large coefficient values), thereby improving generalization.

---

## 2. Regularized Linear Models  | [Link](https://github.com/AdilShamim8/50-Days-of-Machine-Learning/tree/main/Day%2036%20Regularized%20Linear%20Models)

Consider the standard linear regression model where we predict \( y \) from features \( x \) using coefficients \(\theta\):  

$$  
\hat{y}_i = \theta^T x_i  
$$  

and the cost (loss) function is the mean squared error (MSE):  

$$  
J(\theta) = \frac{1}{n} \sum_{i=1}^n \left(y_i - \theta^T x_i\right)^2.  
$$  

**Regularization** adds a penalty term R(θ) to this loss:  

$$  
J_{\text{reg}}(\theta) = \frac{1}{n} \sum_{i=1}^n \left(y_i - \theta^T x_i\right)^2 + \lambda \, R(\theta),  
$$  

where λ≥0  controls the strength of the penalty.  

---  

## 3. Types of Regularization  

### Ridge Regression (L2 Regularization)  

- **Penalty term:** $$ \|\theta\|_2^2 = \sum_{j=1}^p \theta_j^2 $$  
- **Cost function:**  

$$  
J_{\text{ridge}}(\theta) = \frac{1}{n}\sum_{i=1}^n \left(y_i - \theta^T x_i\right)^2 + \lambda \sum_{j=1}^p \theta_j^2.  
$$  

- **Effect:** Shrinks coefficients toward zero but rarely exactly zero. It’s especially useful when predictors are highly correlated.  

### Lasso Regression (L1 Regularization)  | [Link](https://github.com/AdilShamim8/50-Days-of-Machine-Learning/tree/main/Day%2037%20Lasso%20Regression)

- **Penalty term:** $$ \|\theta\|_1 = \sum_{j=1}^p |\theta_j| $$  
- **Cost function:**  

$$  
J_{\text{lasso}}(\theta) = \frac{1}{n}\sum_{i=1}^n \left(y_i - \theta^T x_i\right)^2 + \lambda \sum_{j=1}^p |\theta_j|.  
$$  

- **Effect:** Can force some coefficients exactly to zero, thus performing automatic feature selection.  

### ElasticNet Regression  | [Link](https://github.com/AdilShamim8/50-Days-of-Machine-Learning/tree/main/Day%2038%20ElasticNet%20Regression)

ElasticNet combines both L1 and L2 penalties. Its cost function is:  

$$  
J_{\text{EN}}(\theta) = \frac{1}{n}\sum_{i=1}^n \left(y_i - \theta^T x_i\right)^2 + \lambda \left(\alpha \|\theta\|_1 + (1-\alpha)\|\theta\|_2^2\right),  
$$  

where:  
- λ controls the overall strength of regularization.  
- α ∈ [0,1]\) determines the mix:  
  - α = 1 gives pure Lasso.  
  - α = 0 gives pure Ridge.  

ElasticNet is useful when you want a balance between coefficient shrinkage and variable selection, especially in high-dimensional settings.


---

## 4. Python Code Examples

Below is a self-contained example using the Boston Housing dataset (available via scikit-learn) that shows how to apply ordinary linear regression, Ridge, Lasso, and ElasticNet. We also show how to use cross-validation to tune hyperparameters for ElasticNet.

```python
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston  # Note: Boston dataset is deprecated in recent versions; alternatives are available.
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, ElasticNetCV

# Load the dataset
boston = load_boston()  
X, y = boston.data, boston.target

# Standardize features (regularization is sensitive to scale)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42)

# ---------------------------
# 1. Ordinary Linear Regression
# ---------------------------
lr = LinearRegression()
lr.fit(X_train, y_train)
print("Linear Regression R^2:", lr.score(X_test, y_test))

# ---------------------------
# 2. Ridge Regression (L2)
# ---------------------------
ridge = Ridge(alpha=1.0)  # Try different alpha values to control penalty strength.
ridge.fit(X_train, y_train)
print("Ridge Regression R^2:", ridge.score(X_test, y_test))

# ---------------------------
# 3. Lasso Regression (L1)
# ---------------------------
lasso = Lasso(alpha=0.1)  # The alpha here is the penalty term.
lasso.fit(X_train, y_train)
print("Lasso Regression R^2:", lasso.score(X_test, y_test))

# ---------------------------
# 4. ElasticNet Regression
# ---------------------------
elastic = ElasticNet(alpha=1.0, l1_ratio=0.5)  # 50-50 mix of L1 and L2
elastic.fit(X_train, y_train)
print("ElasticNet Regression R^2:", elastic.score(X_test, y_test))

# ---------------------------
# 5. ElasticNet with Cross-Validation
# ---------------------------
elastic_cv = ElasticNetCV(
    alphas=[0.001, 0.01, 0.1, 1.0, 10.0],
    l1_ratio=[0.1, 0.5, 0.9],
    cv=5,
    random_state=42
)
elastic_cv.fit(X_train, y_train)
print("ElasticNetCV Best alpha:", elastic_cv.alpha_)
print("ElasticNetCV Best l1_ratio:", elastic_cv.l1_ratio_)
print("ElasticNetCV R^2:", elastic_cv.score(X_test, y_test))
```

### Explanation

- **StandardScaler:** Regularization penalties are scale-dependent. Standardizing features ensures that each feature contributes equally.
- **Ridge vs. Lasso:** Ridge shrinks all coefficients (helpful with multicollinearity), while Lasso may set some coefficients to zero (feature selection).
- **ElasticNet:** Combines both effects; use ElasticNetCV to automatically select optimal hyperparameters (both alpha and l1_ratio).

---

## 5. Conclusion

Regularization techniques are essential for building robust predictive models, especially in high-dimensional spaces. They help balance the bias–variance trade-off by penalizing large coefficients:

- **Ridge Regression** (L2) controls coefficient magnitude.
- **Lasso Regression** (L1) performs variable selection by enforcing sparsity.
- **ElasticNet Regression** provides a tunable balance between L1 and L2 regularization.

Using Python’s scikit-learn, you can quickly experiment with these methods and tune hyperparameters using cross-validation, thereby improving your model’s generalization performance.

This overview—with formulas and code examples—should give you a strong foundation to explore regularized linear models further.