<img src="./images/banner.png" width="800">

# Regularization Techniques in Linear Regression

Regularization is a fundamental concept in machine learning that addresses one of the most common challenges in model building: creating models that generalize well to unseen data. As we delve into more complex models, the risk of overfitting increases, and regularization provides a powerful set of techniques to mitigate this risk.


Regularization refers to a set of techniques that prevent overfitting by adding a penalty term to the loss function. This penalty discourages the model from becoming too complex, effectively constraining its capacity to memorize the training data. Regularization helps to create models that are simpler and more generalizable, striking a balance between fitting the training data and maintaining predictive power on new, unseen data.


Why is Regularization Important?

1. **Prevents Overfitting:** By limiting model complexity, regularization helps avoid the model fitting noise in the training data.

2. **Improves Generalization:** Regularized models often perform better on unseen data, making them more reliable in real-world applications.

3. **Feature Selection:** Some regularization techniques can help identify the most important features, leading to more interpretable models.

4. **Handling High-Dimensional Data:** Regularization is particularly useful when dealing with datasets where the number of features is large relative to the number of samples.


In this lecture, we'll focus on three main types of regularization for linear regression:

1. Ridge Regression (L2 Regularization)
2. Lasso Regression (L1 Regularization)
3. Elastic Net Regularization (Combination of L1 and L2)


<img src="./images/regularization.webp" width="800">

Each of these techniques adds a different type of penalty to the model, resulting in different effects on the model's behavior and the resulting coefficients.


Regularization is not just a technique for linear regression; it's a fundamental concept that extends to many areas of machine learning, including:

- Logistic Regression
- Neural Networks (weight decay, dropout)
- Support Vector Machines
- And many more advanced models


💡 **Pro Tip:** Understanding regularization in the context of linear regression provides a solid foundation for grasping more complex regularization techniques in advanced machine learning models.


In the following sections, we'll explore the problem of overfitting, delve into the bias-variance trade-off, and then examine each regularization technique in detail. We'll also look at practical implementations and guidelines for choosing the right regularization approach for your specific problem.


While regularization is a powerful tool, it's not a silver bullet. It's crucial to understand when and how to apply these techniques effectively. By the end of this lecture, you'll have a comprehensive understanding of regularization techniques, their impact on model performance, and how to implement them in your own machine learning projects.

**Table of contents**<a id='toc0_'></a>    
- [The Problem of Overfitting](#toc1_)    
  - [Symptoms of Overfitting](#toc1_1_)    
  - [Causes of Overfitting](#toc1_2_)    
  - [The Importance of Addressing Overfitting](#toc1_3_)    
- [Understanding the Bias-Variance Trade-off](#toc2_)    
  - [Decomposing Prediction Error](#toc2_1_)    
  - [High Bias vs. High Variance Models](#toc2_2_)    
  - [The Trade-off Curve](#toc2_3_)    
  - [Implications for Regularization](#toc2_4_)    
  - [Practical Considerations](#toc2_5_)    
- [Ridge Regression (L2 Regularization)](#toc3_)    
  - [Deriving the Ridge Regression Objective](#toc3_1_)    
  - [How Ridge Regression Works](#toc3_2_)    
  - [Choosing the Regularization Parameter](#toc3_3_)    
  - [Implementing Ridge Regression](#toc3_4_)    
- [Lasso Regression (L1 Regularization)](#toc4_)    
  - [How Lasso Regression Works](#toc4_1_)    
  - [The Path of Coefficients](#toc4_2_)    
  - [Choosing the Regularization Parameter](#toc4_3_)    
  - [Advantages and Limitations](#toc4_4_)    
  - [Implementing Lasso Regression](#toc4_5_)    
  - [Practical Considerations](#toc4_6_)    
- [Elastic Net Regularization](#toc5_)    
  - [Mathematical Formulation](#toc5_1_)    
  - [How Elastic Net Works](#toc5_2_)    
  - [Choosing Hyperparameters](#toc5_3_)    
  - [Advantages and Limitations](#toc5_4_)    
  - [Implementing Elastic Net](#toc5_5_)    
  - [Practical Considerations](#toc5_6_)    
- [Impact of Regularization on Bias and Variance](#toc6_)    
  - [Impact of Ridge Regression (L2)](#toc6_1_)    
  - [Impact of Lasso Regression (L1)](#toc6_2_)    
  - [Impact of Elastic Net](#toc6_3_)    
  - [Practical Implications](#toc6_4_)    
  - [Visualization of the Impact](#toc6_5_)    
- [Summary and Key Takeaways](#toc7_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_'></a>[The Problem of Overfitting](#toc0_)

Overfitting is a fundamental challenge in machine learning that occurs when a model learns the training data too well, including its noise and peculiarities, at the expense of generalizing to new, unseen data. Understanding overfitting is crucial for developing effective machine learning models, as it directly impacts a model's real-world performance and reliability.


Overfitting occurs when a model captures not just the underlying patterns in the data, but also the random fluctuations and noise. As a result, the model performs exceptionally well on the training data but fails to generalize to new, unseen data.


<img src="./images/overfitting.png" width="800">

🔑 **Key Concept:** An overfit model is like memorizing answers to a test rather than understanding the underlying concepts. It performs well on the known questions but struggles with new, slightly different problems.


### <a id='toc1_1_'></a>[Symptoms of Overfitting](#toc0_)


Recognizing overfitting is crucial for model evaluation and improvement. Here are some common signs:

1. **Large Gap Between Training and Test Performance:** 
   When a model shows significantly better performance on the training data compared to the test data, it's a strong indicator of overfitting. This gap suggests that the model has learned patterns specific to the training set that don't generalize well.

2. **Excessive Model Complexity:** 
   Overly complex models with many parameters relative to the amount of training data are prone to overfitting. In linear regression, this might manifest as coefficients with very large absolute values or a model that uses many features to explain a simple relationship.

3. **Poor Performance on New Data:** 
   The ultimate test of a model is its performance on new, unseen data. If a model that performs well during training fails to make accurate predictions on new data, overfitting is likely the culprit.


### <a id='toc1_2_'></a>[Causes of Overfitting](#toc0_)


Understanding the root causes of overfitting can help in developing strategies to prevent it:

1. **Limited Data:** 
   When the training dataset is too small, the model might learn the noise in the data rather than the true underlying pattern. This is particularly problematic in high-dimensional spaces where the number of features is large compared to the number of samples.

2. **Model Complexity:** 
   Models with high complexity (e.g., many parameters, high-degree polynomials in regression) have the capacity to fit the training data very closely, including its noise. While this results in low training error, it often leads to poor generalization.

3. **Noise in the Data:** 
   Real-world data often contains noise — random fluctuations or errors in measurement. An overfit model might interpret this noise as a pattern to be learned, leading to poor generalization.


### <a id='toc1_3_'></a>[The Importance of Addressing Overfitting](#toc0_)


Tackling overfitting is crucial for several reasons:

- **Model Reliability:** Overfit models are unreliable in real-world applications, as their performance on new data can be unpredictable.
- **Resource Efficiency:** Simpler models that generalize well are often more computationally efficient and easier to maintain.
- **Interpretability:** Non-overfit models tend to be more interpretable, providing clearer insights into the underlying relationships in the data.


🤔 **Why This Matters:** In practical machine learning, the goal is rarely to achieve perfect performance on the training data. Instead, we aim for models that generalize well, making reliable predictions on new, unseen data. Understanding and addressing overfitting is key to achieving this goal.


In the next sections, we'll explore the bias-variance trade-off, which provides a theoretical framework for understanding overfitting, and then delve into regularization techniques that help combat this problem in linear regression models.

## <a id='toc2_'></a>[Understanding the Bias-Variance Trade-off](#toc0_)

The bias-variance trade-off is a fundamental concept in machine learning that provides a framework for understanding model error and the problem of overfitting. It helps us balance the complexity of our model against its ability to generalize to new data. This trade-off is crucial in choosing the right model and in applying regularization techniques effectively.


### <a id='toc2_1_'></a>[Decomposing Prediction Error](#toc0_)


To understand the bias-variance trade-off, we first need to break down the sources of error in our predictions. The expected prediction error of a model can be decomposed into three components:

1. **Bias:** The error introduced by approximating a real-world problem with a simplified model.
2. **Variance:** The error due to the model's sensitivity to fluctuations in the training data.
3. **Irreducible Error:** The inherent noise in the problem that cannot be reduced by any model.


Mathematically, this decomposition is often expressed as:

$$ \text{Expected Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error} $$


<img src="./images/bias-variance-tradeoff.png" width="800">

<img src="./images/bias-variance.png" width="800">

🔑 **Key Concept:** The goal in model selection is to find the sweet spot that minimizes both bias and variance, recognizing that reducing one often comes at the cost of increasing the other.


### <a id='toc2_2_'></a>[High Bias vs. High Variance Models](#toc0_)


Understanding the characteristics of high bias and high variance models helps in diagnosing and addressing model performance issues:

1. **High Bias (Underfitting)**
   - Symptoms: Poor performance on both training and test data
   - Characteristics: 
     - Too simple to capture the underlying pattern in the data
     - Makes strong assumptions about the data distribution
     - Examples: Linear models for complex, non-linear relationships

   🔍 **Example:** Using a linear regression to model a clearly quadratic relationship would result in high bias.

2. **High Variance (Overfitting)**
   - Symptoms: Excellent performance on training data, poor performance on test data
   - Characteristics:
     - Too complex, capturing noise in the training data
     - Highly sensitive to small fluctuations in the training set
     - Examples: High-degree polynomial models, deep neural networks with many parameters

   🔍 **Example:** Fitting a high-degree polynomial to a dataset with just a few points would likely result in high variance.


### <a id='toc2_3_'></a>[The Trade-off Curve](#toc0_)


The relationship between model complexity, bias, and variance can be visualized with a characteristic U-shaped curve:

<img src="./images/bias-variance-2.jpeg" width="800">

As model complexity increases:
- Bias typically decreases (the model can fit the data more closely)
- Variance typically increases (the model becomes more sensitive to the specific training data)


The optimal model complexity occurs at the point where the sum of squared bias and variance is minimized.


### <a id='toc2_4_'></a>[Implications for Regularization](#toc0_)


Understanding the bias-variance trade-off is crucial for effective regularization:

- Regularization techniques generally work by increasing bias slightly to achieve a larger reduction in variance.
- The goal is to move along the trade-off curve to find the optimal balance for a given problem and dataset.


💡 **Pro Tip:** When applying regularization, monitor both training and validation performance. If both improve, you're likely reducing variance without significantly increasing bias – a win-win situation!


### <a id='toc2_5_'></a>[Practical Considerations](#toc0_)


In practice, navigating the bias-variance trade-off involves:
- Careful feature selection and engineering
- Choosing an appropriate model complexity
- Using cross-validation to estimate the true error
- Applying regularization techniques judiciously


> **Important Note:** The optimal trade-off point depends on your specific problem, the amount of data available, and the cost associated with different types of errors in your application.


By understanding the bias-variance trade-off, you'll be better equipped to diagnose model performance issues, choose appropriate regularization techniques, and ultimately build models that generalize well to new, unseen data. In the following sections, we'll explore specific regularization techniques that help manage this trade-off in the context of linear regression.

## <a id='toc3_'></a>[Ridge Regression (L2 Regularization)](#toc0_)

Ridge Regression, also known as L2 regularization, is a powerful technique used to mitigate overfitting in linear regression models. It can be derived from a probabilistic perspective, providing insights into its underlying assumptions and behavior.


This regularization technique is called L2 because it adds a penalty term to the loss function that corresponds to the squared magnitude of the coefficients. This penalty encourages the model to keep the coefficients small, effectively shrinking them towards zero.


Ridge regression can be understood as a Maximum A Posteriori (MAP) estimation under specific probabilistic assumptions:

1. **Likelihood**: We assume the target variable follows a normal distribution around the model's prediction:

   $y | X, \beta \sim \mathcal{N}(X\beta, \sigma^2I)$

2. **Prior**: We assume a Gaussian prior on the weights:

   $\beta \sim \mathcal{N}(0, \tau^2I)$

Here, $\sigma^2$ represents the noise variance, and $\tau^2$ controls the spread of the prior on weights.


<img src="./images/normal.png" width="800">

### <a id='toc3_1_'></a>[Deriving the Ridge Regression Objective](#toc0_)


Using Bayes' theorem, the posterior distribution of $\beta$ given the data is:

$p(\beta | X, y) \propto p(y | X, \beta) \cdot p(\beta)$


Taking the negative log of this posterior (to convert multiplication to addition and maximize to minimize):

$\log p(\beta | X, y) \propto \log p(y | X, \beta) + \log p(\beta)$


Expanding these terms:

1. $\log p(y | X, \beta) \propto \frac{1}{2\sigma^2} \|y - X\beta\|^2_2$
2. $\log p(\beta) \propto \frac{1}{2\tau^2} \|\beta\|^2_2$


Combining these and dropping constants, we get the Ridge regression objective:

$\min_{\beta} \left\{ \|y - X\beta\|^2_2 + \lambda \|\beta\|^2_2 \right\}$

Where $\lambda = \frac{\sigma^2}{\tau^2}$ is the regularization parameter.


<img src="./images/ridge.webp" width="400">

🔑 **Key Concept:** The regularization parameter $\lambda$ represents the ratio of the noise variance to the prior variance on weights. A larger $\lambda$ implies more confidence in the prior (that weights should be close to zero) relative to the data.


### <a id='toc3_2_'></a>[How Ridge Regression Works](#toc0_)


Ridge regression operates by shrinking the coefficients of correlated predictors towards each other, allowing them to borrow strength from one another. This has several important effects:

1. **Coefficient Stabilization:** 
   In the presence of multicollinearity, ordinary least squares can produce wildly varying coefficients. Ridge regression stabilizes these coefficients, making them more reliable.

2. **Variance Reduction:** 
   By constraining the coefficients, ridge regression reduces the model's variance, often at the cost of introducing a small amount of bias. This is particularly beneficial when the least squares estimates have high variance.

3. **Improved Generalization:** 
   The coefficient shrinkage often leads to better performance on unseen data, as it prevents the model from fitting noise in the training data too closely.


Geometrically, ridge regression can be viewed as constraining the coefficient vector to lie within a sphere centered at the origin. The radius of this sphere is determined by the regularization parameter $\lambda$. This constraint ensures that no single feature dominates the model, promoting a more balanced use of features.


### <a id='toc3_3_'></a>[Choosing the Regularization Parameter](#toc0_)


The choice of $\lambda$ is crucial in ridge regression:

- $\lambda = 0$: Equivalent to ordinary least squares (maximum likelihood estimation)
- $\lambda \to \infty$: All coefficients approach zero (except the intercept), reflecting complete trust in the prior
- Optimal $\lambda$: Typically chosen through cross-validation


💡 **Pro Tip:** Plot the coefficient paths (how coefficients change with $\lambda$) to gain insights into feature importance and model behavior.


Note that ridge regression does not perform feature selection; all features are retained in the model. If feature selection is desired, Lasso regression (L1 regularization) can be a better choice (which we'll cover next).


Here are some key advantages and limitations of Ridge Regression:

**Advantages:**
- Handles multicollinearity effectively
- Often improves prediction accuracy on new data
- Provides a continuous shrinkage of coefficients

**Limitations:**
- Does not perform feature selection (all features are kept in the model)
- Scale-dependent: Features need to be standardized before applying ridge regression


### <a id='toc3_4_'></a>[Implementing Ridge Regression](#toc0_)


In practice, ridge regression can be easily implemented using popular machine learning libraries. Here's a simple example using scikit-learn:


Here's the modified mini-batch gradient descent function with ridge regularization:

```python
import numpy as np

def mini_batch_gradient_descent_with_ridge(X, y, learning_rate=0.01, num_epoch=1000, batch_size=20, lambda_reg=0.1):
    n, m = X.shape
    beta = np.random.randn(m, 1)
    cost_history = []

    for _ in range(num_epoch):
        shuffled_indices = np.random.permutation(n)
        X_shuffled = X[shuffled_indices]
        y_shuffled = y[shuffled_indices]

        for i in range(0, n, batch_size):
            xi = X_shuffled[i:i+batch_size]
            yi = y_shuffled[i:i+batch_size]
            gradient = compute_gradient_with_ridge(xi, yi, beta, lambda_reg)
            beta -= learning_rate * gradient

        cost = compute_cost_with_ridge(X, y, beta, lambda_reg)
        cost_history.append(cost)

    return beta, cost_history

def compute_gradient_with_ridge(X, y, beta, lambda_reg):
    m = X.shape[0]
    predictions = X.dot(beta)
    gradient = (1/m) * X.T.dot(predictions - y) + (lambda_reg/m) * beta
    return gradient

def compute_cost_with_ridge(X, y, beta, lambda_reg):
    m = X.shape[0]
    predictions = X.dot(beta)
    cost = (1/(2*m)) * np.sum((predictions - y)**2) + (lambda_reg/(2*m)) * np.sum(beta**2)
    return cost
```


The main changes in this version are:

1. The function name is changed to `mini_batch_gradient_descent_with_ridge` to reflect the addition of ridge regularization.

2. A new parameter `lambda_reg` is added to control the strength of the regularization.

3. The `compute_gradient` and `compute_cost` functions are replaced with `compute_gradient_with_ridge` and `compute_cost_with_ridge`, respectively.

4. In `compute_gradient_with_ridge`, the regularization term `(lambda_reg/m) * beta` is added to the gradient calculation.

5. In `compute_cost_with_ridge`, the regularization term `(lambda_reg/(2*m)) * np.sum(beta**2)` is added to the cost calculation.


The ridge regularization adds a penalty term to the cost function, which helps prevent overfitting by discouraging large values in the beta coefficients. The `lambda_reg` parameter controls the strength of this regularization: a larger value will result in stronger regularization, while a value of 0 would be equivalent to no regularization. When using this function, you can adjust the `lambda_reg` parameter to find the right balance between fitting the training data and preventing overfitting.


> **Important Note:** Always standardize your features before applying ridge regression to ensure that the penalty is applied uniformly across all features.


By understanding Ridge Regression from both its mathematical derivation and practical application, you can more effectively leverage this powerful technique to build robust and generalizable linear models, especially in scenarios with multicollinearity or high-dimensional data.

## <a id='toc4_'></a>[Lasso Regression (L1 Regularization)](#toc0_)

Lasso (Least Absolute Shrinkage and Selection Operator) Regression is another powerful regularization technique used in linear regression. Unlike Ridge Regression, Lasso not only helps in preventing overfitting but also performs feature selection, making it particularly useful in high-dimensional datasets where only a subset of features are relevant.


Similar to Ridge Regression, Lasso can be derived from a Bayesian perspective as a Maximum A Posteriori (MAP) estimation:

1. **Likelihood**: We assume the same normal distribution for the target variable:

   $y | X, \beta \sim \mathcal{N}(X\beta, \sigma^2I)$

2. **Prior**: Instead of a Gaussian prior, Lasso assumes a Laplace prior on the weights:

   $p(\beta) \propto \exp(-\lambda\|\beta\|_1)$


Following the same MAP estimation process as with Ridge Regression, we arrive at the Lasso objective function:

$\min_{\beta} \left\{ \|y - X\beta\|^2_2 + \lambda \|\beta\|_1 \right\}$

Where $\|\beta\|_1 = \sum_{j=1}^p |\beta_j|$ is the L1 norm of the coefficient vector.


<img src="./images/lasso.webp" width="400">

🔑 **Key Concept:** The use of the L1 norm in the penalty term leads to sparse solutions, effectively performing feature selection by pushing some coefficients exactly to zero.


<img src="./images/l1-l2.ppm" width="800">

### <a id='toc4_1_'></a>[How Lasso Regression Works](#toc0_)


Lasso regression has some unique properties that distinguish it from Ridge regression:

1. **Sparse Solutions**: 
   Lasso can produce sparse models by setting some coefficients exactly to zero. This is due to the geometry of the L1 penalty, which intersects with the error contours at corners, corresponding to zero values for some coefficients.

2. **Feature Selection**: 
   By pushing coefficients to zero, Lasso effectively performs feature selection, identifying the most important predictors in the model.

3. **Bias-Variance Trade-off**: 
   Like Ridge regression, Lasso reduces variance but may increase bias. However, the sparsity induced by Lasso can lead to simpler models that may generalize better in some cases.


Geometrically, the Lasso constraint can be visualized as a diamond-shaped region in 2D (or a cross-polytope in higher dimensions) centered at the origin. The optimization process finds the point where this constraint region first touches the contours of the least squares error function.


### <a id='toc4_2_'></a>[The Path of Coefficients](#toc0_)


One of the most insightful aspects of Lasso is the path of coefficients as $\lambda$ varies:

- As $\lambda$ increases, more coefficients are pushed to exactly zero.
- The order in which coefficients become non-zero (or return to zero) can provide insights into feature importance.


💡 **Pro Tip:** Plotting the Lasso path (coefficients vs. $\lambda$) can provide valuable insights into the relative importance of features and how the model changes with regularization strength.


### <a id='toc4_3_'></a>[Choosing the Regularization Parameter](#toc0_)


As with Ridge regression, choosing the right $\lambda$ is crucial:

- $\lambda = 0$: Equivalent to ordinary least squares
- $\lambda \to \infty$: All coefficients become zero
- Optimal $\lambda$: Typically chosen through cross-validation


### <a id='toc4_4_'></a>[Advantages and Limitations](#toc0_)


**Advantages:**
- Performs feature selection, leading to simpler models
- Can handle high-dimensional data effectively
- Provides interpretable models by identifying key predictors

**Limitations:**
- Can be unstable when features are highly correlated
- May not perform well when there are many relevant features with small effects


### <a id='toc4_5_'></a>[Implementing Lasso Regression](#toc0_)


Here's the mini-batch gradient descent function modified to use Lasso regularization:

```python
import numpy as np

def mini_batch_gradient_descent_with_lasso(X, y, learning_rate=0.01, num_epoch=1000, batch_size=20, lambda_reg=0.1):
    n, m = X.shape
    beta = np.random.randn(m, 1)
    cost_history = []

    for _ in range(num_epoch):
        shuffled_indices = np.random.permutation(n)
        X_shuffled = X[shuffled_indices]
        y_shuffled = y[shuffled_indices]

        for i in range(0, n, batch_size):
            xi = X_shuffled[i:i+batch_size]
            yi = y_shuffled[i:i+batch_size]
            gradient = compute_gradient_with_lasso(xi, yi, beta, lambda_reg)
            beta -= learning_rate * gradient

        cost = compute_cost_with_lasso(X, y, beta, lambda_reg)
        cost_history.append(cost)

    return beta, cost_history

def compute_gradient_with_lasso(X, y, beta, lambda_reg):
    m = X.shape[0]
    predictions = X.dot(beta)
    gradient = (1/m) * X.T.dot(predictions - y) + lambda_reg * np.sign(beta)
    return gradient

def compute_cost_with_lasso(X, y, beta, lambda_reg):
    m = X.shape[0]
    predictions = X.dot(beta)
    cost = (1/(2*m)) * np.sum((predictions - y)**2) + lambda_reg * np.sum(np.abs(beta))
    return cost

def soft_thresholding(x, lambda_reg):
    return np.sign(x) * np.maximum(np.abs(x) - lambda_reg, 0)
```


The main changes for Lasso regularization are:

1. The function name is changed to `mini_batch_gradient_descent_with_lasso`.

2. In `compute_gradient_with_lasso`, the regularization term is now `lambda_reg * np.sign(beta)`. This is because the derivative of the L1 norm (used in Lasso) is the sign function.

3. In `compute_cost_with_lasso`, the regularization term is now `lambda_reg * np.sum(np.abs(beta))`. This is the L1 norm of the coefficients.

4. A new function `soft_thresholding` is added. This function can be used to implement coordinate descent for Lasso, which is often more efficient than gradient descent for Lasso problems. However, it's not used in the main function here.


Note that Lasso regularization tends to produce sparse solutions (i.e., it can drive some coefficients to exactly zero), which can be useful for feature selection. The `lambda_reg` parameter controls the strength of the regularization, with larger values producing sparser solutions.


One important consideration with Lasso is that the objective function is not differentiable at zero, which can sometimes cause issues with gradient-based optimization. In practice, this is often handled by using specialized optimization algorithms (like coordinate descent) or by using a smooth approximation of the L1 norm. The implementation provided here uses the standard gradient descent approach, which may not always converge to the optimal solution for Lasso problems, especially with high-dimensional data or large `lambda_reg` values.

> **Important Note:** Like Ridge regression, it's important to standardize features before applying Lasso to ensure fair penalization across all features.


### <a id='toc4_6_'></a>[Practical Considerations](#toc0_)


1. **Feature Correlation**: When features are highly correlated, Lasso tends to pick one arbitrarily. Consider using Elastic Net (a combination of L1 and L2 penalties) in such cases.

2. **Stability**: The feature selection property of Lasso can be unstable across different samples. Techniques like stability selection can be used to improve reliability.

3. **Interpretability**: While Lasso provides a form of feature importance, be cautious about interpreting the magnitude of non-zero coefficients directly as feature importance.


By understanding the principles behind Lasso regression and its unique properties, you can effectively leverage this technique for both regularization and feature selection in your linear models, particularly in high-dimensional settings where identifying key predictors is crucial.

## <a id='toc5_'></a>[Elastic Net Regularization](#toc0_)

Elastic Net regularization is a hybrid approach that combines the strengths of both Ridge (L2) and Lasso (L1) regularization. It was developed to address some of the limitations of Lasso, particularly in scenarios with highly correlated features or when the number of predictors significantly exceeds the number of observations.


Elastic Net aims to strike a balance between the L1 and L2 penalties, offering a more flexible and robust regularization technique. It introduces two hyperparameters:

1. $\alpha$: Controls the overall strength of regularization
2. $\rho$: Determines the mix of L1 and L2 penalties


<img src="./images/elastic-net.webp" width="400">

### <a id='toc5_1_'></a>[Mathematical Formulation](#toc0_)


The Elastic Net objective function is defined as:

$\min_{\beta} \left\{ \|y - X\beta\|^2_2 + \alpha \left( \rho \|\beta\|_1 + \frac{1-\rho}{2} \|\beta\|^2_2 \right) \right\}$

Where:
- $\|\beta\|_1$ is the L1 norm (sum of absolute values)
- $\|\beta\|^2_2$ is the squared L2 norm
- $\alpha \geq 0$ is the overall regularization strength
- $0 \leq \rho \leq 1$ is the mixing parameter


🔑 **Key Concept:** When $\rho = 1$, Elastic Net becomes Lasso; when $\rho = 0$, it becomes Ridge regression. Values in between create a compromise between the two.


From a Bayesian perspective, Elastic Net can be seen as placing a prior on the coefficients that is a mixture of Laplace (for L1) and Gaussian (for L2) distributions. This combined prior allows for both sparsity and the grouping effect of correlated features.


### <a id='toc5_2_'></a>[How Elastic Net Works](#toc0_)


Elastic Net combines the beneficial properties of both Ridge and Lasso:

1. **Feature Selection**: 
   Like Lasso, Elastic Net can produce sparse models by setting some coefficients to exactly zero, especially when $\rho$ is close to 1.

2. **Handling Correlated Features**: 
   Unlike Lasso, which tends to arbitrarily select one feature from a group of correlated features, Elastic Net can select groups of correlated features together.

3. **Stability**: 
   The L2 penalty provides stability, especially in scenarios where predictors are highly correlated, addressing a key limitation of Lasso.

4. **Overcome Limitations**: 
   Elastic Net can handle situations where the number of predictors (p) is much larger than the number of observations (n), a scenario where Lasso is limited to selecting at most n variables.


### <a id='toc5_3_'></a>[Choosing Hyperparameters](#toc0_)


Selecting appropriate values for $\alpha$ and $\rho$ is crucial for Elastic Net's performance:

- $\alpha$: Controls overall regularization strength. Larger values increase regularization.
- $\rho$: Balances L1 and L2 penalties. Values closer to 1 favor sparsity, while values closer to 0 favor Ridge-like behavior.


💡 **Pro Tip:** Use cross-validation with a grid search over different combinations of $\alpha$ and $\rho$ to find the optimal hyperparameters for your specific dataset.


### <a id='toc5_4_'></a>[Advantages and Limitations](#toc0_)


**Advantages:**
- Combines benefits of both L1 and L2 regularization
- Handles correlated features better than Lasso
- Can perform feature selection while still maintaining some contribution from all features

**Limitations:**
- Requires tuning of two hyperparameters instead of one
- May be computationally more intensive due to the additional hyperparameter


### <a id='toc5_5_'></a>[Implementing Elastic Net](#toc0_)


Here's an example of implementing Elastic Net using scikit-learn:


Certainly! Here's the mini-batch gradient descent function modified to use Elastic Net regularization, which combines both L1 (Lasso) and L2 (Ridge) regularization:

```python
import numpy as np

def mini_batch_gradient_descent_with_elastic_net(X, y, learning_rate=0.01, num_epoch=1000, batch_size=20, lambda1=0.1, lambda2=0.1):
    n, m = X.shape
    beta = np.random.randn(m, 1)
    cost_history = []

    for _ in range(num_epoch):
        shuffled_indices = np.random.permutation(n)
        X_shuffled = X[shuffled_indices]
        y_shuffled = y[shuffled_indices]

        for i in range(0, n, batch_size):
            xi = X_shuffled[i:i+batch_size]
            yi = y_shuffled[i:i+batch_size]
            gradient = compute_gradient_with_elastic_net(xi, yi, beta, lambda1, lambda2)
            beta -= learning_rate * gradient

        cost = compute_cost_with_elastic_net(X, y, beta, lambda1, lambda2)
        cost_history.append(cost)

    return beta, cost_history

def compute_gradient_with_elastic_net(X, y, beta, lambda1, lambda2):
    m = X.shape[0]
    predictions = X.dot(beta)
    l1_term = lambda1 * np.sign(beta)
    l2_term = lambda2 * beta
    gradient = (1/m) * X.T.dot(predictions - y) + l1_term + l2_term
    return gradient

def compute_cost_with_elastic_net(X, y, beta, lambda1, lambda2):
    m = X.shape[0]
    predictions = X.dot(beta)
    l1_penalty = lambda1 * np.sum(np.abs(beta))
    l2_penalty = (lambda2 / 2) * np.sum(beta**2)
    cost = (1/(2*m)) * np.sum((predictions - y)**2) + l1_penalty + l2_penalty
    return cost

def soft_thresholding(x, lambda1):
    return np.sign(x) * np.maximum(np.abs(x) - lambda1, 0)
```


The main changes for Elastic Net regularization are:

1. The function name is changed to `mini_batch_gradient_descent_with_elastic_net`.

2. We now have two regularization parameters: `lambda1` for the L1 (Lasso) term and `lambda2` for the L2 (Ridge) term.

3. In `compute_gradient_with_elastic_net`, we include both the L1 term (`lambda1 * np.sign(beta)`) and the L2 term (`lambda2 * beta`) in the gradient calculation.

4. In `compute_cost_with_elastic_net`, we include both the L1 penalty (`lambda1 * np.sum(np.abs(beta))`) and the L2 penalty (`(lambda2 / 2) * np.sum(beta**2)`) in the cost calculation.

5. The `soft_thresholding` function is kept, as it can be useful for coordinate descent implementations of Elastic Net, although it's not used in the main function here.


Elastic Net regularization combines the benefits of both Lasso and Ridge regularization:

- Like Lasso, it can produce sparse models by driving some coefficients to exactly zero, which is useful for feature selection.
- Like Ridge, it can handle correlated features well and doesn't have limitations in high-dimensional settings where p > n (number of features greater than number of samples).


The balance between L1 and L2 regularization can be adjusted by changing the relative values of `lambda1` and `lambda2`:

- If `lambda1 = 0`, it reduces to Ridge regression.
- If `lambda2 = 0`, it reduces to Lasso regression.
- When both are non-zero, you get the benefits of both regularization techniques.


As with Lasso, the non-differentiability of the L1 term at zero can sometimes cause issues with gradient-based optimization. More sophisticated optimization techniques (like coordinate descent or proximal gradient methods) are often used for Elastic Net in practice, especially for high-dimensional problems.

> **Important Note:** As with Ridge and Lasso, it's crucial to standardize features before applying Elastic Net to ensure fair penalization across all features.


### <a id='toc5_6_'></a>[Practical Considerations](#toc0_)


1. **Interpretation**: While Elastic Net provides a form of feature importance, be cautious about directly interpreting the magnitude of non-zero coefficients as feature importance.

2. **Computational Cost**: The additional hyperparameter can make the tuning process more computationally intensive compared to Ridge or Lasso alone.

3. **Model Complexity**: Elastic Net models can be more complex than pure Lasso models but may offer better predictive performance, especially with correlated features.


By leveraging Elastic Net regularization, you can create models that benefit from both the sparsity of Lasso and the stability of Ridge regression. This makes Elastic Net a versatile choice, particularly useful in scenarios with complex feature interactions or when dealing with high-dimensional data where feature selection is desirable but pure Lasso might be too aggressive.

## <a id='toc6_'></a>[Impact of Regularization on Bias and Variance](#toc0_)

Understanding how regularization affects the bias-variance trade-off is crucial for effectively applying these techniques in practice. This section explores how different regularization methods impact model bias and variance, providing insights into when and why to use each approach.


Before diving into the impact of regularization, let's briefly recap the bias-variance trade-off:

- **Bias**: The error introduced by approximating a real-world problem with a simplified model.
- **Variance**: The model's sensitivity to fluctuations in the training data.
- **Trade-off**: As model complexity increases, bias tends to decrease while variance increases.


The goal of regularization is to find an optimal balance between bias and variance to minimize overall prediction error. 


Regularization techniques generally work by introducing a controlled amount of bias to reduce variance:

1. **Reducing Variance**: 
   By constraining model parameters, regularization reduces the model's sensitivity to individual data points, lowering variance.

2. **Increasing Bias**: 
   The constraints imposed by regularization can prevent the model from perfectly fitting the training data, potentially increasing bias.

3. **Overall Error**: 
   The aim is to decrease variance more than the increase in bias, reducing overall prediction error.


<img src="./images/ridge-lasso-elastic.webp" width="800">

### <a id='toc6_1_'></a>[Impact of Ridge Regression (L2)](#toc0_)


Ridge regression impacts bias and variance in the following ways:

1. **Continuous Shrinkage**: 
   - Ridge shrinks all coefficients towards zero, but rarely sets them exactly to zero.
   - This results in a more stable model with lower variance.

2. **Bias Introduction**: 
   - As λ (regularization strength) increases, bias generally increases.
   - The model becomes simpler and may underfit if λ is too large.

3. **Variance Reduction**: 
   - Larger λ values lead to more significant variance reduction.
   - This is particularly beneficial when features are correlated.


🔍 **Example:** In a dataset with many correlated features, Ridge can significantly reduce variance by distributing the impact across all features, rather than relying heavily on a subset.


### <a id='toc6_2_'></a>[Impact of Lasso Regression (L1)](#toc0_)


Lasso affects bias and variance differently from Ridge:

1. **Feature Selection**: 
   - Lasso can set some coefficients exactly to zero, effectively performing feature selection.
   - This can lead to simpler models with potentially lower variance.

2. **Sparse Solutions**: 
   - As λ increases, more coefficients are set to zero, increasing model sparsity.
   - This can result in more interpretable models but may increase bias if important features are excluded.

3. **Variance Reduction**: 
   - By selecting a subset of features, Lasso can significantly reduce variance, especially in high-dimensional spaces.

4. **Bias-Variance Balance**: 
   - The feature selection property of Lasso can lead to a different bias-variance trade-off compared to Ridge.
   - It may achieve lower variance but potentially higher bias, especially if the true model is not sparse.


💡 **Pro Tip:** Lasso can be particularly effective when you believe only a subset of features are truly relevant to the prediction task.


### <a id='toc6_3_'></a>[Impact of Elastic Net](#toc0_)


Elastic Net combines the properties of both Ridge and Lasso:

1. **Flexible Trade-off**: 
   - By adjusting the mixing parameter (ρ), Elastic Net can achieve a balance between the bias-variance characteristics of Ridge and Lasso.

2. **Grouped Selection**: 
   - In scenarios with groups of correlated features, Elastic Net can select or reject features as groups.
   - This can lead to models with lower variance than Lasso while maintaining some of the interpretability benefits.

3. **Stability**: 
   - Elastic Net tends to be more stable than Lasso in the presence of highly correlated features.
   - This stability often translates to a more favorable bias-variance trade-off in such scenarios.


### <a id='toc6_4_'></a>[Practical Implications](#toc0_)


Understanding these impacts helps in choosing and tuning regularization methods:

1. **Feature Correlation**: 
   - With highly correlated features, Ridge or Elastic Net may be preferable to Lasso.
   - They handle multicollinearity better, leading to more stable models with lower variance.

2. **High-Dimensional Data**: 
   - In scenarios with many features relative to observations, Lasso or Elastic Net can be beneficial.
   - Their feature selection properties can lead to simpler models with lower variance.

3. **Model Interpretability**: 
   - If model interpretability is crucial, Lasso or Elastic Net with a higher L1 ratio might be preferred.
   - The sparsity they induce can make it easier to identify key predictors.

4. **Tuning Process**: 
   - When tuning regularization parameters, monitor both training and validation performance.
   - The optimal regularization strength is often where validation performance peaks, balancing bias and variance.


> **Important Note:** The impact of regularization on bias and variance can vary depending on the specific dataset and problem. Always validate your approach using cross-validation and consider the practical implications of your model choices.


### <a id='toc6_5_'></a>[Visualization of the Impact](#toc0_)


To truly understand the impact of regularization on bias and variance, it's often helpful to visualize:

1. **Learning Curves**: Plot training and validation errors against the regularization strength.
2. **Coefficient Paths**: Show how coefficients change as regularization strength increases.
3. **Prediction Error Decomposition**: Visualize how total error, bias, and variance change with regularization.


By carefully considering how different regularization techniques affect the bias-variance trade-off, you can make more informed decisions about which method to use and how to tune it for your specific problem. This understanding is key to building models that generalize well to unseen data, striking the right balance between model complexity and predictive power.

## <a id='toc7_'></a>[Summary and Key Takeaways](#toc0_)

As we conclude our exploration of regularization techniques in linear regression, let's recap the main concepts and highlight the key takeaways. This summary will help solidify your understanding and provide a quick reference for applying these techniques in practice.


Regularization is a crucial concept in machine learning that addresses overfitting by adding a penalty term to the loss function. We've explored three main regularization techniques:

1. Ridge Regression (L2)
2. Lasso Regression (L1)
3. Elastic Net (Combination of L1 and L2)


The primary goal of regularization is to create models that generalize well to unseen data by finding an optimal balance between bias and variance.


Remember the general steps for implementing regularization:

1. Preprocess your data (scaling, handling missing values, etc.)
2. Split your data into training and testing sets.
3. Choose a regularization technique based on your problem characteristics.
4. Use cross-validation to tune hyperparameters.
5. Train your model on the entire training set using the best hyperparameters.
6. Evaluate on the test set to assess generalization.


Always compare your regularized models against a baseline (e.g., ordinary least squares) to ensure you're gaining benefits from regularization. Regularization is not a one-size-fits-all solution, so experiment with different techniques and hyperparameters to find the best fit for your data.


Regularization is a powerful tool in the data scientist's toolkit, offering ways to create more robust and generalizable models. By understanding the principles behind these techniques and their impact on model behavior, you can make informed decisions about when and how to apply regularization in your projects.


> **Important Note:** While regularization is highly effective, it's not a silver bullet. Always consider the specific characteristics of your data and the requirements of your problem when choosing and applying regularization techniques.


As you move forward in your machine learning journey, remember that the concepts learned here extend beyond linear regression. Many advanced techniques, including neural networks and other complex models, use similar principles of regularization to improve performance and generalization.


By mastering these regularization techniques, you've gained valuable skills that will serve you well in a wide range of machine learning applications. Keep practicing, experimenting with different datasets, and staying curious about new developments in the field!