# Ridge Regression (L2 Regularization)

## Summary

* **Ridge Regression** is also known as **L2 Regularization** and is used to solve the problem of **overfitting** in linear regression models.
* **Overfitting** occurs when a model fits the training data perfectly (**Low Bias**) but performs poorly on new test data (**High Variance**).
* The algorithm modifies the **Cost Function** by adding a penalty term: **Lambda ($\lambda$)** multiplied by the summation of the **slope squared ($\theta^2$)**.
* There is an **inverse relationship** between Lambda and the Slope: as $\lambda$ increases, the slope ($\theta$) decreases.
* A key characteristic of Ridge Regression is that while coefficients (slopes) are reduced, they **never reach zero**.

## Exam Notes

### Relationship Between Lambda and Slope

**Question**: What is the relationship between Lambda ($\lambda$) and the Slope ($\theta$) in Ridge Regression?

**Answer**:  
There is an inverse relationship between them. As the value of **Lambda ($\lambda$) increases**, the value of the **Slope ($\theta$) decreases**. However, the slope will **never become zero**, regardless of how much Lambda is increased. This is a critical distinction from other regularization techniques.

## Introduction to Overfitting

In linear regression, the goal is to find the **best fit line**. Consider a scenario with only two data points in the training set. A linear regression model will create a line that passes perfectly through both points, resulting in an error of zero and very high accuracy.

While this appears ideal, it often leads to **overfitting**:

* **Training Data**: The model has high accuracy and **Low Bias**.
* **Test Data**: When new data points are introduced, the error increases significantly because the model is too tightly fitted to the training set. This results in **High Variance**.

To resolve this issue and reduce overfitting, **Ridge Regression (L2 Regularization)** is used.

## Ridge Regression (L2 Regularization)

Ridge Regression acts as a hyperparameter tuning technique for linear regression.  
It adjusts the standard cost function to prevent the model from fitting the training data too perfectly (i.e., preventing the cost from reaching zero).

### The Cost Function

The standard cost function for Linear Regression (Mean Squared Error) is:

$$
J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2
$$

Ridge Regression adds a penalty term to this equation:

$$
J(\theta) =
\frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2
+
\lambda \sum_{i=1}^{n} \ (slope)^2
$$

* **$\lambda$ (Lambda)**: A hyperparameter that determines the strength of the penalty.
* **$\sum \ (slope)^2$**: The summation of the squared coefficients (thetas).

By adding this penalty term, the cost function is penalized.  
Even if the Mean Squared Error approaches zero, the total cost does not become zero, forcing the algorithm to find a different best-fit line.

## Relationship Between Lambda and Slope

Understanding the relationship between **Lambda ($\lambda$)** and the **Slope ($\theta$)** is crucial for understanding how Ridge Regression works.
![image-2.png](attachment:image-2.png)
1. **When $\lambda = 0$**:
   * The penalty term becomes zero.
   * The equation reverts to the standard linear regression cost function.
   * The global minimum remains unchanged.

2. **When $\lambda$ increases (e.g., $\lambda = 10$)**:
   * The gradient descent curve shifts.
   * A new **global minimum** is formed.
   * At this new minimum, the value of the **slope ($\theta$) is reduced**.

3. **Continuous Increase**:
   * As $\lambda$ continues to increase (e.g., to 30), the slope decreases further, shifting closer to zero.

> **Key Takeaway**:  
> The slope decreases as Lambda increases, but it **will never be zero**.

## Mathematical Intuition with Multiple Features

In a **Multiple Linear Regression** scenario with features $x_1, x_2, x_3$, the equation is:

$$
h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3
$$

The coefficients ($\theta$) represent the magnitude of movement in $y$ for a unit movement in $x$.

* **High Coefficient**: A value like 0.52 indicates a high correlation; a unit change in $x_1$ significantly impacts $y$.
* **Low Coefficient**: A value like 0.24 indicates a lower correlation.

### Effect of Ridge Regression

When Ridge Regression is applied:
* Coefficients for all features are **shrunk**.
* A coefficient might reduce from 0.52 to 0.40, or from 0.24 to 0.14.
* Even for features with low correlation (e.g., $x_3$), the coefficient is reduced but remains non-zero.

This shrinkage reduces the variance of the model.  
By reducing the magnitude of the coefficients, the model becomes less sensitive to specific weights in the training data, thereby reducing overfitting while retaining all features in the model.
