# Ridge and Lasso Theory

## Overfitting, Underfitting, and Generalized Model

### Overfitting
**Overfitting** happens when the model fits the training data too well. It performs well on the training data (low bias) but poorly on unseen test data (high variance), as it’s overly complex and not generalizable.

In the graph below, we can see that the model fits well on training data (blue points) but does not fit well on test data (red points), as they are far from our predicted points (green points).

![Overfitting Example](images/Overfitting_Example.png)  
*Figure 1: Example of Overfitting, where the model fits the training data too closely but fails on test data.*

---

### Underfitting
**Underfitting** occurs when the model is too simple to capture the data’s patterns. It performs poorly on both the training data and test data. This typically happens when the model has high bias, making it unable to learn from the data.

In the graph below, we can see that the model does not fit well on both training data (blue points) and test data (red points).

![Underfitting Example](images/Underfitting_Example.png)  
*Figure 2: Example of Underfitting, where the model fails to capture the underlying patterns in the data.*

---

### Generalized Model
A **well-generalized model** finds the balance between overfitting and underfitting, performing well on both training and test data by capturing the core patterns of the data without overcomplicating the model.

In the graph below, we can see that the model fits well on both training data (blue points) and test data (red points).

![Generalized Model Example](images/Generalized_Example.png)  
*Figure 3: Example of a Generalized Model, demonstrating good performance on both training and test data.*

---

### Conclusion
Achieving a balance between overfitting and underfitting is crucial for building robust predictive models. Techniques like Ridge and Lasso regression can help in this regard by adding regularization to the model training process, effectively controlling complexity and enhancing generalization.

---
---

## Ridge And Lasso

Let us take overfitting into consideration. In as overfitting model, the model fits the training data too well, often resulting in the cost function being very close to zero. Mathematically, we can describe the cost function as:  
> J($θ_0$, $θ_1$) = $ \frac{1}{2m} $ $ \sum_{i=1}^{m} $ $ (h_θ(x_i) - y_i )^2 $


When the cost function is 0, it means the difference between the predicted points and actual data points is zero, i.e., $ (h_θ(x_i) - y_i )^2 $ = 0. The model is fitting the data perfectly, which is not desirable as it does not generalize well to unseen data.

Below is an example of an overfitting model, where the cost function is J(θ) = 0:  
![Overfitting Example](images/Overfitting_Example2.png)  
*Figure 4: Example of Overfitting, where the model fits the training data too closely.*

Since overfitting is not an ideal model and we do not want the cost function to be 0, we use **Ridge and Lasso regularization.**

---

### Ridge (L2 Regularization)

**Ridge regularization** adds an extra term to the cost function, preventing the model from fitting the data too closely. In this method, we add λ(slope)$^2$ to the cost function, where λ is a regularization parameter. The new cost function becomes:  
> J($θ_0$, $θ_1$) = $ \frac{1}{2m} $ $ \sum_{i=1}^{m} $ $ (h_θ(x_i) - y_i )^2 + $ λ(slope)$^2$

Here, λ is used to determine how fast we want to lessen or deepen the steepness of the best-fit line. 

In overfitting condition where cost function is 0, when we have added $ λ(slope)$^2$ it is no longer 0.

Let us assume our slope value is 2 and λ value is 1.  
![Overfitting Example](images/Overfitting_Example3.png)  
*Figure 5: Example of Overfitting, where the model fits the training data too closely with a best-fit line of slope 2.*

In this case our cost function is:  
> J(θ) = 0 + 1(2)^2 = 4

Now, the algorithm tries to minimize the cost further by reducing the slope, leading to a less aggressive fit. This introduces a small difference between the predicted and actual data points, resulting in a generalized model.

Now, let us assume the slope is reduced to 1.5, and the cost function becomes:

> J(θ) = (small value) + 1(1.5)$^2$ ≈ 3

![Overfitting Example](images/Overfitting_Example4.png)  
*Figure 6: This shows a new line introduced with a reduced slope with the help of regression.*

We continue adjusting the slope iteratively until the cost function reaches its minimum, ensuring the best fit for the data. This process ultimately results in a generalized model that balances accuracy and overfitting.

---

### Lasso (L1 Regularization)

**Lasso regulization** adds an extra term to the cost function, preventing the model from overfitting and also reduces the number of features. Here, we add λ |slope| to the cost function where λ is a regularization parameter which is is used to determine how fast we want to lessen or deepen the steepness of the best-fit line.  
The new cost function can be written as:
> J($θ_0$, $θ_1$) = $ \frac{1}{2m} $ $ \sum_{i=1}^{m} $ $ (h_θ(x_i) - y_i )^2 + $ λ |slope|

In lasso regularization, the overfitting is prevented by adding a term to the cost function which does not allow J(θ) or cost to be 0. It also simplifies the model by reducing the impact of less important features, as many of their coefficients get very close to zero, and some even become exactly zero, allowing the model to focus only on the features that truly matter.  

---

### Conclusion

Both **Ridge (L2)** and **Lasso (L1)** regularization are powerful techniques for preventing overfitting by adding penalty terms to the cost function. Ridge regularization works by penalizing large coefficients, which reduces the complexity of the model without eliminating any features. On the other hand, Lasso not only penalizes large coefficients but also performs feature selection by shrinking some coefficients to zero, effectively removing less important features.

Together, these regularization methods help create more generalized models that perform better on unseen data by finding a balance between underfitting and overfitting. The choice between Ridge and Lasso depends on the specific needs of the model—whether we want to reduce the complexity or focus on feature selection. Often, a combination of both methods, called Elastic Net, is used to get the best of both worlds.l.
