In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#### Here we will Discuss Ridge and Lasso Regression.

So far, we haven't exactly used any specific techniques to evaluate the model that we are making. We will be discussing those as well. However, something that comes up during evaluation is the concept of overfitting and underfitting.

**Overfitting:** This occurs when your model accurately predicts the training data but fails to achieve a similar level of accuracy on unseen data.

**Underfitting:** This occurs when your model doesn't perform well on the training data.

Underfitting can occur due to a small amount of data, the wrong algorithms used, or a learning rate that is too high or too low. It is fairly simple to address.

The main problem arises when your model overfits. This happens when the model parameters exactly replicate a set of data points (e.g., if my data points are in an exponential shape). When new, unseen data is given to the model, which might deviate from the original graph found via training, the model predicts these data points poorly.

To determine if your model is overfitting or underfitting, we mainly use two terms: `Bias` and `Variance`.

**Bias** is simply the measure of the error made while making a prediction by our model.

- Low bias means the error between our prediction and the target value is low.
- High bias means this error is high.

Obviously, a model with low bias is favorable.

**Variance** specifies the amount our model varies when a different portion of the training dataset (or any dataset) is used.

- Low variance means there is a small variation in the predictions for different portions of datasets.
- High variance means there is a large variation.

Ideally, we want our model to have low variance.

Thus, an ideal model will have low bias and low variance (though this is not always possible, and we will soon talk about the bias-variance tradeoff).

A model with low bias but high variance is overfitting, and a model with high bias and low variance is underfitting. (If both are high, it is chaos.)

---


**Ridge and Lasso Regression** help solve the problem of overfitting in Linear Regression by introducing a regularization parameter. (Additionally, Lasso also helps in reducing the dimensionality of our data.)

Starting with **Ridge Regression** also called **L2 Regularization**

Regularization : it is a set of methods used for reducing overfitting in ML models.

In `Ridge Regression`, while calculating the Mean Squared Error (MSE), we basically add a new term:-

$$ \text{MSE}_{\text{ridge}} = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}_i - y_i)^2 + \frac{\lambda}{2p} \sum_{j=1}^{p} w_j^2 $$

where:
- $ y_i $ are the observed values,
- $ \hat{y}_i $ are the predicted values,
- $ \lambda $ is the regularization parameter,
- $ W $ or $ \beta_j $ are the coefficients (slopes).
- $ p $ is number of features or coefficients

The additional term $ \lambda \sum_{j=1}^{p} w_j^2 $ helps to penalize large coefficients, thereby reducing overfitting.

Adding this parameter introduces a new error term, which makes the predictions slightly less accurate on the training data but results in lower variance overall. This means that although it reduces the accuracy on the training data, it compensates by improving the model's performance on new, unseen data.

In simpler words, if our model creates a line that fits our data perfectly, introducing Ridge's term will result in a line that isn't perfect. This increases the MSE from 0 to something higher, but it helps generalize the model, leading to better predictions on new, unseen data.
(we will vizualize this below.)

Hence Solving the problemn of Over Fitting.

---


**Lasso Regression** also called **L1 Regularization**
 
Similar to Ridge Regression, `Lasso Regression` also introduces a term to the MSE:

$$ \text{MSE}_{\text{Lasso}} = \frac{1}{2m} \sum_{i=1}^{m} \left( \hat{y}_i - y_i \right)^2 + \frac{\lambda}{p} \sum_{j=1}^{p} \lvert w_j \rvert$$

Where everything means the same as in Ridge Regression, only here we take the absolute sum of the slopes.

In addition to rectifying the problem of overfitting in the same way as Ridge Regression, Lasso Regression also helps in reducing the dimensionality of our data. This means that it can effectively identify and eliminate irrelevant features.

---

###### Now How Does this Eleminate the irrelevant Features?, well It is more of a Mathematical Reason why this happens and i will understand and explain  this on pen and paper.


### Ridge Regression:-

The derivative of the regularization term $ \frac{\lambda}{2p} \sum_{j=1}^{p} w_j^2 $ with respect to $ \mathbf{w} $ in Ridge regression is computed as follows:

$$ \frac{\partial}{\partial w_i} \left( \frac{\lambda}{2p} \sum_{j=1}^{p} w_j^2 \right) = \frac{2 \lambda}{2p} w_i = \frac{\lambda}{p} w_i $$

Therefore, the derivative of the regularization term $ \frac{\lambda}{2p} \sum_{j=1}^{p} w_j^2 $ with respect to $ w_i $ is $ \frac{\lambda}{p} w_i $. This derivative is used in Ridge regression to adjust the gradient descent update step for the weight $ w_i $, helping to regularize the model and control overfitting. (Nothing changes for $ b $).

### Lasso Regression:-

The derivative of the regularization term $ \lambda \sum_{j=1}^{p} |w_j| $ with respect to $ w_j $ in Lasso regression is computed as follows:

$$ \frac{\partial}{\partial w_i} (\lambda \sum_{j=1}^{p} |w_j|) = \lambda \cdot \text{sign}(w_j) $$

where $ \text{sign}(w_j) $ is the sign function, defined as:

$$ \text{sign}(w_j) = \begin{cases}
1 & \text{if } w_j > 0 \\
-1 & \text{if } w_j < 0 \\
0 & \text{if } w_j = 0
\end{cases} $$

Therefore, the derivative of the regularization term $ \lambda \sum_{j=1}^{p} |w_j| $ with respect to $ w_j $ is $ \lambda \cdot \text{sign}(w_j) $. This derivative is used in Lasso regression to adjust the gradient descent update step for the weight $ w_j $, promoting sparsity by potentially driving some weights $ w_i $ to zero, which effectively eliminates irrelevant features.

##### Note: The $ |X| $ function isn't differentiable, but for the purpose of machine learning algorithms, we use its subderivative. I just learned about this concept and will be exploring it further.


In [2]:
#Code implementation of Ridge

def compute_cost_ridge(X,Y,W,b,L):
    """
    This function computes the Mean Squared Error (MSE) 
    of our model on the training data.

    Args:
        X (ndarray): Input values
        Y (ndarray): Actual values or target values
        W (ndarray): Weights for the input parameters
        b (scalar) : Intercept or bias term
        L (scalar) : Regularizaton Parameter

    Returns:
        total_cost (float): The Mean Squared Error (MSE)
    """

    
    m = X.shape[0]
    cost = 0.0
    yhat = np.dot(X,W) + b

    sum_w = np.sum(W**2) #This computes the square of the weights and sums them.

    for i in range(m):
        cost += (yhat[i] - Y[i]) ** 2 

    total_cost = (cost + L*sum_w) / (2 * m)

    return total_cost


def compute_gradient_ridge(X,Y,W,b,L):
    """
    This function computes the Gradient of cost fuction
    for a given set of w and b values.

    Args:
        X (ndarray) : Training Input values
        Y (ndarray) : Target value or (output values for the input)    
        W (ndarray) : slope or wrights for the input parameter
        b (scalar) : intercept or bias parameter
        L (scalar) : Regularizaton Parameter

    Returns:
        dj_dw (ndarray) : gradient when partially diffrentiated wrt w
        dj_db (ndarray) : gradient when partially diffrentiated wrt b
    """
        
    m,n = X.shape
    dj_dw = np.zeros(n,)
    dj_db = 0.0
    for i in range(m):
        err = (np.dot(X[i],W) + b) - Y[i]
        for j in range(n):
            dj_dw[j] += err *  X[i,j] 
        dj_db += err 

    dj_dw = dj_dw/m + (L/m)*W   #This add the regularization term.
    dj_db /= m

    return dj_dw,dj_db


In [3]:
def compute_cost_lasso(X, Y, W, b, L):
    """
    This function computes the Mean Squared Error (MSE) 
    of our model on the training data.

    Args:
        X (ndarray): Input values
        Y (ndarray): Actual values or target values
        W (ndarray): Weights for the input parameters
        b (scalar): Intercept or bias term
        L (scalar): Regularization parameter

    Returns:
        total_cost (float): The Mean Squared Error (MSE)
    """
    
    m = X.shape[0]
    cost = 0.0
    yhat = np.dot(X, W) + b

    sum_w = np.sum(abs(W))  # This computes the absolute value of the weights and sums them.

    for i in range(m):
        cost += (yhat[i] - Y[i]) ** 2 

    total_cost = (cost / (2 * m)) + (L / m) * sum_w

    return total_cost


def compute_gradient_lasso(X, Y, W, b, L):
    """
    This function computes the Gradient of the cost function
    for a given set of W and b values.

    Args:
        X (ndarray): Training input values
        Y (ndarray): Target values or output values for the input    
        W (ndarray): Weights for the input parameters
        b (scalar): Intercept or bias parameter
        L (scalar): Regularization parameter

    Returns:
        dj_dw (ndarray): Gradient when partially differentiated with respect to W
        dj_db (ndarray): Gradient when partially differentiated with respect to b
    """
        
    m, n = X.shape
    dj_dw = np.zeros(n)
    dj_db = 0.0
    
    for i in range(m):
        err = (np.dot(X[i], W) + b) - Y[i]
        for j in range(n):
            dj_dw[j] += err * X[i, j]
        dj_db += err 

    dj_dw = (dj_dw / m) + (L / m) * np.sign(W)  # This adds the regularization term.
    dj_db /= m

    return dj_dw, dj_db

#### Now that we have understood what these mean and how to implement them, there is something called Elastic Regression which combines both Ridge and Lasso regularization. 

And I mean literally, this is the MSE in Elastic Net Regression:

$$ \text{MSE}_{\text{Elastic Net}} = \frac{1}{2m} \sum_{i=1}^{m} \left( \hat{y}_i - y_i \right)^2 + \frac{\lambda_1}{m} \sum_{j=1}^{m} \lvert w_j \rvert + \frac{\lambda_2}{2m} \sum_{j=1}^{m} w_j^2 $$

Here:
- $ m $ is the number of samples.
- $ n $ is the number of features.
- $ y_i $ represents the actual output for the $ i $th sample.
- $ \hat{y}_i $ represents the predicted output for the $ i $th sample.
- $ \lambda_1 $ and $ \lambda_2 $ are regularization parameters controlling the strengths of Lasso (L1) and Ridge (L2) regularization, respectively.

This formulation combines Lasso and Ridge penalties to achieve both feature selection and regularization in the Elastic Net Regression model.



#### Derivative of Elastic Net Regression MSE

The derivative of the Elastic Net Regression MSE with respect to $ w_i $ is:

$$ \frac{\partial \text{MSE}_{\text{Elastic Net}}}{\partial w_j} = \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}_i - y_i \right) x_{ij} + \frac{\lambda_2}{m} w_j + \frac{\lambda_1}{m} \cdot \text{sign}(w_j) $$



where:
- $ m $ is the number of samples.
- $ y_i $ represents the actual output for the $ i $ th sample.
- $ \hat{y}_i $ represents the predicted output for the $ i $ th sample.
- $ x_{i,j} $ is the $ j $ th feature value of the $ i $ th sample.
- $ \lambda_1 $ and $ \lambda_2 $ are regularization parameters controlling the strengths of Lasso (L1) and Ridge (L2) regularization, respectively.
- $ \text{sign}(w_i) $ is the sign function defined as:
  $$ \text{sign}(w_i) = \begin{cases}
  1 & \text{if } w_i > 0 \\
  -1 & \text{if } w_i < 0 \\
  0 & \text{if } w_i = 0
  \end{cases} $$

This derivative helps in computing the gradient during the optimization process of Elastic Net Regression, adjusting the weights $ w_i $ accordingly to minimize the MSE while considering both Lasso and Ridge regularization.


In [None]:
def compute_cost_elastic_net(X, Y, W, b, L1, L2):
    """
    This function computes the Mean Squared Error (MSE) 
    of our model on the training data.

    Args:
        X (ndarray): Input values
        Y (ndarray): Actual values or target values
        W (ndarray): Weights for the input parameters
        b (scalar): Intercept or bias term
        L1 (scalar): L1 Regularization parameter
        L2 (scalar): L2 Regularization parameter

    Returns:
        total_cost (float): The Mean Squared Error (MSE)
    """
    
    m = X.shape[0]
    cost = 0.0
    yhat = np.dot(X, W) + b

    sum_w_absolute = np.sum(np.abs(W))  
    sum_w_squared = np.sum(W**2)  

    for i in range(m):
        cost += (yhat[i] - Y[i]) ** 2 

    total_cost = (cost / (2 * m)) + (L1 / m) * sum_w_absolute + (L2 / (2 * m)) * sum_w_squared  

    return total_cost


def compute_gradient_elastic_net(X, Y, W, b, L1, L2):
    """
    This function computes the Gradient of the cost function
    for a given set of W and b values.

    Args:
        X (ndarray): Training input values
        Y (ndarray): Target values or output values for the input    
        W (ndarray): Weights for the input parameters
        b (scalar): Intercept or bias parameter
        L1 (scalar): L1 Regularization parameter
        L2 (scalar): L2 Regularization parameter

    Returns:
        dj_dw (ndarray): Gradient when partially differentiated with respect to W
        dj_db (ndarray): Gradient when partially differentiated with respect to b
    """
        
    m, n = X.shape
    dj_dw = np.zeros(n)
    dj_db = 0.0
    
    for i in range(m):
        err = (np.dot(X[i], W) + b) - Y[i]
        for j in range(n):
            dj_dw[j] += err * X[i, j]
        dj_db += err 

    dj_dw = (dj_dw / m) + (L1 / m) * np.sign(W) + (L2 / m) * W  
    dj_db /= m

    return dj_dw, dj_db


#### When I previously talked about Bias and Variance, I mentioned the concept of Bias and Variance Trade-off:
It simply means that the more you try to lower the Bias, the more Variance increases. Similarly, the more you try to lower the Variance, the higher Bias goes.

The key idea is that we need to find the optimal balance where both of these factors are balanced.