# <span style="color:#2E86C1"><b>Linear Regression Model: A Deep Dive</b></span>


Linear regression is a fundamental algorithm used in both **machine learning (ML)** and **deep learning (DL)**. It models the relationship between a dependent variable \($ y $\) and an independent variable \($ x $\) by fitting a **linear equation**. Let's break down the model step by step.

### <span style="color:#D35400"><b>1. Basic Model Architecture</b></span>

In a **deep learning** context, linear regression can be viewed as a **simple neural network** with:

- **One input** (the feature \($ x $\)),
- **One neuron** (which applies a linear transformation \($ y = wx + b $\)),
- **One output** (the predicted value \($ \hat{y} $\)).

The equation for linear regression is:

$$
y = wx + b
$$

Where:

- \($ w $\) is the **weight** (or slope of the line),
- \($ b $\) is the **bias** (the y-intercept of the line),
- \($ x $\) is the **input feature**,
- \($ \hat{y} $\) is the **predicted output**.

### <span style="color:#D35400"><b>2. Loss Function</b></span>

The goal of linear regression is to minimize the error between the predicted values \($ \hat{y} $\) and the actual values \($ y $\). This error is quantified using a **loss function**. The most common loss function for linear regression is the **Mean Squared Error (MSE)**:

$$
\text{Loss} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

Where:
- \($ n $\) is the **number of data points**,
- \($ y_i $\) is the **actual value** of the \($ i $\)-th data point,
- \($ \hat{y}_i $\) is the **predicted value**.

### <span style="color:#D35400"><b>3. Gradient Descent (Weight and Bias Updates)</b></span>

To minimize the loss function, we use **gradient descent**. This optimization algorithm updates the weights and biases by computing the **gradients** of the loss function with respect to these parameters.


---

## <span style="color:#2E86C1"><b>Understanding Gradient Descent</b></span>

### <span style="color:#D35400"><b>1. What is a Gradient?</b></span>

- The **gradient** is a vector that points in the direction of the **steepest ascent** of a function.
- It indicates how much the function will increase (or decrease) if you move in that direction.
- In simple terms, it gives you the **slope** of the function at a specific point.

### <span style="color:#D35400"><b>2. What is Gradient Descent?</b></span>

- **Gradient Descent** is an optimization algorithm used to minimize a function (often a loss function in machine learning).
- It involves taking **steps downhill** towards the minimum point of the function.
- The algorithm uses the gradient to determine the direction to move and how far to go.

### <span style="color:#D35400"><b>3. What is a Partial Derivative?</b></span>

- A **partial derivative** measures how a function changes as one of its variables changes while keeping other variables constant.
- It allows us to understand the **sensitivity** of the function to each parameter independently.
- In the context of gradient descent, partial derivatives are used to calculate the gradient.

### <span style="color:#28B463"><b>4. Key Concepts</b></span>

- **Steepest Ascent vs. Steepest Descent**:
  - The **gradient** indicates the direction of steepest ascent.
  - **Gradient Descent** uses the negative gradient to find the direction of steepest descent.
  
- **Learning Rate**:
  - The **learning rate** \( $\alpha$ \) controls the size of the steps taken during the descent.
  - A small learning rate leads to slow convergence, while a large learning rate may overshoot the minimum.

- **Iteration**:
  - Gradient descent is performed iteratively, updating the weights or parameters until convergence (when the updates become negligibly small).

<center><img src="../../../images/gradient_descent.png" alt="error" width="800"/></center>
---

### <span style="color:#28B463"><b>Gradient of the Loss Function</b></span>

We compute the **partial derivatives** of the loss function with respect to both \($ w $\) (weight) and \($ b $\) (bias):

- **Partial derivative w.r.t \($ w $\)**:

$$
\frac{\partial \text{Loss}}{\partial w} = -\frac{2}{n} \sum_{i=1}^{n} x_i (y_i - \hat{y}_i)
$$

- **Partial derivative w.r.t \($ b $\)**:

$$
\frac{\partial \text{Loss}}{\partial b} = -\frac{2}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)
$$

### <span style="color:#28B463"><b>Updating Weight and Bias</b></span>

The **gradient descent** algorithm updates \($ w $\) and \($ b $\) iteratively using the **learning rate** \($ \alpha $\) (which controls the step size of the update):

- **Weight update**:

$$
w_{\text{new}} = w_{\text{old}} - \alpha \cdot \frac{\partial \text{Loss}}{\partial w}
$$

- **Bias update**:

$$
b_{\text{new}} = b_{\text{old}} - \alpha \cdot \frac{\partial \text{Loss}}{\partial b}
$$

These updates are performed over multiple **epochs** (iterations over the entire dataset) until the loss converges to a minimum.

---

## <span style="color:#2E86C1"><b>Key Terms Explained</b></span>

- **Loss**: This is a measure of how far off the predicted values \($ \hat{y} $\) are from the actual values \($ y $\). In linear regression, the loss is often calculated using the **MSE**.

- **Gradient Descent**: This is the optimization algorithm used to minimize the loss function by adjusting the model's parameters (**weight** and **bias**). It works by moving in the direction of the steepest decrease of the loss (negative gradient).

- **Epoch**: An epoch refers to one complete pass through the entire training dataset. In each epoch, the model updates its weights and biases using gradient descent.

- **Weight (w)**: This is the parameter that determines how much influence the input variable \($ x $\) has on the output \($ y $\). It's equivalent to the slope of the line in linear regression.

- **Bias (b)**: This is the parameter that shifts the line up or down. It's equivalent to the y-intercept in linear regression.

---

## <span style="color:#2E86C1"><b>Calculating the Updated Weight (w)</b></span>

Let's now dive deeper into the **equation** to calculate the updated **weight** \($ w $\). We will substitute values step by step to get a final formula.

### <span style="color:#D35400"><b>Starting Point</b></span>

We are using **gradient descent** to update \($ w $\):

$$
w_{\text{new}} = w_{\text{old}} - \alpha \cdot \frac{\partial \text{Loss}}{\partial w}
$$

Where:

- \($ \alpha $\) is the **learning rate**,
- \($ \frac{\partial \text{Loss}}{\partial w} $\) is the **gradient of the loss** with respect to \($ w $\).

### <span style="color:#D35400"><b>Loss Function and Gradient</b></span>

For linear regression, the predicted value \($ \hat{y} $\) is:

$$
\hat{y} = w \cdot x + b
$$

And the loss is the **Mean Squared Error (MSE)**:

$$
\text{Loss} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

Substitute \($ \hat{y}_i = w \cdot x_i + b $\) into the loss function:

$$
\text{Loss} = \frac{1}{n} \sum_{i=1}^{n} (y_i - (w \cdot x_i + b))^2
$$

### <span style="color:#28B463"><b>Gradient Calculation</b></span>

Take the partial derivative of the loss function with respect to \($ w $\):

$$
\frac{\partial \text{Loss}}{\partial w} = -\frac{2}{n} \sum_{i=1}^{n} x_i (y_i - (w \cdot x_i + b))
$$

### <span style="color:#28B463"><b>Bias Substitution \( b = $\bar{y}$ - w $\cdot$ $\bar{x}$ \)</b></span>

Now, substitute the bias term \($ b = \bar{y} - w \cdot \bar{x} $\), where:

- \( $\bar{y}$ \) is the **mean** of \($ y $\),
- \( $\bar{x}$ \) is the **mean** of \($ x $\).

Thus, the predicted value becomes:

$$
\hat{y} = w \cdot (x - \bar{x}) + \bar{y}
$$

Substitute this back into the **gradient** of the loss:

$$
\frac{\partial \text{Loss}}{\partial w} = -\frac{2}{n} \sum_{i=1}^{n} x_i \left( (y_i - \bar{y}) - w \cdot (x_i - \bar{x}) \right)
$$

### <span style="color:#D35400"><b>Final Equation for Weight Update</b></span>

Finally, the **weight** gets updated using **gradient descent**:

$$
w_{\text{new}} = w_{\text{old}} + \frac{2\alpha}{n} \sum_{i=1}^{n} x_i \left( (y_i - \bar{y}) - w_{\text{old}} \cdot (x_i - \bar{x}) \right)
$$

This formula shows how the weight \($ w $\) gets updated in each step of the gradient descent process.

---

# <span style="color:#2E86C1"><b>Ridge Regression: Introducing Regularization</b></span>

In **Ridge Regression**, the primary difference from ordinary linear regression is the inclusion of a **regularization term**. This term helps penalize large weights to prevent **overfitting**, leading to a model that generalizes better.

### <span style="color:#D35400"><b>1. Ridge Regression Loss Function</b></span>

The **Ridge Regression Loss** function combines the **Mean Squared Error (MSE)** with a penalty term that controls the magnitude of the weights:

$$
\text{Loss}_{\text{Ridge}} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda w^2
$$

Where:

- \( $\lambda$ \) is the **regularization parameter** (often referred to as **alpha** in Ridge, which should not be confused with the learning rate \( $\alpha$ \)),
- \( $w^2$ \) is the **sum of the squared weights**.

The additional term \( $\lambda w^2$ \) discourages the model from learning large weights, which may lead to overfitting.

### <span style="color:#D35400"><b>2. Gradient of Ridge Loss with Respect to Weight \( $w$ \)</b></span>

To derive the weight update rule, we need to compute the **gradient of the Ridge Loss** with respect to the weight \( $w$ \). This consists of two parts:

- **Gradient of the MSE** (same as in linear regression):

$$
\frac{\partial \text{Loss}_{\text{MSE}}}{\partial w} = -\frac{2}{n} \sum_{i=1}^{n} x_i (y_i - \hat{y}_i)
$$

- **Gradient of the regularization term \( \lambda w^2 \)**:

$$
\frac{\partial \lambda w^2}{\partial w} = 2\lambda w
$$

Thus, the total gradient for **Ridge Regression** becomes:

$$
\frac{\partial \text{Loss}_{\text{Ridge}}}{\partial w} = -\frac{2}{n} \sum_{i=1}^{n} x_i (y_i - \hat{y}_i) + 2\lambda w
$$

### <span style="color:#D35400"><b>3. Weight Update Rule for Ridge Regression</b></span>

Using the **gradient descent** algorithm, we update the weight \( w \) based on the computed gradient:

$$
w_{\text{new}} = w_{\text{old}} - \alpha \cdot \left( -\frac{2}{n} \sum_{i=1}^{n} x_i (y_i - \hat{y}_i) + 2\lambda w_{\text{old}} \right)
$$

Simplifying the update formula:

$$
w_{\text{new}} = w_{\text{old}} + \frac{2\alpha}{n} \sum_{i=1}^{n} x_i (y_i - \hat{y}_i) - 2\alpha\lambda w_{\text{old}}
$$

This final equation shows how the weight is updated at each step of gradient descent in **Ridge Regression**.

---

## <span style="color:#2E86C1"><b>Key Differences from Ordinary Linear Regression</b></span>

### <span style="color:#28B463"><b>Regularization Term</b></span>

The key difference in Ridge Regression is the inclusion of the **second term** \( $2\alpha\lambda w_{\text{old}}$ \) in the weight update rule. This term penalizes large values of \( $w$ \), shrinking the weights over time and helping to prevent **overfitting**.

### <span style="color:#28B463"><b>Regularization Parameter \( $\lambda$ \)</b></span>

- The **regularization parameter \( $\lambda$ \)** controls the strength of the penalty. 
- When \( $\lambda$ = 0 \), Ridge Regression becomes equivalent to **ordinary linear regression**. 
- A larger \( $\lambda$ \) results in greater penalization, pushing the weights towards zero and reducing model complexity.

### <span style="color:#28B463"><b>Impact on Generalization</b></span>

The regularization term \( $\lambda w^2$ \) encourages the model to have **smaller weights**, preventing it from overfitting the training data. This allows the model to generalize better to unseen data, avoiding **overly complex solutions** that fit noise in the data.

---

By adding **Ridge regularization**, we improve the **stability** of the linear model, especially when dealing with **multicollinearity** (where predictor variables are highly correlated). Ridge Regression is an effective tool when you need to balance between fitting your data and maintaining a model that generalizes well.



# <span style="color:#2E86C1"><b>Lasso Regression: Emphasizing Feature Selection</b></span>

In **Lasso Regression**, the key difference from ordinary linear regression is the introduction of a **regularization term** that encourages sparsity in the model. This means that some coefficients can become exactly zero, leading to a simpler and more interpretable model.

### <span style="color:#D35400"><b>1. Lasso Regression Loss Function</b></span>

The **Lasso Regression Loss** function integrates the **Mean Squared Error (MSE)** with a penalty term based on the absolute values of the weights:

$$
\text{Loss}_{\text{Lasso}} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} |w_j|
$$

Where:

- \($ \lambda $\) is the **regularization parameter** (also called **alpha** in Lasso, which should not be confused with the learning rate \($ \alpha $\)),
- \($ |w_j| $\) is the **absolute sum of the weights**.

The term \($ \lambda \sum_{j=1}^{p} |w_j| $\) encourages some weights to shrink to zero, effectively performing feature selection.

### <span style="color:#D35400"><b>2. Gradient of Lasso Loss with Respect to Weight \($ w $\)</b></span>

To derive the weight update rule for Lasso Regression, we compute the **gradient of the Lasso Loss** with respect to the weight \($ w $\). The gradient consists of two components:

- **Gradient of the MSE** (same as in linear regression):

$$
\frac{\partial \text{Loss}_{\text{MSE}}}{\partial w} = -\frac{2}{n} \sum_{i=1}^{n} x_i (y_i - \hat{y}_i)
$$

- **Gradient of the regularization term \($ \lambda |w| $\)**:

The derivative with respect to \($ w $\) involves the **sign function**:

$$
\frac{\partial \lambda |w|}{\partial w} = \lambda \cdot \text{sgn}(w)
$$

So, the total gradient for **Lasso Regression** is given by:

$$
\frac{\partial \text{Loss}_{\text{Lasso}}}{\partial w} = -\frac{2}{n} \sum_{i=1}^{n} x_i (y_i - \hat{y}_i) + \lambda \cdot \text{sgn}(w)
$$

### <span style="color:#D35400"><b>3. Weight Update Rule for Lasso Regression</b></span>

Using the **gradient descent** algorithm, we update the weight \($ w $\) based on the computed gradient:

$$
w_{\text{new}} = w_{\text{old}} - \alpha \left( -\frac{2}{n} \sum_{i=1}^{n} x_i (y_i - \hat{y}_i) + \lambda \cdot \text{sgn}(w_{\text{old}}) \right)
$$

This simplifies to:

$$
w_{\text{new}} = w_{\text{old}} + \frac{2\alpha}{n} \sum_{i=1}^{n} x_i (y_i - \hat{y}_i) - \alpha \lambda \cdot \text{sgn}(w_{\text{old}})
$$

### <span style="color:#D35400"><b>4. Key Differences from Ordinary Linear Regression</b></span>

#### <span style="color:#28B463"><b>Regularization Term</b></span>

The major distinction in Lasso Regression is the inclusion of the **absolute value term** \( \alpha \lambda \cdot \text{sgn}(w_{\text{old}}) \) in the weight update rule. This term can drive some weights exactly to zero, allowing the model to exclude less important features.

#### <span style="color:#28B463"><b>Regularization Parameter \( \lambda \)</b></span>

- The **regularization parameter \($ \lambda $\)** controls the strength of the penalty.
- When \($ \lambda = 0 $\), Lasso Regression becomes equivalent to **ordinary linear regression**.
- A larger \($ \lambda $\) increases the penalty, promoting more weights to become zero and leading to a simpler model.

---

## <span style="color:#2E86C1"><b>Weight Updates in Lasso Regression</b></span>

In Lasso Regression, the weight update rule incorporates the term \($ \lambda \cdot \text{sgn}(w_{\text{old}}) $\). This term plays a critical role in how the weights are adjusted during training. Let's break down its impact:

### <span style="color:#D35400"><b>Understanding the Penalty Term</b></span>

- **\($ \lambda $\)**: This is the regularization parameter that controls the strength of the penalty. A larger \($ \lambda $\) encourages more weights to shrink towards zero.
  
- **\($ \text{sgn}(w_{\text{old}})$\)**: The sign function returns:
  - **1** if \($ w_{\text{old}} > 0 $\) (positive weight)
  - **-1** if \($ w_{\text{old}} < 0 $\) (negative weight)
  - **0** if \($ w_{\text{old}} = 0 $\) (zero weight)

### <span style="color:#28B463"><b>Impact of the Penalty Term</b></span>

1. **When \($ w_{\text{old}} $\) is Positive**:
   - The term \($ \lambda \cdot \text{sgn}(w_{\text{old}}) $\) contributes positively to the weight update.
   - **Effect**: The penalty reduces the value of the weight \($ w_{\text{new}} $\). 
   - **Interpretation**: This encourages the weight to shrink, thus regularizing the model.

   $$
   w_{\text{new}} = w_{\text{old}} - \alpha \cdot \left(\text{penalty}\right) \quad \text{(penalty is positive)}
   $$

2. **When \($ w_{\text{old}} $\) is Negative**:
   - The term \($ \lambda \cdot \text{sgn}(w_{\text{old}}) $\) contributes negatively to the weight update.
   - **Effect**: The penalty increases the value of the weight \($ w_{\text{new}} $\) (making it less negative).
   - **Interpretation**: This adjustment reduces the magnitude of the negative weight, pushing it closer to zero.

   $$
   w_{\text{new}} = w_{\text{old}} - \alpha \cdot \left(\text{penalty}\right) \quad \text{(penalty is negative)}
   $$

3. **When \($ w_{\text{old}} $\) is Small (Close to Zero)**:
   - If \($ |w_{\text{old}}| $\) is small enough, the penalty can effectively drive the weight to exactly zero.
   - **Effect**: The weight \($ w_{\text{new}} $\) becomes zero, effectively eliminating that feature from the model.
   - **Interpretation**: This feature selection property is a key benefit of Lasso Regression.

   $$
   w_{\text{new}} = 0 \quad \text{(if the update drives \( w_{\text{old}} \) to zero)}
   $$


#### <span style="color:#28B463"><b>Impact on Feature Selection</b></span>

The L1 penalty encourages sparsity, meaning that Lasso can eliminate irrelevant features entirely by setting their corresponding weights to zero. This makes Lasso an effective method for feature selection in high-dimensional datasets.



# <span style="color:#2E86C1"><b>Elastic Net Regression:(Combination of Ridge and Lasso)</b></span>

**Elastic Net Regression** combines both Lasso and Ridge regression to achieve a balance between feature selection and regularization. It is particularly useful when dealing with highly correlated features.

### <span style="color:#D35400"><b>1. Elastic Net Loss Function</b></span>

The **Elastic Net Loss** function integrates the **Mean Squared Error (MSE)** with both L1 and L2 penalty terms:

$$
\text{Loss}_{\text{Elastic Net}} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \left( l_1 \sum_{j=1}^{p} |w_j| + (1 - l_1) \sum_{j=1}^{p} w_j^2 \right)
$$

Where:

- \($ \lambda $\) is the **regularization parameter** (similar to Lasso).
- \($ l_1 $\) is the **l1_ratio**, which controls the balance between Lasso and Ridge penalties (0 ≤ l1_ratio ≤ 1).
- \($ |w_j| $\) is the **absolute sum of the weights** (L1 penalty).
- \($ w_j^2 $\) is the **sum of squares of the weights** (L2 penalty).

### <span style="color:#D35400"><b>2. Gradient of Elastic Net Loss with Respect to Weight \($ w $\)</b></span>

To derive the weight update rule for Elastic Net Regression, we compute the **gradient of the Elastic Net Loss** with respect to the weight \($ w $\). The gradient consists of three components:

- **Gradient of the MSE**:

$$
\frac{\partial \text{Loss}_{\text{MSE}}}{\partial w} = -\frac{2}{n} \sum_{i=1}^{n} x_i (y_i - \hat{y}_i)
$$

- **Gradient of the L1 regularization term** (Lasso):

$$
\frac{\partial (\lambda l_1 \sum_{j=1}^{p} |w|)}{\partial w} = \lambda l_1 \cdot \text{sgn}(w)
$$

- **Gradient of the L2 regularization term** (Ridge):

$$
\frac{\partial \left(\lambda (1 - l_1) \sum_{j=1}^{p} w^2\right)}{\partial w} = 2\lambda (1 - l_1) w
$$

The total gradient for **Elastic Net Regression** is:

$$
\frac{\partial \text{Loss}_{\text{Elastic Net}}}{\partial w} = -\frac{2}{n} \sum_{i=1}^{n} x_i (y_i - \hat{y}_i) + \lambda l_1 \cdot \text{sgn}(w) + 2\lambda (1 - l_1) w
$$

### <span style="color:#D35400"><b>3. Weight Update Rule for Elastic Net Regression</b></span>

Using the **gradient descent** algorithm, we update the weight \($ w $\) based on the computed gradient:

$$
w_{\text{new}} = w_{\text{old}} - \alpha \left( -\frac{2}{n} \sum_{i=1}^{n} x_i (y_i - \hat{y}_i) + \lambda l_1 \cdot \text{sgn}(w_{\text{old}}) + 2\lambda (1 - l_1) w_{\text{old}} \right)
$$

This simplifies to:

$$
w_{\text{new}} = w_{\text{old}} + \frac{2\alpha}{n} \sum_{i=1}^{n} x_i (y_i - \hat{y}_i) - \alpha \lambda l_1 \cdot \text{sgn}(w_{\text{old}}) - 2\alpha \lambda (1 - l_1) w_{\text{old}}
$$

### <span style="color:#D35400"><b>4. Key Differences from Lasso and Ridge Regression</b></span>

#### <span style="color:#28B463"><b>Combination of Penalties</b></span>

- Elastic Net combines L1 and L2 penalties, allowing it to benefit from both feature selection (L1) and regularization (L2).

#### <span style="color:#28B463"><b>l1_ratio Parameter</b></span>

- The **l1_ratio** parameter controls the balance between Lasso and Ridge regularization:
  - If \($ l_1 = 1 $\), Elastic Net behaves like Lasso.
  - If \($ l_1 = 0 $\), Elastic Net behaves like Ridge.
  - Values between 0 and 1 provide a mix of both.

---

## <span style="color:#2E86C1"><b>Weight Updates in Elastic Net Regression</b></span>

In Elastic Net Regression, the weight update rule incorporates both L1 and L2 regularization terms, balanced by the **l1_ratio**. Let's break down the impacts:

### <span style="color:#D35400"><b>Understanding the Components</b></span>

1. **L1 Regularization Term**:
   - The term \($ \lambda l_1 \cdot \text{sgn}(w_{\text{old}}) $\) reduces the weights.
  
2. **L2 Regularization Term**:
   - The term \($ 2\lambda (1 - l_1) w_{\text{old}} $\) penalizes larger weights, encouraging weight decay.

### <span style="color:#28B463"><b>Impact on Weight Updates</b></span>

- **When \($ w_{\text{old}} $\) is Positive**:
  - The L1 term reduces the weight, while the L2 term further encourages smaller weights.

- **When \($ w_{\text{old}} $\) is Negative**:
  - The L1 term increases the weight, pushing it closer to zero, while the L2 term counteracts by promoting decay.

- **When \($ w_{\text{old}} $\) is Small (Close to Zero)**:
  - Both regularization terms work together to drive the weight towards zero, allowing for effective feature selection.

### <span style="color:#28B463"><b>Overall Effect on Feature Selection</b></span>

The Elastic Net regression encourages sparsity and feature selection while retaining some ability to handle correlated features due to the inclusion of the L2 penalty. This makes it an effective choice in scenarios where there are many features, some of which may be highly correlated.

---

# CODE IMPLEMENTATION 

In [None]:
import pandas as pd 

In [None]:
data = pd.read_csv('../../../datasets/')

In [None]:
#nothing yet 