# <span style="color:#2E86C1"><b>Linear Regression Model: A Deep Dive</b></span>


Linear regression is a fundamental algorithm used in both **machine learning (ML)** and **deep learning (DL)**. It models the relationship between a dependent variable \($ y $\) and an independent variable \($ x $\) by fitting a **linear equation**. Let's break down the model step by step.

### <span style="color:#D35400"><b>1. Basic Model Architecture</b></span>

In a **deep learning** context, linear regression can be viewed as a **simple neural network** with:

- **One input** (the feature \($ x $\)),
- **One neuron** (which applies a linear transformation \($ y = wx + b $\)),
- **One output** (the predicted value \($ \hat{y} $\)).

The equation for linear regression is:

$$
y = wx + b
$$

Where:

- \($ w $\) is the **weight** (or slope of the line),
- \($ b $\) is the **bias** (the y-intercept of the line),
- \($ x $\) is the **input feature**,
- \($ \hat{y} $\) is the **predicted output**.

### <span style="color:#D35400"><b>2. Loss Function</b></span>

The goal of linear regression is to minimize the error between the predicted values \($ \hat{y} $\) and the actual values \($ y $\). This error is quantified using a **loss function**. The most common loss function for linear regression is the **Mean Squared Error (MSE)**:

$$
\text{Loss} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

Where:
- \($ n $\) is the **number of data points**,
- \($ y_i $\) is the **actual value** of the \($ i $\)-th data point,
- \($ \hat{y}_i $\) is the **predicted value**.

### <span style="color:#D35400"><b>3. Gradient Descent (Weight and Bias Updates)</b></span>

To minimize the loss function, we use **gradient descent**. This optimization algorithm updates the weights and biases by computing the **gradients** of the loss function with respect to these parameters.


---

## <span style="color:#2E86C1"><b>Understanding Gradient Descent</b></span>

### <span style="color:#D35400"><b>1. What is a Gradient?</b></span>

- The **gradient** is a vector that points in the direction of the **steepest ascent** of a function.
- It indicates how much the function will increase (or decrease) if you move in that direction.
- In simple terms, it gives you the **slope** of the function at a specific point.

### <span style="color:#D35400"><b>2. What is Gradient Descent?</b></span>

- **Gradient Descent** is an optimization algorithm used to minimize a function (often a loss function in machine learning).
- It involves taking **steps downhill** towards the minimum point of the function.
- The algorithm uses the gradient to determine the direction to move and how far to go.

### <span style="color:#D35400"><b>3. What is a Partial Derivative?</b></span>

- A **partial derivative** measures how a function changes as one of its variables changes while keeping other variables constant.
- It allows us to understand the **sensitivity** of the function to each parameter independently.
- In the context of gradient descent, partial derivatives are used to calculate the gradient.

### <span style="color:#28B463"><b>4. Key Concepts</b></span>

- **Steepest Ascent vs. Steepest Descent**:
  - The **gradient** indicates the direction of steepest ascent.
  - **Gradient Descent** uses the negative gradient to find the direction of steepest descent.
  
- **Learning Rate**:
  - The **learning rate** \( $\alpha$ \) controls the size of the steps taken during the descent.
  - A small learning rate leads to slow convergence, while a large learning rate may overshoot the minimum.

- **Iteration**:
  - Gradient descent is performed iteratively, updating the weights or parameters until convergence (when the updates become negligibly small).

<center><img src="../../../images/gradient_descent.png" alt="error" width="800"/></center>
---

### <span style="color:#28B463"><b>Gradient of the Loss Function</b></span>

We compute the **partial derivatives** of the loss function with respect to both \($ w $\) (weight) and \($ b $\) (bias):

- **Partial derivative w.r.t \($ w $\)**:

$$
\frac{\partial \text{Loss}}{\partial w} = -\frac{2}{n} \sum_{i=1}^{n} x_i (y_i - \hat{y}_i)
$$

- **Partial derivative w.r.t \($ b $\)**:

$$
\frac{\partial \text{Loss}}{\partial b} = -\frac{2}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)
$$

### <span style="color:#28B463"><b>Updating Weight and Bias</b></span>

The **gradient descent** algorithm updates \($ w $\) and \($ b $\) iteratively using the **learning rate** \($ \alpha $\) (which controls the step size of the update):

- **Weight update**:

$$
w_{\text{new}} = w_{\text{old}} - \alpha \cdot \frac{\partial \text{Loss}}{\partial w}
$$

- **Bias update**:

$$
b_{\text{new}} = b_{\text{old}} - \alpha \cdot \frac{\partial \text{Loss}}{\partial b}
$$

These updates are performed over multiple **epochs** (iterations over the entire dataset) until the loss converges to a minimum.

---

## <span style="color:#2E86C1"><b>Key Terms Explained</b></span>

- **Loss**: This is a measure of how far off the predicted values \($ \hat{y} $\) are from the actual values \($ y $\). In linear regression, the loss is often calculated using the **MSE**.

- **Gradient Descent**: This is the optimization algorithm used to minimize the loss function by adjusting the model's parameters (**weight** and **bias**). It works by moving in the direction of the steepest decrease of the loss (negative gradient).

- **Epoch**: An epoch refers to one complete pass through the entire training dataset. In each epoch, the model updates its weights and biases using gradient descent.

- **Weight (w)**: This is the parameter that determines how much influence the input variable \($ x $\) has on the output \($ y $\). It's equivalent to the slope of the line in linear regression.

- **Bias (b)**: This is the parameter that shifts the line up or down. It's equivalent to the y-intercept in linear regression.

---

## <span style="color:#2E86C1"><b>Calculating the Updated Weight (w)</b></span>

Let's now dive deeper into the **equation** to calculate the updated **weight** \($ w $\). We will substitute values step by step to get a final formula.

### <span style="color:#D35400"><b>Starting Point</b></span>

We are using **gradient descent** to update \($ w $\):

$$
w_{\text{new}} = w_{\text{old}} - \alpha \cdot \frac{\partial \text{Loss}}{\partial w}
$$

Where:

- \($ \alpha $\) is the **learning rate**,
- \($ \frac{\partial \text{Loss}}{\partial w} $\) is the **gradient of the loss** with respect to \($ w $\).

### <span style="color:#D35400"><b>Loss Function and Gradient</b></span>

For linear regression, the predicted value \($ \hat{y} $\) is:

$$
\hat{y} = w \cdot x + b
$$

And the loss is the **Mean Squared Error (MSE)**:

$$
\text{Loss} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

Substitute \($ \hat{y}_i = w \cdot x_i + b $\) into the loss function:

$$
\text{Loss} = \frac{1}{n} \sum_{i=1}^{n} (y_i - (w \cdot x_i + b))^2
$$

### <span style="color:#28B463"><b>Gradient Calculation</b></span>

Take the partial derivative of the loss function with respect to \($ w $\):

$$
\frac{\partial \text{Loss}}{\partial w} = -\frac{2}{n} \sum_{i=1}^{n} x_i (y_i - (w \cdot x_i + b))
$$


### <span style="color:#28B463"><b>Updating weight (w)</b></span>

Finally, the weight update is:

$$
w_{\text{new}} = w_{\text{old}} - \alpha \cdot \left( -\frac{2}{n} \sum_{i=1}^{n} (y_i - (w \cdot x_i + b)) \right)
$$

This formula updates the weight \($ w $\) by moving in the direction of the **steepest descent** (negative gradient) to minimize the loss.

## <span style="color:#2E86C1"><b>Calculating the Updated Bias (b)</b></span>

Just like with the weight \($ w $\), we use **gradient descent** to update the **bias** \($ b $\). The formula is:

$$
b_{\text{new}} = b_{\text{old}} - \alpha \cdot \frac{\partial \text{Loss}}{\partial b}
$$

Where:

- \($ \alpha $\) is the **learning rate**,
- \($ \frac{\partial \text{Loss}}{\partial b} $\) is the **gradient of the loss** with respect to \($ b $\).

### <span style="color:#D35400"><b>Loss Function and Bias Gradient</b></span>

For linear regression, the predicted value \($ \hat{y} $\) is:

$$
\hat{y} = w \cdot x + b
$$

Substituting \($ \hat{y}_i = w \cdot x_i + b $\) into the **Mean Squared Error (MSE)** loss function:

$$
\text{Loss} = \frac{1}{n} \sum_{i=1}^{n} (y_i - (w \cdot x_i + b))^2
$$

### <span style="color:#28B463"><b>Gradient Calculation for Bias</b></span>

Now, we take the **partial derivative** of the loss function with respect to \($ b $\):

$$
\frac{\partial \text{Loss}}{\partial b} = -\frac{2}{n} \sum_{i=1}^{n} (y_i - (w \cdot x_i + b))
$$

### <span style="color:#28B463"><b>Updating Bias (b)</b></span>

Finally, the bias update is:

$$
b_{\text{new}} = b_{\text{old}} - \alpha \cdot \left( -\frac{2}{n} \sum_{i=1}^{n} (y_i - (w \cdot x_i + b)) \right)
$$

This formula updates the bias \($ b $\) by moving in the direction of the **steepest descent** (negative gradient) to minimize the loss.

---


# CODE IMPLEMENTATION 