#### Mathematical formualation

##### Assumption:

For now, assume we already know the correct value of the slope \( m \).

Using Ordinary Least Squares (OLS), we found:

$$
m = 78.35
$$

Now, we will focus on optimizing only \( b \) while keeping \( m \) fixed.


#### Step 1 : 
start with a random b 

$$
L = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$


since : $$
\hat{y}_i = m x_i + b
$$


$$
L = \sum_{i=1}^{n} (y_i - m x_i - b)^2
$$


$$
\frac{dL}{db}
=
\frac{d}{db}
\left(
\sum_{i=1}^{n}
(y_i - m x_i - b)^2
\right)
$$


$$
=
\sum_{i=1}^{n}
\frac{d}{db}
\left(
(y_i - m x_i - b)^2
\right)
$$


$$
g = (y_i - m x_i - b)
$$

$$
=
\sum_{i=1}^{n}
2(y_i - m x_i - b)
\cdot
\frac{d}{db}(y_i - m x_i - b)
$$


$$
\frac{d}{db}(y_i - m x_i - b) = -1
$$


$$
\frac{dL}{db}
=
-2
\sum_{i=1}^{n}
(y_i - m x_i - b)
$$


### Evaluating the Derivative at \( b = 0 \) and \( m = 78.35 \)

We know:

$$
\frac{dL}{db}
=
-2 \sum_{i=1}^{n} (y_i - m x_i - b)
$$

Substitute \( m = 78.35 \) and \( b = 0 \):

$$
\frac{dL}{db}
=
-2 \sum_{i=1}^{n} \left( y_i - 78.35 x_i - 0 \right)
$$

Simplifying:

$$
\frac{dL}{db}
=
-2 \sum_{i=1}^{n} \left( y_i - 78.35 x_i \right)
$$


### Gradient Descent (Single Parameter Version)

```python
for i in range(epochs):
    b_new = b_old - lr * slope
    b_old = b_new

# Mathematics Behind Gradient Descent for Linear Regression

## 1. Model Definition

For Simple Linear Regression:

ŷ = mx + b

Where:
- m = slope  
- b = intercept  

---

## 2. Loss Function (Mean Squared Error)

We use the Mean Squared Error (MSE) as the loss function:

L(m, b) = (1/n) Σ (yi − (mxi + b))²

Where:
- n = number of data points  
- yi = actual value  
- mxi + b = predicted value  

---

## 3. Partial Derivative with respect to m

Differentiate L(m, b) with respect to m:

∂L/∂m = ∂/∂m [ (1/n) Σ (yi − (mxi + b))² ]

Using the chain rule:

∂L/∂m = (1/n) Σ 2(yi − (mxi + b)) (−xi)

Simplifying:

∂L/∂m = (−2/n) Σ xi (yi − (mxi + b))

---

## 4. Partial Derivative with respect to b

Differentiate L(m, b) with respect to b:

∂L/∂b = ∂/∂b [ (1/n) Σ (yi − (mxi + b))² ]

Using the chain rule:

∂L/∂b = (1/n) Σ 2(yi − (mxi + b)) (−1)

Simplifying:

∂L/∂b = (−2/n) Σ (yi − (mxi + b))

---

## 5. Gradient Descent Update Rule

Gradient Descent updates parameters using:

θ = θ − η (∂L/∂θ)

So for m:

m = m − η (∂L/∂m)

m = m − η [ (−2/n) Σ xi (yi − (mxi + b)) ]

m = m + (2η/n) Σ xi (yi − (mxi + b))

For b:

b = b − η (∂L/∂b)

b = b − η [ (−2/n) Σ (yi − (mxi + b)) ]

b = b + (2η/n) Σ (yi − (mxi + b))

---

## 6. Final Update Equations

m := m + (2η/n) Σ xi (yi − ŷi)

b := b + (2η/n) Σ (yi − ŷi)

Where:

ŷi = mxi + b  
η = learning rate  

---

## Interpretation

- (yi − ŷi) represents the prediction error.
- The gradients determine the direction of steepest increase of the loss.
- Subtracting the gradient moves the parameters in the direction of steepest decrease.
- The learning rate η controls how large each update step is.

This iterative process continues until the loss function converges to a minimum, producing the best-fit line.


# Gradient Descent Update Equations for m and b

## Model

ŷ = mx + b  

## Loss Function (Mean Squared Error)

L(m, b) = (1/n) Σ (yi − ŷi)²

---

## Gradients

∂L/∂m = (−2/n) Σ xi (yi − ŷi)

∂L/∂b = (−2/n) Σ (yi − ŷi)

---

## Final Update Rules

Using the Gradient Descent rule:

θ := θ − η (∂L/∂θ)

### Update for m

m := m − η (∂L/∂m)

m := m − η [ (−2/n) Σ xi (yi − ŷi) ]

m := m + (2η/n) Σ xi (yi − ŷi)

---

### Update for b

b := b − η (∂L/∂b)

b := b − η [ (−2/n) Σ (yi − ŷi) ]

b := b + (2η/n) Σ (yi − ŷi)

---

## Compact Final Form

m_new = m_old + (2η/n) Σ xi (yi − (mxi + b))

b_new = b_old + (2η/n) Σ (yi − (mxi + b))

Where:
- η = learning rate  
- n = number of samples  
- (yi − ŷi) = prediction error  

These equations are applied iteratively until convergence.
