## **Gradient Descent: Step-by-Step Mathematical Example**  
Gradient Descent is an **iterative optimization algorithm** used to minimize the error in Linear Regression by adjusting the parameters \( m \) (slope) and \( b \) (intercept) step by step.

---

## **1. Problem Statement**  
We have a small dataset:

| Hours Studied (\( X \)) | Exam Score (\( Y \)) |
|------------------|------------------|
| 1 | 50 |
| 2 | 55 |
| 3 | 65 |

We assume that the relationship between \( X \) and \( Y \) follows the equation:

\[
Y = mX + b
\]

Our goal is to find the **best values of \( m \) and \( b \)** using **Gradient Descent**.

---

## **2. Define the Cost Function**
We use **Mean Squared Error (MSE)** as the cost function:

\[
J(m, b) = \frac{1}{2m} \sum_{i=1}^{m} (Y_i - \hat{Y}_i)^2
\]

Where:
- \( m \) = Number of training samples (**not slope**)
- \( Y_i \) = Actual target value
- \( \hat{Y}_i \) = Predicted value (\( mX_i + b \))

We use **\( \frac{1}{2m} \)** instead of \( \frac{1}{m} \) to simplify differentiation.

---

## **3. Compute Partial Derivatives**
To minimize the cost function, we compute **gradients (partial derivatives)** of \( J(m, b) \):

1. **Gradient w.r.t \( m \) (Slope update rule)**:

\[
\frac{\partial J}{\partial m} = -\frac{1}{m} \sum_{i=1}^{m} X_i (Y_i - \hat{Y}_i)
\]

2. **Gradient w.r.t \( b \) (Intercept update rule)**:

\[
\frac{\partial J}{\partial b} = -\frac{1}{m} \sum_{i=1}^{m} (Y_i - \hat{Y}_i)
\]

---

## **4. Gradient Descent Algorithm**
We update \( m \) and \( b \) iteratively using:

\[
m = m - \alpha \cdot \frac{\partial J}{\partial m}
\]

\[
b = b - \alpha \cdot \frac{\partial J}{\partial b}
\]

Where:
- \( \alpha \) = **Learning rate** (controls step size)

---

## **5. Step-by-Step Example Calculation**
### **Step 1: Initialize Values**
We start with:
- \( m = 0 \), \( b = 0 \)
- Learning rate \( \alpha = 0.01 \)

### **Step 2: Compute Predictions**
For each \( X \):

\[
\hat{Y} = mX + b
\]

Since **initially \( m = 0 \) and \( b = 0 \):**
- \( \hat{Y}_1 = 0(1) + 0 = 0 \)
- \( \hat{Y}_2 = 0(2) + 0 = 0 \)
- \( \hat{Y}_3 = 0(3) + 0 = 0 \)

### **Step 3: Compute Gradients**
**Compute \( \frac{\partial J}{\partial m} \):**
\[
\frac{\partial J}{\partial m} = -\frac{1}{3} [(1(50 - 0)) + (2(55 - 0)) + (3(65 - 0))]
\]

\[
= -\frac{1}{3} [(50) + (110) + (195)]
\]

\[
= -\frac{1}{3} (355) = -118.33
\]

**Compute \( \frac{\partial J}{\partial b} \):**
\[
\frac{\partial J}{\partial b} = -\frac{1}{3} [(50 - 0) + (55 - 0) + (65 - 0)]
\]

\[
= -\frac{1}{3} (50 + 55 + 65) = -\frac{1}{3} (170) = -56.67
\]

### **Step 4: Update Parameters**
Using \( \alpha = 0.01 \):

\[
m = 0 - (0.01 \times -118.33) = 1.1833
\]

\[
b = 0 - (0.01 \times -56.67) = 0.5667
\]

---

## **6. Next Iteration**
Using updated values \( m = 1.1833 \), \( b = 0.5667 \), we repeat the process:

1. Compute predictions
2. Compute gradients
3. Update parameters

After **multiple iterations**, \( m \) and \( b \) will converge to the **optimal values**.

---

## **7. Final Result**
After multiple iterations, the algorithm converges to:

\[
m \approx 7.5, \quad b \approx 42
\]

So, the **best-fit line** is:

\[
Y = 7.5X + 42
\]

---

## **8. Python Implementation**
```python
import numpy as np

# Dataset
X = np.array([1, 2, 3])
Y = np.array([50, 55, 65])

# Initialize parameters
m = 0
b = 0
alpha = 0.01  # Learning rate
epochs = 1000  # Number of iterations
n = len(X)  # Number of data points

# Gradient Descent Loop
for _ in range(epochs):
    Y_pred = m * X + b  # Predictions
    dm = (-2/n) * sum(X * (Y - Y_pred))  # Derivative w.r.t m
    db = (-2/n) * sum(Y - Y_pred)  # Derivative w.r.t b
    m = m - alpha * dm  # Update m
    b = b - alpha * db  # Update b

# Print final values
print("Final Slope (m):", round(m, 5))
print("Final Intercept (b):", round(b, 5))

# Make a prediction for 3 hours of study
print("Prediction for 3 hours of study:", round(m * 3 + b, 2))
```

---

## **9. Key Takeaways**
✅ **Gradient Descent is an iterative approach** to optimize \( m \) and \( b \).  
✅ The **cost function (MSE) measures how good our model is**.  
✅ **Partial derivatives (gradients) guide parameter updates**.  
✅ **Learning rate \( \alpha \) controls step size** (too high = unstable, too low = slow).  
✅ After multiple iterations, the **best line is found**!  

Would you like me to plot the **Gradient Descent visualization**? 📉🔥

