types of Gradient descent 
1) Batch Gradient Descent (Full Batch GD)
2) Stochastic Gradient Descent (SGD)
3) Mini-Batch Gradient Descent

| Method        | Data Used Per Update       | Number of Updates per Epoch |
| ------------- | -------------------------- | --------------------------- |
| Batch GD      | Entire dataset (n samples) | 1                           |
| SGD           | 1 sample                   | n                           |
| Mini-Batch GD | b samples                  | n / b                       |


##  Mathematical View

$$
\textbf{Full Cost Function:}
\quad
J(\theta) = \frac{1}{n} \sum_{i=1}^{n} L_i(\theta)
$$

$$
\textbf{Batch Gradient Descent:}
\quad
\nabla J(\theta) = \frac{1}{n} \sum_{i=1}^{n} \nabla L_i(\theta)
$$

$$
\textbf{Stochastic Gradient Descent (SGD):}
\quad
\nabla J(\theta) \approx \nabla L_i(\theta)
$$

$$
\textbf{Mini-Batch Gradient Descent:}
\quad
\nabla J(\theta) \approx \frac{1}{b} \sum_{i=1}^{b} \nabla L_i(\theta)
$$


Batch Gradient Descent is typically used when:

- The dataset is small
- The loss function is convex

- Stable and deterministic convergence is preferred

### Mathematical formulation of BatchGD

##### Batch GD Update

For each epoch:

$$
\beta_0 := \beta_0 - \eta \frac{\partial J}{\partial \beta_0}
$$

$$
\beta_1 := \beta_1 - \eta \frac{\partial J}{\partial \beta_1}
$$

$$
\beta_2 := \beta_2 - \eta \frac{\partial J}{\partial \beta_2}
$$


## Derivation of  ∂J / ∂β₀  (Batch Gradient Descent)

### Step 1: Model

For multiple linear regression:

$$
\hat{y}_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i}
$$

---

### Step 2: Cost Function (MSE)

$$
J(\beta_0, \beta_1, \beta_2)
=
\frac{1}{n}
\sum_{i=1}^{n}
\left(
y_i - \hat{y}_i
\right)^2
$$

Substituting the model:

$$
J =
\frac{1}{n}
\sum_{i=1}^{n}
\left(
y_i -
(\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i})
\right)^2
$$

---

### Step 3: Differentiate with respect to β₀

Let the error be:

$$
e_i = y_i - \hat{y}_i
$$

Then:

$$
J = \frac{1}{n} \sum_{i=1}^{n} e_i^2
$$

Now differentiate:

$$
\frac{\partial J}{\partial \beta_0}
=
\frac{1}{n}
\sum_{i=1}^{n}
2 e_i
\frac{\partial e_i}{\partial \beta_0}
$$

Since:

$$
e_i = y_i - (\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i})
$$

we get:

$$
\frac{\partial e_i}{\partial \beta_0} = -1
$$

---

### Step 4: Final Result

$$
\frac{\partial J}{\partial \beta_0}
=
-\frac{2}{n}
\sum_{i=1}^{n}
\left(
y_i - \hat{y}_i
\right)
$$

---

### note

- It is proportional to the **average error**
- If predictions are too small → gradient becomes negative → β₀ increases
- If predictions are too large → gradient becomes positive → β₀ decreases

This is how the bias term corrects itself during Batch Gradient Descent.


##  ∂J / ∂β₁ 

### Step 1: Model

$$
\hat{y}_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i}
$$

---

### Step 2: Cost Function (MSE)

$$
J(\beta_0, \beta_1, \beta_2)
=
\frac{1}{n}
\sum_{i=1}^{n}
\left(
y_i - \hat{y}_i
\right)^2
$$

---

### Step 3: Define Error

$$
e_i = y_i - \hat{y}_i
$$

So,

$$
J = \frac{1}{n} \sum_{i=1}^{n} e_i^2
$$

---

### Step 4: Differentiate w.r.t β₁

Using chain rule:

$$
\frac{\partial J}{\partial \beta_1}
=
\frac{1}{n}
\sum_{i=1}^{n}
2 e_i
\frac{\partial e_i}{\partial \beta_1}
$$

Now compute:

$$
e_i = y_i - (\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i})
$$

So,

$$
\frac{\partial e_i}{\partial \beta_1} = -x_{1i}
$$

---

### Step 5: Substitute

$$
\frac{\partial J}{\partial \beta_1}
=
\frac{1}{n}
\sum_{i=1}^{n}
2 e_i (-x_{1i})
$$

---

### Final Result

$$
\frac{\partial J}{\partial \beta_1}
=
-\frac{2}{n}
\sum_{i=1}^{n}
x_{1i}
\left(
y_i - \hat{y}_i
\right)
$$

---

### notes

- Gradient is weighted by feature value \( x_{1i} \)
- If CGPA strongly contributes to error → β₁ adjusts more
- Larger feature values produce larger updates

This is why **feature scaling is important** in Gradient Descent.


##  ∂J / ∂β₂

### Step 1: Model

$$
\hat{y}_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i}
$$

---

### Step 2: Cost Function

$$
J =
\frac{1}{n}
\sum_{i=1}^{n}
\left(
y_i - \hat{y}_i
\right)^2
$$

---

### Step 3: Define Error

$$
e_i = y_i - \hat{y}_i
$$

So,

$$
J = \frac{1}{n} \sum_{i=1}^{n} e_i^2
$$

---

### Step 4: Differentiate w.r.t β₂

Using chain rule:

$$
\frac{\partial J}{\partial \beta_2}
=
\frac{1}{n}
\sum_{i=1}^{n}
2 e_i
\frac{\partial e_i}{\partial \beta_2}
$$

Now,

$$
e_i = y_i - (\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i})
$$

So,

$$
\frac{\partial e_i}{\partial \beta_2} = -x_{2i}
$$

---

### Final Result

$$
\frac{\partial J}{\partial \beta_2}
=
-\frac{2}{n}
\sum_{i=1}^{n}
x_{2i}
\left(
y_i - \hat{y}_i
\right)
$$


## Final Pattern

for any feature $x_j$
:

$$
\frac{\partial J}{\partial \beta_j}
=
-\frac{2}{n}
\sum_{i=1}^{n}
x_{ji}
\left(
y_i - \hat{y}_i
\right)
$$

And for bias:

$$
\frac{\partial J}{\partial \beta_0}
=
-\frac{2}{n}
\sum_{i=1}^{n}
\left(
y_i - \hat{y}_i
\right)
$$
