### what is Gradient descent ?

Gradient Descent is used in:
 - Linear Regression
 - Logistic Regression
 - Neural Networks
 - Deep Learning
 - Transformers
 - LLMs
 - Basically everything that learns weights

#### why we need gradient descent ? 
1. Matrix inversion is computationally expensive
2. High memory requirements
3. Non-invertible matrix issue
4. Better scalability for large datasets
5. Generalization to other machine learning models

[day50](https://github.com/DiwanshuG/MachineLearningJourney/blob/main/day50/02_Multiple_Linear_Regression_Mathematical_Derivation.ipynb)

### Intuition of algo : 

note : Gradient Descent (Through Linear Regression Lens)

- J = loss function
$$
J = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

##### Given:

- \( n = 4 \)

$$
\hat{y}_i = m x_i + b
$$

##### Cost Function:

$$
J(m, b) = \sum_{i=1}^{4} \left( y_i - (m x_i + b) \right)^2
$$


##### Assumption:

For now, assume we already know the correct value of the slope \( m \).

Using Ordinary Least Squares (OLS), we found:

$$
m = 78.35
$$

Now, we will focus on optimizing only \( b \) while keeping \( m \) fixed.


##### Objective:

We need to find the value of \( b \) for which the loss function \( J(m, b) \) is minimum, 
while keeping \( m \) fixed.

In other words,

$$
\min_{b} \; J(m, b)
$$


When you expand

$$
\left( y_i - (m x_i + b) \right)^2
$$

First rewrite it as:

$$
\left( y_i - m x_i - b \right)^2
$$

Now apply the identity:

$$
(a - b)^2 = a^2 - 2ab + b^2
$$

So,

$$
\left( y_i - m x_i - b \right)^2
=
(y_i - m x_i)^2
- 2b (y_i - m x_i)
+ b^2
$$


Since \( m \) is fixed, the cost function \( J(m, b) \) becomes a function of only \( b \).

After expansion, it takes the form of a quadratic equation:

$$
J(b) = A b^2 + B b + C
$$

So,

$$
J(b) \propto b^2
$$

This means the cost function is a quadratic (parabola) in terms of \( b \).


#### Step 1 : 
select a random b

As humans, we can look at the graph of the cost function and visually identify 
the direction in which the minimum lies.

But an algorithm cannot "see" the graph.

It does not know whether it should increase or decrease the value of \( b \).

So instead of visual intuition, it uses the gradient (derivative) 
to mathematically determine the direction of steepest increase of the loss function.

If the derivative is positive, the function is increasing → decrease \( b \).

If the derivative is negative, the function is decreasing → increase \( b \).

Therefore, the algorithm updates the parameter in the opposite direction of the gradient 
to move toward the minimum.


##### So, we compute the slope (derivative) at the current point and update the parameter accordingly.

At any given value of \( b \), we calculate:

$$
\frac{\partial J}{\partial b}
$$

- If the slope is positive → decrease \( b \)
- If the slope is negative → increase \( b \)
- If the slope is zero → we have reached the minimum

So, instead of seeing the graph, the algorithm uses the slope at the current point 
to decide the next step.


##### Gradient Descent Update Rule:

$$
b_{\text{new}} = b_{\text{old}} - \alpha \frac{\partial J}{\partial b}
$$


Where:

$$
\alpha = \text{learning rate (step size)}
$$

$$
\frac{\partial J}{\partial b} = \text{slope of the cost function at the current point}
$$


##### When to Stop Gradient Descent?

There are three common stopping criteria:

---

### 1. When the Gradient Becomes (Almost) Zero

$$
\frac{\partial J}{\partial b} \approx 0
$$

This means we are very close to the minimum and further updates will not significantly reduce the loss.

---

### 2. When the Change in Loss Becomes Very Small

$$
|J_{\text{new}} - J_{\text{old}}| < \epsilon
$$

Where \( \epsilon \) is a very small number.

This means the loss is no longer decreasing meaningfully, so the algorithm has converged.

---

### 3. Maximum Number of Iterations Reached

Stop after a predefined number of iterations (e.g., 1000 or 10,000).

This prevents infinite loops and controls training time.
