# **Gradient Descent**

Gradient Descent is an optimization technique. 

It is the backbone of entire deep learning.

## Linear Regression vs. Gradient Descent

It is common to confuse these two because they are often used together, but they are fundamentally different concepts.
| Feature | Linear Regression | Gradient Descent |
| :--- | :--- | :--- |
| **What is it?** | A statistical **Model** | An optimization **Algorithm** |
| **Analogy** | The **Car** | The **Engine** that moves the car |
| **Purpose** | Defines **what** we want to find (the best-fit line). | Defines **how** we find it (iterative improvement). |
| **Equation** | $y = mx + b$ | $m_{new} = m_{old} - \alpha \cdot \frac{\partial J}{\partial m}$ |

## 1. Linear Regression (The "What")
Linear regression is a problem statement. It says:
> "I have data points, and I want to find the straight line that is closest to all of them."
It defines the **Cost Function** (usually Mean Squared Error), which measures how "wrong" a specific line is. It does **not** inherently specify how to find the best line, just that we want the one with the least error.

## 2. Gradient Descent (The "How")
Gradient Descent is a generic method used to minimize functions. When applied to Linear Regression, it acts as the "solver".

### How it works:
1.  **Start Randomly**: Pick a random slope ($m$) and intercept ($c$).
2.  **Check Error**: See how bad the line is (calculate the Cost).
3.  **Calculate Gradient**: Find out which direction to move $m$ and $c$ to reduce the error.
4.  **Take a Step**: Nudge $m$ and $b$ slightly in that direction.
5.  **Repeat**: Do this thousands of times until the error stops decreasing (convergence).
---

### Summary
*   You can solve Linear Regression **without** Gradient Descent (using the **Normal Equation** like `m = numerator / denominator`). This is fast for small data but slow for huge data.
*   You can use Gradient Descent for things **other than** Linear Regression (like Neural Networks, Logistic Regression, etc.).
**In short:** Linear Regression gives you the map and the destination. Gradient Descent is the act of walking step-by-step to get there.




## What happens if we don't use Gradient Descent?
If you don't use Gradient Descent, you solve Linear Regression using the **Normal Equation** (also known as the Closed-Form Solution or OLS - Ordinary Least Squares).

Instead of taking steps down a hill (iterative), you essentially "teleport" directly to the bottom using pure algebra and matrix math.

## The Normal Equation
Mathematics allows us to derive a direct formula to find the best parameters ($\theta$) in one single step:
$$ \theta = (X^T X)^{-1} X^T y $$

### Pros (Why you might prefer this)
1.  **No Iterations**: You get the answer instantly in one line of code.
2.  **Exact Solution**: It finds the mathematically precise global minimum, not just an approximation.
3.  **No Hyperparameters**: You don't need to pick a "Learning Rate" ($\alpha$) or worry about it being too big or too small.

### Cons (The "Catch")
1.  **Slow on Big Data**: The term $(X^T X)^{-1}$ requires computing the *inverse* of a matrix.
    *   If you have $n$ features (columns), the computational complexity is roughly **$O(n^3)$**.
    *   If you double your features, the calculation time increases by **8x**.
    *   For 100,000 features, this becomes computationally impossible for most computers.
    *   Simply linear regression without gradient descent will work slower in higher dimension data.
2.  **Memory Intensity**: Storing and inverting extremely large matrices consumes massive amounts of RAM.

## Summary Table
| Feature | Gradient Descent | Normal Equation (No Gradient Descent) |
| :--- | :--- | :--- |
| **Method** | Iterative (Step-by-step) | Analytical (Direct Formula) |
| **Speed** | Fast for large datasets | Slow for large datasets ($n > 10,000$) |
| **Accuracy** | Approximation (very close) | Exact |
| **Configuration** | Needs Learning Rate ($\alpha$) | No configuration needed |
| **Best For** | Deep Learning, Big Data | Small tabular datasets |




There are 3 types of gradient descent:
1. Batch Gradient Descent
2. Stochastic Gradient Descent  (SGD) [SGD regressor]
3. Mini-batch Gradient Descent

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)