### Stochastic Gradient Descent
#### What is Gradient Descent?
- Gradient descent is an iterative optimization algorithm used to minimize a loss function, which represents how far the model’s predictions are from the actual values. The main goal is to adjust the parameters of a model (weights, biases, etc.) so that the error is minimized.
- The update rule for the traditional gradient descent algorithm is:
  $$
  \theta = \theta - \alpha \nabla_{\theta}J(\theta)
  $$
- In traditional gradient descent, the gradients are computed based on the entire dataset, which can be computationally expensive for large datasets.
#### Need for Stochastic Gradient Descent
For large datasets, computing the gradient using all data points can be slow and memory-intensive. This is where SGD comes into play. Instead of using the full dataset to compute the gradient at each step, SGD uses only one random data point (or a small batch of data points) at each iteration. This makes the computation much faster.

<img src="Images/stochastic.webp" width=600>

#### Working of Stochastic Gradient Descent
- In Stochastic Gradient Descent, the gradient is caculated for each training example (or small subsets of training examples) rather than the entire dataset.
- The update rule becomes:
  $$
  \theta = \theta - \alpha \nabla J(\theta, x_{i}, y_{i})
  $$
  Where
  - $x_{i}$ and $y_{i}$ represent the features and target of the i-th training example.
  - The gradient $\nabla J(\theta, x_{i}, y_{i})$ is now caculated for a single data point or a small batch.
- The key difference from traditional gradient descent is that, in SGD, the paramaters updates are made based on a single data point, not the entire dataset. The ramdom selection of data points introduces stochasticity, which can be both and advantage and a challenge.
#### Advantages of Stochastic Gradient Descent
1. `Efficiency`: Because it uses only one or a few data points to calculate the gradient, SGD can be much faster, especially for large datasets. Each step requires fewer computations, leading to quicker convergence.

2. `Memory Efficiency`: Since it does not require storing the entire dataset in memory for each iteration, SGD can handle much larger datasets than traditional gradient descent.

3. `Escaping Local Minima`: The noisy updates in SGD, caused by the stochastic nature of the algorithm, can help the model escape local minima or saddle points, potentially leading to better solutions in non-convex optimization problems (common in deep learning).

4. `Online Learning`: SGD is well-suited for online learning, where the model is trained incrementally as new data comes in, rather than on a static dataset.
#### Challenges of Stochastic Gradient Descent
1. `Noisy Convergence`: Since the gradient is estimated based on a single data point (or a small batch), the updates can be noisy, causing the cost function to fluctuate rather than steadily decrease. This makes convergence slower and more erratic than in batch gradient descent.

2. `Learning Rate Tuning`: SGD is highly sensitive to the choice of learning rate. A learning rate that is too large may cause the algorithm to diverge, while one that is too small can slow down convergence. Adaptive methods like Adam and RMSprop address this by adjusting the learning rate dynamically during training.

3. `Long Training Times`: While each individual update is fast, the convergence might take a longer time overall since the steps are more erratic compared to batch gradient descent.

### Momemtum
#### What is Momentum?
Momentum is an optimization technique that accelerates gradient descent by accumulating a velocity vector in directions of persistent reduction of the loss function. Instead of relying only on the current gradient, momentum combines the current gradient with a fraction of the previous update. This help smooth out oscillations and speed up convergence, especially in ravines or areas with high curvature.

#### Update Rule
The update equations with momentum are:
  $$
  v_{t} = \beta v_{t - 1} + (1 - \beta) \nabla_{\theta}J(\theta)
  $$
  $$
  \theta = \theta - \alpha v_{t}
  $$
  Where:  
  - $v_{t}$: velocity (accumulated gradient) at step $t$  
  - $\beta$: momentum coefficient ($0 \leq \beta < 1$), usually set around **0.9**  
  - $\alpha$: learning rate  
  - $\nabla_{\theta}J(\theta)$: gradient of the loss function w.r.t parameters  

Intuition: $v_{t}$ acts like the "inertia" that carries information from past gradients. If the gradient points in the same direction repeatedly, $v_{t}$ grows, making updates faster in that direction. If gradient oscillate, momentum helps dampen the oscillations.

#### Advantages of Momentum
1. `Faster Convergence`: Accelerates learning, especially in directions where gradients consistently point the same way.
2. `Reduced Oscillation`: Helps smooth noisy updates, particularly in SGD.
3. `Effective in Ravines`: Deals well with cost function having steep curvature in one direction and shallow curvature in another.
4. `Simple to implement`: Only requires storing the previous velocity vector.
#### Challenges of Momentum
1. `Hyperparameter Tuning`: The choice of $\beta$ is critical. Too high can overshoot minima, too low loses the benifit.
2. `Risk of Overshooting`: If combined with a large learning rate, the accumulated velocity may push parameters past the optimal point.
3. `Not Adaptive`: Momentum dose not adjust learning rates for individual parameters like Adam or RMSprop.