# Optimizers
Neural network optimizers are algorithms that adjust a model's weights to minimize the loss function during training. They guide how weights are updated in response to the calculated gradients from backpropagation. 

![optimizers](https://miro.medium.com/v2/resize:fit:640/format:webp/1*XVFmo9NxLnwDr3SxzKy-rA.gif)

![optimizers_overview](https://miro.medium.com/v2/resize:fit:640/format:webp/1*SjtKOauOXFVjWRR7iCtHiA.gif)

## Gradient Descent
- Batch gradient descent
- Stochastic gradient descent
- Mini-batch gradient descent


## Momentum
* Based on stochastic gradient descent. Full name is SGD with momentum.
* Adds a fraction (momentum term) of the previous update to the current update.
* Helps escape local minima and smoothens updates.

![momentum](https://miro.medium.com/v2/resize:fit:720/format:webp/1*L5lNKxAHLPYNc6-Zs4Vscw.png))

## Adagrad
Adagrad is an optimizer with parameter-specific learning rates, which are adapted relative to how frequently a parameter gets updated during training. The more updates a parameter receives, the smaller the updates.

## RMSProp (Root Mean Square Propagation)

- Maintain a moving (discounted) average of the square of gradients
- Divide the gradient by the root of this average
- if the surface is flat --> big jump, if steep --> small jump

Suggested Default values by Hinton(developer of RMSProp):
- decay factor γ: 0.9 (keeps 90% of the previous gradient information)
- learning rate η: 0.001


## ADAM (Adaptive Moment Estimation)
ADAM optimizes neural networks by adjusting learning rates for each parameter dynamically. It combines:
- *Momentum*: Uses an exponentially moving average of past gradients (first moment estimate).
- *RMSprop*: Scales updates based on recent squared gradients (second moment estimate).

## Optimizer Comparison Table
| Optimizer  | Description | Pros  | Cons |
|------------|------------|---------|---------|
| **SGD** (Stochastic Gradient Descent) | Basic gradient descent using a small batch of data. | - Simple & efficient <br> - Good generalization | - Slow convergence <br> - High variance in updates |
| **SGD + Momentum** | SGD with an additional term to smooth updates. | - Faster than vanilla SGD <br> - Reduces oscillations | - Still needs manual learning rate tuning |
| **AdaGrad** | Adapts learning rate per parameter based on past gradients. | - Good for sparse data <br> - No manual learning rate tuning | - Learning rate decreases too much over time |
| **AdaDelta** | Improvement over AdaGrad, avoids aggressive learning rate decay. | - No need to set learning rate <br> - Works well in some cases | - Can be computationally expensive |
| **RMSprop** | Adapts learning rate based on recent squared gradients. | - Works well with non-stationary objectives <br> - Handles sparse gradients | - Learning rate decay may require tuning |
| **ADAM** (Adaptive Moment Estimation) | Combines Momentum & RMSprop for adaptive learning rates. | - Fast convergence <br> - Works well for most tasks | - Can generalize worse than SGD <br> - Higher memory usage |