## 1. Gradient Descent

Gradient Descent is a fundamental optimization algorithm used to minimize a function by iteratively moving towards the direction of steepest descent. In the context of machine learning, it's commonly employed to optimize the parameters of a model to minimize a given loss function.

Mathematical Representation:
The algorithm updates the parameters $w$ in the following manner:

$$ g_{t} = \nabla_{w} f(w_{t}) $$

Where:   
- $ g_t $ is gradients of loss $ f $ w.r.t parameters $ w $  
- $ \nabla_{t} f(w_{t}) = \left( \frac{\partial f}{\partial w_1}(w_{t}), \frac{\partial f}{\partial w_2}(w_{t}), \ldots, \frac{\partial f}{\partial w_n}(w_{t}) \right) $


$$ w_{t+1} = w_t - \eta g_t $$

Where:  
- $ t $ : step number  
- $ w_t $ : current parameters at step $ t $  
- $ w_{t+1} $ : updated parameters  
- $ g_t $ : gradients at step $ t $, i.e.,
- $ \eta $ : is learning r

#### Drawback

One of the drawbacks of Gradient Descent is that it requires computing the gradient for the entire dataset for each update. This can be computationally expensive, especially for large datasets.ta.


## 2. Stochastic Gradient Descent

#### Key Features:  

**Incremental Updates:** Parameters are updated after processing each sample, making it computationally more efficient compared to Gradient Descent, especially for large datasets.  

**Random Sampling:** SGD typically involves shuffling the dataset and then sequentially sampling one instance at a time to update the parameters. This randomness can help in escaping local minima and reaching the global minimum more efficiently.

#### Cons:
**Noisy Updates:** Since each update is based on a single sample, the updates can be noisy and may not necessarily decrease the loss function monotonically.  
**Convergence:** Due to the stochastic nature of SGD, convergence to the optimal solution may be slower compared to Gradient Descent. However, it's often used in practice due to its efficiency and ability to handle large datasets.

## 3. Minibatch Gradient Descent

#### Key Features:  
**Efficient Updates:** Parameters are updated after processing a mini-batch of samples, which strikes a balance between the efficiency of SGD and the stability of Gradient Descent. This approach often results in faster convergence compared to pure SGD.

#### Advantage:  
**Efficiency:** MGD, along with SGD, performs more frequent updates compared to Gradient Descent, which can lead to faster convergence, especially for large datasets.  


#### Disadvantages:
**Slow Convergence:** Although MGD is more efficient than GD, it can still suffer from slow convergence. This is because throughout training, some weights may have steeper slopes and require larger updates, while others may have flatter slopes, requiring smaller updates. This imbalance can slow down convergence.  
**Oscillating Updates:** The addition of randomness in the updates (especially in SGD and MiniBatch variants) can sometimes cause the gradients to oscillate, making convergence difficult.  

## 4. Momentum Optimizer

Momentum optimization is a technique commonly used to accelerate the convergence of gradient-based optimization algorithms, such as Gradient Descent variants. It introduces a momentum term that accumulates gradients over time and adjusts the parameter updates accordingly.

Mathematical Representation:
The update rule for the parameters $w$ in Momentum Optimizer is given by:

$$w_{t+1} = w_t + m_t$$
$$m_t = \beta m_{t-1} - \eta g_{t}$$

Where:
- $m_t$ : Exponentially decaying moving average of past gradients.
- $\beta$ : is another paratemeter called (**decay coefficient** ranging between [0,1]) which tells us how fast we can forget about the past gradients
- $\eta$ : is learning rate.


    Initialize --> $m_0 = 0$  
    Step 1 --> $m_1 = -\eta g_1$  
    Step 2 --> $m_2 = \beta m_1 - \eta g_2 = -\eta(\beta g_1 + g_2)$  
     .  
     .  
     .  
    Step $\tau$ --> $m_{\tau}$ = $-\eta(\beta^{\tau-1} g_1 + \beta^{\tau-2} g_2$ $+ ..... +$ $g_{\tau})$


- $g_1$ has smaller coefficient than $g_2$ and so on.. thats why we called it exponentially decaying moving average of past gradients.

- Momentum optimization helps in accelerating convergence, especially in scenarios where the gradients fluctuate significantly or the loss surface has long, narrow valleys.

- It's important to tune the hyperparameters, particularly the learning rate $\eta$ and the decay coefficient $\beta$, to achieve optimal performance for a given optimization problem.

## 5. Nestrov Optimizer

Nesterov Accelerated Gradient (NAG) is an enhancement of the Momentum optimizer that improves convergence by employing a "look-ahead" approach for gradient computation. It adjusts the gradient calculation by partially updating the parameters before evaluating the gradient.

Mathematical Representation:
The update rule for the parameters $w$ in Nesterov Optimizer is given by:

$$w_{t+1} = w_t + m_t$$
$$m_t = \beta m_{t-1} - \eta ∇_w f(w_t + \beta m_{t-1})$$

- $∇_w f(w_t + \beta m_{t-1})$ - Nestrov Accelerated Gradients (NAG)

#### Explanation:
- Instead of directly computing the gradients at time step $t$, Nesterov optimizer uses a look-ahead approach by performing a partial update ($w_t + \beta m_{t-1}$) of the parameters before computing the gradients.  
- The gradient is then calculated based on the partially updated parameters.  
- This helps in correcting the gradient direction more accurately, especially in scenarios where the momentum term might lead the optimization astray.  
- The parameter update is then adjusted using the momentum term $m_t$, similar to the Momentum optimizer.    
#### Performance:
Nesterov Optimizer, with its Nesterov Accelerated Gradients (NAG), often outperforms traditional SGD and Momentum methods in terms of convergence speed and efficiency, especially for deep learning tasks with complex loss surfaces.  

#### Note:
Tuning the hyperparameters, particularly the learning rate $\eta$ and the decay coefficient $\beta$, is crucial for achieving optimal performance with Nesterov Optimizer. Additionally, it's important to consider the computational overhead of the look-ahead approach, especially for large-scale optimization tasks.  

# Learning Rate Otimizers

## 6. AdaGrad Optimizer
AdaGrad (Adaptive Gradient Algorithm) is an optimization algorithm that adapts the learning rate for each parameter based on the history of gradients. It aims to provide a larger learning rate for infrequent parameters and a smaller learning rate for frequent parameters.

Mathematical Representation:
The update rule for the parameters $w$ in AdaGrad Optimizer is given by:

$$w_{t+1} = w_{t} + \frac{\eta}{\sqrt{v_t + \epsilon}} g_{t}$$

- Global learning rate $\eta$ is divided by square root of $v_t$ which is running average of the squared gradients.
$$v_t = v_{t-1} + g^{2}_{t}$$

#### Explanation:
- AdaGrad adapts the learning rate individually for each parameter based on the history of gradients.
- The learning rate is scaled by the square root of the accumulated squared gradients, which effectively reduces the learning rate for parameters with frequently occurring large gradients and increases it for parameters with infrequent large gradients.
- The accumulation of squared gradients allows the optimizer to decay the effective learning rate over time for each parameter, thus adapting the learning rate to the specific requirements of each parameter.
#### Problems:
- Sensitive to Initial Gradients: At the beginning of training (t=0), the accumulation $v_0$ is initialized with 0. This initialization makes the optimizer sensitive to the initial gradients, potentially affecting convergence.
- Cumulative Accumulation: The accumulation of squared gradients, $v_{\tau} = g^{2}{1}+g^{2}{2}+....g^{2}_{\tau}$ increases with each step. This cumulative effect can lead to extremely small effective learning rates over time, slowing down the optimization process, especially in long training sessions.

## 7. RMSProp Otimizer - Root Mean Square Propagation

RMSProp (Root Mean Square Propagation) is an optimization algorithm that addresses some of the limitations of AdaGrad by using an exponentially decaying moving average of squared gradients. It aims to provide more stable and adaptive learning rates during training.

Mathematical Representation:
The update rule for the parameters $w$ in RMSProp Optimizer is given by:

$$w_{t+1} = w_{t}+ \triangle w_{t}$$
$$\triangle w_{t} = -\frac{\eta}{RMS(g_t)}.g_t$$
$$RMS(g_t) = \sqrt{v_t + \epsilon}$$
$$v_t = \beta v_{t-1} + (1-\beta)g^2_t$$

#### Explanation:
- RMSProp addresses the sensitivity to initial gradients and the issue of monotonically decreasing learning rates observed in AdaGrad.
- Instead of accumulating all past squared gradients, RMSProp computes an exponentially decaying moving average of squared gradients using the parameter $\beta$.
- This moving average helps to stabilize and adaptively adjust the learning rates for each parameter during training.
- The RMS term in the update rule normalizes the gradient by the square root of the moving average of squared gradients, providing a more balanced and stable learning rate.
#### Benefits:
- Less Sensitive to Initial Gradients: By using an exponentially decaying moving average of squared gradients, RMSProp is less sensitive to the initial gradients encountered during training.
- Avoids Monotonically Decreasing Learning Rates: The adaptive learning rates provided by RMSProp help avoid the issue of monotonically decreasing learning rates, which can occur in AdaGrad due to the accumulation of squared gradients.

## 8. AdaDelta Optimizer  

AdaDelta is an extension of the RMSProp optimizer that aims to overcome its reliance on a global learning rate $\eta$. It achieves this by using a running average of the parameter updates to adaptively adjust the learning rates.

Mathematical Representation:
The update rule for the parameters $w$ in AdaDelta Optimizer is given by:

$$w_{t+1} = w_{t}+ \triangle w_{t}$$
$$\triangle w_{t} = -\frac{RMS(\triangle w_{t-1})}{RMS(g_t)} g_t$$

$$RMS(\triangle w_{t}) = \sqrt{u_t + \epsilon}$$
$$u_t = \beta u_{t-1} + (1-\beta)(\triangle w_t)^2$$

$$RMS(g_t) = \sqrt{v_t + \epsilon}$$
$$v_t = \beta v_{t-1} + (1-\beta)(g_t)^2$$

#### Explanation:
- AdaDelta addresses the issues of sensitivity to initial gradients and continually decreasing learning rates observed in some optimization algorithms.
- By using a running average of the parameter updates, AdaDelta adapts the learning rates on a per-parameter basis without requiring a global learning rate $\eta$.
- The update rule for $\triangle w_{t}$ is scaled by the ratio of the root mean square of the past parameter updates to the root mean square of the gradients.
- This normalization helps to stabilize the learning rates and avoid overly aggressive or overly conservative updates.
- AdaDelta effectively removes the need for manual tuning of the learning rate hyperparameter, making it more convenient to use and less sensitive to the choice of hyperparameters.
#### Benefits:
- Less Sensitive to Initial Gradients: AdaDelta, like RMSProp, is less sensitive to the initial gradients encountered during training, thanks to its adaptive learning rates.
- Avoids Continually Decreasing Learning Rates: By adapting the learning rates on a per-parameter basis, AdaDelta avoids the issue of continually decreasing learning rates observed in some optimization algorithms.

# Adaptive Moment Optimizers

## 9. Adam Optimizer  


Adam (Adaptive Moment Estimation) is an optimization algorithm that computes adaptive learning rates for each parameter by estimating the first and second moments of the gradients. It combines the concepts of momentum optimization and RMSProp, providing adaptive learning rates along with momentum.  

Mathematical Representation:  
Adam maintains two moving averages:  

+ The first moment $m_t$, which is the exponentially decaying moving average of gradients.  

$$m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$$

+ The second moment $v_t$, which is the exponentially decaying moving average of squared gradients.  

$$v_t = \beta_2 v_{t-1} + (1-\beta_2)g^2_t$$

Where:

- $\beta_1$ and $\beta_2$ are hyperparameters controlling the decay rates of the moving averages, typically close to 1 but less than 1.  
- $g_t$ is the gradient at time step $t$.  
- $\epsilon$ is a small value added to prevent division by zero.  

$\beta_1, \beta_2  \epsilon [0,1)$   
e.g.,  $\beta_1 = 0.9$ and $\beta_2 = 0.999$  

Adam also incorporates bias correction to remove initialization bias:  

$$\hat{m_t} = \frac{m_t}{1-\beta^t_1}$$ 
$$\hat{v_t} = \frac{v_t}{1-\beta^t_2}$$

$$w_t = w_{t-1} - \eta \frac{\hat{m_t}}{{\sqrt{\hat{v_t}} + \epsilon}}$$  

#### Explanation:
- Adam combines the concepts of momentum optimization (using $m_t$) and RMSProp (using $v_t$) to provide adaptive learning rates for each parameter.
- The first moment $m_t$ represents the average gradient, providing momentum to the parameter updates.
- The second moment $v_t$ represents the uncentered variance of the gradients, adapting the learning rates based on the magnitude and direction of the gradients.
- Bias correction is applied to account for the initialization bias of $m_t$ and $v_t$.
- The parameters are updated using the bias-corrected first moment divided by the square root of the bias-corrected second moment, scaled by the global learning rate $\eta$.
#### Benefits:
- Invariance to Diagonal Rescaling: Adam is invariant to diagonal rescaling of gradients, making it suitable for a wide range of optimization problems.
- Suitable for Online and Non-Stationary Problems: Adam adapts to changes in the optimization landscape, making it suitable for online learning and non-stationary optimization problems.
- Handling Noisy and Sparse Gradients: Adam's adaptive learning rates help in handling noisy and sparse gradients commonly encountered in deep learning tasks.

## 10. AdaMax Optimizer

AdaMax is a variant of the Adam optimizer that generalizes the concept of the second moment to the $L^{\infty}$ norm of the gradients. This adaptation aims to provide a more stable and effective optimization algorithm, particularly for deep learning tasks.
 
Mathematical Representation:  
AdaMax maintains two moving averages similar to Adam:  
- The first moment $m_t$, which is the exponentially decaying moving average of gradients.

$$m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$$

Generalizing Adam $L^p$ --> norm of the gradients (p can be 2 or higher)

$$v_t = \beta^p_2 v_{t-1} + (1-\beta^p_2)|g_t|^p$$

For Adam: $p = 2$  
For AdaMax: $p \to \infty$  --> inifinity norm of gradients --> $u_t$
- The generalized $L^{\infty}$ norm of the gradients, denoted as $u_t$, which represents the maximum absolute value of the gradients encountered so far.

$$u_t = max(\beta_2 u_{t-1}, |g_t|)$$

$$\hat{m_t} = \frac{m_t}{1-\beta^t_1}$$ 

$$w_t = w_{t-1} - \eta \frac{\hat{m_t}}{u_t}$$

#### Explanation:
- AdaMax extends the concept of the second moment in Adam by generalizing it to the $L^{\infty}$ norm of the gradients.
- The $L^{\infty}$ norm represents the maximum absolute value of the gradients encountered during training, providing a stable and adaptive learning rate for each parameter.
- The parameter update is scaled by the ratio of the bias-corrected first moment to the $L^{\infty}$ norm of the gradients, similar to the update rule in Adam.
- AdaMax offers an alternative approach to adaptive learning rate optimization, particularly suitable for scenarios where the maximum gradient magnitude is of interest.
#### Benefits:
- Stable and Adaptive Learning Rates: AdaMax provides stable and adaptive learning rates based on the maximum absolute value of the gradients encountered during training.
- Alternative to Adam: AdaMax offers an alternative to Adam, particularly suited for scenarios where the $L^{\infty}$ norm of the gradients is of importance.

# 11. AMSGrad Optimizer

AMSGrad (Adaptive Moment Estimation with Stable Gradients) is an optimization algorithm that addresses the convergence issues observed in some adaptive learning rate methods like Adam and AdaDelta. It achieves this by modifying the update rule for the second moment to ensure stability and prevent excessive growth.

Mathematical Representation:
AMSGrad modifies the update rule for the second moment compared to Adam and AdaDelta. Instead of applying bias correction, it takes the maximum of the past second moment and the current second moment: $$\hat{v}_t= max(\hat{v}_{t-1}, v_t)$$

The update rule for the parameters $w$ in AMSGrad is similar to Adam:

$$w_t = w_{t-1} - \eta \frac{\hat{m_t}}{{\sqrt{\hat{v_t}} + \epsilon}}$$ 

#### Explanation:
- AMSGrad modifies the update rule for the second moment to prevent its excessive growth, which can lead to convergence issues observed in Adam and AdaDelta.
- By taking the maximum of the past second moment and the current second moment, AMSGrad ensures stability and prevents the learning rate from decreasing too quickly.
- The parameter update rule is similar to Adam, with the bias-corrected first moment divided by the square root of the modified second moment, scaled by the global learning rate.
#### Benefits:
- Improved Stability: AMSGrad ensures stability by modifying the update rule for the second moment, preventing its excessive growth.
- Convergence: By addressing the convergence issues observed in some adaptive learning rate methods, AMSGrad aims to provide more reliable convergence during optimization.

# 12. AdaBound Optimizer

AdaBound is an optimization algorithm that dynamically clips the learning rates during training to keep them within a desired range. It is inspired by the behavior of adaptive learning rate methods like Adam and combines it with the concept of learning rate clipping to achieve more stable and controlled optimization.

Mathematical Representation:
AdaBound maintains the first and second moments similar to Adam:

- The first moment $m_t$:

$$m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$$

- The second moment $v_t$:

$$v_t = \beta_2 v_{t-1} + (1-\beta_2)g^2_t$$

Where:

- $\beta_1$ and $\beta_2$ are hyperparameters controlling the decay rates of the moving averages, typically close to 1 but less than 1.

AdaBound then dynamically clips the learning rate $\eta_t'$ to keep it within a desired range:  
$${\eta_t}' = Clip \biggl( \frac{\eta}{\sqrt{v_t}}, \eta_{lower}(t), \eta_{upper}(t) \biggl) $$

Where:

- $\eta$ : global learning rate  
- $\eta_{lower}(t)$ : lower bound  
- $\eta_{upper}(t)$ : upper bound  

The clipped learning rate $\eta_t'$ is further scaled by the square root of the time step $t$ to provide a decaying effect:

$$\hat{\eta_t} = \frac{{\eta_t}'}{\sqrt{t}}$$

The parameters $w$ are then updated using the bias-corrected first moment and the scaled, clipped learning rate:

$$w_t = w_{t-1} - \hat{\eta_t}\odot m_t$$

#### Explanation:
- AdaBound combines the adaptive learning rate behavior of algorithms like Adam with the concept of learning rate clipping to achieve more stable and controlled optimization.
- By dynamically clipping the learning rates within a desired range, AdaBound prevents large fluctuations and ensures more stable convergence.
The clipped learning rate is further scaled by the square root of the time step to provide a decaying effect, allowing for smoother optimization.
#### Benefits:
- Controlled Learning Rates: AdaBound dynamically clips the learning rates during training, ensuring that they stay within a desired range, which can lead to more stable and controlled optimization.
- Stable Convergence: By preventing large fluctuations in the learning rates, AdaBound aims to achieve more stable convergence during optimization.

# 13. AdamW Optimizer

AdamW is an optimization algorithm that addresses the issue of weight decay regularization in Adam by incorporating it directly into the parameter update rule. It fixes the discrepancy between $L_2$ regularization and weight decay regularization observed in Adam, ensuring more consistent and effective regularization.

First moment:
$$m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$$

Second (raw) moment:

$$v_t = \beta_2 v_{t-1} + (1-\beta_2)g^2_t$$

Bias Correction:  
$$\hat{m_t} = \frac{m_t}{1-\beta^t_1}$$ 
$$\hat{v_t} = \frac{v_t}{1-\beta^t_2}$$  

Mathematical Representation:  
The update rule for the parameters $w$ in AdamW incorporates both the first and second moments as well as weight decay regularization:  

$$w_t = w_{t-1} - \Bigg( \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda w_{t-1} \Bigg)$$

Weight decay regularization:

$$w_{t+1} = (1-\lambda)w_{t} - \eta \nabla f(w_t)$$

$L_2$ regularization:

$$L^{reg}(w_t) = f(w_t) + \frac{\lambda}{2}||w_t||^2_2$$

$$\nabla L^{reg}(w_t) = \nabla f({w_t}) + \lambda w_t$$

$SGD:w_{t+1} = w_t - \eta(\nabla f(w_t) + \lambda w_t)$

Under standard SGD, $L_2$ regularization is equivalent to reparameterised weight decay regularization. But it is not in case of Adam optimiser.


#### Explanation:
- AdamW fixes the discrepancy between $L_2$ regularization and weight decay regularization observed in Adam by directly incorporating weight decay into the parameter update rule.
- Weight decay regularization is added to the parameter update alongside the gradient-based update, ensuring consistent regularization throughout the optimization process.
- By integrating weight decay into the parameter update, AdamW provides more consistent and effective regularization, leading to improved generalization performance.
#### Benefits:
- Consistent Regularization: AdamW ensures consistent regularization throughout the optimization process by incorporating weight decay directly into the parameter update rule.
- Improved Generalization: By providing more effective regularization, AdamW can lead to improved generalization performance, especially in deep learning tasks with complex models and datasets.
#### Important Note:
- Prefer AdamW Over Adam with $L_2$ Regularization: When using Adam optimizer, it is recommended to use AdamW instead of applying $L_2$ regularization separately. This ensures more consistent regularization and can lead to better optimization results.