---   

<img align="left" width="110"   src="https://upload.wikimedia.org/wikipedia/commons/c/c3/Python-logo-notext.svg"> 


<h1 align="center">Tools and Techniques for Data Science</h1>
<h1 align="center">Course: Deep Learning</h1>

---
<h3 align="right">Muhammad Sheraz (Data Scientist)</h3>
<h1 align="center">Day34 (Momentunm Optimizers)</h1>




<div align='center'><img  src='Images/types_gd.png'></div>

### Drawbacks of base optimizer:(GD, SGD, mini-batch GD)

<img align='right' src='Images/gd.png'>

- Gradient Descent uses the whole training data to update weight and bias. Suppose if we have millions of records then training becomes slow and computationally very expensive.

- SGD solved the Gradient Descent problem by using only single records to updates parameters. But, still, SGD is slow to converge because it needs forward and backward propagation for every record. And the path to reach global minima becomes very noisy.

- Mini-batch GD overcomes the SDG drawbacks by using a batch of records to update the parameter. Since it doesn't use entire records to update parameter, the path to reach global minima is not as smooth as Gradient Descent.


**Constant Gradient:**
- Gradient remains constant across iterations.
- Often indicates a flat region in the optimization landscape.
- In traditional gradient descent, leads to consistent updates in the parameter space.
- In Momentum Optimization, results in steady momentum accumulation without abrupt changes.

**High Curvature:**
- Indicates rapid changes in the optimization landscape.
- Typically found in regions with sharp peaks or valleys.
- In traditional gradient descent, may result in slow convergence or oscillations around the optimum.
- Momentum Optimization helps by smoothing out updates and efficiently navigating through regions with high curvature.

**Noisy Gradient:**
- Gradients exhibit significant fluctuations due to noisy data or stochasticity.
- Often encountered in scenarios where data is incomplete or corrupted.
- In traditional gradient descent, noisy gradients can lead to slow convergence or convergence to suboptimal solutions.
- Momentum Optimization helps by averaging out the noise over time, resulting in smoother updates and faster convergence despite the noise.


**Saddle Point:**
- Saddle points are critical points in the optimization landscape where the gradient is zero but neither a minimum nor a maximum.
- They are characterized by flat regions in some directions and steep regions in others.
- In traditional gradient descent, convergence at saddle points can be slow due to the flat directions, causing the algorithm to get stuck.
- Momentum Optimization helps in escaping saddle points by accumulating momentum in the steep directions, facilitating faster convergence.


## Convex and Non-Convex Optimization



**Convex Optimization:**
- Objective function: Single global minimum.
- All local minima are also global minima.
- Gradient-based algorithms (e.g., gradient descent) guarantee convergence to the global minimum.
- Examples: Linear programming, quadratic programming, least squares regression, support vector machines with linear kernels.

**Non-Convex Optimization:**
- Objective function: Multiple local minima, saddle points, and possibly global minima.
- Convergence to global minimum is not guaranteed.
- Gradient-based algorithms may get stuck in local minima or saddle points.
- Exploration techniques (e.g., random restarts, simulated annealing) are often employed to find better solutions.
- Examples: Neural network training, clustering algorithms (e.g., k-means), many real-world optimization problems.


<img src='Images/connoncon.png'>

**Read here For Detail Lecture**
- <a href='https://pub.towardsai.net/gradient-descent-and-the-melody-of-optimization-algorithms-244830ea2516'>Link1</a>
- <a href='https://towardsdatascience.com/learning-parameters-part-2-a190bef2d12'>Link2</a>

> **Intution:** If I am repeatedly being asked to move in the same direction then I should probably gain some confidence and start taking bigger steps in that direction. Just as a ball gains momentum while rolling down a slope.

## 1. SGD with momentum

<img align='right' src='Images/mom_int.png'> 

- It always works better than the normal Stochastic Gradient Descent Algorithm. 

- The problem with SGD is that while it tries to reach `minim`a because of the `high oscillation` we can’t `increase` the `learning rate`. So it takes time to 
 converge. In this algorithm, we will be using `Exponentially Weighted Averages` to compute Gradient and used this Gradient to update parameter.

- An equation to update weights and bias in SGD

<div align='center'><img src='Images/f11.png'></div>

- An equation to update weights and bias in SGD with momentum

<div align='center'><img src='Images/mom_eq.png'></div>


- In SGD with momentum, we have added momentum in a gradient function. By this I mean the present Gradient is dependent on its previous Gradient and so on. This accelerates SGD to converge faster and reduce the oscillation


### Why Momentum optimizer

- Momentum optimizer helps accelerate gradient descent by adding a fraction of the update vector from the previous time step to the current update.
- It reduces oscillations and helps the optimizer converge faster by smoothing out the updates.
- This momentum allows the optimizer to continue in the same direction if previous gradients have been consistently pointing in that direction.
- It's particularly effective in overcoming local minima or saddle points.
- Momentum optimizer helps escape steep and narrow ravines in the loss landscape.
- Overall, it improves the efficiency and speed of convergence during training neural networks.


> - Single benefit of  Momentum optimizer is `Speed`
> - In `99%` cases, Momentum optimizer is faster than the normal Stochastic Gradient Descent Algorithm.

### Role of alpha in momentum optimizer

- In momentum optimizer, the parameter alpha (typically denoted as α) represents the momentum term.
- Alpha controls the contribution of the previous update vector to the current update.
- A higher value of alpha means more momentum, causing the optimizer to remember and build up speed in previous directions.
- A lower value of alpha reduces the influence of past gradients, making the optimization process less persistent in previous directions.
- The choice of alpha affects the trade-off between exploration and exploitation during optimization.
- Typically, alpha values range between 0 and 1, with common values such as 0.9 or,0.5 0.99 used in practice.
- Tuning alpha requires balancing between stability (avoiding oscillations) and efficiency (speeding up convergence).


- When alpha is 0:
  - The momentum term becomes ineffective.
  - Each update is solely based on the current gradient without considering any past updates.
  - Essentially equivalent to using vanilla gradient descent without momentum.
  - No memory of past gradients, potentially leading to slower convergence.

- When alpha is 1:
  - The momentum term is fully utilized.
  - The current update is entirely determined by the momentum term, ignoring the current gradient.
  - Maintains constant velocity in previous directions, potentially leading to overshooting or instability.
  - Persistence in previous directions might help escape local minima but can also lead to divergence.
  - Setting alpha to 1 is generally not recommended due to instability risks.


### Problems of Momentum Optimizer

- **Overshooting:**
  - High momentum values can cause the optimizer to overshoot the minimum point, leading to oscillations or instability.
- **Difficulty in Fine-Tuning:**
  - Momentum might hinder fine-tuning in the later stages of training when approaching the minimum point.
- **Parameter Sensitivity:**
  - Momentum's effectiveness can vary significantly depending on the choice of learning rate and other hyperparameters.
- **Memory Requirements:**
  - Momentum requires additional memory to store previous update information, which could be a concern for memory-constrained environments.
- **Dependency on Initial Conditions:**
  - Momentum's behavior can be sensitive to the initial conditions, making it harder to predict its performance across different optimization tasks.


<img src='Images/mom.png'>

> One typical drawback of the Momentum Optimizer is that the algorithm tends to oscillate at the minima (like a ball rolling down a v shaped valley) before stopping at the minima.

<h1 align='center'>Interview Questions</h1>

**1. What is Momentum Optimization in the context of machine learning and optimization algorithms?**

   **Answer:** 
   Momentum Optimization is a technique used in optimization algorithms, particularly in stochastic gradient descent (SGD) variants, to accelerate convergence and escape local minima. It involves incorporating the notion of momentum, i.e., the accumulation of past gradients, to update the parameters of a model.

**2. How does Momentum Optimization differ from traditional gradient descent?**

   **Answer:** 
   In traditional gradient descent, the parameters are updated in the direction opposite to the gradient of the loss function with respect to those parameters. Momentum Optimization adds a momentum term that accelerates the updates by accumulating gradients over time, allowing for smoother and faster convergence, especially in regions with high curvature.

**3. What is the formula for updating parameters using Momentum Optimization?**

   **Answer:** 
   The formula for updating parameters using Momentum Optimization is:

   <div align='center'><img src='Images/mom_eq.png'></div>

  
**4. What is the purpose of the momentum term in Momentum Optimization?**

   **Answer:** 
   The momentum term serves to accelerate convergence by accumulating gradients from past updates. It smooths out the updates by incorporating information about the direction of previous gradients, which helps to navigate through flat regions or saddle points more efficiently and escape local minima.

**5. How do you choose the value of the momentum parameter `beta` in Momentum Optimization?**

   **Answer:** 
   The choice of the momentum parameter depends on the characteristics of the optimization problem and the data. Typically, values for `beta` range between 0 and 1, with common choices being around 0.9. Higher values of `beta` result in stronger momentum and faster convergence, but too high a value may lead to oscillations or overshooting.

**6. Can Momentum Optimization be combined with other optimization techniques?**

   **Answer:** 
   Yes, Momentum Optimization can be combined with techniques like learning rate schedules (e.g., learning rate decay) and adaptive learning rate methods (e.g., Adam, RMSprop) to further improve convergence and stability.

**7. What are the advantages of using Momentum Optimization?**

   **Answer:** 
   - Momentum Optimization accelerates convergence, especially in the presence of noisy or sparse gradients.
   - It helps to escape local minima and plateaus more effectively compared to traditional gradient descent.
   - Momentum can act as a form of inertia, smoothing out updates and reducing oscillations during optimization.

**8. Are there any limitations or drawbacks to using Momentum Optimization?**

   **Answer:** 
   - Momentum Optimization may overshoot or oscillate around the optimal solution if the momentum parameter is too high.
   - It introduces an additional hyperparameter `beta` that needs to be tuned.
   - Momentum may cause the algorithm to converge too quickly, skipping over potentially valuable regions of the parameter space.
