---   

<img align="left" width="110"   src="https://upload.wikimedia.org/wikipedia/commons/c/c3/Python-logo-notext.svg"> 


<h1 align="center">Tools and Techniques for Data Science</h1>
<h1 align="center">Course: Deep Learning</h1>

---
<h3 align="right">Muhammad Sheraz (Data Scientist)</h3>
<h1 align="center">Day35 (Nesterov Accelerated Gradient)</h1>




<img  src='Images/mom_nag.png'> 


<a href='https://towardsdatascience.com/learning-parameters-part-2-a190bef2d12'>Read here For Detail Lecture</a>


## 1. Nesterov Accelerated Gradient Descent

<img width='50%' align='right' src='Images/mom_nag.png'> 

- A variant of the gradient descent optimization algorithm.
- Extension of basic gradient descent.
- Utilizes `momentum` for faster convergence.
  

**Intution:**

- Momentum may be a good method but if the momentum is too high the algorithm may miss the local minima and may continue to rise up. So, to resolve this issue the NAG algorithm was developed. It is a look ahead method. We know we’ll be using γV(t−1) for modifying the weights so, θ−γV(t−1) approximately tells us the future location.


<img src='Images/f_nag.PNG'>


**Usage**:
- Particularly useful in training *deep neural networks*.
- Faster convergence compared to basic gradient descent or momentum-based methods.
- Helps mitigate oscillations, leading to more stable convergence.


### Advantages:

- Does not miss the `local minima.`
- `Slows` if minima’s are occurring.

### Disadvantages:
- It can stuck in `local minima`

<h1 align='center'>Difference Between Momentum and Nesterov Accelerated Gradient Descent </h1>

| Feature                              | Momentum                                    | Nesterov Accelerated Gradient Descent (NAG)                                      |
|--------------------------------------|---------------------------------------------|-----------------------------------------------------------------------------------|
| Basis                                | Basic gradient descent with momentum       | Extension of momentum-based optimization with lookahead step                      |
| Update Rule                          | Standard momentum update rule              | Modified update rule considering the lookahead position                           |
| Calculation of Gradient              | Gradient calculated at current position    | Gradient calculated at lookahead position                                         |
| Anticipation                         | No anticipation of future direction        | Anticipates future direction by adjusting the lookahead position with momentum    |
| Convergence Improvement              | Helps in accelerating convergence          | Further improves convergence by reducing overshooting                             |
| Application                          | Widely used in various optimization tasks  | Particularly useful in training deep neural networks for faster convergence        |


### How it Works?



1. **Initialize Parameters**: Initialize the parameters (weights and biases) of your model randomly or using some predetermined values.

2. **Set Hyperparameters**: Determine the hyperparameters such as the learning rate (`α`) and momentum (`μ`).

3. **Initialize Velocity**: Initialize the velocity vector (`v_t`) to zero or a small random value.

4. **Iterative Optimization**:
   - **Forward Pass**: Perform a forward pass through your model to compute the loss function.
   - **Calculate Gradient**: Calculate the gradient of the loss function with respect to the parameters.
   - **Update Velocity with Momentum**: Update the velocity using the momentum term:
     ```
     v_t = μ * v_{t-1} - α * ∇f(x_{t-1} + μ * v_{t-1})
     ```
   - **Lookahead Update**: Calculate the lookahead position:
     ```
     x_{lookahead} = x_{t-1} + μ * v_{t}
     ```
   - **Calculate Gradient at Lookahead**: Calculate the gradient of the loss function with respect to the parameters at the lookahead position.
   - **Update Parameters**: Update the parameters using the gradient at the lookahead position:
     ```
     x_t = x_{t-1} - α * ∇f(x_{lookahead})
     ```

5. **Repeat**: Repeat the iterative optimization process until convergence criteria are met (e.g., the loss function stops decreasing or reaches a threshold).

6. **Convergence Check**: Monitor the convergence of the optimization process by tracking the value of the loss function or other performance metrics on a validation set.

7. **Evaluate**: After convergence or upon reaching a stopping criterion, evaluate the performance of your model on a separate test set to assess its generalization ability.

8. **Adjust Hyperparameters**: Fine-tune the hyperparameters if necessary based on the performance evaluation results.

9. **Deploy**: Once satisfied with the model's performance, deploy it for making predictions on new data.

10. **Regularization (Optional)**: Implement regularization techniques such as L1 or L2 regularization to prevent overfitting if needed.


In [1]:
import numpy as np

def nag_optimizer(X, y, initial_params, learning_rate, momentum, num_iterations):
    params = initial_params
    velocity = np.zeros_like(params)
    
    for i in range(num_iterations):
        # Calculate gradient at lookahead position
        lookahead_params = params - momentum * velocity
        gradient = compute_gradient(X, y, lookahead_params)
        
        # Update velocity
        velocity = momentum * velocity - learning_rate * gradient
        
        # Update parameters
        params += velocity
        
        # Optionally, you can include convergence criteria here
        # Example: if np.linalg.norm(gradient) < tolerance:
        #              break
    
    return params

def compute_gradient(X, y, params):
    # Compute gradient of loss function with respect to parameters
    # This is specific to your model and loss function
    # For simplicity, assuming a linear regression model here
    predictions = np.dot(X, params)
    errors = predictions - y
    gradient = np.dot(X.T, errors) / len(X)
    return gradient

# Example usage
X = np.array([[1, 2], [3, 4], [5, 6]])  # Input features
y = np.array([3, 4, 5])  # Target labels
initial_params = np.zeros(X.shape[1])  # Initialize parameters
learning_rate = 0.01
momentum = 0.9
num_iterations = 1000

# Run Nesterov Accelerated Gradient Descent
final_params = nag_optimizer(X, y, initial_params, learning_rate, momentum, num_iterations)
print("Final parameters:", final_params)


Final parameters: [-4.32356380e+33 -5.47672654e+33]


<h1 align='center'>Interview Questions</h1>


### What is Nesterov Accelerated Gradient Descent?

Nesterov Accelerated Gradient Descent (NAG) is a variant of the gradient descent optimization algorithm that helps accelerate convergence by modifying the update rule to take into account the momentum.

### How does NAG differ from standard momentum-based optimization?

NAG calculates the gradient not at the current position but at a lookahead position, which is adjusted using the momentum term. This lookahead update allows the algorithm to anticipate the future direction of movement, helping to reduce overshooting and improve convergence speed, especially in narrow va### Why is Nesterov Accelerated Gradient Descent useful?

NAG further refines the update rule by considering the lookahead position, which helps to reduce overshooting and improve convergence, especially in scenarios with complex optimization landscapes. It accelerates convergence, particularly in scenarios with long, narrow valleys.

### When would you choose to use Nesterov Accelerated Gradient Descent over standard gradient descent?

NAG is particularly useful in training deep neural networks, where it often converges faster than the standard gradient descent algorithm or its variants like momentum-based gradient descent. It helps mitigate the issue of oscillations typically seen in momentum-based methods, which can result in overshooting the optimal solution.
as follows:
