#### 1.  What if the loss function has a local minima or saddle point?

        If the loss function has a local minimum or a saddle point, it can pose challenges for gradient descent because the algorithm may get stuck in these regions instead of converging to the global minimum. 

#### 2. SGD and Problem and solution?
                Stochastic Gradient Descent (SGD) is a variant of gradient descent that updates the model's parameters using the gradient computed on a single or a few randomly selected training examples at each iteration, instead of using the entire training dataset (as in batch gradient descent). In higher dimention Saddle points much more common.
    
    Problems and Solution:
        - Noise and Varience: The stochastic nature of SGD introduces noise, which can result in high variance during training, leading to instability.
            - Variants of SGD, such as mini-batch SGD or batch SGD, can be used to reduce the noise and variance by considering a subset of training examples
        - local Minima and suddle point: SGD may get stuck in local minima, preventing the model from reaching the global minimum.
            - Techniques like momentum, which introduces inertia to the optimization process, can help the optimizer escape local minima. Additionally, advanced optimization algorithms like Adam and Nesterov accelerated gradient can improve the convergence and robustness of SGD.
        - Convergence Speed: SGD can converge slowly, especially for large datasets or complex models.
            - Accelerated optimization algorithms like AdaGrad, RMSProp, and Adam, which adaptively adjust the learning rate based on the history of gradients, can help improve convergence speed. 
        - Overfitting: SGD may be prone to overfitting, especially when the model capacity is high relative to the amount of training data.
            - We use Regularization techniques, such as L1 and L2 regularization (weight decay), dropout, and batch normalization, can be applied to reduce overfitting.

### 3. What is Convex and Non-convex Loss Function?
**Convex:** A convex cost function is characterized by its shape, which forms a convex curve. It has only single global minima exist.

**Non Convex:** A non-convex cost function is characterized by its shape, which can have multiple local minima, maxima, or saddle points. The curve of a non-convex cost function may contain valleys, plateaus, or irregularities.

### 4. What is overshoting?
Overshooting, in the context of machine learning, refers to a situation where the model's parameters update too aggressively, leading to instability or poor convergence. When the learning rate is too high, the model makes large updates to the parameters in each iteration, which can cause the optimization process to overshoot the optimal solution.

        1. Oscillations: If the learning rate is too high, the model's parameters may oscillate around the optimal solution without converging. The model may keep overshooting and undershooting the optimal values, preventing it from settling into a stable state.

        2. Divergence: With an excessively high learning rate, the model's parameters may update so drastically that the loss function increases rather than decreases. This divergence indicates that the learning rate is too large, and the model cannot find a suitable solution.

        3. Slow convergence: Although it may seem counterintuitive, using a very high learning rate can actually slow down the convergence process. The model may take longer to find the optimal solution due to frequent overshooting and the need for more iterations to stabilize.

To overcome this probelm learning should be choosen carefully.

### Exponentially Weighted Average:
    Exponentially Weighted Average (EWA), also known as Exponential Moving Average (EMA), is a statistical calculation that gives more weight to recent data points while gradually decreasing the influence of older data points. In order to compute the EWMA, you must define one parameter β. This parameter decides how important the current observation is in the calculation of the EWMA. The Exponentially Weighted Moving Average (EWMA) is commonly used as a smoothing technique in time series.

**`EWA(t) = (1 - α) * EWA(t-1) + α * value(t)`**

### Momentum:
     Momentum helps SGD overcome the noise and erratic updates by introducing a velocity term that accumulates gradients from previous iterations. This helps in faster convergence and better handling of flat regions or noisy gradients. Momentum is a hyperparameter between 0 and 1 that controls the influence of the previous velocity.
        - Solves the saddle point and local minimum problems.
        - It overshoots the problem and returns to it back.
        
    Momentum takes into account the historical gradients and helps accelerate the optimization process by providing additional momentum to the updates. The momentum term is an exponentially weighted moving average of the previous gradients.

Algorithm 8.1: Momentum Update

    Inputs:
    - Parameters: w (initial parameter vector)
    - Learning rate: α (step size)
    - Momentum coefficient: β (controls the influence of previous velocity)
    - Batch: B (randomly sampled mini-batch of training examples)
    - Loss function gradient: ∇L(w, x, y) (gradient of the loss function with respect to the parameters w, given input x and target y)

    Procedure:
    1. Set the velocity vector v to zero: v = 0
    2. For each example (x, y) in the batch B:
        - Compute the gradient of the loss function with respect to the parameters: ∇ = ∇L(w, x, y)
        - Update the velocity: v = β * v - α * ∇
    3. Update the parameters using the velocity:
        - w = w + v

    Output:
    - Updated parameter vector w

In [4]:
import numpy as np

def momentum_algorithm(gradient_func, learning_rate, momentum, num_iterations):
    w = np.zeros(2)  # Initial parameter vector
    v = np.zeros(2)  # Initial velocity
    
    for t in range(num_iterations):
        gradient = gradient_func(w)  # Compute gradient at current parameter vector
        
        v = momentum * v - learning_rate * gradient  # Update velocity
        w += v  # Update parameter vector
        
    return w

def gradient_func(w):
    gradient = np.zeros(2)
    gradient[0] = 2 * (w[0] - 3)
    gradient[1] = 2 * (w[1] - 2)
    return gradient

learning_rate = 0.1
momentum = 0.9
num_iterations = 10

optimal_w = momentum_algorithm(gradient_func, learning_rate, momentum, num_iterations)
print("Optimal parameter vector:", optimal_w)

Optimal parameter vector: [2.98680039 1.99120026]


### Nesterov Momentum:
Nesterov Momentum, also known as Nesterov Accelerated Gradient (NAG), is a variant of the Momentum algorithm that aims to improve upon the original Momentum method. It calculates the gradient using an estimate of the current parameters updated with the momentum term. Nesterov accelerated gradient can improve the convergence and robustness of SGD.

    - May be it does not overshooting drastically compare to Momentum algorithm.

Algorithm 8.1: Momentum Update

    Inputs:
    - Parameters: w (initial parameter vector)
    - Learning rate: α (step size)
    - Momentum coefficient: β (controls the influence of previous velocity)
    - Batch: B (randomly sampled mini-batch of training examples)
    - Compute gradient: ∇L(lookahead_position)

    Procedure:
    1. Set the velocity vector v to zero: v = 0
    2. For each example (x, y) in the batch B:
        - Compute the lookahead position: lookahead_position= w+β*momentum.
        - Compute the lookahead gradient with the parameter of ∇ =(lookhead_Position).
        - Update the velocity: v = β * v - α * ∇
    3. Update the parameters using the velocity:
        - w = w + v

    Output:
    - Updated parameter vector w
**However in practice we will do something that**
```
v_prev = v # back this up
v = mu * v - learning_rate * dx # velocity update stays the same
x += -mu * v_prev + (1 + mu) * v # position update changes form
```

In [5]:
import numpy as np

def momentum_algorithm(gradient_func, learning_rate, momentum, num_iterations):
    w = np.zeros(2)  # Initial parameter vector
    v = np.zeros(2)  # Initial velocity
    
    for t in range(num_iterations):
        gradient = gradient_func(w)  # Compute gradient at current parameter vector
        v_prev=v
        v = momentum * v - learning_rate * gradient  # Update velocity
        w += -momentum*v_prev + (1+momentum)*v  # Update parameter vector
        
    return w

def gradient_func(w):
    gradient = np.zeros(2)
    gradient[0] = 2 * (w[0] - 3)
    gradient[1] = 2 * (w[1] - 2)
    return gradient

learning_rate = 0.1
momentum = 0.9
num_iterations = 10

optimal_w = momentum_algorithm(gradient_func, learning_rate, momentum, num_iterations)
print("Optimal parameter vector:", optimal_w)

Optimal parameter vector: [2.84591779 1.89727853]


### Annealing the learning rate
Annealing the learning rate is a technique commonly used in training deep neural networks. It involves gradually reducing the learning rate over time to improve convergence and avoid overshooting the optimal solution.

There are three common types of implementing the learning rate decay:
1. Step decay: Reduce the learning rate by some factor every few epochs. Typical values might be reducing the learning rate by a half every 5 epochs, or by 0.1 every 20 epochs. These numbers depend heavily on the type of problem and the model. One heuristic you may see in practice is to watch the validation error while training with a fixed learning rate, and reduce the learning rate by a constant (e.g. 0.5) whenever the validation error stops improving.

2. Exponential decay: It has the mathematical form $\alpha = \alpha_0 e^{-k t}$, where $α0,k$ are hyperparameters and t is the iteration number (but you can also use units of epochs).
3. 1/t decay: has the mathematical form $\alpha = \alpha_0 / (1 + k t )$ where a0,k are hyperparameters and t is the iteration number.

        In practice, we find that the step decay is slightly preferable because the hyperparameters it involves (the fraction of decay and the step timings in units of epochs) are more interpretable than the hyperparameter k. Lastly, if you can afford the computational budget, err on the side of slower decay and train for a longer time.

#### Adagrad:
AdaGrad, short for Adaptive Gradient, is an optimization algorithm commonly used in machine learning and deep learning for updating the parameters of a model during training.
```
grad_squared = 0
while(True):
  dx = compute_gradient(x)
  
  # here is a problem, the grad_squared isn't decayed (gets so large)
  grad_squared += dx * dx			
  
  x -= (learning_rate*dx) / (np.sqrt(grad_squared) + 1e-7)
```
**Advantage**: One advantage of AdaGrad is that it eliminates the need for manually tuning the learning rate. It automatically adapts the learning rate based on the gradients observed during training. This can be particularly useful in dealing with sparse data or in tasks where different features have vastly different scales.

**Limitation:** As the sum of squared gradients grows over time, the learning rate can become too small, leading to slow convergence or premature stopping. To address this issue, variants of AdaGrad, such as **RMSprop and Adam**, have been proposed that introduce additional mechanisms to control the accumulation of squared gradients.

#### What happens to the step size over long time?
1. **Shrinking learning rate:** As the sum of squared gradients grows over time, the learning rate in AdaGrad can become progressively smaller. This can cause the algorithm to converge very slowly or even prematurely stop learning altogether.

2. **Sensitivity to the choice of hyperparameters:** AdaGrad's performance is sensitive to the choice of the epsilon parameter, which is added to the denominator to avoid division by zero. If the epsilon value is set too small, it may lead to numerical instability. Conversely, if it is set too large, it may prevent the learning rate from adapting effectively.


#### RMSProp:
The RMSProp update adjusts the Adagrad method in a very simple way in an attempt to reduce its aggressive, monotonically decreasing learning rate. Here, decay_rate is a hyperparameter and typical values are [0.9, 0.99, 0.999]. Notice that the x+= update is identical to Adagrad, but the cache variable is a “leaky”. Hence, RMSProp still modulates the learning rate of each weight based on the magnitudes of its gradients, which has a beneficial equalizing effect, but unlike Adagrad the updates do not get monotonically smaller.
```
grad_squared = 0
while(True):
  dx = compute_gradient(x)
  
  #Solved ADAgra
  grad_squared = decay_rate * grad_squared + (1-grad_squared) * dx * dx  
  
  x -= (learning_rate*dx) / (np.sqrt(grad_squared) + 1e-7)
```

#### Adam:
Adam is a recently proposed update that looks a bit like RMSProp with momentum. The (simplified) update looks as follows:
```
# eps = 1e-8, beta1 = 0.9, beta2 = 0.999
m = beta1*m + (1-beta1)*dx
v = beta2*v + (1-beta2)*(dx**2)
x += - learning_rate * m / (np.sqrt(v) + eps)
```

With the bias correction mechanism, the update looks as follows:
```
# t is your iteration counter going from 1 to infinity
m = beta1*m + (1-beta1)*dx
mt = m / (1-beta1**t)
v = beta2*v + (1-beta2)*(dx**2)
vt = v / (1-beta2**t)
x += - learning_rate * mt / (np.sqrt(vt) + eps)
```