# Optimization Algorithms for Deep Neural Networks

In this module, we will study:
1. Gradient Descent
2. Momentum-Based Gradient Descent
3. Nesterov Mementum
4. Adagrad
5. RMSProp
6. Adam


## Optimization Algorithms

**What is optimization?**

In the simplest case, an optimization problem consists of maximizing or minimizing a real function by systematically choosing input values from within an allowed set and computing the value of the function.

In the case of Machine Learning or Deep Learning, optimization refers to _the process of minimizing the loss function by systematically updating the network weights_. 

Mathematically, this is expressed as follows:

\begin{equation}
w = argmin_wL(w),
\end{equation}

where $L(w)$ and $w$ denotes, respectively, the loss function and weights.

***

**The Task of an Optimization Algorithm:**

Optimization algorithms (in the case of minimization) have one of the following goals:
1. Find the global minimum of the objective function. This is feasible if the objective function is convex, i.e. any local minimum is a global minimum.
2. Find the lowest possible value of the objective function within its neighborhood. That’s usually the case if the objective function is not convex as the case in most deep learning problems.

There are several optimization techniques, we will, in this module, learn the important once.

***

## Gradient Descent





#### What is Gradient Descent?

Gradient Descent is an optimizing algorithm used in Machine/ Deep Learning algorithms. It is a first-order (i.e., gradient-based) optimization algorithm where we iteratively update the parameters of a differentiable cost function until its minimum is attained.

Mathematically, gardient descent can be defined as follows:

\begin{equation}
w := w - \eta . \frac{\partial}{\partial w} L(w)
\end{equation}

In the above equation, $\eta$ denotes the learning rate.

***

Visually, the process of gradient descent optimization can be shown as in the following figure.

![GradientDescent](gradient-descent.png)

***

#### Steps to Implement Gradient Descent

1. Randomly initialize values for weights.
2. Update weights using the following formula.
\begin{equation}
w := w - \eta . \frac{\partial}{\partial w} L(w)
\end{equation}

3. Repeat until slope = 0; that is, $\frac{\partial}{\partial w} L(w) = 0$.

***
#### Selection of Learning Rate 

Learning rate must be chosen wisely as:

1. if it is too small, then the model will take some time to learn.
2. if it is too large, model will converge as our pointer will shoot and we’ll not be able to get to minima as shown in the following figure.

![LearningRate](image1.png)


***
![LearningRateSelection](image2.png)

****


There are three variants of gradient descent: 1. Batch gradient descent, 2: stochastic gradient descent, and 3. mini-batch gradient descent.

## Batch Gradient Descent

In this variant, we calculate the gradient for the entire dataset on each training step before we update the weights.

\begin{equation}
\frac{\partial}{\partial w} L(w) = \frac{1}{N} \sum_{i=1}^{N}\frac{\partial}{\partial w} L_i(x_i, y_i, w)
\end{equation}

You can imagine that since we take the sum of the loss of all individual training examples, our computation becomes quickly very expensive. Therefore it’s impractical for large datasets.

## Stochastic gradient descent

Stochastic Gradient Descent (SGD) was introduced to address this exact issue. Instead of calculating the gradient over all training examples and update the weights, SGD updates the weights for each training example $x_i,y_i$ 



\begin{equation}
w := w - \eta \frac{\partial}{\partial w} L_i(x_i, y_i, w)
\end{equation}

As a result, SGD is much faster and more computationally efficient, but it has noise in the estimation of the gradient. Since it updates the weight frequently, it can lead to big oscillations, which makes the training process highly unstable.

## Mini-batch Stochastic Gradient Descent
Mini batch SGD sits right in the middle of the two previous ideas combining the best of both worlds. It randomly selects $n$ training examples, the so-called mini-batch, from the whole dataset and computes the gradients only from them. It essentially tries to approximate Batch Gradient Descent by sampling only a subset of the data. Mathematically:

\begin{equation}
w := w - \eta \frac{\partial}{\partial w} L_i(x_{(i:i+n)}, y_{(i:i+n)}, w)
\end{equation}

In practice, mini-batch SGD is the most frequently used variation because it’s both computationally cheap and results in more robust convergence.

***
## Concerns on SGD

SGD is easy to implement, but it has some limitations:

1. If the loss function changes quickly in one direction and slowly in another, it may result in a high oscillation of gradients making the training progress very slow.

2. If the loss function has a local minimum or a saddle point, it is very possible that SGD will be stuck there without being able to “jump out” and proceed in finding a better minimum. This happens because the gradient becomes zero so there is no update in the weight whatsoever.

 **A saddle point is a point on the surface of the graph of a function where the slopes (derivatives) are all zero but which is not a local maximum of the function.**

3. The gradients are still noisy because we estimate them based only on a small sample of our dataset. The noisy updates might not correlate well with the true direction of the loss function.

4. Choosing a good loss function is tricky and requires time-consuming experimentation with different hyperparameters.

5. The same learning rate is applied to all of our parameters, which can become problematic for features with different frequencies or significance.

To overcome some of these problems, many improvements have been proposed over the years.


***
## [Momentum-Based SGD](https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d) 

The overcome the limitation of SGD, a variation, called SGD with momentum, is proposed. For the best explanation, please click on the title of the cell.

Mathematically, the SGD with momentum can be defined as:

\begin{equation}
V_t = \beta V_{t-1} + (1-\beta) \frac{\partial}{\partial w}L(x, y, w) \\
W := W - \eta V_t
\end{equation}

where
- $L$ is the loss function
- $\eta$ is the learning rate
- $\beta$ is the constant, called momentum, and its ideal value is $0.9$.
- $V_t$ is a term, called velocity.

***

#### Advantages of Using SGD with Momentum

1. We can now escape local minimums or saddle points because we keep moving downwards even though the gradient of the mini-batch might be zero

2. Momentum can also help us reduce the oscillation of the gradients because the velocity vectors can smooth out these highly changing landscapes.

3. Finally, it reduces the noise of the gradients (stochasticity) and follows a more direct walk down the landscape.

***

## Nestrov Accelarated Gradient

An alternative version of momentum, called Nesterov momentum, calculates the update direction in a slightly different way.

Instead of combining the velocity vector and the gradients, we calculate where the velocity vector would take us and compute the gradient at this point. In other words, we find what the gradient vector would have been if we moved only according to our build-up velocity, and compute it from there.

We can visualize this as below:

![NAG](image3.png)

This anticipatory update prevents us from going too fast and results in increased responsiveness. The most famous algorithm that make us of Nesterov momentum is called Nesterov accelerated gradient (NAG) and goes as follows:

\begin{equation}
V_t = \beta V_{t-1} + (1-\beta) \frac{\partial}{\partial w}L(X, y, w+\beta V_{t-1}) \\
W := W - \eta V_t
\end{equation}

where
- $L$ is the loss function
- $\eta$ is the learning rate
- $\beta$ is the constant, called momentum, and its ideal value is $0.9$.
- $V_t$ is a term, called velocity.

## References

1. [A journey into Optimization algorithms for Deep Neural Networks](https://theaisummer.com/optimization/)
2. [The Hitchhiker’s Guide to Optimization in Machine Learning](https://towardsdatascience.com/the-hitchhikers-guide-to-optimization-in-machine-learning-edcf5a104210)
3. [Gradient Descent Explained](https://towardsdatascience.com/gradient-descent-explained-9b953fc0d2c)
4. [Gentle Introduction to the Adam Optimization Algorithm for Deep Learning](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)
5. [Stochastic Gradient Descent with momentum](https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d)