# Optimization Algorithms for Deep Neural Networks

In this module, we will study:
1. Gradient Descent
2. Momentum-Based Gradient Descent
3. Nesterov Mementum
4. Adagrad
5. RMSProp
6. Adam


## Optimization Algorithms

**What is optimization?**

In the simplest case, an optimization problem consists of maximizing or minimizing a real function by systematically choosing input values from within an allowed set and computing the value of the function.

In the case of Machine Learning or Deep Learning, optimization refers to _the process of minimizing the loss function by systematically updating the network weights_. 

Mathematically, this is expressed as follows:

\begin{equation}
w = argmin_wL(w),
\end{equation}

where $L(w)$ and $w$ denotes, respectively, the loss function and weights.

***

**The Task of an Optimization Algorithm:**

Optimization algorithms (in the case of minimization) have one of the following goals:
1. Find the global minimum of the objective function. This is feasible if the objective function is convex, i.e. any local minimum is a global minimum.
2. Find the lowest possible value of the objective function within its neighborhood. That’s usually the case if the objective function is not convex as the case in most deep learning problems.

There are several optimization techniques, we will, in this module, learn the important once.

***

## Gradient Descent





#### What is Gradient Descent?

Gradient Descent is an optimizing algorithm used in Machine/ Deep Learning algorithms. It is a first-order (i.e., gradient-based) optimization algorithm where we iteratively update the parameters of a differentiable cost function until its minimum is attained.

Mathematically, gardient descent can be defined as follows:

\begin{equation}
w := w - \eta . \frac{\partial}{\partial w} L(w)
\end{equation}

In the above equation, $\eta$ denotes the learning rate.

***

Visually, the process of gradient descent optimization can be shown as in the following figure.

![GradientDescent](gradient-descent.png)

***

#### Steps to Implement Gradient Descent

1. Randomly initialize values for weights.
2. Update weights using the following formula.
\begin{equation}
w := w - \eta . \frac{\partial}{\partial w} L(w)
\end{equation}

3. Repeat until slope = 0; that is, $\frac{\partial}{\partial w} L(w) = 0$.

***
#### Selection of Learning Rate 

Learning rate must be chosen wisely as:

1. if it is too small, then the model will take some time to learn.
2. if it is too large, model will converge as our pointer will shoot and we’ll not be able to get to minima as shown in the following figure.

![LearningRate](image1.png)


***
![LearningRateSelection](image2.png)

****


There are three variants of gradient descent: 1. Batch gradient descent, 2: stochastic gradient descent, and 3. mini-batch gradient descent.

Batch gradient descent
The equation and code presented above actually referred to batch gradient descent. In this variant, we calculate the gradient for the entire dataset on each training step before we update the weights.

\begin{equation}
w := w - \eta . \frac{\partial}{\partial w} L(w)
\end{equation}
You can imagine that since we take the sum of the loss of all individual training examples, our computation becomes quickly very expensive. Therefore it’s impractical for large datasets.

## References

1. [A journey into Optimization algorithms for Deep Neural Networks](https://theaisummer.com/optimization/)
2. [The Hitchhiker’s Guide to Optimization in Machine Learning](https://towardsdatascience.com/the-hitchhikers-guide-to-optimization-in-machine-learning-edcf5a104210)
3. [Gradient Descent Explained](https://towardsdatascience.com/gradient-descent-explained-9b953fc0d2c)
4. [Gentle Introduction to the Adam Optimization Algorithm for Deep Learning](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)