## Adaptive Learning Rate and Optimization:-

There are a vast variety of optimizers that we can use to optimize our network parameters. We will focus on the following: SGD with Momentum, AdaGrad, RMSProp and Adam

#### SGD with Momentum:-

SGD with momentum, in contrast to simple SGD, is a method that prevents oscillations during an update by adding the scaled previous update in the optimization to the current update:

$$
\begin{align}
v_{t+1} =& \gamma v_{t} + \alpha \nabla_{\theta_t} J(\theta_t), \nonumber \\
\theta_{t+1} =& \theta_t + v_{t+1}, \nonumber
\end{align}
$$

where $v_0$ is initialized to zeros in the first optimization of the timestep.

#### AdaGrad:-

Adagrad or adaptive gradient algorithm is a parameter update method that adapts the learning rate based on whether the parameters represent frequently ocurring features or infrecuently ocurring features.

$$
\begin{align}
g_{t} =& \nabla_{\theta_{t}} J(\theta_{t}), \nonumber \\
\theta_{t+1} =& \theta_{t} - \frac{\alpha}{\sqrt{G_{t} + \epsilon}} \odot g_{t}, \nonumber
\end{align}
$$

where $G_{t} \in \mathbb{R}^{d \times d}$ is a diagonal matrix where $G_{i,i}$ is $\sum^{t}_1 g_{t,i}^2$

#### RMSProp:-

RMSProp is similar to adagrad but with one difference: Instead of accumulating all past squared gradients, it looks at a window of previous gradients defined by a decaying factor.

$$
\begin{align}
\mathbb{E}[g^2]_{t} =& \gamma \mathbb{E}[g^2]_{t-1} + (1 - \gamma)g_{t}^2, \nonumber \\
\theta_{t+1} =& \theta_{t} - \frac{\alpha}{\sqrt{\mathbb{E}[g^2]_{t} + \epsilon}} \odot g_{t}, \nonumber
\end{align}
$$


### Adam:-

Adam or adaptive moment estimation is another adaptive learning rate method with respect to each parameter. This method also keeps an exponentially decaying average of past gradients. However, in constrast to previous methods, a momentum like term is used which keeps an exponentially decaying average of past gradients. The parameter update is performed as follows:

$$
\begin{align}
m_{t} =& \beta_1 m_{t-1} + (1 - \beta_1)g_{t}, \nonumber \\
v_{t} =& \beta_2 v_{t-1} + (1 - \beta_2)g_t^2, \nonumber \\
\hat{m}_t =& \frac{m_t}{1-\beta_1^t}, \nonumber \\
\hat{v}_t =& \frac{v_t}{1-\beta_2^t}, \nonumber \\
\theta_{t+1} =& \theta_{t} - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon}\hat{m}_t, \nonumber
\end{align}
$$

where $m_t$ and $v_t$ are the first and second moment estimates of the gradients while $\hat{m}_t$ and $\hat{v}_t$ are bias corrected values of the two moments. 

## Batch Normalization:-

The description of batch normalization is motivated by Stanford CS231 course

One way to make deep networks easier to train is to use more sophisticated optimization procedures such as SGD+momentum, RMSProp, or Adam. Another strategy is to change the architecture of the network to make it easier to train. One idea along these lines is batch normalization.

The idea is relatively straightforward. Machine learning methods tend to work better when their input data consists of uncorrelated features with zero mean and unit variance. When training a neural network, we can preprocess the data before feeding it to the network to explicitly decorrelate its features; this will ensure that the first layer of the network sees data that follows a nice distribution. However even if we preprocess the input data, the activations at deeper layers of the network will likely no longer be decorrelated and will no longer have zero mean or unit variance since they are output from earlier layers in the network. Even worse, during the training process the distribution of features at each layer of the network will shift as the weights of each layer are updated.

 At training time, a batch normalization layer uses a minibatch of data to estimate the mean and standard deviation of each feature. These estimated means and standard deviations are then used to center and normalize the features of the minibatch. A running average of these means and standard deviations is kept during training, and at test time these running averages are used to center and normalize features.It is possible that this normalization strategy could reduce the representational power of the network, since it may sometimes be optimal for certain layers to have features that are not zero-mean or unit variance. To this end, the batch normalization layer includes learnable shift and scale parameters for each feature dimension.