# Training Deep Neural Networks

So far we trained shallow nets, which have few hidden layers. Now, if we have to tackle more complex problems like, object detection or speech recognition. Then we have increase our layers and neurons. But training a DNN is not that easy, there are several problems that can occur:

- *Vanishing/Exploding gradients* problem: this is when gradients grow smaller and smaller, or bigger and bigger during back propagation. This will make lower layers hard to train.
- You might not have enough data, or label the data is too costly.
- Training may be extremely slow.
- Model could easily overfit the data having too many parameters.

So, let's tackle these problems. Welcome to Deep Learning!

# Vanishing/Exploding Gradients Problem

In backpropagation gradients often get smaller and smaller when getting to lower layers. Hence, updates of parameters of those layers are negligible, this is called *vanishing gradients*. On other hand sometimes gradients grow bigger and bigger and updates on parameters are too large leading algorithm to diverge. This is called *exploding gradients*, it usually occurs in RNN.

The reason of vanishing gradients are activation functions used. Like in sigmoid functions when bigger values occur it tends to 0 or 1, where the gradients is almost zero or very minimum. So backpropagation keeps diluting the gradients.

## Glorot and He Initialization

Xavier Glorot and Yoshua Bengio proposed a paper to solve this problem by pointing out that we need the signal to flow properly in both directions (forward and back propagation): means the variance of inputs and outputs of a layer should be equal same goes for the gradients.

So they propose that we should initialize weights randomly as:
$$
fan_{avg} = (fan_{in} + fan_{out}) / 2
\\
\sigma^2 = \frac1{fan_{avg}} , mean=0: for\ normal\ distribution
\\
r = \sqrt{\frac3{fan_{avg}}}, (-r, +r): for\ uniform\ distribution
$$
Here, $fan_{in}$ and $fan_{out}$ are number of inputs and neurons of a layer.

This is called *Xavier Initialization* or *Glorot Initialization*. If you replace $fan_{avg}$ to $fan_{in}$ you get *LeCun Initialization*, which was proposed back in 1990. These initialization can be used with logistic activation functions.

For ReLU (and its variance), there is a *He Initialization*. Where $\sigma^2=\frac2{fan_{in}}$.

By default, Keras uses Glorot Initialization with uniform distribution. When creating a layer you can change this to He initialization by setting `kernel_initializer='he_uniform'` or `kernel_initializer='he_normal'` like this:

```python
keras.layers.Dense(10, activation='relu', kernel_initializer='he_normal')
```

If you want to use $fan_{avg}$ in He initialization with uniform distribution, you can use `VarianceScaling` like this:

```python
he_avg_init = keras.initializers.VarianceScaling(scale=2, mode='fan_avg',
                                                distribution='uniform')
keras.layers.Dense(10, activation='relu', kernel_initializer=he_avg_init)
```

## Non-saturating Activation Functions

Earlier we see that the problems with unstable gradients are due to poor choice of activation function. So, let's change it as nowadays use mostly ReLU, it does not saturates for the positive values (and it is fast). 

Unfortunately, ReLU suffers from a problem called *dying ReLU*: during training some neurons effectively "die", meaning they stop outputting anything other than 0.

To solve this problem, there is a variant called *Leaky ReLU*. It is defined as:
$$
LeakyReLU_\alpha(z) = max(\alpha z, z)
$$
where $\alpha$ defines how much slope in z < 0, and typically set to 0.01. 

There are other variants proposed in 2015, $\alpha=0.2$ is better than 0.01, *randomized Leaky ReLU* (chooses $\alpha$ randomly), and *parametric Leaky ReLU* (trains value of $\alpha$).

A 2015 paper also proposed a new activation function called *exponential linear unit (ELU)*:
$$
ELU_\alpha(z) =
\begin{cases}
    \alpha(\exp(z)-1),& if\ z < 0\\
    z,              & if\ z \ge 0
\end{cases}
$$
It solves dead neuron problem, if $\alpha=1$, then it is smooth everywhere which helps in GD.

The main drawback is it is slower to compute, so it does reduce training time but computes slow which results similar time as using ReLU. 

In 2017 paper, they introduces a *scaled ELU*, they showed if you build a network with stack of dense layers, and if all hidden layers uses SELU, then the network will be *self-normalize*. Means output of each layer tends to preserve mean 0 and std of 1 during training. This outperforms all other activation functions in such networks.

However, there are conditions for self-normalization to happen:

- Input features must be standardized (mean 0 and std 1).
- Initialization must be LeCun normal, this means setting `kernel_initializer='lecun_normal'`.
- The network must be sequential. Not RNN and skip connections.
- All layers must be dense, but in some cases they perform better on CNN also.

> ReLU is default in most cases because it is faster to calculate than others. Which increases the latency of network and nowadays latency is important than faster training or convergence.

To use Leaky ReLU:

```python
model = keras.models.Sequential([
    [...]
    keras.layers.Dense(10),
    keras.layers.LeakyReLU(alpha=0.2),
    [...]
])
```

For SELU:

```python
layer = keras.layers.Dense(10, activation='selu', 
                          kernel_initializer='lecun_normal')
```

## Batch Normalization

The above activation functions can significantly reduce the risk of vanishing/exploding gradients at the start of the training but they can't guarantee they won't come back.

In a 2015 paper, they proposed a technique called *Batch Normalization* (BN) that address these problems. In this an operation is added to just before or after the activation function. This operation simply zero-centers and normalizes each input, then scales and shifts the result using two parameters. In other words, the operation lets the model learn the optimal scale and mean of each of the layer's inputs. 

If you add it at he start it works as standardization of inputs. (But it sees one batch at a time).

It estimate each input's mean and std over the current mini-batch (hence called "Batch Normalization"). Then the whole operation is summarized step by step.
$$
1.\ \mu_B = \frac1{m_B}\sum_{i=1}^{m_B}x^{(i)} 
\\
2.\ \sigma_B^2 = \frac1{m_B}\sum_{i=1}^{m_B}(x^{(i)} - \mu_B)^2
\\
3.\ \hat x = \frac{x^{(i)} - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}
\\
4.\ z^{(i)} = \gamma \times \hat x^{(i)} + \beta
$$
In this algorithm:

- $\mu_B$: vector of input means evaluated over the mini-batch B (It contains one mean per input).
- $\sigma_B$: vector of input std on B (one std per input).
- $m_B$: mini-batch size.
- $\hat x^{(i)}$: vector of zero-centered and normalized inputs for instance $i$.
- $\gamma$: the output scale parameter vector.
- $\times$: element-wise multiplication.
- $\beta$: the output shift parameter vector.
- $\epsilon$: smoothing term.
- $z^{(i)}$: the output of the BN operation. Rescaled and shifted version of the inputs.

In test time, we need to do another thing, since at test time one instance is too small to calculate mean or std. What today's frameworks do they calculate mean and std with exponentially moving average way and use them in testing. 

Batch Normalization also acts as a regularizer, reducing the need of other regularizers.

It does add complexity to the model. Due to this there is a runtime penalty: it does slower prediction. Which can be solved by merging BN layer with their previous layers. It can be done by:
$$
Z = XW + b,
\\
W' = \gamma \times W/ \sigma,
\\
b' = \gamma \times (b - \mu) + \beta
$$
So if we replace W and b to W' and b' we can merge BN layers.

> Computations of BN layers may slower the training, but it converges faster so overall time is less than a normal network.

### Implementing Batch Normalization with Keras

In Keras just add `BatchNormalization` layer before or after each hidden layer's activation function, and optionally add a BN layer at the start:


In [1]:
import tensorflow as tf
from tensorflow import keras

In [2]:
model = keras.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(256, activation='relu'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation='softmax')
])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
batch_normalization (BatchNo (None, 784)               3136      
_________________________________________________________________
dense (Dense)                (None, 256)               200960    
_________________________________________________________________
batch_normalization_1 (Batch (None, 256)               1024      
_________________________________________________________________
dense_1 (Dense)              (None, 10)                2570      
Total params: 207,690
Trainable params: 205,610
Non-trainable params: 2,080
_________________________________________________________________


Let's look at the parameters of the first BN layer, two are trainable and two are not. Mean and variance are moving which we discussed above:


In [3]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('batch_normalization/gamma:0', True),
 ('batch_normalization/beta:0', True),
 ('batch_normalization/moving_mean:0', False),
 ('batch_normalization/moving_variance:0', False)]

The authors of the paper are in the favor of adding the BN layer before the activation function. You can add before by removing activation from hidden layer and add a activation layer after BN layer. However, BN layer also have a offset parameter you can remove bias from hidden layer:

In [4]:
model = keras.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(256, use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation('relu'),
    keras.layers.Dense(10, activation='softmax')
])

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 784)               0         
_________________________________________________________________
batch_normalization_2 (Batch (None, 784)               3136      
_________________________________________________________________
dense_2 (Dense)              (None, 256)               200704    
_________________________________________________________________
batch_normalization_3 (Batch (None, 256)               1024      
_________________________________________________________________
activation (Activation)      (None, 256)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                2570      
Total params: 207,434
Trainable params: 205,354
Non-trainable params: 2,080
____________________________________________

However, BN layer's default hyperparameters are good. But there is one which is momentum for the moving averages $v$. A good value is close to 1; like 0.9, 0.99, 0.999 (add more 9 for large dataset).

Another important hyperparameter is axis. By default it is -1, means it normalizes last axis: for example, in [batch_size, features] it normalizes features. But if data comes as [batch_size, height, width] then it only normalizes the width feature, so you have to set `axis=[1, 2]`.

Batch Normalization has become default layers in deep neural networks.

## Gradient Clipping

To handle exploding gradients you can clip the gradients value during backpropagation so that they never exceed some threshold. This is called *Gradient Clipping*. This technique is often used in RNN, as Batch Normalization is tricky to use in them.

In Keras, set `clipvalue` or `clipnorm` while creating an optimizer:

```python
optimizer = keras.optimizers.SGD(clipvalue=1.0)
model.compile(loss='mse', optimizer=optimizer)
```

It will clip value of the gradients between -1 and 1. But if a value is [0.9, 100] then it will become [0.9, 1.0], which changes the orientation of the gradient vector. In practice it works well but if you don't want to change orientation of gradient vectors use `clipnorm=1.0`, it will clip when its $l_2$ norm is greater than 1.0. Now, [0.9, 100.0] will be clipped to [0.0089, 0.99]. 

You can see what value to use by looking TensorBoard and tune values to get best performance.