# Training Deep Neural Networks

So far we trained shallow nets, which have few hidden layers. Now, if we have to tackle more complex problems like, object detection or speech recognition. Then we have increase our layers and neurons. But training a DNN is not that easy, there are several problems that can occur:

- *Vanishing/Exploding gradients* problem: this is when gradients grow smaller and smaller, or bigger and bigger during back propagation. This will make lower layers hard to train.
- You might not have enough data, or label the data is too costly.
- Training may be extremely slow.
- Model could easily overfit the data having too many parameters.

So, let's tackle these problems. Welcome to Deep Learning!

# Vanishing/Exploding Gradients Problem

In backpropagation gradients often get smaller and smaller when getting to lower layers. Hence, updates of parameters of those layers are negligible, this is called *vanishing gradients*. On other hand sometimes gradients grow bigger and bigger and updates on parameters are too large leading algorithm to diverge. This is called *exploding gradients*, it usually occurs in RNN.

The reason of vanishing gradients are activation functions used. Like in sigmoid functions when bigger values occur it tends to 0 or 1, where the gradients is almost zero or very minimum. So backpropagation keeps diluting the gradients.

## Glorot and He Initialization

Xavier Glorot and Yoshua Bengio proposed a paper to solve this problem by pointing out that we need the signal to flow properly in both directions (forward and back propagation): means the variance of inputs and outputs of a layer should be equal same goes for the gradients.

So they propose that we should initialize weights randomly as:
$$
fan_{avg} = (fan_{in} + fan_{out}) / 2
\\
\sigma^2 = \frac1{fan_{avg}} , mean=0: for\ normal\ distribution
\\
r = \sqrt{\frac3{fan_{avg}}}, (-r, +r): for\ uniform\ distribution
$$
Here, $fan_{in}$ and $fan_{out}$ are number of inputs and neurons of a layer.

This is called *Xavier Initialization* or *Glorot Initialization*. If you replace $fan_{avg}$ to $fan_{in}$ you get *LeCun Initialization*, which was proposed back in 1990. These initialization can be used with logistic activation functions.

For ReLU (and its variance), there is a *He Initialization*. Where $\sigma^2=\frac2{fan_{in}}$.

By default, Keras uses Glorot Initialization with uniform distribution. When creating a layer you can change this to He initialization by setting `kernel_initializer='he_uniform'` or `kernel_initializer='he_normal'` like this:

```python
keras.layers.Dense(10, activation='relu', kernel_initializer='he_normal')
```

If you want to use $fan_{avg}$ in He initialization with uniform distribution, you can use `VarianceScaling` like this:

```python
he_avg_init = keras.initializers.VarianceScaling(scale=2, mode='fan_avg',
                                                distribution='uniform')
keras.layers.Dense(10, activation='relu', kernel_initializer=he_avg_init)
```

