# Chapter 11: Training Deep Neural Nets

In the previous chapter, we trained a neural network with 2 hidden layers. More complex problems require networks with more hidden layers with hundreds of neurons per layer. Training these can lead to several problems:

- The _vanishing gradients_ and _exploding gradients_ problem makes lower levels hard to train.
- Training a large network can be very slow.
- A model with millions of parameters risks overfitting the training data.

Below we will discuss methods for solving all of these problems.

## Vanishing/Exploding Gradients Problem

While training a neural network with backpropagation, the algorithm finds the components of the error contributed by each layer to compute the error gradient.  Gradients can often get smaller and smaller as the algorithm progresses, resulting in the gradient contribution from the lower layers approaching zero. This is known as the _vanishing gradient_ problem. Alternatively, the gradient can also can grow bigger and bigger which can cause the algorithm to diverge. This is called the _exploding gradient problem_.

Around 2010, a paper titled ["Understanding the Difficulty of Training Deep Feedforward Neural Networks"](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf) found some reasons for this. The sigmoid activation function as well as the random initialization of the weight matrices' elements using a normal distribution with a mean of 0 and a standard deviation of 1. The paper showed the variance of the outputs was much larger than the variance of the inputs. Going forward in the network, the variance kept getting larger and it results in the activation saturating near the horizontal asymptotes, which causes the gradient to vanish.

### Xavier and He Initialization

The authors of the paper found that one way to prevent the vanishing/exploding gradient problem is to ensure that the variance of the input and output of each layer is the same. One way to do this is to initialize the weights matrix using a normal distribution with a mean of 0 and a standard deviation given by

$$ \sigma = \sqrt{\frac{2}{n_\text{ inputs} + n_\text{ outputs}}} $$

or a uniform distribution centered at 0 with a radius, $r$, given by

$$ r = \sqrt{\frac{6}{n_\text{ inputs} + n_\text{ outputs}}} $$

where $n_\text{ inputs}$ and $n_\text{ outputs}$ is the number of input  or output connections in that particular layer. This is often known as _Xavier initialization_ after the author's first name, or sometimes _Glorot initialization_.

For the ReLU activation function, we use a normal distribution with a standard deviation given by

$$ \sigma = \frac{2}{\sqrt{n_\text{ inputs} + n_\text{ outputs}}} $$

or a uniform distribution with a radius given by

$$ r = \sqrt{\frac{24}{n_\text{ inputs} + n_\text{ outputs}}} $$

which is known as _He initialization_. Below is an example of creating a layer of a neural network which uses _He initialization_. By default, `tf.layers.dense()` uses Xavier initialization.

In [0]:
import tensorflow as tf

n_inputs = 28 ** 2
n_hidden = 100

X = tf.placeholder(tf.float32, shape=(None, n_inputs))
he_init = tf.variance_scaling_initializer()
hidden1 = tf.layers.dense(X, n_hidden, activation=tf.nn.relu,
                          kernel_initializer=he_init, name='hidden1')

### Nonsaturating Activation Functions

One of the causes of the vanishing/exploding gradient problem discussed in the paper is the sigmoid activation function. The ReLU activation function performs much better, but it has a different problem. If some neurons output negative values, after the application of the activation function, their output will be stuck at 0. Since the gradient is also 0, the neuron remains "dead."

One solution to this problem is to use a "leaky" ReLU function, given by

$$ \text{LeakyReLU}(z) = \max(\alpha z, z) $$

where $\alpha$ is the slope ofthe ReLU function when the value of $z$ is less than 0. Researchers have found that this activation function performs better than the "hard" ReLU function. You can even have $\alpha$ be a parameter that the model learns during training. This prevents neurons from completely dying.

Another activation function that performs better than leaky ReLU that was proposed in this [paper](https://arxiv.org/pdf/1511.07289v5.pdf) by Djork-Arné Clevert called the _exponential linear unit_ (ELU) given by

$$ \text{ELU}_\alpha(z) = \left\{ \begin{matrix}
\alpha\,(\exp(z) - 1) && \text{if}\;z < 0 \\
z && \text{if}\; z \geq 0
\end{matrix} \right. $$

It has the following differences from the ReLU function:

- It takes negative values when $z < 0$ . which allows the unit to have an average output closer to 0. This helps alleviate the vanishing radient problem. You can tweak the hyperparameter, $\alpha$, sets the negative number that ELU approaches.

- It has a nonzero gradient when $z < 0$, preventing the dying units issue.

- The function is differentiable everywhere, which helps the speed of Gradient Descent.

The disadvantage of ELU is that it takes longer to compute than ReLU. The extra time is compensated for the fact that it helps Gradient Descent converge fasted, but it does cause the model to make predictions more slowly.

TensorFlow offeres an implementation of ELU which is used in the code example below:

In [0]:
n_inputs = 28 ** 2
n_hidden = 100

X = tf.placeholder(tf.float32, shape=(None, n_inputs))
he_init = tf.variance_scaling_initializer()
hidden1 = tf.layers.dense(X, n_hidden, activation=tf.nn.elu, name='hidden1')

TensorFlow does not have an implementation of leaky ReLU, but it is easy to define ourselves:

In [0]:
def leaky_relu(z, alpha=0.01):
  return tf.maximum(alpha * z, z)

n_inputs = 28 ** 2
n_hidden = 100

X = tf.placeholder(tf.float32, shape=(None, n_inputs))
he_init = tf.variance_scaling_initializer()
hidden1 = tf.layers.dense(X, n_hidden, activation=leaky_relu, name='hidden1')

### Batch Normalization

In this [paper](https://arxiv.org/pdf/1502.03167v3.pdf) Sergey Ioffe and Christian Szegedy proposed a technique called _Batch Normalization_ (BN) to address both the vanishing/exploding gradient problem and the problem that the distribution of each layer's inputs change when the parameters of the previous layers change (i.e. the _Internal Covariate Shift_ problem).

The technique adds an operation to the model just before applying the activation function of each layer. It zero-centers and normalizes the inputs, then it scales and shifts the result using two new parameters per layer. This lets the model learn the optimal mean and shift for each layer.

The algorithm starts by first computing the empirical mean for the current mini-batch, $B$, given by

$$ \mu_B = \frac{1}{m_B} \sum\limits_{i\,=\,1}^{m_B} \mathbf{x}^{(i)} $$

Next, we find the empirical standard deviation, given by

$$ \sigma_B^{\;\;2} = \frac{1}{m_B} \sum\limits_{i\,=\,1}^{m_B} \left( \mathbf{x}^{(i)} - \mu_B \right)^2 $$

Then we zero-center and normalize the inputs in the mini-batch

$$ \hat{\mathbf{x}}^{(i)} = \frac{\mathbf{x}^{(i)} - \mu_B}{\sqrt{\sigma_B^{;\;2} + \epsilon}} $$

where $\epsilon$ is a small number, typically $10^{-5}$, called the _smoothing term_ to avoid division by zero. Finally it computes the output given by

$$ \mathbf{z}^{(i)} = \gamma\,\hat{\mathbf{x}}^{(i)} + \beta $$

where $\gamma$ is the scaling parameter and $\beta$ is the shift parameter which are learned during training.

When the model makes predictions, it uses the empirical mean and standard deviation of the entire training set. In the end, the model ends up learning 4 parameters: the mean of the training set, $\mu$; the standard deviation of the training set, $\sigma$; the scaling parameter, $\gamma$; and the shift parameter, $\beta$.

Adding Batch Normalization to a deep neural network improves the performance of the model, lets you skip normalizing the data before training the data, and helps the model converge to the optimal parameters in fewer training iterations. However, using Batch Normalization causes the model to make predictions slower since it adds another computational step for making predictions.