# Training Deep Neural Networks (DNNs)
To tackle complex problems we often need deeper models with 10+ hidde layers, hundreds of neurons per layer, linked by hundreds of thousands of connections.This can cause some common challenges:
- Vanishing Gradients Problem -> many solutions below
- Exploding Gradients Problem -> many solutions below
- Lack of Training Data and Overfitting Noisy Data -> transfer learning and unsupervised pretraining
- Slow Training -> choice of optimizer

# 1. Vanishing and Exploding Gradients
The backpropagation algorithm works by going from the output layer to the input layer and computing the gradient of the cost function with regard to each parameter of the network, then using these gradients to update the parameter with Gradient Descent.

As the algorithm progresses backwards, gradients often get smaller towards the earlier layers and Gradient Descent barely updates these weights. The reason is that chain rule is used to calculate the gradients, multiplying a bunch of small values <1 within the formula, and thus make the overall results small as well.

Since the weight update is proportional to the gradient size, the update of early layer weights is very small, which hinders learning and training never converges to a good solution. This is called the **vanishing gradients problem**. 

In some cases, the opposite can happen and gradients become larger and larger, leading to overadjust the weights in the earlier layers of the network. We would move further and further away from the optimal weights and the algorithm diverges. This problem is called **exploding gradients** and is quite common for recurrent neural networks (RNNs).

Researchers in 2010 found the combination of logistic sigmoid activation function and a normal-distribution weight initialization technique to be a common reason for this behavior. The result is that the variance of each layer's output is larger than for its input.

#### Choice of Initialization
Glorot and He propose an initialization techniques where input and output variance stays equal when moving forward (like a chain of microphones and amplifiers), and the variance of the gradients to stay equal before and after moving through the layers in backpropagation. Both cannot be guaranteed but the **Xavier or Glorot Initialization** found a good compromise that works well in practice. It is very similar to LeCun Initialization (1990s).

Keras uses Glorot initialization with a uniform distribution by default. This behavior can be changed using kernal_initialization.

#### Nonsaturating Activation Functions
Next to initialization the choice of activation functions can lead to saturations. ReLu is great in general since it does not saturate for positive values and is fast to compute. However, some neurons can die turing training, meaning they stop outputting anything other than 0. This problem is known as **dying ReLU**. 

The variant called **LeakyReLU** avoids this problem by having a non-zero slope even for negative values. There are many types of such leaky ReLus and we can influence the leakiness by tweaking the hyperparameter alpha. 

Alternatively, we can use the **ELU function** which looks like a smoothed ReLu. Usually, ELU leads to longer training time than ReLu. There is a variant called SELU which can lead to the network's self-normalization given certain prerequesits.

Geron suggests: SELU > ELU > Leaky ReLU > ReLU > tanh > logistic

In [None]:
# LeakyReLU
model = keras.models.Sequential([
    [...],
    keras.layers.Dense(30, kernel_initializer="he_normal"),
    keras.layers.LeakyReLU(alpha=0.2)
    [...]
])

In [None]:
# SELU
model = keras.models.Sequential([
    [...],
    keras.layers.Dense(30, activation="selu", kernel_initializer="lecun_normal"),
    [...]
])

### Batch Normalization
Proper initialization prevents the vanishing/exploding gradients problem in the beginning of training the network. However, it can occur later. **Batch Normalization (BN)** addresses this problem. 

An operation is added just before and after the activation function of each hidden layer that zero-centers and normalizes each input, then scales and shifts the result. Thus, the operation leds the model learn the optimal scale and mean of each of the layer's input.

Adding a BN layer as the very first layer is roughly equivalent to standardizing the data before training. 

Four parameter vectors are learning in keras' standard implementation of Batch Normalization: 
- the output scale vector (backpropagation)
- the output offset vector (backpropagation)
- the final input mean vector (estimated during training but used only after training)
- the final input standard deviation vector (estimated during training but used only after training)

Researchers generally found **Batch Normalization to improve all kinds of DNNs**,, e.g. lead to huge improvements on the ImageNet classification task. Using more saturating activation functions like tanh and logistic activation function become possible again. Furthermore, BN acts as a regulizer, reducing the need for dropouts.

Downsides are slower predictions and slightly more complex models. Epochs are generally found to be slower, however, fewer epochs are usually needed for convergence. Thus, wall time will usually be shorter.

Applying BN after each layer is so common in practice that the BN layers are hidden in diagrams. New bleeding-edge research challenges this approach but as of today it is still good practice.

In [None]:
# BN as first layer and after every hidden layer
# usually necessary for way deeper networks
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28])
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation_function="softmax")
])

### Gradient Clipping
Another approach to avoid exploding gradients is to clip gradients during backpropagation by using a threshold. This approach is an alternative to BN in RNNs, for other types BN is sufficient.


In [1]:
optimizer = keras.optimizers.SGD(clipvalue=1.0) # orientation of the gradient might change; instead clipnorm=1.0
model.compile(loss="mse", optimizer=optimizer)

NameError: name 'keras' is not defined

# 2. Reusing Pretrained Layers

Chapter 14 shows ways of finding a good existing neural network. We shouldn't train a deep network from scratch but instead reuse the lower layers of an existing network. This technique is called **transfer learning**.