# Chapter 11: Training Deep Neural Networks

Some problems training a deep DNN (deep neural networks):
- Vanishing or exploding gradients problem. Gradients grow smaller and smaller / bigger and bigger and makes training lower layers very hard.
- Not enough training data, or too costly to label.
- Training may be extremely slow.
- Model with millions of parameters risk overfitting.

## 11.1 The Vanishing/Exploding Gradients Problems

Backpropagation computes and propagates the error gradient of each layer from output to input. 

**Vanishing gradients problem** - Gradients get smaller and smaller as it progresses to lower layers. So Gradient Descent leaves the lower layers' connection weights virtually unchanged and never converges to a good solution.

**Exploding gradients problem** - Similar effect but gradients get bigger and bigger and algorithm diverges.

> In general, deep neural networks suffer from unstable gradients; different layers may learn at widely different speeds.

It was discovered that using logistic sigmoid activation function and initializing weights using a normal distribution ($\mu=0, \sigma = 1$) caused this issue.

- Initializing weights using a normal distribution:
    - The variance (the spread) of the outputs of each layer is much greater than its inputs.
    
    - Going forward in the network, the variance keeps increasing after each layer until the activation function saturates (ends up far right/left) at the top layers.

- Logistic sigmoid activation function:
    - Because the variance keeps increasing, inputs become large (negative or positive, "far left/right"), with outputs of 0 or 1 and derivative extremely close to 0.
    - So backpropagation has no error gradient to propagate to the lower layers.

### 11.1.1 Glorot and He Initialization

For the signal to flow properly, we need the variance of the outputs of each layer to be equal to the variance of its inputs, and we need the gradients to have equal variance before and after flowing through a layer in the reverse direction.

> Microphone Amplifier Analogy: Setting knob too close to 0, voice is inaudible but too close to max, voice is too saturated. For a chain of amplifiers, they all need to be set properly so that voice is loud and clear at the end of the chain.  

> Your voice has to come out of each amplifier at the same amplitude as it came in.

It's impossible to guarantee both (output & gradient variances) unless the layer has an equal number of inputs and neurons (*fan-in*, *fan-out*). Glorot and Bengio proposed a good compromise that the connection weights to be initialized randomly according to *Equation 11-1*, where $\text{fan}_\text{avg} = (\text{fan}_\text{in} + \text{fan}_\text{out})/2$ and is called **Xavier initialization** or **Glorot initialization**. This strategy is used for the logistic activation function.

**LeCun initialization** - Equivalent to Glorot initialization when $\text{fan}_\text{in} = \text{fan}_\text{out} $.

**He initialization** - The initialization strategy for the ReLU activation function (and its variants).

### 11.1.2 Nonsaturating Activation Functions

### 11.1.3 Batch Normalization

### 11.1.4 Gradient Clipping

## 11.2 Reusing Pretrained Layers

### 11.2.1 Transfer Learning with Keras

### 11.2.2 Unsupervised Pretraining

### 11.2.3 Pretraining on an Auxiliary Task

## 11.3 Faster Optimizers

### 11.3.1 Momentum Optimization

### 11.3.2 Nesterov Accelerated Gradient

### 11.3.3 AdaGrad

### 11.3.4 RMSProp

### 11.3.5 Adam and Nadam Optimization

### 11.3.6 Learning Rate Scheduling

## 11.4 Avoiding Overfitting Through Regularization

### 11.4.1 $\ell_1$ and $\ell_2$ Regularization

### 11.4.2 Dropout

### 11.4.3 Monte Carlo (MC) Dropout

### 11.4.4 Max-Norm Regularization

## 11.5 Summary and Practical Guidelines