# Chapter 11: Training Deep Neural Networks

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

Some problems training a deep DNN (deep neural networks):
- Vanishing or exploding gradients problem. Gradients grow smaller and smaller / bigger and bigger and makes training lower layers very hard.
- Not enough training data, or too costly to label.
- Training may be extremely slow.
- Model with millions of parameters risk overfitting.

## 11.1 The Vanishing/Exploding Gradients Problems

Backpropagation computes and propagates the error gradient of each layer from output to input. 

**Vanishing gradients problem** - Gradients get smaller and smaller as it progresses to lower layers. So Gradient Descent leaves the lower layers' connection weights virtually unchanged and never converges to a good solution.

**Exploding gradients problem** - Similar effect but gradients get bigger and bigger and algorithm diverges.

> In general, deep neural networks suffer from unstable gradients; different layers may learn at widely different speeds.

It was discovered that using logistic sigmoid activation function and initializing weights using a normal distribution ($\mu=0, \sigma = 1$) caused this issue.

- Initializing weights using a normal distribution:
    - The variance (the spread) of the outputs of each layer is much greater than its inputs.
    
    - Going forward in the network, the variance keeps increasing after each layer until the activation function saturates (ends up far right/left) at the top layers.

- Logistic sigmoid activation function:
    - Because the variance keeps increasing, inputs become large (negative or positive, "far left/right"), with outputs of 0 or 1 and derivative extremely close to 0.
    - So backpropagation has no error gradient to propagate to the lower layers.

### 11.1.1 Glorot and He Initialization

For the signal to flow properly, we need the variance of the outputs of each layer to be equal to the variance of its inputs, and we need the gradients to have equal variance before and after flowing through a layer in the reverse direction.

> Microphone Amplifier Analogy: Setting knob too close to 0, voice is inaudible but too close to max, voice is too saturated. For a chain of amplifiers, they all need to be set properly so that voice is loud and clear at the end of the chain.  

> Your voice has to come out of each amplifier at the same amplitude as it came in.

It's impossible to guarantee both (output & gradient variances) unless the layer has an equal number of inputs and neurons (*fan-in*, *fan-out*). Glorot and Bengio proposed a good compromise that the connection weights to be initialized randomly according to *Equation 11-1*, where $\text{fan}_\text{avg} = (\text{fan}_\text{in} + \text{fan}_\text{out})/2$ and is called **Xavier initialization** or **Glorot initialization**. This strategy is used for the logistic activation function.

**LeCun initialization** - Equivalent to Glorot initialization when $\text{fan}_\text{in} = \text{fan}_\text{out} $.

**He initialization** - The initialization strategy for the ReLU activation function (and its variants).

In [None]:
# Default is Glorot with uniform distribution 
# Change to He initialization
keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")

# He initialization with uniform distribution based on fan_avg rather than fan_in
# Use VarianceScaling
he_avg_init = keras.initializers.VarianceScaling(scale=2, mode='fan_avg',
                                                 distribution='uniform')
keras.layers.Dense(10, activation="sigmoid", kernel_initializer=he_avg_init)

<tensorflow.python.keras.layers.core.Dense at 0x7f6a37de7d10>

### 11.1.2 Nonsaturating Activation Functions

**Dying ReLUs** - During training, some neurons effectively "die," meaning they stop outputting anything other than 0.

A neuron dies when its weights get tweaked in such a way that the weighted sum of its inputs are negative for all instances in the training set. When this happens, it just keeps outputting 0s, and Gradient Descent does not affect it anymore because the gradient of the ReLU function is 0 when its input is negative.

To solve this problem, use **leaky ReLU** where $\text{LeakyReLU}_\alpha(z) = \text{max}(\alpha z, z) $. The hyperparameter $\alpha$ defines how much the function "leaks": it is the slope of the function for $z<0$ and is typically set to $0.01$.

> Leaky ReLU is just like ReLU, but with a small slope for negative values. A small slope ensures that leaky ReLUs never die.

**Exponential linear unit (ELU)** - Outperforms all the ReLU variants. Looks a lot like the ReLU function with some major differences:
- Takes on nagative values when $z<0$, allowing an average output closer to 0 and alleviating the vanishing gradients problem.
- Hyperparameter $\alpha$ defines the value when z is a large negative number (usually set to 1).
- Has nonzero gradient for $z<0$, avoiding the dead neurons problem.
- If $\alpha = 1$, then the function is smooth everywhere including around $z=0$, speeding up Gradient Descent.

The main drawback of the ELU function is that it is slower to compute than the ReLU function and its variants (due to the exponential function).

**Scaled ELU (SELU)** - A scaled variant of the ELU activation function. The network will *self-normalize* ($\mu =0, \sigma =1$) but under specific conditions:
- Input features must be standarized ($\mu =0, \sigma =1$).
- Every hidden layer's weights must be initialized with LeCun normal initialization.
- Network's architecture must be sequential.

> In general, SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic.

> Because ReLU is the most used activation function, many libraries and hardware accelerators provide ReLU-specific optimizations; if speed is your priority, ReLU might be the best choice.

In [None]:
# On fashion MNIST as example

# For leaky ReLU
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),                  # FROM TEXTBOOK NOTEBOOK 
    keras.layers.Dense(300, kernel_initializer="he_normal"),     # FROM TEXTBOOK NOTEBOOK 
    keras.layers.LeakyReLU(alpha=0.2),                           # Activation function after each layer
    keras.layers.Dense(10, kernel_initializer="he_normal"),
    keras.layers.LeakyReLU(alpha=0.2)
])

# For PReLU
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),                  # FROM TEXTBOOK NOTEBOOK 
    keras.layers.Dense(300, kernel_initializer="he_normal"),     # FROM TEXTBOOK NOTEBOOK 
    keras.layers.PReLU(),                                        # Activation function after each layer
    keras.layers.Dense(10, kernel_initializer="he_normal"),
    keras.layers.PReLU()
])

# For SELU
layer = keras.layers.Dense(10, activation="selu",
                           kernel_initializer="lecun_normal")

### 11.1.3 Batch Normalization

### 11.1.4 Gradient Clipping

## 11.2 Reusing Pretrained Layers

### 11.2.1 Transfer Learning with Keras

### 11.2.2 Unsupervised Pretraining

### 11.2.3 Pretraining on an Auxiliary Task

## 11.3 Faster Optimizers

### 11.3.1 Momentum Optimization

### 11.3.2 Nesterov Accelerated Gradient

### 11.3.3 AdaGrad

### 11.3.4 RMSProp

### 11.3.5 Adam and Nadam Optimization

### 11.3.6 Learning Rate Scheduling

## 11.4 Avoiding Overfitting Through Regularization

### 11.4.1 $\ell_1$ and $\ell_2$ Regularization

### 11.4.2 Dropout

### 11.4.3 Monte Carlo (MC) Dropout

### 11.4.4 Max-Norm Regularization

## 11.5 Summary and Practical Guidelines