In [40]:
import pandas as pd
import numpy as np
import tensorflow as tf

# Training Deep Neural Networks

Training a deep DNN isn't a walk in the park. Here are some of the problems you could run into: <br>
1. You many be faced with the trickey *vanishing gradients* problem or the related *exploding gradients* problem. This is when the gradients grow smaller and smaller, or larger and larger, when flowing backward through the DNN during training. Both of these problems make lower layers very hard to train. <br><br>
2. You might not have enough training data for such a large network, or it might be too costly to label. <br><br>
3. Training may be extremely slow. <br><br>
4. A model with millions of parameters would severely risk overfitting the training set, especially if there are not enough training instances or if the yare too noisy.

## The Vanishing/Exploding Gradients Problems

Unfortunately, during backpropagation, gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As a result, the Gradient Descent update leaves the lower layers' connection weights virtually unchanged, and the training never converges to a good solution. We call this the *vanishing gradients* problem. More generally, deep neural networks suffer from unstable gradients; different layers may learn at widely different speeds.

This unfortunate behavior was empirically observed long ago, and it was one of the reasons deep neural networks were mostly abandoned in the early 2000's, but some light was shed in a 2010 paper by Xavier Glorot and Yoshua Bengio. The authors found a few suspects, including the combination of the popular logistic sigmoid activation function and the weight initialization technique that was most popular at the time (ie.e, a normal distribution with a mean of 0 and a standard deviation of 1).

In short, they showed that with the activation function and this initialization scheme, the variance of the outputs of each layer is much greater than the variance of its inputs.

### Glorot and He Initialization

In their paper, Glorot and Bengio propose that we need the signal to flow properly in both directions: in the forward direction when making predictions, and in the reverse direction when backpropagating gradients. We don't want the signal to die out, nor do we want it to explode and saturate. For the signal to flow properly, the authors argue that we need the variance of the outputs of each layer to be equal to the variance of its inputs, and we need the gradients to have equal variance before and after flowing through a layer in the reverse direction. It is actually not possible to guarantee both, but Glorot and Bengio proposed *Xavier initialization or Glorot initialization*

Here's an analogy: if you set a microphone amplifier's knob too close to zero, people won't hear your voice, but if you set it too close to the max, your voice will be saturated and people won't understand what you're saying. Now imagine a chain of such amplifiers: they all need to be set properly in order for your voice to come out loud and clear at the end of the chain. Your voice has to come out of each amplifier at the same amplitude as it came in.

Using Glorot initialization can speed up training considerably, and it is one of the tricks that led to the success of Deep Learning. By default, Keras uses Glorot initialization with a uniform distribution. When creating a layer, you can change this to He initialization by setting **kernel_initialzier='he_uniform'** or **kernel_initializer='he_normal'** like below.

If you want He initialization with a uniform distribution but ased on $fan_{avg}$ rather than $fan_{in}$ you can use the VarianceScaling initializer as well.

In [35]:
init_helper = pd.DataFrame(
    columns=['Initialization', 'Activation Functions', '$\sigma^{2}$ (Normal)'],
    data=np.array([
        ['Glorot', 'None, tanh, logistic, softmax', r'$\frac{1}{fan_{avg}}$'],
        ['He', 'ReLU and variants', r'$\frac{2}{fan_{in}}$'],
        ['LeCun', 'SELU', r'$\frac{1}{fan_{in}}$']
    ])
)

init_helper

Unnamed: 0,Initialization,Activation Functions,$\sigma^{2}$ (Normal)
0,Glorot,"None, tanh, logistic, softmax",$\frac{1}{fan_{avg}}$
1,He,ReLU and variants,$\frac{2}{fan_{in}}$
2,LeCun,SELU,$\frac{1}{fan_{in}}$


In [44]:
# He Normal
he_norm = tf.keras.layers.Dense(10, activation='relu', kernel_initializer='he_normal')

# He using fan avg
avg_init = tf.keras.initializers.VarianceScaling(
    scale=2,
    mode='fan_avg',
    distribution='uniform'
)
he_avg_init = tf.keras.layers.Dense(10, activation='relu', kernel_initializer=avg_init)

### Nonsaturating Activation Functions

One of the insights in the 2010 paper by Glorot and Bengio was that the problems with unstable gradients were in part due to a poor choice of activation functions. It turns out that other activations behave much better in deep neural networks-- in particular, the ReLU activation function. 

Unfortunately, the ReLU activiation function is not perfect. It suffers from a problem known as the *dying ReLUs*: during training, some neurons effectively "die", meaning they stop outputting anything other than 0. To solve this problem, you may want to use a variant of the ReLU function such as the *leaky ReLU*. The hyperparameter $\alpha$ defines how much the function "leaks": it is the slope of the function for z < 0 and is typically set to 0.01.

A 2015 paper compared several variants of the ReLU activiation function, and one of its conclusions was that the leaky variants always outperformed the strict ReLU activation function. In fact, setting $\alpha$ = 0.2 (a huge leak) seemed to result in better performance than $\alpha$ = 0.01 (a small leak).

*Randomized leaky ReLU (RReLU)*, where $\alpha$ is selected at random, performed fairly well and seemed to act as a regularizer. Finally, the paper evaluated the *parametric leaky ReLU (PReLU)*, where $\alpha$ is authorized to be learned during training (instead of being a hyperparameter, it becomes a parameter that can be modified by backpropagation like any other parameter). **PReLU was reported to strongly outperform ReLU on large image datasets, but on smaller datasets it runs the risk of overfitting the training set.**

Last but not least, **a 2015 paper proposed a new activation function called the *exponential linear unit (ELU)* that outperformed all the ReLU variants** in the authors' experiments: training time was reduced and the neural network performed better on the test set.

The ELU activation function looks a lot like the ReLU function, with a few major differences: <br>
1. It takes on negative values when z < 0, which allows the unit to have an average output closer to 0 and helps alleviate the vanishing gradients problem. The hyperparameter $\alpha$ defines the value that the ELU function approaches when z is a large negative number. It is usually set to 1, but you can tweak it like any other hyperparameter. <br><br>
2. It has a nonzero gradient for z < 0, which avoids the dead neurons problem. <br><br>
3. If $\alpha$ is equal to 1 then the function is smooth everywhere, including around z = 0, which helps speed up Gradient Descent since it does not bounce as much to the left and right of z = 0

**The main drawback of the ELU activiation function is that it is slower to compute than the ReLU function and its variants** (due to the use of the exponential function). Its faster convergence rate during training compensates for that slow computation, but still, at test time an ELU network will be slower than a ReLU network.

Then, a 2017 paper introduced the *Scaled ELU (SELU)* activation function: as its named suggests, it is a scaled variant of the ELU activation function. The authors showed that if you build a neural network composed exclusively of a stack of dense layers, and if all hidden layers use the SELU activation function, then the network will *self-normalize*: the output of each layer will tend to preserve a mean of 0 and a standard deviation of 1 during training, which solves the vanishing/exploding gradients problem. There are, however, a few conditions for self-normalization to happen: <br>
1. The input features must be standardized (mean 0 and standard deviations 1) <br><br>
2. Every hidden layer's weights must be initialized with LeCun normal initialization. In Keras, this means setting kernel_initializer='lecun_normal'<br><br>
3. The network's architecture must be sequential. **Unfortunately, if you try to use SELU in nonsequential architectures, such as recurrent networks or networks with skip connections (i.e. connections that skip layers, such as in Wide & Deep nets), self-normalization will not be guaranteed, so SELU will not necessarily outperform other activation functions.**<br><br>
4. The paper only guarantees self-normalization if all layers are dense, but some researchers have noted that the SELU activation function can improve performance in convolutional neural nets as well.

***So, which activation should you use for the hidden layers of your deep neural networks?*** Although your mileage will vary, in general **SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic**. 

If the network's architecture prevents its from self-normalizing, then ELU may perform better than SELU (since SELU is not smooth at z = 0). If you care a lot about runtime latency, then you may prefer leaky ReLU. If you don't want to tweak yet another hyperparameter, you may use the default $\alpha$ values used by Keras (e.g., 0.3 for leaky ReLU). If you ahve spare time and computing power, you can use cross-validation to evaluate other activation functions, such as RReLU if your network is overfitting or PReLU if you have a huge training set. **That said, because ReLU is the most used activation function (by far), may libraries and hardware accelerators provide ReLU-specific optimizations; therefore, if speed is your priority, ReLU might still be the best choice.**

In [55]:
# Example of a model using LeakyReLU
model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, kernel_initializer='he_normal'),
    tf.keras.layers.LeakyReLU(alpha=0.2),
    tf.keras.layers.Dense(1)
])

# Example of a model using PReLU
model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, kernel_initializer='he_normal'),
    tf.keras.layers.PReLU(),
    tf.keras.layers.Dense(1)
])

# Examlpe of a model using SELU
model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, kernel_initializer='lecun_normal', activation='selu'),
    tf.keras.layers.Dense(1)
])

### Batch Normalization