# Training Deep Neural Networks

Training deep networks can entail a wide range of problems, from overfitting to vanishing/exploding gradients. This latter issue is especially troublesome as networks are trained by gradient descent, where the gradient is automatically computed through the backpropagation algorithm. If the gradient gets smaller and smaller when going backwards to the network then weights will not be updated and the network will never converge. Instead, in the opposite case, if the gradient gets bigget and bigger then weights updates will explode and the network will diverge.

In [1]:
import keras
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from keras.models import Sequential
from keras.layers import Flatten, Dense

Using TensorFlow backend.


### Weights initialization
Keras defaults to **Glorot Initialization** with uniform distribution, where the boundaries of the distribution depend on the number of weights in the input and output layers. It is also possible to use the same initialization with a normal distribution. This type of initialization works best with linear layers and S-shaped activation functions like TanH, Sigmoid and Softmax.

In [3]:
Dense(10, kernel_initializer = 'glorot_normal')
Dense(10, kernel_initializer = 'glorot_uniform')

<keras.layers.core.Dense at 0x7fd093327550>

With RELU and RELU-derived activation functions it is often better to use **He Initialization**, which differs for the scale of the variance.

In [4]:
Dense(10, kernel_initializer = 'he_normal')
Dense(10, kernel_initializer = 'he_uniform')

<keras.layers.core.Dense at 0x7fd093327c10>

Finally with SELU activation the preferred option is **LeCun initialization**.

In [5]:
Dense(10, kernel_initializer = 'lecun_normal')
Dense(10, kernel_initializer = 'lecun_uniform')

<keras.layers.core.Dense at 0x7fd093290110>

### Activation Functions
In the past the most common activation function was the S-shaped **sigmoid**. This can be a poor choice in many scenarios because of the vanishing gradient problem: when the sigmoid function saturates for values close to 0 or 1, the gradient is close to zero, making convergence very slow or even impossible. The **tanh** function also suffers from the same problem, but less because its non-saturation range is wider (from -1 to 1).

In [8]:
Dense(10, activation='sigmoid')
Dense(10, activation='tanh')

<keras.layers.core.Dense at 0x7fd093293050>

The **ReLU** function is the most common type of non-saturating activation function, both because it proved to work very well and because it is blazingly fast to compute. Its main drawback is that neurons using it could *die*: when the weighted sum of the inputs is negative for all instances in the training set, a ReLU adopting neuron will always output zero, making gradient descent unable to affect it. 

In [9]:
Dense(10, activation='relu')

<keras.layers.core.Dense at 0x7fd093293510>

Several improvements have been proposed to solve this issue. **Leaky ReLU** ensures that even for negative values, the function never becomes flat (and thus the gradient never becomes null). A very small slope ($\alpha$) is sufficient to ensure that neurons have a positive chance to wake-up. Additionally, the slope $\alpha$ could also be randomly drawn during training (RReLU) or be an additional parameter that should be learned during training (PReLU).

In [10]:
Dense(10)
keras.layers.LeakyReLU(alpha=0.1) # default alpha value is 0.3

<keras.layers.advanced_activations.LeakyReLU at 0x7fd093293c10>

A more interesting variant is the **Exponential Linear Unit (ELU)**. which substitutes the flat part on the left with an exponential function. Thus, ELU can take negative values and has a non-zero gradient everywhere. Furthermore, it is smooth, meaning that it doesn't have have the typical kink of ReLU functions. While ELU is slower to compute than ReLU, it often makes convergence faster because of its nice properties.

In [11]:
Dense(10, activation='elu')

<keras.layers.core.Dense at 0x7fd09329d050>

But, the award of best activation function, when its application is feasible, goes to the **Scaled ELU (SELU)** function. A sequential network, composed solely of dense layers, with each hidden layer using the SELU activation function is guaranteed to **self-normalize** under certain conditions. This is a very nice property because in a normalized network the output of every layer will preserve a zero mean and a unitary standard deviation, solving the vanishing/exploding gradient problem.
* Inputs should be standardized (not min-max scaled!)
* Hidden layers should use LeCun initialization with normal distribution
* Network should be sequential
* All layers should be dense

In [12]:
Dense(10, activation='selu', kernel_initializer = 'lecun_normal')

<keras.layers.core.Dense at 0x7fd09329d7d0>

### Batch Normalization

### Optimizers

### Regularization