# Training Deep Neural Networks
This notebook presents problems encountered when training deep neural networks and some of the
techniques that may be used to solve them.

## Index

[Glorot and He Initialization](#Glorot-and-He-Initialization)

[Nonsaturating Activation Functions](#Nonsaturating-Activation-Functions)

[Batch Normalization](#Batch-Normalization)

[Faster Optimizers](#Faster-Optimizers)

[Learning Rate Scheduling](Learning-Rate-Scheduling)

[Avoiding Overfitting Through Regularization](#Avoiding-Overfitting-ThroughRegularization)

## Glorot and He Initialization
Glorot and Bengio proposed a way to alleviate the unstable gradient problem. The signal needs to
flow properly in the forward direction while making predictions and in the reverse direction when
 backpropagating gradients.

By default Keras uses the Glorot initialization with a uniform distribution. When creating a
layer this, it can be changed to the He initialization by setting the
```kernel_initializer="he_uniform"``` or ```kernel_initializer="he_normal"```

In [4]:
import tensorflow as tf
from tensorflow import keras

# He normal
keras.layers.Dense(10, activation='relu', kernel_initializer='he_normal')

<tensorflow.python.keras.layers.core.Dense at 0x1d566573e88>

If the He initialization is to be used with a uniform distribution based on $fan_{avg}$ rather
than $fan_{in}$, the ```VarianceScaling``` initializer should be used

In [5]:
# VarianceScaling
he_avg_init = keras.initializers.VarianceScaling(scale=2.0, mode='fan_avg', distribution='uniform')
keras.layers.Dense(10, activation='sigmoid', kernel_initializer=he_avg_init)


<tensorflow.python.keras.layers.core.Dense at 0x1d56dbcc388>

## Nonsaturating Activation Functions
*Leaky ReLU* can be used to prevent dying neurons. Dying neurons occur when its weights get
tweaked in such a way that the weighted sum of its inputs are negative for all instances in the
training set. When this happens, it keeps on outputting zeros and Gradient Descent has no impact.
 Leaky ReLU fixes this as the activation has a small slope (below x = 0) that ensures that the
 leaky ReLU never dies.

 To use the leaky ReLU activation function, a ```LeakyReLU``` layer needs to be added to the
 model after the layer that it should be applied to. Other activation functions such as variants
 to the ```LeakyReLU``` such as the ```Parametric Leaky ReLU``` and ```Scaled ELU``` a variant of
  the *Exponential Linear Unit (ELU)* are shown below

In [6]:
# Leaky ReLU
model = keras.models.Sequential([
    keras.layers.Dense(10, kernel_initializer='he_normal'),
    keras.layers.LeakyReLU(alpha=0.2),
])

In [7]:
# Parametric Leaky ReLU
model = keras.models.Sequential([
    keras.layers.Dense(10, kernel_initializer='he_normal'),
    keras.layers.PReLU(),
])

In [8]:
# SELU
model = keras.models.Sequential([
    keras.layers.Dense(10, activation='selu', kernel_initializer='lecun_normal')
])

## Batch Normalization
Batch normalization is a technique for training very deep neural networks that standardizes the
inputs to a layer for each mini-batch. This has the effect of stabilizing the learning process
and  dramatically reducing the number of training epochs required to train deep networks.

The technique adds an operation to the model before or after the activation function of each
hidden layer. The operation zero-centers and normalizes each input, then scales and shifts the
result using two new parameter vectors per layer; one for scaling and the other for shifting. The
 operation lets the model learn the optimal scale and mean of each of the layers inputs.

 A ```BatchNormalization``` layer should be added before or after each hidden layers activation
 function and optionally before the first layer in the model

In [9]:
# Batch Normalization
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation='elu', kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation='softmax')
])

In [10]:
# Model summary
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
batch_normalization (BatchNo (None, 784)               3136      
_________________________________________________________________
dense_5 (Dense)              (None, 300)               235500    
_________________________________________________________________
batch_normalization_1 (Batch (None, 300)               1200      
_________________________________________________________________
dense_6 (Dense)              (None, 100)               30100     
_________________________________________________________________
batch_normalization_2 (Batch (None, 100)               400       
_________________________________________________________________
dense_7 (Dense)              (None, 10)               

Batch Normalization creates four parameters $\gamma, \beta, \mu$ and $\sigma$.
- $\gamma$ is the output scale parameter vector for the layer (it contains one scale per input)
- $\beta$ is the output shift (offset) parameter vector for the layer (it contains one offset
parameter per input). Each input is offset by its corresponding shift parameter
- $\mu$ is the vector of input means, evaluated over the whole mini-batch (contains one mean per
input)
- $\sigma$ is the vector of input standard deviations, also evaluated over the whole mini-batch
(it contains one standard deviation per input)

We can take a look at the parameters of the first BN layer. Two are trainable (by
backpropagation), and two are not

In [11]:
# Trainable
[(var.name, var.trainable) for var in model.layers[1].variables]


[('batch_normalization/gamma:0', True),
 ('batch_normalization/beta:0', True),
 ('batch_normalization/moving_mean:0', False),
 ('batch_normalization/moving_variance:0', False)]

Depending on the task it might be best to add the BN layers before the activation functions,
rather than after. To do this the activation function must be removed from the hidden layers and
then added separately after the BN layers. Since BN includes one offset parameter per input, the
bias term should be removed from the previous layer (passing ```use_bias=False```)


In [12]:
# BN
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, kernel_initializer='he_normal', use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation('elu'),
    keras.layers.Dense(300, kernel_initializer='he_normal', use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation('elu'),
    keras.layers.Dense(10, activation='softmax')
])

## Faster Optimizers
Training a large deep neural network can be very slow. Applying a good initialization strategy
for the connection weights, using a good activation function, using Batch Normalization, and
reusing parts of a different neural network trained on a similar task.

Another huge speed boost comes from using a faster optimizer regular Gradient Descent!

### Momentum Optimization
*Momentum optimization* or SGD with momentum is method which helps accelerate
gradients  vectors in the right directions, thus leading to faster converging.

In [13]:
# Momentum Optimization
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9)


### Nestrov Accelerated Gradient
The *Nestrov Accelerated Gradient (NAG)* measures the gradient cost function not at the local
position $\theta$ but slightly ahead in the direction of the momentum, at $\theta + \beta m$.
This small tweak works because the momentum vector will be pointing in the right direction
(toward the optimum), so it will be slightly more accurate than the gradient at the original
position.



In [14]:
# NAG
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True)


### RMSProp
*RMPSProp* accumulates the gradients from the most recent iterations (as opposed to all the
gradients from the beginning of training). It does so by using exponential decay in the first step

The decay rate $\rho$ (rho) is a hyperparameter so it can be tuned. The default value of 0.9
tends to work well, however.

In [15]:
# RMSProp
optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9)

### Adam and Nadam Optimisation
*Adam* or *adaptive moment estimation* combines the idea of momentum optimization and RMSProp:
just like momentum, it keeps track of an exponentially decaying average of past gradients; and
like RMSProp, it keeps track of an exponentially decaying average of past squared gradients.

Adam is and *adaptive learning rate* algorithm. it requires less tuning of the learning rate
hyperparameter.

In [16]:
# Adam
optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)

## Learning Rate Scheduling
Finding a good learning rate is essential. If it is set too high, training may diverge. If it is
set too low, training will eventually converge to the optimum, but it will take very long.

The initial practice dealt with a constant learning rate. However, this can be improved a lot.
For example, if you start with a large learning rate and then reduce it once training stops
making fast progress, you can reach a good solution faster with the optimal constant learning
rate. It can also be beneficial to start with a low learning rate, increase it, then drop it again.

In [17]:
# Power scheduling
optimizer = keras.optimizers.SGD(lr=0.01, decay=1e-4)

In [18]:
# Exponential scheduling
def exponential_decay_fn(epoch):
    return 0.01 * 0.1**(epoch / 20)


In [19]:
# Piecewise scheduling
def piecewise_constant_fn(epoch):
    if epoch < 5:
        return 0.01
    elif epoch < 15:
        return
    else:
        return 0.001

In [20]:
# Performance scheduling
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)

tf.keras allows the user to implement learning rate scheduling via ```keras.optimizers
.schedules```, then pass this learning rate to the optimizer.

In [21]:
# Native keras
s = 20 * len(range(100)) // 32 # number of steps in 20 epochs (batch size = 32)
learning_rate = keras.optimizers.schedules.ExponentialDecay(0.01, s, 0.1)
optimizer = keras.optimizers.SGD(learning_rate)

## Avoiding Overfitting Through Regularization
Deep neural networks tend to have thousands if not millions of parameters. This makes them
extremely powerful and flexible but also prone to overfitting. Regularization techniques are
required to ensure these models are able to generalize.

### $\ell_{1}$ and $\ell_{2}$ Regularization
The $\ell_{2}$ regularizer is called at each step during training to compute the regularization
loss. This is then added to the final loss. The $\ell_{1}$ regularizer can be called through
Keras by ```keras.regularizers.l1()```.

In [22]:
# L2 norm
layer = keras.layers.Dense(100, activation='elu',
                           kernel_initializer='he_normal',
                           kernel_regularizer=keras.regularizers.l2(0.01))


Since the same regularizer should be applied to all the layers in the network Python's
```functools.partial()``` may be used, which lets you create a thing wrapper callable, with some
default arguments


In [23]:
# Wrapper
from functools import partial

RegularizedDense = partial(keras.layers.Dense, activation='elu', kernel_initializer='he_normal',
                           kernel_regularizer=keras.regularizers.l2(0.01))

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    RegularizedDense(300),
    RegularizedDense(100),
    RegularizedDense(10, activation='softmax', kernel_initializer='glorot_uniform')
])


### Dropout
*Dropout* is a successful regularization technique that is able to get a 1-2% boost in accuracy.
It ia a fairly simple algo, at every training step, every neuron (including the input neurons but
 always excluding the output neurons) has a probability *p* of being temporarily *dropped out*,
 meaning it will be entirely ignored during this training step, but it may be active during the
 next step.

 A unique neural network is generated at each training step. Once there have been 10,000 training
  step, there are 10,000 different neural networks. These neural networks are not independent
  from each other, they shar some weights. The resulting neural network can be seen as an
  averaging ensemble of all these smaller neural networks.

To implement on keras the ```keras.layers.Dropout``` layer should be used. During training it
will randomly drop some inputs (setting them to 0) and divides the remaining inputs by the keep
probability. 

In [24]:
# Dropout
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(300, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(10, activation='softmax')
])

### Max-Norm Regularization
For each neuron *max-norm regularization*, constraints the weights $w$ of the incoming
connections such that $\| x\|_{2} \leq r$, where $r$ is the max-norm hyperparameter and
$\|\cdot \|_{2}$ is the $\ell_{2}$ norm.

Reducing $r$ increases the amount of regularization and helps reduce overfitting. To implement
max-norm regularization in Keras, set the ```kernel_constraint``` argument of each hidden layer
to a ```max_norm()``` constraint with the appropriate max value.

In [26]:
# Max-norm regularization
keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal',
                   kernel_constraint=keras.constraints.max_norm(1.0))

<tensorflow.python.keras.layers.core.Dense at 0x1d55ef4a688>