# Chapter 11: Training Deep Neural Nets

The training of a small neural net like what we used in the MNIST or iris example datasets is trivial. They are quick to train and don't have too many parameters. However, in the case of more complex problems, you may need neural nets with much more complexity. This gives rise to 3 main issues:

1. The problem of *vanishing gradients* (or the related *exploding gradients* problem) which will be explained in a bit.

2. Training such a large neural net will take a long time.

3. A model with millions or billions of parameters has a very high capability of overfitting the data.

### Vanishing/Exploding Gradients Problem

Originally, neural nets were initialized with weights and biases sampled from a distribution of mean 0 and std 1. Researchers at the time, though, noticed that the process of backpropogation had issues that either the gradient would get smaller and smaller the deeper into the network it went and thus the first few layers would be left nearly unchanged after each step (meaning the network would take years to converge to a solution, if that), or alternatively, the gradients trended to infinity as backpropogation went through each layer and the algorithm would diverge.

Researchers Xavier Glorot and Yoshua Bengio in 2010 discovered that on a large scale, for each of the layers, the variance of the outputs was much greater than the variance of its inputs. "Going forward in the network, the variance keeps increasing after each layer until the activation function saturates at the top layers."

The derivative of the sigmoid/logistic function at high and low values is near 0, so as the variance keeps increasing, the activation functions "saturate" and thus when "backpropagation kicks in, it has virtually no gradient to propagate back through the network, and what little gradient exists keeps getting diluted as backpropagation progresses down through the top layers, so there is really nothing left for the lower layers."

Xavier's solution was to ensure that the variance of the input and output of each layer was approximately the same. This is not actually possible when the number of inputs != number of outputs, but the compromise that Xavier proposed was to chose weights from a normal distribution with mean 0 and std = sqrt(2/(n_inputs+n_outputs))

This approach has been shown to speed up the training of neural nets significantly. Since then, different distributions for different activation functions besides sigmoid have been proposed, but Tensorflow will handle this all for you, so I won't bother writing them out.

### Nonsaturating Activation Functions

While we know that biological neurons use the sigmoid activation function, this 2010 paper suggests that in ANNs, the sigmoid function is not the best choice. If we use the ReLU instead, we know that it will never saturate (and also it helps that it is much faster to calculate too).

One common issue with the ReLU activation function however is that neurons with ReLU often "die". If the weighted sum of all of a neuron's inputs is negative, the ReLU function will output 0. It's unlikely that its weights will ever update to fix this problem because its gradient will be 0, and thus the ReLU neuron will almost always output 0. It's possible that up to half of the neurons in a network will be "dead" if they all have the ReLU function.

To solve this issue, we can use a variant of the ReLU function. One example is the *leaky ReLU* defined as leaky_relu(z,a)=max(az,z) where a<1. The book gives the analogy of the neuron "going into a coma" rather than dying with the leaky relu. While it has a small gradient when z<0, it will still turn back on given enough time. One paper suggests that setting a large leak a=0.2 outperforms when a=0.01, and definitely outperforms the original ReLU.

Something else you can do is a *randomized leaky ReLU (RReLU)*, "where a is picked randomly in a given range during training, and it is fixed to an average value during testing. It also performed fairly well and seemed to act as a regularizer." You could also have a be another parameter of the model to be trained through backpropagation in the case of *parametric leaky ReLU (PReLU)*, but for small datasets this could lead to some overfitting.

However, a 2015 paper proposed another activation function that may be superior to all the others. Called the *exponential linear unit (ELU)* it is defined as elu(z,a) = (a(exp(z)-1) if z < 0, z if z>=0. The only drawback for this function it seems is that it is slower to compute than the other ReLU variants, but "this is compensated by the faster convergence rate. However, at test time, an ELU network will be slower than a ReLU network."

In [3]:
import tensorflow as tf
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(10, activation=tf.keras.activations.elu),
    tf.keras.layers.Dense(10, activation=tf.keras.activations.relu)
])

In [6]:
# Keras doesn't come with any more of the activation fns that we just discussed, but it is
# very easy to make your own like so:
def leaky_relu(z,alpha=0.2):
    return max(alpha*z,z)

model.add(tf.keras.layers.Dense(10,activation=leaky_relu))

### Batch Normalization

"Although using He initialization along with ELU... can significantly reduce the vanishing/exploding gradients problems at the beginning of training, it doesn't guarantee that they won't come back."

Another really powerful way to reduce the effect of vanishing/exploding gradients is to use Batch Normalization. The idea is that before the activation function is applied to the weighted sums in each layer. It first normalizes and zero-centers the values, and then it uses two additional parameters per layer to scale and then shift all the values.

This technique is also used to solve what is called the **Internal Covariate Shift** problem. The normalization and zero-centering of the data addresses "the problem that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change."

The BN algorithm first takes an empirical mean and an empirical std of the whole batch.

Then, it zero-centers and normalizes each input by: *new_xi = (xi - mean)/sqrt(std^2 - epsilon)*. This should seem very familiar because it is exactly what I learned back in AP Stats, however there is that epsilon > 0 term added just to ensure there is no division by 0.

Then, the algorithm scales the new values by **y** and shifts it by **beta**, i.e., *z = ***y***new_xi + ***beta***

"At test time, there is no mini-batch to compute the empirical mean and std, so instead you simply use the whole training set's mean and std."

In [42]:
bn_model = tf.keras.models.Sequential([ # with batch normalization
    tf.keras.layers.Flatten(input_shape=(28,28)),
    tf.keras.layers.Dense(150, activation='sigmoid'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(150, activation='sigmoid'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(10, activation='softmax')
])

bn_model.compile(loss='sparse_categorical_crossentropy', 
                 optimizer=tf.keras.optimizers.Adam(learning_rate=5e-4), 
                 metrics=['accuracy'])

In [31]:
no_bn_model = tf.keras.models.Sequential([ # no batch normalization
    tf.keras.layers.Flatten(input_shape=(28,28)),
    tf.keras.layers.Dense(150, activation='sigmoid'),
    tf.keras.layers.Dense(150, activation='sigmoid'),
    tf.keras.layers.Dense(10, activation='softmax')
])
no_bn_model.compile(loss='sparse_categorical_crossentropy', 
                 optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4), 
                 metrics=['accuracy'])

In [32]:
from tensorflow.keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

In [34]:
bn_model.fit(x_train/255., y_train, epochs=3) # Since we have batch normalization, we can set a higher learning rate
# epoch 1, accuracy: 89%
# epoch 2, accuracy: 94%
# epoch 3, accuracy: 96%

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x286711e3700>

In [36]:
no_bn_model.fit(x_train/255.,y_train, epochs=3)
# epoch 1, accuracy: 71%
# epoch 2, accuracy: 88%
# epoch 3, accuracy: 91%

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x2863ec9a6a0>

### Avoiding Overfitting Through Regularization

#### Early stopping

The idea of early stopping is really simple. At regular intervals, evaluate how your model does on a validation set. Store the weights of the best performing model. After x number of epochs with no improvement, simply stop training and restore the saved weights.

"Although early stopping works very well in practice, you can usually get much higher performance out of your network by combining it with other regularization techniques."

In [43]:
early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=0, restore_best_weights=True)

bn_model.fit(x_train/255.,y_train, validation_data=(x_test/255.,y_test), callbacks=[early_stop], epochs=50)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50


<tensorflow.python.keras.callbacks.History at 0x28641c17730>

#### L1 and L2 Regularization

We covered l1 and l2 regularization in Ch4, but it is a way to penalize weights that are too high. l1 is (alpha) x sum of |weights| and l2 is (beta) x sum of weights^2.

In [47]:
tf.keras.layers.Dense(64, kernel_regularizer=tf.keras.regularizers.l1_l2(.01,.01), 
                    activity_regularizer=None, bias_regularizer=None)

<tensorflow.python.keras.layers.core.Dense at 0x28641bf1880>