# Chapter-11 Training Deep Neural Networks

## The Vanishing/Exploding Gradientes Problems

Unfortunately, gradients often get smaller and smaller as the algorithm progresses
down to the lower layers. As a result, the Gradient Descent update leaves the lower
layers’ connection weights virtually unchanged, and training never converges to a
good solution. We call this the vanishing gradients problem.

In some cases, the oppo‐
site can happen: the gradients can grow bigger and bigger until layers get insanely
large weight updates and the algorithm diverges. This is the exploding gradients prob‐
lem, which surfaces in recurrent neural networks

## Glorot and He Initialization

By default, Keras uses Glorot initialization with a uniform distribution. When creat‐
ing a layer, you can change this to He initialization by setting kernel_initial
izer="he_uniform" or kernel_initializer="he_normal"

## Activation Function

In [1]:
# model = keras.models.Sequential([
# [...]
# keras.layers.Dense(10, kernel_initializer="he_normal"),
# keras.layers.LeakyReLU(alpha=0.2),
# [...]
# ])

## Batch Normalization

The technique consists of
adding an operation in the model just before or after the activation function of each
hidden layer. This operation simply zero-centers and normalizes each input, then
scales and shifts the result using two new parameter vectors per layer: one for scaling,
the other for shifting. In other words, the operation lets the model learn the optimal
scale and mean of each of the layer’s inputs.

In [4]:
from tensorflow import keras

In [6]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])

In [7]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
batch_normalization (BatchNo (None, 784)               3136      
_________________________________________________________________
dense (Dense)                (None, 300)               235500    
_________________________________________________________________
batch_normalization_1 (Batch (None, 300)               1200      
_________________________________________________________________
dense_1 (Dense)              (None, 100)               30100     
_________________________________________________________________
batch_normalization_2 (Batch (None, 100)               400       
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1

As you can see, each BN layer adds four parameters per input: γ, β, μ, and σ (for
example, the first BN layer adds 3,136 parameters, which is 4 × 784). The last two
parameters, μ and σ, are the moving averages; they are not affected by backpropaga‐
tion, so Keras calls them “non-trainable” 9 (if you count the total number of BN
parameters, 3,136 + 1,200 + 400, and divide by 2, you get 2,368, which is the total
number of non-trainable parameters in this model).

In [8]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('batch_normalization/gamma:0', True),
 ('batch_normalization/beta:0', True),
 ('batch_normalization/moving_mean:0', False),
 ('batch_normalization/moving_variance:0', False)]

Now when you create a BN layer in Keras, it also creates two operations that will be
called by Keras at each iteration during training. These operations will update the
moving averages. Since we are using the TensorFlow backend, these operations are
TensorFlow operations

In [9]:
model.layers[1].updates

[<tf.Operation 'cond/Identity' type=Identity>,
 <tf.Operation 'cond_1/Identity' type=Identity>]

## Gradient Clippingn

Another popular technique to mitigate the exploding gradients problem is to clip the
gradients during backpropagation so that they never exceed some threshold. This is
called Gradient Clipping. 12 This technique is most often used in recurrent neural net‐
works, as Batch Normalization is tricky to use in RNNs, as we will see in Chapter 15.
For other types of networks, BN is usually sufficient.

In [10]:
optimizer = keras.optimizers.SGD(clipvalue=1.0)
model.compile(loss="mse", optimizer=optimizer)

This optimizer will clip every component of the gradient vector to a value between
–1.0 and 1.0.

## Reuse Pretrained Layers (Transfer Learning)

In [11]:
# model_A = keras.models.load_model("my_model_A.h5")
# model_B_on_A = keras.models.Sequential(model_A.layers[:-1])
# model_B_on_A.add(keras.layers.Dense(1, activation='sigmoid'))

Note that model_A and model_B_on_A now share some layers. When you train
model_B_on_A , it will also affect model_A . If you want to avoid that, you need to clone
model_A before you reuse its layers

To do this, you clone model A’s architecture with
clone_model() , then copy its weights (since clone_model() does not clone the
weights):

In [12]:
# model_A_clone = keras.models.clone_model(model_A)
# model_A_clone.set_weights(model_A.get_weights())

Now you could train model_B_on_A for task B, but since the new output layer was ini‐
tialized randomly it will make large errors (at least during the first few epochs), so
there will be large error gradients that may wreck the reused weights. To avoid this,
one approach is to freeze the reused layers during the first few epochs, giving the new
layer some time to learn reasonable weights. To do this, set every layer’s trainable
attribute to False and compile the model

In [13]:
# for layer in model_B_on_A.layers[:-1]:
#     layer.trainable = False
# model_B_on_A.compile(loss="binary_crossentropy", optimizer="sgd",
# metrics=["accuracy"])

You must always compile your model after you freeze or unfreeze
layers.

Now you can train the model for a few epochs, then unfreeze the reused layers (which
requires compiling the model again) and continue training to fine-tune the reused
layers for task B. After unfreezing the reused layers, it is usually a good idea to reduce
the learning rate, once again to avoid damaging the reused weights:

In [14]:
# history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4,
# validation_data=(X_valid_B, y_valid_B))

# for layer in model_B_on_A.layers[:-1]:
#     layer.trainable = True

# optimizer = keras.optimizers.SGD(lr=1e-4) # the default lr is 1e-2
# model_B_on_A.compile(loss="binary_crossentropy", optimizer=optimizer,

# metrics=["accuracy"])
# history = model_B_on_A.fit(X_train_B, y_train_B, epochs=16,
#                 validation_data=(X_valid_B, y_valid_B))

It turns out that transfer learning does not work very well with
small dense networks, presumably because small networks learn few patterns, and
dense networks learn very specific patterns, which are unlikely to be useful in other
tasks. Transfer learning works best with deep convolutional neural networks, which
tend to learn feature detectors that are much more general (especially in the lower layers)