# Chapter 11: Training Deep Neural Nets

In the previous chapter, we trained a neural network with 2 hidden layers. More complex problems require networks with more hidden layers with hundreds of neurons per layer. Training these can lead to several problems:

- The _vanishing gradients_ and _exploding gradients_ problem makes lower levels hard to train.
- Training a large network can be very slow.
- A model with millions of parameters risks overfitting the training data.

Below we will discuss methods for solving all of these problems.

## Vanishing/Exploding Gradients Problem

While training a neural network with backpropagation, the algorithm finds the components of the error contributed by each layer to compute the error gradient.  Gradients can often get smaller and smaller as the algorithm progresses, resulting in the gradient contribution from the lower layers approaching zero. This is known as the _vanishing gradient_ problem. Alternatively, the gradient can also can grow bigger and bigger which can cause the algorithm to diverge. This is called the _exploding gradient problem_.

Around 2010, a paper titled ["Understanding the Difficulty of Training Deep Feedforward Neural Networks"](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf) found some reasons for this. The sigmoid activation function as well as the random initialization of the weight matrices' elements using a normal distribution with a mean of 0 and a standard deviation of 1. The paper showed the variance of the outputs was much larger than the variance of the inputs. Going forward in the network, the variance kept getting larger and it results in the activation saturating near the horizontal asymptotes, which causes the gradient to vanish.

### Xavier and He Initialization

The authors of the paper found that one way to prevent the vanishing/exploding gradient problem is to ensure that the variance of the input and output of each layer is the same. One way to do this is to initialize the weights matrix using a normal distribution with a mean of 0 and a standard deviation given by

$$ \sigma = \sqrt{\frac{2}{n_\text{ inputs} + n_\text{ outputs}}} $$

or a uniform distribution centered at 0 with a radius, $r$, given by

$$ r = \sqrt{\frac{6}{n_\text{ inputs} + n_\text{ outputs}}} $$

where $n_\text{ inputs}$ and $n_\text{ outputs}$ is the number of input  or output connections in that particular layer. This is often known as _Xavier initialization_ after the author's first name, or sometimes _Glorot initialization_.

For the ReLU activation function, we use a normal distribution with a standard deviation given by

$$ \sigma = \frac{2}{\sqrt{n_\text{ inputs} + n_\text{ outputs}}} $$

or a uniform distribution with a radius given by

$$ r = \sqrt{\frac{24}{n_\text{ inputs} + n_\text{ outputs}}} $$

which is known as _He initialization_. Below is an example of creating a layer of a neural network which uses _He initialization_. By default, `tf.layers.dense()` uses Xavier initialization.

In [0]:
import tensorflow as tf

n_inputs = 28 ** 2
n_hidden = 100

X = tf.placeholder(tf.float32, shape=(None, n_inputs))
he_init = tf.variance_scaling_initializer()
hidden1 = tf.layers.dense(X, n_hidden, activation=tf.nn.relu,
                          kernel_initializer=he_init, name='hidden1')

### Nonsaturating Activation Functions

One of the causes of the vanishing/exploding gradient problem discussed in the paper is the sigmoid activation function. The ReLU activation function performs much better, but it has a different problem. If some neurons output negative values, after the application of the activation function, their output will be stuck at 0. Since the gradient is also 0, the neuron remains "dead."

One solution to this problem is to use a "leaky" ReLU function, given by

$$ \text{LeakyReLU}(z) = \max(\alpha z, z) $$

where $\alpha$ is the slope ofthe ReLU function when the value of $z$ is less than 0. Researchers have found that this activation function performs better than the "hard" ReLU function. You can even have $\alpha$ be a parameter that the model learns during training. This prevents neurons from completely dying.

Another activation function that performs better than leaky ReLU that was proposed in this [paper](https://arxiv.org/pdf/1511.07289v5.pdf) by Djork-Arné Clevert called the _exponential linear unit_ (ELU) given by

$$ \text{ELU}_\alpha(z) = \left\{ \begin{matrix}
\alpha\,(\exp(z) - 1) && \text{if}\;z < 0 \\
z && \text{if}\; z \geq 0
\end{matrix} \right. $$

It has the following differences from the ReLU function:

- It takes negative values when $z < 0$ . which allows the unit to have an average output closer to 0. This helps alleviate the vanishing radient problem. You can tweak the hyperparameter, $\alpha$, sets the negative number that ELU approaches.

- It has a nonzero gradient when $z < 0$, preventing the dying units issue.

- The function is differentiable everywhere, which helps the speed of Gradient Descent.

The disadvantage of ELU is that it takes longer to compute than ReLU. The extra time is compensated for the fact that it helps Gradient Descent converge fasted, but it does cause the model to make predictions more slowly.

TensorFlow offeres an implementation of ELU which is used in the code example below:

In [0]:
tf.reset_default_graph()

n_inputs = 28 ** 2
n_hidden = 100

X = tf.placeholder(tf.float32, shape=(None, n_inputs))
he_init = tf.variance_scaling_initializer()
hidden1 = tf.layers.dense(X, n_hidden, activation=tf.nn.elu, name='hidden1')

TensorFlow does not have an implementation of leaky ReLU, but it is easy to define ourselves:

In [0]:
tf.reset_default_graph()

def leaky_relu(z, alpha=0.01):
  return tf.maximum(alpha * z, z)

n_inputs = 28 ** 2
n_hidden = 100

X = tf.placeholder(tf.float32, shape=(None, n_inputs))
he_init = tf.variance_scaling_initializer()
hidden1 = tf.layers.dense(X, n_hidden, activation=leaky_relu, name='hidden1')

### Batch Normalization

In this [paper](https://arxiv.org/pdf/1502.03167v3.pdf) Sergey Ioffe and Christian Szegedy proposed a technique called _Batch Normalization_ (BN) to address both the vanishing/exploding gradient problem and the problem that the distribution of each layer's inputs change when the parameters of the previous layers change (i.e. the _Internal Covariate Shift_ problem).

The technique adds an operation to the model just before applying the activation function of each layer. It zero-centers and normalizes the inputs, then it scales and shifts the result using two new parameters per layer. This lets the model learn the optimal mean and shift for each layer.

The algorithm starts by first computing the empirical mean for the current mini-batch, $B$, given by

$$ \mu_B = \frac{1}{m_B} \sum\limits_{i\,=\,1}^{m_B} \mathbf{x}^{(i)} $$

Next, we find the empirical standard deviation, given by

$$ \sigma_B^{\;\;2} = \frac{1}{m_B} \sum\limits_{i\,=\,1}^{m_B} \left( \mathbf{x}^{(i)} - \mu_B \right)^2 $$

Then we zero-center and normalize the inputs in the mini-batch

$$ \hat{\mathbf{x}}^{(i)} = \frac{\mathbf{x}^{(i)} - \mu_B}{\sqrt{\sigma_B^{;\;2} + \epsilon}} $$

where $\epsilon$ is a small number, typically $10^{-5}$, called the _smoothing term_ to avoid division by zero. Finally it computes the output given by

$$ \mathbf{z}^{(i)} = \gamma\,\hat{\mathbf{x}}^{(i)} + \beta $$

where $\gamma$ is the scaling parameter and $\beta$ is the shift parameter which are learned during training.

When the model makes predictions, it uses the empirical mean and standard deviation of the entire training set. In the end, the model ends up learning 4 parameters: the mean of the training set, $\mu$; the standard deviation of the training set, $\sigma$; the scaling parameter, $\gamma$; and the shift parameter, $\beta$.

Adding Batch Normalization to a deep neural network improves the performance of the model, lets you skip normalizing the data before training the data, and helps the model converge to the optimal parameters in fewer training iterations. However, using Batch Normalization causes the model to make predictions slower since it adds another computational step for making predictions.

#### Implementing Batch Normalization with TensorFlow

TensorFlow provides a `tf.nn.batch_normalization()` function which normalizes and centers the data, but you must compute the mean and standard deviation yourself.You also have to handle the creation of the scaling and offfset parameters. TensorFlow also includes a `tf.layers.batch_normalization()` function which handles all of batch normalization for you. Below is an example.

In [6]:
tf.reset_default_graph()

n_inputs = 28 ** 2 # MNIST dataset.
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')

# Indicates if the batch normalization should be using the mini-batch's mean
# or the mean of the entire training set (same with standard deviation).
training = tf.placeholder_with_default(False, shape=(), name='training')

hidden1 = tf.layers.dense(X, n_hidden1, name='hidden1')
bn1 = tf.layers.batch_normalization(hidden1, training=training, momentum=0.9)
bn1_act = tf.nn.elu(bn1)

hidden2 = tf.layers.dense(bn1_act, n_hidden2, name='hidden2')
bn2 = tf.layers.batch_normalization(hidden2, training=training, momentum=0.9)
bn2_act = tf.nn.elu(bn2)

logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name='outputs')
logts = tf.layers.batch_normalization(logits_before_bn, training=training,
                                      momentum=0.9)

Instructions for updating:
Use keras.layers.batch_normalization instead.


The BN algorithm uses _exponential decay_ to compute a running average, which is why it requires the _momentum_ parameter. Given a new value, $v$, it updates the running average $\hat{v}$ given by

$$ \hat{v} \leftarrow \hat{v} \times \text{momentum} + v \times (1 - \text{momentum}) $$

Momentum values should be typically close to 1, e.g. 0.9, 0.99, or 0.999.

Below is an example of using _partial application_ using the `functools` library in order to make the code less repetitive.

In [0]:
from functools import partial

tf.reset_default_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')

# Indicates if the batch normalization should be using the mini-batch's mean
# or the mean of the entire training set (same with standard deviation).
training = tf.placeholder_with_default(False, shape=(), name='training')

batch_norm_layer = partial(tf.layers.batch_normalization,
                           training=training, momentum=0.9)

hidden1 = tf.layers.dense(X, n_hidden1, name='hidden1')
bn1 = batch_norm_layer(hidden1)
bn1_act = tf.nn.elu(bn1)

hidden2 = tf.layers.dense(bn1_act, n_hidden2, name='hidden2')
bn2 = batch_norm_layer(hidden2)
bn2_act = tf.nn.elu(bn2)

logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name='outputs')
logits = batch_norm_layer(logits_before_bn)

In [8]:
# Setting up the rest of the graph for training.

y = tf.placeholder(tf.int32, shape=(None), name='y')

x_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,
                                                           labels=y)
loss = tf.reduce_mean(x_entropy, name='loss')
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
training_op = optimizer.minimize(loss)

correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()

Instructions for updating:
Use tf.cast instead.


In [9]:
# Downloading the MNIST dataset.

import numpy as np

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train = X_train.astype(np.float32).reshape(-1, 28*28) / 255.0
X_test = X_test.astype(np.float32).reshape(-1, 28*28) / 255.0
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)
X_valid, X_train = X_train[:5000], X_train[5000:]
y_valid, y_train = y_train[:5000], y_train[5000:]

def shuffle_batch(X, y, batch_size):
  rnd_idx = np.random.permutation(len(X))
  n_batches = len(X) // batch_size
  for batch_idx in np.array_split(rnd_idx, n_batches):
    X_batch, y_batch = X[batch_idx], y[batch_idx]
    yield X_batch, y_batch

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


In [10]:
# Training the neural network using Batch Normalization.
# In just 20 training iterations it achieves 97% accuracy on the
# validation set.

n_epochs = 20
batch_size = 200

# These extra ops are for training the scaling and offset parameters
# in batch normalization.
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)

with tf.Session() as sess:
  init.run()
  for epoch in range(n_epochs):
    for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
      sess.run([training_op, extra_update_ops],
               feed_dict={training: True, X: X_batch, y: y_batch})
    accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
    print('Epoch: {} Validation Accuracy: {}'.format(epoch, accuracy_val))

Epoch: 0 Validation Accuracy: 0.8848000168800354
Epoch: 1 Validation Accuracy: 0.9088000059127808
Epoch: 2 Validation Accuracy: 0.9233999848365784
Epoch: 3 Validation Accuracy: 0.9308000206947327
Epoch: 4 Validation Accuracy: 0.9399999976158142
Epoch: 5 Validation Accuracy: 0.9434000253677368
Epoch: 6 Validation Accuracy: 0.9491999745368958
Epoch: 7 Validation Accuracy: 0.9521999955177307
Epoch: 8 Validation Accuracy: 0.9552000164985657
Epoch: 9 Validation Accuracy: 0.9592000246047974
Epoch: 10 Validation Accuracy: 0.9620000123977661
Epoch: 11 Validation Accuracy: 0.9642000198364258
Epoch: 12 Validation Accuracy: 0.9649999737739563
Epoch: 13 Validation Accuracy: 0.9674000144004822
Epoch: 14 Validation Accuracy: 0.9682000279426575
Epoch: 15 Validation Accuracy: 0.9710000157356262
Epoch: 16 Validation Accuracy: 0.9702000021934509
Epoch: 17 Validation Accuracy: 0.9715999960899353
Epoch: 18 Validation Accuracy: 0.9714000225067139
Epoch: 19 Validation Accuracy: 0.972599983215332


An alternate syntax to training a model this way is to define the `training_op` the following way:

```python
with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
    with tf.control_dependencies(extra_update_ops):
        training_op = optimizer.minimize(loss)
```

this lets you train the model using the more simple syntax:

```python
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
```

### Gradient Clipping

One way to solve the exploding gradients problem is to clip the gradients' values to a defined range. This technique is called [_Gradient Clipping_](http://proceedings.mlr.press/v28/pascanu13.pdf). Though in general people prefer Batch Normalization. Below is an example of Gradient Clipping using TensorFlow:

In [0]:
# An example of gradient clipping.

tf.reset_default_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

training = tf.placeholder_with_default(False, shape=(), name='training')

batch_norm_layer = partial(tf.layers.batch_normalization,
                           training=training, momentum=0.9)

hidden1 = tf.layers.dense(X, n_hidden1, name='hidden1')
hidden1_act = tf.nn.elu(hidden1)

hidden2 = tf.layers.dense(hidden1_act, n_hidden2, name='hidden2')
hidden2_act = tf.nn.elu(hidden2)

logits_before_bn = tf.layers.dense(hidden2_act, n_outputs, name='outputs')
logits = batch_norm_layer(logits_before_bn)

x_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,
                                                           labels=y)
loss = tf.reduce_mean(x_entropy, name='loss')

# Gradient clipping is here.
threshold = 1.0
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
grads_and_vars = optimizer.compute_gradients(loss)
capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
              for grad, var in grads_and_vars]
training_op = optimizer.apply_gradients(capped_gvs)

## Reusing Pretrained Layers

Instead of training a large DNN from scratch, it is generally better to reuse an existing neural network used for a similar task, then reuse the lower layers of this network. This technique is called _transfer learning_.

### Reusing a TensorFlow model

Below is an example of saving and restoring a TensorFlow model using the `tr.train.import_meta_graph()` function:

In [0]:
# Defining the graph and saving it.

tf.reset_default_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

training = tf.placeholder_with_default(False, shape=(), name='training')

batch_norm_layer = partial(tf.layers.batch_normalization,
                           training=training, momentum=0.9)

hidden1 = tf.layers.dense(X, n_hidden1, name='hidden1')
bn1 = batch_norm_layer(hidden1)
bn1_act = tf.nn.elu(bn1)

hidden2 = tf.layers.dense(bn1_act, n_hidden2, name='hidden2')
bn2 = batch_norm_layer(hidden2)
bn2_act = tf.nn.elu(bn2)

logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name='outputs')
logits = batch_norm_layer(logits_before_bn)

training = tf.placeholder_with_default(False, shape=(), name='training')

x_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,
                                                           labels=y)
loss = tf.reduce_mean(x_entropy, name='loss')
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
training_op = optimizer.minimize(loss)

correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')

init = tf.global_variables_initializer()

# New code here!
saver = tf.train.Saver()
with tf.Session() as sess:
  init.run()
  saver.save(sess, './my_model.ckpt')

In [0]:
# Getting nodes in the previous graph for reuse.

X = tf.get_default_graph().get_tensor_by_name('X:0')
y = tf.get_default_graph().get_tensor_by_name('y:0')
accuracy = tf.get_default_graph().get_tensor_by_name('accuracy:0')
training_op = tf.get_default_graph().get_operation_by_name('GradientDescent')

In [14]:
# Listing the operations in the predefined graph, truncated for readability.

for op in tf.get_default_graph().get_operations()[:20]:
  print(op.name)

X
y
training/input
training
hidden1/kernel/Initializer/random_uniform/shape
hidden1/kernel/Initializer/random_uniform/min
hidden1/kernel/Initializer/random_uniform/max
hidden1/kernel/Initializer/random_uniform/RandomUniform
hidden1/kernel/Initializer/random_uniform/sub
hidden1/kernel/Initializer/random_uniform/mul
hidden1/kernel/Initializer/random_uniform
hidden1/kernel
hidden1/kernel/Assign
hidden1/kernel/read
hidden1/bias/Initializer/zeros
hidden1/bias
hidden1/bias/Assign
hidden1/bias/read
hidden1/MatMul
hidden1/BiasAdd


Below is an example of creating a collection of important operations. This is often helpful if the graph is large and you only want to reuse certain operations.

In [0]:
# Defining the original graph.

tf.reset_default_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

training = tf.placeholder_with_default(False, shape=(), name='training')

batch_norm_layer = partial(tf.layers.batch_normalization,
                           training=training, momentum=0.9)

hidden1 = tf.layers.dense(X, n_hidden1, name='hidden1')
bn1 = batch_norm_layer(hidden1)
bn1_act = tf.nn.elu(bn1)

hidden2 = tf.layers.dense(bn1_act, n_hidden2, name='hidden2')
bn2 = batch_norm_layer(hidden2)
bn2_act = tf.nn.elu(bn2)

logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name='outputs')
logits = batch_norm_layer(logits_before_bn)

training = tf.placeholder_with_default(False, shape=(), name='training')

x_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,
                                                           labels=y)
loss = tf.reduce_mean(x_entropy, name='loss')
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
training_op = optimizer.minimize(loss)

correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')

init = tf.global_variables_initializer()

# New code here!
for op in (X, y, accuracy, training_op):
  tf.add_to_collection('my_important_ops', op)

saver = tf.train.Saver()
with tf.Session() as sess:
  init.run()
  saver.save(sess, './my_model.ckpt')

In [0]:
# Restoring the graph and getting the operations from the collection.

tf.reset_default_graph()

saver = tf.train.import_meta_graph('./my_model.ckpt.meta')

X, y, accuracy, training_op = tf.get_collection('my_important_ops')

You can also define a restore Saver which will only restore specified variables. This is useful if you only want to restore the lower layers of a neural network. Below is an example of restoring only the lower layers of a neural network using a saver which only restores specified variables.

In [0]:
# Defining a graph with 5 hidden layers. First implementing gradient clipping.

tf.reset_default_graph()

n_inputs = 28 ** 2

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None,), name='y')

training = tf.placeholder_with_default(False, shape=(None))

with tf.name_scope('dnn'):
  layer = X
  for i, n_hidden in enumerate((300, 100, 50, 20)):
    hidden = tf.layers.dense(layer, n_hidden1,
                             name='hidden{}'.format(i+1))
    layer = tf.nn.relu(hidden, name='relu{}'.format(i+1))
  logits = tf.layers.dense(layer, n_outputs, name='outputs')

with tf.name_scope('loss'):
  x_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,
                                                             labels=y)
  loss = tf.reduce_mean(x_entropy, name='loss')

with tf.name_scope('train'):
  optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
  training_op = optimizer.minimize(loss)

with tf.name_scope('eval'):
  correct = tf.nn.in_top_k(logits, y, 1)
  accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')

In [0]:
# Defining the savers and running the training.

reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
                               scope='hidden[123]') # Regex string
reuse_saver = tf.train.Saver(reuse_vars)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

In [19]:
# Training the model one time to train the lower layers of the neural network.

import os

model_path = './my_model.ckpt'

with tf.Session() as sess:
  init.run()
  for epoch in range(n_epochs):
    for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
      sess.run(training_op, feed_dict={X: X_batch, y: y_batch, training: True})
    accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
    print('Epoch: {} Validation set accuracy: {}'.format(epoch, accuracy_val))
  saver.save(sess, model_path)

Epoch: 0 Validation set accuracy: 0.8141999840736389
Epoch: 1 Validation set accuracy: 0.8755999803543091
Epoch: 2 Validation set accuracy: 0.8989999890327454
Epoch: 3 Validation set accuracy: 0.9093999862670898
Epoch: 4 Validation set accuracy: 0.9172000288963318
Epoch: 5 Validation set accuracy: 0.9236000180244446
Epoch: 6 Validation set accuracy: 0.9296000003814697
Epoch: 7 Validation set accuracy: 0.9348000288009644
Epoch: 8 Validation set accuracy: 0.9366000294685364
Epoch: 9 Validation set accuracy: 0.9405999779701233
Epoch: 10 Validation set accuracy: 0.9444000124931335
Epoch: 11 Validation set accuracy: 0.946399986743927
Epoch: 12 Validation set accuracy: 0.9491999745368958
Epoch: 13 Validation set accuracy: 0.9488000273704529
Epoch: 14 Validation set accuracy: 0.9535999894142151
Epoch: 15 Validation set accuracy: 0.9552000164985657
Epoch: 16 Validation set accuracy: 0.9581999778747559
Epoch: 17 Validation set accuracy: 0.9588000178337097
Epoch: 18 Validation set accuracy: 0.95

In [20]:
# Training the model again, this time we restore the lower layers

new_model_path = './my_model_new.ckpt'

with tf.Session() as sess:
  init.run()
  reuse_saver.restore(sess, model_path)
  for epoch in range(n_epochs):
    for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
      sess.run(training_op, feed_dict={X: X_batch, y: y_batch, training: True})
    accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
    print('Epoch: {} Validation set accuracy: {}'.format(epoch, accuracy_val))
  saver.save(sess, new_model_path)

Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from ./my_model.ckpt
Epoch: 0 Validation set accuracy: 0.9404000043869019
Epoch: 1 Validation set accuracy: 0.946399986743927
Epoch: 2 Validation set accuracy: 0.9488000273704529
Epoch: 3 Validation set accuracy: 0.9534000158309937
Epoch: 4 Validation set accuracy: 0.9575999975204468
Epoch: 5 Validation set accuracy: 0.9570000171661377
Epoch: 6 Validation set accuracy: 0.9592000246047974
Epoch: 7 Validation set accuracy: 0.9607999920845032
Epoch: 8 Validation set accuracy: 0.9620000123977661
Epoch: 9 Validation set accuracy: 0.9643999934196472
Epoch: 10 Validation set accuracy: 0.9638000130653381
Epoch: 11 Validation set accuracy: 0.9660000205039978
Epoch: 12 Validation set accuracy: 0.9664000272750854
Epoch: 13 Validation set accuracy: 0.9664000272750854
Epoch: 14 Validation set accuracy: 0.967199981212616
Epoch: 15 Validation set accuracy: 0.9696000218391418
Epo

### Reusing Models From another Frameworks

Below is an example to illustrate how to import models from other frameworks into TensorFlow. This code sets the kernel and bias of a hidden layer at the start of the TensorFlow session using the initializer.


In [21]:
tf.reset_default_graph()

n_inputs = 2
n_hidden = 3

original_w = [[1., 2., 3.], [4., 5., 6.]]
original_b = [7., 8., 9.]

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
hidden = tf.layers.dense(X, n_hidden, activation=tf.nn.relu, name='hidden')

graph = tf.get_default_graph()
assign_kernel = graph.get_operation_by_name('hidden/kernel/Assign')
assign_bias = graph.get_operation_by_name('hidden/bias/Assign')

init_kernel = assign_kernel.inputs[1]
init_bias = assign_bias.inputs[1]

init = tf.global_variables_initializer()

with tf.Session() as sess:
  sess.run(init, feed_dict={
    init_kernel: original_w,
    init_bias: original_b,
  })
  print(hidden.eval(feed_dict={X: [[10., 11.]]}))

[[ 61.  83. 105.]]


Another way is to make dedicated nodes for assigning the hidden layer's kernel and bias, then set them at any point using placeholders. This is more verbose but allows more control.

In [22]:
tf.reset_default_graph()

n_inputs = 2
n_hidden = 3

original_w = [[1., 2., 3.], [4., 5., 6.]]
original_b = [7., 8., 9.]

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
hidden = tf.layers.dense(X, n_hidden, activation=tf.nn.relu, name='hidden')

with tf.variable_scope('', default_name='', reuse=True):
  hidden_weights = tf.get_variable('hidden/kernel')
  hidden_bias = tf.get_variable('hidden/bias')
  
original_weights = tf.placeholder(tf.float32, shape=(n_inputs, n_hidden))
original_bias = tf.placeholder(tf.float32, shape=(n_hidden))

assign_hidden_weights = tf.assign(hidden_weights, original_weights)
assign_hidden_bias = tf.assign(hidden_bias, original_bias)

init = tf.global_variables_initializer()

with tf.Session() as sess:
  sess.run(init)
  sess.run(assign_hidden_weights, feed_dict={original_weights: original_w})
  sess.run(assign_hidden_bias, feed_dict={original_bias: original_b})
  print(hidden.eval(feed_dict={X: [[10., 11.]]}))

[[ 61.  83. 105.]]


### Freezing Lower Layers

Since it is likely that the lower layers of the DNN have learned to detect the lower level patterns in the training set, you can reuse the layers as they are by "freezing" their weights. As a result, the higher level layers will be easier to train.

In [0]:
# Defining a new graph which will use the lower level
# layers from the section on using pre-trained layers.

tf.reset_default_graph()

n_inputs = 28 ** 2

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None,), name='y')

training = tf.placeholder_with_default(False, shape=(None))

with tf.name_scope('dnn'):
  layer = X
  for i, n_hidden in enumerate((300, 100, 50, 20)):
    hidden = tf.layers.dense(layer, n_hidden1,
                             name='hidden{}'.format(i+1))
    layer = tf.nn.relu(hidden, name='relu{}'.format(i+1))
  logits = tf.layers.dense(layer, n_outputs, name='outputs')

with tf.name_scope('loss'):
  x_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,
                                                             labels=y)
  loss = tf.reduce_mean(x_entropy, name='loss')

# New code here!
with tf.name_scope('train'):
  training_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,
                                    scope='hidden[34]|outputs')
  optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
  training_op = optimizer.minimize(loss, var_list=training_vars)

with tf.name_scope('eval'):
  correct = tf.nn.in_top_k(logits, y, 1)
  accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')
  
reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
                               scope='hidden[123]') # Regex string
reuse_saver = tf.train.Saver(reuse_vars)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

In [24]:
# Training the model with the frozen layers.

with tf.Session() as sess:
  init.run()
  reuse_saver.restore(sess, model_path)
  for epoch in range(n_epochs):
    for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
      sess.run(training_op, feed_dict={X: X_batch, y: y_batch, training: True})
    accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
    print('Epoch: {} Validation set accuracy: {}'.format(epoch, accuracy_val))
  saver.save(sess, new_model_path)

INFO:tensorflow:Restoring parameters from ./my_model.ckpt
Epoch: 0 Validation set accuracy: 0.9422000050544739
Epoch: 1 Validation set accuracy: 0.9470000267028809
Epoch: 2 Validation set accuracy: 0.9502000212669373
Epoch: 3 Validation set accuracy: 0.9509999752044678
Epoch: 4 Validation set accuracy: 0.953000009059906
Epoch: 5 Validation set accuracy: 0.9534000158309937
Epoch: 6 Validation set accuracy: 0.9557999968528748
Epoch: 7 Validation set accuracy: 0.954200029373169
Epoch: 8 Validation set accuracy: 0.9553999900817871
Epoch: 9 Validation set accuracy: 0.9570000171661377
Epoch: 10 Validation set accuracy: 0.9563999772071838
Epoch: 11 Validation set accuracy: 0.9584000110626221
Epoch: 12 Validation set accuracy: 0.9592000246047974
Epoch: 13 Validation set accuracy: 0.9611999988555908
Epoch: 14 Validation set accuracy: 0.9603999853134155
Epoch: 15 Validation set accuracy: 0.9593999981880188
Epoch: 16 Validation set accuracy: 0.9603999853134155
Epoch: 17 Validation set accuracy: 0

In [0]:
# Another way to freeze the lower layers is to add a stop_gradient()
# layer in the graph. Any layer below it will be frozen.

tf.reset_default_graph()

n_inputs = 28 ** 2

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None,), name='y')

training = tf.placeholder_with_default(False, shape=(None))

with tf.name_scope('dnn'):
  layer = X
  for i, n_hidden in enumerate((300, 100, 50, 20)):
    hidden = tf.layers.dense(layer, n_hidden1,
                             name='hidden{}'.format(i+1))
    layer = tf.nn.relu(hidden, name='relu{}'.format(i+1))
    # New code here!
    if i == 1:
      hidden2 = layer
      layer = tf.stop_gradient(layer)
  logits = tf.layers.dense(layer, n_outputs, name='outputs')

with tf.name_scope('loss'):
  x_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,
                                                             labels=y)
  loss = tf.reduce_mean(x_entropy, name='loss')

with tf.name_scope('train'):
  optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
  training_op = optimizer.minimize(loss)

with tf.name_scope('eval'):
  correct = tf.nn.in_top_k(logits, y, 1)
  accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')
  
reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
                               scope='hidden[123]') # Regex string
reuse_saver = tf.train.Saver(reuse_vars)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

In [26]:
with tf.Session() as sess:
  init.run()
  reuse_saver.restore(sess, model_path)
  for epoch in range(n_epochs):
    for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
      sess.run(training_op, feed_dict={X: X_batch, y: y_batch, training: True})
    accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
    print('Epoch: {} Validation set accuracy: {}'.format(epoch, accuracy_val))
  saver.save(sess, new_model_path)

INFO:tensorflow:Restoring parameters from ./my_model.ckpt
Epoch: 0 Validation set accuracy: 0.9440000057220459
Epoch: 1 Validation set accuracy: 0.9467999935150146
Epoch: 2 Validation set accuracy: 0.949999988079071
Epoch: 3 Validation set accuracy: 0.9502000212669373
Epoch: 4 Validation set accuracy: 0.9520000219345093
Epoch: 5 Validation set accuracy: 0.9556000232696533
Epoch: 6 Validation set accuracy: 0.9531999826431274
Epoch: 7 Validation set accuracy: 0.954800009727478
Epoch: 8 Validation set accuracy: 0.9549999833106995
Epoch: 9 Validation set accuracy: 0.95660001039505
Epoch: 10 Validation set accuracy: 0.9557999968528748
Epoch: 11 Validation set accuracy: 0.95660001039505
Epoch: 12 Validation set accuracy: 0.9559999704360962
Epoch: 13 Validation set accuracy: 0.9589999914169312
Epoch: 14 Validation set accuracy: 0.9553999900817871
Epoch: 15 Validation set accuracy: 0.9574000239372253
Epoch: 16 Validation set accuracy: 0.9584000110626221
Epoch: 17 Validation set accuracy: 0.958

### Caching Frozen Layers

One way to improve the speed of training when you have frozen the lower layers is to run the training set through the frozen layers at the start of training. The code below shows an example of training a model this way, reusing the TensorFlow graph defined above.

In [27]:
with tf.Session() as sess:
  init.run()
  reuse_saver.restore(sess, model_path)
  h2_cache = sess.run(hidden2, feed_dict={X: X_train})
  for epoch in range(n_epochs):
    for h2_batch, y_batch in shuffle_batch(h2_cache, y_train, batch_size):
      sess.run(training_op, feed_dict={hidden2: h2_batch, y: y_batch})
    accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
    print('Epoch: {} Validation set accuracy: {}'.format(epoch, accuracy_val))
  saver.save(sess, new_model_path)

INFO:tensorflow:Restoring parameters from ./my_model.ckpt
Epoch: 0 Validation set accuracy: 0.9455999732017517
Epoch: 1 Validation set accuracy: 0.949999988079071
Epoch: 2 Validation set accuracy: 0.9509999752044678
Epoch: 3 Validation set accuracy: 0.953000009059906
Epoch: 4 Validation set accuracy: 0.9526000022888184
Epoch: 5 Validation set accuracy: 0.9552000164985657
Epoch: 6 Validation set accuracy: 0.9559999704360962
Epoch: 7 Validation set accuracy: 0.9567999839782715
Epoch: 8 Validation set accuracy: 0.9556000232696533
Epoch: 9 Validation set accuracy: 0.9574000239372253
Epoch: 10 Validation set accuracy: 0.9570000171661377
Epoch: 11 Validation set accuracy: 0.9585999846458435
Epoch: 12 Validation set accuracy: 0.9592000246047974
Epoch: 13 Validation set accuracy: 0.9592000246047974
Epoch: 14 Validation set accuracy: 0.9607999920845032
Epoch: 15 Validation set accuracy: 0.9595999717712402
Epoch: 16 Validation set accuracy: 0.9607999920845032
Epoch: 17 Validation set accuracy: 0

### Tweaking, Dropping, or Replacing the Upper Layers

The higher the layer is in the previously trained neural network, the less likely it will be useful in training a new network for different tasks. The output layer is generally always replaced, in many cases the old output layer may be a different shape than the output for the new layer.

One way to try to determine how many layers to freeze is to try freezing all of the hidden layers first, then training the neural network again after unfreezing one or two of the top layers and seeing if performance improves. The more training data, the more layers you can unfreeze.

If you still cannot get good performance with little training data, you can try dropping the top hidden layers and freezing the lower ones. You can keep trying until you find the right number of layers to reuse. If you have a lot of training data, you can replace the top layers instead of dropping them or add more layers.

### Model Zoos

A _model zoo_ is a collection of machine learning models that other people have trained for different machine learning tasks. TensorFlow has its own [model zoo](https://github.com/tensorflow/models). Another popular model zoo is the [Caffe Model Zoo](https://github.com/BVLC/caffe/wiki/Model-Zoo). Saumitro Dasgupta wrote a [converter](https://github.com/ethereon/caffe-tensorflow) to convert Caffe models to TensorFlow.

### Unsupervised Training

If you do not have a large labeled training set and there is not a previously trained model for a similar task, but you do have a large unlabled training set, you can use an _unsupervised pretraining_ algorithms such as _Restricted Boltzmann Machines_ (RBMs) or autoencoders to train successlive DNN layers to find low level features in the training set. Afterwards you can tune the model using supervised learning and backpropagation.

### Pretraining an Auxilary Task

One way to train a DNN if you have limited labeled training data is to train a neural network for a similar task then reuse the lower layers to train a new DNN for the actual task.

Another strategy is to take unlabeled training data and take some data and modify it. You label the unmodified data as "good" and the modified data as "bad" so that you can train a DNN classifier using a supervised algorithm to get lower layers which recognize lower level features for the actual task.

## Faster Optimizers

In this section we will examine optimizers which are faster than plain Gradient Descent which can help speed up training DNNs.

### Momentum Optimizers

Recall that Gradient Descent updates the weight vector, $\theta$ by subtracting the weight of the gradient of the cost function, $\nabla_\theta J(\theta)$ multiplied by the learning rate, $\eta$, i.e.

$$ \theta \leftarrow \theta - \eta\nabla_\theta J(\theta) $$

[Momentum optimization](https://www.researchgate.net/publication/243648538_Some_methods_of_speeding_up_the_convergence_of_iteration_methods), proposed by Boris Polyak in 1964 subtracts the local gradient from a momentum vector, $\mathbf{m}$, multiplied by the learning rate $\eta$ and it updates the weights by adding the momentum vector to the weight vector, $\theta$. To prevent the momentum from growing too large, the algorithm introduces a hyperparameter, $\beta$, called the _momentum_, which is between 0 and 1 (typically 0.9). The algorithm can be written in two stages:

$$ \begin{matrix}
1. && \mathbf{m} \leftarrow \beta\,\mathbf{m} - \eta\nabla_\theta J(\theta) \\
2. && \theta \leftarrow \theta + \mathbf{m}
\end{matrix}  $$

It follows that if the gradient remains constant, the terminal velocity (the maximum size of the weight updates) is given by the learning rate, $\eta$, multiplied by $\frac{1}{1\,-\,\beta}$. If $\beta = 0.9$ then Momentum optimization converges 10 times as quickly as Gradient Descent. Due to the larger steps, Momentum optimization can escape local optima much more quickly than Gradient Descent. Below is an example of implementing Momentum optimization in TensorFlow:


In [0]:
optimizer = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9)

The main drawback of Momentum optimization is it adds another hyperparameter to tune, but generally 0.9 works in practice.

### Nesterov Accelerated Gradient

One improvement to Momentum optimization proposed by Yuli Nesterov in 1983 is called [Nesterov Momentum Optimization](https://scholar.google.com/citations?view_op=view_citation&citation_for_view=DJ8Ep8YAAAAJ:hkOj_22Ku90C) or _Nesterov Accelerated Gradient_ (NAG). The algorithm works in the following steps:

$$ \begin{matrix}
1. && \mathbf{m} \leftarrow \beta\,\mathbf{m} - \eta\nabla_\theta J(\theta + \beta\,\mathbf{m}) \\
2. && \theta \leftarrow \theta + \mathbf{m}
\end{matrix} $$

The only difference between NAG and Momentum optimization is that it computes the gradient of the cost function at $\theta + \beta\,\mathbf{m}$ instead of at the current value of $\theta$. This improves the algorithm since the momentum vector is usually pointing in the direction of the optimal value. Below is an example of NAG using TensorFlow:

In [0]:
optimizer = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9,
                                       use_nesterov=True)

### AdaGrad Optimizer

The [AdaGrad algorithm](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) is an algorithm designed to decay the learning rate, doing so faster for steeper gradients, in order to converge more directly towards the global optimum. This is called an _adaptive learning rate_. The algorithm works in two stages:

The first stage computes a vector, $\mathbf{s}$, given by

$$ \mathbf{s} \leftarrow \mathbf{s} + \nabla_\theta J(\theta) \otimes \nabla_\theta J(\theta) $$

where $\otimes$ denotes component-wise multiplication. This is a vectorized form of the following operation:

$$ s_i \leftarrow s_i + \left( \frac{\partial J(\theta)}{\partial \theta_i} \right)^2 $$

where $s_i$ denotes a component of $\mathbf{s}$.The vector $\mathbf{s}$ accumulates the squares of each component of the gradient, and it grows larger when the gradient is steeper.

The second step is similar to Gradient Descent but with a modification. It is given by

$$ \theta \leftarrow \theta - \eta\nabla_\theta J(\theta) \oslash \sqrt{\mathbf{s} + \epsilon} $$

where $\oslash$ denotes component-wise division and $\epsilon$ is a smoothing term to avoid division by zero (it is typically 10<sup>-10</sup>). This operation is a vectorized form of the following operation:

$$ \theta_i \leftarrow \theta_i - \eta \left( \frac{\partial J(\theta)}{\partial \theta_i} \right) \left( s_i + \epsilon \right)^{-1/2} $$

where $\theta_i$ is each component of the vector $\theta$.

AdaGrad performs well for simple quadratic problems, but often stops too early when training DNNs because the learning rate degrades to zero. Below is an example of an AdaGrad optimizer in TensorFlow:

In [0]:
optimizer = tf.train.AdagradOptimizer(learning_rate=0.01)

### RMSProp

Since AdaGrad can decay the learning rate too quickly, [RMSProp](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) slows the rate of decay of the learning rate by accumulating only the most recent iterations using exponential decay. The algorithm works as follows:

$$ \begin{matrix}
1. && \mathbf{s} \leftarrow \beta\,\mathbf{s} + (1 - \beta\, ) \nabla_\theta J(\theta) \otimes \nabla_\theta J(\theta) \\
2. && \theta \leftarrow \theta - \eta\nabla_\theta J(\theta) \oslash \sqrt{\mathbf{s} + \epsilon}
\end{matrix} $$

where $\beta$ is the decay rate and is typically set to 0.9. This value typically works in practice so you do not have to tune it. Except for simple problems, RMSProp typically performs better than AdaGrad. Below is an example of RMSProp with TensorFlow:

In [0]:
optimizer = tf.train.RMSPropOptimizer(learning_rate=0.1, momentum=0.9,
                                      decay=0.9, epsilon=True)

### Adam Optimizer

[Adam](https://arxiv.org/pdf/1412.6980v8.pdf) which stands for _adaptive moment estimation_ combines Momentum optimizers and RMSProp, keeping track of an exponentially decaying average of past gradients and past squared gradients. The algorithm works as follows:

$$ \begin{matrix}
1. && \mathbf{m} \leftarrow \beta_1\mathbf{m} - (1 - \beta_1) \nabla_\theta J(\theta) \\
2. && \mathbf{s} \leftarrow \beta_2\mathbf{s} + (1 - \beta_2) \nabla_\theta J(\theta) \otimes \nabla_\theta J(\theta) \\
3. && \mathbf{m} \leftarrow (1 - \beta_1)^{-\,t}\,\mathbf{m} \\
4. && \mathbf{s} \leftarrow (1 - \beta_2)^{-\,t} \,\mathbf{s} \\
5. && \theta \leftarrow \theta + \eta \, \mathbf{m} \oslash \sqrt{\mathbf{s} + \epsilon}
\end{matrix}$$

where $t$ is the training iteration number, $\beta_1$ is the momentum decay rate, and $\beta_2$ is the scaling decay rate. Steps 1, 2, and 5 resemble both Momentum optimization and RMSProp. Steps 3 and 4 account for the fact that $\mathbf{m}$ and $\mathbf{s}$ are initialized to zero vectors, so these steps prevent the algorithm from being biased towards zero at the beginning.

The momentum decay rate, $\beta_1$, is typically initialized to 0.9. The scaling decay rate, $\beta_2$ is initialized to 0.99. These values perform well in practice so its rare you have to tune them. The smoothing parameter, $\epsilon$, is typically initialized to 10<sup>-10</sup>. Since the model's learning rate is adaptive, it is not as necessary to tune the learning rate, $\eta$.

Below is an example of an Adam optimizer in TensorFlow:

In [0]:
optimizer = tf.train.AdamOptimizer(learning_rate=0.001)

### Learning Rate Scheduling

Finding a good learning rate can be difficult. If the learning rate is too high the algorithm can diverge. If it is too low the algorithm will take too long to train. If it is slightly too high the algorithm may dance around the optimum and not converge unless you are using an apadtive learning rate algorithm like AdaGrad, RMSProp, or Adam.

Below are some strategies for _learning rate scheduling_, training methods which start with a high learning rate and gradually reduce it as you get closer to the optimum:

#### Predetermined piecewise constant learning rate

Setting the learning rate to a high value at the start of training, e.g. $\eta_0 = 0.01$ then reducing it to a lower rate, e.g. $\eta_1 = 0.001$ after a constant number of training iterations. This performs well but requires tuning to find which training epoch is the right one to reduce the learning rate at.

#### Performance scheduling

Measure the validation error every $N$ steps and reduce the learning rate by some constant factor, $\lambda$, when the error starts increasing.

#### Exponential scheduling

The learning rate is an exponential function of the iteration number, i.e.

$$ \eta(t) = \eta_0^{\;-t/r} $$

This requires tuning of the initial learning rate, $\eta_0$, and the rate of decay, $r$.

#### Power scheduling

Set the learning rate to the exponential function

$$ \eta(t) = \eta_0 (1 + t/r)^{-c} $$

where $c$ is typically set to 1. This also requires tuning like exponential scheduling, but in this case the learning rate drops much more slowly.

In [0]:
# An example of implementing a learning schedule with TensorFlow.

initial_learning_rate = 0.1
decay_steps = 10000
decay_rate = 0.1
global_step = tf.Variable(0, trainable=False, name='global_step')
learning_rate = tf.train.exponential_decay(initial_learning_rate, global_step,
                                           decay_steps, decay_rate)
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9)
training_op = optimizer.minimize(loss, global_step=global_step)

For adaptive learning rate optimization methods like AdaGrad, RMSProp or Adam, learning rate scheduling is not necessary.

## Avoiding Overfitting Through Regularization

Since neural networks have tens of thousands of parameters (sometimes millions) they are prone to overfitting the data. The following section goes over the most common ways of introducing regularization into the model to prevent overfitting.

### Early Stopping

One way to prevent overfitting is to implement early stopping (introduced in chapter 4). After a certain number of training iterations (e.g. every 50 iterations), you measure the model's error on the validation set. If after a certain number of training iterations, the error does not decrease, stop training the model. Early stopping typically works best when combined with another regularization technique.

### $\ell_1$ and $\ell_2$ Regularization

You can add $\ell_1$ or $\ell_2$ regularization to neural networks' weights (typically not the biases) just like linear models in chapter 4. One way to do this with TensorFlow is to simply add the regularization to the cost function. Below is an example of adding $\ell_1$ regularization to a neural network with one hidden layer using TensorFlow.

In [0]:
tf.reset_default_graph()

n_inputs = 28 ** 2
n_hidden1 = 300
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

with tf.name_scope('dnn'):
  hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu,
                            name='hidden1')
  logits = tf.layers.dense(hidden1, n_outputs, name='outputs')

W1 = tf.get_default_graph().get_tensor_by_name('hidden1/kernel:0')
W2 = tf.get_default_graph().get_tensor_by_name('outputs/kernel:0')

# New code here!
scale = 0.001 # L1 regularization parameter
with tf.name_scope('loss'):
  x_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                             logits=logits)
  base_loss = tf.reduce_mean(x_entropy, name='avg_x_entropy')
  reg_losses = tf.reduce_sum(tf.abs(W1)) + tf.reduce_sum(tf.abs(W2))
  loss = tf.add(base_loss, scale * reg_losses, name='loss')
  
with tf.name_scope('eval'):
  correct = tf.nn.in_top_k(logits, y, 1)
  accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')
  
learning_rate = 0.01

with tf.name_scope('train'):
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  training_op = optimizer.minimize(loss)
  
init = tf.global_variables_initializer()

In [39]:
# Training the model and printing the validation error.

with tf.Session() as sess:
  init.run()
  for epoch in range(n_epochs):
    for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
      sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
    accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
    print('Epoch: {} Accuracy: {}'.format(epoch, accuracy_val))

Epoch: 0 Accuracy: 0.8238000273704529
Epoch: 1 Accuracy: 0.8661999702453613
Epoch: 2 Accuracy: 0.8805999755859375
Epoch: 3 Accuracy: 0.8880000114440918
Epoch: 4 Accuracy: 0.892799973487854
Epoch: 5 Accuracy: 0.8956000208854675
Epoch: 6 Accuracy: 0.899399995803833
Epoch: 7 Accuracy: 0.9028000235557556
Epoch: 8 Accuracy: 0.906000018119812
Epoch: 9 Accuracy: 0.9053999781608582
Epoch: 10 Accuracy: 0.9053999781608582
Epoch: 11 Accuracy: 0.9064000248908997
Epoch: 12 Accuracy: 0.9065999984741211
Epoch: 13 Accuracy: 0.9071999788284302
Epoch: 14 Accuracy: 0.9070000052452087
Epoch: 15 Accuracy: 0.9061999917030334
Epoch: 16 Accuracy: 0.9049999713897705
Epoch: 17 Accuracy: 0.9056000113487244
Epoch: 18 Accuracy: 0.9074000120162964
Epoch: 19 Accuracy: 0.9046000242233276


Below is an alternative way of adding $\ell_1$ regularization using `tf.layers.dense()`. You can use `l1_regularizer()`, `l2_regularizer()`, or `l1_l2_regularizer()` functions to add regularization to each layer.

In [0]:
tf.reset_default_graph()

n_inputs = 28 ** 2
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10
scale = 0.001

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

regularized_dense_layer = partial(
    tf.layers.dense, activation=tf.nn.relu,
    kernel_regularizer=tf.contrib.layers.l1_regularizer(scale))

with tf.name_scope('dnn'):
  hidden1 = regularized_dense_layer(X, n_hidden1, name='hidden1')
  hidden2 = regularized_dense_layer(hidden1, n_hidden2, name='hidden2')
  logits = regularized_dense_layer(hidden2, n_outputs, name='outputs')
  
with tf.name_scope('loss'):
  x_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                             logits=logits)
  base_loss = tf.reduce_mean(x_entropy, name='avg_x_entropy')
  reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
  loss = tf.add_n([base_loss] + reg_losses, name='loss')
  
with tf.name_scope('eval'):
  correct = tf.nn.in_top_k(logits, y, 1)
  accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')
  
learning_rate = 0.01

with tf.name_scope('train'):
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  training_op = optimizer.minimize(loss)
  
init = tf.global_variables_initializer()

In [42]:
with tf.Session() as sess:
  init.run()
  for epoch in range(n_epochs):
    for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
      sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
    accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
    print('Epoch: {} Accuracy: {}'.format(epoch, accuracy_val))

Epoch: 0 Accuracy: 0.6510000228881836
Epoch: 1 Accuracy: 0.8587999939918518
Epoch: 2 Accuracy: 0.8826000094413757
Epoch: 3 Accuracy: 0.8948000073432922
Epoch: 4 Accuracy: 0.9017999768257141
Epoch: 5 Accuracy: 0.9083999991416931
Epoch: 6 Accuracy: 0.9083999991416931
Epoch: 7 Accuracy: 0.9111999869346619
Epoch: 8 Accuracy: 0.9120000004768372
Epoch: 9 Accuracy: 0.9161999821662903
Epoch: 10 Accuracy: 0.9151999950408936
Epoch: 11 Accuracy: 0.9154000282287598
Epoch: 12 Accuracy: 0.9160000085830688
Epoch: 13 Accuracy: 0.9169999957084656
Epoch: 14 Accuracy: 0.9165999889373779
Epoch: 15 Accuracy: 0.9165999889373779
Epoch: 16 Accuracy: 0.9165999889373779
Epoch: 17 Accuracy: 0.9169999957084656
Epoch: 18 Accuracy: 0.9154000282287598
Epoch: 19 Accuracy: 0.9146000146865845


### Dropout

The most popular regularization technique for DNNs is [_dropout_](https://arxiv.org/pdf/1207.0580.pdf) proposed by G. E. Hinton in 2012 and in this [paper](http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf) by Nitish Srivastava et al. which has been shown to improve DNN accuracy by 1-2%.

The algoritm is simple, at every training step each neuron has a probability, $p$, of being temporarily excluded during that round of training. That hyperparameter, $p$, is called the _dropout rate_. Dropout improves the performance of DNNs because it lets you train the DNN as if it were an ensemble of $2^N$ possible DNNs (where $N$ is the number of neurons) so it prevents overfitting. After training, you need to multiply each connection by $(1 - p)$ or the _keep rate_ to make up for the fact that each neuron on average had fewer connections during training. Alternatively you can divide each connection by $(1-p)$ during training, which as a similar effect but is not exactly equivalent.

Below is an example of implementing dropout using TensorFlow:

In [0]:
tf.reset_default_graph()

n_inputs = 28 ** 2
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10
dropout_rate = 0.5 # == 1 - keep_rate

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

training = tf.placeholder_with_default(False, shape=(), name='training')

X_drop = tf.layers.dropout(X, dropout_rate, training=training)

with tf.name_scope('dnn'):
  hidden1 = tf.layers.dense(X_drop, n_hidden1, activation=tf.nn.relu,
                            name='hidden1')
  hidden1_drop = tf.layers.dropout(hidden1, dropout_rate, training=training)
  hidden2 = tf.layers.dense(hidden1_drop, n_hidden2, activation=tf.nn.relu,
                            name='hidden2')
  hidden2_drop = tf.layers.dropout(hidden2, dropout_rate, training=training)
  logits = tf.layers.dense(hidden2_drop, n_outputs, name='outputs')

with tf.name_scope('loss'):
  x_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                             logits=logits)
  loss = tf.reduce_mean(x_entropy, name='loss')
  
with tf.name_scope('eval'):
  correct = tf.nn.in_top_k(logits, y, 1)
  accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')
  
learning_rate = 0.01

with tf.name_scope('train'):
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  training_op = optimizer.minimize(loss)
  
init = tf.global_variables_initializer()

In [49]:
batch_size = 50

with tf.Session() as sess:
  init.run()
  for epoch in range(n_epochs):
    for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
      sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
    accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
    print('Epoch: {} Accuracy: {}'.format(epoch, accuracy_val))

Epoch: 0 Accuracy: 0.9065999984741211
Epoch: 1 Accuracy: 0.9215999841690063
Epoch: 2 Accuracy: 0.9340000152587891
Epoch: 3 Accuracy: 0.9404000043869019
Epoch: 4 Accuracy: 0.9440000057220459
Epoch: 5 Accuracy: 0.9473999738693237
Epoch: 6 Accuracy: 0.9509999752044678
Epoch: 7 Accuracy: 0.9562000036239624
Epoch: 8 Accuracy: 0.9584000110626221
Epoch: 9 Accuracy: 0.9616000056266785
Epoch: 10 Accuracy: 0.9624000191688538
Epoch: 11 Accuracy: 0.9652000069618225
Epoch: 12 Accuracy: 0.9679999947547913
Epoch: 13 Accuracy: 0.9679999947547913
Epoch: 14 Accuracy: 0.9692000150680542
Epoch: 15 Accuracy: 0.9700000286102295
Epoch: 16 Accuracy: 0.9718000292778015
Epoch: 17 Accuracy: 0.9710000157356262
Epoch: 18 Accuracy: 0.972000002861023
Epoch: 19 Accuracy: 0.9742000102996826
