# Chapter 11: Training Deep Neural Nets

In the previous chapter, we trained a neural network with 2 hidden layers. More complex problems require networks with more hidden layers with hundreds of neurons per layer. Training these can lead to several problems:

- The _vanishing gradients_ and _exploding gradients_ problem makes lower levels hard to train.
- Training a large network can be very slow.
- A model with millions of parameters risks overfitting the training data.

Below we will discuss methods for solving all of these problems.

## Vanishing/Exploding Gradients Problem

While training a neural network with backpropagation, the algorithm finds the components of the error contributed by each layer to compute the error gradient.  Gradients can often get smaller and smaller as the algorithm progresses, resulting in the gradient contribution from the lower layers approaching zero. This is known as the _vanishing gradient_ problem. Alternatively, the gradient can also can grow bigger and bigger which can cause the algorithm to diverge. This is called the _exploding gradient problem_.

Around 2010, a paper titled ["Understanding the Difficulty of Training Deep Feedforward Neural Networks"](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf) found some reasons for this. The sigmoid activation function as well as the random initialization of the weight matrices' elements using a normal distribution with a mean of 0 and a standard deviation of 1. The paper showed the variance of the outputs was much larger than the variance of the inputs. Going forward in the network, the variance kept getting larger and it results in the activation saturating near the horizontal asymptotes, which causes the gradient to vanish.

### Xavier and He Initialization

The authors of the paper found that one way to prevent the vanishing/exploding gradient problem is to ensure that the variance of the input and output of each layer is the same. One way to do this is to initialize the weights matrix using a normal distribution with a mean of 0 and a standard deviation given by

$$ \sigma = \sqrt{\frac{2}{n_\text{ inputs} + n_\text{ outputs}}} $$

or a uniform distribution centered at 0 with a radius, $r$, given by

$$ r = \sqrt{\frac{6}{n_\text{ inputs} + n_\text{ outputs}}} $$

where $n_\text{ inputs}$ and $n_\text{ outputs}$ is the number of input  or output connections in that particular layer. This is often known as _Xavier initialization_ after the author's first name, or sometimes _Glorot initialization_.

For the ReLU activation function, we use a normal distribution with a standard deviation given by

$$ \sigma = \frac{2}{\sqrt{n_\text{ inputs} + n_\text{ outputs}}} $$

or a uniform distribution with a radius given by

$$ r = \sqrt{\frac{24}{n_\text{ inputs} + n_\text{ outputs}}} $$

which is known as _He initialization_. Below is an example of creating a layer of a neural network which uses _He initialization_. By default, `tf.layers.dense()` uses Xavier initialization.

In [1]:
import tensorflow as tf

n_inputs = 28 ** 2
n_hidden = 100

X = tf.placeholder(tf.float32, shape=(None, n_inputs))
he_init = tf.variance_scaling_initializer()
hidden1 = tf.layers.dense(X, n_hidden, activation=tf.nn.relu,
                          kernel_initializer=he_init, name='hidden1')

Instructions for updating:
Use keras.layers.dense instead.
Instructions for updating:
Colocations handled automatically by placer.


In [0]:
import numpy as np

def reset_graph(seed=42):
  tf.reset_default_graph()
  tf.set_random_seed(seed)
  np.random.seed(seed)

### Nonsaturating Activation Functions

One of the causes of the vanishing/exploding gradient problem discussed in the paper is the sigmoid activation function. The ReLU activation function performs much better, but it has a different problem. If some neurons output negative values, after the application of the activation function, their output will be stuck at 0. Since the gradient is also 0, the neuron remains "dead."

One solution to this problem is to use a "leaky" ReLU function, given by

$$ \text{LeakyReLU}(z) = \max(\alpha z, z) $$

where $\alpha$ is the slope ofthe ReLU function when the value of $z$ is less than 0. Researchers have found that this activation function performs better than the "hard" ReLU function. You can even have $\alpha$ be a parameter that the model learns during training. This prevents neurons from completely dying.

Another activation function that performs better than leaky ReLU that was proposed in this [paper](https://arxiv.org/pdf/1511.07289v5.pdf) by Djork-Arné Clevert called the _exponential linear unit_ (ELU) given by

$$ \text{ELU}_\alpha(z) = \left\{ \begin{matrix}
\alpha\,(\exp(z) - 1) && \text{if}\;z < 0 \\
z && \text{if}\; z \geq 0
\end{matrix} \right. $$

It has the following differences from the ReLU function:

- It takes negative values when $z < 0$ . which allows the unit to have an average output closer to 0. This helps alleviate the vanishing gradient problem. You can tweak the hyperparameter, $\alpha$, sets the negative number that ELU approaches.

- It has a nonzero gradient when $z < 0$, preventing the dying units issue.

- The function is differentiable everywhere, which helps the speed of Gradient Descent.

The disadvantage of ELU is that it takes longer to compute than ReLU. The extra time is compensated for the fact that it helps Gradient Descent converge fasted, but it does cause the model to make predictions more slowly.

TensorFlow offeres an implementation of ELU which is used in the code example below:

In [0]:
reset_graph()

n_inputs = 28 ** 2
n_hidden = 100

X = tf.placeholder(tf.float32, shape=(None, n_inputs))
he_init = tf.variance_scaling_initializer()
hidden1 = tf.layers.dense(X, n_hidden, activation=tf.nn.elu, name='hidden1')

TensorFlow does not have an implementation of leaky ReLU, but it is easy to define ourselves:

In [0]:
reset_graph()

def leaky_relu(z, alpha=0.01):
  return tf.maximum(alpha * z, z)

n_inputs = 28 ** 2
n_hidden = 100

X = tf.placeholder(tf.float32, shape=(None, n_inputs))
he_init = tf.variance_scaling_initializer()
hidden1 = tf.layers.dense(X, n_hidden, activation=leaky_relu, name='hidden1')

### Batch Normalization

In this [paper](https://arxiv.org/pdf/1502.03167v3.pdf) Sergey Ioffe and Christian Szegedy proposed a technique called _Batch Normalization_ (BN) to address both the vanishing/exploding gradient problem and the problem that the distribution of each layer's inputs change when the parameters of the previous layers change (i.e. the _Internal Covariate Shift_ problem).

The technique adds an operation to the model just before applying the activation function of each layer. It zero-centers and normalizes the inputs, then it scales and shifts the result using two new parameters per layer. This lets the model learn the optimal mean and shift for each layer.

The algorithm starts by first computing the empirical mean for the current mini-batch, $B$, given by

$$ \mu_B = \frac{1}{m_B} \sum\limits_{i\,=\,1}^{m_B} \mathbf{x}^{(i)} $$

Next, we find the empirical standard deviation, given by

$$ \sigma_B^{\;\;2} = \frac{1}{m_B} \sum\limits_{i\,=\,1}^{m_B} \left( \mathbf{x}^{(i)} - \mu_B \right)^2 $$

Then we zero-center and normalize the inputs in the mini-batch

$$ \hat{\mathbf{x}}^{(i)} = \frac{\mathbf{x}^{(i)} - \mu_B}{\sqrt{\sigma_B^{;\;2} + \epsilon}} $$

where $\epsilon$ is a small number, typically $10^{-5}$, called the _smoothing term_ to avoid division by zero. Finally it computes the output given by

$$ \mathbf{z}^{(i)} = \gamma\,\hat{\mathbf{x}}^{(i)} + \beta $$

where $\gamma$ is the scaling parameter and $\beta$ is the shift parameter which are learned during training.

When the model makes predictions, it uses the empirical mean and standard deviation of the entire training set. In the end, the model ends up learning 4 parameters: the mean of the training set, $\mu$; the standard deviation of the training set, $\sigma$; the scaling parameter, $\gamma$; and the shift parameter, $\beta$.

Adding Batch Normalization to a deep neural network improves the performance of the model, lets you skip normalizing the data before training the data, and helps the model converge to the optimal parameters in fewer training iterations. However, using Batch Normalization causes the model to make predictions slower since it adds another computational step for making predictions.

#### Implementing Batch Normalization with TensorFlow

TensorFlow provides a `tf.nn.batch_normalization()` function which normalizes and centers the data, but you must compute the mean and standard deviation yourself.You also have to handle the creation of the scaling and offfset parameters. TensorFlow also includes a `tf.layers.batch_normalization()` function which handles all of batch normalization for you. Below is an example.

In [0]:
reset_graph()

n_inputs = 28 ** 2 # MNIST dataset.
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')

# Indicates if the batch normalization should be using the mini-batch's mean
# or the mean of the entire training set (same with standard deviation).
training = tf.placeholder_with_default(False, shape=(), name='training')

hidden1 = tf.layers.dense(X, n_hidden1, name='hidden1')
bn1 = tf.layers.batch_normalization(hidden1, training=training, momentum=0.9)
bn1_act = tf.nn.elu(bn1)

hidden2 = tf.layers.dense(bn1_act, n_hidden2, name='hidden2')
bn2 = tf.layers.batch_normalization(hidden2, training=training, momentum=0.9)
bn2_act = tf.nn.elu(bn2)

logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name='outputs')
logts = tf.layers.batch_normalization(logits_before_bn, training=training,
                                      momentum=0.9)

Instructions for updating:
Use keras.layers.batch_normalization instead.


The BN algorithm uses _exponential decay_ to compute a running average, which is why it requires the _momentum_ parameter. Given a new value, $v$, it updates the running average $\hat{v}$ given by

$$ \hat{v} \leftarrow \hat{v} \times \text{momentum} + v \times (1 - \text{momentum}) $$

Momentum values should be typically close to 1, e.g. 0.9, 0.99, or 0.999.

Below is an example of using _partial application_ using the `functools` library in order to make the code less repetitive.

In [0]:
from functools import partial

reset_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')

# Indicates if the batch normalization should be using the mini-batch's mean
# or the mean of the entire training set (same with standard deviation).
training = tf.placeholder_with_default(False, shape=(), name='training')

batch_norm_layer = partial(tf.layers.batch_normalization,
                           training=training, momentum=0.9)

hidden1 = tf.layers.dense(X, n_hidden1, name='hidden1')
bn1 = batch_norm_layer(hidden1)
bn1_act = tf.nn.elu(bn1)

hidden2 = tf.layers.dense(bn1_act, n_hidden2, name='hidden2')
bn2 = batch_norm_layer(hidden2)
bn2_act = tf.nn.elu(bn2)

logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name='outputs')
logits = batch_norm_layer(logits_before_bn)

In [0]:
# Setting up the rest of the graph for training.

y = tf.placeholder(tf.int32, shape=(None), name='y')

x_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,
                                                           labels=y)
loss = tf.reduce_mean(x_entropy, name='loss')
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
training_op = optimizer.minimize(loss)

correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()

Instructions for updating:
Use tf.cast instead.


In [4]:
# Downloading the MNIST dataset.

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train = X_train.astype(np.float32).reshape(-1, 28*28) / 255.0
X_test = X_test.astype(np.float32).reshape(-1, 28*28) / 255.0
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)
X_valid, X_train = X_train[:5000], X_train[5000:]
y_valid, y_train = y_train[:5000], y_train[5000:]

def shuffle_batch(X, y, batch_size):
  rnd_idx = np.random.permutation(len(X))
  n_batches = len(X) // batch_size
  for batch_idx in np.array_split(rnd_idx, n_batches):
    X_batch, y_batch = X[batch_idx], y[batch_idx]
    yield X_batch, y_batch

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


In [0]:
# Training the neural network using Batch Normalization.
# In just 20 training iterations it achieves 97% accuracy on the
# validation set.

n_epochs = 20
batch_size = 200

# These extra ops are for training the scaling and offset parameters
# in batch normalization.
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)

with tf.Session() as sess:
  init.run()
  for epoch in range(n_epochs):
    for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
      sess.run([training_op, extra_update_ops],
               feed_dict={training: True, X: X_batch, y: y_batch})
    accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
    print('Epoch: {} Validation Accuracy: {}'.format(epoch, accuracy_val))

Epoch: 0 Validation Accuracy: 0.8841999769210815
Epoch: 1 Validation Accuracy: 0.9100000262260437
Epoch: 2 Validation Accuracy: 0.9211999773979187
Epoch: 3 Validation Accuracy: 0.932200014591217
Epoch: 4 Validation Accuracy: 0.9377999901771545
Epoch: 5 Validation Accuracy: 0.9444000124931335
Epoch: 6 Validation Accuracy: 0.9480000138282776
Epoch: 7 Validation Accuracy: 0.9509999752044678
Epoch: 8 Validation Accuracy: 0.954200029373169
Epoch: 9 Validation Accuracy: 0.9575999975204468
Epoch: 10 Validation Accuracy: 0.9588000178337097
Epoch: 11 Validation Accuracy: 0.9603999853134155
Epoch: 12 Validation Accuracy: 0.9634000062942505
Epoch: 13 Validation Accuracy: 0.9639999866485596
Epoch: 14 Validation Accuracy: 0.9661999940872192
Epoch: 15 Validation Accuracy: 0.9679999947547913
Epoch: 16 Validation Accuracy: 0.9678000211715698
Epoch: 17 Validation Accuracy: 0.9685999751091003
Epoch: 18 Validation Accuracy: 0.968999981880188
Epoch: 19 Validation Accuracy: 0.9696000218391418


An alternate syntax to training a model this way is to define the `training_op` the following way:

```python
with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
    with tf.control_dependencies(extra_update_ops):
        training_op = optimizer.minimize(loss)
```

this lets you train the model using the more simple syntax:

```python
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
```

### Gradient Clipping

One way to solve the exploding gradients problem is to clip the gradients' values to a defined range. This technique is called [_Gradient Clipping_](http://proceedings.mlr.press/v28/pascanu13.pdf). Though in general people prefer Batch Normalization. Below is an example of Gradient Clipping using TensorFlow:

In [0]:
# An example of gradient clipping.

reset_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

training = tf.placeholder_with_default(False, shape=(), name='training')

batch_norm_layer = partial(tf.layers.batch_normalization,
                           training=training, momentum=0.9)

hidden1 = tf.layers.dense(X, n_hidden1, name='hidden1')
hidden1_act = tf.nn.elu(hidden1)

hidden2 = tf.layers.dense(hidden1_act, n_hidden2, name='hidden2')
hidden2_act = tf.nn.elu(hidden2)

logits_before_bn = tf.layers.dense(hidden2_act, n_outputs, name='outputs')
logits = batch_norm_layer(logits_before_bn)

x_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,
                                                           labels=y)
loss = tf.reduce_mean(x_entropy, name='loss')

# Gradient clipping is here.
threshold = 1.0
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
grads_and_vars = optimizer.compute_gradients(loss)
capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
              for grad, var in grads_and_vars]
training_op = optimizer.apply_gradients(capped_gvs)

## Reusing Pretrained Layers

Instead of training a large DNN from scratch, it is generally better to reuse an existing neural network used for a similar task, then reuse the lower layers of this network. This technique is called _transfer learning_.

### Reusing a TensorFlow model

Below is an example of saving and restoring a TensorFlow model using the `tr.train.import_meta_graph()` function:

In [0]:
# Defining the graph and saving it.

reset_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

training = tf.placeholder_with_default(False, shape=(), name='training')

batch_norm_layer = partial(tf.layers.batch_normalization,
                           training=training, momentum=0.9)

hidden1 = tf.layers.dense(X, n_hidden1, name='hidden1')
bn1 = batch_norm_layer(hidden1)
bn1_act = tf.nn.elu(bn1)

hidden2 = tf.layers.dense(bn1_act, n_hidden2, name='hidden2')
bn2 = batch_norm_layer(hidden2)
bn2_act = tf.nn.elu(bn2)

logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name='outputs')
logits = batch_norm_layer(logits_before_bn)

training = tf.placeholder_with_default(False, shape=(), name='training')

x_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,
                                                           labels=y)
loss = tf.reduce_mean(x_entropy, name='loss')
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
training_op = optimizer.minimize(loss)

correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')

init = tf.global_variables_initializer()

# New code here!
saver = tf.train.Saver()
with tf.Session() as sess:
  init.run()
  saver.save(sess, './my_model.ckpt')

In [0]:
# Getting nodes in the previous graph for reuse.

X = tf.get_default_graph().get_tensor_by_name('X:0')
y = tf.get_default_graph().get_tensor_by_name('y:0')
accuracy = tf.get_default_graph().get_tensor_by_name('accuracy:0')
training_op = tf.get_default_graph().get_operation_by_name('GradientDescent')

In [0]:
# Listing the operations in the predefined graph, truncated for readability.

for op in tf.get_default_graph().get_operations()[:20]:
  print(op.name)

X
y
training/input
training
hidden1/kernel/Initializer/random_uniform/shape
hidden1/kernel/Initializer/random_uniform/min
hidden1/kernel/Initializer/random_uniform/max
hidden1/kernel/Initializer/random_uniform/RandomUniform
hidden1/kernel/Initializer/random_uniform/sub
hidden1/kernel/Initializer/random_uniform/mul
hidden1/kernel/Initializer/random_uniform
hidden1/kernel
hidden1/kernel/Assign
hidden1/kernel/read
hidden1/bias/Initializer/zeros
hidden1/bias
hidden1/bias/Assign
hidden1/bias/read
hidden1/MatMul
hidden1/BiasAdd


Below is an example of creating a collection of important operations. This is often helpful if the graph is large and you only want to reuse certain operations.

In [0]:
# Defining the original graph.

reset_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

training = tf.placeholder_with_default(False, shape=(), name='training')

batch_norm_layer = partial(tf.layers.batch_normalization,
                           training=training, momentum=0.9)

hidden1 = tf.layers.dense(X, n_hidden1, name='hidden1')
bn1 = batch_norm_layer(hidden1)
bn1_act = tf.nn.elu(bn1)

hidden2 = tf.layers.dense(bn1_act, n_hidden2, name='hidden2')
bn2 = batch_norm_layer(hidden2)
bn2_act = tf.nn.elu(bn2)

logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name='outputs')
logits = batch_norm_layer(logits_before_bn)

training = tf.placeholder_with_default(False, shape=(), name='training')

x_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,
                                                           labels=y)
loss = tf.reduce_mean(x_entropy, name='loss')
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
training_op = optimizer.minimize(loss)

correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')

init = tf.global_variables_initializer()

# New code here!
for op in (X, y, accuracy, training_op):
  tf.add_to_collection('my_important_ops', op)

saver = tf.train.Saver()
with tf.Session() as sess:
  init.run()
  saver.save(sess, './my_model.ckpt')

In [0]:
# Restoring the graph and getting the operations from the collection.

reset_graph()

saver = tf.train.import_meta_graph('./my_model.ckpt.meta')

X, y, accuracy, training_op = tf.get_collection('my_important_ops')

You can also define a restore Saver which will only restore specified variables. This is useful if you only want to restore the lower layers of a neural network. Below is an example of restoring only the lower layers of a neural network using a saver which only restores specified variables.

In [0]:
# Defining a graph with 5 hidden layers. First implementing gradient clipping.

reset_graph()

n_inputs = 28 ** 2

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None,), name='y')

training = tf.placeholder_with_default(False, shape=(None))

with tf.name_scope('dnn'):
  layer = X
  for i, n_hidden in enumerate((300, 100, 50, 20)):
    hidden = tf.layers.dense(layer, n_hidden1,
                             name='hidden{}'.format(i+1))
    layer = tf.nn.relu(hidden, name='relu{}'.format(i+1))
  logits = tf.layers.dense(layer, n_outputs, name='outputs')

with tf.name_scope('loss'):
  x_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,
                                                             labels=y)
  loss = tf.reduce_mean(x_entropy, name='loss')

with tf.name_scope('train'):
  optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
  training_op = optimizer.minimize(loss)

with tf.name_scope('eval'):
  correct = tf.nn.in_top_k(logits, y, 1)
  accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')

In [0]:
# Defining the savers and running the training.

reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
                               scope='hidden[123]') # Regex string
reuse_saver = tf.train.Saver(reuse_vars)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

In [0]:
# Training the model one time to train the lower layers of the neural network.

import os

model_path = './my_model.ckpt'

with tf.Session() as sess:
  init.run()
  for epoch in range(n_epochs):
    for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
      sess.run(training_op, feed_dict={X: X_batch, y: y_batch, training: True})
    accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
    print('Epoch: {} Validation set accuracy: {}'.format(epoch, accuracy_val))
  saver.save(sess, model_path)

In [0]:
# Training the model again, this time we restore the lower layers

new_model_path = './my_model_new.ckpt'

with tf.Session() as sess:
  init.run()
  reuse_saver.restore(sess, model_path)
  for epoch in range(n_epochs):
    for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
      sess.run(training_op, feed_dict={X: X_batch, y: y_batch, training: True})
    accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
    print('Epoch: {} Validation set accuracy: {}'.format(epoch, accuracy_val))
  saver.save(sess, new_model_path)

Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from ./my_model.ckpt
Epoch: 0 Validation set accuracy: 0.9498000144958496
Epoch: 1 Validation set accuracy: 0.9549999833106995
Epoch: 2 Validation set accuracy: 0.9552000164985657
Epoch: 3 Validation set accuracy: 0.9559999704360962
Epoch: 4 Validation set accuracy: 0.9585999846458435
Epoch: 5 Validation set accuracy: 0.9606000185012817
Epoch: 6 Validation set accuracy: 0.9599999785423279
Epoch: 7 Validation set accuracy: 0.9617999792098999
Epoch: 8 Validation set accuracy: 0.9629999995231628
Epoch: 9 Validation set accuracy: 0.9628000259399414
Epoch: 10 Validation set accuracy: 0.9635999798774719
Epoch: 11 Validation set accuracy: 0.9652000069618225
Epoch: 12 Validation set accuracy: 0.9664000272750854
Epoch: 13 Validation set accuracy: 0.9666000008583069
Epoch: 14 Validation set accuracy: 0.9678000211715698
Epoch: 15 Validation set accuracy: 0.9674000144004822
E

### Reusing Models From another Frameworks

Below is an example to illustrate how to import models from other frameworks into TensorFlow. This code sets the kernel and bias of a hidden layer at the start of the TensorFlow session using the initializer.


In [0]:
reset_graph()

n_inputs = 2
n_hidden = 3

original_w = [[1., 2., 3.], [4., 5., 6.]]
original_b = [7., 8., 9.]

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
hidden = tf.layers.dense(X, n_hidden, activation=tf.nn.relu, name='hidden')

graph = tf.get_default_graph()
assign_kernel = graph.get_operation_by_name('hidden/kernel/Assign')
assign_bias = graph.get_operation_by_name('hidden/bias/Assign')

init_kernel = assign_kernel.inputs[1]
init_bias = assign_bias.inputs[1]

init = tf.global_variables_initializer()

with tf.Session() as sess:
  sess.run(init, feed_dict={
    init_kernel: original_w,
    init_bias: original_b,
  })
  print(hidden.eval(feed_dict={X: [[10., 11.]]}))

[[ 61.  83. 105.]]


Another way is to make dedicated nodes for assigning the hidden layer's kernel and bias, then set them at any point using placeholders. This is more verbose but allows more control.

In [0]:
reset_graph()

n_inputs = 2
n_hidden = 3

original_w = [[1., 2., 3.], [4., 5., 6.]]
original_b = [7., 8., 9.]

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
hidden = tf.layers.dense(X, n_hidden, activation=tf.nn.relu, name='hidden')

with tf.variable_scope('', default_name='', reuse=True):
  hidden_weights = tf.get_variable('hidden/kernel')
  hidden_bias = tf.get_variable('hidden/bias')
  
original_weights = tf.placeholder(tf.float32, shape=(n_inputs, n_hidden))
original_bias = tf.placeholder(tf.float32, shape=(n_hidden))

assign_hidden_weights = tf.assign(hidden_weights, original_weights)
assign_hidden_bias = tf.assign(hidden_bias, original_bias)

init = tf.global_variables_initializer()

with tf.Session() as sess:
  sess.run(init)
  sess.run(assign_hidden_weights, feed_dict={original_weights: original_w})
  sess.run(assign_hidden_bias, feed_dict={original_bias: original_b})
  print(hidden.eval(feed_dict={X: [[10., 11.]]}))

[[ 61.  83. 105.]]


### Freezing Lower Layers

Since it is likely that the lower layers of the DNN have learned to detect the lower level patterns in the training set, you can reuse the layers as they are by "freezing" their weights. As a result, the higher level layers will be easier to train.

In [0]:
# Defining a new graph which will use the lower level
# layers from the section on using pre-trained layers.

reset_graph()

n_inputs = 28 ** 2

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None,), name='y')

training = tf.placeholder_with_default(False, shape=(None))

with tf.name_scope('dnn'):
  layer = X
  for i, n_hidden in enumerate((300, 100, 50, 20)):
    hidden = tf.layers.dense(layer, n_hidden1,
                             name='hidden{}'.format(i+1))
    layer = tf.nn.relu(hidden, name='relu{}'.format(i+1))
  logits = tf.layers.dense(layer, n_outputs, name='outputs')

with tf.name_scope('loss'):
  x_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,
                                                             labels=y)
  loss = tf.reduce_mean(x_entropy, name='loss')

# New code here!
with tf.name_scope('train'):
  training_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,
                                    scope='hidden[34]|outputs')
  optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
  training_op = optimizer.minimize(loss, var_list=training_vars)

with tf.name_scope('eval'):
  correct = tf.nn.in_top_k(logits, y, 1)
  accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')
  
reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
                               scope='hidden[123]') # Regex string
reuse_saver = tf.train.Saver(reuse_vars)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

In [0]:
# Training the model with the frozen layers.

with tf.Session() as sess:
  init.run()
  reuse_saver.restore(sess, model_path)
  for epoch in range(n_epochs):
    for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
      sess.run(training_op, feed_dict={X: X_batch, y: y_batch, training: True})
    accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
    print('Epoch: {} Validation set accuracy: {}'.format(epoch, accuracy_val))
  saver.save(sess, new_model_path)

INFO:tensorflow:Restoring parameters from ./my_model.ckpt
Epoch: 0 Validation set accuracy: 0.9531999826431274
Epoch: 1 Validation set accuracy: 0.9539999961853027
Epoch: 2 Validation set accuracy: 0.9553999900817871
Epoch: 3 Validation set accuracy: 0.9552000164985657
Epoch: 4 Validation set accuracy: 0.9567999839782715
Epoch: 5 Validation set accuracy: 0.9563999772071838
Epoch: 6 Validation set accuracy: 0.9580000042915344
Epoch: 7 Validation set accuracy: 0.9577999711036682
Epoch: 8 Validation set accuracy: 0.9581999778747559
Epoch: 9 Validation set accuracy: 0.9584000110626221
Epoch: 10 Validation set accuracy: 0.9584000110626221
Epoch: 11 Validation set accuracy: 0.9589999914169312
Epoch: 12 Validation set accuracy: 0.9577999711036682
Epoch: 13 Validation set accuracy: 0.9595999717712402
Epoch: 14 Validation set accuracy: 0.9606000185012817
Epoch: 15 Validation set accuracy: 0.9599999785423279
Epoch: 16 Validation set accuracy: 0.9599999785423279
Epoch: 17 Validation set accuracy:

In [0]:
# Another way to freeze the lower layers is to add a stop_gradient()
# layer in the graph. Any layer below it will be frozen.

reset_graph()

n_inputs = 28 ** 2

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None,), name='y')

training = tf.placeholder_with_default(False, shape=(None))

with tf.name_scope('dnn'):
  layer = X
  for i, n_hidden in enumerate((300, 100, 50, 20)):
    hidden = tf.layers.dense(layer, n_hidden1,
                             name='hidden{}'.format(i+1))
    layer = tf.nn.relu(hidden, name='relu{}'.format(i+1))
    # New code here!
    if i == 1:
      hidden2 = layer
      layer = tf.stop_gradient(layer)
  logits = tf.layers.dense(layer, n_outputs, name='outputs')

with tf.name_scope('loss'):
  x_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,
                                                             labels=y)
  loss = tf.reduce_mean(x_entropy, name='loss')

with tf.name_scope('train'):
  optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
  training_op = optimizer.minimize(loss)

with tf.name_scope('eval'):
  correct = tf.nn.in_top_k(logits, y, 1)
  accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')
  
reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
                               scope='hidden[123]') # Regex string
reuse_saver = tf.train.Saver(reuse_vars)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

In [0]:
with tf.Session() as sess:
  init.run()
  reuse_saver.restore(sess, model_path)
  for epoch in range(n_epochs):
    for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
      sess.run(training_op, feed_dict={X: X_batch, y: y_batch, training: True})
    accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
    print('Epoch: {} Validation set accuracy: {}'.format(epoch, accuracy_val))
  saver.save(sess, new_model_path)

INFO:tensorflow:Restoring parameters from ./my_model.ckpt
Epoch: 0 Validation set accuracy: 0.9383999705314636
Epoch: 1 Validation set accuracy: 0.9441999793052673
Epoch: 2 Validation set accuracy: 0.9485999941825867
Epoch: 3 Validation set accuracy: 0.9509999752044678
Epoch: 4 Validation set accuracy: 0.9527999758720398
Epoch: 5 Validation set accuracy: 0.951200008392334
Epoch: 6 Validation set accuracy: 0.954200029373169
Epoch: 7 Validation set accuracy: 0.9559999704360962
Epoch: 8 Validation set accuracy: 0.9559999704360962
Epoch: 9 Validation set accuracy: 0.9562000036239624
Epoch: 10 Validation set accuracy: 0.9559999704360962
Epoch: 11 Validation set accuracy: 0.9562000036239624
Epoch: 12 Validation set accuracy: 0.9556000232696533
Epoch: 13 Validation set accuracy: 0.9575999975204468
Epoch: 14 Validation set accuracy: 0.9580000042915344
Epoch: 15 Validation set accuracy: 0.9577999711036682
Epoch: 16 Validation set accuracy: 0.9580000042915344
Epoch: 17 Validation set accuracy: 0

### Caching Frozen Layers

One way to improve the speed of training when you have frozen the lower layers is to run the training set through the frozen layers at the start of training. The code below shows an example of training a model this way, reusing the TensorFlow graph defined above.

In [0]:
with tf.Session() as sess:
  init.run()
  reuse_saver.restore(sess, model_path)
  h2_cache = sess.run(hidden2, feed_dict={X: X_train})
  for epoch in range(n_epochs):
    for h2_batch, y_batch in shuffle_batch(h2_cache, y_train, batch_size):
      sess.run(training_op, feed_dict={hidden2: h2_batch, y: y_batch})
    accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
    print('Epoch: {} Validation set accuracy: {}'.format(epoch, accuracy_val))
  saver.save(sess, new_model_path)

INFO:tensorflow:Restoring parameters from ./my_model.ckpt
Epoch: 0 Validation set accuracy: 0.9369999766349792
Epoch: 1 Validation set accuracy: 0.9467999935150146
Epoch: 2 Validation set accuracy: 0.9506000280380249
Epoch: 3 Validation set accuracy: 0.9521999955177307
Epoch: 4 Validation set accuracy: 0.9526000022888184
Epoch: 5 Validation set accuracy: 0.9531999826431274
Epoch: 6 Validation set accuracy: 0.9544000029563904
Epoch: 7 Validation set accuracy: 0.954200029373169
Epoch: 8 Validation set accuracy: 0.9545999765396118
Epoch: 9 Validation set accuracy: 0.9559999704360962
Epoch: 10 Validation set accuracy: 0.9559999704360962
Epoch: 11 Validation set accuracy: 0.9570000171661377
Epoch: 12 Validation set accuracy: 0.9575999975204468
Epoch: 13 Validation set accuracy: 0.9563999772071838
Epoch: 14 Validation set accuracy: 0.9567999839782715
Epoch: 15 Validation set accuracy: 0.9575999975204468
Epoch: 16 Validation set accuracy: 0.9581999778747559
Epoch: 17 Validation set accuracy: 

### Tweaking, Dropping, or Replacing the Upper Layers

The higher the layer is in the previously trained neural network, the less likely it will be useful in training a new network for different tasks. The output layer is generally always replaced, in many cases the old output layer may be a different shape than the output for the new layer.

One way to try to determine how many layers to freeze is to try freezing all of the hidden layers first, then training the neural network again after unfreezing one or two of the top layers and seeing if performance improves. The more training data, the more layers you can unfreeze.

If you still cannot get good performance with little training data, you can try dropping the top hidden layers and freezing the lower ones. You can keep trying until you find the right number of layers to reuse. If you have a lot of training data, you can replace the top layers instead of dropping them or add more layers.

### Model Zoos

A _model zoo_ is a collection of machine learning models that other people have trained for different machine learning tasks. TensorFlow has its own [model zoo](https://github.com/tensorflow/models). Another popular model zoo is the [Caffe Model Zoo](https://github.com/BVLC/caffe/wiki/Model-Zoo). Saumitro Dasgupta wrote a [converter](https://github.com/ethereon/caffe-tensorflow) to convert Caffe models to TensorFlow.

### Unsupervised Training

If you do not have a large labeled training set and there is not a previously trained model for a similar task, but you do have a large unlabled training set, you can use an _unsupervised pretraining_ algorithms such as _Restricted Boltzmann Machines_ (RBMs) or autoencoders to train successlive DNN layers to find low level features in the training set. Afterwards you can tune the model using supervised learning and backpropagation.

### Pretraining an Auxilary Task

One way to train a DNN if you have limited labeled training data is to train a neural network for a similar task then reuse the lower layers to train a new DNN for the actual task.

Another strategy is to take unlabeled training data and take some data and modify it. You label the unmodified data as "good" and the modified data as "bad" so that you can train a DNN classifier using a supervised algorithm to get lower layers which recognize lower level features for the actual task.

## Faster Optimizers

In this section we will examine optimizers which are faster than plain Gradient Descent which can help speed up training DNNs.

### Momentum Optimizers

Recall that Gradient Descent updates the weight vector, $\theta$ by subtracting the weight of the gradient of the cost function, $\nabla_\theta J(\theta)$ multiplied by the learning rate, $\eta$, i.e.

$$ \theta \leftarrow \theta - \eta\nabla_\theta J(\theta) $$

[Momentum optimization](https://www.researchgate.net/publication/243648538_Some_methods_of_speeding_up_the_convergence_of_iteration_methods), proposed by Boris Polyak in 1964 subtracts the local gradient from a momentum vector, $\mathbf{m}$, multiplied by the learning rate $\eta$ and it updates the weights by adding the momentum vector to the weight vector, $\theta$. To prevent the momentum from growing too large, the algorithm introduces a hyperparameter, $\beta$, called the _momentum_, which is between 0 and 1 (typically 0.9). The algorithm can be written in two stages:

$$ \begin{matrix}
1. && \mathbf{m} \leftarrow \beta\,\mathbf{m} - \eta\nabla_\theta J(\theta) \\
2. && \theta \leftarrow \theta + \mathbf{m}
\end{matrix}  $$

It follows that if the gradient remains constant, the terminal velocity (the maximum size of the weight updates) is given by the learning rate, $\eta$, multiplied by $\frac{1}{1\,-\,\beta}$. If $\beta = 0.9$ then Momentum optimization converges 10 times as quickly as Gradient Descent. Due to the larger steps, Momentum optimization can escape local optima much more quickly than Gradient Descent. Below is an example of implementing Momentum optimization in TensorFlow:


In [0]:
optimizer = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9)

The main drawback of Momentum optimization is it adds another hyperparameter to tune, but generally 0.9 works in practice.

### Nesterov Accelerated Gradient

One improvement to Momentum optimization proposed by Yuli Nesterov in 1983 is called [Nesterov Momentum Optimization](https://scholar.google.com/citations?view_op=view_citation&citation_for_view=DJ8Ep8YAAAAJ:hkOj_22Ku90C) or _Nesterov Accelerated Gradient_ (NAG). The algorithm works in the following steps:

$$ \begin{matrix}
1. && \mathbf{m} \leftarrow \beta\,\mathbf{m} - \eta\nabla_\theta J(\theta + \beta\,\mathbf{m}) \\
2. && \theta \leftarrow \theta + \mathbf{m}
\end{matrix} $$

The only difference between NAG and Momentum optimization is that it computes the gradient of the cost function at $\theta + \beta\,\mathbf{m}$ instead of at the current value of $\theta$. This improves the algorithm since the momentum vector is usually pointing in the direction of the optimal value. Below is an example of NAG using TensorFlow:

In [0]:
optimizer = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9,
                                       use_nesterov=True)

### AdaGrad Optimizer

The [AdaGrad algorithm](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) is an algorithm designed to decay the learning rate, doing so faster for steeper gradients, in order to converge more directly towards the global optimum. This is called an _adaptive learning rate_. The algorithm works in two stages:

The first stage computes a vector, $\mathbf{s}$, given by

$$ \mathbf{s} \leftarrow \mathbf{s} + \nabla_\theta J(\theta) \otimes \nabla_\theta J(\theta) $$

where $\otimes$ denotes component-wise multiplication. This is a vectorized form of the following operation:

$$ s_i \leftarrow s_i + \left( \frac{\partial J(\theta)}{\partial \theta_i} \right)^2 $$

where $s_i$ denotes a component of $\mathbf{s}$.The vector $\mathbf{s}$ accumulates the squares of each component of the gradient, and it grows larger when the gradient is steeper.

The second step is similar to Gradient Descent but with a modification. It is given by

$$ \theta \leftarrow \theta - \eta\nabla_\theta J(\theta) \oslash \sqrt{\mathbf{s} + \epsilon} $$

where $\oslash$ denotes component-wise division and $\epsilon$ is a smoothing term to avoid division by zero (it is typically 10<sup>-10</sup>). This operation is a vectorized form of the following operation:

$$ \theta_i \leftarrow \theta_i - \eta \left( \frac{\partial J(\theta)}{\partial \theta_i} \right) \left( s_i + \epsilon \right)^{-1/2} $$

where $\theta_i$ is each component of the vector $\theta$.

AdaGrad performs well for simple quadratic problems, but often stops too early when training DNNs because the learning rate degrades to zero. Below is an example of an AdaGrad optimizer in TensorFlow:

In [0]:
optimizer = tf.train.AdagradOptimizer(learning_rate=0.01)

### RMSProp

Since AdaGrad can decay the learning rate too quickly, [RMSProp](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) slows the rate of decay of the learning rate by accumulating only the most recent iterations using exponential decay. The algorithm works as follows:

$$ \begin{matrix}
1. && \mathbf{s} \leftarrow \beta\,\mathbf{s} + (1 - \beta\, ) \nabla_\theta J(\theta) \otimes \nabla_\theta J(\theta) \\
2. && \theta \leftarrow \theta - \eta\nabla_\theta J(\theta) \oslash \sqrt{\mathbf{s} + \epsilon}
\end{matrix} $$

where $\beta$ is the decay rate and is typically set to 0.9. This value typically works in practice so you do not have to tune it. Except for simple problems, RMSProp typically performs better than AdaGrad. Below is an example of RMSProp with TensorFlow:

In [0]:
optimizer = tf.train.RMSPropOptimizer(learning_rate=0.1, momentum=0.9,
                                      decay=0.9, epsilon=True)

### Adam Optimizer

[Adam](https://arxiv.org/pdf/1412.6980v8.pdf) which stands for _adaptive moment estimation_ combines Momentum optimizers and RMSProp, keeping track of an exponentially decaying average of past gradients and past squared gradients. The algorithm works as follows:

$$ \begin{matrix}
1. && \mathbf{m} \leftarrow \beta_1\mathbf{m} - (1 - \beta_1) \nabla_\theta J(\theta) \\
2. && \mathbf{s} \leftarrow \beta_2\mathbf{s} + (1 - \beta_2) \nabla_\theta J(\theta) \otimes \nabla_\theta J(\theta) \\
3. && \mathbf{m} \leftarrow (1 - \beta_1)^{-\,t}\,\mathbf{m} \\
4. && \mathbf{s} \leftarrow (1 - \beta_2)^{-\,t} \,\mathbf{s} \\
5. && \theta \leftarrow \theta + \eta \, \mathbf{m} \oslash \sqrt{\mathbf{s} + \epsilon}
\end{matrix}$$

where $t$ is the training iteration number, $\beta_1$ is the momentum decay rate, and $\beta_2$ is the scaling decay rate. Steps 1, 2, and 5 resemble both Momentum optimization and RMSProp. Steps 3 and 4 account for the fact that $\mathbf{m}$ and $\mathbf{s}$ are initialized to zero vectors, so these steps prevent the algorithm from being biased towards zero at the beginning.

The momentum decay rate, $\beta_1$, is typically initialized to 0.9. The scaling decay rate, $\beta_2$ is initialized to 0.99. These values perform well in practice so its rare you have to tune them. The smoothing parameter, $\epsilon$, is typically initialized to 10<sup>-10</sup>. Since the model's learning rate is adaptive, it is not as necessary to tune the learning rate, $\eta$.

Below is an example of an Adam optimizer in TensorFlow:

In [0]:
optimizer = tf.train.AdamOptimizer(learning_rate=0.001)

### Learning Rate Scheduling

Finding a good learning rate can be difficult. If the learning rate is too high the algorithm can diverge. If it is too low the algorithm will take too long to train. If it is slightly too high the algorithm may dance around the optimum and not converge unless you are using an apadtive learning rate algorithm like AdaGrad, RMSProp, or Adam.

Below are some strategies for _learning rate scheduling_, training methods which start with a high learning rate and gradually reduce it as you get closer to the optimum:

#### Predetermined piecewise constant learning rate

Setting the learning rate to a high value at the start of training, e.g. $\eta_0 = 0.01$ then reducing it to a lower rate, e.g. $\eta_1 = 0.001$ after a constant number of training iterations. This performs well but requires tuning to find which training epoch is the right one to reduce the learning rate at.

#### Performance scheduling

Measure the validation error every $N$ steps and reduce the learning rate by some constant factor, $\lambda$, when the error starts increasing.

#### Exponential scheduling

The learning rate is an exponential function of the iteration number, i.e.

$$ \eta(t) = \eta_0^{\;-t/r} $$

This requires tuning of the initial learning rate, $\eta_0$, and the rate of decay, $r$.

#### Power scheduling

Set the learning rate to the exponential function

$$ \eta(t) = \eta_0 (1 + t/r)^{-c} $$

where $c$ is typically set to 1. This also requires tuning like exponential scheduling, but in this case the learning rate drops much more slowly.

In [0]:
# An example of implementing a learning schedule with TensorFlow.

initial_learning_rate = 0.1
decay_steps = 10000
decay_rate = 0.1
global_step = tf.Variable(0, trainable=False, name='global_step')
learning_rate = tf.train.exponential_decay(initial_learning_rate, global_step,
                                           decay_steps, decay_rate)
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9)
training_op = optimizer.minimize(loss, global_step=global_step)

For adaptive learning rate optimization methods like AdaGrad, RMSProp or Adam, learning rate scheduling is not necessary.

## Avoiding Overfitting Through Regularization

Since neural networks have tens of thousands of parameters (sometimes millions) they are prone to overfitting the data. The following section goes over the most common ways of introducing regularization into the model to prevent overfitting.

### Early Stopping

One way to prevent overfitting is to implement early stopping (introduced in chapter 4). After a certain number of training iterations (e.g. every 50 iterations), you measure the model's error on the validation set. If after a certain number of training iterations, the error does not decrease, stop training the model. Early stopping typically works best when combined with another regularization technique.

### $\ell_1$ and $\ell_2$ Regularization

You can add $\ell_1$ or $\ell_2$ regularization to neural networks' weights (typically not the biases) just like linear models in chapter 4. One way to do this with TensorFlow is to simply add the regularization to the cost function. Below is an example of adding $\ell_1$ regularization to a neural network with one hidden layer using TensorFlow.

In [0]:
reset_graph()

n_inputs = 28 ** 2
n_hidden1 = 300
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

with tf.name_scope('dnn'):
  hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu,
                            name='hidden1')
  logits = tf.layers.dense(hidden1, n_outputs, name='outputs')

W1 = tf.get_default_graph().get_tensor_by_name('hidden1/kernel:0')
W2 = tf.get_default_graph().get_tensor_by_name('outputs/kernel:0')

# New code here!
scale = 0.001 # L1 regularization parameter
with tf.name_scope('loss'):
  x_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                             logits=logits)
  base_loss = tf.reduce_mean(x_entropy, name='avg_x_entropy')
  reg_losses = tf.reduce_sum(tf.abs(W1)) + tf.reduce_sum(tf.abs(W2))
  loss = tf.add(base_loss, scale * reg_losses, name='loss')
  
with tf.name_scope('eval'):
  correct = tf.nn.in_top_k(logits, y, 1)
  accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')
  
learning_rate = 0.01

with tf.name_scope('train'):
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  training_op = optimizer.minimize(loss)
  
init = tf.global_variables_initializer()

In [0]:
# Training the model and printing the validation error.

with tf.Session() as sess:
  init.run()
  for epoch in range(n_epochs):
    for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
      sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
    accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
    print('Epoch: {} Accuracy: {}'.format(epoch, accuracy_val))

Epoch: 0 Accuracy: 0.8309999704360962
Epoch: 1 Accuracy: 0.8709999918937683
Epoch: 2 Accuracy: 0.8838000297546387
Epoch: 3 Accuracy: 0.8934000134468079
Epoch: 4 Accuracy: 0.8966000080108643
Epoch: 5 Accuracy: 0.8988000154495239
Epoch: 6 Accuracy: 0.9016000032424927
Epoch: 7 Accuracy: 0.9043999910354614
Epoch: 8 Accuracy: 0.9057999849319458
Epoch: 9 Accuracy: 0.906000018119812
Epoch: 10 Accuracy: 0.9067999720573425
Epoch: 11 Accuracy: 0.9053999781608582
Epoch: 12 Accuracy: 0.9070000052452087
Epoch: 13 Accuracy: 0.9083999991416931
Epoch: 14 Accuracy: 0.9088000059127808
Epoch: 15 Accuracy: 0.9064000248908997
Epoch: 16 Accuracy: 0.9065999984741211
Epoch: 17 Accuracy: 0.9065999984741211
Epoch: 18 Accuracy: 0.9065999984741211
Epoch: 19 Accuracy: 0.9052000045776367


Below is an alternative way of adding $\ell_1$ regularization using `tf.layers.dense()`. You can use `l1_regularizer()`, `l2_regularizer()`, or `l1_l2_regularizer()` functions to add regularization to each layer.

In [0]:
reset_graph()

n_inputs = 28 ** 2
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10
scale = 0.001

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

regularized_dense_layer = partial(
    tf.layers.dense, activation=tf.nn.relu,
    kernel_regularizer=tf.contrib.layers.l1_regularizer(scale))

with tf.name_scope('dnn'):
  hidden1 = regularized_dense_layer(X, n_hidden1, name='hidden1')
  hidden2 = regularized_dense_layer(hidden1, n_hidden2, name='hidden2')
  logits = regularized_dense_layer(hidden2, n_outputs, name='outputs')
  
with tf.name_scope('loss'):
  x_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                             logits=logits)
  base_loss = tf.reduce_mean(x_entropy, name='avg_x_entropy')
  reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
  loss = tf.add_n([base_loss] + reg_losses, name='loss')
  
with tf.name_scope('eval'):
  correct = tf.nn.in_top_k(logits, y, 1)
  accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')
  
learning_rate = 0.01

with tf.name_scope('train'):
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  training_op = optimizer.minimize(loss)
  
init = tf.global_variables_initializer()


For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.



In [0]:
with tf.Session() as sess:
  init.run()
  for epoch in range(n_epochs):
    for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
      sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
    accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
    print('Epoch: {} Accuracy: {}'.format(epoch, accuracy_val))

Epoch: 0 Accuracy: 0.8176000118255615
Epoch: 1 Accuracy: 0.8751999735832214
Epoch: 2 Accuracy: 0.8913999795913696
Epoch: 3 Accuracy: 0.9020000100135803
Epoch: 4 Accuracy: 0.9064000248908997
Epoch: 5 Accuracy: 0.9082000255584717
Epoch: 6 Accuracy: 0.9121999740600586
Epoch: 7 Accuracy: 0.9146000146865845
Epoch: 8 Accuracy: 0.9150000214576721
Epoch: 9 Accuracy: 0.9187999963760376
Epoch: 10 Accuracy: 0.9179999828338623
Epoch: 11 Accuracy: 0.9196000099182129
Epoch: 12 Accuracy: 0.9179999828338623
Epoch: 13 Accuracy: 0.9192000031471252
Epoch: 14 Accuracy: 0.9196000099182129
Epoch: 15 Accuracy: 0.9179999828338623
Epoch: 16 Accuracy: 0.9186000227928162
Epoch: 17 Accuracy: 0.920199990272522
Epoch: 18 Accuracy: 0.9186000227928162
Epoch: 19 Accuracy: 0.9174000024795532


### Dropout

The most popular regularization technique for DNNs is [_dropout_](https://arxiv.org/pdf/1207.0580.pdf) proposed by G. E. Hinton in 2012 and in this [paper](http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf) by Nitish Srivastava et al. which has been shown to improve DNN accuracy by 1-2%.

The algoritm is simple, at every training step each neuron has a probability, $p$, of being temporarily excluded during that round of training. That hyperparameter, $p$, is called the _dropout rate_. Dropout improves the performance of DNNs because it lets you train the DNN as if it were an ensemble of $2^N$ possible DNNs (where $N$ is the number of neurons) so it prevents overfitting. After training, you need to multiply each connection by $(1 - p)$ or the _keep rate_ to make up for the fact that each neuron on average had fewer connections during training. Alternatively you can divide each connection by $(1-p)$ during training, which as a similar effect but is not exactly equivalent.

Below is an example of implementing dropout using TensorFlow:

In [0]:
reset_graph()

n_inputs = 28 ** 2
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10
dropout_rate = 0.5 # == 1 - keep_rate

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

training = tf.placeholder_with_default(False, shape=(), name='training')

X_drop = tf.layers.dropout(X, dropout_rate, training=training)

with tf.name_scope('dnn'):
  hidden1 = tf.layers.dense(X_drop, n_hidden1, activation=tf.nn.relu,
                            name='hidden1')
  hidden1_drop = tf.layers.dropout(hidden1, dropout_rate, training=training)
  hidden2 = tf.layers.dense(hidden1_drop, n_hidden2, activation=tf.nn.relu,
                            name='hidden2')
  hidden2_drop = tf.layers.dropout(hidden2, dropout_rate, training=training)
  logits = tf.layers.dense(hidden2_drop, n_outputs, name='outputs')

with tf.name_scope('loss'):
  x_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                             logits=logits)
  loss = tf.reduce_mean(x_entropy, name='loss')
  
with tf.name_scope('eval'):
  correct = tf.nn.in_top_k(logits, y, 1)
  accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')
  
learning_rate = 0.01

with tf.name_scope('train'):
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  training_op = optimizer.minimize(loss)
  
init = tf.global_variables_initializer()

Instructions for updating:
Use keras.layers.dropout instead.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [0]:
batch_size = 50

with tf.Session() as sess:
  init.run()
  for epoch in range(n_epochs):
    for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
      sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
    accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
    print('Epoch: {} Accuracy: {}'.format(epoch, accuracy_val))

Epoch: 0 Accuracy: 0.9021999835968018
Epoch: 1 Accuracy: 0.9240000247955322
Epoch: 2 Accuracy: 0.9326000213623047
Epoch: 3 Accuracy: 0.9387999773025513
Epoch: 4 Accuracy: 0.9431999921798706
Epoch: 5 Accuracy: 0.9480000138282776
Epoch: 6 Accuracy: 0.9521999955177307
Epoch: 7 Accuracy: 0.9552000164985657
Epoch: 8 Accuracy: 0.9584000110626221
Epoch: 9 Accuracy: 0.9598000049591064
Epoch: 10 Accuracy: 0.9616000056266785
Epoch: 11 Accuracy: 0.9634000062942505
Epoch: 12 Accuracy: 0.9661999940872192
Epoch: 13 Accuracy: 0.9674000144004822
Epoch: 14 Accuracy: 0.9682000279426575
Epoch: 15 Accuracy: 0.9703999757766724
Epoch: 16 Accuracy: 0.9711999893188477
Epoch: 17 Accuracy: 0.9711999893188477
Epoch: 18 Accuracy: 0.9742000102996826
Epoch: 19 Accuracy: 0.9732000231742859


### Training Sparse Models

All of the optimization techniques presented produce _dense_ models, meaning all or most parameters will be nonzero. If you need a faster model or a model that runs very fast, you may need a sparse model instead.

One way to achieve this is to set all small weights to zero, another option is to apply strong $\ell_1$ regularization. One last option is to apply _Dual Averaging_, also known as [_Follow The Regularized Leader_](https://scholar.google.fr/citations?view_op=view_citation&citation_for_view=DJ8Ep8YAAAAJ:Tyk-4Ss8FVUC) (FTRL), proposed by Yurii Nesterov. TensorFlow implements a variant of FTRL called [_FTRL-Proximal_](https://www.eecs.tufts.edu/~dsculley/papers/ad-click-prediction.pdf) in the `FTRLOptimizer` class.

### Max-Norm Regularization

Another regularization technique is called _max-norm regularization_ for the weights, $\mathbf{w}$, of each hidden layer where it applys the following constraint

$$ ||\,\mathbf{w}\,||_{\,2} \leq r $$

where $||\cdot||_{\,2}$ is the $\ell_2$ norm and $r$ is the max-norm hyperparameter. The algorithm works by computing

$$ \lambda = \max\left( \frac{r}{||\,\mathbf{w}\,||_{\,2}}, \, 1 \right) $$

and then updates each hidden layer's weight vector

$$ \mathbf{w} \leftarrow \lambda\,\mathbf{w} $$

TensorFlow does not have max-norm regularization built in but it is possible to implement it using the `clip_by_norm()` function.

In [0]:
reset_graph()

n_inputs = 28 ** 2
n_hidden1 = 300
n_hidden2 = 50
n_outputs = 10

learning_rate = 0.01
momentum = 0.9

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

with tf.name_scope('dnn'):
  hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu,
                            name='hidden1')
  hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu,
                            name='hidden2')
  logits = tf.layers.dense(hidden2, n_outputs, name='outputs')

with tf.name_scope('loss'):
  xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                            logits=logits)
  loss = tf.reduce_mean(xentropy, name='loss')
  
with tf.name_scope('train'):
  optimizer = tf.train.MomentumOptimizer(learning_rate, momentum)
  training_op = optimizer.minimize(loss)
  
with tf.name_scope('eval'):
  correct = tf.nn.in_top_k(logits, y, 1)
  accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')
  
# New code here below
threshold = 1.0

weights1 = tf.get_default_graph().get_tensor_by_name('hidden1/kernel:0')
clipped_weights1 = tf.clip_by_norm(weights1, clip_norm=threshold, axes=1)
clip_weights1 = tf.assign(weights1, clipped_weights1)

weights2 = tf.get_default_graph().get_tensor_by_name('hidden2/kernel:0')
clipped_weights2 = tf.clip_by_norm(weights2, clip_norm=threshold, axes=1)
clip_weights2 = tf.assign(weights2, clipped_weights2)

init = tf.global_variables_initializer()

In [0]:
n_epochs = 20
batch_size = 50

with tf.Session() as sess:
  init.run()
  for epoch in range(n_epochs):
    for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
      sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
      clip_weights1.eval()
      clip_weights2.eval()
    accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
    print('Epoch: {} Accuracy: {}'.format(epoch, accuracy_val))

Epoch: 0 Accuracy: 0.9567999839782715
Epoch: 1 Accuracy: 0.9696000218391418
Epoch: 2 Accuracy: 0.9715999960899353
Epoch: 3 Accuracy: 0.9771999716758728
Epoch: 4 Accuracy: 0.9771999716758728
Epoch: 5 Accuracy: 0.977400004863739
Epoch: 6 Accuracy: 0.982200026512146
Epoch: 7 Accuracy: 0.9810000061988831
Epoch: 8 Accuracy: 0.9800000190734863
Epoch: 9 Accuracy: 0.9824000000953674
Epoch: 10 Accuracy: 0.982200026512146
Epoch: 11 Accuracy: 0.9851999878883362
Epoch: 12 Accuracy: 0.9824000000953674
Epoch: 13 Accuracy: 0.984000027179718
Epoch: 14 Accuracy: 0.9842000007629395
Epoch: 15 Accuracy: 0.9842000007629395
Epoch: 16 Accuracy: 0.984000027179718
Epoch: 17 Accuracy: 0.9833999872207642
Epoch: 18 Accuracy: 0.9842000007629395
Epoch: 19 Accuracy: 0.9843999743461609


The above implementation works, but it is verbose and not reuseable. Below is a different implementation which defines a regularizer similar to the `l1_regularizer()` function.

In [0]:
def max_norm_regularizer(threshold, axes=1, name='max_norm',
                         collection='max_norm'):
  def max_norm(weights):
    clipped = tf.clip_by_norm(weights, clip_norm=threshold, axes=axes)
    clip_weights = tf.assign(weights, clipped)
    tf.add_to_collection(collection, clip_weights)
    return None
  return max_norm

In [0]:
reset_graph()

n_inputs = 28 ** 2
n_hidden1 = 300
n_hidden2 = 50
n_outputs = 10

learning_rate = 0.01
momentum = 0.9

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

max_norm_reg = max_norm_regularizer(threshold=1.0)

with tf.name_scope('dnn'):
  hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu,
                            kernel_regularizer=max_norm_reg, name='hidden1')
  hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu,
                            kernel_regularizer=max_norm_reg, name='hidden2')
  logits = tf.layers.dense(hidden2, n_outputs, name='outputs')
  
with tf.name_scope('loss'):
  xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                            logits=logits)
  loss = tf.reduce_mean(xentropy, name='loss')
  
with tf.name_scope('train'):
  optimizer = tf.train.MomentumOptimizer(learning_rate, momentum)
  training_op = optimizer.minimize(loss)
  
with tf.name_scope('eval'):
  correct = tf.nn.in_top_k(logits, y, 1)
  accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')
  
init = tf.global_variables_initializer()

In [0]:
n_epochs = 20
batch_size = 50

clip_all_weights = tf.get_collection('max_norms')

with tf.Session() as sess:
  init.run()
  for epoch in range(n_epochs):
    for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
      sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
      sess.run(clip_all_weights)
    accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
    print('Epoch: {} Accuracy: {}'.format(epoch, accuracy_val))

Epoch: 0 Accuracy: 0.9562000036239624
Epoch: 1 Accuracy: 0.9710000157356262
Epoch: 2 Accuracy: 0.9733999967575073
Epoch: 3 Accuracy: 0.9753999710083008
Epoch: 4 Accuracy: 0.9746000170707703
Epoch: 5 Accuracy: 0.9783999919891357
Epoch: 6 Accuracy: 0.9800000190734863
Epoch: 7 Accuracy: 0.980400025844574
Epoch: 8 Accuracy: 0.9810000061988831
Epoch: 9 Accuracy: 0.9824000000953674
Epoch: 10 Accuracy: 0.9815999865531921
Epoch: 11 Accuracy: 0.9815999865531921
Epoch: 12 Accuracy: 0.9807999730110168
Epoch: 13 Accuracy: 0.9818000197410583
Epoch: 14 Accuracy: 0.9819999933242798
Epoch: 15 Accuracy: 0.9810000061988831
Epoch: 16 Accuracy: 0.9807999730110168
Epoch: 17 Accuracy: 0.982200026512146
Epoch: 18 Accuracy: 0.9811999797821045
Epoch: 19 Accuracy: 0.982200026512146


### Data Augmentation

One last regularization technique is data augmentation. You can augment your training instances in such a way that a human would not be able to tell the difference, for example if the task is image classification you can try slightly rotating the images or changing the lighting conditions slightly. This will make the model more tolerant to minor changes and prevent overfitting.

## Practical Guidelines

Below are good default settings for a DNN:

<table>
  <tr><td><b>Initialization</b></td><td></td><td>He Initialization</td></tr>
  <tr><td><b>Activation function</b></td><td></td><td>ELU</td></tr>
  <tr><td><b>Normalization</b></td><td></td><td>Batch Normalization</td></tr>
  <tr><td><b>Regularization</b></td><td></td><td>Dropout</td></tr>
  <tr><td><b>Optimizer</b></td><td></td><td>Nesterov Accelerated Gradient</td></tr>
  <tr><td><b>Learning rate schedule</b></td><td></td><td>None</td></tr>
</table>

Here is how the default configuration can be tweaked:

- You can add learning rate scheduling if you are unable to find a good learning rate.

- If your training set is too small, you can use data augmentation to add more training instances.

- If you need a sparse model, you can add $\ell_1$ regularization or you can use FTRL optimization as well.

- If you want a fast model at runtime, you can drop Batch Normalization or replace the ELU activation function with ReLU.

## Exercises

### 1. Is it okay to initialize all the weights to the same value as long as that value is selected randomly using He initialization?

No, the point of using He initialization is so that the variance of the outputs is the same as the variance of the inputs. Initializing all of the weights to the same value will have the same affect as having one neuron per layer, and it will not be possible to converge to a good solution during training.

### 2. Is it okay to initialize the bias terms to 0?

Yes, it is fine to initialize the bias terms to zero at first. You can even initialize them randomly like the weights.

### 3. Name three advantages of the ELU activation function over ReLU.

1. ELU can have negative values, which means that the mean of the output values will be closer to zero, which can help solve the vanishing gradients problem.

2. Since ELU outputs negative values instead of zero when the output of the layer before activation is negative and they also have a nonzero gradient, which prevents neurons from going "dead."

3. ELU is differentiable for all possible input values, which means that Gradient Descent will converge faster.

### 4. In which case would you want to use each of the following activation functions:

#### ELU

You can use ELU to help training converge faster or if the neural network's performance is degrading because some neurons are going "dead."

#### Leaky ReLU

You can use leaky ReLU to prevent neurons from going "dead" and if you want prediction to be more performant than ELU, even if it means training may take longer.

#### ReLU

You can use ReLU as a default activation function for the hidden layers in a DNN. It generally performs well and is very fast to compute. Although leaky ReLU and ELU outperform ReLU, people use ReLU for its simplicity.

#### tanh

You can use the tanh activation function as an alternative for the output layer if the DNN is for a regression task whose outputs are between -1 and 1.

#### logistic

Similar to tanh, you can use the logistic activation function as an activation function for an output layer which needs to output probabilities that an instance belongs to a particular class.

#### softmax

Softmax activation function should be used for the output layer of the DNN to make predictions for a classifying instances into mutually exclusive classes.

### 5. What may happen if you set the `momentum` hyperparamter to close to 1 (e.g. 0.999)?

If the `momentum` hyperparameter is too close to one, then the momentum vector, $\mathbf{m}$, will increase where the gradient is steep and not decrease when the gradient is less steep. This will cause the optimization algorithm to bounce around the optimum and not converge.

### 6. Name three ways you can produce a sparse model.

1. You can set all of the weights that are very small to zero.

2. You can apply strong $\ell_1$ regularization to the model.

3. You can use FTRL instead of Adam optimization.

### 7. Does dropout slow down training? Does it slow down inference (i.e. making predictions on new instances)?

Dropout slows training typically by a factor of two since you need to train the DNN for more iterations on average. Dropout also requires TensorFlow to apply additional operations during training which increases training time. Dropout does affect the performance of inference.

### 8. Deep Learning

#### a. Build a DNN with 5 hidden layers of 100 neurons each, He initialization, and the ELU activation function.

In [0]:
# Defining a function which builds the DNN with the specified settings.

def build_dnn(X, n_outputs):
  he_init = tf.variance_scaling_initializer()
  n_hidden = 100
  with tf.name_scope('dnn'):
    hidden1 = tf.layers.dense(X, n_hidden, kernel_initializer=he_init,
                              activation=tf.nn.elu, name='hidden1')
    hidden2 = tf.layers.dense(hidden1, n_hidden, kernel_initializer=he_init,
                              activation=tf.nn.elu, name='hidden2')
    hidden3 = tf.layers.dense(hidden2, n_hidden, kernel_initializer=he_init,
                              activation=tf.nn.elu, name='hidden3')
    hidden4 = tf.layers.dense(hidden3, n_hidden, kernel_initializer=he_init,
                              activation=tf.nn.elu, name='hidden4')
    hidden5 = tf.layers.dense(hidden4, n_hidden, kernel_initializer=he_init,
                              activation=tf.nn.elu, name='hidden5')
    logits = tf.layers.dense(hidden5, n_outputs, name='outputs')
  return hidden1, hidden2, hidden3, hidden4, hidden5, logits

#### b. Using Adam optimization and early stopping, try training it on MNIST but only on digits 0 to 4. You will need a softmax output layer with five neurons, and make sure to save checkpoints at regular intervals and save the final model so you can use it later. 

In [0]:
# Getting the digits from MNIST that are in the range of 0 to 4

X_train_01234 = X_train[y_train < 5]
y_train_01234 = y_train[y_train < 5]

X_valid_01234 = X_valid[y_valid < 5]
y_valid_01234 = y_valid[y_valid < 5]

X_test_01234 = X_test[y_test < 5]
y_test_01234 = y_test[y_test < 5]

In [0]:
# Define a function for building the TensorFlow graph for the classification
# task

from datetime import datetime

n_inputs = 28 ** 2
n_outputs = 5
root_logdir = 'logs'

def build_graph(X, y, learning_rate=0.001, beta1=0.9, beta2=0.999):
  hidden1, hidden2, hidden3, hidden4, hidden5, logits = build_dnn(X, n_outputs)

  with tf.name_scope('loss'):
    x_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                             logits=logits)
    loss = tf.reduce_mean(x_entropy, name='loss')

  with tf.name_scope('train'):
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate, beta1=beta1,
                                       beta2=beta2)
    training_op = optimizer.minimize(loss)
    
  with tf.name_scope('eval'):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    
  now = datetime.utcnow().strftime('%Y%m%d%H%M%S')
  logdir = '{}/run-{}/'.format(root_logdir, now)
  with tf.name_scope('saver'):
    saver = tf.train.Saver()
    loss_summary = tf.summary.scalar('Loss', loss)
    accuracy_summary = tf.summary.scalar('Accuracy', accuracy)
    file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())
    
  with tf.name_scope('init'):
    init = tf.global_variables_initializer()
    
  return init, saver, training_op, loss, accuracy, loss_summary, \
    accuracy_summary, file_writer

In [0]:
# Define a function for training the DNN.

import os

model_path = 'mnist_model_01234.ckpt'
n_epochs = 500

def train_dnn(X, y, init, saver, training_op, loss, accuracy, loss_summary,
              accuracy_summary, file_writer, n_epochs=100, batch_size=50):
  with tf.Session() as sess:
    if os.path.isfile(model_path):
      saver.restore(sess, model_path)
      with open('{}.epoch'.format(model_path)) as f:
        start_epoch = int(f.read())
    else:
      sess.run(init)
      start_epoch = 0
      
    best_loss = None
    rounds_since_best_loss = 0
    
    for epoch in range(start_epoch, n_epochs):
      if epoch % 10 == 0:
        saver.save(sess, model_path)
        with open('{}.epoch'.format(model_path), 'w') as f:
          f.write(str(epoch))
          f.close()
          
      for X_batch, y_batch in shuffle_batch(X_train_01234, y_train_01234,
                                            batch_size):
        sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        
      loss_summary_str = loss_summary.eval(
          feed_dict={X: X_valid_01234, y: y_valid_01234})
      acc_summary_str = accuracy_summary.eval(
          feed_dict={X: X_valid_01234, y: y_valid_01234})
      file_writer.add_summary(loss_summary_str, epoch)
      file_writer.add_summary(acc_summary_str, epoch)
      
      if epoch == 0:
        best_loss = loss.eval(feed_dict={X: X_train_01234, y: y_train_01234})
      elif epoch % 5 == 0:
        loss_val = loss.eval(feed_dict={X: X_train_01234, y: y_train_01234})
        if loss_val < best_loss:
          best_loss = loss_val
          rounds_since_best_loss = 0
        else:
          rounds_since_best_loss += 1
          if rounds_since_best_loss == 4:
            break
            
      if epoch % 10 == 0:
        accuracy_val = accuracy.eval(feed_dict={X: X_valid_01234,
                                                y: y_valid_01234})
        print('Epoch: {} Accuracy: {}'.format(epoch, accuracy_val))
        
    saver.save(sess, model_path)
    with open('{}.epoch'.format(model_path), 'w') as f:
      f.write(str(epoch))
      f.close()

In [0]:
# Training the model

reset_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

init, saver, training_op, loss, accuracy, loss_summary, accuracy_summary, \
  file_writer = build_graph(X, y)

train_dnn(X, y, init, saver, training_op, loss, accuracy, loss_summary,
          accuracy_summary, file_writer, n_epochs=500, batch_size=50)

Epoch: 0 Accuracy: 0.9757623076438904
Epoch: 10 Accuracy: 0.989835798740387
Epoch: 20 Accuracy: 0.9917904734611511
Epoch: 30 Accuracy: 0.9937450885772705
Epoch: 40 Accuracy: 0.9906176924705505
Epoch: 50 Accuracy: 0.9929632544517517
Epoch: 60 Accuracy: 0.9937450885772705
Epoch: 70 Accuracy: 0.9941360354423523
Epoch: 80 Accuracy: 0.9941360354423523
Epoch: 90 Accuracy: 0.9941360354423523
Epoch: 100 Accuracy: 0.9941360354423523


In [0]:
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip

In [0]:
get_ipython().system_raw(
  'tensorboard --logdir {} --host 0.0.0.0 --port 6006 &'.format(root_logdir))
get_ipython().system_raw('./ngrok http 6006 &')

In [0]:
! curl -s http://localhost:4040/api/tunnels | python3 -c \
  "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

https://f2880a28.ngrok.io


#### c. Tune the hyperparameters with cross-validation and see what precision you can achieve.

In [0]:
# Refactoring into the DNN defined above as an Scikit-Learn Estimator

from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.exceptions import NotFittedError

class DNNClassifier(BaseEstimator, ClassifierMixin):
  def __init__(self, n_hidden=5, n_neurons=100, learning_rate=0.001,
           beta1=0.9, beta2=0.999, batch_size=50,
           batch_normalization_momentum=None, dropout_rate=None,
           activation=tf.nn.elu, n_epochs=500, initializer=he_init):
    self.n_hidden = n_hidden
    self.n_neurons = n_neurons
    self.learning_rate = learning_rate
    self.beta1 = beta1
    self.beta2 = beta2
    self.batch_size = batch_size
    self.batch_normalization_momentum = batch_normalization_momentum
    self.dropout_rate = dropout_rate
    self.activation = activation
    self.n_epochs = n_epochs
    self.initializer = initializer
    self._sess = None
    
  def __del__(self):
    if self._sess is not None:
      self._sess.close()
      
  def _dnn(self, inputs):
    for i in range(self.n_hidden):
      if self.dropout_rate:
        inputs = tf.layers.dropout(inputs, rate=self.dropout_rate,
                                   training=self._training,
                                   name='dropout{}'.format(i))
      inputs = tf.layers.dense(inputs, self.n_neurons,
                               name='hidden{}'.format(i),
                               kernel_initializer=self.initializer)
      if self.batch_normalization_momentum:
        inputs = tf.layers.batch_normalization(
            inputs, training=self._training,
            momentum=self.batch_normalization_momentum,
            name='batch_normal{}'.format(i))
      inputs = self.activation(inputs)
    return inputs
    
  def _build_graph(self, n_inputs, n_outputs):
    reset_graph()
    self._X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
    self._y = tf.placeholder(tf.int32, shape=(None), name='y')
    
    if self.batch_normalization_momentum or self.dropout_rate:
      self._training = tf.placeholder_with_default(False, shape=(),
                                                   name='training')
    else:
      self._training = None
      
    dnn_outputs = self._dnn(self._X)
    logits = tf.layers.dense(dnn_outputs, n_outputs,
                             kernel_initializer=self.initializer,
                             name='logits')
    
    self._y_proba = tf.nn.softmax(logits, name='y_proba')
    
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=self._y,
                                                              logits=logits)
    self._loss = tf.reduce_mean(xentropy, name='loss')
    
    optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate,
                                       beta1=self.beta1, beta2=self.beta2)
    self._training_op = optimizer.minimize(self._loss)
    self._extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
    
    correct = tf.nn.in_top_k(logits, self._y, 1)
    self._accuracy = tf.reduce_mean(tf.cast(correct, tf.float32),
                                    name="accuracy")
    
    self._saver = tf.train.Saver()
    self._loss_summary = tf.summary.scalar('Loss', self._loss)
    self._accuracy_summary = tf.summary.scalar('Accuracy', self._accuracy)
    now = datetime.utcnow().strftime('%Y%m%d%H%M%S')
    logdir = '{}/run-{}/'.format(root_logdir, now)
    self._file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())
    
    self._init = tf.global_variables_initializer()
    
    self._graph = tf.get_default_graph()
    
  def _get_model_params(self):
    with self._graph.as_default():
      gvars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES)
    return {gvar.op.name: val
            for (gvar, val) in zip(gvars, self._sess.run(gvars))}
  
  def _restore_model_params(self, model_params):
    gvar_names = list(model_params.keys())
    assign_ops = {
        gvar_name: self._graph.get_operation_by_name(
            '{}/Assign'.format(gvar_name))
        for gvar_name in gvar_names
    }
    init_values = {gvar_name: assign_op.inputs[1]
                   for gvar_name, assign_op in assign_ops.items()}
    feed_dict = {init_values[gvar_name]: model_params[gvar_name]
                 for gvar_name in gvar_names}
    self._sess.run(assign_ops, feed_dict=feed_dict)
    
  def _train(self, X_train, y_train):
    self._sess.run(self._init)
    
    best_loss = None
    rounds_since_best_loss = 0
    best_params = None
    
    for epoch in range(self.n_epochs):
      for X_batch, y_batch in shuffle_batch(X_train, y_train, self.batch_size):
        feed_dict = {self._X: X_batch, self._y: y_batch}
        if self._training is not None:
          feed_dict[self._training] = True
        self._sess.run(self._training_op, feed_dict=feed_dict)
        if self._extra_update_ops:
          self._sess.run(self._extra_update_ops, feed_dict=feed_dict)
      
      loss_summary_str = self._loss_summary.eval(
          session=self._sess, feed_dict={self._X: X_train, self._y: y_train})
      acc_summary_str = self._accuracy_summary.eval(
          session=self._sess, feed_dict={self._X: X_train, self._y: y_train})
      self._file_writer.add_summary(loss_summary_str, epoch)
      self._file_writer.add_summary(acc_summary_str, epoch)
      
      if epoch == 0:
        best_loss = self._loss.eval(
            session=self._sess, feed_dict={self._X: X_train, self._y: y_train})
        best_params = self._get_model_params()
      elif epoch % 5 == 0:
        loss_val = self._loss.eval(
            session=self._sess, feed_dict={self._X: X_train, self._y: y_train})
        if loss_val < best_loss:
          best_loss = loss_val
          rounds_since_best_loss = 0
          best_params = self._get_model_params()
        else:
          rounds_since_best_loss += 1
          if rounds_since_best_loss == 5:
            self._restore_model_params(best_params)
            break
      
  def save_model(self, model_path):
    if self._sess is None:
      raise NotFittedError()
    self._saver.save(self._sess, model_path)
    
  def restore_model(self, model_path, n_inputs, n_outputs):
    if self._sess is not None:
      self._sess.close()
    self._build_graph(n_inputs, n_outputs)
    self._sess = tf.Session()
    self._saver.restore(self._sess, model_path)
  
  def fit(self, X, y):
    if self._sess is None:
      self._build_graph(X.shape[1], len(set(y)))
      self._sess = tf.Session()
    self._train(X, y)
    return self
    
  def predict_proba(X, y=None):
    if self._sess is None:
      raise NotFittedError()
    return self._y_proba.eval(session=self._sess,
                              feed_dict={self._X: X, self._y: y})
  
  def predict(self, X, y=None):
    y_proba = self.predict_proba(X, y)
    return np.argmax(y_proba, axis=1)
  
  def score(self, X, y):
    if self._sess is None:
      raise NotFittedError()
    return self._accuracy.eval(session=self._sess,
                               feed_dict={self._X: X, self._y: y})

In [0]:
from sklearn.model_selection import RandomizedSearchCV

param_grid = {
    'activation': [tf.nn.relu, tf.nn.elu, leaky_relu, tf.nn.tanh],
    'learning_rate': [0.0001, 0.0005, 0.001, 0.005, 0.01],
    'beta1': [0.9, 0.99, 0.999],
    'beta2': [0.9, 0.99, 0.999],
    'n_neurons': [70, 80, 90, 100, 110, 120, 130, 140, 150],
    'batch_size': [60, 80, 100, 120, 140, 160, 180, 200],
}

rnd_search = RandomizedSearchCV(DNNClassifier(), param_grid, n_iter=40, cv=3)
rnd_search.fit(X_train_01234, y_train_01234)

RandomizedSearchCV(cv=3, error_score='raise-deprecating',
          estimator=DNNClassifier(activation=<function elu at 0x7f7baa3e6488>,
       batch_normalization_momentum=None, batch_size=50, beta1=0.9,
       beta2=0.999, dropout_rate=None,
       initializer=<tensorflow.python.ops.init_ops.VarianceScaling object at 0x7f7b93c00ef0>,
       learning_rate=0.001, n_epochs=500, n_hidden=5, n_neurons=100),
          fit_params=None, iid='warn', n_iter=40, n_jobs=None,
          param_distributions={'activation': [<function relu at 0x7f7baa3989d8>, <function elu at 0x7f7baa3e6488>, <function leaky_relu at 0x7f7b9f805d08>, <function tanh at 0x7f7baa56ba60>], 'learning_rate': [0.0001, 0.0005, 0.001, 0.005, 0.01], 'beta1': [0.9, 0.99, 0.999], 'beta2': [0.9, 0.99, 0.999], 'n_neurons': [70, 80, 90, 100, 110, 120, 130, 140, 150], 'batch_size': [60, 80, 100, 120, 140, 160, 180, 200]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn', scoring=N

In [0]:
print('Best score:', rnd_search.best_score_)
print('Best params:', rnd_search.best_params_)

Best score: 0.9901205472628479
Best params: {'n_neurons': 110, 'learning_rate': 0.001, 'beta2': 0.99, 'beta1': 0.9, 'batch_size': 140, 'activation': <function leaky_relu at 0x7f7b9f805d08>}


In [0]:
dnn_clf = rnd_search.best_estimator_
dnn_clf.score(X_test_01234, y_test_01234)

0.99494064

#### d. Now add batch normalization and compare the learning curves. Is it converging faster than before?

In [0]:
dnn_clf = DNNClassifier(batch_normalization_momentum=0.95, n_neurons=110,
                        learning_rate=0.001, beta2=0.99, beta1=0.9,
                        batch_size=140, activation=leaky_relu)
dnn_clf.fit(X_train_01234, y_train_01234)

DNNClassifier(activation=<function leaky_relu at 0x7f7b9f805d08>,
       batch_normalization_momentum=0.95, batch_size=140, beta1=0.9,
       beta2=0.99, dropout_rate=None,
       initializer=<tensorflow.python.ops.init_ops.VarianceScaling object at 0x7f7b93c00ef0>,
       learning_rate=0.001, n_epochs=500, n_hidden=5, n_neurons=110)

In [0]:
dnn_clf.score(X_test_01234, y_test_01234)

0.995719

In [0]:
dnn_clf.score(X_train_01234, y_train_01234)

1.0

Examining TensorBoard shows that the model did converge faster. It also appears that the model is overfitting the training set now.

#### e. Is the model overfitting the training set? Try adding dropout to every layer and try again. Does it help?

The model is overfitting. Below is code which retrains the model using dropout.

In [0]:
dnn_clf = DNNClassifier(batch_normalization_momentum=0.95, n_neurons=110,
                        learning_rate=0.001, beta2=0.99, beta1=0.9,
                        batch_size=140, activation=leaky_relu, dropout_rate=0.5)
dnn_clf.fit(X_train_01234, y_train_01234)

Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


DNNClassifier(activation=<function leaky_relu at 0x7f7b9f805d08>,
       batch_normalization_momentum=0.95, batch_size=140, beta1=0.9,
       beta2=0.99, dropout_rate=0.5,
       initializer=<tensorflow.python.ops.init_ops.VarianceScaling object at 0x7f7b93c00ef0>,
       learning_rate=0.001, n_epochs=500, n_hidden=5, n_neurons=110)

In [0]:
dnn_clf.score(X_test_01234, y_test_01234)

0.9939677

In [0]:
dnn_clf.score(X_train_01234, y_train_01234)

0.99703974

The model is overfitting less, but it is also performing worse than without dropout. I could run another parameter search using `RandomizedSearchCV` to tune the hyperparameters. In the interest of time I will omit that and use the model without dropout. Below I will retrain the model without dropout and then save it.

In [0]:
model_path = './my_mnist_model_01234.ckpt'

dnn_clf = DNNClassifier(n_neurons=110, learning_rate=0.001, beta2=0.99,
                        beta1=0.9, batch_size=140, activation=leaky_relu)
dnn_clf.fit(X_train_01234, y_train_01234)
dnn_clf.save_model(model_path)

### 9. Transfer learning.

#### a. Create a new DNN that reuses all the pretrained hidden layers of the previous model, freezes them, and replaces the softmax output layer with the new one.

In [0]:
# Refactoring the class from the previous exercise to have another
# hyperparameter.

class TransferDNNClassifier(DNNClassifier):
  def __init__(self, n_frozen=0, **kwargs):
    DNNClassifier.__init__(self, **kwargs)
    self.n_frozen = n_frozen
    
  def _dnn(self, inputs):
    for i in range(self.n_hidden):
      if self.dropout_rate:
        inputs = tf.layers.dropout(inputs, rate=self.dropout_rate,
                                   training=self._training,
                                   name='dropout{}'.format(i))
      inputs = tf.layers.dense(inputs, self.n_neurons,
                               name='hidden{}'.format(i),
                               kernel_initializer=self.initializer)
      inputs = self.activation(inputs)
      if i + 1 == self.n_frozen:
        inputs = tf.stop_gradient(inputs)
    return inputs

In [0]:
tdnn_clf = TransferDNNClassifier(n_neurons=110, learning_rate=0.001,
                                 beta2=0.99, beta1=0.9, batch_size=140,
                                 activation=leaky_relu, n_frozen=5)
tdnn_clf.restore_model(model_path, 28 * 28, 5)

INFO:tensorflow:Restoring parameters from ./my_mnist_model_01234.ckpt


#### b. Train this new DNN on digits 5 to 9, using only 100 images per digit, and time how long it takes. Despite this small number of examples, can you achieve high precision?

In [0]:
# Clock class for timing training.

import time

class Clock:
  def __init__(self):
    self.start_time = None
  def start(self):
    self.start_time = time.time()
    return self
  def stop(self):
    dt = time.time() - self.start_time
    self.start_time = None
    h, m, s = int(dt // 3600), int(dt % 3600) // 60, dt % 60
    return '{}h {}m {:.3f}s'.format(h, m, s)

In [0]:
# Preparing the MNIST data by first getting all of the instances
# which are 5 through 9.

X_train_56789 = X_train[y_train >= 5]
y_train_56789 = y_train[y_train >= 5] - 5

X_test_56789 = X_test[y_test >= 5]
y_test_56789 = y_test[y_test >= 5] - 5

In [0]:
# Getting a training set with 100 instances from each class.

X_train_transfer = []
y_train_transfer = []

counts = {i: 0 for i in range(5)}
for data, label in zip(X_train_56789, y_train_56789):
  counts[label] += 1
  if counts[label] <= 100:
    X_train_transfer.append(data)
    y_train_transfer.append(label)

X_train_transfer = np.array(X_train_transfer, dtype=np.float32)
y_train_transfer = np.array(y_train_transfer, dtype=np.int32)

In [0]:
# Training the model and timing how long it takes to train it.

clock = Clock().start()
tdnn_clf.fit(X_train_transfer, y_train_transfer)
clock.stop()

'0h 0m 8.735s'

In [0]:
# The accuracy is not that great, but this is expected given we
# are only retraining the output layer.

tdnn_clf.score(X_test_56789, y_test_56789)

0.62353426

#### c. Try caching the frozen layers instead and train the model again: how much faster is it now?

In [0]:
# Refactoring the DNNClassiifer to cache the frozen layers during traning

class CachedDNNClassifier(DNNClassifier):
  def __init__(self, n_frozen=0, **kwargs):
    DNNClassifier.__init__(self, **kwargs)
    self.n_frozen = n_frozen
    
  def _dnn(self, inputs):
    self._cached_layer = None
    for i in range(self.n_hidden):
      if self.dropout_rate:
        inputs = tf.layers.dropout(inputs, rate=self.dropout_rate,
                                   training=self._training,
                                   name='dropout{}'.format(i))
      inputs = tf.layers.dense(inputs, self.n_neurons,
                               name='hidden{}'.format(i),
                               kernel_initializer=self.initializer)
      inputs = self.activation(inputs)
      if i + 1 == self.n_frozen:
        self._cached_layer = inputs
        inputs = tf.stop_gradient(inputs)
    return inputs
  
  def _train(self, X_train, y_train):
    self._sess.run(self._init)
    
    best_loss = None
    rounds_since_best_loss = 0
    best_params = None
    
    X_input = X_train
    if self._cached_layer is not None:
      X_input = self._sess.run(self._cached_layer, feed_dict={self._X: X_train,
                                                              self._y: y_train})
    
    for epoch in range(self.n_epochs):
      for X_batch, y_batch in shuffle_batch(X_input, y_train, self.batch_size):
        inputs = self._X if self._cached_layer is None else self._cached_layer
        feed_dict = {inputs: X_batch, self._y: y_batch}
        if self._training is not None:
          feed_dict[self._training] = True
        self._sess.run(self._training_op, feed_dict=feed_dict)
      
      loss_summary_str = self._loss_summary.eval(
          session=self._sess, feed_dict={self._X: X_train, self._y: y_train})
      acc_summary_str = self._accuracy_summary.eval(
          session=self._sess, feed_dict={self._X: X_train, self._y: y_train})
      self._file_writer.add_summary(loss_summary_str, epoch)
      self._file_writer.add_summary(acc_summary_str, epoch)
      
      if epoch == 0:
        best_loss = self._loss.eval(
            session=self._sess, feed_dict={self._X: X_train, self._y: y_train})
        best_params = self._get_model_params()
      elif epoch % 5 == 0:
        loss_val = self._loss.eval(
            session=self._sess, feed_dict={self._X: X_train, self._y: y_train})
        if loss_val < best_loss:
          best_loss = loss_val
          rounds_since_best_loss = 0
          best_params = self._get_model_params()
        else:
          rounds_since_best_loss += 1
          if rounds_since_best_loss == 5:
            self._restore_model_params(best_params)
            break

In [0]:
cdnn_clf = CachedDNNClassifier(n_neurons=110, learning_rate=0.01,
                               beta2=0.99, beta1=0.9, batch_size=140,
                               activation=leaky_relu, n_frozen=5)
cdnn_clf.restore_model(model_path, 28 * 28, 5)

INFO:tensorflow:Restoring parameters from ./my_mnist_model_01234.ckpt


In [0]:
# Training the model and timing how long it takes to train it. The model
# takes less time to train with caching.

clock = Clock().start()
cdnn_clf.fit(X_train_transfer, y_train_transfer)
clock.stop()

'0h 0m 6.521s'

In [0]:
cdnn_clf.score(X_test_56789, y_test_56789)

0.6877186

#### d. Try again reusing four hidden layers instead of five. Can you achieve a higher precision?

In [0]:
cdnn_clf = CachedDNNClassifier(n_neurons=110, learning_rate=0.01,
                               beta2=0.99, beta1=0.9, batch_size=140,
                               activation=leaky_relu, n_frozen=4)
cdnn_clf.restore_model(model_path, 28 * 28, 5)

INFO:tensorflow:Restoring parameters from ./my_mnist_model_01234.ckpt


In [0]:
cdnn_clf.fit(X_train_transfer, y_train_transfer)

CachedDNNClassifier(n_frozen=4)

In [0]:
cdnn_clf.score(X_test_56789, y_test_56789)

0.7228965

As we can see, reusing only four out of five hidden layers increased accuracy by 10%.

#### e. Now unfreeze two hidden layers and continue training, can you get the model to perform even better?

In [0]:
cdnn_clf = CachedDNNClassifier(n_neurons=110, learning_rate=0.01,
                               beta2=0.99, beta1=0.9, batch_size=50,
                               activation=tf.nn.elu, n_frozen=3)
cdnn_clf.restore_model(model_path, 28 * 28, 5)

INFO:tensorflow:Restoring parameters from ./my_mnist_model_01234.ckpt


In [0]:
cdnn_clf.fit(X_train_transfer, y_train_transfer)

CachedDNNClassifier(n_frozen=3)

In [0]:
cdnn_clf.score(X_test_56789, y_test_56789)

0.8514709

The model does perform slightly better after unfreezing two hidden layers. Increasing the learning rate also helped the model converge faster. ELU also helped the model achieve some better performance.

### 10. Pretraining on an auxilary task.

#### a. Build two DNNs (let's call them DNN A and DNN B), both similar to the one you built earlier but without the output layer: each DNN has 5 hidden layers of 100 neurons each, He initialization, and ELU activation. Next add one more hidden layer with 10 units on top of both DNNs using TensorFlow's `concat()` function, then add an output layer with a single neuron using the logistic activation function.

In [5]:
reset_graph()

n_inputs = 28 ** 2

X_a = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X_a')
X_b = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X_b')

y = tf.placeholder(tf.int32, shape=(None), name='y')

def hidden_layer(inputs, name, n_neurons=100):
  return tf.layers.dense(inputs, n_neurons, kernel_initializer=he_init,
                         activation=tf.nn.elu, name=name)

hidden_1a = hidden_layer(X_a, name='hidden_1a')
hidden_2a = hidden_layer(hidden_1a, name='hidden_2a')
hidden_3a = hidden_layer(hidden_2a, name='hidden_3a')
hidden_4a = hidden_layer(hidden_3a, name='hidden_4a')
hidden_5a = hidden_layer(hidden_4a, name='hidden_5a')

hidden_1b = hidden_layer(X_b, name='hidden_1b')
hidden_2b = hidden_layer(hidden_1b, name='hidden_2b')
hidden_3b = hidden_layer(hidden_2b, name='hidden_3b')
hidden_4b = hidden_layer(hidden_3b, name='hidden_4b')
hidden_5b = hidden_layer(hidden_4b, name='hidden_5b')

concat = tf.concat([hidden_5a, hidden_5b], axis=1, name='concat')

hidden_merged = hidden_layer(concat, n_neurons=10, name='hidden_merged')
logits = tf.layers.dense(hidden_merged, 1, kernel_initializer=he_init)

y_float = tf.cast(y, tf.float32)
xentropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=y_float,
                                                   logits=logits)
loss = tf.reduce_mean(xentropy, name='loss')

optimizer = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9)
training_op = optimizer.minimize(loss)

y_pred = tf.cast(tf.greater_equal(logits, 0), tf.int32)
correct = tf.equal(y_pred, y)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')

saver = tf.train.Saver()

init = tf.global_variables_initializer()

Instructions for updating:
Use tf.cast instead.


#### b. Split the MNIST training set into two sets: split #1 should contain 55,000 images, and split #2 should contain 5,000 images. Create a function that generates a training batch where each instance is a pair of MNIST images picked from split #1. Half of the training instances should be the same class, while the other half should be images from different classes. For each pair, the training label should be 0 if the images are from the same class, or 1 if they are from different classes.

In [0]:
# Downloading MNIST again.

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train = X_train.astype(np.float32).reshape(-1, 28*28) / 255.0
X_test = X_test.astype(np.float32).reshape(-1, 28*28) / 255.0
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)

In [0]:
# Splitting the training set into split #1 and split #2.

X_train1 = []
y_train1 = []

X_train2 = []
y_train2 = []

counts = {i: 0 for i in range(10)}

for data, label in zip(X_train, y_train):
  if counts[label] < 500:
    counts[label] += 1
    X_train2.append(data)
    y_train2.append(label)
  else:
    X_train1.append(data)
    y_train1.append(label)
    
X_train1 = np.array(X_train1, dtype=np.float32)
y_train1 = np.array(y_train1, dtype=np.int32)

X_train2 = np.array(X_train2, dtype=np.float32)
y_train2 = np.array(y_train2, dtype=np.int32)

In [0]:
# Creating a function for producing batches from split #1 where each
# instance is a pair of digits in X_train1. Each batch has an equal number
# of pairs that are the same digit and pairs that are different digits.

def generate_batch(X, y, batch_size):
  rand_idx1 = np.random.permutation(len(X))
  rand_idx2 = np.random.permutation(len(X))
  
  X_batch1 = []
  X_batch2 = []
  y_batch = []
  
  same_classes = 0
  diff_classes = 0
  
  for i, j in zip(rand_idx1, rand_idx2):
    data1, label1 = X[i], y[i]
    data2, label2 = X[j], y[j]
    
    if label1 == label2:
      if same_classes < batch_size / 2:
        same_classes += 1
        X_batch1.append(data1)
        X_batch2.append(data2)
        y_batch.append([0])
    else:
      if diff_classes < batch_size / 2:
        diff_classes += 1
        X_batch1.append(data1)
        X_batch2.append(data2)
        y_batch.append([1])
    
    if len(X_batch1) == batch_size:
      break
  rand_idx = np.random.permutation(len(X_batch1))
  return np.array(X_batch1)[rand_idx], \
         np.array(X_batch2)[rand_idx], \
         np.array(y_batch)[rand_idx]

#### c. Train the DNN on this training set. For each image pair, you can simultaneously feed the first image to DNN A and the second image to DNN B. The whole network will gradually learn to tell whether two images belong to the same class or not.

In [9]:
# Training the DNNs using early stopping.

model_path = 'dual_dnn_model'
n_epochs = 500
batch_size = 500

with tf.Session() as sess:
  init.run()
  
  best_loss = None
  rounds_since_best_loss = 0
  
  for epoch in range(n_epochs):
    for _ in range(len(X_train1) // batch_size):
      X_batch1, X_batch2, y_batch = generate_batch(X_train1, y_train1,
                                                   batch_size)
      sess.run(training_op,
               feed_dict={X_a: X_batch1, X_b: X_batch2, y: y_batch})
    
    if epoch % 5 == 0:
      acc_val = accuracy.eval(feed_dict={X_a: X_batch1, X_b: X_batch2,
                                         y: y_batch})
      loss_val = loss.eval(feed_dict={X_a: X_batch1, X_b: X_batch2,
                                      y: y_batch})
      print('Epoch: {} Loss: {} Accuracy: {}'.format(epoch, loss_val, acc_val))
      if epoch == 0:
        best_loss = loss_val
      elif loss_val < best_loss:
        best_loss = loss_val
        rounds_since_best_loss = 0
        saver.save(sess, model_path)
      else:
        rounds_since_best_loss += 1
        if rounds_since_best_loss == 5:
          break
  else:
    saver.save(sess, model_path)

Epoch: 0 Loss: 0.6925538778305054 Accuracy: 0.5099999904632568
Epoch: 5 Loss: 0.42443645000457764 Accuracy: 0.8299999833106995
Epoch: 10 Loss: 0.32438501715660095 Accuracy: 0.8479999899864197
Epoch: 15 Loss: 0.30453580617904663 Accuracy: 0.8700000047683716
Epoch: 20 Loss: 0.20559610426425934 Accuracy: 0.9240000247955322
Epoch: 25 Loss: 0.2321556955575943 Accuracy: 0.9160000085830688
Epoch: 30 Loss: 0.21768558025360107 Accuracy: 0.906000018119812
Epoch: 35 Loss: 0.14439666271209717 Accuracy: 0.9480000138282776
Epoch: 40 Loss: 0.17752383649349213 Accuracy: 0.9279999732971191
Epoch: 45 Loss: 0.12383560091257095 Accuracy: 0.9419999718666077
Epoch: 50 Loss: 0.1052306592464447 Accuracy: 0.9599999785423279
Epoch: 55 Loss: 0.101117342710495 Accuracy: 0.9639999866485596
Epoch: 60 Loss: 0.10225614160299301 Accuracy: 0.9639999866485596
Epoch: 65 Loss: 0.07068835198879242 Accuracy: 0.9639999866485596
Epoch: 70 Loss: 0.06899896264076233 Accuracy: 0.9779999852180481
Epoch: 75 Loss: 0.070251792669296

In [12]:
# Restoring the model and calculating the test set accuracy.

with tf.Session() as sess:
  saver.restore(sess, model_path)
  X_batch1, X_batch2, y_batch = generate_batch(X_train, y_train, len(X_train))
  train_acc_val = accuracy.eval(feed_dict={X_a: X_batch1, X_b: X_batch2,
                                           y: y_batch})
  X_batch1, X_batch2, y_batch = generate_batch(X_test, y_test, len(X_test))
  test_acc_val = accuracy.eval(feed_dict={X_a: X_batch1, X_b: X_batch2,
                                          y: y_batch})
  print('Training set accuracy:', train_acc_val)
  print('Test set accuracy:', test_acc_val)

INFO:tensorflow:Restoring parameters from dual_dnn_model
Training set accuracy: 0.97814053
Test set accuracy: 0.97619045


#### d. Now create a new DNN by reusing and freezing the hidden layers of DNN A and adding a softmax output layer on top with 10 neurons. Train this network with split #2 and see if you can achieve high performance despite having only 500 images per class.

In [0]:
reset_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

def hidden_layer(inputs, name, n_neurons=100):
  return tf.layers.dense(inputs, n_neurons, kernel_initializer=he_init,
                         activation=tf.nn.elu, name=name)

hidden_1a = hidden_layer(X, name='hidden_1a')
hidden_2a = hidden_layer(hidden_1a, name='hidden_2a')
hidden_3a = hidden_layer(hidden_2a, name='hidden_3a')
hidden_4a = hidden_layer(hidden_3a, name='hidden_4a')
hidden_5a = hidden_layer(hidden_4a, name='hidden_5a')
stop_grad = tf.stop_gradient(hidden_5a)

logits = tf.layers.dense(stop_grad, 10, kernel_initializer=he_init)

xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                          logits=logits)
loss = tf.reduce_mean(xentropy, name='loss')

optimizer = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9)
training_op = optimizer.minimize(loss)

correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')

reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
                               scope='hidden_[12345]a')
reuse_saver = tf.train.Saver(reuse_vars)
saver = tf.train.Saver()

init = tf.global_variables_initializer()

In [35]:
# Training DNN A after restoring the hidden layers to classify digits in the
# MNIST dataset.

batch_size = 100
new_model_path = 'dual_dnn_retrained_model'

with tf.Session() as sess:
  sess.run(init)
  reuse_saver.restore(sess, model_path)

  best_loss = None
  rounds_since_best_loss = 0

  for epoch in range(n_epochs):
    for X_batch, y_batch in shuffle_batch(X_train2, y_train2, batch_size):
      sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
  
    if epoch % 5 == 0:
      loss_val = loss.eval(feed_dict={X: X_train2, y: y_train2})
      if epoch % 10 == 0:
        acc_val = accuracy.eval(feed_dict={X: X_train2, y: y_train2})
        print('Epoch: {} Loss: {} Accuracy: {}'.format(epoch, loss_val,
                                                       acc_val))
      if epoch == 0:
        best_loss = loss_val
      elif loss_val < best_loss:
        best_loss = loss_val
        rounds_since_best_loss = 0
        saver.save(sess, new_model_path)
      else:
        rounds_since_best_loss += 1
        if rounds_since_best_loss == 5:
          break
  else:
    saver.save(sess, new_model_path)

INFO:tensorflow:Restoring parameters from dual_dnn_model
Epoch: 0 Loss: 0.14949673414230347 Accuracy: 0.9602000117301941
Epoch: 10 Loss: 0.10927245765924454 Accuracy: 0.968999981880188
Epoch: 20 Loss: 0.09988775849342346 Accuracy: 0.9696000218391418
Epoch: 30 Loss: 0.09442849457263947 Accuracy: 0.9718000292778015
Epoch: 40 Loss: 0.09076770395040512 Accuracy: 0.9724000096321106
Epoch: 50 Loss: 0.0879199281334877 Accuracy: 0.9735999703407288
Epoch: 60 Loss: 0.0857444629073143 Accuracy: 0.973800003528595
Epoch: 70 Loss: 0.0838107168674469 Accuracy: 0.974399983882904
Epoch: 80 Loss: 0.08218634873628616 Accuracy: 0.9753999710083008
Epoch: 90 Loss: 0.08075331151485443 Accuracy: 0.9757999777793884
Epoch: 100 Loss: 0.07950214296579361 Accuracy: 0.9765999913215637
Epoch: 110 Loss: 0.07833652198314667 Accuracy: 0.9765999913215637
Epoch: 120 Loss: 0.07726486027240753 Accuracy: 0.9768000245094299
Epoch: 130 Loss: 0.0763075053691864 Accuracy: 0.9771999716758728
Epoch: 140 Loss: 0.07542519271373749 

In [37]:
# Restoring the model and testing the performance. The model is slightly
# overfitting, but considering it only was given 500 instances of each digit,
# 96.7% accuracy is very good!

with tf.Session() as sess:
  saver.restore(sess, new_model_path)
  train_acc_val = accuracy.eval(feed_dict={X: X_train2, y: y_train2})
  test_acc_val = accuracy.eval(feed_dict={X: X_test, y: y_test})
  print('Train set accuracy:', train_acc_val)
  print('Test set accuracy:', test_acc_val)

INFO:tensorflow:Restoring parameters from dual_dnn_retrained_model
Train set accuracy: 0.9818
Test set accuracy: 0.9667
