# Deep Learning with TensorFlow
### Week 2: Optimisation and regularisation

## Contents

[1. Introduction](#introduction)

[2. Error backpropagation](#backprop)

[3. Optimisers](#optimisers)

[4. Automatic differentiation in TensorFlow (\*)](#autodiff)

[5. Weight regularisation, dropout and early stopping](#regularisation)

[6. TensorFlow regularisers, Dropout layers, metrics and callbacks (\*)](#tf_regularisation)

[7. Batch normalisation (\*)](#batchnorm)

[References](#references)

<a class="anchor" id="introduction"></a>
## Introduction

In the last week of the module we reviewed some important concepts in machine learning, including generalisation, validation, dataset splits, overfitting/underfitting and regularisation. and took a first look at the prototypical deep learning architecture, which is the multilayer perceptron.

You also trained your first deep learning models in TensorFlow on the MNIST dataset, using the Sequential API, and learned the core methods `compile`, `fit`, `evaluate` and `predict`. You saw how the low level objects Tensors and Variables are including in these models to encapsulate mutable parameters and computational operations.

In this week of the module we will focus on practical aspects of training deep learning models. We touched on methods of regularisation last week, and how they are important to combat overfitting. Here, we will look at three commonly used regularisation methods for deep learning models: $\mathcal{l}^2$ (and $\mathcal{l}^1$) regularisation, dropout and early stopping.

We will then turn to the issue of optimising neural networks, and study the important backpropagation algorithm, with a focus on its application to MLP models. We will then walk through several popular optimisers that are used in practice. In this exposition, we will see how the vanishing gradients problem naturally occurs in the context of deep networks. We will then look at ways of mitigating this and stabilising information flow through the network using Glorot initialisation and batch normalisation.

We will also learn how to implement all of these techniques in TensorFlow, and see how gradients can easily be computed using automatic differentiation tools. We will introduce callbacks and metrics, which are useful objects for monitoring your models during training and evaluation. 

<a class="anchor" id="backprop"></a>
## Error backpropagation

Gradient-based neural network optimisation can be seen as iterating over the following two main steps:

1. Computation of the (stochastic) gradient of the loss function with respect to the model parameters
2. Use of the computed gradient to update the parameters

We have already seen the parameter update rule according to stochastic gradient descent for a neural network model $f_\theta:\mathbb{R}^D\mapsto Y$, where $Y$ is the target space (e.g. $\mathbb{R}^{n_{L+1}}$ or $[0, 1]^{n_{L+1}}$):

$$
\theta_{t+1} = \theta_{t} - \eta \nabla_t L(\theta_t; \mathcal{D}_m),\qquad t\in\mathbb{N}_0. \label{sgd}\tag{1}
$$

In the above equation, the minibatch loss $L(\theta_t; \mathcal{D}_m)$ is calculated on a randomly drawn sample of data points from the training set,

$$
L(\theta_t; \mathcal{D}_m) = \frac{1}{M} \sum_{x_i, y_i\in\mathcal{D}_m} l(y_i, f_{\theta_t}(x_i)). \label{minibatch_loss}\tag{2}
$$

where $\mathcal{D}_m$ is the randomly sampled minibatch, $M = |\mathcal{D}_m| \ll |\mathcal{D}_{train}|$ is the size of the minibatch, and we will denote $L_i:Y\times Y\mapsto \mathbb{R}$, given by $L_i := l(y_i, f_\theta(x_i))$, as the per-example loss. 

The update \eqref{sgd} requires computation of the term $\nabla_t L(\theta_t; \mathcal{D}_m)$ (from here we drop the $\mathcal{D}_m$ in this expresesion for brevity); that is, the gradient of the loss function with respect to all of the model parameters, evaluated at the current parameter settings $\theta_t$.

The computation of the derivatives is done through applying the chain rule of differentiation. The algorithm for computing these derivatives in an efficient manner is known as **backpropagation**, and was popularised for use in neural network optimisation in [Rumelhart et al 1986b](#Rumelhart86b) and [Rumelhart et al 1986c](#Rumelhart86c), although the technique dates back earlier, see e.g. [Werbos](#Werbos94) which includes Paul Werbos' 1974 dissertation.

In this section, we will derive the important backpropagation algorithm for finding the loss function derivatives for a multilayer perceptron.

First recall the layer transformations in the MLP:


$$
\begin{align}
\mathbf{h}^{(0)} &:= \mathbf{x}, \label{fp1}\tag{3}\\
\mathbf{h}^{(k)} &= \sigma\left( \mathbf{W}^{(k-1)}\mathbf{h}^{(k-1)} + \mathbf{b}^{(k-1)} \right),\qquad k=1,\ldots, L,\label{fp2}\tag{4}\\
\hat{\mathbf{y}} &= \sigma_{out}\left( \mathbf{w}^{(L)}\mathbf{h}^{(L)} + b^{(L)} \right),\label{fp3} \tag{5}
\end{align}
$$

where $\mathbf{W}^{(k)}\in\mathbb{R}^{n_{k+1}\times n_k}$, $\mathbf{b}^{(k)}\in\mathbb{R}^{n_{k+1}}$, $\mathbf{h}^{(k)}\in\mathbb{R}^{n_k}$, $\hat{\mathbf{y}}\in Y$, $\sigma, \sigma_{out}:\mathbb{R}\mapsto\mathbb{R}$ are activation functions that are applied element-wise, $n_0 := D$, and $n_k$ is the number of units in the $k$-th hidden layer. 

Also recall that we define the **pre-activations**

$$
\mathbf{a}^{(k)} = \mathbf{W}^{(k-1)}\mathbf{h}^{(k-1)} + \mathbf{b}^{(k-1)} \label{preactivations}\tag{6}
$$

and **post-activations**

$$
\mathbf{h}^{(k)} = \sigma(\mathbf{a}^{(k)}). \label{activations}\tag{7}
$$

We will consider the gradient of the loss computed on a single data example $\nabla_t {L}_i(\theta_t)$, given the sum \eqref{minibatch_loss}. We first compute the **forward pass** \eqref{fp1}-\eqref{fp3} and store the preactivations $\mathbf{a}^{(k)}$ and post-activations $\mathbf{h}^{(k)}$.

<img src="figures/forward.png" alt="Forward pass" style="width: 800px;"/>
<center>Pre-activations, post-activations, weights and biases in the forward pass</center>

Consider the derivative of $L_i$ with respect to $w^{(k)}_{pq}$ and $b^{(k)}_p$. We have:

$$
\begin{align}
\frac{\partial L_i}{\partial w^{(k)}_{pq}} &= \frac{\partial L_i}{\partial a^{(k+1)}_p} \frac{\partial a^{(k+1)}_p}{\partial w^{(k)}_{pq}} \\
&= \frac{\partial L_i}{\partial a^{(k+1)}_p} h^{(k)}_q,
\end{align}
$$

where the second line follows from \eqref{preactivations}. Similarly,

$$
\begin{align}
\frac{\partial L_i}{\partial b^{(k)}_{p}} &= \frac{\partial L_i}{\partial a^{(k+1)}_p} \frac{\partial a^{(k+1)}_p}{\partial b^{(k)}_{p}} \\
&= \frac{\partial L_i}{\partial a^{(k+1)}_p}. 
\end{align}
$$

We introduce the notation $\delta^{(k)}_p := \frac{\partial L_i}{\partial a^{(k)}_p}$, called the **error**. We then write

$$
\begin{align}
\frac{\partial L_i}{\partial w^{(k)}_{pq}} &= \delta^{(k+1)}_p h^{(k)}_q \label{dldw_error}\tag{8}\\
\frac{\partial L_i}{\partial b^{(k)}_{p}} &= \delta^{(k+1)}_p. \label{dldb_error}\tag{9}
\end{align}
$$

We therefore need to compute the quantity $\delta^{(k+1)}_p$ for each hidden and output unit in the network. Again using the chain rule, we have

$$
\begin{align}
\delta^{(k)}_p \equiv \frac{\partial L_i}{\partial a^{(k)}_p} &= \sum_{j=1}^{n_{k+1}} \frac{\partial L_i}{\partial a^{(k+1)}_j} \frac{\partial a^{(k+1)}_j}{\partial a^{(k)}_p} \\
&= \sum_{j=1}^{n_{k+1}} \delta^{(k+1)}_j \frac{\partial a^{(k+1)}_j}{\partial a^{(k)}_p} \label{recursive_error}\tag{10}
\end{align}
$$

Combining \eqref{preactivations} and \eqref{activations} we see that

$$
\begin{align}
a^{(k+1)}_j &= \sum_{l=1}^{n_k} w^{(k)}_{jl} \sigma(a^{(k)}_l) + b^{(k)}_p \label{activation_forward}\tag{11}\\
\frac{\partial a^{(k+1)}_j}{\partial a^{(k)}_p} &= w^{(k)}_{jp} \sigma'(a^{(k)}_p)
\end{align}
$$

where $\sigma'$ is the derivative of the activation function. So from the above equation and \eqref{recursive_error} we have

$$
\begin{align}
\delta^{(k)}_p \equiv \frac{\partial L_i}{\partial a^{(k)}_p} &= \sum_{j=1}^{n_{k+1}} \delta^{(k+1)}_p \frac{\partial a^{(k+1)}_j}{\partial a^{(k)}_p}\\
&=  \sigma'(a^{(k)}_p) \sum_{j=1}^{n_{k+1}} w^{(k)}_{jp}  \delta^{(k+1)}_p \label{error_backprop}\tag{12}
\end{align}
$$

Equation \eqref{error_backprop} is analogous to \eqref{activation_forward}, and describes the backpropagation of errors through the network. We can write it in the more concise form (analogous to \eqref{activations}):

$$
\mathbf{\delta}^{(k)} = \mathbf{\sigma}'(\mathbf{a}^{(k)})(\mathbf{W}^{(k)})^T \mathbf{\delta}^{(k+1)},
$$

where $\mathbf{\sigma}'(\mathbf{a}^{(k)}) = \text{diag} ([\mathbf{\sigma}'(a^{(k)}_p)]_{p=1}^{n_k})$.

<img src="figures/forward_backward_pass.png" alt="Forward and backward passes" style="width: 650px;"/>
<center>Forward and backward passes</center>

Now we can summarise the backpropagation algorithm as follows:

>1. Propagate the signal forwards by passing an input vector $x_i$ through the network and computing all pre-activations and post-activations using $\mathbf{a}^{(k)} = \mathbf{W}^{(k-1)}\mathbf{h}^{(k-1)} + \mathbf{b}^{(k-1)}$
> 2. Evaluate $\mathbf{\delta}^{(L+1)} = \frac{\partial L_i}{\partial \mathbf{a}^{(L+1)}}$ for the output neurons
> 3. Backpropagate the errors to compute $\mathbf{\delta}^{(k)}$ for each hidden unit using $\mathbf{\delta}^{(k)} = \mathbf{\sigma}'(\mathbf{a}^{(k)})(\mathbf{W}^{(k)})^T \mathbf{\delta}^{(k+1)}$
> 4. Obtain the derivatives of $L_i$ with respect to the weights and biases using $\frac{\partial L_i}{\partial w^{(k)}_{pq}} = \delta^{(k+1)}_p h^{(k)}_q,\quad 
\frac{\partial L_i}{\partial b^{(k)}_{p}} = \delta^{(k+1)}_p$

The backpropagation algorithm can easily be extended to apply to any directed acyclic graph, but we have presented it here in the case of MLPs for simplicity.

<a class="anchor" id="optimisers"></a>
## Optimisers

Recall the two main steps to training neural networks:

1. Computation of the (stochastic) gradient of the loss function with respect to the model parameters
2. Use of the computed gradient to update the parameters

Now that we have seen how gradients of the loss with respect to the parameters can be efficiently computed using the backpropagation algorithm (step 1), we will take a look at several popular gradient-based optimisation algorithms used in deep learning (step 2).

In [None]:
import tensorflow as tf

#### Stochastic gradient descent
We have already seen how stochastic gradient descent (SGD, [Robbins & Monro 1951](#Robbins51)) can be applied to optimise neural network parameters. 

Recall that SGD computes stochastic gradients by computing the loss on a minibatch of samples:

$$
L(\theta_t; \mathcal{D}_m) = \frac{1}{M} \sum_{x_i, y_i\in\mathcal{D}_m} l(y_i, f_{\theta_t}(x_i)),
$$

where $\mathcal{D}_m$ is a randomly sampled minibatch of training data points, $M = |\mathcal{D}_m|$ is the size of the minibatch (typically much smaller than $|\mathcal{D}_{train}|$). We then use the gradient $\nabla\tilde{L}(\theta_t)$ to update the parameters according to the SGD update rule

$$
\theta_{t+1} = \theta_{t} - \eta \nabla_{\theta_t} L(\theta_t; \mathcal{D}_m),\qquad t\in\mathbb{N}_0.
$$

In TensorFlow, an SGD optimizer object can be instantiated from the `tf.keras.optimizers` module. 

Optimiser objects like this one can be passed directly into the `optimizer` keyword argument in `model.compile` instead of the string reference `'sgd'`. This is useful, for example if you want to change the learning rate default.

In [None]:
# Create an SGD optimiser

sgd = tf.keras.optimizers.SGD()

print(sgd.lr)  # Default learning rate

In [None]:
# Create a new sgd optimiser and change the learning rate 

sgd = tf.keras.optimizers.SGD(learning_rate=0.005)

print(sgd.lr)

Stochastic gradient descent reduces redudancy in the gradient computation, and is faster than full (batch) gradient descent. However some challenges remain: 

* Convergence can still be very slow with SGD
* Setting the learning correctly can be difficult, involving trial and error
* Different weights might operate on different scales, and require different rates of learning

Several optimisation algorithms have been proposed to help treat these problems.

#### Momentum

One common tweak to accelerate the slow convergence of SGD is to add momentum ([Qian 1999](#Qian99)):

$$
\begin{align}
\mathbf{g}_t :=&~ \nabla_\theta L(\theta_t; \mathcal{D}_m),\\
\mathbf{v}_{t+1} =&~ \beta \mathbf{v}_t + \eta\mathbf{g}_t\\
\theta_{t+1} =&~ \theta_t - \mathbf{v}_{t+1},
\end{align}
$$

where $\beta\ge0$ is the momentum term, and as before, $\eta>0$ is the learning rate. When $\beta=0$ then we recover plain SGD, but with $\beta>0$ (a typical value is around 0.9), this gives the gradient a short term memory which often accelerates convergence.

Momentum can be added when created an SDG optimizer using the `momentum` keyword argument. The default value is `0.0` (plain SGD).

In [None]:
# Create an SGD optimiser with momentum

sgd_with_momentum = tf.keras.optimizers.SGD(momentum=0.9)

print(sgd_with_momentum.lr)
print(sgd_with_momentum.momentum)

#### Nesterov momentum

A common variant of momentum is to use Nesterov momentum ([Nesterov 1983](#Nesterov83)), which computes the gradient correction after the accumulated gradient, instead of before:

$$
\begin{align}
\mathbf{g}_t &= \nabla_\theta L(\theta_t - \beta\mathbf{v}_t; \mathcal{D}_m),\\
\mathbf{v}_{t+1} &= \beta \mathbf{v}_t + \eta\mathbf{g}_t\\
\theta_{t+1} &= \theta_t - \mathbf{v}_{t+1},
\end{align}
$$

The accumulated gradient approximates the next value of the parameters, and so by evaluating the gradient $\nabla_\theta\tilde{L}(\theta_t - \beta\mathbf{v}_t )$, this gives the optimiser a sense of 'look-ahead'.

<img src="figures/nesterov_momentum.png" alt="Nesterov momentum" style="width: 550px;"/>

Nesterov momentum can be added to an SGD optimizer using the `nesterov` keyword argument. The default value is `False`.

In [None]:
# Create an SGD optimiser with momentum

sgd_with_momentum = tf.keras.optimizers.SGD(momentum=0.9, nesterov=True)

print(sgd_with_momentum.momentum)
print(sgd_with_momentum.nesterov)

#### Adagrad

The Adagrad optimiser ([Duchi et al 2011](#Duchi11)) adapts the learning rate for each parameter, to account for different weights learning on different scales. Parameters that receive a gradient less frequently have larger updates, making Adagrad well suited to sparse data, where most of the features are zero in the data. It is used, for example, in [Pennington et al 2014](#Pennington14) to train GloVe word embedding vectors.

The update rule is

$$
\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \odot \nabla_\theta L(\theta_t; \mathcal{D}_m),
$$

where $G_t\in\mathbb{R}^{p\times p}$ is a diagonal matrix where the diagonal element $(G_t)_{ii}$ is the sum of squares of gradients with respect to $\theta_i$ up to time step $t$. In the above, the division and square root operations are performed element-wise, and $\odot$ is the Hadamard product.

Note that the resulting learning rates per parameter are monotonically decreasing, and eventually the algorithm effectively stops learning.

The Adagrad optimiser can be instantiated as follows:

In [None]:
# Create an Adagrad optimiser

adagrad = tf.keras.optimizers.Adagrad(
    learning_rate=0.001, epsilon=1e-07,
)

In the above, all keyword arguments shown are the default settings.

#### RMSprop

RMSprop is an unpublished optimisation method that aims to resolve the vanishing learning rates of Adagrad (it appeared in Geoff Hinton's Coursera course [in lecture 6e](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf)). It uses a decaying average of past squared gradients. 

The update rule is

$$
\begin{align}
\mathbb{E}[\mathbf{g}^2]_t &= \rho \mathbb{E}[\mathbf{g}^2]_{t-1} + (1 - \rho)(\nabla_\theta L(\theta_t; \mathcal{D}_m))^2\\
\theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{\mathbb{E}[\mathbf{g}^2]_t + \epsilon}} \odot \nabla_\theta L(\theta_t; \mathcal{D}_m)
\end{align}
$$

As before, the division and square root are performed element-wise, and $\odot$ is the Hadamard product. The $\rho$ term is typically set similar to momentum, around 0.9.

The RMSprop can be instantiated as follows:

In [None]:
# Create an RMSprop optimiser

rmsprop = tf.keras.optimizers.RMSprop(
    learning_rate=0.001, rho=0.9, epsilon=1e-07
)

Again the above are the default settings. Momentum can also be added to the `RMSprop` optimiser with the `momentum` keyword argument:

In [None]:
# Create an RMSprop optimiser with momentum

rmsprop_with_momentum = tf.keras.optimizers.RMSprop(momentum=0.9)

The RMSprop optimiser is also the default optimiser that is chosen if `model.compile` is called without the `optimizer` keyword argument.

#### Adam

The Adam optimiser ([Kingma 2015](#Kingma15)) is a popular optimisation algorithm, that also computes adaptive learning rates per parameter. It estimates first and second moments of the gradients, and the name stands for <ins>Ada</ins>ptive <ins>m</ins>oment estimation.

The update rule is

$$
\begin{align}
\mathbb{E}[\mathbf{g}]_t &= \beta_1\mathbb{E}[\mathbf{g}]_{t-1} + (1 - \beta_1) \nabla_\theta L(\theta_t; \mathcal{D}_m),\\
\mathbb{E}[\mathbf{g}^2]_t &= \beta_2\mathbb{E}[\mathbf{g}^2]_{t-1} + (1 - \beta_2) (\nabla_\theta L(\theta_t; \mathcal{D}_m))^2,\\
\mathbf{m}_t &= \mathbb{E}[\mathbf{g}]_t / (1 - \beta_1),\\
\mathbf{v}_t &= \mathbb{E}[\mathbf{g}^2]_t / (1 - \beta_2),\\
\theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{\mathbf{v}_t + \epsilon}}\odot \mathbf{m}_t
\end{align}
$$

The $\mathbf{m}_t$ and $\mathbf{v}_t$ terms correct for an initial bias towards zero. Typical values are $\beta_1 \approx 0.9$, $\beta_2 \approx 0.999$ and $\epsilon \approx 10^{-7}$.

The Adam optimiser can be instantiated as follows:

In [None]:
# Create an Adam optimiser

adam = tf.keras.optimizers.Adam()

print(adam.lr)
print(adam.beta_1)
print(adam.beta_2)
print(adam.epsilon)

The above is a non-exhaustive list of optimisers that have been developed and are actively used in deep learning research and practice. See [here](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers) for a complete of the optimisers that are available to use in Tensorflow.

The code below will create a demonstration optimization run using one of each of the optimizers described above for the [Beale function](http://benchmarkfcns.xyz/benchmarkfcns/bealefcn.html), a common test function used to evaluate optimization algorithms.

In [None]:
def beale(x, y):
    return (1.5 - x + x * y)**2 + (2.25 - x + x * (y**2))**2 + (2.625 - x + x * (y**3))**2

def grad_beale(x, y):
    ddx = 2*(1.5 - x + x * y)*(y - 1) + 2*(2.25 - x + x * (y**2))*((y**2) - 1) + 2*(2.625 - x + x * (y**3))*((y**3)-1)
    ddy = 2*(1.5 - x + x * y)*(x) + 2*(2.25 - x + x * (y**2))*(2*y*x) + 2*(2.625 - x + x * (y**3))*(3*(y**2)*x)
    return [ddx, ddy]

In this case the gradient is easy enough to calculate by hand as above. Later in this week you will learn how to use the [automatic differentiation](#autodiff) tools in TensorFlow to compute the gradient of any differentiable function for you.

The cell below will run the optimization routine for 100 iterations for each optimizer, and plot the trajectories over the contour plot of the Beale function. Feel free to interrupt the cell execution and restart it to try a different random initial condition. You can also try changing the learning rates and other parameters to see the effect on convergence.

In [None]:
from IPython import display
import matplotlib.pyplot as plt
import numpy as np

x_init = tf.random.normal(())
y_init = tf.random.normal(())

test_fn = beale
grad_fn = grad_beale

X, Y = np.meshgrid(np.linspace(-4, 4, 100), np.linspace(-4, 4, 100))
Z = test_fn(X, Y)
levels = np.exp(np.linspace(0, 10, 25)) - 1
plt.figure(figsize=(13, 8))
plt.contour(X, Y, Z, levels, alpha=0.6, cmap='viridis')
plt.colorbar()
plt.xlabel('x')
plt.ylabel('y')
plt.scatter(x_init.numpy(), y_init.numpy())
plt.scatter(3, 0.5, marker='*', label='Optimum')
optimizers_config = [
    {"name": "SGD", "kwargs": {"learning_rate": 0.01}, "label": "SGD"},
    {"name": "SGD", "kwargs": {"learning_rate": 0.001, "momentum": 0.9, "nesterov": True}, "label": "SGD-NAG"},
    {"name": "Adam", "kwargs": {"learning_rate": 0.1}, "label": "Adam"},
    {"name": "Adagrad", "kwargs": {"learning_rate": 0.1}, "label": "Adagrad"},
    {"name": "RMSprop", "kwargs": {"learning_rate": 0.05}, "label": "RMSprop"}
]
optimizers, states, plot_data = [], [], []
for optimizer in optimizers_config:
    optimizers.append(getattr(tf.keras.optimizers, optimizer['name'])(**optimizer['kwargs']))
    states.append((tf.Variable(x_init, name='x_{}'.format(optimizer['name'])), 
                   tf.Variable(y_init, name='y_{}'.format(optimizer['name']))))
    opt_plot, = plt.plot([x_init.numpy()], [y_init.numpy()], label=optimizer['label'])
    plot_data.append(opt_plot)
plt.title("Test optimization run with {} optimizers. Initial conditions x: {:.4f}, y: {:.4f}".format(
    len(optimizers), x_init.numpy(), y_init.numpy()), fontsize=14)

num_iterations = 100
for i in range(num_iterations):
    try:
        for optimizer, state, data in zip(optimizers, states, plot_data):
            grads = grad_fn(state[0], state[1])
            optimizer.apply_gradients(zip(grads, state))
            data.set_xdata(np.append(data.get_xdata(), state[0].numpy()))
            data.set_ydata(np.append(data.get_ydata(), state[1].numpy()))
        plt.text(0.01, 0.01, 'Iteration {}'.format(i+1), horizontalalignment='left', 
                 verticalalignment='bottom', transform = plt.gca().transAxes, 
                 fontsize=12, bbox=dict(facecolor='white', alpha=1.))
        plt.legend(fontsize=12)
        display.display(plt.gcf())
        display.clear_output(wait=True)
    except KeyboardInterrupt:
        break

There is no single answer for which optimiser is best to use, as it very often depends on the data and the model. Optimisers with adaptive learning rates tend to convergence faster in most situations, and are more frequently chosen than plain SGD in most situations. However, it is interesting to note that SGD has been shown to generalise better ([Hardt et al 2015](#Hardt15)), and techniques such as switching from Adam to SGD during training have been proposed ([Keskar & Socher 2017](#Keskar17)). Further methods include annealed learning rates, cyclic learning rates ([Smith 2015](#Smith15)) and decaying momentum ([Chen & Kyrillidis 2019](#Chen19)). Many advances have been made in recent years, and neural network optimisation is still very much an active research area.

*Exercise.* Change the Beale function above for a different [test function](https://en.wikipedia.org/wiki/Test_functions_for_optimization) in the above code.

<a class="anchor" id="autodiff"></a>
## Automatic differentiation in TensorFlow

One of the major advantages of using deep learning frameworks like TensorFlow is the ability to automatically compute gradients of any differentiable operation. When using the Keras `model.fit` API, TensorFlow applies the backpropagation equations automatically to compute gradients, and then uses the optimiser algorithm selected to update the parameters.

In this section, we will see how lower-level tools in TensorFlow can be leveraged to compute gradients of differentiable expressions, and build a custom training loop that breaks down the training loop to give you extra flexibility when you need it.

In [None]:
import tensorflow as tf

Operations that you want to take gradients with respect to need to be defined inside a `tf.GradientTape` context:

In [None]:
# Define a simple operation and take the gradient



In [None]:
# Take multiple derivatives



In [None]:
# Gradients can also be computed with respect to intermediate variables



In [None]:
# Gradients can be taken once by default



In [None]:
# Variable objects are tracked automatically



In [None]:
# Take gradients with respect to a layer operation



In [None]:
# Take gradients with respect to a model operation



*Exercise.* Use `tf.GradientTape` to make a plot of the function $\frac{dy}{dx}:[-5, 5]\mapsto\mathbb{R}$, where $y = \sin (x^2) - \frac{x^2}{4}$.

#### Custom training loop
We will now see how to build a custom training loop using `tf.GradientTape` and optimiser objects.

In [None]:
# Load the Fashion-MNIST dataset



In [None]:
# Get the class labels

classes = [
    "T-shirt/top",
    "Trouser",
    "Pullover",
    "Dress",
    "Coat",
    "Sandal",
    "Shirt",
    "Sneaker",
    "Bag",
    "Ankle boot"
]

In [None]:
# View a few training data examples

import numpy as np
import matplotlib.pyplot as plt

n_rows, n_cols = 3, 5
random_inx = np.random.choice(x_train.shape[0], n_rows * n_cols, replace=False)
fig, axes = plt.subplots(n_rows, n_cols, figsize=(14, 8))
fig.subplots_adjust(hspace=0.2, wspace=0.1)

for n, i in enumerate(random_inx):
    row = n // n_cols
    col = n % n_cols
    axes[row, col].imshow(x_train[i])
    axes[row, col].get_xaxis().set_visible(False)
    axes[row, col].get_yaxis().set_visible(False)
    axes[row, col].text(10., -1.5, f'{classes[y_train[i]]}')
plt.show()

In [None]:
# Build the model



In [None]:
# Print the model summary



In [None]:
# Define an optimiser



In [None]:
# Define the loss function



We will also use the `tf.data` module to load the training data into a `tf.data.Dataset` object.

In [None]:
# Load the data into a tf.data.Dataset



In [None]:
# Iterate over the Dataset object



`Dataset` objects come with `map` and `filter` methods for data preprocessing on the fly. For example, we can normalise the pixel values to the range $[0, 1]$ with the `map` method:

In [None]:
# Normalise the pixel values



We could also filter out data examples according to some criterion with the `filter` method. For example, if we wanted to exclude all data examples with label $9$ from the training:

In [None]:
# Filter out all examples with label 9 (ankle boot)



In [None]:
# Shuffle the dataset



In [None]:
# Batch the dataset



In [None]:
# Print the element_spec



We now have everything to write the custom training loop.

In [None]:
# Build the custom training loop



In [None]:
# Build a new model



In [None]:
# Optimise the custom training loop by compiling the training step into a graph



In many cases the data processing pipeline can also be optimised for performance gain, see [here](https://www.tensorflow.org/guide/data_performance) for more information.

<a class="anchor" id="regularisation"></a>
## Weight regularisation, dropout and early stopping

Deep learning models are typically very over-parameterised, often with millions of parameters over many layers in the model. They are universal approximators (see e.g. [Cybenko](#Cybenko89) for the large width case, or [Lu et al](#Lu17) for the large depth case), and so overfitting can be a problem. When training neural networks, it is important to regularise them to prevent overfitting. As written above, there are several forms of regularisation, but in this section we will look at three in particular: weight regularisation, dropout and early stopping.

#### $\mathcal{l}^2$ and $\mathcal{l}^1$ regularisation
Recall that for a linear model of the form

$$
f(\mathbf{x}) = \sum_j w_j \phi_j(\mathbf{x}),
$$

a typical regularisation is to add a sum of squares penalty term to the loss term to discourage the weights $w_j$ from growing too large. In this case, the regularised loss takes the form


$$
L(\mathbf{w}, \alpha) = L_0(\mathbf{w}) + \alpha_2 \sum_i w_i^2,
$$

where $L_0$ is the unconstrained loss function, and $\alpha_2$ is a regularisation hyperparameter. This is $\mathcal{l}^2$ regularisation.

This form of regularisation is often referred to as **weight decay**, although the two terms are technically not the same. Weight decay ([Hanson & Pratt](#Hanson88)) is defined as a modification to the update rule, rather than to the loss function itself:

$$
\mathbf{\theta}_{t+1} \leftarrow (1 - \lambda)\theta_t - \eta g_t,
$$

where $\theta\in\mathbb{R}^p$ is the model parameters, $\lambda$, $\eta$ are hyperparameters, and $g_t$ is the $t$-th batch update. In the case of stochastic gradient descent, the update $g_t = \nabla_\theta L(\theta_t; \mathcal{D}_m)$ and the two formulations are equivalent. However, this is not the case for all gradient-based optimisers commonly used in deep learning.

An alternative weight regularisation is $\mathcal{l}^1$ regularisation, in which the sum of absolute values of the weights are added to the loss term:

$$
L(\mathbf{w}, \alpha) = L_0(\mathbf{w}) + \alpha_1 \sum_i |w_i|.
$$

This form of regularisation encourages sparsity in the weights. Both $\mathcal{l}^1$ and $\mathcal{l}^2$ regularisation discourage the weights from growing too large, which restricts the capacity of the network.

It is also possible to add a weighted combination of both $\mathcal{l}^2$ and $\mathcal{l}^1$ regularisation to the loss function.

#### Dropout
Dropout was introduced by [Srivastava et al](#Srivastava14) in 2014 as a regularisation technique for neural networks, that also has the effect of modifying the behaviour of neurons within a network.

The following is taken from the paper abstract:

> Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different “thinned” networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods.

The method of dropout is to randomly 'zero out' neurons (or equivalently, weight connections) in the network during training according to a Bernoulli mask whose values are independently sampled at every iteration. 

Suppose $\mathbf{W}^k\in\mathbb{R}^{n_{k+1}\times n_{k}}$ is a weight matrix mapping neurons in layer $k$ to layer $k+1$:

$$
\mathbf{h}^{(k+1)} = \sigma\left( \mathbf{W}^{(k)}\mathbf{h}^{(k)} + \mathbf{b}^{(k)} \right)
$$

We can view dropout as randomly replacing each column of $\mathbf{W}^k$ with zeros with probability $p_k$. We can write this as applying a Bernoulli mask:

$$
\begin{align}
\mathbf{W}^k &\leftarrow \mathbf{W}^k \cdot\text{diag} ([\mathbf{z}_{k, j}]_{j=1}^{n_{k-1}})\\
\mathbf{z}_{k, j} &\sim \text{Bernoulli}(p_k), \qquad k=1,\ldots, L,
\end{align}
$$

with $\mathbf{W}^k\in\mathbb{R}^{n_{k+1}\times n_{k}}$. The following diagrams illustrate the effect of dropout on a neural network.

<img src="figures/no_dropout.png" alt="MLP with a two hidden layers" style="width: 700px;"/>
<center>Neural network without dropout</center>

<img src="figures/dropout1.png" alt="MLP with a two hidden layers" style="width: 700px;"/>
<center>Neural network with dropout</center>

By randomly dropping out neurons in the network, one obvious effect is that the capacity of the model is reduced, and so there is a regularisation effect. Each randomly sampled Bernoulli mask defines a new 'sub-network' that is smaller than the original. 

In addition, a key motivation of dropout is that it prevents neurons from co-adapting too much. Any neuron in the network is no longer able to depend on any other specific neurons being present, and so each neuron learns features that are more robust, and generalise better.

In the figure below (taken from the [original paper](#Srivastava14)) we see features that are learned on the MNIST dataset for a model trained without dropout (left) and one trained with dropout (right). We see that the dropout model learns features that are much less noisy and more meaningful (it is detecting edges, textures, spots etc.) and help the model to generalise better. The non-dropout model's features suggest a large degree of co-adaptation, where the neurons depend on the specific combination of features in order to make good predictions on the training data.

<img src="figures/dropout_no_dropout.png" alt="Features with and without dropout" style="width: 700px;"/>
<center>Learned features in a neural network trained without dropout (left) and with dropout (right). From Srivastava et al 2014</center>

Typically, dropout is applied only in the training phase. When making predictions, all weight connections  $\mathbf{W}^k$ are restored, but rescaled by a factor of $p_k$ to take account for the fact that fewer connections were present at training.

However, [Gal & Ghahramani](#Gal16) describe a Bayesian interpretation of dropout, and proposed that dropout is also applied at test time in order to obtain a Bayesian predictive distribution.

#### Early stopping

You might have found in the last week that it is difficult to set a good number of epochs to train for ahead of time. In the simple MNIST example training is quick so it is not a problem to experiment, but in many cases training could take hours or days (or even longer!) and so this is not an option. 

Recall that deep learning models are usually vastly overparameterised, and have the capacity to drastically overfit. A simple but effective method is to simply stop the training before the model starts to overfit. The picture is similar to the balance between capacity and generalisation:

<img src="figures/early_stopping.png" alt="Early stopping" style="width: 500px;"/>
<center>Prediction error vs number of training epochs</center>

With early stopping, the aim is to stop the training when the validation error is at a minimum. This means that the model needs to be regularly evaluated on a held-out validation set (that is not used for training), and the optimisation routine is terminated when the validation error starts to rise. Validation is normally performed once per epoch in the training run.

In practice, the validation error measurements will be noisy, and so it is not a reliable measure to simply detect when the validation error increases and immediately stop the training. What is usually done is to periodically save model checkpoints (say once per epoch), and set a **patience** threshold, to specify a maximum number of validation runs that are allowed where the validation error does not improve upon the best score so far. If this patience threshold is reached, the training is terminated.

The early stopping algorithm is outlined in pseudocode below.

Early stopping inputs: `val_metric`, `max_patience`

-------
>```
>best_valid_loss = np.inf
>patience = 0
>
>for epoch in range(max_epochs):
>    epoch_train_loss = train_model(train_data, train_loss)
>    epoch_valid_loss = validate_model(valid_data, val_metric)
>    if epoch_valid_loss < best_valid_loss:
>        best_valid_loss = epoch_valid_loss
>        patience = 0
>    else:
>        patience += 1
>        
>    save_model(epoch)
>    
>    if patience >= max_patience:
>        break  # terminate training
>```

-------

It is also possible to validate the model using a measure that is different to the loss function used for training the model. Therefore `val_metric` is also an input to the early stopping algorithm above.

Of course, all of the regularisation techniques mentioned here (and more) can be used together in deep learning models (and they often are).

<a class="anchor" id="tf_regularisation"></a>
## TensorFlow regularisers, Dropout layers and callbacks

In this section we will build on what we have covered already with the `Sequential` API, and include weight regularisers, `Dropout` layers, and introduce callback objects - these are very useful objects for dynamically performing operations during the training run. An example is the `EarlyStopping` callback.

In [None]:
import tensorflow as tf

For this tutorial we will use the diabetes dataset from `sklearn`.

In [None]:
# Load the diabetes dataset



In [None]:
# Print dataset description



In [None]:
# Get the input and target data



In [None]:
# Normalise the target data (this will make clearer training curves)



In [None]:
# Partition the data into training and validation sets



In [None]:
# Load the data into training and validation Dataset objects



In [None]:
# Build the MLP model



In [None]:
# Print the model summary



In [None]:
# Compile the model



In [None]:
# Train the model, including validation



In [None]:
# Plot the training and validation loss

import matplotlib.pyplot as plt

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Loss vs. epochs')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Training', 'Validation'], loc='upper right')
plt.show()

#### Regularise the model

Both $\mathcal{l}^2$ and $\mathcal{l}^1$ regularisation can easily be included using the `kernel_regularizer` and `bias_regularizer` keyword arguments in the `Dense` layer.

Dropout can also be easily included as an additional layer of our model.

In [None]:
# Redefine the model using l2 regularisation and dropout



In [None]:
# Compile the model



In [None]:
# Train the model, including validation



In [None]:
# Plot the training and validation loss

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Loss vs. epochs')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Training', 'Validation'], loc='upper right')
plt.show()

The $\mathcal{l}^2$ regularisation and dropout have helped to reduce the overfitting of the model. 


#### Callbacks
We can go one step further and introduce early stopping as well, and save the model weights at the best validation score. We can do this with callbacks.

In [None]:
# Create a new model



In [None]:
# Compile the model



The `EarlyStopping` callback is a built-in callback in the `tf.keras.callbacks` module. You can see a complete list of built-in callbacks [here](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks).

In [None]:
# Create an EarlyStopping callback



In [None]:
# Train the model, including validation



In [None]:
# Plot the training and validation metrics

import numpy as np

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Loss vs. epochs')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.xticks(np.arange(len(history.history['loss'])))
plt.legend(['Training', 'Validation'], loc='upper right')
plt.show()

#### Custom callbacks

It is also possible (and often useful) to create custom callbacks to perform certain actions depending on the training progress. We will look at building a custom callback to save the model weights, dependent on the performance of a specified validation metric.

Note that the `tf.keras.callbacks` module has the built-in callback `ModelCheckpoint`, which automatically handles model saving for Keras models. We will look at a slightly lower-level method for model saving, using the `tf.train` module.

In [None]:
# Create a custom callback for saving the model weights



In [None]:
# Create a new model



In [None]:
# Compile the model



In [None]:
# Train the model, including validation



In [None]:
# Inspect the saved checkpoints



In [None]:
# Plot the training and validation metrics

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.plot(history.history['val_mae'])
plt.title('Loss vs. epochs')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.xticks(np.arange(len(history.history['loss'])))
plt.legend(['Training', 'Val loss', 'Val MAE'], loc='upper right')
plt.show()

In [None]:
# Re-initialise the model



In [None]:
# Restore the best model weights



In [None]:
# Clean up



<a class="anchor" id="batchnorm"></a>
## Batch normalisation

Batch normalisation ([Ioffe & Szegedy 2015](#Ioffe15)) is a widely used method in deep learning. It is used to normalise the distribution of internal activation values in the network, and greatly helps to stabilise learning especially in deep networks. 

The core issue is that of **covariate shift**, which is the change in distribution of input variables to a machine learning model. This can happen in datasets where there is some change in conditions in subsequent data collections, for example over time or location. Whilst the underlying target function might not have changed, the distribution of the input variables does change which means the model could perform poorly in the changed conditions.

The following shows a simple example of a regression function that fails to generalise to new data points whose distribution has shifted from the training data, even though the underlying target function is the same.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.kernel_ridge import KernelRidge

def target(x):
    return x**3 - 15 * x - 12

n_samples = 100
x_train_all = np.linspace(-5, 5, n_samples)[..., np.newaxis]
x_train_sub = x_train_all[30:]
y_train_all = target(x_train_all) + 10 * np.random.randn(n_samples, 1)
y_train_sub = y_train_all[30:]

kernel_regressor_sub = KernelRidge(alpha=1e-2, kernel='rbf', gamma=0.5)
kernel_regressor_sub.fit(x_train_sub, y_train_sub)
mse1 = np.mean((kernel_regressor_sub.predict(x_train_sub) - y_train_sub)**2)

kernel_regressor_all = KernelRidge(alpha=1e-2, kernel='rbf', gamma=0.5)
kernel_regressor_all.fit(x_train_all, y_train_all)

fig = plt.figure(figsize=(14, 5))

fig.add_subplot(1, 2, 1)
plt.plot(x_train_sub, target(x_train_sub), '--')
plt.plot(x_train_all, kernel_regressor_sub.predict(x_train_all))
plt.scatter(x_train_sub, y_train_sub, alpha=0.5)
plt.title("Regression function and training data")
plt.legend(['Target function', 'Kernel regressor'])

fig.add_subplot(1, 2, 2)
plt.plot(x_train_all, target(x_train_all), '--')
plt.plot(x_train_all, kernel_regressor_sub.predict(x_train_all))
plt.scatter(x_train_all, y_train_all, alpha=0.5)
plt.title("Same regression function on shifted data")
plt.legend(['Target function', 'Kernel regressor'])

plt.show()

The same phenomenon can occur during the course of training deep learning models on large datasets, where stochastic minibatches are used in the optimisation procedure. Furthermore, since deep learning models can be viewed as hierarchical feature extractors, we can encounter problems of **internal covariate shift**, where the activation values in hidden layers also undergo changes of distribution due to changes in parameter values and activations in earlier layers. 

<img src="figures/internal_covariate_shift.png" alt="Internal Covariate Shift" style="width: 750px;"/>
<center>Changes in weights and activations earlier in the network cause internal covariate shift in activations in later layers</center>

Batch normalisation reduces the internal covariate shift by normalising the mean and variance of the activation values in a layer. Intuitively speaking, this means that although layer inputs will change over the course of training, they won't change so much that learning becomes very slow or unstable. Batch normalisation also has a slight regularisation effect on the network.

For a layer with $n_k$-dimensional input $\mathbf{h}^{(k)} = (h^{(k)}_1,\ldots,h^{(k)}_{n_k})$, we normalise each input feature

$$
\hat{h}^{(k)}_j = \frac{h^{(k)}_j - \mathbb{E}[h^{(k)}_j]}{\sqrt{\text{Var}[h^{(k)}_j]}}.
$$

In order to maintain full expressive power of the network, we make sure the final transformation can represent the identity:

$$
z^{(k)}_j = \gamma^{(k)}_j \hat{h}^{(k)}_j + \beta^{(k)}_j,
$$

where $\gamma^{(k)}_j$ and $\beta^{(k)}_j$ are learned parameters. Note that setting $\gamma^{(k)}_j = \sqrt{\text{Var}[h^{(k)}_j]}$ and $\beta^{(k)}_j = \mathbb{E}[h^{(k)}_j]$ recovers the original activations. However, now the model can control the mean and variance of activations within the hidden layer $\mathbf{h}^{(k)}$ by tuning the parameters $\gamma^{(k)}_j$ and $\beta^{(k)}_j$. $\mathbf{z}^{(k)}$ then becomes the new input to the next layer in the network.

Statistics $\mathbb{E}[h^{(k)}_j]$ and $\text{Var}[h^{(k)}_j]$ are estimated over each minibatch $\mathcal{D}_m$:

$$
\begin{align}
\mu^{(k)}_{jm} &= \frac{1}{M} \sum_{i=1}^M h^{(k)}_{ij}\\
\left(\sigma^{(k)}_{jm}\right)^2 &= \frac{1}{M} \sum_{i=1}^M (h^{(k)}_{ij} - \mu^{(k)}_{jm})^2\\
\hat{h}^{(k)}_{j} &= \frac{h^{(k)}_{j} - \mu^{(k)}_{jm}}{\sqrt{\left(\sigma^{(k)}_{jm}\right)^2 + \epsilon}}\\
z^{(k)}_{j} &= \gamma^{(k)}_j\hat{h}^{(k)}_{j} + \beta^{(k)}_j =: BN_{\gamma^{(k)}, \beta^{(k)}}\left(h^{(k)}_{j}\right)
\end{align}
$$

where $M = |\mathcal{D}_m|$, and $h^{(k)}_{ij}$ is the activation value for the $j$-th for input $x_i\in\mathcal{D}_m$ in layer $k$.

At training time, the estimates $\mu^{(k)}_{jm}$ and $\sigma^{(k)}_{jm}$ are computed on the minibatch $\mathcal{D}_m$. In practical implementations, a running average of these estimates over the training run is used at test time.

The batch normalisation calculation is fully differentiable, and so gradients can be backpropagated through this calculation as normal. 

The batch normalisation operation is implemented as a layer in TensorFlow. In the following experiment we will recreate the first example from the [original paper](#Ioffe15) on the MNIST dataset.

In [None]:
import tensorflow as tf

In [None]:
# Load the MNIST dataset

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

In [None]:
# Create Dataset objects

train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
test_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test))

train_dataset.element_spec

In [None]:
# Normalise pixel values in the Datasets

def normalise_pixels(image, label):
    return (tf.cast(image, tf.float32) / 255., label)

train_dataset = train_dataset.map(normalise_pixels)
test_dataset = test_dataset.map(normalise_pixels)

In [None]:
# Shuffle and batch the dataset

train_dataset = train_dataset.shuffle(1000).batch(60)
test_dataset = test_dataset.batch(60)

We now define an MLP classifier model for the MNIST dataset. The following demonstrates how to build a model using [the functional API](https://www.tensorflow.org/guide/keras/functional).

In [None]:
# Build the classifier model using the functional API



In [None]:
# Print the model summary



In [None]:
# Define a loss function, optimiser and metric



In [None]:
# Fit the model



In [None]:
# Re-build the model with batch normalisation



In [None]:
# Compile the model 



In [None]:
# Fit the model



We will compare the progress of the test accuracy in both models.

In [None]:
# Plot the test accuracy

import matplotlib.pyplot as plt

plt.plot(bn_history.history['accuracy'])
plt.plot(no_bn_history.history['accuracy'], '--')
plt.legend(['With BN', 'Without BN'])
plt.ylabel("Test accuracy")
plt.xlabel("Epochs")
plt.title("Test accuracy vs epochs")
plt.show()

We see clearly in the above plot that the batch normalisation layers help the model to train faster, and to a higher accuracy.

Batch normalisation reduces internal covariate shift, particularly early on in training. The distribution is more stable, making learning easier.

<a class="anchor" id="references"></a>
### References

<a class="anchor" id="Chen19"></a>
* Chen, J. & Kyrillidis, A., (2019), "Decaying Momentum Helps Neural Network Training", arXiv preprint arXiv:1910.04952.
<a class="anchor" id="Cybenko89"></a>
* Cybenko, G. (1989) "Approximations by superpositions of sigmoidal functions", Mathematics of Control, Signals, and Systems, **2** (4), 303–314.
<a class="anchor" id="Duchi11"></a>
* Duchi, J., Hazan, E., & Singer, Y. (2011), "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization", *Journal of Machine Learning Research*, **12**, 2121–2159.
<a class="anchor" id="Gal16"></a>
* Gal, Y. & Ghahramani, Z. (2016), "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning", Proceedings of The 33rd International Conference on Machine Learning, **48**, 1050-1059.
<a class="anchor" id="Hanson88"></a>
* Hanson, S. J. & Pratt, L. Y. (1988) "Comparing biases for minimal network construction with back-propagation", in *Proceedings of the 1st International Conference on Neural Information Processing Systems*,  177–185.
<a class="anchor" id="Hardt15"></a>
* Hardt, M., Recht, B., & Singer, Y. (2015), "Train faster, generalize better: Stability of stochastic gradient descent", in *Proceedings of the 33rd International Conference on International Conference on Machine Learning*, **48**, 1225-1234.
<a class="anchor" id="Ioffe15"></a>
* Ioffe, S. & Szegedy, C., "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", in *Proceedings of the 32nd International Conference on International Conference on Machine Learning*, **37**, 448–456.
<a class="anchor" id="Keskar17"></a>
* Keskar, N. S. * Socher, R. (2017), "Improving Generalization Performance by Switching from Adam to SGD", arXiv preprint, abs/1712.07628.
<a class="anchor" id="Kingma15"></a>
* Kingma, D. P. & Ba, J. L. (2015), "Adam: a Method for Stochastic Optimization", International Conference on Learning Representations, 1–13.
<a class="anchor" id="Lu17"></a>
* Lu, Z., Pu, H., Wang, F. Hu, Z., & Wang, L. (2017) "The Expressive Power of Neural Networks: A View from the Width", Advances in Neural Information Processing Systems 30. Curran Associates, Inc., 6231–6239.
<a class="anchor" id="Nesterov83"></a>
* Nesterov, Y. (1983), "A method for unconstrained convex minimization problem with the rate of convergence o(1/k2)", Doklady ANSSSR (translated as Soviet. Math. Docl.), **269**, 543–547.
<a class="anchor" id="Pennington14"></a>
* Pennington, J., Socher, R. & Manning, C. D. (2014), "Glove: Global vectors for word representation", in *Proceedings of Empirical Methods in Natural Language Processing (EMNLP)*.
<a class="anchor" id="Qian99"></a>
* Qian, N. (1999), "On the momentum term in gradient descent learning algorithms", Neural Networks: The Official Journal of the International Neural Network Society, **12** (1), 145–151.
<a class="anchor" id="Robbins51"></a>
* Robbins, H. and Monro, S. (1951), "A stochastic approximation method", *The annals of mathematical statistics*, 400–407.
<a class="anchor" id="Rumelhart86b"></a>
* Rumelhart, D. E., Hinton, G., and Williams, R. (1986b), "Learning representations by back-propagating errors", Nature, **323**, 533-536.
<a class="anchor" id="Rumelhart86c"></a>
* Rumelhart, D. E., Hinton, G., and Williams, R. (1986c), "Learning Internal Representations by Error Propagation", in Rumelhart, D. E.; McClelland, J. L. (eds.), Parallel Distributed Processing : Explorations in the Microstructure of Cognition. Volume 1: Foundations, Cambridge, MIT Press.
<a class="anchor" id="Smith15"></a>
* Smith, L. N. (2015), "Cyclical Learning Rates for Training Neural Networks", arXiv preprint, abs/1506.01186.
<a class="anchor" id="Srivastava14"></a>
* Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014), "Dropout: A Simple Way to Prevent Neural Networks from Overfitting", Journal of Machine Learning Research, **15**, 1929-1958.
<a class="anchor" id="Werbos94"></a>
* Werbos, P. J. (1994), "The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political Forecasting", New York:, John Wiley & Sons.