# Training Deep Neural Networks

In the previous lesson, we introduced artificial neural networks & trained our first deep neural networks. But they were shallow nets, with just a few hidden layers. What if you need to tackle a complex problem, such as detecting hundreds of types of objects in high-resolution images? You may need to train a much deeper DNN, perhaps with 10 layers or many more, each containing hundreds of neurons, linked by hundreds of thousands of connections. Training a deep DNN isn't a walk in the park. Here are some of the problems you could run into:

* You may be faced with the tricky *vanishing gradients* problem or the related *exploding gradients* problem. This is when the gradients grow smaller & smaller, or larger & larger, when flowing backward through the DNN during training. Both of these problems make lower layers very hard to train.
* You might not have enough training data for such a large network, or it might be too costly to label.
* Training may be extremely slow.
* A model with millions of parameters would severely risk overfitting the training set, especially if there are not enough training instances or if they are too noisy.

In this lesson, we will go through each of these problems & present techniques to solve them. We will start by exploring the vanishing & exploding gradients problems & some of their most popular solutions. Next, we will look at transfer learning & unsupervised pretraining. Then we will discuss various optimizers that can speed up training large models tremendously. Finally, we will go through a few popular regularisation techniques for large neural networks. 

With these tools, you will be able to train very deep nets. Welcome to Deep Learning!

---

# The Vanishing/Exploding Gradients Problems

As discussed before, the backpropagation algorithm works by going from the output layer to the input layer, propagating the error gradient along the way. Once the algorithm has computed the gradient of the cost function with regard to each parameter in the network, it uses these gradients to update each parameter with a gradient descent step.

Unfortunately, gradients often get smaller & smaller as the algorithm progresses down to the lower layers. As a result, the gradient descent update leaves the lower layers' connection weights virtually unchanged, & training never converges to a good solution. We call this the *vanishing gradients* problem. In some cases, the opposite can happen: the gradients can grow bigger & bigger until layers get insanely large weight updates & the algorithm diverges. This is the *exploding gradients* problem, which surfaces in recurrent neural networks. More generally, deep neural networks suffer from unstable gradients; different layers may learn at widely different speeds.

This unfortunate behavior was empirically observed long ago, & it was one of the reasons deep neural networks were mostly abandoned in the early 2000s. It wasn't clear what caused the gradients to be so unstable when training a DNN, but some light was shed in a 2010 payer by Xavier Glorot & Yoshua Bengio. The authors found a few suspects, including the combination of the popular logistic sigmoid activation function & the weight initialisation technique that was most popular at the time (i.e., a normal distribution with a mean of 0 & a standard deviation of 1). In short, they showed that with this activation function & this initialisation scheme, the variance of the outputs of each layer is much greater than the variance of its inputs. Going forward in the network, the variance keeps increasing after each layer until the activation function saturates at the top layers. This saturation is actually made worse by the fact that the logistic function has a mean of 0.5, not 0 (the hyperbolic tangent function has a mean of 0 & behaves slightly better than the logistic function in deep networks).

Looking at the logistic activation function, you can see that when the inputs become large (negative or positive), the function saturates 0 or 1, with a derivative extremely close to 0. Thus, when backpropagation kicks in, it has virtually no gradient to propagate back through the network; & what little gradient exists keeps getting diluted as backpropagation progresses through the top layers, so there is really nothing left for the lower layers.

<img src = "Images/Logistic Activation Function Saturation.png" width = "500" style = "margin:auto"/>

## Glorot & He Initialisation

In their paper, Glorot & Bengio propose a way to significantly alleviate the unstable gradients problem. They point out that we need the signal to flow properly in both directions: in the forward direction when making predictions, & in the reverse direction when backpropagating gradients. We don't want the signal to die out, nor do we want it to explode & saturate. For the signal to flow properly, the authors argue that we need the variance of the outputs of each layer to be equal to the variance of its inputs, & we need the gradients to have equal variance before & after flowing through a layer in the reverse direction (please check out the paper if you are interested in the mathematical details). It is actually not possible to guarantee both unless the layer has an equal number of inputs & neurons (these numbers are called the *fan-in* & *fan-out* of the layer), but Glorot & Bengio proposed a good compromise that has proven to work very well in practice: the connection weights of each layer must be initialised randomly as described in the below equation, where $fan_{avg} = (fan_{in} + fan_{out})/2$. This initialisation strategy is called *Xavier initialisation* or *Glorot initialisation*, after the paper's first author.

$$Normal\ distribution\ with\ mean = 0\ and\ variance\ \sigma^2 = \frac{1}{fan_{avg}}$$
$$Or\ a\ uniform\ distribution\ between\ -r\ and\ +r,\ with\ r = \sqrt{\frac{3}{fan_{avg}}}$$

If you replace $fan_{avg}$ with $fan_{in}$ in the above equation, you get an initialisation strategy that Yann LeCun proposed in the 1990s. He called it *Lecun initialisation*. Genevieve Orr & Klaus-Robert Muller even recommended it in ther 1998 book *Neural Networks: Tricks of the Trade*. Lecun initialisation is equivalent to Glorot initialisation when $fan_{in} = fan_{out}$. It took over a decade for researches to realise how important this trick is. Using Glorot initialisation can speed up training considerable & it is one of the tricks that led to the success of deep learning. 

Some papers have provided similar strategies for different activation functions. These strategies differ only by the scale of the variance & whether they use $fan_{avg}$ or $fan_{in}$ as shown in the below table (for the uniform distribution, just compute $r = \sqrt{3\sigma^2}$). The initialisation strategy for the ReLU activation function (& its variants, including the ELU activation described shortly) is sometimes called *He initialisation*, after the paper's first author. The SELU activation function will be explained later in this lesson. It should be used with LeCun initialisation (preferably with a normal distribution, as we will see).

|**Initialisation**|**Activation Functions**|**$\sigma^2$ (Normal)**|
|:---|:---|:---|
|Glorot|None, tanh, logistic, softmax| $1/fan_{avg}$ |
|He|ReLU & variants| $2/fan_{in}$ |
|LeCun|SELU| $1/fan_{in}$ |

By default, keras uses Glorot initialisation with a uniform distribution. When creating a layer, you can change this to He initialisation by setting `kernal_initialization = "he_uniform"` or `kernel_initializer = "he_normal"` like this:

In [1]:
import tensorflow as tf
from tensorflow import keras

keras.layers.Dense(10, activation = "relu", kernel_initializer = "he_normal")

<keras.layers.core.dense.Dense at 0x7ff5e7265580>

If you want He initialisation with a uniform distribution but based on $fan_{avg}$ rather than $fan_{in}$, you can use the `VarianceScaling` initializer like this:

In [2]:
he_avg_init = keras.initializers.VarianceScaling(scale = 2, mode = "fan_avg",
                                                 distribution = "uniform")
keras.layers.Dense(10, activation = "sigmoid", kernel_initializer = he_avg_init)

<keras.layers.core.dense.Dense at 0x7ff5ef0209d0>

## Nonsaturation Activation Functions

One of the insights in the 2010 paper by Glorot & Bengio was that the problems with unstable gradients were in part due to a poor choice of activation. Until then most people had assumed that if Mother Nature had chosen to use roughly sigmoid activation functions in biological neurons, they must be an excellent choice. But it turns out that other activation functions behave much better in deep neural networks -- in particular, the ReLU activation function, mostly because it does not saturate for positive values (& because it is fast to compute).

Unfortunately, the ReLU activation function is not perfect. It suffers from a problem known as the *dying ReLUs*: during training, some neurons effectively "die", meaning they stop outputting anything other than 0. In some cases, you may find that half of your network's neurons are dead, especially if you used a large learning rate. A neuron dies when its weights get tweaked in such a way that the weighted sum of its inputs are negative for all instances in the training set. When this happens, it just keeps outputting zeros, & gradient descent does not affect it anymore because the gradient of the ReLU function is zero when its input is negative.

To solve this problem, you may want to use a variant of the ReLU function, such as the *leaky ReLU*. This function is defined as $LeakyReLU_{\alpha}(z) = max(\alpha z, z)$, shown below. The hyperparameter $\alpha$ defines how much the function "leaks": it is the slope of the function for $z < 0$ & is typically set to 0.01. This small slope ensures that leaky ReLUs never die; they can go into a long coma, but they have a chance to eventually wake up. A 2015 paper compared several variants of the ReLU activation function, & one of its conclusions was that the leaky variants always outperformed the strict ReLU activation function. In fact, setting $\alpha = 0.2$ (a huge leak) seemed to result in better performance than $\alpha = 0.01$ (a small leak). The paper also evaluated the *randomised leaky ReLU* (RReLU), where $\alpha$ is picked randomly in a given range during training & is fixed to an average value during testing. RReLU also performed fairly well & seemed to act as a regulariser (reducing the risk of overfitting the training set). Finally, the paper evaluated the *parametric leaky ReLU* (PReLU), where $\alpha$ is authorised to be learned during training (instead of being a hyperparameter, it becomes a parameter that can be modified by backpropagation like any other parameter). PReLU was reported to strongly outperform ReLU on large image datasets, but on smaller datasets it runs the risk of overfitting the training set.

<img src = "Images/Leaky ReLU.png" width = "600" style = "margin:auto">

Last but not least, a 2015 paper by Djork-Arne Clervert proposed a new activation function called the *exponential linear unit* (ELU) that outperformed all the ReLU variants in the authors' experiments; training time was reduced, & the neural network performed better on the test set. The below figure graphs the function & the below equation shows its definition.

$$ELU_\alpha(z) = \Biggl \{\begin{split}
\alpha(e^z - 1)\ if\ z < 0\\
z \quad \quad if\ z \geq 0\\
\end{split}$$

<img src = "Images/ELU Activation Function.png" width = "600" style = "margin:auto"/>

The ELU activation function looks a lot like the ReLU function, with a few major differences:

* It takes on negative values when z < 0, which allows the unit to have an average output closer to 0 & helps alleviate the vanishing gradients problem. The hyperparameter $\alpha$ defines the value that the ELU function approaches when $z$ is a large negative number. It is usually set to 1, but you can tweak it like any other hyperparameter.
* It has a nonzero gradient for $z$ < 0, which avoids the dead neurons problem.
* If $\alpha$ is equal to 1, then the function is smooth everywhere, including around $z$ = 0, which helps speed up gradient descent since it does not bounce as much to the left & right of $z$ = 0.

The main drawback of the ELU activation function is that it is slower to compute than the ReLU function & its variants (due to the use of the exponential function). Its faster convergence rate during training compensates for that slow computation, but still, at test time, an ELU network will be slower than a ReLU network.

Then, a 2017 paper by Gunter Klambauer introduced the Scaled ELU (SELU) activation function: as its name suggests, it is a scaled variant of the ELU activation function. The authors showed that if you build a neural network composed exclusively of a stack of dense layers, & if all hidden layers use the SELU activation function, then the network will *self-normalise*: the output of each layer will tend to preserve a mean of 0 & standard deviation of 1 during training, which solves the vanishing/exploding gradients problem. As a result, the SELU activation function often significantly outperforms other activation functions for such neural nets (especially deep ones). There are, however, a few conditions for self-normalisation to happen (see the paper for the mathematical justification):

* The initial features must be standardised (mean 0 & standard deviation 1).
* Every hidden layer's weights must be initialised with LeCun normal initialisation. In Keras, this means setting `kernel_initializer = "lecun_normal"`.
* The network's architecture must be sequential. Unfortunately, if you try to use SELU in nonsequential architectures, such as recurrent networks or networks with *skip connections* (i.e., connections that skip layers, such as in Wide & Deep nets), self-normalisation will not be guaranteed, so SELU will not necessarily outperform other activation functions.
* The paper only guarantees self-normalisation if all layers are dense, but some researchers have noted that the SELU activation function can improve performance in convolutional neural networks as well.

To use the leaky ReLU activation function, create a `LeakyReLU` layer & add it to your model just after the layer you want to apply it to:

In [4]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape = [28, 28]),
    keras.layers.Dense(300, kernel_initializer = "he_normal"),
    keras.layers.LeakyReLU(alpha = 0.2),
    keras.layers.Dense(100, kernel_initializer = "he_normal"),
    keras.layers.LeakyReLU(alpha = 0.2),
    keras.layers.Dense(10, activation = "softmax")
])

For PreLU, replace `LeakyRelu(alpha = 0.2)` with `PreLU()`. There is currently no official implementation of RReLU in Keras, but you can fairly easily implement your own. 

For SELU activation set `activation = "selu"` & `kernel_initializer = "lecun_normal"` when creating a layer:

In [5]:
layer = keras.layers.Dense(10, activation = "selu",
                           kernel_initializer = "lecun_normal")

## Batch Normalisation

Although using He initialisation along with ELU (or any variant of ReLU) can significantly reduce the danger of the vanishing/exploding gradients problems at the beginning of training, it doesn't guarantee that they won't come back during training.

In a 2015 paper, Sergey Ioffe & Christian Szegedy proposed a technique called *Batch Normalisation* (BN) that addressed these problems. The technique consists of adding an operation in the model just before or after the activation function of each hidden layer. This operation simply zero-centers & normalises each input, then scales & shifts the result using two new parameter vectors per layer: one for scaling, the other for shifting. In other words, the operation lets the model learn the optimal scale & mean of each of the layer's inputs. In many cases, if you add a BN layer as the very first layer of your neural netowrk, you do not need to standardise your training set (e.g., using a `StandardScaler`); the BN layer will do it for you (approximately, since it only looks at one batch at a time, & it can also rescale & shift each input feature).

In order to zero-center & normalise the inputs, the algorithm needs to estimate each input's mean & standard deviation. It does so by evaluating the mean & standard deviation of the input over the current mini-batch (hence the name "batch normalisation"). The whole operation is summarised step by step in the below equation.

1. $\mu_B = \frac{1}{m_B}\sum^{m_B}_{i = 1}x^{(i)}$
2. $\sigma_B^2 = \frac{1}{m_B}\sum^{m_B}_{i = 1}(x^{(i)} - \mu_B)^2$
3. $\hat{x}^{(i)} = \frac{x^{(i)} - \mu_B}{\sqrt{\sigma_B^2 + \varepsilon}}$
4. $z^{(i)} = \gamma \otimes \hat{x}^{(i)} + \beta$

In this algorithm:

* $\mu_B$ is the vector of input means, evaluated over the whole mini-batch B (it contains one mean per input).
* $\sigma_B$ is the vector of input standard deviations, also evaluated over the whole mini-batch (it contains one standard deviation per input).
* $m_B$ is the number of instances in the mini-batch.
* $\hat{x}^{(i)}$ is the vector of zero-centered & normalised inputs for instance $i$.
* $\gamma$ is the output scale parameter vector for the layer (it contains one scale parameter per input)
* $\otimes$ represents element-wise multiplication (each input is multiplied by its corresponding output scale parameter).
* $\beta$ is the output shift (offset) parameter vector for the layer (it contains one offset parameter per input). Each input is offset by its corresponding shift parameter.
* $\varepsilon$ is a tiny number that avoids division by zero (typically $10^{-5}$). This is called a *smoothing term*.
* $z^{(i)}$ is the output of the BN operation. It is a rescaled & shifted version of the inputs.

So during training, BN standardised its inputs, then rescales & offsets them. Good! What about at test time? Well, it's not that simple. Indeed, we may need to make predictions for individual instances rather than for batches of instances: in this case, we will have no way to compute each input's mean & standard deviation. Moreover, even if we do have a batch of instances, it may be too small, or the instances may be independent & identically distributed, so computing statistics over the batch instances would be unreliable. One solution could be to wait until the end of training, then run the whole training set through the neural network & compute the mean & standard deviations of each input of the BN layer. These "final" input means & standard deviations could then be used instead of the batch input means & standard deviations could then be used instead of the batch input means & standard deviations when making predictions. However, most implementations of batch normalisation estimate these final statistics during training by using a moving average of the layer's input means & standard deviations. This is what Keras does automatically when you use the `BatchNormalization` layer. To sum up, four parameter vectors are learned in each batch-normalised layer: $\gamma$ (the output scale vecetor), & $\beta$ (the output offset vector) are learned through regular backpropagation, & $\mu$ (the final input mean vector) & $\sigma$ (the final input standard deviation vector) are estimated using the exponential moving average. Note that $\mu$ & $\sigma$ are estimated during training, but they are used only after training (to replace the batch input means & standard deviations of the above equation.

Ioffe & Szegedy demonstrated that batch normalisation considerably improved all the deep neural networks they experimented with, leading to a huge improvement in the ImageNet classification task (ImageNet is a large database of images classified into many classes, commonly used to evaluate computer vision systems). The vanishing gradients problem was strongly reduced, to the point that they could use saturating activation functions such as the tanh & even the logistic activation function. The networks were also much less sensitive to the weight initialisation. The authors were able to use much larger learning rates, significantly speeding up the learning process. Specifically, they note that: 

*Applied to a state-of-the-art image classification model, batch normalisation achieves the same accuracy with 14 times fewer training steps, & beats the original model by a significant margin. [...] Using an ensemble of batch-normalised networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (& 4.8% test error), exceeding the accuracy of human raters.*

Finally, like a gift that keeps on giving, batch normalisation acts like a regulariser, reducing the need for other regularisation techniques (such as dropout, discussed later in this lesson).

Batch normalisation does, however, add some complexity to the model (although it can remove the need for normalising the input data, as we discussed earlier). Moreover, there is a runtime penalty: the neural network makes slower predictions due to the extra computations required at each layer. Fortunately, it's often possible to fuse the BN layer with the previous layer, after training, thereby avoiding the runtime penalty. This is done by updating the previous layer's weights & biases so that it directly produces outputs of the appropriate scale & offset. For example if the previous layer computes $XW + b$, then the BN layer will compute $\gamma \otimes (XW + b - \mu)/\sigma + \beta$, the equation simplifies to $XW' + b'$. So if we replace the previous layer's weights & biases ($W$ & $b$) with the updated weights & biases ($W'$ & $b'$), we can get rid of the BN layer.

### Implementing Batch Normalisation with Keras

As with most things with Keras, implementing batch normalisation is simple & intuitive. Just add a `BatchNormalization` layer before or after each hidden layer's activation function, & optionally add a BN layer as well as the first layer in your model. For example, this example applies BN after every hidden layer & as the first layer in the model (after flattening the input images):

In [6]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape = [28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation = "elu", kernel_initializer = "he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation = "elu", kernel_initializer = "he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation = "softmax")
])

That's all! In this tiny example with just 2 hidden layers, it's unlikely that batch normalisation will have a very positive impact; but for deeper networks it can make a tremendous difference.

Let's display the model summary:

In [7]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_2 (Flatten)         (None, 784)               0         
                                                                 
 batch_normalization (BatchN  (None, 784)              3136      
 ormalization)                                                   
                                                                 
 dense_9 (Dense)             (None, 300)               235500    
                                                                 
 batch_normalization_1 (Batc  (None, 300)              1200      
 hNormalization)                                                 
                                                                 
 dense_10 (Dense)            (None, 100)               30100     
                                                                 
 batch_normalization_2 (Batc  (None, 100)             

As you can see, each BN layer adds 4 parameters per input: $\gamma$, $\beta$, $\mu$, & $\sigma$ (for example, the first BN layer adds 3,136 parameters, which is 4 $\times$ 784). The last two parameters $\mu$ & $\sigma$, are the moving averages; they are not affected by backpropagation, so Keras calls them "non-trainable" (if you count the the total number of BN parameters, 3135 + 1200 + 400, & divide by 2, you get 2,368, which is the total of non-trainable parameters in this model).

Let's look at the parameters of the first BN layer. Two are trainable (by backpropagation), & two are not:

In [8]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('batch_normalization/gamma:0', True),
 ('batch_normalization/beta:0', True),
 ('batch_normalization/moving_mean:0', False),
 ('batch_normalization/moving_variance:0', False)]

Now when you create a BN layer in Keras, it also creates two operations that will be called by Keras at each iteration during training. These operations will update the moving averages. Since we are using the TensorFlow backend, these operations are TensorFlow operations.

The authors of the BN paper argued in favor of adding the BN layers before the activation functions, rather than after (as we just did). There is some debate about this, as which is preferable seems to depend on the task -- you can experiment with this too to see which option works best on your dataset. To add the BN layers before the activation functions, you must remove the activation function from the hidden layers & add them as separate layers after the BN layers. Moreover, since a batch normalisation layer includes one offset parameter per input, you can remove the bias term from the previous layer (just pass `use_bias = False` when creating it):

In [9]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape = [28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, kernel_initializer = "he_normal", use_bias = False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("elu"),
    keras.layers.Dense(100, kernel_initializer = "he_normal", use_bias = False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("elu"),
    keras.layers.Dense(10, activation = "softmax")
])

The `BatchNormalization` class has quite a few hyperparameters you can tweak. The defaults will usually be fine, but you may occasionally need to tweak the `momentum`. This hyperparameter is used by the `BatchNormalization` layer when it updates the exponential moving averages; given a new value $v$ (i.e., a new vector of input means or standard deviations computed over the current batch), the layer updates the running average $\hat{v}$ using the following equation:

$$\hat{v} \leftarrow \hat{v} \times {momentum} + v \times (1 - {momentum})$$

A good momentum value is typically close to 1; for example, 0.9, 0.99, or 0.999 (you want more 9s for larger datasets & smaller mini-batches).

Another important hyperparameter is `axis`: it determines which axis should be normalized. It defaults to -1, meaning that by default, it will normalise the last axis (using the means & standard deviations computed across the *other* axes). When the input batch is 2D (i.e., the batch shape is [*batch size*, *features*]), this means that each input feature will be normalised based on the mean & standard deviation computed across all the instances in the batch. For example, the first BN layer in the previous code example will independently normalise (& rescale & shift) each of the 784 input features. If we move the first BN layer before the `Flatten` layer, then the input batches will be 3D, with shape [*batch size*, *height*, *width*]; therefore, the BN layer will compute 28 means & 28 standard deviations (1 per column of pixels, computed across all instances in the batch & across all rows in the column), & it will normalise all pixels in a given column using the same mean & standard deviation. There will also be just 28 scale parameters & 28 shift parameters. If instead you still want to treat each of the 784 pixels independently, then you should set `axis = [1, 2]`.

Notice that the BN layer does not perform the same computation during training & after training: it uses batch statistics during the training & the "final" statistics after training (i.e., the final values of the moving averages). Let's take a peek at the source code of this class to see how this is handled:

In [None]:
class BatchNormalization(keras.layers.Layer):
    [...]
    def call(self, inputs, training = None):
        [...]

The `call()` method is the one that performs the computations; as you can see, it has an extra `training` argument, which is set to `None` by default, but the `fit()` method sets it to 1 during training. If you ever need to write a custom layer, & it must behave differently during training & testing, add a `training` argument to the `call()` method & use this argument in the method to decide what to compute.

`BatchNormalization` has become one of the most-used layers in deep neural networks, to the point that it is often omitted in the diagrams, as it is assumed that BN is added after every layer. But a recent paper by Hongyi Zhang may change this assumption: by using a novel *fixed-updated* (fixup) weight initialisation technique, the authors managed to train a very deep neural network (10,000 layers!) without BN, achieving state-of-the-art performance on complex image classification tasks. As this is bleeding-edge research, however, you may want to wait for additional research to confirm this finding before you drop batch normalisation.

## Gradient Clipping

Another popular technique to mitigate the exploding gradients problem is to clip the gradients during backpropagation so they never exceed some threshold. This is called gradient clipping. This technique is most often used in recurrent neural networks, as batch normalization is tricky to use in RNN. For other types of networks, BN is usually sufficient.

In Keras, implementing gradient clipping is just a matter of setting the `clipvalue` or `clipnorm` argument when creating an optimizer, like so:

In [10]:
optimizer = keras.optimizers.SGD(clipvalue = 1.0)
model.compile(loss = "mse", optimizer = optimizer)

This optimizer will clip every component of the gradient vector to a value between -1.0 & 1.0. This means that all the partial derivatives of the loss (with regard to each & every trainable parameter) will be clipped between -1.0 & 1.0. The threshold is a hyperparameter you can tune. Note that it may change the orientation of the gradient vector. For instance, if the original gradient vector is [0.9, 100.0], it points mostly in the direction of the second axis; but once you clip it by value, you get [0.9, 1.0], which points roughly in the diagonal between the two axes. In practice, this approach works well. If you want to ensure that gradient clipping does not change the direction of the gradient vector, you should clip by norm by setting `clipnorm` instead of `clipvalue`. This will clip the whole gradient if its $l_2$ norm is greater than the threshold that you picked. For example, if you set `clipnorm = 1.0`, then the vector [0.9, 100.0] will be clipped to [0.00899964, 0.9999595], preserving its orientation but almost eliminating the first component. If you observe that the gradients explode during training (you can track the size of the gradients using Tensorboard), you may want to try both clipping by value & clipping by norm, with different thresholds, & see which option performs best on the validation set.

---

# Reusing Pretrained Layers

It is generally not a good idea to train a very large DNN from scratch: instead, you should always try to find an existing neural network that accomplishes a similar task to the one you are trying to tackle, then reuse the lower layers of this network. This technique is called *transfer learning*. It will not only speed up training considerably, but also require significantly less training data.

Suppose you have access to a DNN that was trained to classify pictures into 100 different categories, including animals, plants, vehicles, & everyday objects. You now want to train a DNN to classify specific types of vehicles. These tasks are very similar, even partly overlapping, so you should try to reuse parts of the first network.

<img src = "Images/Reusing Pre-Trained Layers.png" width = "500" style = "margin:auto"/>

The output layer of the original model should usually be replaced because it is most likely not useful at all for the new task, & it may not even have the right number of outputs for the new task. 

Similarly, the upper hidden layers of the original model are less likely to be as useful as the lower layers, since the high-level features that are most useful for the new task may differ significantly from the ones that were most useful for the original task. You want to find the right number of layers to reuse.

Try freezing all the reused layers first (i.e., make their weights non-trainable so that gradient descent won't modify them), then train your model & see how it performs, Then try unfreezing one or two of the top hidden layers to let backpropagation tweak them & see if performance improves. The more training data you have, the more layers you can unfreeze. It is also useful to reduce the learning rate when you unfreeze reused layers: this will avoid wrecking their fine-tuned weights.

If you still cannot get good performance, & you have little training data, try dropping the top hidden layer(s) & freezing all the remaining hidden layers again. You can iterate until you find the right number of layers to reuse. If you have plently of training data, you may try replacing the top hidden layers instead of dropping them, & even adding more hidden layers.

## Transfer Learning with Keras

Let's look at an example. Suppose the fashion mnist dataset only contained eight classes -- for example, all the classes except for sandal & shirt. Someone built & trained a keras model on that set & got reasonably good performance (>90% accuracy). Let's call this model A. You now want to tackle a different task: you have images of sandals & shirts, & you want to train a binary classifier (positive = shirt, negative = sandal). Your dataset is quite small; you only have 200 labeled images. When you train a new model for this task (let's call it model B) with the same architecture as model A, it performs reasonably well (97.2% accuracy). But since it's a much easier task (there are just two classes), you were hoping for more. While drinking your morning coffee, you realise that your task is quite similar to task A, so perhaps transfer learning can help. Let's find out.

First, you need to load model A & create a new model based on that model's layers. Let's reuse all the layers except for the output layer:

In [11]:
import numpy as np

(X_train, y_train), (X_test, y_test) = keras.datasets.mnist.load_data()
X_val, X_train = X_train[:5000] / 255.0, X_train[5000:] / 255.0
y_val, y_train = y_train[:5000], y_train[5000:]
X_test = X_test / 255.0

def split_dataset(X, y):
    y_5_or_6 = (y == 5) | (y == 6)
    y_A = y[~y_5_or_6]
    y_A[y_A > 6] -= 2
    y_B = (y[y_5_or_6] == 6).astype(np.float32)
    return ((X[~y_5_or_6], y_A),
            (X[y_5_or_6], y_B))

(X_train_A, y_train_A), (X_train_B, y_train_B) = split_dataset(X_train, y_train)
(X_val_A, y_val_A), (X_val_B, y_val_B) = split_dataset(X_val, y_val)
(X_test_A, y_test_A), (X_test_B, y_test_B) = split_dataset(X_test, y_test)
X_train_B = X_train_B[:200]
y_train_B = y_train_B[:200]

In [12]:
model_A = keras.models.Sequential()
model_A.add(keras.layers.Flatten(input_shape = [28, 28]))
for n_hidden in (300, 100, 50, 50, 50):
    model_A.add(keras.layers.Dense(n_hidden, activation = "selu"))
model_A.add(keras.layers.Dense(8, activation = "softmax"))

model_A.compile(loss = "sparse_categorical_crossentropy",
                optimizer = keras.optimizers.SGD(learning_rate = 1e-3),
                metrics = ["accuracy"])

history = model_A.fit(X_train_A, y_train_A, epochs = 20,
                      validation_data = (X_val_A, y_val_A))

model_A.save("my_model_A.h5")

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [13]:
model_A = keras.models.load_model("my_model_A.h5")
model_B_on_A = keras.models.Sequential(model_A.layers[:-1])
model_B_on_A.add(keras.layers.Dense(1, activation = "sigmoid"))

Note that `model_A` & `model_B_on_A` now share some layers. When you train `model_B_on_A`, it will also affect `model_A`. If you want to avoid that, you need to *clone* `model_A` before you reuse its layers. To do this, you clone model A's architecture with `clone_model()`, then copy its weights (since `clone_model()` does not clone the weights):

In [14]:
model_A_clone = keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())

Now you could train `model_B_on_A` for task B, but since the new output layer was initialised randomly it will make large errors (at least during the first few epochs), so there will be large error gradietns that may wreck the reused weights. To avoid this, one approach is to freeze the reused layers during the first few epochs, giving the new layer some time to learn reasonable weights. To do this, set every layer's `trainable` attribute to `False` & compile the model:

In [15]:
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False
    
model_B_on_A.compile(loss = "binary_crossentropy", optimizer = "sgd",
                     metrics = ["accuracy"])

Now you can train the model for a few epochs, then unfreeze the reused layers (which requires compiling the model again) & continue training to fine-tune the reused layers for task B. After unfreezing the reused layers, it is usually a good idea to reduce the learning rate, once again to avoid damaging the reused weights:

In [16]:
history = model_B_on_A.fit(X_train_B, y_train_B, epochs = 4,
                           validation_data = (X_val_B, y_val_B))

for layer in model_B_on_A.layers[:-1]:
    layer.trainable = True
    
optimizer = keras.optimizers.SGD(learning_rate = 1e-4)
model_B_on_A.compile(loss = "binary_crossentropy", optimizer = optimizer,
                     metrics = ["accuracy"])
history = model_B_on_A.fit(X_train_B, y_train_B, epochs = 16,
                           validation_data = (X_val_B, y_val_B))

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Epoch 1/16
Epoch 2/16
Epoch 3/16
Epoch 4/16
Epoch 5/16
Epoch 6/16
Epoch 7/16
Epoch 8/16
Epoch 9/16
Epoch 10/16
Epoch 11/16
Epoch 12/16
Epoch 13/16
Epoch 14/16
Epoch 15/16
Epoch 16/16


Well, that didn't work. It turns out that transfer learning does not work very well with small dense networks, presumably because small networks learn few patterns, & dense networks learn very specific patterns, which are unlikely to be useful in other tasks. Transfer learning works best with deep convolutional neural networks, which tend to learn feature detectors that are much more general (especially in the lower layers).

## Unsupervised Pretraining

Suppose you want to tackle a complex task for which you don't have much labeled training data, but unfortunately you cannot find a model trained on a similar task. Don't lose hope! First, you should try to gather more labeled training data, but if you can't you may still be able to perform *unsupervised pretraining*. Indeed, it is often cheap to gather unlabeled training examples, but expensive to label them. If you can gather plenty of unlabeled training data, you can try to use it to train an unsupervised model, such as an autoencoder or a generative adversarial network. Then you can reuse the lower layers of the autoencoder or the lower layers of the GAN's discriminator, add the output layer for your task on top, & fine tune the final network using supervised learning (i.e., with the labeled training examples).

It is this technique that Geoffrey Hinton & his team used in 2006 & which led to the revival of neural networks & the success of deep learning. Until 2010, unsupervised pretraining -- typically with restricted Boltzmann machines (RBMs) -- was the norm for deep nets, & only after the vanishing gradients problem was alleviated did it become much more common to train DNNs purely using supervised learning. Unsupervised pretraining (today typically using autoencoders or GANs rather than RBMs) is still a good option when you have a complex task to solve, no similar model you can reuse, & little labeled training data but plenty of unlabeled training data.

Note that in the early days of deep learning it was difficult to train deep models, so people would use a technique called *greedy layer-wise pretraining* (figure below). They would first train an unsupervised model with a single layer, typically an RBM, then they would freeze that layer & add another one on top of it, then train the model again (effectively just training the new layer), then freeze the new layer & add another layer on top of it, train the model again, & so on. Nowadays, things are much simpler: people generally train the full unsupervised model in one shot (starting at step 3) & use autoencoders or GANS rather than RBMs.

<img src = "Images/Unsupervised Pre-Training.png" width = "500" style = "margin:auto"/>

## Pretraining on an Auxiliary Task

If you do not have much labeled training data, one last option is to train a first neural network on an auxiliary task for which you can easily obtain or generate labeled training data, then reuse the lower layers of that network for your actual task. The first neural network's lower layers will learn feature detectors that will likely be reusable by the second neural network.

For example, if you want to build a system to recognise faces, you may only have a few pictures of each individual -- clearly not enough to train a good clasifier. Gathering hundreds of pictures of each person would not be practical. You could, however, gather a lot of pictures of random people on the web & train a first neural network to detect whether or not two different pictures feature the same person. such a network would learn good feature detectors for faces, so reusing its lower layers would allow you to train a good face classifier that uses little training data.

For *natural language processing* (NLP) applications, you can download a corpus of millions of text documents & automatically generate labelled data from it. For example, you could randomly mask out some words & train a model to predict what the missing words are (e.g., it should predict that the missing word in the sentence "What ___ you saying?" is probably "are" or "were"). If you can train a model to reach good performance on this task, then it will already know quite a lot about language, & you can certainly reuse it for your actual task & fine-tune it on your labeled data.

---

# Faster Optimizers

Training a very large deep neural network can be painfully slow. So far, we have seen four ways to speed up training (& reach a better solution): applying a good initialisation strategy for the connection weights, using a good activation function, using batch normalisation, & reusing parts of a pretrained network (possibly built on a auxiliary task or using unsupervised learning). Another huge speed boost comes from using a faster optimiser than the regular gradient descent optimiser. In this section, we will present the most popular algorithms: momentum optimisation, Nesterov accelerated gradient, AdaGrad, RMSProp, & finally Adam & Nadam optimisation.

## Momentum Optimisation

Imagine a bowling ball rolling down a gentle slope on a smooth surface: it will start out slowly, but it will quickly pick up momentum until it eventually reaches terminal velocity (if there is some friction or air resistance). This is the very simple idea behind *momentum optimisation*, proposed by Boris Polyak in 1964. In contrast, regular gradient descent will simply take small, regular steps down the slope, so the algorithm will take much more time to reach the bottom.

Recall that the gradient descent updates the weights $\theta$ by directly subtracting the gradient of the cost function $J(\theta)$ with regard to the weights ($\triangledown_{\theta}J(\theta)$) multiplied by the learning rate $\eta$. The equation is $\theta \leftarrow \theta - \eta \triangledown_{\theta}J(\theta)$. It does not care what the earlier gradients were. If the local gradient is tiny, it goes very slowly. 

Momentum optimisation cares a great deal about what previous gradients were: at each iteration, it subtracts the local gradient from the *momentum vector* $m$ (multiplied by the learning rate $\eta$), & it updates the weights by adding this momentum vector. In other words, the gradient is used for acceleration, not for speed. To simulate some sort of friction mechanism & prevent the momentum from growing too large, the algorithm introduces a new hyperparameter $\beta$, called the *momentum*, which must be set between 0 (high friction) & 1 (no friction). A typical momentum value is 0.9.

1. $$m \leftarrow \beta m - \eta \triangledown_{\theta}J(\theta)$$
2. $$\theta \leftarrow \theta + m$$

You can easily verify that if the gradient remains constant, the terminal velocity (i.e., the maximum size of the weight updates) is equal to that gradient multiplied by the learning rate $\eta$ multiplied by 1/(1 - $\beta$) (ignoring the sign). For example, if $\beta = 0.9$, then the terminal velocity is equal to 10 times the gradient times the learning rate, so momentum optimisation ends up going 10 times faster than gradient descent! This allows momentum optimisation to escape from plateaus much faster than gradient descent. We saw in previous lessons that when the inputs have different scales, the cost function will look like an elongated bowl. Gradient descent goes down the steep slope quite fast, but then it takes a very long time to go down the valley. In contrast, momentum optimisation will roll down the valley faster & faster until it reaches the bottom (optimum). In deep neural networks that don't use batch normalisation, the upper layers will often end up having inputs with very different scales, so using momentum optimisation helps a lot. It can also help roll past local optima.

Implementing momentum optimisation in Keras is a no-brainer: just use the `SGD` optimizer & set its `momentum` hyperparameter, then lie back & profit!

In [17]:
optimizer = keras.optimizers.SGD(learning_rate = 0.001, momentum = 0.9)

The one drawback of momentum optimisation is that it adds yet another hyperparameter to tune. However, the momentum value of 0.9 usually works well in practice & almost always goes faster than regular gradient descent.

## Nesterov Accelerated Gradient

One small variant to momentum optimisation, proposed by Yurii Nesterov in 1983, is almost always faster than vanilla momentum optimisation. The Nesterov accelerated gradient (NAG) method, also known as *Nesterov momentum optimisation*, measures the gradient of the cost function not at the local position $\theta$ but slightly ahead in the direction of the momentum, at $\theta + \beta m$.

1. $$m \leftarrow m - \eta \triangledown_{\theta}J(\theta + \beta m)$$
2. $$\theta \leftarrow \theta + m$$

This small tweak works because in general, the momentum vector will be pointing in the right direction (i.e, toward the optimum), so it will be slightly more accurate to use the gradient measured a bit farther in that direction rather than the gradient at the original position, as you can see in the figure below (where $\triangledown_1$ represents the gradient of the cost function measured at the starting point $\theta$, & $\triangledown_2$ represents the gradient at the point located at $\theta + \beta m$).

<img src = "Images/Nesterov Momentum Optimisation.png" width = "500" style = "margin:auto"/>

As you can see the Nesterov update ends up slightly closer to the optimum. After a while, these small improvements add up & NAG ends up being significantly faster than regular momentum optimisation. Moreover, not that when the momentum pushes the weights across a valley, $\triangledown_1$ continues to push farther across the valley, while $\triangledown_2$ pushes back toward the bottom of the valley. This helps reduce oscillations & thus NAG converges faster.

NAG is generally faster than regular momentum optimisation. To use it, simply set `nesterov = True` when creating the `SGD` optimiser:

In [18]:
optimizer = keras.optimizers.SGD(learning_rate = 0.001, momentum = 0.9, nesterov = True)

## AdaGrad

Consider the elongated bowl problem again: gradient descent starts by quickly going down the steepest slope, which does not point straight toward the global optimum, then it very slowly goes down to the bottom of the valley. It would be nice if the algorithm could correct its direction earlier to point a bit more toward the global optimum. The AdaGrad algorithm achieves this correction by scaling down the gradient vector along the steepest dimensions.

1. $$s \leftarrow s + \triangledown_{\theta}J(\theta) \otimes \triangledown_{\theta}J(\theta)$$
2. $$\theta \leftarrow \theta - \eta \triangledown_{\theta}J(\theta) \oslash \sqrt{s + \varepsilon}$$

The first step accumulates the square of the gradients into the vector $s$ (recall that the $\otimes$ symbol represents the element-wise multiplication). This vectorised form is equivalent to computing $s_i \leftarrow s_i + (\partial J(\theta) / \partial \theta_i)^2$ for each element $s_i$ of the vector $s$; in other words, each $s_i$ accumulates the sqaures of the partial derivative of the cost function with regard to parameter $\theta_i$. If the cost function is steep along the $i^{th}$ dimension, then $s_i$ will get larger & larger at each iteration.

The second step is almost identical to gradient descent, but with one big difference: the gradient vector is scaled down by a factor of $\sqrt{s + \varepsilon}$ (the $\oslash$ represents the element-wise division, & $\varepsilon$ is a smoothing term to avoid division by zero, typically set to $10^{-10}$). This vectorised form is equivalent to simultaneously computing $\theta_i \leftarrow \theta_i - \eta \partial J(\theta) / \partial \theta_i / \sqrt{s_i + \varepsilon}$ for all parameters $\theta_i$.

In short, this algorithm decays the learning rate, but it does so faster for steep dimensions than for dimensions with gentler slopes. This is called an *adaptive learning rate*. It helps point the resulting updates more directly toward the global optimum. One additional benefit is that it requires much less tuning of the learning rate hyperparameter $\eta$.

<img src = "Images/AdaGrad vs Gradient Descent.png" width = "600" style = "margin:auto"/>

AdaGrad frequently performs well for simple quadratic problems, but it often stops too early when training neural networks. The learning rate gets scaled down so much that the algorithm ends up stopping entirely before reaching the global optimum. So even though Keras has an `Adagrad` optimiser, you should not use it to train deep neural networks (it may be efficient for simpler tasks such as linear regression, though). Still, understanding AdaGrad is helpful to grasp the other adaptive learning rate optimisers.

## RMSProp

As we've seen, AdaGrad runs the risk of slowing down a bit too fast & never converging to the global optimum. The *RMSProp* algorithm fixes this by accumulating only the gradients from the most recent iterations (as opposed to all the gradients since the beginning of training). It does so by using exponential decay in the first step.

1. $$s \leftarrow \beta s + (1 - \beta) \triangledown_{\theta}J(\theta) \otimes \triangledown_{\theta} J(\theta)$$
2. $$\theta \leftarrow \theta - \eta \triangledown_{\theta} J(\theta) \oslash \sqrt{s + \varepsilon}$$

The decay rate $\beta$ is typically set to 0.9. Yes, it is once again a new hyperparameter, but this default value often works well, so you may not need to tune it at all.

As you might expect, Keras has an `RMSprop` optimiser:

In [19]:
optimizer = keras.optimizers.RMSprop(learning_rate = 0.001, rho = 0.9)

Note that the `rho` argument corresponds to $\beta$ in the above equation. Except on very simple problems, this optimiser almost always performs much better than AdaGrad. In fact, it was the preferred optimisation algorithm of many researchers until Adam optimisation came around.

## Adam & Nadam Optimisation

Adam, which stands for *adaptive moment estimation*, combines the ideas of momentum optimisation & RMSProp: just like momentum optimisation, it keeps track of an exponential decaying average of past gradients; & just like RMSProp, it keeps track of an exponentially decaying average of past squared gradients.

1. $$m \leftarrow \beta_1 m - (1 - \beta_1) \triangledown_{\theta}J(\theta)$$
2. $$s \leftarrow \beta_2 s + (1 - \beta_2) \triangledown_{\theta}J(\theta) \otimes \triangledown_{\theta} J(\theta)$$
3. $$\hat{m} \leftarrow \frac{m}{1 - \beta_1^t}$$
4. $$\hat{s} \leftarrow \frac{s}{1 - \beta_2^t}$$
5. $$\theta \leftarrow \theta + \eta \hat{m} \oslash \sqrt{\hat{s} + \varepsilon}$$

In this equation, $t$ represents the iteration number (starting at 1).

If you just look at steps 1, 2, & 5, you will notice Adam's close similarity to both momentum optimisation & RMSProp. The only difference is that step 1 computes an exponentially decaying average rather than an exponentially decaying sum, but these are actually equivalent except for a constant factor (the decaying average is just $1 - \beta_1$ times the decaying sum). Steps 3 & 4 are somewhat of a technical detail: since $m$ & $s$ are initialised at 0, they will be biased toward 0 at the beginning of training, so these two steps will help boost $m$ & $s$ at the beginning of training.

The momentum decay hyperparameter $\beta_1$ is typically initialised to 0.9, while the scaling decay hyperparameter $\beta_2$ is often initialised to 0.999. As earlier, the smoothing term $\varepsilon$ is usually initialised to a tiny number such as $10^{-7}$. These are the default values for the `Adam` class (to be precise, `epsilon` defaults to `None`, which tells Keras to use `keras.backend.epsilon()`, which defaults to $10^{-7}$; you can change it using `keras.backend.set_epsilon()`). Here is how to create an Adam optimiser using Keras:

In [20]:
optimizer = keras.optimizers.Adam(learning_rate = 0.001, beta_1 = 0.9, beta_2 = 0.999)

Since Adam is an adaptive learning rate algorithm (like AdaGrad & RMSProp), it requires less tuning of the learning rate hyperparameter $\eta$. You can often use the default value $\mu = 0.001$, making Adam even easier to use than gradient descent.

Finally, two variants of Adam are worth mentioning:

* *AdaMax*
   - Notice that in step 2 of the Adam algorithm, Adam accumulates the squares of the gradients in $s$ (with greater weight for more recent gradients). In step 5, if we ignore $\varepsilon$ ^ steps 3 & 4 (which are technical details anyway), Adam scales down the parameter updates by the square root of $s$. In short, Adam scales down the parameter updates by the $l_2$ norm of the time-decayed gradeitns (recall that the $l_2$ is the square root of the sum of squares). AdaMax, introduced in the same paper as Adam, replaces the $l_2$ norm with the $l_{\infty}$ norm (a fancy way of saying the max). Specifically, it replaces step 2 in the Adam algorithm with $s \leftarrow max(\beta_2 s, \triangledown_{\theta}J(\theta))$, it drops step 4, & in step 5, it scales down the gradient updates by a factor of $s$, which is just the max of the time-decayed gradients. In practice, this can make AdaMax more stable than Adam, but it really depends on the dataset, & in general, Adam performs better. So, this is just one more optimiser you can try if you experience problems with Adam on some task.
* *Nadam*
   - Nadam optimisation is Adam optimisation plus the Nesterov trick, so it will often converge slightly faster than Adam. In his report, the researcher Timothy Dozat compares many different optimizers on various tasks & finds that Nadam generally outperforms Adam but is sometimes outperformed by RMSProp.
   
All of the optimisation techniques discussed so far only rely on the *first-order partial derivatives* (*Jacobians*). The optimisation literature also contains amazing algorithms based on the *second-order partial derivatives* (the *Hessians*, which are the partial derivatives of the Jacobians). Unfortunately, these algorithms are very hard to apply to deep neural networks because there are $n^2$ Hessians per output (where $n$ is the number of parameters), as opposed to just $n$ Jacobians per output. Since DNNs typically have tens of thousands of parameters, the second-order optimisation algorithms often don't even fit in memory, & even when they do, computing the Hessians is just too slow. 

The below table compares all the optimisers we've discussed so far (* is bad, ** is average, & *** is good).

|Class|Convergence Speed|Convergence Quality|
|:---:|:---:|:---:|
|SGD|*|***|
|SGD(momentum = ...)|**|***|
|SGD(momentum = ..., nesterov = True)|**|***|
|Adagrad|***|* (stops too early)|
|RMSprop|***|** or ***|
|Adam|***|** or ***|
|Nadam|***|** or ***|
|AdaMax|***|** or ***|

## Learning Rate Scheduling

Finding a good learning rate is very important. If you set it much too high, training may diverge. If you set it too low, training will eventually converge to the optimum, but it will take a very long time. If you set it slightly too high, it will progress very quickly at first, but it will end up dancing around the optimum, never really settling down. If you have a limited computing budget, you may have to interrupt training before it has converged properly, yielding a suboptimal solution.

<img src = "Images/Learning Curves vs Learning Rates.png" width = "600" style = "margin:auto"/>

As discussed before, you can find a good learning rate by training the model for a few hundred iterations, exponentially increasing the learning rate from a very small value to a very large, & then looking at the learning curve & picking a learning rate slightly lower than the one at which the learning curve starts shooting back up. You can then reinitialise your model & train it with that learning rate.

But you can do better than a constant learning rate: if you start with a large learning rate & then reduce it once training stops making fast progress, you can reach a good solution faster than with the optimal constant learning rate. There are many different strategies to reduce the learning rate during training. It can also be beneficial to start with a low learning rate, increase it, then drop it again. These strategies are called *learning schedules*. These are the most commonly used learning schedules:

* *Power scheduling*
   - Set the learning rate to a function of the iteration number $t$: $\eta(t) = \eta_0 / (1 + t/s)^c$. The initial learning rate $\eta_0$, the power $c$ (typically set to 1), & the steps $s$ are hyperparameters. The learning rate drops at each step. After $s$ steps, it is down to $\eta_0 / 2$. After $s$ more steps, it is down to $\eta_0 / 3$, then it goes down to $\eta_0 / 4$, then $\eta_0 / 5$, & so on. As you can see, this schedule first drops quickly, then more & more slowly. Of course, power scheduling requires tuning $\eta_0$ & $s$ (& possibly $c$).
* *Exponential scheduling*
   - Set the learning rate to $\eta(t) = \eta_0 0.1^{t/s}$. The learning rate will gradually drop by a factor of 10 every $s$ steps. While power scheduling reduces the learning rate more & more slowly, exponential scheduling keeps slashing it by a factor of 10 every $s$ steps.
* *Piecewise constant scheduling*
   - Use a constant learning rate for a number of epochs (e.g., $\eta_0 = 0.1$ for 5 epochs), then a smaller learning rate for another number of epochs (e.g., $\eta_1 = 0.001$ for 50 epochs), & so on. Although this solution can work very well, it requires fiddling around to figure out the right sequence of learning rates & how long to use each of them.
* *Performance scheduling*
   - Measure the validation error every N steps (just like for early stopping), & reduce the learning rate by a factor of $\lambda$ when the error stops dropping.
* *1cycle scheduling*
   - Contrary to the other approaches, *1cycle* (introduced in a 2018 paper by Leslie Smith) starts by increasing the initial learning rate $\eta_0$, growing linearly up to $\eta_1$ halfway through training. Then it decreases the learning rate linearly down to $\eta_0$ again during the second half of training, finishing the last few epochs by dropping the rate down by several orders of magnitude (still linearly). The maximum learning rate $\eta_1$ is chosen using the same approach we used to find the optimal learning rate, & the initial learning rate $\eta_0$ is chosen to be roughly 10 times lower. When using a momentum, we start with a high momentum first (e.g., 0.95), then drop it down to a lower momentum during the first half of training (e.g., down to 0.85, linearly), & then bring it back up to the maximum value (e.g., 0.95) during the second half of training, finishing the last few epochs with that maximum value. Smith did many experiments showing that this approach was often able to speed up training considerably & reach better performance. For example, on the popular CIFAR10 image dataset, this approach reached 91.9% validation accuracy in just 100 epochs, instead of 90.3% accuracy in 800 epochs through a standard approach (with the same neural network architecture).
   
A 2013 paper by Andrew Senior compared the performance of some of the most popular learning schedules when using momentum optimisation to train deep neural networks for speech recognition. The authors concluded that, in this setting, both performance scheduling & exponential scheduling performed well. They both favored exponential scheduling because it was easy to tune & it converged slightly faster to the optimal solution (they also mentioned that it was easier to implement than performance scheduling, but in Keras, both options are easy). That said, the 1cycle approach seems to perform even better.

Implementing power scheduling in Keras is the easiest option: just set the `decay` hyperparameter when creating an optimizer:

In [21]:
optimizer = keras.optimizers.SGD(learning_rate = 0.01, decay = 1e-4)

The `decay` is the inverse of $s$ (the number of steps it takes to divide the learning rate by one more unit), & Keras assumes that $c$ is equal to 1.

Exponential scheduling & piecewise scheduling are quite simple too. You first need to define a function that takes the current epoch & returns the learning rate. For example, let's implement exponential scheduling:

In [22]:
def exponential_decay_fn(epoch):
    return 0.01 * 0.1 ** (epoch / 20)

If you do not want to hardcode $\eta_0$ & $s$, you can create a function that returns a configured function:

In [23]:
def exponential_decay(lr0, s):
    def exponential_decay_fn(epoch):
        return lr0 * 0.1 ** (epoch / s)
    return exponential_decay_fn

exponential_decay_fn = exponential_decay(lr0 = 0.01, s = 20)

Next, create a `LearningRateScheduler` callback, giving it the schedule function, & pass this callback to the `fit()` method:

In [None]:
lr_sceduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)
history = model.fit(X_train_scaled, y_train, [...], callbacks = [lr_scheduler])

The `LearningRateScheduler` will update the optimiser's `learning_rate` attribute at the beginning of each epoch. Updating the learning rate once per epoch is usually enough, but if you want it to be updated more often, for example at every step, you can always write your own callback. Updating the learning rate at every step makes sense if there are many steps per epoch. Alternatively, you can use the `keras.optimizers.schedules` approach, described shortly.

The schedule function can optionally take the current learning rate as a second argument. For example, the following schedule function multiplies the previous learning rate by $0.1^{1/20}$, which results in the same exponential decay (except the decay now starts at the beginning of epoch 0 instead of 1):

In [24]:
def exponential_decay_fn(epoch, lr):
    return lr * 0.1 ** (1 / 20)

This implementation relies on the optimiser's initial learning rate (contrary to the previous implementation), so make sure to set it appropriately.

When you save a model, the optimiser & its learning rate get saved along with it. This means that with this new schedule function, you could just load a trained model & continue training where it left off, no problem. Things are not so simple if your schedule function uses the `epoch` argument, however: the epoch does not get saved, & it get reset to 0 every time you call the `fit()` method. If you were to continue training a model where it left off, this could lead to a very large learning rate, which would likely damage your model's weights. One solution is to manually set the `fit()` method's `initial_epoch` argument so the `epoch` starts at the right value.

For piecewise constant scheduling, you can use a schedule function like the following one (as earlier, you can define a more general function if you want, then create a `LearningRateScheduler` callback with this function & pass it to the `fit()` method, just like we did for exponential scheduling:

In [25]:
def piecewise_constant_fn(epoch):
    if epoch < 5:
        return 0.01
    elif epoch < 15:
        return 0.005
    else: 
        return 0.001

For performance scheduling, use the `ReduceLROnPlateau` callback. For example, if you pass the following callback to the `fit()` method, it will multiply the learning rate by 0.5 whenever the best validation loss does not improve for five consecutive epochs.

In [26]:
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor = 0.5, patience = 5)

Lastly, tf.keras offers an alternative way to implement learning rate scheduling: define the learning rate using one of the schedules available in `keras.optimizers.schedules`, then pass this learning rate to any optimiser. This approach updates the learning rate at each step rather than at each epoch. For example, here is how to implement the same exponential schedule as the `exponential_decay_fn()` function we defined earlier:

In [27]:
s = 20 * len(X_train) // 32
learning_rate = keras.optimizers.schedules.ExponentialDecay(0.01, s, 0.1)
optimizer = keras.optimizers.SGD(learning_rate)

This is nice & simple, plus when you save the model, the learning rate & its schedule (including its state) get saved as well. This approach, however, is not part of the Keras API; it is specific to tf.keras.

As for the 1cycle approach, the implementation poses no particular difficulty: just create a custom callback that modifies the learning rate at each iteration (you can updated the optimizer's learning rate by changing `self.model.optimizer.lr`). 

To sum up, exponential decay, performance scheduling, & 1cycle can considerably speed up convergence, so give it a try.

---

# Avoiding Overfitting Through Regularisation

With thousands of parameters, you can fit a whole zoo. Deep neural networks typically have tens of thousands of parameters, sometimes even millions. This gives them an incredible amount of freedom & means that they can fit a huge variety of complex datasets. But this great flexibility also makes the network prone to overfitting the training set. We need regularisation. 

We already implemented one of the best regularisation techniques in the previous lesson, early stopping. Moreover, even though batch normalisation was designed to solve the unstable gradients problems, it also acts like a pretty good regulariser. In this section, we will examine other popular regularisation techniques for neural networks: $l_1$ & $l_2$ regularisation, dropout, & max-norm regularisation.

## $l_1$ & $l_2$ Regularisation

Just like we did for simple linear models, we can also use $l_2$ regularisation to constrain a neural network's connection weights, &/or $l_1$ regularisation if you want a sparse model (with many weights equal to 0). Here is how to apply $l_2$ regularisation to a Keras layer's connection weights, using a regularisation factor of 0.01:

In [28]:
layer = keras.layers.Dense(100, activation = "elu",
                           kernel_initializer = "he_normal",
                           kernel_regularizer = keras.regularizers.l2(0.01))

The `l2()` function returns a regulariser that will be called at each step during training to compute the regularisation loss. This is then added to the final loss. As you might expect, you can just use `keras.regularizers.l1()` if you want $l_1$ regularisation; if you want both $l_1$ & $l_2$ regularisation, use `keras.regularizers.l1_l2()` (specifying both regularisation factors).

Since you will typically want to apply the same regulariser to all layers in your network, as well as using the same activation function & the same initialisation strategy in all hidden layers, you may find yourself repeating the same arguments. This makes the code ugly & error-prone. To avoid this, you can try refactoring your code to use loops. Another option is to use Python's `functools.partial()` function, which lets you create a thin wrapper for any callable, with some default argument values:

In [29]:
from functools import partial

RegularizedDense = partial(keras.layers.Dense, 
                           activation = "elu",
                           kernel_initializer = "he_normal",
                           kernel_regularizer = keras.regularizers.l2(0.01))

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape = [28, 28]),
    RegularizedDense(300),
    RegularizedDense(100),
    RegularizedDense(10, activation = "softmax",
                     kernel_initializer = "glorot_uniform")
])

## Dropout

*Dropout* is one of the most popular regularisation techniques for deep neural networks. It was proposed in a paper by Geoffrey Hinton in 2012 & further detailed in a 2014 paper by Nitish Srivastava, & it has proven to be highly successful: even the state-of-the-art neural networks get a 1-2% accuracy boost simply by adding dropout. This may not sound like a lot, but when a model already has 95% accuracy, getting a 2% accuracy boost means dropping the error rate by almost 40% (going from 5% to roughly 3%).

It is a fairly simple algorithm: at every training set, every neuron (including the input neurons, but always excluding the output neurons) has a probability $p$ of being temporarily "dropped out", meaning it will be entirely ignored during this training step, but may be active during the next step. The hyperparameter $p$ is called the *dropout rate*, & it is typically set between 10% & 50%; closer to 20-30% in recurrent neural networks, & closer to 40-50% in convolutional neural networks. After training, neurons don't get dropped anymore.

<img src = "Images/Dropout Regularisation.png" width = "500" style = "margin:auto"/>

It's surprising at first that this destructive technique works at all. Would a company perform better if its employees were told to toss a coin every morning to decide whether or not to go to work? Well, who knows; perhaps it would! The company would be forced to adapt its organisation; it could not rely on any single person to work the coffee machine or perform any other critical tasks, so this expertise would have to be spread across several people. Employees would have to learn to cooperate with many of their coworkers, not just a handful of them. The company would become much more resilient. If one person quit, it wouldn't make much of a difference. It's unclear whether this idea would actually work for companies, but it certainly does for neural networks. Neurons trained with dropout cannot co-adapt with their neighbouring neurons; they have to be as useful as possible on their own. They also cannot rely excessively on just a few input neurons; they must pay attention to each of their input neurons. They end up being less sensitive to slight changes in the inputs. In the end, you get a more robust network that generalises better.

Another way to understand the power of dropout is to realise that a unique neural network is generated at each training step. Since each neuron can be either present or absent, there are a total of $2^N$ possible networks (where $N$ is the total number of droppable neurons). This is such a huge number that it is virtually impossible for a same neural network to be sampled twice. Once you have run 10,000 training steps, you have essentially trained 10,000 different neural networks (each with just one training instance). These neural networks are obviously not independent because they share many of their weights, but they are nevertheless all different. The resulting neural network can be seen as an averaging ensemble of all these smaller neural networks.

There is one small but important technical detail. Suppose $p = 50%$, in which case during testing, a neuron would be connected to twice as many input neurons as it would be (on average) during training. To compensate for this fact, we need to multiply each neuron's input connection weights by 0.5 after training. If we don't each neuron will get a total input signal roughly twice as large as what the network was trained on & will be unlikely to perform well. More generally, we need to multiply each input connection weight by the *keep probability* (1 - $p$) after training. Alternatively, we can divide each neuron's output by the keep probability during training (these alternatives are not perfectly equivalent, but they work equally well).

To implement dropout using keras, you can use the `keras.layers.Dropout` layer. During training, it randomly drops some inputs (setting them to 0) & divides the remaining inputs by the keep probability. After training, it does nothing at all; it just passes the inputs to the next layer. The following code applies dropout regularisation before every `Dense` layer, using a dropout rate of 0.2:

In [30]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape = [28, 28]),
    keras.layers.Dropout(rate = 0.2),
    keras.layers.Dense(300, activation = "elu", kernel_initializer = "he_normal"),
    keras.layers.Dropout(rate = 0.2),
    keras.layers.Dense(100, activation = "elu", kernel_initializer = "he_normal"),
    keras.layers.Dropout(rate = 0.2),
    keras.layers.Dense(10, activation = "softmax")
])

If you observe that the model is overfitting, you can increase the dropout rate. Conversely, you should try decreasing the dropout rate if the model underfits tthe training set. It can also help to increase the dropout rate for large layers, & reduce it for small ones. Moreover, many state-of-the-art architectures only use dropout after the last hidden layer, so you may want to try this if full dropout is too strong.

Dropout does tend to significantly slow down convergence, but it usually results in a much better model when tuned properly. So, it is generally well worth the extra time & effort.

## Monte Carlo (MC) Dropout

In 2016, a paper by Yarin Gal & Zoubin Ghahramani added a few more good reasons to use dropout:

* First, the paper established a profound connection between dropout networks (i.e., neural networks containing a `Dropout` layer before every weight layer) & approximate Bayesian inference, giving droupout a solid mathematical justification.
* Second, the authors introduced a powerful technique called *MC Dropout*, which can boost the performance of any trained droupout model without having to retrain it or even modify it all all, provides a much better measure of the model's uncertainty, & is amazingly simple to implement.

If this all sounds like a "one weird trick" advertisement, then take a look at the following code. It is the full implementation of *MC Dropout*, boosting the dropout model we trained earlier without retraining it:

In [31]:
y_probas = np.stack([model(X_test, training = True) for sample in range(100)])
y_proba = y_probas.mean(axis = 0)

We just made 100 predictions over the test set, setting `training = True` to ensure that the `Dropout` layer is active, & stack the predictions. Since dropout is active, all the predictions will be different. Recall that `predict()` returns a matrix with one row per instance & one column per class. Because there are 10,000 instances in the test set & 10 classes, this is a matrix of shape [10000, 10]. We stack 100 such matrices, so `y_probas` is an array of shape [100, 10000, 10]. Once we average over the first dimension (`axis = 0`), we get `y_proba`, an array of shape [10000, 10], like we would get with a single prediction. That's all! Averaging over multiple predictions with dropout on gives us a Monte Carlo estimate that is generally more reliable than the result of a single prediction with dropout off. For example, let's look at the model's prediction for the first instance in the fashion MNIST test set, with dropout off:

In [37]:
np.round(model.predict(X_test[:1], 2))



array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)

The model seems almost certain that this image belongs to class 9 (ankle boot). Should you trust it? Is there really so little room for doubt? Compare this with the predictions made when dropout is activated:

In [38]:
np.round(y_probas[:, :1], 2)

array([[[0.1 , 0.12, 0.04, 0.06, 0.28, 0.11, 0.08, 0.06, 0.05, 0.1 ]],

       [[0.1 , 0.12, 0.05, 0.04, 0.21, 0.16, 0.04, 0.08, 0.11, 0.08]],

       [[0.16, 0.08, 0.07, 0.09, 0.16, 0.12, 0.05, 0.05, 0.06, 0.16]],

       [[0.13, 0.09, 0.15, 0.08, 0.24, 0.14, 0.02, 0.04, 0.04, 0.07]],

       [[0.08, 0.11, 0.08, 0.05, 0.19, 0.2 , 0.05, 0.12, 0.05, 0.07]],

       [[0.06, 0.11, 0.15, 0.02, 0.28, 0.13, 0.05, 0.1 , 0.04, 0.06]],

       [[0.1 , 0.04, 0.06, 0.04, 0.15, 0.09, 0.07, 0.13, 0.14, 0.18]],

       [[0.16, 0.12, 0.06, 0.02, 0.32, 0.1 , 0.05, 0.06, 0.05, 0.06]],

       [[0.12, 0.11, 0.03, 0.05, 0.24, 0.11, 0.06, 0.06, 0.07, 0.15]],

       [[0.28, 0.05, 0.14, 0.05, 0.1 , 0.07, 0.07, 0.1 , 0.08, 0.04]],

       [[0.21, 0.07, 0.08, 0.03, 0.18, 0.12, 0.1 , 0.07, 0.09, 0.05]],

       [[0.09, 0.06, 0.08, 0.06, 0.21, 0.14, 0.07, 0.11, 0.04, 0.15]],

       [[0.04, 0.08, 0.2 , 0.05, 0.21, 0.18, 0.04, 0.05, 0.09, 0.05]],

       [[0.08, 0.08, 0.09, 0.03, 0.19, 0.29, 0.03, 0.1 , 0.05, 0

This tells a very different story: apparently, when we activate dropout, the model is not sure anymore. It seems to prefer a class 9, but sometimes it hesitates with other classes 5 (sandal) & 7 (sneaker), which makes sense given they're all footwear. Once we average over the first dimensions, we get the following MC Dropout predictions:

In [39]:
np.round(y_proba[:1], 2)

array([[0.15, 0.1 , 0.09, 0.06, 0.17, 0.13, 0.06, 0.08, 0.07, 0.1 ]],
      dtype=float32)

The model still thinks this image belongs to class 9, but only with 62% confidence, which seems much more reasonablle than 99%. Plus it's useful to know exactly which other classes it thinks are likely. You can also take a look at the standard deviation of the probability estimates.

In [40]:
y_std = y_probas.std(axis = 0)
np.round(y_std[:1], 2)

array([[0.07, 0.05, 0.05, 0.03, 0.07, 0.05, 0.04, 0.04, 0.04, 0.06]],
      dtype=float32)

Apparently, there's quite a lot of variance in the probability estimates: if you were to build a risk-sensitive system (e.g., a medical or financial system), you should probably treat such an uncertain prediction with extreme caution. You definitely would not treat it like a 99% confident prediction. Moreover, the model's accuracy got a small boost in accuracy.

In [41]:
accuracy = np.sum(y_pred == y_test) / len(y_test)
accuracy

NameError: name 'y_pred' is not defined

If your model contains other layers that behave in a special way during training (such as `BatchNormalisation` layers), then you should not force training mode like we just did. Instead, you should replace the `Dropout` layers with the following `MCDropout` class:

In [42]:
class MCDropout(keras.layers.Dropout):
    def call(self, inputs):
        return super().call(inputs, training = True)

Here, we just subclass the `Dropout` layer & override the `call()` method to force its `training ` argument to `True`. Similarly, you could define an `MCAlphaDropout` class by subclassing `AlphaDropout` instead. If you are creating a model from scratch, it's just a matter of using `MCDropout` rather than `Dropout`. But if you have a model that was already trained using `Dropout`, you need to create a new model that's identical to the existing model except that it replaces the `Dropout` layers with `MCDropout`, then copy the existing model's weights to your new model.

In short, MC dropout is a fantastic technique that boosts dropout models & provides better uncertainty estimates. Of course, since it is just regular dropout during training, it also acts like a regulariser.

## Max-Norm Regularisation

Another regularisation technique that is popular for neural networks is called *max-norm regularisation*: for each neuron, it constrains the weights $w$ of the incoming connections such that $|| w ||_2 \leq r$, where $r$ is the max-norm hyperparameter & $|| . ||_2$ is the $l_2$ norm.

Max-norm regularisation does not add a regularisation loss term to the overall loss function. Instead, it is typically implemented by computing $||w||_2$ after each training step & rescaling $w$ if needed ($w \leftarrow w r/ ||w||_2$).

Reducing $r$ increases the amount of regularisation & helps reduce overfitting. Max-norm regularisation can also help alleviate the unstable gradients problems (if you are not using batch normalisation). 

To implement max-norm regularisation in Keras, set the `kernel_constraint` argument of each hidden layer to a `max_norm()` constraint with the appropriate max value, like so:

In [44]:
keras.layers.Dense(100, activation = "elu", kernel_initializer = "he_normal",
                   kernel_constraint = keras.constraints.max_norm(1.))

<keras.layers.core.dense.Dense at 0x7ff5a954aa90>

After each training iteration, the model's `fit()` method will call the object returned by `max_norm()` passing it the layer's weights. You can also define your own custom constraint function if necessary & use it as the `kernel_constraint`. You can also constrain the bias terms by setting the `bias_constraint` argument.

The `max_norm()` function has an `axis `argument that defaults to 0. A `Dense` layer ususally has weights of shape [*number of inputs*, *number of neurons*], so using `axis = 0` means that the max-norm constraint will apply independently to each neuron's weight vector. If you want to use max-norm with convolutional layers, make sure to set the `max_norm()` constraints` axis` argument appropriately (usually `axis = [0, 1, 2]`).

---

# Summary & Practical Guidelines

In this chapter, we have covered a wide range of techniques, & you may be wondering which ones you should use. This depends on the task, & there is no clear consensus yet, but here is a table configuration that will work fine in most cases, without requiring much hyperparameter tuning. That said, please do not consider these defaults as hard rules.

|Hyperparameter|Default Value|
|:---:|:---:|
|Kernel initialiser|HE initialisation|
|Activation function|ELU|
|Normalisation|None if shallow, Batch norm if deep|
|Regularisation|Early stopping (+ $l_2$ if needed)|
|Optimiser|Momentum optimisation (or RMSProp or Nadam)|
|Learning rate schedule|1cycle|

If the network is a simple stack of dense layers, then it can self-normalise, & you should use the table configuration below instead.

|Hyperparameter|Default value|
|:---:|:---:|
|Kernel initialiser|LeCun initialisation|
|Activation function|SELU|
|Normalisation|None (self-normalisation)|
|Regularisation|Alpha dropout if needed|
|Optimiser|Momentum optimisation (or RMSProp or Nadam)|
|Learning rate schedule|1cycle|

Don't forget to normalise the input features! You should also try to reuse parts of a pretrained neural network if you can find one that solves a similar problem, or use unsupervised pretraining if you have a lot of unlabeled data, or use pretraining on an auxiliary task if you have a lot of labeled data for a similar task.

While the previous guidlines should cover most cases, here are some exceptions:

* If you need a sparse model, you can use $l_1$ regularisation (& optionally zero out the tiny weights after training). If you need an even sparser model, you can use the TensorFlow model optimisation toolkit. This will break self-normalisation, so you should use the default configuration in this case.
* If you need a low-latency model (one that performs lightning-fast predictions), you may need to use fewer layers, fold the batch normalisation layers into previous layers, & possibly use a faster activation function such as leaky ReLU or just ReLU. Having a sparse model will also help. Finally, you may wantto reduce the float precision from 32 bits to 16 or even 8 bits. Again, check out the TensorFlow model optimisation toolkit
* If you are building a risk-sensitive application, or inference latency is not very important in your application, you can use MC dropout to boost performance & get more reliable probability estimates, along with uncertainty estimates.

With these guidelines, you are now ready to train very deep nets! I hope you are now convinced that you can go quite a long way using just keras. They may come a time, when you need to have even more control; for example, to write a custom loss function or to tweak the training algorithm. For such cases you will need to use TensorFlow's lower-level API.