# Chapter 11: Training Deep Neural Networks

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

Some problems training a deep DNN (deep neural networks):
- Vanishing or exploding gradients problem. Gradients grow smaller and smaller / bigger and bigger and makes training lower layers very hard.
- Not enough training data, or too costly to label.
- Training may be extremely slow.
- Model with millions of parameters risk overfitting.

## 11.1 The Vanishing/Exploding Gradients Problems

Backpropagation computes and propagates the error gradient of each layer from output to input. 

**Vanishing gradients problem** - Gradients get smaller and smaller as it progresses to lower layers. So Gradient Descent leaves the lower layers' connection weights virtually unchanged and never converges to a good solution.

**Exploding gradients problem** - Similar effect but gradients get bigger and bigger and algorithm diverges.

> In general, deep neural networks suffer from unstable gradients; different layers may learn at widely different speeds.

It was discovered that using logistic sigmoid activation function and initializing weights using a normal distribution ($\mu=0, \sigma = 1$) caused this issue.

- Initializing weights using a normal distribution:
    - The variance (the spread) of the outputs of each layer is much greater than its inputs.
    
    - Going forward in the network, the variance keeps increasing after each layer until the activation function saturates (ends up far right/left) at the top layers.

- Logistic sigmoid activation function:
    - Because the variance keeps increasing, inputs become large (negative or positive, "far left/right"), with outputs of 0 or 1 and derivative extremely close to 0.
    - So backpropagation has no error gradient to propagate to the lower layers.

### 11.1.1 Glorot and He Initialization

For the signal to flow properly, we need the variance of the outputs of each layer to be equal to the variance of its inputs, and we need the gradients to have equal variance before and after flowing through a layer in the reverse direction.

> Microphone Amplifier Analogy: Setting knob too close to 0, voice is inaudible but too close to max, voice is too saturated. For a chain of amplifiers, they all need to be set properly so that voice is loud and clear at the end of the chain.  

> Your voice has to come out of each amplifier at the same amplitude as it came in.

It's impossible to guarantee both (output & gradient variances) unless the layer has an equal number of inputs and neurons (*fan-in*, *fan-out*). Glorot and Bengio proposed a good compromise that the connection weights to be initialized randomly according to *Equation 11-1*, where $\text{fan}_\text{avg} = (\text{fan}_\text{in} + \text{fan}_\text{out})/2$ and is called **Xavier initialization** or **Glorot initialization**. This strategy is used for the logistic activation function.

**LeCun initialization** - Equivalent to Glorot initialization when $\text{fan}_\text{in} = \text{fan}_\text{out} $.

**He initialization** - The initialization strategy for the ReLU activation function (and its variants).

In [None]:
# Default is Glorot with uniform distribution 
# Change to He initialization
keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")

# He initialization with uniform distribution based on fan_avg rather than fan_in
# Use VarianceScaling
he_avg_init = keras.initializers.VarianceScaling(scale=2, mode='fan_avg',
                                                 distribution='uniform')
keras.layers.Dense(10, activation="sigmoid", kernel_initializer=he_avg_init)

<tensorflow.python.keras.layers.core.Dense at 0x7f6a37de7d10>

### 11.1.2 Nonsaturating Activation Functions

**Dying ReLUs** - During training, some neurons effectively "die," meaning they stop outputting anything other than 0.

A neuron dies when its weights get tweaked in such a way that the weighted sum of its inputs are negative for all instances in the training set. When this happens, it just keeps outputting 0s, and Gradient Descent does not affect it anymore because the gradient of the ReLU function is 0 when its input is negative.

To solve this problem, use **leaky ReLU** where $\text{LeakyReLU}_\alpha(z) = \text{max}(\alpha z, z) $. The hyperparameter $\alpha$ defines how much the function "leaks": it is the slope of the function for $z<0$ and is typically set to $0.01$.

> Leaky ReLU is just like ReLU, but with a small slope for negative values. A small slope ensures that leaky ReLUs never die.

**Exponential linear unit (ELU)** - Outperforms all the ReLU variants. Looks a lot like the ReLU function with some major differences:
- Takes on nagative values when $z<0$, allowing an average output closer to 0 and alleviating the vanishing gradients problem.
- Hyperparameter $\alpha$ defines the value when z is a large negative number (usually set to 1).
- Has nonzero gradient for $z<0$, avoiding the dead neurons problem.
- If $\alpha = 1$, then the function is smooth everywhere including around $z=0$, speeding up Gradient Descent.

The main drawback of the ELU function is that it is slower to compute than the ReLU function and its variants (due to the exponential function).

**Scaled ELU (SELU)** - A scaled variant of the ELU activation function. The network will *self-normalize* ($\mu =0, \sigma =1$) but under specific conditions:
- Input features must be standarized ($\mu =0, \sigma =1$).
- Every hidden layer's weights must be initialized with LeCun normal initialization.
- Network's architecture must be sequential.

> In general, SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic.

> Because ReLU is the most used activation function, many libraries and hardware accelerators provide ReLU-specific optimizations; if speed is your priority, ReLU might be the best choice.

In [None]:
# On fashion MNIST as example

# For leaky ReLU
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),                  # FROM TEXTBOOK NOTEBOOK 
    keras.layers.Dense(300, kernel_initializer="he_normal"),     # FROM TEXTBOOK NOTEBOOK 
    keras.layers.LeakyReLU(alpha=0.2),                           # Activation function after each layer
    keras.layers.Dense(10, kernel_initializer="he_normal"),
    keras.layers.LeakyReLU(alpha=0.2)
])

# For PReLU
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),                  # FROM TEXTBOOK NOTEBOOK 
    keras.layers.Dense(300, kernel_initializer="he_normal"),     # FROM TEXTBOOK NOTEBOOK 
    keras.layers.PReLU(),                                        # Activation function after each layer
    keras.layers.Dense(10, kernel_initializer="he_normal"),
    keras.layers.PReLU()
])

# For SELU
layer = keras.layers.Dense(10, activation="selu",
                           kernel_initializer="lecun_normal")

### 11.1.3 Batch Normalization

Although different initialization and a variant activation function reduces the vanishing/exploding gradients problem, it doesn't guarantee that they won't come back during training.

**Batch Normalization** - Adding an operation in the model just before or after the activation function of each hidden layer. This operation simply zero-centers and normalizes each input, then scales and shifts the result using two new parameter vectors per layer: 1 for scaling, other for shifting.

> If you add a BN layer as the very first layer, you do not need to standardize your training set (eg. using `StandardScaler`).

Come testing a new instance, how to calculate the batch mean/standard deviation?

Estimate the final statistics (overall to use on new instance, instead of the batch values) by using a moving average of the layer's input means and standard deviation.

4 parameters are learned in each batch-normalized layer:
- $\mathbf{\gamma}$, the output scale vector
- $\mathbf{\beta}$, the output offset vector
- $\mathbf{\mu}$, the final input mean vector
- $\mathbf{\sigma}$, the final input standard deviation vector

> $\mathbf{\mu}$ and $\mathbf{\sigma}$ are estimated during training, but only used after training (to replace batch input means and standard deviation).

Batch Normalization acts like a regularizer, reducing the need for other regularization techniques.

There is a runtime penalty: the neural network makes slower predictions due to the extra computations required at each layer. Fortunately, it's often possible to fuse the BN layer with the previous layer, after training, thereby avoiding the runtime penalty.

#### Implementing Batch Normalization with Keras

Just add a `BatchNormalization` layer before or after each hidden layer's activation function.

For example, this model applies BN after every hidden layer and as the first layer in the model (after flattening the input images).

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
batch_normalization (BatchNo (None, 784)               3136      
_________________________________________________________________
dense (Dense)                (None, 300)               235500    
_________________________________________________________________
batch_normalization_1 (Batch (None, 300)               1200      
_________________________________________________________________
dense_1 (Dense)              (None, 100)               30100     
_________________________________________________________________
batch_normalization_2 (Batch (None, 100)               400       
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1

Bn layer adds 4 parameters per input: $\gamma, \beta, \mu, \sigma$. So for the first BN layer, $4 \times 784 = 3,136$ parameters.

$\mu, \sigma$ are moving averages and "non-trainable" so $(3,136 + 1,200 + 400) / 2 = 2,368$ the total number of non-trainable parameters in the model.

In [None]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('batch_normalization/gamma:0', True),
 ('batch_normalization/beta:0', True),
 ('batch_normalization/moving_mean:0', False),
 ('batch_normalization/moving_variance:0', False)]

In [None]:
# layers.updates method is deprecated
model.layers[1].updates



[]

In [None]:
# To add BN layers before the activation functions
# Remove activation function from hidden layers
# Add them as separate layers after BN layers
# Remove bias term from previous layer, 'use_bias=False'

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, kernel_initializer="he_normal", use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("elu"),
    keras.layers.Dense(100, kernel_initializer="he_normal", use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("elu"),
    keras.layers.Dense(10, activation="softmax")
])

`BatchNormalization` class hyperparameters:
- `momentum`: Uses when updating the exponential moving averages. A good momentum value is close to 1.
- `axis`: Determines which axis should be normalized. Defaults to -1, normalizing the last axis (using the means and standard deviations computed across the other axes).

> Note: BN uses batch statistics during training and the "final" statistics after training (ie. the final values of the moving averages).

### 11.1.4 Gradient Clipping

**Gradient Clipping** - Clip the gradients during backpropagation so that they never exceed some threshold in order to mitigate the exploding gradients problem.

In [None]:
# Set clipvalue or clipnorm for Gradient Clipping
optimizer = keras.optimizers.SGD(clipvalue=1.0)
model.compile(loss="mse", optimizer=optimizer)

This optimizer will clip every component of the gradient vector to a value between -1.0 and 1.0.

If you want to ensure that Gradient Clipping does not change the direction of the gradient vector, you should clip by norm by setting `clipnorm` instead of `clipvalue`.

## 11.2 Reusing Pretrained Layers

**Transfer learning** - Finding an existing neural network that accomplishes a similar task to the one you are trying to tackle and reusing the lower layers of this network.

> Note: Transfer learning will work best when the inputs have similar low-level features.

> Note: The more similar the tasks are, the more layers you want to reuse (starting with the lower layers). For very similar tasks, try keeping all the hidden layers and just replacing the output layer.

Freeze all the reused layers first (ie. make their weights non-trainable so that Gradient Descent won't modify them). Then slowly unfreeze the top layers and see if performance improves.

It is also useful to reduce the learning rate when you unfreeze reused layers to avoid wrecking their fine-tuned weights.

### 11.2.1 Transfer Learning with Keras

Suppose the Fashion MNIST only contained eight classes - all except sandal and shirt. There's a prebuilt Keras model (model A) with >90% accuracy.

We want to train a binary classifier (positive=shirt, negative=sandal). You want to build a model (model B) that uses transfer learning from model A.

First, we need to split the Fashion MNIST into two sets of train, valid, and test sets:
- `X_train_A`: "Model A" with all images except for sandals and shirts
- `X_train_B`: "Model B" with just the first 200 images of sandals or shirts

In [None]:
# FROM TEXTBOOK NOTEBOOK
# Import Fashion MNIST dataset and split into train, valid, test sets

(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_train_full = X_train_full / 255.0
X_test = X_test / 255.0
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

In [None]:
# FROM TEXTBOOK NOTEBOOK
# Split into "Model A" dataset: all images except sandals or shirts
# Split into "Model B" dataset: 200 images of sandals or shirts

def split_dataset(X, y):
    y_5_or_6 = (y == 5) | (y == 6) # sandals or shirts
    y_A = y[~y_5_or_6]
    y_A[y_A > 6] -= 2 # class indices 7, 8, 9 should be moved to 5, 6, 7
    y_B = (y[y_5_or_6] == 6).astype(np.float32) # binary classification task: is it a shirt (class 6)?
    return ((X[~y_5_or_6], y_A),
            (X[y_5_or_6], y_B))

(X_train_A, y_train_A), (X_train_B, y_train_B) = split_dataset(X_train, y_train)
(X_valid_A, y_valid_A), (X_valid_B, y_valid_B) = split_dataset(X_valid, y_valid)
(X_test_A, y_test_A), (X_test_B, y_test_B) = split_dataset(X_test, y_test)
X_train_B = X_train_B[:200]
y_train_B = y_train_B[:200]

In [None]:
# FROM TEXTBOOK NOTEBOOK
# Build Model A neural network

model_A = keras.models.Sequential()
model_A.add(keras.layers.Flatten(input_shape=[28, 28]))
for n_hidden in (300, 100, 50, 50, 50):
    model_A.add(keras.layers.Dense(n_hidden, activation="selu"))
model_A.add(keras.layers.Dense(8, activation="softmax"))

model_A.compile(loss="sparse_categorical_crossentropy",
                optimizer=keras.optimizers.SGD(lr=1e-3),
                metrics=["accuracy"])

history = model_A.fit(X_train_A, y_train_A, epochs=20,
                    validation_data=(X_valid_A, y_valid_A))

model_A.save("my_model_A.h5")

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [None]:
model_A = keras.models.load_model("my_model_A.h5")
model_B_on_A = keras.models.Sequential(model_A.layers[:-1])     # Use all layers except output layer
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))   # Model B output layer

> Note: `model_A` and `model_B_on_A` now share some layers. When you train `model_B_on_A`, `model_A` is also affected. **Clone** `model_A` before you reuse its layers with `clone_model()` and copy its weights.

In [None]:
# Clone Model A and copy its weights

model_A_clone = keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())

Since the new output layer is initialized randomly, it will make large errors during the 1st few epochs that will produce large error gradients and wreck the reused weights. To avoid this, freeze the reused layers, giving the new layer some time to learn reasonable weights.

In [None]:
# Freeze the reused layers
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False

# Must compile the model after freezing or unfreezing layers
model_B_on_A.compile(loss="binary_crossentropy", optimizer="sgd",
                     metrics=["accuracy"])

In [None]:
# Train for a few epochs
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4,
                           validation_data=(X_valid_B, y_valid_B))
# Unfreeze the reused layers
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = True
# Decrease learning rate
optimizer = keras.optimizers.SGD(lr=1e-4) # the default lr is 1e-2
# Compile model again
model_B_on_A.compile(loss="binary_crossentropy", optimizer=optimizer,
                     metrics=["accuracy"])
# Train model
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=16,
                           validation_data=(X_valid_B, y_valid_B))

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Epoch 1/16
Epoch 2/16
Epoch 3/16
Epoch 4/16
Epoch 5/16
Epoch 6/16
Epoch 7/16
Epoch 8/16
Epoch 9/16
Epoch 10/16
Epoch 11/16
Epoch 12/16
Epoch 13/16
Epoch 14/16
Epoch 15/16
Epoch 16/16


In [None]:
model_B_on_A.evaluate(X_test_B, y_test_B)



[0.056174881756305695, 0.987500011920929]

If you try to change the classes or the random seed, you will see that the improvement generally drops or even vanishes or reverses. What happened is "torturing the data until it confesses" (ie. went through many iterations and picked the best result, when in general may not be the case).

It turns out that transfer learning does not work very well with small dense networks, presumably because small networks learn few patterns, and dense networks learn very specific patterns.

Transfer learning works best with deep convolutional neural networks, which tend to learn feature detectors that are much more general (especially in the lower layers).

### 11.2.2 Unsupervised Pretraining

Suppose you want to tackle a complex task for which you don't have much labeled training data. If you can gather plenty of unlabeled training data, you can try to use it to train an unsupervised model, such as an autoencoder or a generative adversarial network (GAN). Then you can reuse the lower layers, add the output layer for your task on top, and fine-tune the final network using supervised learning.

**Greedy layer-wise pretraining** - Used in early days of Deep Learning.
1. First train an unsupervised model with a single layer (typically restricted Boltzmann machines, RBMs).
2. Then freeze that layer and add another one on top of it.
3. Train the model again (effectively just training the new layer).
4. Repeat layer by layer.

Today, people generally train the full unsupervised model in one shot and use autoencoders or GANs rather than RBMs.

### 11.2.3 Pretraining on an Auxiliary Task

One last option is to train a 1st neural network on an auxiliary task for which you can easily obtain or generate labeled training data, then reuse the lower layers of that network for your actual task. The 1st neural network's lower layers will learn feature detectors that will likely be reusable by the 2nd neural network.

For example, you want to build a system to recognize faces but have only a few pictures of each individual.
1. You can gather a lot of pictures of random people and train a 1st neural network to detect whether or not two different pictures feature the same person.
2. Reuse its lower layers, as it would learn good feature detectors for faces, to train a good face classifier that uses little training data.

For **natural language processing (NLP)**, you can download millions of text documents and automatically generate labeled data from it.
1. Mask out some words and train a model to predict what the missing words are.
2. If it performs well, it means it already knows a lot about language.
3. You can reuse it for the actual task.

## 11.3 Faster Optimizers

### 11.3.1 Momentum Optimization

**Momentum optimization** - Imagine a bowling ball rolling down a gentle slope on a smooth surface: it will start out slowly, but it will quickly pick up momentum until it eventually reaches terminal velocity.

In contrast, regular Gradient Descent simply takes small, regular steps down the slope.

Momentum optimization cares a great deal about what previous gradients were: at each iteration, it subtracts the local gradient from the **momentum vector** $\mathbf{m}$ and updates the weights by adding this momentum vector.

Hyperparameter $\beta$ called the **momentum** simulates as a sort of friction mechanism and prevent the momentum from growing too large (set between 0 "high friction" and 1 "no friction" - typical value is 0.9).

In deep neural networks that don't use Batch Normalization, the upper layers will often end up having inputs with very different scales, so using momentum optimization helps a lot and helps roll past local optima.

> Note: Due to the momentum, the optimizer may overshoot a bit, then come back, overshoot again, and oscillate many times before stabilizing at a minimum. Having a bit of friction gets rid of these oscillations and speeds up convergence.

In [None]:
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9)

### 11.3.2 Nesterov Accelerated Gradient

**Nesterov Accelerated Gradient (NAG)** method (also called *Nesterov momentum optimization*) - Measures the gradient of the cost function not at the local position $\mathbf{\theta}$ but slightly ahead in the direction of the momentum, $\theta + \beta \mathbf{m}$.

This small tweak works because in general the momentum vector will be pointing in the right direction (ie. toward the optimum), so it will be slightly more accurate to use the gradient measured a bit further in that direction rather than the gradient at the original position.

> Note: NAG gradient pushes toward the bottom of the valley (instead of across). So NAG is generally faster than regular momentum optimization.

In [None]:
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True)

### 11.3.3 AdaGrad

> Recall: For an elongated bowl problem, Gradient Descent would go down the steepest slope, which does not point straight toward the global optimum, then it very slowly goes down to the bottom of the valley.

**AdaGrad** algorithm - Corrects its direction to point more toward the global optimum by scaling down the gradient vector along the steepest dimensions. The steps are:

1. Accumulates the square of the gradients into the vector $\mathbf{s}$. Each $s_i$ accumulates the squares of the partial derivative of the cost function with regard to parameter $\theta_i$. If the cost function is steep along the $i^{th}$ dimension, then $s_i$ will get larger and larger at each iteration.

2. Almost identical to Gradient Descent, but the gradient vector is scaled down (element-wise division) by a factor of $\sqrt{\mathbf{s} + \epsilon}$.

**Adaptive learning rate** - Decays the learning rate such that steeper dimensions decay faster than dimensions with gentler slopes.

AdaGrad often stops too early when training neural networks. The learning rate gets scaled down too much that the algorithm ends up stopping before reaching the global optimum.

> Note: You **should not** use AdaGrad to train deep neural networks (may be efficient for simpler tasks such as Linear Regression).

### 11.3.4 RMSProp

**RMSProp** algorithm - Fixes AdaGrad's problem of slowing down too fast and never converging to global optimum by accumulating only the gradients from the most recent iterations (as opposed to all the gradients since the beginning of training). It uses exponential decay in the 1st step.

> Note: The decay rate $\beta$ is typically set to 0.9, and often works well so no need to tune it.

In [None]:
optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9) # rho -> beta

> Note: Except on very simple problems, RMSProp almost always performs much better than AdaGrad.

### 11.3.5 Adam and Nadam Optimization

**Adam (adaptive moment estimation)** - Combines the ideas of momentum optimization and RMSProp: it keeps track of an exponentially decaying average of past gradients (momentum optimization) and squared gradients (RMSProp).

> Note: The mean is often called the *first moment* and the variance is called *second moment*, hence the name of the algorithm.

> Note: Refer to *Equation 11-8. Adam algorithm* in book.

Steps 1, 2, and 5 are very similar to both momentum optimization and RMSProp, except that step 1 computes an eponentially decaying average rather than a decaying sum.

Steps 3 and 4: since $\mathbf{m}$ and $\mathbf{s}$ are initialized at 0, they will be biased toward 0 at the beginning of training, so these 2 steps help boost $\mathbf{m}$ and $\mathbf{s}$ at the beginning of training.

The momentum decay hyperparameter $\beta_1$ is typically initalized to 0.9 and scaling decay hyperparameter $\beta_2$ to 0.999.

In [None]:
optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)

> Note: You can often use the default learning rate value, $\eta=0.001$ since adaptive learning rate algorithms {Adam, AdaGrad, RMSProp} require less tuning of the learning rate, making Adam even easier to use than Gradient Descent.

2 variants of Adam:
- **AdaMax**: Replaces the $\ell_2$ norm, "$\sqrt{\text{sum of squares}}$" with $\ell_\infty$.
    - Replaces step 2 with $ \mathbf{s} \leftarrow \text{max}(\beta_2 \nabla_\theta J(\theta))$
    - Drops step 4
    - For step 5, scales down the gradient updates by a factor of $\mathbf{s}$, the max of the time-decayed gradients

- **Nadam**: Adam optimization + the Nesterov trick, so it will converge slightly faster than Adam.

> Note: Adaptive optimization methods {RMSProp, Adam, and Nadam optimization} are often great, converging fast to a good solution, but generalize poorly on some datasets.

> #### Training Sparse Models

>> All the optimization algorithms presented produce dense models, meaning that most parameters will be nonzero. If you need a fast model at runtime or need to take up less memory, you may prefer to end up with a sparse model instead.

>>Apply strong $\ell_1$ regularization during training as it pushes the optimizer to zero out as many weights as it can - much better than just setting tiny weights to 0.

### 11.3.6 Learning Rate Scheduling

If you start with a large learning rate and then reduce it once training stops making fast progress, you can reach a good solution faster than with the optimal constant learning rate.

The most commonly used learning schedules are:

**Power scheduling**: 
- Set the learning rate to a function of the iteration number t: $ \eta(t) = \eta_0 / (1 + t/s)^c $.
- Requires tuning the hyperparameters, $\eta_0, s, c$, (c typically set to 1).
- After s steps it is down to $ \eta_0 /2$ then s more steps will be $\eta_0/3$ and so on.
- Schedule first drops quickly, then more and more slowly.

**Exponential scheduling**:
- Set learning rate to $ \eta(t) = \eta_0 0.1^{t/s}$.
- The learning rate will gradually drop by a factor of 10 every s steps.

**Piecewise constant scheduling**:
- Use a constant learning rate for a number of epochs (eg. $\eta_0 = 0.1$ for 5 epochs).
- Then a smaller learning rate for another number of epochs (eg. $\eta_1 = 0.001$ for 50 epochs) and so on.
- Requires fiddling around to figure out right sequence of learning rates and how long to use each of them.

**Performance scheduling**:
- Measure the validation error every $N$ steps (just like for early stopping).
- Reduce the learning rate by a factor of $\lambda$ when the error stops dropping.

**1cycle scheduling**:
- Increase the initial learning rate $\eta_0$ linearly up to $\eta_1$ halfway through training.
- Then decrease the learning rate linearly down to $\eta_0$ during the second half of training.
- Finish the last few epochs by dropping the rate down by several orders of magnitude (still linearly).

In [None]:
# Power scheduling
optimizer = keras.optimizers.SGD(lr=0.01, decay=1e-4) # decay = 1/s

# Exponential scheduling
def exponential_decay_fn(epoch):
    # Takes the current epoch and returns learning rate
    return 0.01 * 0.1**(epoch/ 20) # where n_0=0.01, s=20

def exponential_decay(lr0, s):
    def exponential_decay_fn(epoch):
        return lr0 * 0.1**(epoch / s)
    return exponential_decay_fn # Returns a configured function

exponential_decay_fn = exponential_decay(lr0=0.01, s=20)

In [None]:
# FROM TEXTBOOK NOTEBOOK
# Scaling Fashion MNIST
pixel_means = X_train.mean(axis=0, keepdims=True)
pixel_stds = X_train.std(axis=0, keepdims=True)
X_train_scaled = (X_train - pixel_means) / pixel_stds
X_valid_scaled = (X_valid - pixel_means) / pixel_stds
X_test_scaled = (X_test - pixel_means) / pixel_stds

In [None]:
# FROM TEXTBOOK NOTEBOOK
# Building the model
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
n_epochs = 25

In [None]:
# Create LearningRateScheduler callback, giving it the schedule function
# Pass callback to fit() method
lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
                    validation_data=(X_valid_scaled, y_valid),
                    callbacks=[lr_scheduler])

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


The `LearningRateScheduler` will update the optimizer's `learning_rate` attribute at the beginning of each epoch. Updating the learning rate at every step makes sense if there are many steps per epoch.

In [None]:
def exponential_decay_fn(epoch, lr):
    return lr * 0.1**(1 / 20)   # Decay now starts at the beginning of epoch 0 instead of 1

When you save a model, the optimizer and its learning rate get saved along with it. However, the epoch does not get saved, and it gets reset to 0 every time you call the `fit()` method.

One solution is to manually set the `fit()` method's `initial_epoch` argument so each `epoch` starts at the right value.

In [None]:
# Piecewise constant scheduling
def piecewise_constant_fn(epoch):
    if epoch < 5:
        return 0.01
    elif epoch < 15:
        return 0.005
    else:
        return 0.001

For performance scheduling, use the `ReduceLROnPlateau` callback, that multiplies the learning rate by 0.5 whenever the best validation loss does not improve for 5 consecutive epochs.

In [None]:
# Performance scheduling
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)

An alternate way to implement learning rate scheduling is: define the learning rate using one of the schedules available in `keras.optimizers.schedules`, then pass this learning rate to any optimizer. This updates the learning rate at each step rather than at each epoch.

So an alternate way to define `exponential_decay_fn()` would be:

In [None]:
s = 20 * len(X_train) // 32 # number of steps in 20 epochs (batch size = 32)
learning_rate = keras.optimizers.schedules.ExponentialDecay(0.01, s, 0.1)
optimizer = keras.optimizers.SGD(learning_rate)

This is nice and simple, and when saving the model, the learning rate and its schedule (including its state) is saved as well.

> Note: This is specific to `tf.keras`. Not part of the Keras API.

For 1cycle scheduling, just create a custom callback (similar to the rest) that modifies the learning rate at each iteration (`self.model.optimizer.lr`).

## 11.4 Avoiding Overfitting Through Regularization

### 11.4.1 $\ell_1$ and $\ell_2$ Regularization

You can use $\ell_2$ regularization to constrain a neural network's connection weights and/or $\ell_1$ regularization if you want a sparse model (with many weights equal to 0).

In [None]:
# Apply l2 regularization to a Keras layer's connection weights
layer = keras.layers.Dense(100, activation="elu",
                           kernel_initializer="he_normal",
                           kernel_regularizer=keras.regularizers.l2(0.01))

Since you will typically want to apply the same regularizer to all layers and same activation function and initialization strategy in all the hidden layers, you may find yourself repeating the same arguments.

To avoid this, refactor the code to use loops or use Python's `functools.partial()` function, which lets you create a thin wrapper for any callable.

In [None]:
from functools import partial

In [None]:
RegularizedDense = partial(keras.layers.Dense,
                           activation="elu",
                           kernel_initializer="he_normal",
                           kernel_regularizer=keras.regularizers.l2(0.01))

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    RegularizedDense(300),
    RegularizedDense(100),
    RegularizedDense(10, activation="softmax",
                     kernel_initializer="glorot_uniform")
])

### 11.4.2 Dropout

At every training step, every neuron (including the input neurons, but always excluding the output neurons) has a probability $p$ (the *dropout rate*) of being temporarily "dropped out," meaning it will be entirely ignored during this training step, but it may be active during the next step and is typically set between 10% - 50%.

After training, the neurons don't get dropped anymore.

Neurons trained with dropout cannot co-adapt with their neighboring neurons; they have to be as useful as possible on their own. They also cannot rely excessively on just a few input neurons; they must pay attention to each of their input neurons. They end up being less sensitive to slight changes in the inputs.

In other words, the resulting neural network can be seen as an averaging ensemble of all these smaller neural networks.

Suppose $p=50\%$ in which case during testing a neuron would be connected twice as many input neurons as it would be (on average) during training - during training, there's only about $1/2$ total connections.

To compensate, we need to multiply each neuron's input connection weights by $0.5$ (or more generally $(1-p)$, the *keep probability*) after training, or else each neuron will get a total input signal roughly 2x as large as what the network was trained on.

`keras.layers.Dropout` randomly drops some inputs (sets them to 0) and divides the remaining inputs by the keep probability. After training, it just passes the inputs to the next layer.

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dropout(rate=0.2),  # Dropout before every Dense layer
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(10, activation="softmax")
])

> Note: Since dropout is only active during training, comparing the training loss and validation loss can be misleading (may overfit but similar training/validation losses). So make sure to evaluate the training loss without dropout (eg. after training).

Increase the dropout rate if:
- Model is overfitting
- Large layers

Decrease the dropout rate if:
- Model is underfitting
- Small layers

Alternatively if full dropout is too strong, only use dropout after the last hidden layer.

> Note: If you want to regularize a self-normalizing network based on the SELU activation function, use **alpha dropout**, a variant of dropout that preserves the mean and standard deviation of its inputs.

### 11.4.3 Monte Carlo (MC) Dropout

**MC Dropout** - Boosts the dropout model trained earlier without retraining it.

In [None]:
# Make 100 predictions over test set with dropout and stack predictions
y_probas = np.stack([model(X_test_scaled, training=True)
                     for sample in range(100)])
y_probas.shape

(100, 10000, 10)

In [None]:
y_proba = y_probas.mean(axis=0)
y_proba.shape

(10000, 10)

> Recall: `predict()` returns a matrix with 1 row per instance and 1 column per class.

`y_probas` after making predictions is shape $[10000, 10]$.  
`y_probas` after stacking 100 such matricies is shape $[100, 10000, 10]$.  
`y_proba` after averaging over 1st dimension (`axis=0`) is shape $[10000, 10]$.

Averaging over multiple predictions with dropout on gives us a **Monte Carlo** estimate that is generally more reliable than the result of a single prediction with dropout off.

In [None]:
# FROM TEXTBOOK NOTEBOOK
# Build a model using SELU activation function and alpha dropout

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.AlphaDropout(rate=0.2),
    keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.AlphaDropout(rate=0.2),
    keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.AlphaDropout(rate=0.2),
    keras.layers.Dense(10, activation="softmax")
])
optimizer = keras.optimizers.SGD(lr=0.01, momentum=0.9, nesterov=True)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
n_epochs = 20
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
                    validation_data=(X_valid_scaled, y_valid))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [None]:
# Predict 1st instance of Fashion MNIST test set
# With model using SELU activation function and alpha dropout
np.round(model.predict(X_test_scaled[:1]), 2)

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]], dtype=float32)

Model is certain that image belongs to class 9 (ankle boot).

With MC dropout activated,

In [None]:
np.round(y_probas[:, :1], 2)

array([[[0.  , 0.  , 0.  , 0.  , 0.  , 0.01, 0.  , 0.01, 0.  , 0.98]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.07, 0.  , 0.03, 0.  , 0.89]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.98, 0.  , 0.  , 0.  , 0.02]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.05, 0.  , 0.19, 0.  , 0.76]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.09, 0.  , 0.31, 0.  , 0.6 ]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.11, 0.  , 0.06, 0.  , 0.83]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.01, 0.  , 0.03, 0.  , 0.96]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.01, 0.  , 0.66, 0.  , 0.33]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.64, 0.  , 0.16, 0.  , 0.2 ]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.07, 0.  , 0.05, 0.  , 0.88]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.11, 0.  , 0.  , 0.  , 0.89]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.03, 0.  , 0.05, 0.  , 0.91]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.03, 0.  , 0.07, 0.  , 0.9 ]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.07, 0.  , 0.  , 0.  , 0

In [None]:
np.round(y_proba[:1], 2) # Average over 1st dimension

array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.13, 0.  , 0.13, 0.  , 0.74]],
      dtype=float32)

It's only 74% confident that it's class 9 (ankle boot), 13% class 5 (sandal), and 13% class 7 (sneaker).

In [None]:
y_std = y_probas.std(axis=0)
np.round(y_std[:1], 2)

array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.2 , 0.  , 0.19, 0.01, 0.27]],
      dtype=float32)

In [None]:
# From textbook notebook
y_pred = np.argmax(y_proba, axis=1)

accuracy = np.sum(y_pred == y_test) / len(y_test)
accuracy

0.8635

If your model contains other layers that behave in a special way during training (such as `BatchNormalization` layers), replace the `Dropout` layers with the following `MCDropout` class.

In [None]:
class MCDropout(keras.layers.Dropout):              # Subclass the Dropout layer
    def call(self, inputs):                         # Override the call() method
        return super().call(inputs, training=True)  # Force training argument to True

> Note: `MCDropout` class will work with all Keras API, including Sequential API. If you only care about the Functional or Subclassing API, you do not have to create an `MCDropout` class; create a regular `Dropout` layer and call it with `training=True`.

### 11.4.4 Max-Norm Regularization

**Max-norm regularization** - For each neuron, it constrains the weights $\mathbf{w}$ of the incoming connections such that $\| \mathbf{w} \|_2 \leq r$, where $r$ is the max-norm hyperparameter and $\| \cdot \|_2$ is the $\ell_2$ norm.

Max-norm regularization does not add a regularization loss term to the overall loss function. Instead, it typically computes $\| \mathbf{w} \|_2 $ after each training step and rescales $\mathbf{w}$ if needed.

Reducing $r$ increases the amount of regularization and helps reduce overfitting.

In [None]:
# Set kernel_constraint to a max_norm() constraint
keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal",
                   kernel_constraint=keras.constraints.max_norm(1.))

<tensorflow.python.keras.layers.core.Dense at 0x7faaba9f8e10>

The `max_norm()` function has an `axis` argument that defaults to 0. A Dense layer usually has weights of shape [*number of inputs, number of neurons*], so using `axis=0` means that the max-norm constraint will apply independently to each neuron's weight vector.

## 11.5 Summary and Practical Guidelines

> Note: See *Table 11-3. Default DNN configuration* and *Table 11-4. DNN configuration for a self-normalizing net* in book.

Further tips:
- Remember to normalize the input features.
- Try to reuse parts of a pretrained neural network if you can find one that solves a similar problem.
- Use unsupervised pretraining if you have a lot of unlabeled data.
- Use pretraining on an auxiliary task if you have a lot of labeled data for a similar task.

Some exceptions:
- If you need a sparse model, use $\ell_1$ regularization.
- If you need a low-latency model (lightning-fast predictions):
    - Use fewer layers.
    - Fold the Batch Normalization layers into the previous layers.
    - Use a faster activation function such as leaky ReLU or just ReLU.
    - Having sparse model also helps.
    - Reduce float precision from 32-bits to 16 or 8-bits.
- If you are building a risk-sensitive application (or inference latency is not very important), use MC Dropout to boost performance and get more reliable probability estimates, along with uncertainty estimates.