# Chapter 11: Training Deep Neural Networks

**What are the problems that we might face when training deep neural networks?**
- Vanishing/Exploding gradients problem
- Not enough training data for large network
- Slow training
- Overfitting of the training set if there are not enoough training instances or if they are too noisy

# 1. Vanishing/Exploding Gradients


In neural network, we update the weights of each parameter using gradients. An efficient way to calculate gradients of the cost function with regard to each parameter in the network is through Backpropagation. The backpropagation algorithm works by going from the output layer to the input layer, propagating the error gradient along the way.

However, there are several issues with this algorithm:    

**1. Vanishing Gradient**
- Gradients often get smaller and smaller as the algorithm progresses to the lower layers. The result of this is that the weights of lower layers is unchanged (no learning).

**2. Exploding Gradient**
- Gradients can grow bigger and bigger until layers get very large weight updates and the algorithm diverges. 
- Usually happens in Recurrent Neural Network.

In summary, DNN suffer from unstable gradients. Different layers may learn at different speed

## Why is this happening?

![figure11.1](images/figure11.1.png)

When inputs become large (negative or positive), the function saturates at 0 or 1, with derivative close to 0.    
When backpropagation kicks in, there is no gradient to propagate back through the network. The little gradient keeps diluting as backpropagation progresses down through the top layers and the lower layers end up with nothing

## What can we do to alleviate unstable gradients?

### 1.1 Xavier initialization/ Glorot intialization

- We need the signal to flow properly in both directions: forward direction when making predictions and in the reverse direction when backpropagating gradients
- For signal to flow properly, we need variance of the outputs of each layer to be equal to the variance of its inputs.
- Since each layer might have different number of inputs and neurons, it is not possible to guarantee equal variance in both.
- A good compromise is initialize the connection weights of each layer randomly.

![table11.1](images/table11.1.png)

In [1]:
import tensorflow as tf
from tensorflow import keras
import sklearn
import numpy as np
import os
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

We can use the following intializers

In [2]:
[name for name in dir(keras.initializers) if not name.startswith("_")]

['Constant',
 'GlorotNormal',
 'GlorotUniform',
 'HeNormal',
 'HeUniform',
 'Identity',
 'Initializer',
 'LecunNormal',
 'LecunUniform',
 'Ones',
 'Orthogonal',
 'RandomNormal',
 'RandomUniform',
 'TruncatedNormal',
 'VarianceScaling',
 'Zeros',
 'constant',
 'deserialize',
 'get',
 'glorot_normal',
 'glorot_uniform',
 'he_normal',
 'he_uniform',
 'identity',
 'lecun_normal',
 'lecun_uniform',
 'ones',
 'orthogonal',
 'random_normal',
 'random_uniform',
 'serialize',
 'truncated_normal',
 'variance_scaling',
 'zeros']

In [4]:
keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")

<tensorflow.python.keras.layers.core.Dense at 0x24474c38848>

We can change the type of variance scaling like this:

In [5]:
init = keras.initializers.VarianceScaling(scale=2., mode='fan_avg', distribution='uniform')
keras.layers.Dense(10, activation="relu", kernel_initializer=init)

<tensorflow.python.keras.layers.core.Dense at 0x24474c67a88>

### 1.2 Use Nonsaturating Activation Functions

![figure11.2](images/figure11.2.png)

**ReLU**
- Does not saturate for positive values and fast to compute
- Suffers from dying ReLU issue where some neurons "die" during training (keeps outputting 0). This is beacuse the weighted sum of its inputs are negative for all instances and ReLU is defined as $ReLU(z) = max(0, z)$, hence output is just 0.
- When output is 0, gradient descent does not affect it anymore

**Leaky ReLU**
- To solve this problem, we can use a Leaky ReLU, which is a variant of ReLU $LeakyReLU(z) = max(\alpha z, z)$. $\alpha$ defines how much the function "leaks". 
- Small slope ensures that leaky ReLUs never die
- Leaky variants always outperformed the strict ReLU activation function

**Randomized Leakly ReLU**
- $\alpha$ is picked randomly in a given range during training and is fixed to an average value during testing.
- RReLU performed fairly well and seemed to act as a regularizer.

**Parametric Leaky ReLU**
- $\alpha$ is to be learned during training
- PReLU was reported to strongly outperform ReLU on large image datasets, but on smaller datasets it runs the risk of overfitting the training set

**Exponential Linear Unit (ELU)**
- A new activation function that outperformed all the ReLU variants    
- It is similar to ReLU but it has nonzero gradient for $z < 0$ which avoids the dead neurons problem
- If $\alpha = 1$, then the function is smooth everywhere, including at $z = 0$ which helps to speed up Gradient Descent since it does not bounce as much to the left and rigt of $z = 0$
- Slower to commpute than ReLU and its variants due to the use of the exponential function
- Its faster convergence rate during training compensates for that slow computation


![equation11.2](images/equation11.2.png)  
![figure11.3](images/figure11.3.png)

In [3]:
# activation functions
[m for m in dir(keras.activations) if not m.startswith("_")]

['deserialize',
 'elu',
 'exponential',
 'get',
 'hard_sigmoid',
 'linear',
 'relu',
 'selu',
 'serialize',
 'sigmoid',
 'softmax',
 'softplus',
 'softsign',
 'swish',
 'tanh']

In [4]:
# ReLU and variants
[m for m in dir(keras.layers) if "relu" in m.lower()]

['LeakyReLU', 'PReLU', 'ReLU', 'ThresholdedReLU']

In [6]:
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_train_full = X_train_full / 255.0
X_test = X_test / 255.0
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

#### LeakyReLU

In [6]:
tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28]),
    keras.layers.Dense(300, kernel_initializer='he_normal'),
    keras.layers.LeakyReLU(),
    keras.layers.Dense(100, kernel_initializer='he_normal'),
    keras.layers.LeakyReLU(),
    keras.layers.Dense(10, activation="softmax")
])

In [9]:
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(learning_rate=1e-3),
              metrics=["accuracy"])

In [10]:
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Parametric ReLU

In [11]:
tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, kernel_initializer="he_normal"),
    keras.layers.PReLU(),
    keras.layers.Dense(100, kernel_initializer="he_normal"),
    keras.layers.PReLU(),
    keras.layers.Dense(10, activation="softmax")
])

In [12]:
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(learning_rate=1e-3),
              metrics=["accuracy"])

In [13]:
history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### ELU

In [14]:
keras.layers.Dense(10, activation="elu")

<tensorflow.python.keras.layers.core.Dense at 0x20c4ac72dc8>

### 1.3. Batch Normalization

Using He initialization along with ELU or any variant of ReLU can significantly reduce the danger of vanishing/exploding gradients problems at the beginning of training but it doesn't guarantee they wont come back during training.

**What is Batch Normalization?**

- We add an operation in the model just before or after the activation function of each hidden layer.
- This operation zero-centers and normalizes each input then scales and shifts the result using two new parameter vectors per layer
- This means that the model learn the optimal scale and mean of each of the layer's inputs.
- We do not need to standardize our training set if we add a BN layer as the very first layer of the neural network.
- In order to zero-center and normalize the inputs, the algorithm need to estimate each input's mean and standard deviation.
- It does so by evaluating the mean and standard deviation of the input over the current mini-batch.

**Issues with Batch Normalization**
- During testing phase, for individual instances, there is no way to compute each input's mean and standard deviation.
- Most implementations of BN estimate these final statistics during training by using a moving average of the layer's input means and standard deviations.

**Benefits of Batch Normalization**
- Networks are much less sensitive to the weight initialization
- Able to use much larger learning rates hence significantly speed up the learning process
- BN also acts like a regularizer, reducing the need for other regularization techniques such as dropout
- Although training is slowed down due to extra computation, the algorithm can converge much faster with BN so it will take fewer epochs to reach the same performance.

**Disadvantages of Batch Normalization**
- BN adds complexity to the model.
- The neural network makes slower predictions due to the extra computationn required at each layer.
- But this can be solve by fusing the BN layer with the previous layer after training hence avoiding the runtime penalty.

In implementation, we add BN layer before or after each hidden layer's activation function and optionally add a BN layer as the first layer in the model.

In [15]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="relu"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="relu"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])

### 1.4. Gradient Clipping

To mitigate the exploding gradients problem, we can clip the gradients during backpropagation so that they never exceed some threshold. 
- This technique is most often used in Recurrent Neural Network, as BN is tricky to use in RNNs. 
- For other type of networks, BN is usually sufficient.

In [17]:
optimizer = keras.optimizers.SGD(clipvalue=1.0)
model.compile(loss="mse", optimizer=optimizer)

The optimizer will clip every component of the gradient vector to a value between -1.0 and 1.0

# 2. Reusing Pretrained Layers

## Transfer Learning

- It is not a good idea to train a very large DNN from scratch. 
- We should try to find existing neural network that accomplishes a similar task then reuse the lower layers of this network. 
- This speeds up training considerably and also requires significantly less training data
- Transfer learning work best when the inputs have similar low-level features
- The output layer should be replaced for the new task
- The upper hidden layers of the original model are less likely to be as useful as the lower layers since the high-level features that are most useful for the new task may differ significantly from the ones that were most useful for the original task.
- For very similar tasks, try keeping all the hidden layers and just replacing the output layer

![figure11.4](images/figure11.4.png)

- For all the reused layers, we can try freezing those layers first. That is, making their weights non-trainable so that Gradient Descent wont modify them.
- Then we can train our model and see how it performs
- Then try unfreezing one or two of the top hidden layers to let backpropagation tweak them and see if performance improves.
- When unfreezing resued layers, it is useful to reduce the learning rate as this will avoid wrecking their fine-tuned weights
- If unable to get good performance and have little training data, try dropping the top hidden layers and freezing all the remaining hidden layers again.
- Iterate until can find the right number of layers to reuse.
- If have plenty of training data, may replace the top hidden layers instead of dropping them and even adding more layers

#### Example

Let's split the fashion MNIST training set in two:  
- ``X_train_A``: all images of all items except for sandals and shirts (classes 5 and 6).
- ``X_train_B``: a much smaller training set of just the first 200 images of sandals or shirts.

The validation set and the test set are also split this way, but without restricting the number of images.

We will train a model on set A (classification task with 8 classes), and try to reuse it to tackle set B (binary classification). We hope to transfer a little bit of knowledge from task A to task B, since classes in set A (sneakers, ankle boots, coats, t-shirts, etc.) are somewhat similar to classes in set B (sandals and shirts). However, since we are using Dense layers, only patterns that occur at the same location can be reused (in contrast, convolutional layers will transfer much better, since learned patterns can be detected anywhere on the image, as we will see in the CNN chapter)


In [18]:
def split_dataset(X, y):
    y_5_or_6 = (y == 5) | (y == 6) # sandals or shirts
    y_A = y[~y_5_or_6]
    y_A[y_A > 6] -= 2 # class indices 7, 8, 9 should be moved to 5, 6, 7
    y_B = (y[y_5_or_6] == 6).astype(np.float32) # binary classification task: is it a shirt (class 6)?
    return ((X[~y_5_or_6], y_A),
            (X[y_5_or_6], y_B))

(X_train_A, y_train_A), (X_train_B, y_train_B) = split_dataset(X_train, y_train)
(X_valid_A, y_valid_A), (X_valid_B, y_valid_B) = split_dataset(X_valid, y_valid)
(X_test_A, y_test_A), (X_test_B, y_test_B) = split_dataset(X_test, y_test)
X_train_B = X_train_B[:200]
y_train_B = y_train_B[:200]

In [19]:
X_train_A.shape

(43986, 28, 28)

In [20]:
X_train_B.shape

(200, 28, 28)

In [21]:
y_train_A[:30]

array([4, 0, 5, 7, 7, 7, 4, 4, 3, 4, 0, 1, 6, 3, 4, 3, 2, 6, 5, 3, 4, 5,
       1, 3, 4, 2, 0, 6, 7, 1], dtype=uint8)

In [22]:
y_train_B[:30]

array([1., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0., 0., 0., 0.,
       0., 0., 1., 1., 0., 0., 1., 1., 0., 1., 1., 1., 1.], dtype=float32)

In [23]:
tf.random.set_seed(42)
np.random.seed(42)

In [24]:
model_A = keras.models.Sequential()
model_A.add(keras.layers.Flatten(input_shape=[28, 28]))
for n_hidden in (300, 100, 50, 50, 50):
    model_A.add(keras.layers.Dense(n_hidden, activation="selu"))
model_A.add(keras.layers.Dense(8, activation="softmax"))

In [25]:
model_A.compile(loss="sparse_categorical_crossentropy",
                optimizer=keras.optimizers.SGD(learning_rate=1e-3),
                metrics=["accuracy"])

In [26]:
history = model_A.fit(X_train_A, y_train_A, epochs=20,
                    validation_data=(X_valid_A, y_valid_A))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [27]:
model_A.save("my_model_A.h5")

In [28]:
model_B = keras.models.Sequential()
model_B.add(keras.layers.Flatten(input_shape=[28, 28]))
for n_hidden in (300, 100, 50, 50, 50):
    model_B.add(keras.layers.Dense(n_hidden, activation="selu"))
model_B.add(keras.layers.Dense(1, activation="sigmoid"))

In [29]:
model_B.compile(loss="binary_crossentropy",
                optimizer=keras.optimizers.SGD(learning_rate=1e-3),
                metrics=["accuracy"])

In [30]:
history = model_B.fit(X_train_B, y_train_B, epochs=20,
                      validation_data=(X_valid_B, y_valid_B))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [31]:
# remove the last layer of A 
model_A = keras.models.load_model("my_model_A.h5")
model_B_on_A = keras.models.Sequential(model_A.layers[:-1])
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))

Note that ``model_B_on_A`` and ``model_A`` actually share layers now, so when we train one, it will update both models. If we want to avoid that, we need to build ``model_B_on_A`` on top of a clone of ``model_A``:

In [32]:
model_A_clone = keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())
model_B_on_A = keras.models.Sequential(model_A_clone.layers[:-1])
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))

In [33]:
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False

model_B_on_A.compile(loss="binary_crossentropy",
                     optimizer=keras.optimizers.SGD(learning_rate=1e-3),
                     metrics=["accuracy"])

In [34]:
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4,
                           validation_data=(X_valid_B, y_valid_B))

for layer in model_B_on_A.layers[:-1]:
    layer.trainable = True

model_B_on_A.compile(loss="binary_crossentropy",
                     optimizer=keras.optimizers.SGD(learning_rate=1e-3),
                     metrics=["accuracy"])
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=16,
                           validation_data=(X_valid_B, y_valid_B))

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Epoch 1/16
Epoch 2/16
Epoch 3/16
Epoch 4/16
Epoch 5/16
Epoch 6/16
Epoch 7/16
Epoch 8/16
Epoch 9/16
Epoch 10/16
Epoch 11/16
Epoch 12/16
Epoch 13/16
Epoch 14/16
Epoch 15/16
Epoch 16/16


In [35]:
model_B.evaluate(X_test_B, y_test_B)



[0.1408407837152481, 0.9704999923706055]

In [36]:
model_B_on_A.evaluate(X_test_B, y_test_B)



[0.05611313506960869, 0.9940000176429749]

# 3. Faster Optimizers

We have seen four ways to speed up training: 
- Using good initialization strategy for the connection weights
- Using good activation functions
- Using Batch Normalization
- Reusing parts of a pretrained network

Another way is using a faster optimizer than the regular Gradient Descent optimizer

## 3.1 Momentum Optimization

**Momentum:** A bowling bowl will roll down a gentle slope starting slowly but quickly picks up momentum until it reaches the terminal velocity    
**Regular GD:** It will take slow regular steps down the slope, so the algorithm will take a much longer time to reach the bottom

In regular Gradient Descent, we update the weights $\theta$ this way:    
- $\theta$ $\leftarrow$ $\theta$ - $\eta \nabla_{\theta}J(\theta)$    

It does not care about what the earlier gradients were.


In momentum, we update the weights this way.   
- $m \leftarrow \beta m - \eta \nabla_{\theta} J(\theta)$   
- $\theta$ <- $\theta$ + m

It cares a great deal about what previous gradients were. At each iteration, it substracts the local gradient from the momentum vector m and updates the weights by adding this momentum. To simulate some sort of friction mechanism and prevent the momentum from growing too large, the algorithm introduces a new hyperparameter $\beta$, called the momentum, which must be set between 0 (high friction) and 1 (no friction). A typical momentum value is 0.9

Due to momentum, the optimizer may overshoot a bit, then come  back and overshoot again. Hence, good a have a bit of friction in the system to get rid of these oscillations and thus speeds up convergence

A drawback is that it adds another hyperameter to tune. However, momentum value of 0.9 works well in practice and almost always go faster than regular Gradient Descent

In [37]:
optimizer = keras.optimizers.SGD(learning_rate=0.001, momentum=0.9)

## 3.2 Nesterov Accelerated Gradient

Nesterov momentum optmization measures the gradient of the cost function not at the local position $\theta$ but slightly ahead in the direction of the momentum, at $\theta + \beta m$    
- $m \leftarrow \beta m - \eta \nabla_{\theta} J(\theta + \beta m)$   
- $\theta$ <- $\theta$ + m


This works in general because the momentum vector will be pointing in the right direction, so it will be slightly more accurate to use the gradient measured a bit farther in that direction rather than the gradient at the original position

In [38]:
optimizer = keras.optimizers.SGD(learning_rate=0.001, momentum=0.9, nesterov=True)

## 3.3 Adagrad

- Adagrad can correct its direction earlier to point a bit more toward the global optimum as compared to Gradient Descent, which does not point straight toward the global optimum
- It often stops too early when training neural networks. The learning rate gets scaled down so much that the algorithm ends up stopping before reaching global optimum.
- Hence, it hsould not be used to train deep nerual networks

In [39]:
optimizer = keras.optimizers.Adagrad(learning_rate=0.001)

## 3.4 RMSProp

- RMSProp fixes Adagrad's issue of slowing down too fast and never congering to global minimum.
- It fixes this by accumulating only the gradients from the most recent iterations (as opposed to all the gradients since the beginning of training)
- It does so by using exponential decay in the first step

In [40]:
optimizer = keras.optimizers.RMSprop(learning_rate=0.001, rho=0.9)

## 3.5 Adam and Nadam

- Adam (Adaptive moment estimation) combines the ideas of momentum and RMSprop
- Just like momentum optimization, it keeps track of an exponentially decaying average of past gradients
- Just like RMSProp, it keeps track of an exponentialy decaying average of past sqaured gradients

In [41]:
optimizer = keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)

There are two variants of Adam:  
- AdaMax
- Nadam: Adam optmization + Nesterov trick. It converges slightly faster than Adam

In [42]:
# AdaMax
optimizer = keras.optimizers.Adamax(learning_rate=0.001, beta_1=0.9, beta_2=0.999)

In [43]:
# Nadam
optimizer = keras.optimizers.Nadam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)

## Summary of all optimizers

(*)-bad        
(**)-average    
(***)-good    

![table11.2](images/table11.2.png)

# 4. Learning Rate Scheduling

- If learning rate is too high, training may diverge. 
- If learning rate is too long, training will converge to the optimum eventually but takes a very long time.
- If learning rate is set slightly too high, it will make progress very quickly at first but will end up dancing around the optimum and never settling down.

![figure11.8](images/figure11.8.png)

We can do better than a constant learning rate.    
- If we start with a large learning rate and then reduce it once training stops making fast progress, we can reach a good solution faster than with the optimal constant learning rate.  
- We can also start with low learning rate, increase it, then drop it again.   

## 4.1 Power Scheduling

``lr = lr0 / (1 + steps / s)**c``

- The power c is typically set to 1 and the steps s are hyperparameters.
- The learning rate drops at each step.
- After s steps, it is down to lr0/2, after s more steps it is down to lr0/3 and so on.
- This schedule first drops quickly, then more and more slowly.

The decay = inverse of s. The number of steps it takes to divide the learning rate by one more unit. Keras assumes c =1

![power_scheulding](images/power_scheduling.png)

In [2]:
optimizer = keras.optimizers.SGD(learning_rate=0.01, decay=1e-4)

In [3]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])

In [None]:
n_epochs = 25
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
                    validation_data=(X_valid_scaled, y_valid))

## 4.2 Exponential Scheduling

``lr = lr0 * 0.1**(epoch / s)``

- The learning rate will gradually drop by a factor of 10 every s steps.
- While power scheduling reduces the learning rate more and more slowly, exponential scheduling keeps slashing it by a factor of 10 every s steps.

![exponential_scheduling](images/exponential_scheduling.png)

In [8]:
def exponential_decay(lr0, s):
    def exponential_decay_fn(epoch):
        return lr0 * 0.1**(epoch / s)
    return exponential_decay_fn

exponential_decay_fn = exponential_decay(lr0=0.01, s=20)

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
n_epochs = 25

The ``LearningRateScheduler`` will update the optimizer's ``learning_rate`` at the beginning of each epoch.

In [None]:
lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
                    validation_data=(X_valid_scaled, y_valid),
                    callbacks=[lr_scheduler])

## 4.3 Piecewise Constant Scheduling

- Use a constant learning rate for a number of epochs then a smaller learning rate for another number of epochs and so on.
- Requires fiddling around to figure out the right sequence of learning rates

![piecewise](images/piecewise.png)

In [None]:
def piecewise_constant_fn(epoch):
    if epoch < 5:
        return 0.01
    elif epoch < 15:
        return 0.005
    else:
        return 0.001

In [None]:
def piecewise_constant(boundaries, values):
    boundaries = np.array([0] + boundaries)
    values = np.array(values)
    def piecewise_constant_fn(epoch):
        return values[np.argmax(boundaries > epoch) - 1]
    return piecewise_constant_fn

piecewise_constant_fn = piecewise_constant([5, 15], [0.01, 0.005, 0.001])

In [None]:
lr_scheduler = keras.callbacks.LearningRateScheduler(piecewise_constant_fn)

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
n_epochs = 25
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
                    validation_data=(X_valid_scaled, y_valid),
                    callbacks=[lr_scheduler])

## 4.4 Performance Scheduling

- Measure the validation error every N steps (just like for early stopping) and then reduce the learning rate by a factor of $\lambda$ when the error stops dropping

In [None]:
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(10, activation="softmax")
])
optimizer = keras.optimizers.SGD(learning_rate=0.02, momentum=0.9)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
n_epochs = 25
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
                    validation_data=(X_valid_scaled, y_valid),
                    callbacks=[lr_scheduler])

## 4.5 Using tf.keras Schedulers

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(10, activation="softmax")
])
s = 20 * len(X_train) // 32 # number of steps in 20 epochs (batch size = 32)
learning_rate = keras.optimizers.schedules.ExponentialDecay(0.01, s, 0.1)
optimizer = keras.optimizers.SGD(learning_rate)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
n_epochs = 25
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
                    validation_data=(X_valid_scaled, y_valid))

## 4.6 Summary

- Both performance scheduling and exponential scheduling performed well.
- Exponential scheduling is favoured because it was easy to tune and converged slightly faster to the optimal solution.

# 5. Avoiding Overfitting through Regularization

DNN typically have tens of thousands of parameters, sometimes millions. This gives them a great amount of freedom and means they can fit a huge variety of complex datasets. This flexibility makes the network prone to overfitting the training set.

- Early Stopping 
- Batch Normalization
- $\ell_{1}$ and $\ell_{2}$ Regularization
- Dropout
- Max-norm regularization

## 5.1 $\ell_{1}$ and $\ell_{2}$ Regularization

In [None]:
layer = keras.layers.Dense(100, activation="elu",
                           kernel_initializer="he_normal",
                           kernel_regularizer=keras.regularizers.l2(0.01))
# or l1(0.1) for ℓ1 regularization with a factor of 0.1
# or l1_l2(0.1, 0.01) for both ℓ1 and ℓ2 regularization, with factors 0.1 and 0.01 respectively

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation="elu",
                       kernel_initializer="he_normal",
                       kernel_regularizer=keras.regularizers.l2(0.01)),
    keras.layers.Dense(100, activation="elu",
                       kernel_initializer="he_normal",
                       kernel_regularizer=keras.regularizers.l2(0.01)),
    keras.layers.Dense(10, activation="softmax",
                       kernel_regularizer=keras.regularizers.l2(0.01))
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
n_epochs = 2
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
                    validation_data=(X_valid_scaled, y_valid))

Since we will be using the same activation function and the same initialization strategy in all hidden layers, we may have to repeat the same arguemnts. We can use the following wrapper

In [None]:
from functools import partial

RegularizedDense = partial(keras.layers.Dense,
                           activation="elu",
                           kernel_initializer="he_normal",
                           kernel_regularizer=keras.regularizers.l2(0.01))

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    RegularizedDense(300),
    RegularizedDense(100),
    RegularizedDense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
n_epochs = 2
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
                    validation_data=(X_valid_scaled, y_valid))

## 5.2 Dropout

- One of the most popular regularization techniques for deep neural networks
- At every training step, every neuron (including inputs, exclude output) has a probability p (dropout rate) of being temporarily "dropped out".
- This means that they will be entirely ignored during this training step, but it may be active during the next step.
- After training, neurons don't get dropped anymore.

![figure11.9](images/figure11.9.png)

With dropout regularization, at each training iteration, a random subset of all neurons in one more layers are "dropped out". These neurons output 0 at this iteration

### Why does dropout work?

- Neurons trainged with dropout cannot co-adapt with their neighbouring neurons.They have to be as useful as possible on their own.
- They cannot rely on just a few input neurons. They must pay attetion to each of their input neurons.
- They end up being less sensitive to slight changes in the inputs.
- We get a more robust network that generalizes better

**Another way to understand**
- A unique neural network is generated at each training step
- Since each neuron can either be present or absent, we have a total of $2^{N}$ possible neural networks
- This is a huge number and it is impossible for the same neural network to be sampled twice.
- Once we run 10,000 training steps, we have 10,000 different neural networks.
- These networks are not independent because they share many of their weights, but they are all different.
- The resulting neural network is like an average ensemble of all these smaller neural networks.

### Rescaling

- If our dropout rate p=0.5, during testing a neuron would be connected to twice as many neurons as it would be on average during training.
- To compensate for this fact, we need to multiply each neuron's input connection weights by 1-p = 1-0.5= 0.5 after training. 
- If we dont, then each neuron will get a total input signal roughly twice as large as what the network as trained on.
- We need to multiply each input connection weight by the keep probability(1-p) after training.

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
n_epochs = 2
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,

### Tips

- If model is overfitting, can increase dropout rate
- If model is underfitting, can decrease dropout rate
- For large layers, can increase dropout rate
- For small layers, can reduce dropout rate.
- Most SOTA architectures only use dropout after the last hidden layer.

### Limitation

Dropout significantly slow down convergence but will usually result in a much better model when tuned properly.

## 5.3 Max-Norm Regularization

For each neuron, it constrains the weights w of the incoming connections such that the norm of w <= r, where r is the max-norm hyperparameter.
- Reducing r increases the amount of regularization and helps reduce overfitting

In [None]:
layer = keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal",
                           kernel_constraint=keras.constraints.max_norm(1.))

In [None]:
MaxNormDense = partial(keras.layers.Dense,
                       activation="selu", kernel_initializer="lecun_normal",
                       kernel_constraint=keras.constraints.max_norm(1.))

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    MaxNormDense(300),
    MaxNormDense(100),
    keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
n_epochs = 2
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
                    validation_data=(X_valid_scaled, y_valid))

# Summary and Practical Guidelines

![table11.3](images/table11.3_2.png)

If network is simple stack of dense layers

![table11.4](images/table11.4.png)

- Normalize input features
- Try to reuse parts of pretrained neural network if we can find one that solves similar problem
- Use unsupervised pretraining if have a lot unlabeled data
- Use pretraining on an auxiliary task if we have a lot of labeled data for similar task
- If need sparse model, use $\ell_{1}$ regularization.

# Exercises

**Is it OK to initialize all the weights to the same value as long as that value is selected randomly using He initialzation?**

No. All weights should be sampled independently hence they should not all have the same initalvalue.

**Is it OK to initialize the bias terms to 0?**

It is fine to initialize the bias term to zero.

**Name three advantages of the SELU activation function over ReLU**

- It can take negative values so the average output of th neurons in any given layer is typically closer to zero than when using ReLU activation function. This helps to alleviate the vanishing gradient problem
- It always have a nonzero derivative, which avoids the dead neuron issue that can affect ReLU units
- When conditions are right (model is sequential, weights are initializes using LeCun initialization, inputs are standardized, no incompatible layer or regularization such as dropout or $\ell_{1}$), then SELU activation function ensures the model is self-normalizedm which solves exploding/vanishing gradients problem

**In which cases would you want to use each of the following activation functions: SELU, leaky ReLU (and its variants), ReLU, tanh, logistic, softmax?**

- SELU: Good default.
- ReLU variants: If we need the neural network to be as fast as possible
- ReLU: Simple and often people's preferred option despite the fact that it is generally outperformed by SELU and leaky ReLU.
- Tanh: Useful in the output layer if we need to output a number between -1 nd 1. Rarely used in hidden layers
- Logistic: Useful in the output layer when we need to estimate a probability. Rarely used in hidden layers
- Softmax: Useful in output layer to output probabilities for mutually exclusive classes. Rarely used in hidden layers

**What may happen if you set the ``momentum`` hyperparameter too close to 1 when using an SGD optimizer?**

- The algorithm will pick up a lot of speed, but its momentum will carry it right past the minimum.
- Then it may slow down and come back, accelerate again, overshoot again and so on.
- It may oscillate this way many times before converging so overall it will take much longer to converge than with a smaller momentum value

**Does dropout slow down training? Does it slow down inference (making predictions on new instances)?**

- Yes dropout slows down during training. 
- However it has no impact on inference speed since it is only turned on during training.