**AUTHOR: RAIHAN SALMAN BAEHAQI (1103220180)**

**PART II** 

**Neural Networks and Deep Learning** 

---

**CHAPTER 11 - Training Deep Neural Networks** 

---

In Chapter 10 we introduced artificial neural networks and trained our first deep neural networks. But they were shallow nets, with just a few hidden layers. What if you need to tackle a complex problem, such as detecting hundreds of types of objects in high-resolution images? You may need to train a much deeper DNN, perhaps with 10 layers or many more, each containing hundreds of neurons, linked by hundreds of thousands of connections. 

Training a deep DNN isn't a walk in the park. Here are some of the problems you could run into: 

- You may be faced with the tricky **vanishing gradients problem** or the related **exploding gradients problem**. This is when the gradients grow smaller and smaller, or larger and larger, when flowing backward through the DNN during training. Both of these problems make lower layers very hard to train. 

- You might not have enough training data for such a large network, or it might be too costly to label. 

- Training may be extremely slow. 

- A model with millions of parameters would severely risk overfitting the training set, especially if there are not enough training instances or if they are too noisy. 

In this chapter we will go through each of these problems and present techniques to solve them. We will start by exploring the vanishing and exploding gradients problems and some of their most popular solutions. Next, we will look at transfer learning and unsupervised pretraining, which can help you tackle complex tasks even when you have little labeled data. Then we will discuss various optimizers that can speed up training large models tremendously. Finally, we will go through a few popular regularization techniques for large neural networks. 

---

## **The Vanishing/Exploding Gradients Problems** 

The backpropagation algorithm works by going from the output layer to the input layer, propagating the error gradient along the way. Once the algorithm has computed the gradient of the cost function with regard to each parameter in the network, it uses these gradients to update each parameter with a Gradient Descent step. 

Unfortunately, gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As a result, the Gradient Descent update leaves the lower layers' connection weights virtually unchanged, and training never converges to a good solution. We call this the **vanishing gradients problem**. 

In some cases, the opposite can happen: the gradients can grow bigger and bigger until layers get insanely large weight updates and the algorithm diverges. This is the **exploding gradients problem**, which surfaces in recurrent neural networks. 

More generally, deep neural networks suffer from unstable gradients; different layers may learn at widely different speeds. 

This unfortunate behavior was empirically observed long ago, and it was one of the reasons deep neural networks were mostly abandoned in the early 2000s. It wasn't clear what caused the gradients to be so unstable when training a DNN, but some light was shed in a 2010 paper by Xavier Glorot and Yoshua Bengio. They found suspects including the combination of the popular logistic sigmoid activation function and the weight initialization technique that was most popular at the time. 

**Figure 11-1. Logistic activation function saturation**   
![Figure11-1.jpg](./11.Chapter-11/Figure11-1.jpg) 

When inputs become large (negative or positive), the logistic function saturates at 0 or 1, with a derivative extremely close to 0. Thus, when backpropagation kicks in it has virtually no gradient to propagate back through the network; and what little gradient exists keeps getting diluted as backpropagation progresses down through the top layers, so there is really nothing left for the lower layers. 

---

### **Glorot and He Initialization** 

In their paper, Glorot and Bengio propose a way to significantly alleviate the unstable gradients problem. They point out that we need the signal to flow properly in both directions: in the forward direction when making predictions, and in the reverse direction when backpropagating gradients. We don't want the signal to die out, nor do we want it to explode and saturate. 

For the signal to flow properly, the authors argue that we need the variance of the outputs of each layer to be equal to the variance of its inputs, and we need the gradients to have equal variance before and after flowing through a layer in the reverse direction. 

**Analogy:** If you set a microphone amplifier's knob too close to zero, people won't hear your voice, but if you set it too close to the max, your voice will be saturated and people won't understand what you are saying. Now imagine a chain of such amplifiers: they all need to be set properly in order for your voice to come out loud and clear at the end of the chain. Your voice has to come out of each amplifier at the same amplitude as it came in. 

It is actually not possible to guarantee both unless the layer has an equal number of inputs and neurons (these numbers are called the **fan-in** and **fan-out** of the layer), but Glorot and Bengio proposed a good compromise that has proven to work very well in practice. 

**Equation 11-1. Glorot initialization (when using the logistic activation function)**   
![Eq11-1.jpg](./11.Chapter-11/Eq11-1.jpg) 

The connection weights of each layer must be initialized randomly as follows: 
- **Normal distribution** with mean 0 and variance σ² = 1 / fan_avg, where fan_avg = (fan_in + fan_out) / 2 
- OR **uniform distribution** between –r and +r, with r = √(3 / fan_avg) 

This initialization strategy is called **Xavier initialization** or **Glorot initialization**, after the paper's first author. 

If you replace fan_avg with fan_in, you get **LeCun initialization** (proposed by Yann LeCun in the 1990s). 

**Table 11-1. Initialization parameters for each type of activation function**   
![Table11-1.jpg](./11.Chapter-11/Table11-1.jpg) 

The initialization strategy for ReLU and its variants is sometimes called **He initialization**, after the paper's first author (Kaiming He).

In [None]:
# By default, Keras uses Glorot initialization with a uniform distribution
# For He initialization with ReLU:
keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")

# For He initialization with uniform distribution based on fan_avg:
he_avg_init = keras.initializers.VarianceScaling(scale=2., mode='fan_avg',
                                                 distribution='uniform')
keras.layers.Dense(10, activation="sigmoid", kernel_initializer=he_avg_init)

### **Nonsaturating Activation Functions** 

One of the insights in the 2010 paper by Glorot and Bengio was that the vanishing/exploding gradients problems were in part due to a poor choice of activation function. Until then, most people had assumed that if nature chose to use roughly sigmoid activation functions in biological neurons, they must be an excellent choice. But it turns out that other activation functions behave much better in deep neural networks—in particular, the ReLU activation function, mostly because it does not saturate for positive values (and because it is fast to compute). 

Unfortunately, the ReLU activation function is not perfect. It suffers from a problem known as the **dying ReLUs**: during training, some neurons effectively "die," meaning they stop outputting anything other than 0. In some cases, you may find that half of your network's neurons are dead, especially if you used a large learning rate. A neuron dies when its weights get tweaked in such a way that the weighted sum of its inputs is negative for all instances in the training set. When this happens, it just keeps outputting 0s, and Gradient Descent does not affect it anymore since the gradient of the ReLU function is 0 when its input is negative. 

To solve this problem, you may want to use a variant of the ReLU function, such as the **leaky ReLU**. 

**Figure 11-2. Leaky ReLU**   
![Figure11-2.jpg](./11.Chapter-11/Figure11-2.jpg) 

This function is defined as LeakyReLU_α(z) = max(αz, z). The hyperparameter α defines how much the function "leaks": it is the slope of the function for z < 0 and is typically set to 0.01. This small slope ensures that leaky ReLUs never die; they can go into a long coma, but they have a chance to eventually wake up. 

A 2015 paper compared several variants of the ReLU activation function, and one of its conclusions was that the variants always outperformed the strict ReLU activation function. One of the best variants: 

- **Randomized leaky ReLU (RReLU)** - α is picked randomly in a given range during training and is fixed to an average value during testing 
- **Parametric leaky ReLU (PReLU)** - α is authorized to be learned during training (instead of being a hyperparameter, it becomes a parameter that can be modified by backpropagation like any other parameter) 

The paper also evaluated the **exponential linear unit (ELU)** activation function, which outperformed all the ReLU variants in their experiments: training time was reduced, and the neural network performed better on the test set. 

**Equation 11-2. ELU activation function**   
![Eq11-2.jpg](./11.Chapter-11/Eq11-2.jpg) 

**Figure 11-3. ELU activation function**   
![Figure11-3.jpg](./11.Chapter-11/Figure11-3.jpg) 

It looks a lot like the ReLU function, with a few major differences: 
- It takes on negative values when z < 0, which allows the unit to have an average output closer to 0 and helps alleviate the vanishing gradients problem 
- It has a nonzero gradient for z < 0, which avoids the dying units issue 
- If α is equal to 1 then the function is smooth everywhere, including around z = 0, which helps speed up Gradient Descent 

The main drawback of the ELU activation function is that it is slower to compute than the ReLU function and its variants (due to the use of the exponential function). Its faster convergence rate during training compensates for that slow computation, but at test time an ELU network will be slower than a ReLU network. 

A 2017 paper by Günter Klambauer et al. introduced the **Scaled ELU (SELU)** activation function. The authors showed that if you build a neural network composed exclusively of a stack of dense layers, and if all hidden layers use the SELU activation function, then the network will **self-normalize**: the output of each layer will tend to preserve a mean of 0 and standard deviation of 1 during training, which solves the vanishing/exploding gradients problem. 

However, self-normalization requires: 
1. The input features must be standardized (mean 0 and standard deviation 1) 
2. Every hidden layer's weights must be initialized with LeCun normal initialization 
3. The network's architecture must be sequential 

If you cannot guarantee these, you should go with ELU rather than SELU. 

**So which activation function should you use?** 

In general, SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic. If the network's architecture prevents it from self-normalizing, then ELU may perform better than SELU. If you care a lot about runtime latency, then you may prefer leaky ReLU. If you don't want to tweak yet another hyperparameter, use the default α values (e.g., 0.01 for the leaky ReLU). If you have spare time and computing power, use cross-validation to evaluate several activation functions.

In [None]:
# Leaky ReLU
model = keras.models.Sequential([
    keras.layers.Dense(10, kernel_initializer="he_normal"),
    keras.layers.LeakyReLU(alpha=0.2),
])

# PReLU
keras.layers.PReLU()

# SELU (requires LeCun normal initialization and standardized inputs)
layer = keras.layers.Dense(10, activation="selu",
                           kernel_initializer="lecun_normal")

### **Batch Normalization** 

Batch Normalization (BN) is a technique proposed by Sergey Ioffe and Christian Szegedy in a 2015 paper that addresses the vanishing/exploding gradients problems by adding an operation in the model just before or after the activation function of each hidden layer. This operation zero-centers and normalizes each input, then scales and shifts the result using two new parameter vectors per layer: one for scaling, one for shifting. 

**Equation 11-3. Batch Normalization algorithm**   
![Eq11-3.jpg](./11.Chapter-11/Eq11-3.jpg) 

In this equation: 
- μ_B is the empirical mean, evaluated over the whole mini-batch B 
- σ_B is the empirical standard deviation, also evaluated over the whole mini-batch 
- m_B is the number of instances in the mini-batch 
- ε is a tiny number that avoids division by zero (typically 10⁻⁵) 
- γ is the output scale parameter 
- β is the output shift parameter 

BN adds four parameters per input: γ, β, μ, and σ. The last two parameters, μ and σ, are the moving averages; they are estimated during training but used after training (to replace the batch input mean and standard deviations in Equation 11-3). 

**Benefits:** 
- Reduces the vanishing gradients problem 
- Networks are much less sensitive to weight initialization 
- Can use much larger learning rates, significantly speeding up the learning process 
- Acts as a regularizer, reducing the need for other regularization techniques 

**Implementation:**

In [None]:
# BN after each hidden layer
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])

model.summary()

Some researchers prefer to place the BN layer **before** the activation function rather than after. There is some debate about this; which is preferable seems to depend on the task. To place it before, remove activation from hidden layers and add BN layers, then add separate Activation layers:

In [None]:
# BN before activation
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, kernel_initializer="he_normal", use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("elu"),
    keras.layers.Dense(100, kernel_initializer="he_normal", use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("elu"),
    keras.layers.Dense(10, activation="softmax")
])

### **Gradient Clipping** 

Another popular technique to mitigate the exploding gradients problem is to clip the gradients during backpropagation so that they never exceed some threshold. This is called **Gradient Clipping**. This technique is most often used in recurrent neural networks, as Batch Normalization is tricky to use in RNNs. 

For other types of networks, BN is usually sufficient. 

In Keras, implementing Gradient Clipping is just a matter of setting the clipvalue or clipnorm argument when creating an optimizer:

In [None]:
# Clip each component to [-1.0, 1.0]
optimizer = keras.optimizers.SGD(clipvalue=1.0)

# Clip by norm (if ||g|| > 1.0, then g ← g / ||g||)
optimizer = keras.optimizers.SGD(clipnorm=1.0)

model.compile(loss="mse", optimizer=optimizer)

---

## **Reusing Pretrained Layers** 

It is generally not a good idea to train a very large DNN from scratch: instead, you should always try to find an existing neural network that accomplishes a similar task to the one you are trying to tackle, then reuse the lower layers of this network. This is called **transfer learning**. It will not only speed up training considerably, but will also require significantly less training data. 

**Figure 11-4. Reusing pretrained layers**   
![Figure11-4.jpg](./11.Chapter-11/Figure11-4.jpg) 

For example, suppose you have access to a DNN that was trained to classify pictures into 100 different categories, including animals, plants, vehicles, and everyday objects. You now want to train a DNN to classify specific types of vehicles. These tasks are similar, even partly overlapping, so you should try to reuse parts of the first network. 

**How much should you reuse?** 
- If the new task is similar, try reusing most layers 
- If the new task is very different, keep only the first few layers 
- In general, the more training data you have, the more layers you can unfreeze 

**General guidelines:** 
1. Freeze all the reused layers first (make their weights non-trainable) 
2. Train your model and see how it performs 
3. Try unfreezing one or two of the top hidden layers to let backpropagation tweak them 
4. The more training data you have, the more layers you can unfreeze 
5. Use a lower learning rate when unfreezing reused layers

### **Transfer Learning with Keras**

In [None]:
# Load model A
model_A = keras.models.load_model("my_model_A.h5")
model_B_on_A = keras.models.Sequential(model_A.layers[:-1])
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))

# If you want to keep model A intact, clone it
model_A_clone = keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())

# Freeze reused layers
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False

model_B_on_A.compile(loss="binary_crossentropy", optimizer="sgd",
                     metrics=["accuracy"])

In [None]:
# Train, then unfreeze and fine-tune
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4,
                           validation_data=(X_valid_B, y_valid_B))

for layer in model_B_on_A.layers[:-1]:
    layer.trainable = True

optimizer = keras.optimizers.SGD(lr=1e-4)  # Lower LR
model_B_on_A.compile(loss="binary_crossentropy", optimizer=optimizer,
                     metrics=["accuracy"])
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=16,
                           validation_data=(X_valid_B, y_valid_B))

### **Unsupervised Pretraining** 

If you cannot find a model trained on a similar task, try unsupervised pretraining. If you can gather plenty of unlabeled training data, you can train an unsupervised model such as an autoencoder or a generative adversarial network (see Chapter 17). Then you can reuse the lower layers and add the output layer for your task on top, and fine-tune the final network using supervised learning. 

**Figure 11-5. Unsupervised pretraining**   
![Figure11-5.jpg](./11.Chapter-11/Figure11-5.jpg)

### **Pretraining on an Auxiliary Task** 

If you do not have much labeled training data, one last option is to train a first neural network on an auxiliary task for which you can easily obtain or generate labeled training data, then reuse the lower layers of that network for your actual task. The first neural network's lower layers will learn feature detectors that will likely be reusable by the second neural network. 

For example, if you want to build a system to recognize faces, you may only have a few pictures of each individual. Gathering hundreds of pictures of each person would not be practical. You could, however, gather a lot of pictures of random people on the web and train a first neural network to detect whether or not two different pictures feature the same person. Such a network would learn good feature detectors for faces, so reusing its lower layers would allow you to train a good face classifier that uses little training data. 

---

## **Faster Optimizers** 

Training a very large deep neural network can be painfully slow. So far we have seen four ways to speed up training: 
1. Applying a good initialization strategy for the connection weights 
2. Using a good activation function 
3. Using Batch Normalization 
4. Reusing parts of a pretrained network 

Another huge speed boost comes from using a faster optimizer than the regular Gradient Descent optimizer. In this section we will present the most popular algorithms: Momentum optimization, Nesterov Accelerated Gradient, AdaGrad, RMSProp, and finally Adam and Nadam optimization.

### **Momentum Optimization** 

Imagine a bowling ball rolling down a gentle slope on a smooth surface: it will start out slowly, but it will quickly pick up momentum until it eventually reaches terminal velocity. This is the very simple idea behind Momentum optimization. 

In contrast, regular Gradient Descent will simply take small regular steps down the slope, so it will take much more time to reach the bottom. 

**Equation 11-4. Momentum algorithm**   
![Eq11-4.jpg](./11.Chapter-11/Eq11-4.jpg) 

The momentum hyperparameter β is typically set to 0.9. Since the momentum vector is initialized to 0, the algorithm will take a few iterations to really pick up momentum.

In [None]:
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9)

### **Nesterov Accelerated Gradient** 

One small variant to Momentum optimization, proposed by Yurii Nesterov in 1983, is almost always faster than vanilla Momentum optimization. The idea is to measure the gradient of the cost function not at the local position θ but slightly ahead in the direction of the momentum, at θ + βm. 

**Equation 11-5. Nesterov Accelerated Gradient algorithm**   
![Eq11-5.jpg](./11.Chapter-11/Eq11-5.jpg) 

**Figure 11-6. Regular Momentum optimization versus Nesterov Accelerated Gradient**   
![Figure11-6.jpg](./11.Chapter-11/Figure11-6.jpg)

In [None]:
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True)

### **AdaGrad** 

Consider the elongated bowl problem again: Gradient Descent starts by quickly going down the steepest slope, but it does not point toward the global optimum, then it takes a very long time to go down to the bottom of the valley. 

The AdaGrad algorithm decelerates the gradient vectors along the steepest dimensions. This is called an **adaptive learning rate**. It helps point the resulting updates more directly toward the global optimum. 

**Equation 11-6. AdaGrad algorithm**   
![Eq11-6.jpg](./11.Chapter-11/Eq11-6.jpg) 

**Figure 11-7. AdaGrad versus Gradient Descent: the former reaches the global optimum much faster**   
![Figure11-7.jpg](./11.Chapter-11/Figure11-7.jpg) 

AdaGrad frequently performs well for simple quadratic problems, but unfortunately it often stops too early when training neural networks. The learning rate gets scaled down so much that the algorithm ends up stopping entirely before reaching the global optimum.

### **RMSProp** 

Although AdaGrad slows down a bit too fast and ends up never converging to the global optimum, the RMSProp algorithm fixes this by accumulating only the gradients from the most recent iterations (as opposed to all the gradients since the beginning of training). 

**Equation 11-7. RMSProp algorithm**   
![Eq11-7.jpg](./11.Chapter-11/Eq11-7.jpg) 

The decay rate β is typically set to 0.9. As always, it is a good idea to tune this hyperparameter.

In [None]:
optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9)

### **Adam and Nadam Optimization** 

**Adam** stands for adaptive moment estimation, and it combines the ideas of Momentum optimization and RMSProp: just like Momentum optimization, it keeps track of an exponentially decaying average of past gradients; and just like RMSProp, it keeps track of an exponentially decaying average of past squared gradients. 

**Equation 11-8. Adam algorithm**   
![Eq11-8.jpg](./11.Chapter-11/Eq11-8.jpg) 

The momentum decay hyperparameter β₁ is typically initialized to 0.9, while the scaling decay hyperparameter β₂ is often initialized to 0.999. The smoothing term ε is usually initialized to a tiny number such as 10⁻⁷. These are the default values in TensorFlow. 

Since Adam is an adaptive learning rate algorithm, it requires less tuning of the learning rate hyperparameter η. You can often use the default value η = 0.001, making Adam even easier to use than Gradient Descent.

In [None]:
optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)

Two variants of Adam are worth mentioning: 

**AdaMax** - Replaces ℓ² norm with ℓ^∞ norm 

**Nadam** - Adam optimization plus the Nesterov trick, so it often converges slightly faster than Adam 

**Table 11-2. Optimizer comparison**   
![Table11-2.jpg](./11.Chapter-11/Table11-2.jpg)

### **Learning Rate Scheduling** 

Finding a good learning rate is very important. If you set it much too high, training may diverge. If you set it too low, training will eventually converge to the optimum, but it will take a very long time. If you set it slightly too high, it will make progress very quickly at first, but it will end up dancing around the optimum, never really settling down. 

You can do better than using a constant learning rate: if you start with a large learning rate and then reduce it once it stops making fast progress, you can reach a good solution faster than with an optimal constant learning rate. There are many strategies to reduce the learning rate during training. 

**Figure 11-8. Learning curves for various learning rates η**   
![Figure11-8.jpg](./11.Chapter-11/Figure11-8.jpg) 

These are the most common learning schedules: 

**Power scheduling** - Set the learning rate to a function of the iteration number t: η(t) = η₀ / (1 + t/s)^c 

**Exponential scheduling** - Set the learning rate to η(t) = η₀ × 0.1^(t/s) 

**Piecewise constant scheduling** - Use a constant learning rate for a number of epochs, then a smaller learning rate for another number of epochs, and so on 

**Performance scheduling** - Measure the validation error every N steps, and reduce the learning rate by a factor of λ when the error stops dropping 

**1cycle scheduling** - Increase the initial learning rate η₀, growing linearly up to η₁ halfway through training. Then decrease it linearly down to η₀ again during the second half of training. Then finish with a few epochs dropping the learning rate down by several orders of magnitude

In [None]:
# Power scheduling (set decay when creating optimizer)
optimizer = keras.optimizers.SGD(lr=0.01, decay=1e-4)

# Exponential scheduling
def exponential_decay_fn(epoch):
    return 0.01 * 0.1**(epoch / 20)

lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)
history = model.fit(X_train, y_train, epochs=30, callbacks=[lr_scheduler])

In [None]:
# Piecewise constant scheduling
def piecewise_constant_fn(epoch):
    if epoch < 5:
        return 0.01
    elif epoch < 15:
        return 0.005
    else:
        return 0.001

# Performance scheduling
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)

# Using tf.keras schedules API
s = 20 * len(X_train) // 32
learning_rate = keras.optimizers.schedules.ExponentialDecay(0.01, s, 0.1)
optimizer = keras.optimizers.SGD(learning_rate)

---

## **Avoiding Overfitting Through Regularization** 

With thousands of parameters, a deep neural network has an incredible amount of freedom and can fit a huge variety of complex datasets. But this great flexibility also makes it prone to overfitting the training set. We need regularization. 

We already implemented one of the best regularization techniques: **early stopping**. Moreover, even though Batch Normalization was designed to solve the vanishing/exploding gradients problems, it also acts like a pretty good regularizer. In this section we will examine other popular regularization techniques for neural networks: ℓ₁ and ℓ₂ regularization, dropout, and max-norm regularization.

### **ℓ₁ and ℓ₂ Regularization** 

Just like you did in Chapter 4 for simple linear models, you can use ℓ₂ regularization to constrain a neural network's connection weights, and/or ℓ₁ regularization if you want a sparse model (with many weights equal to 0).

In [None]:
# ℓ₂ regularization
layer = keras.layers.Dense(100, activation="elu",
                           kernel_initializer="he_normal",
                           kernel_regularizer=keras.regularizers.l2(0.01))

# Using functools.partial for reusability
from functools import partial

RegularizedDense = partial(keras.layers.Dense,
                           activation="elu",
                           kernel_initializer="he_normal",
                           kernel_regularizer=keras.regularizers.l2(0.01))

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    RegularizedDense(300),
    RegularizedDense(100),
    RegularizedDense(10, activation="softmax")
])

### **Dropout** 

Dropout is one of the most popular regularization techniques for deep neural networks. It was proposed by G. E. Hinton et al. in a 2012 paper and further detailed in a paper by Nitish Srivastava et al. in 2014. It is a fairly simple algorithm: at every training step, every neuron (including the input neurons but excluding the output neurons) has a probability p of being temporarily "dropped out," meaning it will be entirely ignored during this training step, but it may be active during the next step. 

**Figure 11-9. Dropout regularization**   
![Figure11-9.jpg](./11.Chapter-11/Figure11-9.jpg) 

The hyperparameter p is called the **dropout rate**, and it is typically set to 10% to 50%: closer to 20–30% in recurrent neural nets, and closer to 40–50% in convolutional neural networks. 

After training, neurons don't get dropped anymore. However, for technical reasons, we need to multiply each neuron's input connection weights by the keep probability (1 – p) after training. Alternatively, we can divide each neuron's output by the keep probability during training (these alternatives are mathematically equivalent but the latter is usually preferred because it has a better runtime performance). 

To implement dropout using Keras, use the Dropout layer. During training, it randomly drops some inputs (setting them to 0) and divides the remaining inputs by the keep probability. After training, it does nothing at all; it just passes the inputs to the next layer.

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(10, activation="softmax")
])

### **Monte Carlo (MC) Dropout** 

In 2016, a groundbreaking paper by Yarin Gal and Zoubin Ghahramani showed that you can boost the performance of any trained dropout model by using it with dropout turned on, averaging over multiple predictions (typically 10 to 100). This technique is called **Monte Carlo Dropout** (MC Dropout). It can dramatically improve the accuracy of the model, and provide a better measure of the model's uncertainty. 

To implement MC Dropout, you need to keep dropout active after training. Use training=True when calling the model:

In [None]:
# Make 100 predictions with dropout active
y_probas = np.stack([model(X_test_scaled, training=True)
                     for sample in range(100)])
y_proba = y_probas.mean(axis=0)  # Average predictions
y_std = y_probas.std(axis=0)     # Uncertainty estimates

To use MC Dropout without affecting other layers that behave differently during training and testing (such as Batch Normalization), create an MCDropout class:

In [None]:
class MCDropout(keras.layers.Dropout):
    def call(self, inputs):
        return super().call(inputs, training=True)

# Use MCDropout instead of Dropout
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    MCDropout(rate=0.2),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    MCDropout(rate=0.2),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    MCDropout(rate=0.2),
    keras.layers.Dense(10, activation="softmax")
])

### **Max-Norm Regularization** 

Another regularization technique that is popular for neural networks is called **max-norm regularization**: for each neuron, it constrains the weights **w** of the incoming connections such that ‖**w**‖₂ ≤ r, where r is the max-norm hyperparameter and ‖ · ‖₂ is the ℓ₂ norm. 

Max-norm regularization does not add a regularization loss term to the overall loss function. Instead, it is typically implemented by computing ‖**w**‖₂ after each training step and rescaling **w** if needed (**w** ← **w** × r / ‖**w**‖₂). 

Reducing r increases the amount of regularization and helps reduce overfitting. Max-norm regularization can also help alleviate the vanishing/exploding gradients problems (if you are not using Batch Normalization).

In [None]:
keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal",
                   kernel_constraint=keras.constraints.max_norm(1.))

---

## **Summary and Practical Guidelines** 

In this chapter we have covered a wide range of techniques, and you may be wondering which ones you should use. The configuration in Table 11-3 will work fine in most cases, without requiring much hyperparameter tuning. 

**Table 11-3. Default DNN configuration**   
![Table11-3.jpg](./11.Chapter-11/Table11-3.jpg) 

Don't forget to normalize the input features! Also, if you need a sparse model, you can use ℓ₁ regularization or TensorFlow Model Optimization Toolkit. If you need a low-latency model (one that performs lightning-fast predictions), you may need to use fewer layers, fold the Batch Normalization layers into the previous layers, and use faster activation functions such as leaky ReLU or just ReLU. Having a sparse model will also help. Finally, if you are building a risk-sensitive application, or inference latency is not very important in your application, you can use MC Dropout to boost performance and get more reliable probability estimates, along with uncertainty estimates. 

For a self-normalizing net based on the SELU activation function, you should use the configuration in Table 11-4. 

**Table 11-4. DNN configuration for a self-normalizing net**   
![Table11-4.jpg](./11.Chapter-11/Table11-4.jpg) 

Don't forget to standardize the input features! Moreover, instead of regular dropout, you need to use alpha dropout: this is a variant of dropout that preserves the mean and standard deviation of its inputs (it was introduced in the same paper as SELU, as regular dropout would break self-normalization). 