<a href="https://colab.research.google.com/github/Richish/hands_on_ml/blob/master/11_training_deep_neural_networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Challenges
when traiining a much deeper DNN, perhaps with 10 layers or much more, each containing hundreds of neurons, connected by hundreds of thousands of connections.

1. Vanishing gradients problem (or the related exploding gradients problem) that affects deep neural networks and makes lower layers very hard to train. 
2. You might not have enough training data for such a large network, or it might be too costly to label - Solved by transfer learning.
3. Training may be extremely slow - solved by various optimizers.
4. A model with millions of parameters would severely risk overfitting the training set, especially if there are not enough training instances, or they are too noisy. - solved by regularization techniques.

# Vanishing/Exploding Gradients Problems

During backpropagation: gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As a result, the Gradient Descent update leaves the lower layer connection weights virtually unchanged, and training never converges to a good
solution. This is called the vanishing gradients problem. In some cases, the opposite
can happen: the gradients can grow bigger and bigger, so many layers get insanely
large weight updates and the algorithm diverges. This is the exploding gradients problem,
which is mostly encountered in recurrent neural networks. More generally,
deep neural networks suffer from unstable gradients; different layers may learn at
widely different speeds.

This behavior was one of the reasons why deep neural networks were mostly abandoned for a
long time, it is only around 2010 that significant progress was made in understanding
it. 

A paper titled “Understanding the Difficulty of Training Deep Feedforward
Neural Networks” by Xavier Glorot and Yoshua Bengio found a few suspects, including
the combination of the popular logistic sigmoid activation function and the
weight initialization technique that was most popular at the time, namely random initialization
using a normal distribution with a mean of 0 and a standard deviation of 1.
In short, they showed that with this activation function and this initialization scheme,
the variance of the outputs of each layer is much greater than the variance of its
inputs. Going forward in the network, the variance keeps increasing after each layer
until the activation function saturates at the top layers. This is actually made worse by
the fact that the logistic function has a mean of 0.5, not 0 (the hyperbolic tangent
function has a mean of 0 and behaves slightly better than the logistic function in deep
networks).

When
inputs become large (negative or positive), the function saturates at 0 or 1, with a
derivative extremely close to 0. Thus when backpropagation kicks in, it has virtually
no gradient to propagate back through the network, and what little gradient exists
keeps getting diluted as backpropagation progresses down through the top layers, so
there is really nothing left for the lower layers.



## Initializers- Glorot, LeCunn and He Initializations

### Glorot
In their paper, Glorot and Bengio propose a way to significantly alleviate this problem.
We need the signal to flow properly in both directions: in the forward direction
when making predictions, and in the reverse direction when backpropagating gradients.
We don’t want the signal to die out, nor do we want it to explode and saturate.
For the signal to flow properly, the authors argue that we need the variance of the
outputs of each layer to be equal to the variance of its inputs,2 and we also need the
gradients to have equal variance before and after flowing through a layer in the
reverse direction. 

It is actually not possible to guarantee both unless the layer has an equal
number of inputs and neurons (these numbers are called the fan-in and fan-out of the
layer), but they proposed a good compromise that has proven to work very well in
practice: the connection weights of each layer must be initialized randomly as described in Equation below, where fan{avg} = (fan{in} + fan{out})/2. This initialization strategy is called Xavier initialization (after the author’s first name) or Glorot initialization (after his last name).

Normal distribution with mean 0 and variance: σ^2 = 1/fan{avg}

Or a uniform distribution between −r and + r, with r = root(3/fan{avg})

### LeCunn
If you just replace fan{avg} with fan{in} in above eqn, you get an initialization strategy
that was actually already proposed by Yann LeCun in the 1990s, called LeCun initialization. It is equivalent to Glorot initialization when fan{in} = fan{out}. It took over a decade for researchers to realize
just how important this trick really is. Using Glorot initialization can speed up training considerably, and it is one of the tricks that led to the current success of Deep Learning.

### He
Some papers have provided similar strategies for different activation functions.
These strategies differ only by the scale of the variance and whether they use fan{avg} or fan{in}, as shown in Table below - for the uniform distribution, just compute r = root(3.σ^2). 

The initialization strategy for the ReLU activation function (and its variants, including the ELU activation described shortly) is sometimes called He initialization (after the last name of its author). 

The SELU activation function will be explained . It should be used with LeCun initialization (preferably with a normal distribution, as we will see).

### Table of initializers:
| Initialization      | Activation functions | σ^2 (Normal)    |
| :---        |    :----   |          :--- |
| Glorot      | None, Logistic, tanh, Softmax      | 1/fan{avg}  |
| LeCunn   | SELU        | 1/fan{in}      |
| He   | RELU        | 2/fan{in}      |

By default, Keras uses Glorot initialization with a uniform distribution. You can
change this to He initialization by setting kernel_initializer="he_uniform" or kernel_initializer="he_normal" when creating a layer, like this:


In [None]:
import keras
keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")

<tensorflow.python.keras.layers.core.Dense at 0x7ff6d47a33c8>

In [None]:
# If you want He initialization with a uniform distribution, but based on fanavg rather
# than fanin, you can use the VarianceScaling initializer like this:
he_avg_init = keras.initializers.VarianceScaling(scale=2., mode='fan_avg', distribution='uniform') # basically a custom initializer
keras.layers.Dense(10, activation="sigmoid", kernel_initializer=he_avg_init)

<tensorflow.python.keras.layers.core.Dense at 0x7ff6d5128358>

## Activation Functions- Nonsaturating Activation Functions

### Relu
One of the insights in the 2010 paper by Glorot and Bengio was that the vanishing/
exploding gradients problems were in part due to a poor choice of activation function.
Until then most people had assumed that if Mother Nature had chosen to use
roughly sigmoid activation functions in biological neurons, they must be an excellent
choice. But it turns out that other activation functions behave much better in deep
neural networks, in particular the ReLU activation function, mostly because it does
not saturate for positive values (and also because it is quite fast to compute).

#### Problem of dying relus:
During training in relu, some neurons effectively die, meaning
they stop outputting anything other than 0. In some cases, you may find that half of your network’s neurons are dead, especially if you used a large learning rate. A neuron dies when its weights get tweaked in such a way that the weighted sum of its inputs are negative for all instances in the training set. When this happens, it just keeps outputting 0s, and gradient descent does not affect it anymore since the gradient
of the ReLU function is 0 when its input is negative.

### Leaky relu(solves the problem of dying relu):
This function is defined as LeakyReLUα(z) = max(αz, z). The hyperparameter α defines how much the function “leaks”: it is the
slope of the function for z < 0, and is typically set to 0.01. This small slope ensures that leaky ReLUs never die; they can go into a long coma, but they have a chance to eventually wake up. 

A 2015 paper compared several variants of the ReLU activation function and one of its conclusions was that the leaky variants always outperformed the strict ReLU activation function. In fact, setting α = 0.2 (huge leak) seemed to result in better performance than α = 0.01 (small leak). 

They also evaluated the **randomized leaky ReLU (RReLU)***, where α is picked randomly in a given range during training, and it is fixed to an average value during testing. It also performed fairly well and seemed to act as a regularizer (reducing the risk of overfitting the training set).

Finally, they also evaluated the **parametric leaky ReLU (PReLU)**, where α is authorized to be learned during training (instead of being a hyperparameter, it becomes a parameter that can be modified by backpropagation like any other parameter). This was reported to strongly outperform ReLU on large image datasets, but on smaller
datasets it runs the risk of overfitting the training set.

### ELU
A 2015 paper by Djork-Arné Clevert et al.6 proposed a new activation
function called the exponential linear unit (ELU) that outperformed all the ReLU variants in their experiments: training time was reduced and the neural network performed better on the test set. 
ELU activation function:
ELU{α} (z) = α (exp(z) − 1) if z < 0 else z

It looks like relu for +ve values of z.

 3 differences from relu:
1. It takes on negative values when z < 0, which allows the unit to have an average output closer to 0. This helps alleviate the vanishing gradients problem, as discussed earlier. The hyperparameter α defines the value that the ELU function approaches when z is a large negative number. It is usually set to 1, but you can tweak it like any other hyperparameter if you want.
2. It has a nonzero gradient for z < 0, which avoids the dead neurons problem.
3. If α is equal to 1 then the function is smooth everywhere, including
around z = 0, which helps speed up Gradient Descent, since it does not bounce as much left and right of z = 0.

Drawbacks of ELU:
The main drawback of the ELU activation function is that it is slower to compute than the ReLU and its variants (due to the use of the exponential function), but during training this is compensated by the faster convergence rate. However, at test time an ELU network will be slower than a ReLU network.

#### SELU (Scaled ELU)
In a 2017 paper7 by Günter Klambauer et al., called “Self-Normalizing
Neural Networks”, the authors showed that if you build a neural network composed exclusively of a stack of dense layers, and if all hidden layers use the SELU activation function (which is just a scaled version of the ELU activation function, as its name
suggests), then the network will self-normalize: the output of each layer will tend to preserve mean 0 and standard deviation 1 during training, which solves the vanishing/ exploding gradients problem. As a result, this activation function often outperforms other activation functions very significantly for such neural nets (especially
deep ones). **However, there are a few conditions for self-normalization to happen:**

1. The input features must be standardized (mean 0 and standard deviation 1).
2. Every hidden layer’s weights must also be initialized using LeCun normal initialization. In Keras, this means setting kernel_initializer="lecun_normal".
3. **The network’s architecture must be sequential.** Unfortunately, if you try to use SELU in non-sequential architectures, such as recurrent networks or networks with skip connections (i.e., connections that skip layers, such as in wide & deep nets), self-normalization will not be guaranteed, so SELU will not necessarily outperform other activation functions.
4. The paper only guarantees self-normalization if all layers are dense. However, in practice the SELU activation function seems to work great with convolutional neural nets as well.

### Which optimization function to use:

For the hidden layers of your deep neural networks- Although your mileage will vary, in
general:

SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh
> logistic. 

If the network’s architecture prevents it from self-normalizing,
then ELU may perform better than SELU (since SELU
is not smooth at z = 0). 

If you care a lot about runtime latency, then you may prefer leaky ReLU. 

If you don’t want to tweak yet another hyperparameter, you may just use the default α values used by Keras (e.g., 0.3 for the leaky ReLU). 
If you have spare time and computing power, you can use cross-validation to evaluate other activation functions, in particular RReLU if your network is overfitting, or PReLU if you have a huge training set.


In [None]:
# To use the leaky ReLU activation function, you must create a LeakyReLU instance like this:
from tensorflow import keras
leaky_relu = keras.layers.LeakyReLU(alpha=0.2)
layer = keras.layers.Dense(10, activation=leaky_relu, kernel_initializer="he_normal")


In [None]:
# For PReLU, just replace LeakyRelu(alpha=0.2) with PReLU(). There is currently no
# official implementation of RReLU in Keras, but you can fairly easily implement your own.
p_relu = keras.layers.PReLU()
layer = keras.layers.Dense(10, activation=p_relu, kernel_initializer="he_normal")

In [None]:
# For SELU activation, just set activation="selu" and kernel_initializer="lecun_normal" when creating a layer:
layer = keras.layers.Dense(10, activation="selu", kernel_initializer="lecun_normal")

## Batch Normalization:

Although using He initialization along with ELU (or any variant of ReLU) can significantly
reduce the vanishing/exploding gradients problems at the beginning of training,
it doesn’t guarantee that they won’t come back during training.

Batch Normalization consists of adding an operation in the model just before or after the
activation function of each hidden layer, simply zero-centering and normalizing each
input, then scaling and shifting the result using two new parameter vectors per layer:
one for scaling, the other for shifting. This operation lets the model
learn the optimal scale and mean of each of the layer’s inputs. 

In many cases, if you
add a BN layer as the very first layer of your neural network, you do not need to
standardize your training set (e.g., using a StandardScaler): the BN layer will do it
for you (well, approximately, since it only looks at one batch at a time, and it can also
rescale and shift each input feature).

In order to zero-center and normalize the inputs, the algorithm needs to estimate
each input’s mean and standard deviation. It does so by evaluating the mean and standard
deviation of each input over the current mini-batch (hence the name “Batch
Normalization”).


So during training, BN just standardizes its inputs then rescales and offsets them.
What about at test time?

It is often preferred to estimate these final statistics
during training using a moving average of the layer’s input means and standard
deviations. To sum up, four parameter vectors are learned in each batch-normalized
layer: γ (the ouput scale vector) and β (the output offset vector) are learned through
regular backpropagation, and μ (the final input mean vector), and σ (the final input
standard deviation vector) are estimated using an exponential moving average. Note
that μ and σ are estimated during training, but they are not used at all during training,
only after training

Batch Normalization
also acts like a regularizer, reducing the need for other regularization techniques
(such as dropout, described later in this chapter).
Batch Normalization does, however, add some complexity to the model (although it
can remove the need for normalizing the input data, as we discussed earlier). Moreover,
there is a runtime penalty: the neural network makes slower predictions due to
the extra computations required at each layer. So if you need predictions to be
lightning-fast, you may want to check how well plain ELU + He initialization perform
before playing with Batch Normalization.



You may find that training is rather slow, because each epoch takes
much more time when you use batch normalization. However, this
is usually counterbalanced by the fact that convergence is much
faster with BN, so it will take fewer epochs to reach the same performance.
All in all, wall time will usually be smaller (this is the
time measured by the clock on your wall).

### Implementing Batch Normalization with Keras

Just add a BatchNormalization layer before or after each hidden layer’s activation
function, and optionally add a BN layer as well as the first layer in your model. For
example, this model applies BN after every hidden layer and as the first layer in the
model (after flattening the input images):

In [None]:
from tensorflow import keras

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
batch_normalization (BatchNo (None, 784)               3136      
_________________________________________________________________
dense (Dense)                (None, 300)               235500    
_________________________________________________________________
batch_normalization_1 (Batch (None, 300)               1200      
_________________________________________________________________
dense_1 (Dense)              (None, 100)               30100     
_________________________________________________________________
batch_normalization_2 (Batch (None, 100)               400       
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1

In [None]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('batch_normalization/gamma:0', True),
 ('batch_normalization/beta:0', True),
 ('batch_normalization/moving_mean:0', False),
 ('batch_normalization/moving_variance:0', False)]

The authors of the BN paper argued in favor of adding the BN layers before the activation
functions, rather than after (as we just did). There is some debate about this, as
it seems to depend on the task. So that’s one more thing you can experiment with to
see which option works best on your dataset. To add the BN layers before the activation
functions, we must remove the activation function from the hidden layers, and
add them as separate layers after the BN layers. Moreover, since a Batch Normalization
layer includes one offset parameter per input, you can remove the bias term from
the previous layer (just pass use_bias=False when creating it):

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, kernel_initializer="he_normal", use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("elu"),
    keras.layers.Dense(100, kernel_initializer="he_normal", use_bias=False),
    keras.layers.Activation("elu"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_2 (Flatten)          (None, 784)               0         
_________________________________________________________________
batch_normalization_6 (Batch (None, 784)               3136      
_________________________________________________________________
dense_6 (Dense)              (None, 300)               235200    
_________________________________________________________________
batch_normalization_7 (Batch (None, 300)               1200      
_________________________________________________________________
activation_2 (Activation)    (None, 300)               0         
_________________________________________________________________
dense_7 (Dense)              (None, 100)               30000     
_________________________________________________________________
activation_3 (Activation)    (None, 100)              

#### Hyperparameters of Batch Normalization layer:

1. Momentum:  This hyperparameter is used when updating the exponential moving averages: given a
new value v.
Running average, V{avg} is calculated using eqn:
V{avg} = V{avg}*momentum + v{new_batch}*(1-momentum)

A good momentum value is typically close to 1—for example, 0.9, 0.99, or 0.999 (you
want more 9s for larger datasets and smaller mini-batches).

2. Axis: it determines which axis should be normalized.
It defaults to –1, meaning that by default it will normalize the last axis (using
the means and standard deviations computed across the other axes). For example,
when the input batch is 2D (i.e., the batch shape is [batch size, features]), this means
that each input feature will be normalized based on the mean and standard deviation
computed across all the instances in the batch. For example, the first BN layer in the
previous code example will independently normalize (and rescale and shift) each of
the 784 input features. However, if we move the first BN layer before the Flatten
layer, then the input batches will be 3D, with shape [batch size, height, width], therefore
the BN layer will compute 28 means and 28 standard deviations.


it will normalize all pixels in a given column using the same mean and standard deviation.
There will also be just 28 scale parameters and 28 shift parameters. If instead
you still want to treat each of the 784 pixels independently, then you should set
axis=[1, 2].

Notice that the BN layer does not perform the same computation during training and
after training: it uses batch statistics during training, and the “final” statistics after
training (i.e., the final value of the moving averages).

Batch Normalization has become one of the most used layers in deep neural networks,
to the point that it is often omitted in the diagrams, as it is assumed that BN is
added after every layer. 

However, a very recent paper10 by Hongyi Zhang et al. may
well change this: the authors show that by using a novel fixed-update (fixup) weight
initialization technique, they manage to train a very deep neural network (10,000 layers!)
without BN, achieving state-of-the-art performance on complex image classification
tasks.

## Gradient Clipping

Another popular technique to lessen the exploding gradients problem is to simply
clip the gradients during backpropagation so that they never exceed some threshold.
This is called Gradient Clipping.

This technique is most often used in recurrent neural networks, as Batch Normalization is tricky to use in RNNs.
For other types of networks, BN is usually sufficient.

In Keras, implementing Gradient Clipping is just a matter of setting the clipvalue or
clipnorm argument when creating an optimizer. For example:



In [None]:
optimizer = keras.optimizers.SGD(clipvalue=1.0)
model.compile(loss="mse", optimizer=optimizer)
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_2 (Flatten)          (None, 784)               0         
_________________________________________________________________
batch_normalization_6 (Batch (None, 784)               3136      
_________________________________________________________________
dense_6 (Dense)              (None, 300)               235200    
_________________________________________________________________
batch_normalization_7 (Batch (None, 300)               1200      
_________________________________________________________________
activation_2 (Activation)    (None, 300)               0         
_________________________________________________________________
dense_7 (Dense)              (None, 100)               30000     
_________________________________________________________________
activation_3 (Activation)    (None, 100)              

This will clip every component of the gradient vector to a value between –1.0 and 1.0.
This means that all the partial derivatives of the loss (with regards to each and every
trainable parameter) will be clipped between –1.0 and 1.0. The threshold is a hyperparameter
you can tune. Note that it may change the orientation of the gradient vector:
for example, if the original gradient vector is [0.9, 100.0], it points mostly in the
direction of the second axis, but once you clip it by value, you get [0.9, 1.0], which
points roughly in the diagonal between the two axes. In practice however, this
approach works well. 

If you want to ensure that Gradient Clipping does not change
the direction of the gradient vector, you should clip by norm by setting clipnorm
instead of clipvalue. This will clip the whole gradient if its ℓ2 norm is greater than
the threshold you picked. For example, if you set clipnorm=1.0, then the vector [0.9,
100.0] will be clipped to [0.00899964, 0.9999595], preserving its orientation, but
almost eliminating the first component. If you observe that the gradients explode
during training (you can track the size of the gradients using TensorBoard), you may
want to try both clipping by value and clipping by norm, with different threshold,
and see which option performs best on the validation set.

# Transfer Learning

## Reusing pretrained layers

It is generally not a good idea to train a very large DNN from scratch: instead, you
should always try to find an existing neural network that accomplishes a similar task
to the one you are trying to tackle, then just reuse the lower layers of this network: this is called transfer learning. It will
not only **speed up training considerably, but will also require much less training data.**

For example, suppose that you have access to a DNN that was trained to classify pictures
into 100 different categories, including animals, plants, vehicles, and everyday
objects. You now want to train a DNN to classify specific types of vehicles. These
tasks are very similar, even partly overlapping, so you should try to reuse parts of the
first network.

You
want to find the right number of layers to reuse.
The more similar the tasks are, the more layers you want to reuse
(starting with the lower layers). For very similar tasks, you can try
keeping all the hidden layers and just replace the output layer.


Try freezing all the reused layers first (i.e., make their weights non-trainable, so gradient
descent won’t modify them), then train your model and see how it performs.
Then try unfreezing one or two of the top hidden layers to let backpropagation tweak
them and see if performance improves. The more training data you have, the more layers you can unfreeze. It is also useful to reduce the learning rate when you unfreeze
reused layers: this will avoid wrecking their fine-tuned weights.
If you still cannot get good performance, and you have little training data, try dropping
the top hidden layer(s) and freeze all remaining hidden layers again. You can
iterate until you find the right number of layers to reuse. If you have plenty of training
data, you may try replacing the top hidden layers instead of dropping them, and
even add more hidden layers.

## Transfer Learning With Keras

Let’s look at an example. Suppose the fashion MNIST dataset only contained 8 classes,
for example all classes except for sandals and shirts. Someone built and trained a
Keras model on that set and got reasonably good performance (>90% accuracy). Let’s
call this model A. You now want to tackle a different task: you have images of sandals
and shirts, and you want to train a binary classifier (positive=shirts, negative=sandals).
However, your dataset is quite small, you only have 200 labeled images. When
you train a new model for this task (let’s call it model B), with the same architecture
as model A, it performs reasonably well (97.2% accuracy), but since it’s a much easier
task (there are just 2 classes), you were hoping for more. You realize that your task is quite similar to task A, so perhaps transfer
learning can help? 

Let’s find out!
First, you need to load model A, and create a new model based on the model A’s layers.
Let’s reuse all layers except for the output layer:



In [None]:
model_A = keras.models.load_model("my_model_A.h5")

model_B_on_A = keras.models.Sequential(model_A.layers[:-1])
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))

model_A and model_B_on_A now share some layers. When you train
model_B_on_A, it will also affect model_A. If you want to avoid that, you need to clone
model_A before you reuse its layers. To do this, you must clone model A’s architecture,
then copy its weights (since clone_model() does not clone the weights):

In [None]:
model_A_clone = keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())



Now we could just train model_B_on_A for task B, but since the new output layer was
initialized randomly, it will make large errors, at least during the first few epochs, so
there will be large error gradients that may wreck the reused weights. To avoid this,
one approach is to freeze the reused layers during the first few epochs, giving the new
layer some time to learn reasonable weights. To do this, simply set every layer’s train
able attribute to False and compile the model:


In [None]:
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False

model_B_on_A.compile(loss="binary_crossentropy", optimizer="sgd",metrics=["accuracy"])

**You must always compile your model after you freeze or unfreeze
layers.**

Next, we can train the model for a few epochs, then unfreeze the reused layers (which
requires compiling the model again) and continue training to fine-tune the reused
layers for task B. After unfreezing the reused layers, it is usually a good idea to reduce
the learning rate, once again to avoid damaging the reused weights:

In [None]:
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4, validation_data=(X_valid_B, y_valid_B))

for layer in model_B_on_A.layers[:-1]:
    layer.trainable = True
optimizer = keras.optimizers.SGD(lr=1e-4) # the default lr is 1e-3
model_B_on_A.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy"])
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=16, validation_data=(X_valid_B, y_valid_B))

**It turns out that transfer learning does not work very well
with small dense networks: it works best with deep convolutional neural networks**

## Unsupervised Pretraining

**good option
when you have a complex task to solve, no similar model you can reuse, and little
labeled training data but plenty of unlabeled training data.**

Suppose you want to tackle a complex task for which you don’t have much labeled
training data, but unfortunately you cannot find a model trained on a similar task.
Don’t lose all hope! First, you should of course try to gather more labeled training
data, but if this is too hard or too expensive, you may still be able to perform unsupervised
pretraining. It is often rather cheap to gather unlabeled training
examples, but quite expensive to label them. If you can gather plenty of unlabeled
training data, you can try to train the layers one by one, starting with the lowest layer
and then going up, using an unsupervised feature detector algorithm such as Restricted
Boltzmann Machines or autoencoders. Each layer is
trained on the output of the previously trained layers (all layers except the one being
trained are frozen). Once all layers have been trained this way, you can add the output
layer for your task, and fine-tune the final network using supervised learning (i.e.,
with the labeled training examples). At this point, you can unfreeze all the pretrained
layers, or just some of the upper ones.

This is a rather long and tedious process, but it often works well; in fact, it is this
technique that Geoffrey Hinton and his team used in 2006 and which led to the
revival of neural networks and the success of Deep Learning. Until 2010, unsupervised
pretraining (typically using RBMs) was the norm for deep nets, and it was only
after the vanishing gradients problem was alleviated that it became much more com‐mon to train DNNs purely using supervised learning. However, unsupervised pretraining
(today typically using autoencoders rather than RBMs) is still a good option
when you have a complex task to solve, no similar model you can reuse, and little
labeled training data but plenty of unlabeled training data.

## Pretraining on an Auxiliary Task

If you do not have much labeled training data, one last option is to train a first neural
network on an auxiliary task for which you can easily obtain or generate labeled
training data, then reuse the lower layers of that network for your actual task. The
first neural network’s lower layers will learn feature detectors that will likely be reusable
by the second neural network.


For example, if you want to build a system to recognize faces, you may only have a
few pictures of each individual—clearly not enough to train a good classifier. Gathering
hundreds of pictures of each person would not be practical. However, you could
gather a lot of pictures of random people on the web and train a first neural network
to detect whether or not two different pictures feature the same person. Such a network
would learn good feature detectors for faces, so reusing its lower layers would
allow you to train a good face classifier using little training data.


For natural language processing (NLP) applications, you can easily download millions
of text documents and automatically generate labeled data from it. For example, you
could randomly mask out some words and train a model to predict what the missing
words are (e.g., it should predict that the missing word in the sentence “What ___
you saying?” is probably “are” or “were”). If you can train a model to reach good performance
on this task, then it will already know quite a lot about language, and you
can certainly reuse it for your actual task, and fine-tune it on your labeled data.

## Self-supervised learning 
Is when you automatically generate the
labels from the data itself, then you train a model on the resulting
“labeled” dataset using supervised learning techniques. Since this
approach requires no human labeling whatsoever, it is best classified
as a form of unsupervised learning.

# Faster Optimizers


Another huge speed boost comes from
using a faster optimizer than the regular Gradient Descent optimizer.

The most popular optimizers are: 
1. Momentum optimization
2. Nesterov Accelerated Gradient
3. AdaGrad
4. RMSProp
5. Adam and Nadam optimization.

## Momentum Optimization
Imagine a bowling ball rolling down a gentle slope on a smooth surface: it will start
out slowly, but it will quickly pick up momentum until it eventually reaches terminal
velocity (if there is some friction or air resistance). This is the very simple idea behind
Momentum optimization. In contrast, regular
Gradient Descent will simply take small regular steps down the slope, so it will take
much more time to reach the bottom.

Simple Gradient Descent simply updates the weights θ by directly subtracting the
gradient of the cost function J(θ) with regards to the weights (∇θJ(θ)) multiplied by
the learning rate η. The equation is: θ ← θ – η∇θJ(θ). It does not care about what the
earlier gradients were. If the local gradient is tiny, it goes very slowly.
Momentum optimization cares a great deal about what previous gradients were: at
each iteration, it subtracts the local gradient from the momentum vector m (multiplied
by the learning rate η), and it updates the weights by simply adding this
momentum vector. In other words, the gradient is used for acceleration,
not for speed. To simulate some sort of friction mechanism and prevent the
momentum from growing too large, the algorithm introduces a new hyperparameter
β, simply called the momentum, which must be set between 0 (high friction) and 1
(no friction). A typical momentum value is 0.9.

Equation for Momentum algorithm
1. m-> βm − η∇{θ}J(θ)
2. θ-> θ + m

You can easily verify that if the gradient remains constant, the terminal velocity (i.e.,
the maximum size of the weight updates) is equal to that gradient multiplied by the
learning rate η multiplied by 1/(1 − β) (ignoring the sign). For example, if β = 0.9, then the
terminal velocity is equal to 10 times the gradient times the learning rate, so Momentum
optimization ends up going 10 times faster than Gradient Descent! This allows
Momentum optimization to escape from plateaus much faster than Gradient Descent.

In particular, since when the inputs have very different scales the
cost function will look like an elongated bowl. Gradient Descent goes
down the steep slope quite fast, but then it takes a very long time to go down the valley. In contrast, Momentum optimization will roll down the valley faster and faster
until it reaches the bottom (the optimum). In deep neural networks that don’t use
Batch Normalization, the upper layers will often end up having inputs with very different
scales, so using Momentum optimization helps a lot. It can also help roll past
local optima.


Due to the momentum, the optimizer may overshoot a bit, then
come back, overshoot again, and oscillate like this many times
before stabilizing at the minimum. This is one of the reasons why it
is good to have a bit of friction in the system: it gets rid of these
oscillations and thus speeds up convergence.

Implementing Momentum optimization in Keras: just use the SGD
optimizer and set its momentum hyperparameter, then lie back and profit!

optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9)
The one drawback of Momentum optimization is that it adds yet another hyperparameter
to tune. However, the momentum value of 0.9 usually works well in practice
and almost always goes faster than regular Gradient Descent.

In [None]:
import keras
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9)

## Nesterov Accelerated Gradient
One small variant to Momentum optimization, proposed by Yurii Nesterov in 1983
is almost always faster than vanilla Momentum optimization. The idea of Nesterov
Momentum optimization, or Nesterov Accelerated Gradient (NAG), is to measure the
gradient of the cost function not at the local position but slightly ahead in the direction
of the momentum. The only difference from vanilla
Momentum optimization is that the gradient is measured at θ + βm rather than at θ.

Nesterov Accelerated Gradient algorithm
1. m-> βm − η∇{θ}J(θ + βm)
2. θ-> θ + m

This small tweak works because in general the momentum vector will be pointing in
the right direction (i.e., toward the optimum), so it will be slightly more accurate to
use the gradient measured a bit farther in that direction rather than using the gradient
at the original position. As you can see, the Nesterov update ends up
slightly closer to the optimum. After a while, these small improvements add up and
NAG ends up being significantly faster than regular Momentum optimization. 

Moreover,
note that when the momentum pushes the weights across a valley, ∇1 continues
to push further across the valley, while ∇2 pushes back toward the bottom of the valley.
This helps reduce oscillations and thus converges faster.
NAG will almost always speed up training compared to regular Momentum optimization.
To use it, simply set nesterov=True when creating the SGD optimizer:
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True)

My own analogy-> Think of car with headlights knows when to accelerate/deccelerate.

In [None]:
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True)

## AdaGrad
Consider the elongated bowl problem again: Gradient Descent starts by quickly going
down the steepest slope, then slowly goes down the bottom of the valley. It would be
nice if the algorithm could detect this early on and correct its direction to point a bit
more toward the global optimum.
The AdaGrad algorithm1 achieves this by scaling down the gradient vector along the
steepest dimensions.

1. s-> s + ∇{θ}J(θ) ⊗ ∇{θ}J(θ)
2. θ-> θ − η ∇{θ}J(θ) ⊘(s+é)^0.5

The first step accumulates the square of the gradients into the vector s (recall that the
⊗ symbol represents the element-wise multiplication). This vectorized form is equivalent
to computing si ← si + (∂ J(θ) / ∂ θi)2 for each element si of the vector s; in other
words, each si accumulates the squares of the partial derivative of the cost function
with regards to parameter θi. If the cost function is steep along the ith dimension, then
si will get larger and larger at each iteration.

The second step is almost identical to Gradient Descent, but with one big difference:
the gradient vector is scaled down by a factor of sqrt(s+è). (the ⊘ symbol represents the
element-wise division, and ϵ is a smoothing term to avoid division by zero, typically
set to 10^–10). This vectorized form is equivalent to computing
θi->θi − η ∂J(θ)/ ∂(θi)/sqrt(s+è) for all parameters θi (simultaneously).

In short, this algorithm decays the learning rate, but it does so faster for steep dimensions
than for dimensions with gentler slopes. This is called an adaptive learning rate.
It helps point the resulting updates more directly toward the global optimum (see
Figure 11-7). One additional benefit is that it requires much less tuning of the learning
rate hyperparameter η.

AdaGrad often performs well for simple quadratic problems, but unfortunately it
often stops too early when training neural networks. The learning rate gets scaled
down so much that the algorithm ends up stopping entirely before reaching the
global optimum. So even though Keras has an Adagrad optimizer, **you should not use
it to train deep neural networks** (it may be efficient for simpler tasks such as Linear
Regression, though). However, understanding Adagrad is helpful to grasp the other
adaptive learning rate optimizers.




## RMSProp
Although AdaGrad slows down a bit too fast and ends up never converging to the
global optimum, the RMSProp algorithm fixes this by accumulating only the gradients
from the most recent iterations (as opposed to all the gradients since the beginning
of training). 

It does so by using exponential decay in the first step:
RMSProp algorithm:
1. s-> βs + (1 − β) ∇{θ}J(θ) ⊗ ∇{θ}J(θ)
2. θ-> θ − η ∇{θ}J(θ) ⊘/sqrt(s+è)


The decay rate β is typically set to 0.9. Yes, it is once again a new hyperparameter, but
this default value often works well, so you may not need to tune it at all.

Keras has an RMSProp optimizer:
optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9)

Except on very simple problems, this optimizer almost always performs much better
than AdaGrad. In fact, **it was the preferred optimization algorithm of many researchers
until Adam optimization came around.**

In [None]:
optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9)

## Adam and Nadam Optimization
Adam which stands for adaptive moment estimation, combines the ideas of Momentum
optimization and RMSProp: just like Momentum optimization it keeps track of
an exponentially decaying average of past gradients, and just like RMSProp it keeps
track of an exponentially decaying average of past squared gradients.

1. m-> β{1}m − (1 − β1). ∇{θ}J(θ)
2. s-> β{2}s + (1 − β2) ∇{θ}J(θ) ⊗ ∇{θ}J(θ)
3. m-> m/(1 − β{1}^t)
4. s-> s(1 − β{2}^t)
5. θ-> θ + ηm ⊘ sqrt(s+è)


If you just look at steps 1, 2, and 5, you will notice Adam’s close similarity to both
Momentum optimization and RMSProp. The only difference is that step 1 computes
an exponentially decaying average rather than an exponentially decaying sum, but
these are actually equivalent except for a constant factor (the decaying average is just
1 – β1 times the decaying sum). 

Steps 3 and 4 are somewhat of a technical detail: since
m and s are initialized at 0, they will be biased toward 0 at the beginning of training,
so these two steps will help boost m and s at the beginning of training.
The momentum decay hyperparameter β1 is typically initialized to 0.9, while the scaling
decay hyperparameter β2 is often initialized to 0.999. As earlier, the smoothing
term ϵ is usually initialized to a tiny number such as 10–7. These are the default values
for the Adam class (to be precise, epsilon defaults to None, which tells Keras to use
keras.backend.epsilon(), which defaults to 10–7; you can change it using
keras.backend.set_epsilon()).
optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)
Since Adam is an adaptive learning rate algorithm (like AdaGrad and RMSProp), it
requires less tuning of the learning rate hyperparameter η. You can often use the
default value η = 0.001, making Adam even easier to use than Gradient Descent.



## Adamax
**Not very useful in general** introduced in the same paper as Adam: notice that in step 2 of Equation for adamm, Adam accumulates the squares of the gradients in s (with a greater weight
for more recent weights). In step 5, if we ignore ϵ and steps 3 and 4 (which are
technical details anyway), Adam just scales down the parameter updates by the
square root of s. In short, Adam scales down the parameter updates by the ℓ2
norm of the time-decayed gradients (recall that the ℓ2 norm is the square root of
the sum of squares). Adamax just replaces the ℓ2 norm with the ℓ∞ norm (a fancy
way of saying the max).

In
practice, this can make Adamax more stable than Adam, but this really depends
on the dataset, and in general Adam actually performs better. So it’s just one
more optimizer you can try if you experience problems with Adam on some task.

## Nadam optimization 
**is more important it is simply Adam optimization plus
the Nesterov trick, so it will often converge slightly faster than Adam** In his
report, Timothy Dozat compares many different optimizers on various tasks, and
finds that Nadam generally outperforms Adam, but is sometimes outperformed
by RMSProp.

Adaptive optimization methods (including RMSProp, Adam and
Nadam optimization) are often great, converging fast to a good solution.
However, a 2017 paper19 by Ashia C. Wilson et al. showed
that they can lead to solutions that generalize poorly on some datasets.
So when you are disappointed by your model’s performance,
try using plain Nesterov Accelerated Gradient instead: your dataset
may just be allergic to adaptive gradients. Also check out the latest
research, it is moving fast (e.g., AdaBound).



## Second order optimization algorithms(Using hessions instead of Jacobians)
All the optimization techniques discussed so far only rely on the first-order partial
derivatives (Jacobians). The optimization literature contains amazing algorithms
based on the second-order partial derivatives (the Hessians, which are the partial
derivatives of the Jacobians). Unfortunately, these algorithms are very hard to apply
to deep neural networks because there are n2 Hessians per output (where n is the
number of parameters), as opposed to just n Jacobians per output. Since DNNs typically
have tens of thousands of parameters, **the second-order optimization algorithms often don’t even fit in memory, and even when they do, computing the Hessians is
just too slow.**

## Optimzers to train sparse models

All the optimization algorithms just presented produce dense models, meaning that
most parameters will be nonzero. If you need a blazingly fast model at runtime, or if
you need it to take up less memory, you may prefer to end up with a sparse model
instead.

One trivial way to achieve this is to train the model as usual, then get rid of the tiny
weights (set them to 0). However, this will typically not lead to a very sparse model,
and it may degrade the model’s performance.

A better option is to apply strong ℓ1 regularization during training, as it pushes the
optimizer to zero out as many weights as it can (as in Lasso
Regression).

However, in some cases these techniques may remain insufficient. 
One last option is
to apply **Dual Averaging, often called Follow The Regularized Leader (FTRL)**, a technique
proposed by Yurii Nesterov. When used with ℓ1 regularization, this technique
often leads to very sparse models. Keras implements a variant of FTRL called FTRLProximal
in the FTRL optimizer.

## Summarizing all the optimizers:
1. SGD: simply looks at the local gradient at a point.
2. Momentum optimizer: adds the momentum upto current position and adds it to theta. But to avoid overshooting also friction is introduced, by setting beta=0.9. (basically saying momentum is 0.9 and friction would be like 0.1).
3. Nesterov accelerated: same as momentum but looks ahead some distance and accounts for cost gradient at that point ahead instead of cost gradient at current point.
3. AdaGrad: Adaptive grad: decays the learning rate by a factor that is root of square of slope at a point. That is more decay in learning rate towards steeper slope dimension. But this algo not useful in deep lerning as learning rate decays too soon without converging.
4. RMSProp: same as adagrad but to avoid too much decay of learning rate, s is accumulating only the gradients from the most recent iterations this is achieved by following a exponential decay.
5. Adam: Basically momentum + rmsprop

    i. Adamax: Not very useful. Adamax just replaces the ℓ2 norm with the ℓ∞ norm (a fancy way of saying the max).

    ii. Nadam: Adam + nesterov(looking a short distance ahead)
6. Second order: use second order derivatives instead of first but not useful since n^2 terms and hence ofte3n goes out of memory and also are very slow to compute.
7. Optimizers for sparse models: 

    i. Get rid of tiny weights(make them 0)

    ii. l1 regularization(pushes weights to 0)

    iii. Dual averaging- FLTR(follow the regularized leader). Used alongside l1 normalization this leads to very sparse models.




# Learning Schedules- Scheduling the learning rate:
one approach is to start with a large learning rate, and
divide it by 3 until the training algorithm stops diverging. You will not be too far
from the optimal learning rate, which will learn quickly and converge to good solution.

There are many different
strategies to reduce the learning rate during training. These strategies are called
learning schedules

## Power scheduling
Set the learning rate to a function of the iteration number t: η(t) = η0 / (1 + t/k)^c.
The initial learning rate η0, the power c (typically set to 1) and the steps s are
hyperparameters. The learning rate drops at each step, and after s steps it is down
to η0 / 2. After s more steps, it is down to η0 / 3. Then down to η0 / 4, then η0 / 5,
and so on. As you can see, this schedule first drops quickly, then more and more
slowly. Of course, this requires tuning η0, s (and possibly c).

In [None]:
"""Implementing power scheduling in Keras is the easiest option: just set the decay
hyperparameter when creating an optimizer. The decay is the inverse of s (the number
of steps it takes to divide the learning rate by one more unit), and Keras assumes
that c is equal to 1. For example:"""
import keras
optimizer = keras.optimizers.SGD(lr=0.01, decay=1e-4)

## Exponential scheduling
Set the learning rate to: η(t) = η0 *((0.1)^(t/s)). The learning rate will gradually drop by a
factor of 10 every s steps. While power scheduling reduces the learning rate more
and more slowly, exponential scheduling keeps slashing it by a factor of 10 every
s steps.

In [None]:
def exponential_decay_fn(epoch):
    return 0.01 * 0.1**(epoch / 20)


In [None]:
# If you do not want to hard-code η0 and s, you can create a function that returns a configured function:
def exponential_decay(lr0, s):
    def exponential_decay_fn(epoch):
        return lr0 * 0.1**(epoch / s)
    return exponential_decay_fn
exponential_decay_fn = exponential_decay(lr0=0.01, s=20)



In [None]:
# Next, just create a LearningRateScheduler callback, giving it the schedule function,
# and pass this callback to the fit() method:
lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)
history = model.fit(X_train_scaled, y_train, [...], callbacks=[lr_scheduler])

"""
The LearningRateScheduler will update the optimizer’s learning_rate attribute at
the beginning of each epoch. Updating the learning rate just once per epoch is usually
enough, but if you want it to be updated more often, for example at every step, you
need to write your own callback (see the notebook for an example). This can make
sense if there are many steps per epoch.
"""

## Piecewise constant scheduling
Use a constant learning rate for a number of epochs (e.g., η0 = 0.1 for 5 epochs),
then a smaller learning rate for another number of epochs (e.g., η1 = 0.001 for 50
epochs), and so on. Although this solution can work very well, it requires fiddling around to figure out the right sequence of learning rates, and how long to
use each of them.


In [None]:
def piecewise_constant_fn(epoch):
    if epoch < 5:
        return 0.01
    elif epoch < 15:
        return 0.005
    else:
        return 0.001

## Performance scheduling
Measure the validation error every N steps (just like for early stopping) and
reduce the learning rate by a factor of λ when the error stops dropping.

In [None]:
"""
For performance scheduling, simply use the ReduceLROnPlateau callback. For example,
if you pass the following callback to the fit() method, it will multiply the learning
rate by 0.5 whenever the best validation loss does not improve for 5 consecutive
epochs""" 

lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)


## simpler implementation via tf.keras:
tf.keras offers an alternative way to implement learning rate scheduling: just
define the learning rate using one of the schedules available in keras.optimizers.schedules, then pass this learning rate to any optimizer. This approach updates
the learning rate at each step rather than at each epoch. For example, here is how to
implement the same exponential schedule as earlier:

In [None]:
from tensorflow import keras

s = 20 * len(X_train) // 32 
# number of steps in 20 epochs (batch size = 32)

learning_rate = keras.optimizers.schedules.ExponentialDecay(0.01, s, 0.1)

optimizer = keras.optimizers.SGD(learning_rate)

"""
This is nice and simple, plus when you save the model, the learning rate and its
schedule (including its state) get saved as well. However, this approach is not part of
the Keras API, it is specific to tf.keras.
"""

**Exponential decay or performance scheduling can considerably speed up
convergence.**

# Avoiding Overfitting Through Regularization
With four parameters I can fit an elephant and with five I can make him wiggle his
trunk.
—John von Neumann, cited by Enrico Fermi in Nature 427

With thousands of parameters you can fit the whole zoo. Deep neural networks typically
have tens of thousands of parameters, sometimes even millions. With so many
parameters, the network has an incredible amount of freedom and can fit a huge variety
of complex datasets. But this great flexibility also means that it is prone to overfitting
the training set. 

We need regularization.
We already implemented one of the best regularization techniques earler:
early stopping. Moreover, even though Batch Normalization was designed to solve
the vanishing/exploding gradients problems, is also acts like a pretty good regularizer.

Other popular regularization techniques for neural networks:
ℓ1 and ℓ2 regularization, dropout and max-norm regularization.



## ℓ1 and ℓ2 Regularization
Just like for simple linear models, you can use ℓ1 and ℓ2 regularization
to constrain a neural network’s connection weights (but typically not its biases).
Here is how to apply ℓ2 regularization to a Keras layer’s connection weights,
using a regularization factor of 0.01:


In [None]:
layer = keras.layers.Dense(100, activation="elu",kernel_initializer="he_normal", kernel_regularizer=keras.regularizers.l2(0.01))

"""
The l2() function returns a regularizer that will be called to compute the regularization
loss, at each step during training. This regularization loss is then added to the
final loss. As you might expect, you can just use keras.regularizers.l1() if you
"""

In [None]:
"""
Since you will typically want to apply the same regularizer to all layers in your network,
as well as the same activation function and the same initialization strategy in all
hidden layers, you may find yourself repeating the same arguments over and over.
This makes it ugly and error-prone. To avoid this, you can try refactoring your code
to use loops. Another option is to use Python’s functools.partial() function: it lets
you create a thin wrapper for any callable, with some default argument values. For
example:"""
from functools import partial
RegularizedDense = partial(keras.layers.Dense, activation="elu",kernel_initializer="he_normal", kernel_regularizer=keras.regularizers.l2(0.01))

model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
RegularizedDense(300),
RegularizedDense(100),
RegularizedDense(10, activation="softmax",
kernel_initializer="glorot_uniform")

## Dropout
Dropout is one of the most popular regularization techniques for deep neural networks.
Neural networks got a 1–2% accuracy boost simply by adding dropout. 

It is a fairly simple algorithm: at every training step, every neuron (including the
input  and hidden neurons, but always excluding the output neurons) has a probability p of being
temporarily “dropped out,” meaning it will be entirely ignored during this training
step, but it may be active during the next step (see Figure 11-9). The hyperparameter
p is called the dropout rate, and it is typically set to 50%. After training, neurons don’t
get dropped anymore. And that’s all.



### Intution for dropout:
Would a
company perform better if its employees were told to toss a coin every morning to
decide whether or not to go to work? Well, who knows; perhaps it would! The company
would obviously be forced to adapt its organization; it could not rely on any single
person to fill in the coffee machine or perform any other critical tasks, so this
expertise would have to be spread across several people. Employees would have to
learn to cooperate with many of their coworkers, not just a handful of them. The
company would become much more resilient. If one person quit, it wouldn’t make
much of a difference. It’s unclear whether this idea would actually work for companies,
but it certainly does for neural networks. 

Neurons trained with dropout cannot
co-adapt with their neighboring neurons; they have to be as useful as possible on
their own. They also cannot rely excessively on just a few input neurons; they must
pay attention to each of their input neurons. They end up being less sensitive to slight
changes in the inputs. In the end you get a more robust network that generalizes better.

Another way to understand the power of dropout is to realize that a unique neural
network is generated at each training step. Since each neuron can be either present or
absent, there is a total of 2N possible networks (where N is the total number of droppable
neurons). This is such a huge number that it is virtually impossible for the same
neural network to be sampled twice. Once you have run a 10,000 training steps, you
have essentially trained 10,000 different neural networks (each with just one training
instance). These neural networks are obviously not independent since they share
many of their weights, but they are nevertheless all different. The resulting neural
network can be seen as an averaging ensemble of all these smaller neural networks.

There is one small but important technical detail. Suppose p = 50%, in which case
during testing a neuron will be connected to twice as many input neurons as it was
(on average) during training. To compensate for this fact, we need to multiply each neuron’s input connection weights by 0.5 after training. If we don’t, each neuron will
get a total input signal roughly twice as large as what the network was trained on, and
it is unlikely to perform well. More generally, we need to multiply each input connection
weight by the keep probability (1 – p) after training. 

Alternatively, we can divide
each neuron’s output by the keep probability during training (these alternatives are
not perfectly equivalent, but they work equally well).
To implement dropout using Keras, you can use the keras.layers.Dropout layer.
During training, it randomly drops some inputs (setting them to 0) and divides the
remaining inputs by the keep probability. After training, it does nothing at all, it just
passes the inputs to the next layer. For example, the following code applies dropout
regularization before every Dense layer, using a dropout rate of 0.2:
model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.Dropout(rate=0.2),
keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
keras.layers.Dropout(rate=0.2),
keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
keras.layers.Dropout(rate=0.2),
keras.layers.Dense(10, activation="softmax")
])

Since dropout is only active during training, the training loss is
penalized compared to the validation loss, so comparing the two
can be misleading. In particular, a model may be overfitting the
training set and yet have similar training and validation losses. So
make sure to evaluate the training loss without dropout (e.g., after
training). Alternatively, you can call the fit() method inside a
with keras.backend.learning_phase_scope(1) block: this will
force dropout to be active during both training and validation.25
If you observe that the model is overfitting, you can increase the dropout rate. Conversely,
you should try decreasing the dropout rate if the model underfits the training
set. It can also help to increase the dropout rate for large layers, and reduce it for
small ones. Moreover, many state-of-the-art architectures only use dropout after the
last hidden layer, so you may want to try this if full dropout is too strong.
Dropout does tend to significantly slow down convergence, but it usually results in a
much better model when tuned properly. So, it is generally well worth the extra time
and effort.

If you want to regularize a self-normalizing network based on the
SELU activation function (as discussed earlier), you should use
AlphaDropout: this is a variant of dropout that preserves the mean
and standard deviation of its inputs (it was introduced in the same
paper as SELU, as regular dropout would break self-normalization).


## Monte-Carlo (MC) Dropout
