# Training Deep Neural Networks

When you work with deep neural networks likely you have to be faced to some problems

- Vanishing and exploding gradients problems
- Not having enough training data
- Training may be extremely slow
- Overfitting

We will go through each of these problem and present techniques to solve them

## The Vanishing/Exploding Gradients Problems

More generally, deep neural networks suffer from unstable gradients, diferent layer may learn at widely different speeds.

### Gorot and He Initialization

It require the variance of the input and output be the same.

The connection weights of each layer must be initialized randomly.

Number of input = _fan-in_
Number of neurons = _fan-out_

Using Gorot initialization can speed up training considerably, and it is one of the tricks that led to the success of deep learning.

Some similar strategies has been showed work better with particular activation functions

![alt text](images/initializations.png)

By default Keras uses Glorot initialization with a uniform distribution

### Nonsaturating Activation Functions

ReLU activation function used to be mostly used cause it does not saturate for positive values (and because) it is fast to compute

Unfortunately, this function have a big problem called dying ReLUs, during training some neurons 'die', it means they outputting 0 only.

One alternative is _leaky ReLU_ and his variants. These variant outperformed ReLU. 
- Randomized leaky ReLU (RReLU): Alpha is picked randomly, reducing overfitting 
- Parametric leaky ReLU (PReLU): alpha is learned during training, it is faced like a parameter.

Last but not least, the function ELU (_exponential linear unit_) outperformed ReLU too. One variant of this is Scaled ELU (SELU).


> "So, which activation function should you use for the hidden layers
of your deep neural networks? Although your mileage will vary, in
general SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic. If the network’s architecture prevents it from self
normalizing,
then ELU may perform better than SELU (since SELU
is not smooth at z = 0). If you care a lot about runtime latency, then
you may prefer leaky ReLU. If you don’t want to tweak yet another
hyperparameter, you may use the default a values used by Keras
(e.g., 0.3 for leaky ReLU). If you have spare time and computing
power, you can use cross-validation to evaluate other activation
functions, such as RReLU if your network is overfitting or PReLU
if you have a huge training set. That said, because ReLU is the most
used activation function (by far), many libraries and hardware
accelerators provide ReLU-specific optimizations; therefore, if
speed is your priority, ReLU might still be the best choice."

### Batch Normalization

This technique consist of adding an operation in the model just before or after the activation function each hidden layer. This operation simply zero-centers and normlizes each input.

BN also acts like reguarizer reducing the need for other regularization techniques.

> You may find that training is rather slow, because each epoch takes
much more time when you use Batch Normalization. This is usually
counterbalanced by the fact that convergence is much faster
with BN, so it will take fewer epochs to reach the same perfor
mance.

![BatchNormalization](images\batchnormalization.png)

The BatchNormalization class has quite a few hyperparameters you can tweak like momentum. A good momentum is tipically close to 1 (0.9, 0.99, 0.999)

## Reusing Pretrained Layers

It is generally not a good idea to train a very large DNN from scratch: instead, you should always try to find an existing neural network that accomplishes a similar task to the one you are trying to tackle. This technique is called _transfer learning_

The more similar the taks are, the more layers you want to reuse (starting with the lower layers). For very similar tasks, try keeping all the hidden layers and just replacing the output layer.

If you still cannot get good performance, and you have little training data, try dropping the top hidden layer(s) and freezing all the remaining hidden layers again. You can iterate until you find the right number of layers to reuse. If you have plenty of
training data, you may try replacing the top hidden layers instead of dropping them, and even adding more hidden layers.

You must always compile your model after you freeze or unfreeze layers.

Other ways to face the fact you don't have enough data to train you model is to use unsupervised learning and self-supervised learning.

## Faster Optimizers

In addition to the mentioned above, one way to optimize the training comes from using faster optimizer than the regular Gradient Descent like: momentum optimization, Nesterov Accelerated Gradient, AdaGrad, RMSProp and finally Adam and Nadam optimization.

In momentum optimization, the gradient is used for acceleration, not for speed.

Nesterov Accelerated Gradient is a variant of momentum

AdaGrad is an adaptative learning faster the traditional gradient and requires much less tuning of the learning rate hyperparameter.

RMSProp fixes the problem with the last optimizer, that is AdaGrad runs the risk of slowing down a bit too fast and never convergind. Except for simple problem, RMSProp is better than AdaGrad.

Adam is the preferred optimizer nowdays. Adam stands for _addaptative moment estimation_ combines ideas of momentum and RMSProp. Nadam plus Nesterov trick so it will often converge slightly faster than Adam.

Optimizer comparison

![optimizers](images\optimizers.png)

## Avoiding Overfitting Through Regularization

One of the best regularization techniques is early stopping, even batch normalization. The nest are other popular

- L1 and L2 regularization
- Dropout: It is a fairly simple algorithm: at every training step, every neuron (including the input neurons, but always excluding the output neurons) has a probability p of being
temporarily “dropped out,” meaning it will be entirely ignored during this training step, but it may be active during the next step.
- Monte Carlo (MC) Dropout: Based on samples that can be trained too.
- Max-Norm Regularization

## Summary and Practical Guidelines

Although, there is not a consensus about the best configuration of this techniques, there are one that work fine in most of the cases

![Configuraciones](images\configuration.png)

If the network is a simple stack of dense layers, then it can self-normalize, and you should the next configuration

![Self-normalization](images\selfnormalization.png)

### Practice !!

Deep neural network on the CIFAR10 image dataset:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
import os

In [2]:
(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()

In [3]:
x_train.shape

(50000, 32, 32, 3)

In [4]:
y_train.shape

(50000, 1)

In [5]:
np.unique(y_train)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=uint8)

Before the practice it's necessary to do a validation set

In [6]:
X_valid, X_train = x_train[:5000], x_train[5000:]
y_valid, y_train = y_train[:5000], y_train[5000:]

In [7]:
X_valid.shape, X_train.shape

((5000, 32, 32, 3), (45000, 32, 32, 3))

1. Build a DNN with 20 hidden layers of 100 neurons each (that’s too many, but
it’s the point of this exercise). Use He initialization and the ELU activation
function.

In [13]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[32, 32, 3]),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(10, activation='softmax'),
])

2. Using Nadam optimization and early stopping, train the network on the
CIFAR10 dataset. Remember to search for the right learning rate each
time you change the model’s architecture or hyperparameters.

In [22]:
# Nadam optimization
optimizer = keras.optimizers.Nadam(learning_rate=0.00005)

In [23]:
# Compilamos el modelo

model.compile(loss="sparse_categorical_crossentropy",
              optimizer=optimizer,
              metrics=["accuracy"])

In [25]:
# Earlystopping
early_stopping_cb = keras.callbacks.EarlyStopping(patience=20, restore_best_weights=True)
# Checkpoint
checkpoint_cb = keras.callbacks.ModelCheckpoint("cifar10_keras_model.h5",
                                                save_best_only=True)
# Tensorboard
run_index = 1 # increment every time you train the model
run_logdir = os.path.join(os.curdir, "my_cifar10_logs", "run_{:03d}".format(run_index))
tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
callbacks = [early_stopping_cb, checkpoint_cb, tensorboard_cb]

In [26]:
#tensorboard --logdir=./my_cifar10_logs --port=6006

In [27]:
model.fit(X_train, y_train, epochs=100,
            validation_data=(X_valid, y_valid),
            callbacks=callbacks)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100


<keras.callbacks.History at 0x2000cf9fc70>

3. Now try adding Batch Normalization and compare the learning curves: Is it
converging faster than before? Does it produce a better model? How does it
affect training speed?

In [30]:
model_bn = keras.models.Sequential()

model_bn.add(keras.layers.Flatten(input_shape=[32, 32, 3]))

for i in range(20):
    model_bn.add(keras.layers.BatchNormalization())
    model_bn.add(keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'))

model_bn.add(keras.layers.Dense(10, activation='softmax'))

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_2 (Flatten)         (None, 3072)              0         
                                                                 
 batch_normalization_20 (Bat  (None, 3072)             12288     
 chNormalization)                                                
                                                                 
 dense_42 (Dense)            (None, 100)               307300    
                                                                 
 batch_normalization_21 (Bat  (None, 100)              400       
 chNormalization)                                                
                                                                 
 dense_43 (Dense)            (None, 100)               10100     
                                                                 
 batch_normalization_22 (Bat  (None, 100)             

In [31]:
# Compilamos el modelo

model_bn.compile(loss="sparse_categorical_crossentropy",
              optimizer=optimizer,
              metrics=["accuracy"])

checkpoint_cb = keras.callbacks.ModelCheckpoint("cifar10_keras_model_bn.h5",
                                                save_best_only=True)
# Tensorboard
run_index = 1 # increment every time you train the model
run_logdir = os.path.join(os.curdir, "my_cifar10_nb_logs", "run_{:03d}".format(run_index))
tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
callbacks = [early_stopping_cb, checkpoint_cb, tensorboard_cb]

model_bn.fit(X_train, y_train, epochs=100,
            validation_data=(X_valid, y_valid),
            callbacks=callbacks)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100


<keras.callbacks.History at 0x20013ace280>

In sense of the speed, it is true that the number of epoch is less, but in consideration with the speed of the train whit out batch normalization, the first was faster. In order to quality of the model, not doubt the batch normalization model outperfomed the first.

4. Try replacing Batch Normalization with SELU, and make the necessary adjustements
to ensure the network self-normalizes (i.e., standardize the input features,
use LeCun normal initialization, make sure the DNN contains only a
sequence of dense layers, etc.).

In [12]:
model_selu = keras.models.Sequential()

# Nadam optimization
optimizer = keras.optimizers.Nadam(learning_rate=0.00005)

model_selu.add(keras.layers.Flatten(input_shape=[32, 32, 3]))

for i in range(20):
    model_selu.add(keras.layers.Dense(100, activation='selu', kernel_initializer='lecun_normal'))

model_selu.add(keras.layers.Dense(10, activation='softmax'))

# Compilamos el modelo

model_selu.compile(loss="sparse_categorical_crossentropy",
              optimizer=optimizer,
              metrics=["accuracy"])

# Earlystopping
early_stopping_cb = keras.callbacks.EarlyStopping(patience=20)

checkpoint_cb = keras.callbacks.ModelCheckpoint("cifar10_keras_model_SELU.h5",
                                                save_best_only=True)
# Tensorboard
run_index = 1 # increment every time you train the model
run_logdir = os.path.join(os.curdir, "my_cifar10_SELU_logs", "run_{:03d}".format(run_index))
tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
callbacks = [early_stopping_cb, checkpoint_cb, tensorboard_cb]

model_selu.fit(X_train, y_train, epochs=100,
            validation_data=(X_valid, y_valid),
            callbacks=callbacks)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100


<keras.callbacks.History at 0x25d912809a0>

The model achives a smallers improves in the metric. The change in the metric is too long but if you apply a grid it may improve more.

5. Try regularizing the model with alpha dropout. Then, without retraining your
model, see if you can achieve better accuracy using MC Dropout.

In [10]:
model_alpha = keras.models.Sequential()

# Nadam optimization
optimizer = keras.optimizers.Nadam(learning_rate=0.00005)

model_alpha.add(keras.layers.Flatten(input_shape=[32, 32, 3]))

for i in range(17):
    model_alpha.add(keras.layers.Dense(100, activation='selu', kernel_initializer='lecun_normal'))

for i in range(3):
    model_alpha.add(keras.layers.Dense(100, activation='selu', kernel_initializer='lecun_normal'))
    model_alpha.add(keras.layers.AlphaDropout(0.30))

model_alpha.add(keras.layers.Dense(10, activation='softmax'))

# Compilamos el modelo

model_alpha.compile(loss="sparse_categorical_crossentropy",
              optimizer=optimizer,
              metrics=["accuracy"])

# Earlystopping
early_stopping_cb = keras.callbacks.EarlyStopping(patience=20)

checkpoint_cb = keras.callbacks.ModelCheckpoint("cifar10_keras_model_alpha.h5",
                                                save_best_only=True)
# Tensorboard
run_index = 1 # increment every time you train the model
run_logdir = os.path.join(os.curdir, "my_cifar10_ALPHA_logs", "run_{:03d}".format(run_index))
tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
callbacks = [early_stopping_cb, checkpoint_cb, tensorboard_cb]

X_means = X_train.mean(axis=0)
X_stds = X_train.std(axis=0)
X_train_scaled = (X_train - X_means) / X_stds
X_valid_scaled = (X_valid - X_means) / X_stds
X_test_scaled = (x_test - X_means) / X_stds

model_alpha.fit(X_train_scaled, y_train, epochs=100,
          validation_data=(X_valid_scaled, y_valid),
          callbacks=callbacks)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100


<keras.callbacks.History at 0x2c04605ce50>

In [11]:
model = keras.models.load_model("cifar10_keras_model_alpha.h5")
model.evaluate(X_valid_scaled, y_valid)



[1.9906935691833496, 0.4431999921798706]

I don't see slightly improving in this model. It'd be curious to develop a model whose achieve a higher performance in this task, or search for one model whose achieve it

In [12]:
class MCAlphaDropout(keras.layers.AlphaDropout):
    def call(self, inputs):
        return super().call(inputs, training=True)

In [13]:
mc_model = keras.models.Sequential([
    MCAlphaDropout(layer.rate) if isinstance(layer, keras.layers.AlphaDropout) else layer
    for layer in model.layers
])

In [14]:
def mc_dropout_predict_probas(mc_model, X, n_samples=10):
    Y_probas = [mc_model.predict(X) for sample in range(n_samples)]
    return np.mean(Y_probas, axis=0)

def mc_dropout_predict_classes(mc_model, X, n_samples=10):
    Y_probas = mc_dropout_predict_probas(mc_model, X, n_samples)
    return np.argmax(Y_probas, axis=1)

In [15]:
keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

y_pred = mc_dropout_predict_classes(mc_model, X_valid_scaled)
accuracy = np.mean(y_pred == y_valid[:, 0])
accuracy



0.4484

It got not a better performance