If we need to tackle a complex problem such as object detection in high resolution images then it becomes necessary to train deeper neural networks with 10 layers or more and hundreds of neurons in each layer. This introduces the following problems

1. Vanishing/Exploding Gradients - Deep neural networks are very prone to the vanishing/exploding gradients.
2. Inadequate training data for such a large network and it might be costly to label it.
3. High training time.
4. Deep neural networks have a large number of parameters which can introduce overfitting especially if the training data is too small

# Vanishing/Exploding Gradients Problem

In the backpropagation algorithm the algorithm starts at the output layer calculates the error and propagates the error gradient down to the input layer. After this it updates the weight using a Gradient Descent algorithm. Usually the gradients get smaller as the algorithm cascades down the layers and this can cause the lower layers connection weights to be virtually unchanged. This is called the vanishing gradients problem. The opposite can happen where the gradients get bigger causing the algorithm to diverge. This is the exploding gradient problem. In general deep neural networks suffer from unstable gradients causing different layers to learn at different rates.

The reasons why the neural networks suffer from the vanishing/exploding was explained in a paper by Xavier Glorot and Yoshua Bengio. The main problem is the combination of the sigmoid activation function and the random normal weight initialisation. This causes the variance of the outputs of each layer to be greater than the variance of its inputs. When the inputs are large the sigmoid function saturates at 1 or 0 where the gradient is very small. In the backpropagation stage this gets even smaller and leaves virtually nothing for the lower layers.

# Glorot and He Initialization

In their paper they suggested the following initialisations with the sigmoid activation function

1. Normal distribution with mean 0 and variance $\sigma^2 = \frac{1}{fan}_{avg}$
2. Uniform distribution between -r and +r, with $r = \sqrt{\frac{3}{fan_{avg}}}$

This initialising is very useful and can speed up training significantly.





    Initialisation   |    Activation Functions     |  $\sigma^2$ 
    
        Glorot        None, Tanh, Logistic, Softmax  1\fan_avg
        
        He            ReLU & variants                2/fan_in
        
        LeCun         SELU                           1/fan_in

By default keras uses Glorot initialisation with a uniform distribution. We can use He initialisation when creating a layer

In [2]:
import tensorflow as tf
from tensorflow import keras

In [109]:
tf.__version__
keras.__version__

'3.5.0'

In [110]:
keras.layers.Dense(10, activation="relu", kernel_initializer = "he_normal")

<Dense name=dense_55, built=False>

We can use He initialisation with a uniform distribution and use $fan_{avg}$ rather than $fan_{in}$ in the following way

In [111]:
he_avg_init = keras.initializers.VarianceScaling(scale=2, mode='fan_avg', distribution = 'uniform')
keras.layers.Dense(10, activation = 'sigmoid', kernel_initializer = he_avg_init)

<Dense name=dense_56, built=False>

# Nonsaturating Activation Functions

The ReLU activation function behaves much better than the sigmoid function because it is much faster to compute and does not saturate for positive values.

Some problems with the ReLU function is that during some neurons can die when they start outputting zeros only. This usually happens when the learning rate is too high and the weights get tweaked in a way such that the weighted sum of all its inputs are negative for all instances. The ReLU function has a vanishing gradient for negative values and gradient descent has no effect. This can be solved by looking for better variants of the ReLU activation function $\newline$

1. $LeakyReLU_{\alpha}(z) = max(\alpha z,z)$ 

Here $\alpha$ is the slope for $z < 0$

2. Randomized leaky ReLU - here $\alpha$ is picked randomly in a given range and fixed to an average value during testing.


3. Parametric leaky ReLU - here $\alpha$ is not a hyperparameter but a parameter which can be learned and updated by gradient descent. This activation function outperforms ReLU on large image datasets, but on smaller datasets it runs the risk of overfitting.


4. $ELU_{\alpha}(z) = \begin{cases} 
                       \alpha(\exp(z) - 1): z < 0 \\\\
                       z                  : z \geq 0
                       \end{cases}$
   
   The exponential linear unit (ELU) outperforms all the ReLU variants in their experiments. Training time and performance        during test is also better.

## Leaky ReLU in Keras

In [112]:
leaky_relu = keras.layers.LeakyReLU(negative_slope=0.2)
layer = keras.layers.Dense(10, activation=leaky_relu, 
                            kernel_initializer = 'he_normal')

## SELU Activation

In [113]:
layer = keras.layers.Dense(10, activation='selu', kernel_initializer = 'lecun_normal') # with selu we must use lecun_normalisation

# Batch Normalization

Using He normalisation with the ELU activation function can significantly reduce the vanishing/exploding gradients problem. But the distribution can still change during training. Batch normalisation addresses this issue by normalising the inputs to a layer. The algorithm can be summarised as follows


1. $\Large \mu_{B} = \frac{1}{m_B}\sum_{i = 1}^{m_B} x^{i}$ $\newline$
2. $\Large \sigma_{B}^2 = \frac{1}{m_B}\sum_{i = 1}^{m_B} \left(x^{i} - \mu_B\right)^2$ $\newline$
3. $\Large \hat{x}^i = \frac{x^i - \mu_B}{\sqrt{\sigma_{B}^{2} + \epsilon}}$ $\newline$
4. $\Large  z^i = \gamma \hat{x}^i + \beta$ $\newline$ $\newline$

1. For each feature of the input in the mini batch, the mean and standard deviation are calculated.
2. Each input is then normalised by subtracting the mean and dividing by the standard deviation plus a tiny constant.
3. The data is scaled and shifted by learned parameters $\gamma$ and $\beta$.
4. During the testing stage it uses the final input mean and final input standard deviation which are calculated using exponential moving average

## Implementing Batch Normalization in Keras

In [114]:
model_ = keras.models.Sequential([
    keras.layers.Flatten(input_shape = [28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="elu", kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])

In [115]:
model_.summary()

Since a Batch Normalization layer includes one offset parameter per input, we can remove the bias term from
the previous layer.

In [116]:
model_ = keras.models.Sequential([
    keras.layers.Flatten(input_shape = [28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="elu", kernel_initializer='he_normal', use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal", use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])

# Gradient Clipping

Gradient clipping reduces exploding gradients by restricting the gradient to not exceed some threshold. It is used mostly in RNNs since batch normalisation is difficult to implement in RNNs.

In [117]:
optimizer = keras.optimizers.SGD(clipvalue=1.0)
model_.compile(loss='mse', optimizer=optimizer)

This will clip every component of the gradient to a value between -1.0 and 1.0. This can usually change the orientation of the gradient vector so usually clipping the norm by setting clipnorm instead of clipvalue maybe preferred.

# Reusing Pretrained Layers

Instead of building a large neural network from scratch it can be useful to find an exisiting neural network that performs a similar task and reuse the lower layers of this network. This is called transfer learning. In most cases the output layer should be discarded since it is most likely not relevant to the task at hand.

# Transfer Learning With Keras

In [118]:
model_A = keras.models.load_model("model.keras") # loading a saved model
model_A_clone = keras.models.clone_model(model_A) # clone the model
model_A_clone.set_weights(model_A.get_weights()) # clone weights since cloning does not clone weights
model_B_on_A = keras.models.Sequential(model_A.layers[:-1]) # create a new model B with all the layers of A 
                                                            # except for the output layer
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid")) # adding a different output layer

Now we could use this new model to train for some new task B but one issue is the new output layer is randomly initialised and will cause large error gradients. A solution is to freeze the reused layers and train the output layer for a few epochs in order for it to learn reasonable weights.

In [119]:
fashion = keras.datasets.fashion_mnist
(X_train, y_train), (X_test, y_test) = fashion.load_data()

y_train_B = ((y_train == 5) | (y_train == 6)).astype(int)
y_test_B = ((y_test == 5) | (y_test == 6)).astype(int)

In [120]:
for layer in model_B_on_A.layers[:-1]: # set all but the output layer to non-trainable
    layer.trainable = False

In [121]:
model_B_on_A.compile(loss='binary_crossentropy', optimizer='sgd', 
                     metrics=['accuracy'])

In [122]:
history = model_B_on_A.fit(X_train, y_train_B, epochs=4)

Epoch 1/4
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 632us/step - accuracy: 0.8760 - loss: 366.9988
Epoch 2/4
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 637us/step - accuracy: 0.9068 - loss: 198.5783
Epoch 3/4
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 633us/step - accuracy: 0.9057 - loss: 214.2549
Epoch 4/4
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 625us/step - accuracy: 0.9079 - loss: 194.9447


In [123]:
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = True

In [124]:
optimizer = keras.optimizers.SGD(learning_rate=1e-4)
model_B_on_A.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
history = model_B_on_A.fit(X_train, y_train_B, epochs = 15)

Epoch 1/15
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 797us/step - accuracy: 0.8594 - loss: 190.9481
Epoch 2/15
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 801us/step - accuracy: 0.8843 - loss: 1.4587
Epoch 3/15
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 843us/step - accuracy: 0.8863 - loss: 0.8058
Epoch 4/15
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 803us/step - accuracy: 0.8819 - loss: 0.5851
Epoch 5/15
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 816us/step - accuracy: 0.8916 - loss: 0.4811
Epoch 6/15
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 798us/step - accuracy: 0.8938 - loss: 0.4106
Epoch 7/15
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 820us/step - accuracy: 0.9031 - loss: 0.3602
Epoch 8/15
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 803us/step - accuracy: 0.9055 - loss: 0.3459
Epoch 

In [125]:
model_B_on_A.evaluate(X_test, y_test_B)

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 634us/step - accuracy: 0.9099 - loss: 0.6440


[0.7545680403709412, 0.9068999886512756]

# Faster Optimizers

Another way to speed up training neural networks is to use a faster optimizer. 

## Momentum Optimization

Momentum optimization introduces a momentum term which keep track of previous gradients. This allows for the algorithm to converge faster.

1. $\bf{m} \leftarrow \beta \bf{m} - \eta \nabla_{\theta}J(\bf{\theta})$
2. $\bf{\theta} \leftarrow \bf{\theta} + \bf{m}$

Introduce a momentum term that combines previous gradients with current gradient weighted by $\bf{m}$. We then update the weights by adding this momentum term to move the weights in the direction of this momentum.

Usually $\beta$ is set to 0.9.

In [4]:
optimizer = keras.optimizers.SGD(learning_rate = 0.001, momentum = 0.9) # keras implementation

## Nesterov Accelerated Gradient

Nesterov Accelerated Gradient is slightly different to vanilla Momentum Optimization. The gradient is measured slightly ahead of the local position in the direction of the momentum.

1. $\bf{m} \leftarrow \beta \bf{m} - \eta \nabla_{\theta}J(\bf{\theta + \beta m})$
2. $\bf{\theta} \leftarrow \bf{\theta} + \bf{m}$

NAG is almost always faster than regular Momentum Optimization

In [5]:
NAG_optimizer = keras.optimizers.SGD(learning_rate = 0.001, momentum = 0.9, nesterov=True)

## AdaGrad

1. $ \bf{s} \leftarrow \bf{s} + \nabla_{\theta}J(\bf{\theta}) \otimes \nabla_{\theta}J(\bf{\theta})$
2. $ \theta \leftarrow \theta - \eta \nabla_{\theta}J(\bf{\theta}) \oslash \sqrt{\bf{s} + \epsilon}$

The first step accumulates the square of the gradients into the vector s. This is equivalent to $s_i \leftarrow s_i + (\partial J(\theta)/\partial \theta_i)^2$.

The second step scales down the gradient vector by a factor of $\sqrt{s + \epsilon}$. This is equivalent to $\theta_i \leftarrow \theta_i - \eta\frac{\partial J(\theta)/\partial \theta_i}{\sqrt{s_i + \epsilon}}$ for all parameters $\theta_i$ simultaneously.

This algorithm decays the learning rate faster for steeper dimensions. This prevents overshooting and results in updates more directly towards the global miminum.

## RMSProp

AdaGrad slows down or often stops early when training neural networks. The RMSProp algorithm fixes this by accumulating only the gradients from the most recent iterations.

1. $ \bf{s} \leftarrow \beta\bf{s} + (1 - \beta)\nabla_{\theta}J(\bf{\theta}) \otimes \nabla_{\theta}J(\bf{\theta})$
2. $ \theta \leftarrow \theta - \eta \nabla_{\theta}J(\bf{\theta}) \oslash \sqrt{\bf{s} + \epsilon}$

The decay rate $\beta$ is typically set to 0.9. This among the best and fastest optimizers possibly behind Adam Optimization only.

In [8]:
RMSProp_optimizer = keras.optimizers.RMSprop(learning_rate = 0.001, rho = 0.9)

## Adam & Nadam Optimization

Adam Optimization combines the ideas of Momentum Optimization and RMSProp.

1. $\Large \bf{m} \leftarrow \beta_1 \bf{m} - (1 - \beta_1)\nabla_{\theta}J(\bf{\theta})$
2. $\Large \bf{s} \leftarrow \beta_2\bf{s} + (1 - \beta_2)\nabla_{\theta}J(\bf{\theta}) \otimes \nabla_{\theta}J(\bf{\theta})$ 
3. $\Large \hat{\bf{m}} \leftarrow \frac{\bf{m}}{1 - \beta_{1}^{t}}$
4. $\Large \hat{\bf{s}} \leftarrow \frac{\bf{s}}{1 - \beta_{2}^{t}}$
5. $\Large \theta \leftarrow \theta + \eta \hat{\bf{m}}\oslash \sqrt{\hat{\bf{s}} + \epsilon}$

steps 1 is slightly different from Momentum Optimization as it computes a decaying average rather than a decaying sum and these are equivalent upto a factor of $(1 - \beta_1)$. Steps 3 and 4 boost $\bf{m}$ and $\bf{s}$ at the begining as they are initialised to 0 and will be biased towards zero at the start of training.

The hyperparameter $\beta_1$ is typically set to 0.9 and $\beta_2$ is typically set to 0.999. $\epsilon$ is a smoothing term typically set to a small number $10^{-7}$.

In [5]:
Adam_optimizer = keras.optimizers.Adam(learning_rate = 0.001, beta_1 = 0.9, beta_2 = 0.999)

Adam is an adaptive learning rate algorithm and requires less tuning of the learning rate hyperparameter $\eta$. We can use the default value of 0.001.

### Variants of the Adam Optimizer

Adamax - This variant replaces step 2 with $s \leftarrow max(\beta_2 \bf{s}, \nabla_\theta J(\theta))$, it drops step 4 and in step 5 it scales down the gradient updates by a factor of $\bf{s}$. This can provide more stability in certain task over using Adam but generally it performs worse.

Nadam Optimization - 

# Avoiding Overfitting Through Regularization

Since neural networks have a huge number of parameters and therefore an incredible amount of degrees of freedom. This calls for the need of regularization.

## $\mathcal{l}_1$ and $\mathcal{l}_2$ Regularization

In [3]:
layer = keras.layers.Dense(100, activation='elu', 
                           kernel_initializer='he_normal',
                           kernel_regularizer=keras.regularizers.l2(0.01))

Applying regularization to all hidden layers can become tedious and is error-prone. To avoid this we can use functools.partial() function which allows us to create thin wrapper for any callable with default values.

In [4]:
from functools import partial

In [5]:
RegularizedDense = partial(keras.layers.Dense, 
                           activation='elu',
                           kernel_initializer='he_normal',
                           kernel_regularizer=keras.regularizers.l2(0.01))

In [9]:
model = keras.models.Sequential([
    keras.layers.Input([28, 28]),
    RegularizedDense(300),
    RegularizedDense(100),
    RegularizedDense(10, activation="softmax",
                     kernel_initializer="glorot_uniform")
])

## Dropout