## Vanishing/exploding gradients problem

Gradients become smaller or uncontrollably larger as we progress through the layers of a NN.
Believed to be caused when the variance of the outputs >> variance of the inputs, getting progressively worse through the layers.

### Initialization

Initial solution --> use __Glorot initialization__ to set the connection weights of each layer randomly
Where
$$
fan_{avg} = (fan_{in} + fan_{out})/2
$$
- $fan_{in}$ is the number of inputs to the layer
- $fan_{out}$ is the number of neurons in the layer

Then you can use the normal distribution with mean 0 and var $\sigma^2=\frac{1}{fan_{avg}}$

Or a uniform distribution [-r, r] where $r=\sqrt{\frac{3}{fan_{avg}}}$

Alternative strategies exist, mostly differentiated by the scale of the variance or whether they use $fan_{avg}$ or $fan_{in}$

Different activation functions benefit from different types of initialization:

| Initialization | Activation functions          | $\sigma^2$ (Normal)  |
|----------------|-------------------------------|----------------------|
| Glorot/Xavier  | None, tanh, logistic, softmax | $\frac{1}{fan_{in}}$   |
| He             | ReLU                          | $\frac{2}{fan_{in}}$   |
| Lecun          | SELU                          | $\frac{1}{fan_{in}}$   |

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron
from tensorflow import keras
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Keras uses Glorot initialization by default.

model = keras.models.Sequential()
# We can switch to He initialization by using the parameters when adding the layer
model.add(keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal"))

# We could change it to still use He initialization but based on fan_avg instead
he_avg_init = keras.initializers.VarianceScaling(scale=2, mode="fan_avg", distribution="uniform")
model.add(keras.layers.Dense(10, activation="sigmoid", kernel_initializer=he_avg_init))

### Nonstaturating activation functions

General rule: SELU > ELU > leaky ReLU > ReLU > tanh > logistic.
Go down the hierarchy based on needs.

#### ReLU

Most common activation function. Very fast to compute.
Neurons can die --> happens when the weighted sum of its inputs < 0.

<img src="https://machinelearningmastery.com/wp-content/uploads/2018/10/Line-Plot-of-Rectified-Linear-Activation-for-Negative-and-Positive-Inputs.png" width="400">

### Leaky ReLU

Small slope when $z<0$ defined by $\alpha$. $\alpha$ is typically set to 0.01 by it can be handled as a hyperparameter
Prevents neurons from fully dying.

<img src="https://pytorch.org/docs/stable/_images/LeakyReLU.png" width="400">

### PReLU

Parametric leaky ReLU --> $\alpha$ is learned during training by backpropagation.
Works better for large datasets, but can overfit on small ones.

### ELU

Apparently outperforms other ReLU variants at the expense of computation speed during evaluation.
Training is generally faster due to faster convergence rate.

<img src="https://pytorch.org/docs/stable/_images/ELU.png" width="400">

### SELU

Scaled ELU. Generally outperforms ELU in the following conditions:
- Sequential architecture
- All dense layers
- Normalized inputs
- LeCun normal intialization



In [None]:
model = keras.models.Sequential([
    # Add a leaky ReLU activation function by adding a layer to the model like this:
    keras.layers.Dense(10, kernel_initializer="he_normal"),
    keras.layers.LeakyReLU(negative_slope=0.2),
    # Add a leaky ReLU activation function by adding a layer to the model like this:
    keras.layers.Dense(10, kernel_initializer="he_normal"),
    keras.layers.PReLU(),
    # Alternatively, for the SELU you can set it via a property in the Dense layer
    keras.layers.Dense(10, kernel_initializer="lecun_normal", activation="selu")
])
model.summary()

### Batch Normalization

Another technique to handle vanishing/exploding gradients.
Consists of zero-centering and normalizing inputs before each hidden layer. Input mean/stdv is calculated over each mini-batch.
It then scales and shifts the results using two new parameter vectors.

$$
z^{(i)} = \gamma * \hat{x}^{(i)} + \beta
$$

Where $\gamma$ is the scaling vector, $\hat{x}^{i}$ is the normalized inputs for the $i^{tj}$ instance, and $\beta$ is the shiting vector.

$\mu$ and $\sigma$ of the inputs are calculated on a rolling basis --> each input is used to progressively compute this information. These are then used to normalize data points during test time.

Note that **batch normalization** reduces the need for other regularization techniques? and can even be used as a standardizer, removing the need for pre-normalizing data.

Downsides(ish):
- Generally each epoch takes longer but fewer epochs are required.
- Runtime penalty due to extra computations.
    - Can be mitigated by fusing the BN layer with the previous layer after training? TODO: Investigate

In [None]:
# Keras supports BN by adding layers to the model
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax"),
])

model.summary()

  super().__init__(**kwargs)


The BN layers add 2xShape trainable parameters ($\gamma$ and $\beta$) and 2xshape **non-trainable** parameters ($\mu$ and $\sigma$). Non-trainable means they are not affected by backpropagation

In [None]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('gamma', True),
 ('beta', True),
 ('moving_mean', False),
 ('moving_variance', False)]

In [None]:
# Alternative is to add each BN layer before the activation function. Some people argue this is better, can easily be done:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28]),
    keras.layers.BatchNormalization(),
    # use_bias set to False because the BN layer offset substitutes the bias
    keras.layers.Dense(300, kernel_initializer="he_normal", use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("elu"),
    keras.layers.Dense(100, kernel_initializer="he_normal", use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("elu"),
    keras.layers.Dense(10, activation="softmax"),
])

model.summary()

  super().__init__(**kwargs)


Tweakable hyperparameters for BN:
- momentum: determines how aggressively to update moving $\mu$ and $\sigma$. Usually default is fine. Values close to 1 are fine, closer to 1 the larger the dataset or smaller the batches.
- axis: relevant with multidimensional data. The default works fine when the input is just a 1D array of features

### Gradient Clipping

A common technique for RNN where BN is hard to use.
Limits components of the gradient vectors to defined thresholds (tuned as hyperparameters).
Can change the direction of the gradient vector (for example, [0.9,100] would be transformed to [0.9,1]) which works well. Can be adapted to scale both components to maintain direction.

In [None]:
# Clipping can be implemented easily by using clipvalue
# Use clipnorm is you want to keep gradient direction.
optimizer = keras.optimizers.SGD(clipvalue=1.0)
model.compile(loss="mse", optimizer=optimizer)

## Transfer learning

You can import trained layers from other neural networks with similar functions to speed up training and reduce training data requirements.

- The closer the tasks the more layers you can import.
- Good to initially try freezing imported layers (so they're not impacted by backpropagation) and try training new NN.
- Unfreeze layers/change number of imported layers until desired performance is achieved.

Note: Transfer learning doesn't work too well with small neural networks, it is usually used for larger CNN or RNNs.

In [None]:
# Keras supports transfer learning by importing previous models
model_A = keras.models.load_model("../chapter-10/my_keras_model_0.001.h5")
model_B_on_A = keras.models.Sequential(model_A.layers[:-1])
# Add a new output layer to substitute the last one from the imported model
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))

# Note, layers are shared by reference, use this to clone models and prevent this:
model_A_clone = keras.models.clone_model(model_A)

# Model B output layer will now be initialized randomly --> can severly impact imported layers weights depending on initialization
# We can freeze imported layers to prevent output layer training from having too much impact
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False

model_B_on_A.compile(loss="binary_crossentropy", optimizer="sgd", metrics=["accuracy"])



In [None]:
# Fewer epochs required to train the model
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4, validation_data=(X_val_B, y_valid_B))

### Unsupervised pretraining

Less common technique used to improve results when not enough labeled data is available.
Involves reusing layers from a type of NN which doesn't require labeled data (autoencoder or generative adversarial network?)

## Optimization

### Momentum optimization

A modified version of gradient descent where, instead of shifting weights by the gradient of the cost function times the learning rate ($\nabla_{\theta}J(\theta) * \eta$), it updates the weight based on a momentum vector $m$ which is updated by the gradient of the cost function. 
- So the momentum vector consists of all the gradients up until that point --> gradient is used for acceleration instead of speed.
- New hyperparameter $\beta$ to counteract $m$ [0, 1]. Typical value is 0.9.
- Generally converges much faster than gradient descent.

$$
m \leftarrow \beta*m - \eta\nabla_{\theta}J(\theta)
$$
$$
\theta \leftarrow \theta+m
$$

In [None]:
# By using the momentum parameter Keras will implement SGD with momentum
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9)

### Nesterov accelerated gradient

Enhanced version of momentum optimization which almost always works better.
Measures the gradient to add to the momentum slightly ahead of the local position ($\theta+\beta*m$).
- Tiny improvement in each iteration, but adds up to being much faster.

In [None]:
# By using the nesterov parameter Keras will implement SGD with momentum using NAG
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True)

### AdaGrad

Not useful for neural networks, but forms the basis for other optimizers.
Decays the learning rate along the steeper dimensions (where the gradient is larger).
In practice, does so too fast to be useful for neural networks.

Step 1:
$$
s \leftarrow s + \nabla_{\theta}J(\theta)*\nabla_{\theta}J(\theta)
$$
This step accumulates the square of the partial derivatives of the cost function ($J(\theta)$) along each parameter $\theta_{i}$. --> $s$ gets larger along the steeper dimensions.

Step 2:
$$
\theta \leftarrow \theta - \eta\nabla_{\theta}J(\theta)/\sqrt{s + \epsilon}
$$
So it is gradient descent divided by $\sqrt{s + \epsilon}$, which effectively decays the learning rate along the steeper axes.
$\epsilon$ is an auxiliary parameter to prevent division by 0.

### RMSProp

AdaGrad but with slower learning rate decay
- Adds a hyperparameter $\beta$ which handles exponential decay to AdaGrad.
- Usually set to 0.9.

Basically only changes the first step:
$$
s \leftarrow \beta*s + (1-\beta)\nabla_{\theta}J(\theta)*\nabla_{\theta}J(\theta)
$$

Then applies shift to $\theta$ the same way



In [None]:
# Keras has an RMSProp optimizer
optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9)

### Adam and Nadam optimization

__adaptive moment estimation__ combines momentum with RMSProp
Basically keeps track of the $m$ momentum vector and $s$ past squared gradients and uses both to shift $\theta$

$$
\theta \leftarrow \theta + \eta \hat{m} / \sqrt{\hat{s}+\epsilon}
$$

- momentum hyperparameter $\beta_1$ is usually 0.9
- decay hyperparameter (RMSProp side of the formula) is usually set to 0.999

**Variations**
- AdaMax: Variation which changes how $\hat{s}$ is computed, generally performs worse, but not always.
- Nadam: Adam + Nesterov trick

Note that RMSProp, Adam and Nadam are good for speed, but occasionally generalize poorly on some datasets. Sometimes it is worth trying plain Nesterov.

In [None]:
# Set lr to 0.001, but Adam will adapt the learning rate as the algorithm progresses
optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)

## Learning rate scheduling

- Learning rate way too high: model diverges
- A little to high: sub optimal solution (model converges but dances around the optimal value)
- Too low: too slow

There are strategies to change learning rate during training in order to get the speed of high values with the accuracy of low values

### Power scheduling
$$
\eta(t) = \eta_0/(1+t/s)^c
$$
- $c$ usually set to 1, but is a hyperparameter
- $s$ is the number of steps required to divide lr by 1 unit
- Requires also tuning $\eta_0$ and $s$

### Exponential scheduling
$$
\eta(t) = \eta_0*0.1^{t/s}
$$
- LR drops by a factor of 10 every $s$ steps

### Piecewise constant scheduling
Sets LR statically progressively lower very $x$ epochs.
Requires manually finding good change values

### Performance scheduling
Reduces LR by a factor of $\lambda$ whenever the validation error stops dropping during training

### Icycle scheduling
- First find $\eta_1$ by finding the optimal learning rate.
- Set $\eta_0$ roughly 10 times lower
- Schedule grows $\eta$ linearly during first half of trainig from $\eta_0 \rightarrow \eta_1$
- Decreases it back during the second half of training.

When using momentum the values are inverted (so high -> low -> high) (TODO: Understand momentum in this context)
Generally performs better than other methods.



In [None]:
# Power scheduling use the decay parameter = inverse of s
optimizer = keras.optimizers.SGD(lr=0.01, decay=1e-4)

# Performance scheduling can be done via callback, where lambda is the factor parameter
# The example below multiplies LR by 0.5 when 5 epochs pass with no improvement
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)

# Keras supports custom functions for scheduling, useful for piecewise and exponential scheduling
# The function can accept two params you can use in your logic, epoch and lr.
def exponential_decay_fn(epoch, lr):
    return 0.01*0.1**(epoch/20)

lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)
history = model.fit(X_train, y_train, callbacks=[lr_scheduler])

When a model is saved the optimizer and the LR are saved with it --> you can load a trained model and continue training. However, the epoch is not saved --> if the scheduling function uses the epoch you can't just continue training. Use the `initial_epoch` parameter to set the epoch to the correct value.

## Regularization

Besides early stopping and batch normalization there are other regularization techniques to remove useless parameters.
You generally want some form of regularization at every layer of the network.

### $l_1$ and $l_2$ regularization

$l_2$ regularization can constrain a model's weights.
$l_1$ regularization is effective to create a sparse model (many weights set to 0) --> better runtime speed


In [None]:
# Use the kernel_regularizer parameter
layer = keras.layers.Dense(100, activation="elu",
                           kernel_initializer="he_normal",
                           kernel_regularizer=keras.regularizers.l2(0.01))


You can also use the `keras.regularizers.l1()` and `keras.regularizers.l1_l2()` regularizers

### Dropout

At every step of trainig each neuron has $p$ probability of being removed. The rest of the network weights are updated accordingly.
The final model uses all neurons.
Connection weights after training have to be multiplied by $1-p$ to compensate for the fact that they were only active $p%$ of the time.

- p usually [10,50]%
- [20,30] in recurring NN
- [40-50] in convolutional NN

Very useful to reduce overfitting.
- Increase rate if model overfits
- Decrease if underfits
- Larger layers usually use larger dropout rates.
- Many models will only implement dropout after the last hidden layer.

In [None]:
# Use a dropout layer to implement this.
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28]),
    keras.layers.Dropout(rate=0.2),
    ...
])

### MonteCarlo Dropout

Instead of testing on a final model with all the neurons active, each test sample is passed through all generated models (with their inactive neurons) and then averaged up.
- Number of generated models is a hyperparameter
    - Each additional model (evidently) increases training and testing time linearly
    - Diminishing returns past a certain number

### Max-Norm Regularization

TODO