# Training Deep Neural Networks

if you need to tackle a complex problem, such as detecting hundreds of types of objects in high-resolution images? You may need to train a much deeper DNN

Training a deep DNN isn’t a walk in the park. Here are
some of the problems you could run into :

- You may be faced with the tricky *vanishing gradients* (gradient decreasing close to 0 can't update Weight) problem or the related *exploding gradients* (gradient increasing to infiny or NAN) problem.
- Training may be extremely slow

In this chapter we will go through each of these problems and present
techniques to solve them. 

## The Vanishing/Exploding Gradients Problems

the combination of the popular logistic sigmoid activation function and
the weight initialization technique that was most popular at the time (i.e.,
a normal distribution with a mean of 0 and a standard deviation of 1). the fact
that the logistic function has a mean of 0.5, not 0

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt

In [2]:
def sigmoid(z) :
    return 1 / (1 + np.exp(-z))

<img src='satur.png' />

> the function saturates at 0 or 1 

> Thus, when backpropagation kicks in it has virtually no gradient to propagate backthrough the network; and what little gradient exists keeps getting **diluted as backpropagation** progresses down through the top layers, so there is **really nothing left for the lower layers**.

Glorot(Xavier) and He Initialization

a way to significantly alleviate the unstable gradients problem

<img src="init.png"/>

> By default, Keras uses Glorot initialization with a uniform distribution.

In [11]:
[name for name in dir(keras.initializers) if not name.startswith("_")]

['Constant',
 'GlorotNormal',
 'GlorotUniform',
 'HeNormal',
 'HeUniform',
 'Identity',
 'Initializer',
 'LecunNormal',
 'LecunUniform',
 'Ones',
 'Orthogonal',
 'RandomNormal',
 'RandomUniform',
 'TruncatedNormal',
 'VarianceScaling',
 'Zeros',
 'constant',
 'deserialize',
 'get',
 'glorot_normal',
 'glorot_uniform',
 'he_normal',
 'he_uniform',
 'identity',
 'lecun_normal',
 'lecun_uniform',
 'ones',
 'orthogonal',
 'random_normal',
 'random_uniform',
 'serialize',
 'truncated_normal',
 'variance_scaling',
 'zeros']

Glorot and Bengio proposed a good compromise that has
proven to work very well in practice: the connection weights of each
layer must be initialized randomly as described in Equation 11-1, where
fan-avg = (fan-in + fan-out )/2

> - Fan-in: is a term that defines the maximum number of inputs that a system can accept. 
> - Fan-out: is a term that defines the maximum number of inputs that the output of a system can feed to other systems.

> Glorot initialization can speed up training considerably, and it is
one of the tricks that led to the success of Deep Learning

In [4]:
keras.layers.Dense(10, activation='relu', kernel_initializer='he_normal')

<keras.layers.core.dense.Dense at 0x1f278b5d3c0>

In [9]:
init = keras.initializers.VarianceScaling(scale=.2, mode='fan_avg', distribution='uniform') # set custom initialze random weight
keras.layers.Dense(10, activation='relu', kernel_initializer=init)

<keras.layers.core.dense.Dense at 0x1f278c348b0>

### Nonsaturating Activation Functions

> Glorot and Bengio was that the
problems with unstable gradients were in part due to a poor choice of activation function.

ReLU activation function, mostly because it does not saturate for positive values (and because it is fast to compute).

> ReLU activation function is not perfect. It suffers from
a problem known as the dying ReLUs: during training, some neurons effectively “die,” meaning they stop outputting anything other than 0

To solve this problem, you may want to use a variant of the ReLU function, such as the leaky ReLU. 

In [6]:
def leaky_relu(z, alpha=0.01) :
    return np.maximum(alpha*z, z)

> The hyperparameter α defines how much the function 'leaks': ensures that leaky ReLUs never die;

paper also evaluated the randomized leaky ReLU (RReLU), where α is picked randomly, seemed to act as a regularizer

Finally, the paper evaluated the parametric leaky ReLU (PReLU), where α is authorized to be learned during training

> PReLU was reported to strongly outperform ReLU on large image datasets, but on smaller datasets it runs the risk of overfitting the training set.


activation function called the exponential linear unit (ELU) that outperformed all the ReLU variants (slower to
compute than the ReLU function)

Günter Klambauer et al. introduced the Scaled
ELU (SELU) activation function:

> SELU activation
function often significantly outperforms other activation functions for
such neural nets (will selfnormalize: the output of each layer will tend to preserve a mean of 0 and 
standard deviation of 1 during training, which solves the
vanishing/exploding gradients problem) [The network’s architecture must be sequential. Unfortunately, if you
try to use SELU in nonsequential architectures, such as recurrent networks will not necessarily outperform other
activation functions]

> - The input features must be standardized (mean 0 and standard deviation 1).
> - Every hidden layer’s weights must be initialized with LeCun normal
initialization. In Keras, this means setting
kernel_initializer="lecun_normal" .


in general SELU > ELU > leaky
ReLU (and its variants) > ReLU > tanh > logistic
- If the network’s architecture prevents it from self-normalizing, then ELU
-  care a lot about runtime latency, then you may
prefer leaky ReLU. (0.3 for leaky ReLU)
- If you have spare time and computing power, you can use cross-validation to evaluate other
activation functions, such as RReLU if your network is overfitting or PReLU if you
have a huge training set

> if speed is your priority, ReLU might still be the best choice.

<img src="leak.png"/>

In [7]:
[activation for activation in dir(keras.activations) if not activation.startswith("_")]

['deserialize',
 'elu',
 'exponential',
 'gelu',
 'get',
 'hard_sigmoid',
 'linear',
 'relu',
 'selu',
 'serialize',
 'sigmoid',
 'softmax',
 'softplus',
 'softsign',
 'swish',
 'tanh']

In [8]:
[layer for layer in dir(keras.layers) if "relu" in layer.lower()]

['LeakyReLU', 'PReLU', 'ReLU', 'ThresholdedReLU']

In [13]:
#for use SELU
layer = keras.layers.Dense(10, activation="selu", kernel_initializer="lecun_normal")
layer

<keras.layers.core.dense.Dense at 0x1f278c340a0>

### Batch Normalization
scale it to 0 mean and 1 variance and  scale it add plus Bias term(called offset)

Although using He initialization along with ELU (or any variant of ReLU) can significantly reduce the danger of the vanishing/exploding gradients problems ( doesn’t guarantee that they
won’t come back during training.)

adding an operation in the model just before or after the activation function of each hidden layer

<img src="bn.png" />

source : https://www.youtube.com/watch?v=yXOMHOpbon8

the operation lets the model learn the optimal
scale and mean of each of the layer’s inputs.

if you add a
BN layer as the very first layer of your neural network, you do not need to
standardize your training set (The vanishing gradients problem was
strongly reduced, to the point that they could use saturating activation
functions such as the tanh and even the logistic activation function)

Moreover, there is a runtime penalty: the neural network makes slower predictions due to the extra computations required at
each layer

In [94]:
image_tensor = tf.random.uniform(shape=(3,3), minval=0, maxval=1)
print(image_tensor)

tf.Tensor(
[[0.13832426 0.38750803 0.09253335]
 [0.7931794  0.17145753 0.38980567]
 [0.57801473 0.13108122 0.6219201 ]], shape=(3, 3), dtype=float32)


In [95]:
bn_layers = keras.layers.BatchNormalization()(image_tensor)
print(bn_layers)

tf.Tensor(
[[0.13825515 0.38731444 0.09248712]
 [0.79278314 0.17137186 0.38961092]
 [0.57772595 0.13101573 0.6216094 ]], shape=(3, 3), dtype=float32)


In [101]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(), # do not to do manually normalize
    keras.layers.Dense(300, activation="relu" , use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, use_bias=False),
    keras.layers.BatchNormalization(), # add BN before activation
    keras.layers.Activation('relu'),
    keras.layers.Dense(10, activation="softmax")
])

> the layer before a BatchNormalization layer does not need to have bias terms, since the BatchNormalization layer has some as well, it would be a waste of parameters, so you can set use_bias=False when creating those layers:

> The authors of the BN paper argued in favor of adding the BN layers before the activation functions, rather than after

> you can experiment with this too to see which option works best on
your dataset. To add the BN layers before the activation functions, you
must remove the activation function from the hidden layers and add
them as separate layers after the BN layers.

In [96]:
model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_26 (Flatten)        (None, 784)               0         
                                                                 
 batch_normalization_35 (Bat  (None, 784)              3136      
 chNormalization)                                                
                                                                 
 dense_14 (Dense)            (None, 300)               235200    
                                                                 
 batch_normalization_36 (Bat  (None, 300)              1200      
 chNormalization)                                                
                                                                 
 dense_15 (Dense)            (None, 100)               30000     
                                                                 
 batch_normalization_37 (Bat  (None, 100)             

As you can see, each BN layer adds four parameters per input: γ, β, μ, and
σ (for example, the first BN layer adds 3,136 parameters, which is 4 × 784).

> The last two parameters, μ and σ, are the moving averages; they are not
affected by backpropagation, so Keras calls them “non-trainable

In [97]:
bn1 = model.layers[1]
[(var.name, var.trainable) for var in bn1.variables]

[('batch_normalization_35/gamma:0', True),
 ('batch_normalization_35/beta:0', True),
 ('batch_normalization_35/moving_mean:0', False),
 ('batch_normalization_35/moving_variance:0', False)]

### Gradient Clipping

Another popular technique to mitigate the exploding gradients problem is
to clip the gradients during backpropagation so that they never exceed
some threshold

This technique is most
often used in recurrent neural networks

In [106]:
# just a matter of setting the clipvalue or clipnorm argument when creating an optimizer
keras.optimizers.SGD(clipvalue=1.0)

<keras.optimizer_v2.gradient_descent.SGD at 0x1f279cb1f60>

> clipvalue (float) is set, the gradient of each weight is clipped to be no higher than this value (maximum value set)

### Reusing Pretrained Layers