<a href="https://colab.research.google.com/github/RohanOpenSource/Deep-Learning-And-Beyond/blob/main/ExplodingAndVanishingGradients.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Training neural networks can cause many problems that we can combat. The first of these is the exploding and vanishing gradient problem. During backpropogation, gradients can either become smaller and smaller, leaving to the lower layers of the neural network to remain unchanged, or they can become larger and larger making the neural network not work at all. This normally results in the usage of the sigmoid activation function or the normal distribution initlialization which both cause more variance between layers. To have this not happen, we need a similar variance for the inputs of each layer and the outputs. To acheive this, we can randomly initialize the weights. This is called Glorot and He Intilialization and is what is used by default in Keras.

In [None]:
import tensorflow as tf
import numpy as np
from sklearn.datasets import load_iris

In [None]:
iris = load_iris() #you know the drill we're using iris
X =  iris.data[:, 2:]
y = iris.target

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation="sigmoid", kernel_initializer=tf.keras.initializers.GlorotUniform), #sigmoid can also cause exploding/vanishing gradients
    tf.keras.layers.Dense(64, activation="relu"), #relu isn't perfect either as it leads to some neurons always outputting 0, thus being useless
    tf.keras.layers.Dense(32, activation="leaky_relu"), # rather than making negative numbers 0, leaky relu squashes them quite heavily meaning no neurons become useless
    tf.keras.layers.Dense(3, activation="softmax")
])

model.build([1, 2])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (1, 128)                  384       
_________________________________________________________________
dense_1 (Dense)              (1, 64)                   8256      
_________________________________________________________________
dense_2 (Dense)              (1, 32)                   2080      
_________________________________________________________________
dense_3 (Dense)              (1, 3)                    99        
Total params: 10,819
Trainable params: 10,819
Non-trainable params: 0
_________________________________________________________________


In [None]:
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
model.fit(X, y, epochs=30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7fe6d8d16d10>

Sometimes, using variants of relu and glorot initialization is not enough to prevent the gradients from vanishing or exploding. In this case, batch normalization is required. Batch Normalization normalizes and centers the inputs of the layer before it. This reduces the variance between layers and if used as the first layer, can remove the need to normalize the data before hand.

In [None]:
fashion_mnist = tf.keras.datasets.fashion_mnist
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()
X_valid, X_train = X_train_full[:5000] /  255.0, X_train_full[5000:] / 255.0
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
X_test = X_test / 255.0

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz


In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(10, activation="softmax")
])

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
batch_normalization (BatchNo (None, 784)               3136      
_________________________________________________________________
dense_4 (Dense)              (None, 300)               235500    
_________________________________________________________________
batch_normalization_1 (Batch (None, 300)               1200      
_________________________________________________________________
dense_5 (Dense)              (None, 10)                3010      
Total params: 242,846
Trainable params: 240,678
Non-trainable params: 2,168
_________________________________________________________________


In [None]:
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fe6d0a38bd0>

Another way to avoid exploding/vanishing gradients is to clip the gradients before they become to big or small during backpropogation. This is called gradient clipping. 

In [None]:
optimizer = tf.keras.optimizers.SGD(clipvalue=1.0)
model.compile(loss = "sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
model.fit(X_train, y_train, epochs=5, validation_data=(X_valid, y_valid))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fe6d10cb610>