<a href="https://colab.research.google.com/github/RohanOpenSource/Deep-Learning-And-Beyond/blob/main/ExplodingAndVanishingGradients.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Training neural networks can come with a host of problems that are tricky to combat. The first of these is the exploding and vanishing gradient problem. During backpropogation, gradients can either become progressively smaller, leaving the lower layers of the neural network to remain unchanged, or they can become larger making the neural network not work at all. To have this not happen, we need a similar variance for the inputs of each layer and the outputs. To acheive this, we can randomly initialize the weights with some constraints. This is the basis of Glorot, He, and LeCun intilialization and is what is used by default in Keras.

In [1]:
import tensorflow as tf
import numpy as np
from sklearn.datasets import load_iris

In [2]:
iris = load_iris() #you know the drill we're using iris
X =  iris.data[:, 2:]
y = iris.target

In [3]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation="sigmoid", kernel_initializer=tf.keras.initializers.GlorotUniform), 
    tf.keras.layers.Dense(64, activation="sigmoid", kernel_initializer=tf.keras.initializers.GlorotUniform), 
    tf.keras.layers.Dense(64, activation="sigmoid", kernel_initializer=tf.keras.initializers.GlorotUniform), 
    tf.keras.layers.Dense(64, activation="sigmoid", kernel_initializer=tf.keras.initializers.GlorotUniform), 
    tf.keras.layers.Dense(64, activation="sigmoid", kernel_initializer=tf.keras.initializers.GlorotUniform), 
    tf.keras.layers.Dense(64, activation="sigmoid", kernel_initializer=tf.keras.initializers.GlorotUniform), 
    tf.keras.layers.Dense(3, activation="softmax") 
])
model.build([1, 2])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (1, 128)                  384       
                                                                 
 dense_1 (Dense)             (1, 64)                   8256      
                                                                 
 dense_2 (Dense)             (1, 64)                   4160      
                                                                 
 dense_3 (Dense)             (1, 64)                   4160      
                                                                 
 dense_4 (Dense)             (1, 64)                   4160      
                                                                 
 dense_5 (Dense)             (1, 64)                   4160      
                                                                 
 dense_6 (Dense)             (1, 3)                    1

In [4]:
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
model.fit(X, y, epochs=30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7f32ce1e5850>

As displayed above, changing the weight initlization is not enough to fix vanishing or exploding gradients. This is because activation functions can also be huge contributers to funky gradients.
 
Sigmoid can cause vanishing gradients because it squeezes the weights resulting in insanely small gradients during the end of backpropogation.

Relu isn't perfect either as it leads to some neurons always outputting 0, thus becoming obselete (just like pytorch).

Rather than making negative numbers 0, leaky relu squashes them quite heavily meaning no neurons become entirely useless (NO NEURONS LEFT BEHIND YAYYYY).

In [5]:
leaky_relu = tf.keras.layers.LeakyReLU(alpha=0.2)
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation=leaky_relu, kernel_initializer="he_normal"), 
    tf.keras.layers.Dense(64, activation=leaky_relu, kernel_initializer="he_normal"), 
    tf.keras.layers.Dense(64, activation=leaky_relu, kernel_initializer="he_normal"), 
    tf.keras.layers.Dense(64, activation=leaky_relu, kernel_initializer="he_normal"), 
    tf.keras.layers.Dense(64, activation=leaky_relu, kernel_initializer="he_normal"), 
    tf.keras.layers.Dense(64, activation=leaky_relu, kernel_initializer="he_normal"), 
    tf.keras.layers.Dense(3, activation="softmax")
])


In [6]:
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
model.fit(X, y, epochs=60)

Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60
Epoch 41/60
Epoch 42/60
Epoch 43/60
Epoch 44/60
Epoch 45/60
Epoch 46/60
Epoch 47/60
Epoch 48/60
Epoch 49/60
Epoch 50/60
Epoch 51/60
Epoch 52/60
Epoch 53/60
Epoch 54/60
Epoch 55/60
Epoch 56/60
Epoch 57/60
Epoch 58/60
Epoch 59/60
Epoch 60/60


<keras.callbacks.History at 0x7f32ce15f510>

Sometimes, EVEN ALL OF THIS NONSENSE IS NOT ENOUGH TO END THE WRATH OF THE FUNKY GRADIENT. In such cases, Batch Normalization normalizes, centers, shifts, and scales mini-batches of inputs of the layer before it allowing the model to learn the optimal scale for its data. This reduces the variance between layers and if used as the first layer, can remove the need to normalize the data before hand.

In [7]:
fashion_mnist = tf.keras.datasets.fashion_mnist
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()
X_valid, X_train = X_train_full[:5000] /  255.0, X_train_full[5000:] / 255.0
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
X_test = X_test / 255.0

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz


In [8]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(50, activation="elu", kernel_initializer="he_normal"), #elu is like a combination of relu (x>=0) and an exponential function (x<0), its great but quite slow
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(50, activation="elu", kernel_initializer="he_normal"), 
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(50, activation="elu", kernel_initializer="he_normal"), 
    tf.keras.layers.Dense(10, activation="softmax")
])

model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten (Flatten)           (None, 784)               0         
                                                                 
 batch_normalization (BatchN  (None, 784)              3136      
 ormalization)                                                   
                                                                 
 dense_14 (Dense)            (None, 50)                39250     
                                                                 
 batch_normalization_1 (Batc  (None, 50)               200       
 hNormalization)                                                 
                                                                 
 dense_15 (Dense)            (None, 50)                2550      
                                                                 
 batch_normalization_2 (Batc  (None, 50)              

In [9]:
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid)) # most resonable people would use more epochs but I've got romance novels to read so shut up

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f32cdfd4990>

Now what would be the fun

One can avoid exploding gradients by clippping the gradients before they become too big, during backpropogation. This is called gradient clipping and is quite effective and easy to use. 

In [10]:
optimizer = tf.keras.optimizers.SGD(clipvalue=1.0)
model.compile(loss = "sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
model.fit(X_train, y_train, epochs=5, validation_data=(X_valid, y_valid))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f32d1b95f50>

That's all for this notebook. I'm eternally greatful to finally be back in the game after a year long hiatus caused by school and poor time management. I realized that pursuing coding is a possible side activity and is more important to me than doing well in school(even though I am and will probably continue to).