<a href="https://colab.research.google.com/github/Fernando-Hillesheim/Learning-ML/blob/main/Techniques_to_Solve_Vanishing_Gradients.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from tensorflow import keras

In [2]:
mnist = keras.datasets.mnist
(X_train_full, y_train_full), (X_test, y_test) = mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
[1m11490434/11490434[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


In [3]:
X_train_full[0]

In [4]:
X_valid, X_train = X_train_full[:5000]/255, X_train_full[5000:]/255
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
X_test = X_test/255

# Changing Initializers and Activation Functions

In [11]:
model = keras.Sequential([
    keras.layers.Flatten(input_shape=[28,28]),
    keras.layers.Dense(300, activation="relu", kernel_initializer="he_normal"),
    keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(10, activation="softmax")
])

  super().__init__(**kwargs)


In [12]:
model.summary()

# Batch Normalization

In [13]:
model = keras.Sequential([
    keras.layers.Flatten(input_shape=[28,28]),
    keras.layers.Dense(200, activation="relu"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="relu"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
    ])

  super().__init__(**kwargs)


In [14]:
model.summary()

In this scenario, we are applying batch normalization after the activation function, i.e., batch_norm(activation(x\*w + b)). However, the authors of the batch normalization technique argue that applying batch normalization before the activation function leads to a more accurate model. Therefore, we should use activation(batch_norm(x\*w + b)) instead.

In [15]:
model = keras.Sequential([
    keras.layers.Flatten(input_shape=[28,28]),

    #1º hidden layer
    keras.layers.Dense(200, use_bias=False), #no activation function here / no need for bias, because batch_normalization brings an offset value, that will work like a bias
    keras.layers.BatchNormalization(),
    keras.layers.Activation("relu"), #insted we put here, after the batch normalization layer

    #2º hidden layer
    keras.layers.Dense(100, use_bias=False), #no activation function here / no need for bias, because batch_normalization brings an offset value, that will work like a bias
    keras.layers.BatchNormalization(),
    keras.layers.Activation("relu"), #insted we put here, after the batch normalization layer

    #Output layer
    keras.layers.Dense(10, activation="softmax")
    ])

  super().__init__(**kwargs)


In [16]:
model.summary()

# Gradient Clipping

In [17]:
opt = keras.optimizers.Adam(learning_rate=0.01, clipnorm=1.0) #We can use clipvalue instead of clipnorm. The difference is that clipnorm preserves the relative scale between values. In contrast, clipvalue simply caps each individual gradient component within a specified range. For example, if clipvalue=0.5, all values will be constrained between -0.5 and 0.5.
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
# or
# model.compile(loss='categorical_crossentropy', optimizer=keras.optimizers.Adam(clipnorm=1.0))