# Exercises

1. Is it OK to initialise all the weights to the same value as long as that value is selected randomly using He initialisation?
2. Is it OK to initialise the bias terms to 0?
3. Name three advantages of the SELU activation function over ReLU.
4. In which cases would you want to use each of the following activation functions: SELU, leaky ReLU (& its variants), ReLU, tanh, logistic, & softmax?
5. What may happen if you set the `momentum` hyperparameter too close to 1 (e.g., 0.99999) when using an `SGD` optimizer?
6. Names three ways you can produce a sparse model.
7. Does dropout slow down training? Does it slow down inference (i.e., making predictions on new instances)? What about MC Dropout?
8. Practice training a deep neural network on the CIFAR10 image dataset:
   - Build a DNN with 20 hidden layers of 100 neurons each (that's too many, but it's the point of this exercise). Use He initialisation & the ELU activation function.
   - Using Nadam optimisation & early stopping, training the network on the CIFAR10 dataset. You can load it with `keras.datasets.cifar10.load_data()`. The data is composed of 60,000 32 x 32-pixel colour images (50,000 for training, 10,000 for testing) with 10 classes, so you'll need a softmax output layer with 10 neurons. Remember to search for the right learning rate each time you change the model's architecture or hyperparameters.
   - Now try adding batch normalisation & compare the learning curves: Is it converging faster than before? Does it produce a better model? How does it affect training speed?
   - Try replacing batch normalisation with SELU, & make the necessary adjustments to ensure the network self-normalises (i.e., standardise the input features, use LeCun normal initialisation, make sure the DNN contains only a sequence of dense layers, etc).
   - Try regularising the model with alpha dropout. Then without restraining your model, see if you can achieve better accuracy using MC dropout.
   - Retrain your model using 1cycle scheduling & see if it improves training speed & model accuracy.

---

1. No. All weights should be initialised independently. 
2. Yes. It makes much of a difference if backpropagation will adjust the weight of the bias terms.
3. SELU avoids vanishing gradients issue that affects ReLU units because it has nonzero derivatives for negative inputs. The mean is also closer to 0, which is better for self normalisation. Given a strict set of conditions: sequential model, LeCun initialisation, standardised inputs, SELU will self normalise, so it solves the exploding or vanishing gradients issues.
4. SELU is the standard activation function for deep nets. Leaky ReLUs & its variants are great for their training speed & address the *dying ReLUs* problem. ReLU is often preferred for its simplicity. Hyperbolic tangent can be useful because it outputs values between -1 & 1, although it is mainly used for recurrent nets). Logistic & softmax are useful for estimating probabilities for classes; logistic for binary classification, softmax for multiple classification.
5. You run the risk moving past the global minimum, before it goes in the reverse direction back toward the global minimum but past it again, & this will happen again & again until it eventually converges at the solution. In short, it will oscillate many times before it will make it to the optimum, making training time very long.
6. You can zero out the weights after training, if they fall below a certain threshold. You can also apply $l_1$ regularisation. You can also use the TensorFlow Model Optimisation Toolkit.
7. Dropout does slow down training, but does not slow down inference. MC Dropout also slows down training, but it also slows down inference as well because it averages an estimate over multiple predictions.

# 8a.

In [2]:
import tensorflow as tf
from tensorflow import keras

model1 = keras.models.Sequential()
model1.add(keras.layers.Flatten(input_shape = [32, 32, 3]))
for _ in range(20):
    model1.add(keras.layers.Dense(100, activation = "elu", kernel_initializer = "he_normal"))
model1.add(keras.layers.Dense(10, activation = "softmax"))

# 8b.

After rerunning the below code with the following learning rates: 1e-5, 5e-5, 1e-4, 5e-4, & 1e-3; I found that 1e-4 resulted in the highest accuracy. There's gotta be a better way to do this. I don't want to have to run the code a million times just to find the best hyperparameters.

In [3]:
nadam_optimisation = keras.optimizers.Nadam(learning_rate = 1e-4, beta_1 = 0.9, beta_2 = 0.999)
checkpoint_callback = keras.callbacks.ModelCheckpoint("my_best_model.h5", save_best_only = True)
early_stopping_callback = keras.callbacks.EarlyStopping(patience = 10, restore_best_weights = True)

((X_train, y_train), (X_test, y_test)) = keras.datasets.cifar10.load_data()
X_val, X_train = X_train[:5000] / 255.0, X_train[5000:] / 255.0
y_val, y_train = y_train[:5000], y_train[5000:]
X_test = X_test / 255.0

model1.compile(loss = "sparse_categorical_crossentropy",
              optimizer = nadam_optimisation,
              metrics = ["accuracy"])
model1.fit(X_train, y_train, epochs = 30,
           validation_data = (X_val, y_val),
           callbacks = [checkpoint_callback, early_stopping_callback])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30


<keras.callbacks.History at 0x7fbd0dd26d60>

In [4]:
model1 = keras.models.load_model("my_best_model.h5")
model1.evaluate(X_val, y_val)



[1.426145076751709, 0.5005999803543091]

# 8c.

After changing the architecture & rerunning the below code with the following learning rates: 5e-5, 1e-4, 5e-4, 1e-3; I found that the learning rate = 1e-4 performed the best.

In [5]:
keras.backend.clear_session()

model2 = keras.models.Sequential()
model2.add(keras.layers.Flatten(input_shape = [32, 32, 3]))
model2.add(keras.layers.BatchNormalization())
for _ in range(20):
    model2.add(keras.layers.Dense(100, kernel_initializer = "he_normal"))
    model2.add(keras.layers.BatchNormalization())
    model2.add(keras.layers.Activation("elu"))
model2.add(keras.layers.Dense(10, activation = "softmax"))

In [6]:
nadam_optimisation = keras.optimizers.Nadam(learning_rate = 1e-4, beta_1 = 0.9, beta_2 = 0.999)
checkpoint_callback = keras.callbacks.ModelCheckpoint("my_best_batch_normalised_model.h5", save_best_only = True)

model2.compile(loss = "sparse_categorical_crossentropy",
              optimizer = nadam_optimisation,
              metrics = ["accuracy"])
model2.fit(X_train, y_train, epochs = 30,
           validation_data = (X_val, y_val),
           callbacks = [checkpoint_callback, early_stopping_callback])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30


<keras.callbacks.History at 0x7fbc832f0d90>

In [7]:
model2 = keras.models.load_model("my_best_batch_normalised_model.h5")
model2.evaluate(X_val, y_val)



[1.344652771949768, 0.527400016784668]

Convergence is definitely faster. It takes less epochs for the second model to reach the lowest validation loss in the first model & it continues to reach an even lower validation loss than the first model, reaching its lowest at its 24th epoch.

The second model is more accurate than the first model; 52.74% vs. 50.06%, respectively.

It took longer to train, & it's due to the addition of the extra batch normalisation calculations.

# 8d.

After changing the architecture & rerunning the below code with the following learning rates: 5e-5, 1e-4, 5e-4, 1e-3; I found that the learning rate 5e-4 produced the best model.

In [8]:
keras.backend.clear_session()

model3 = keras.models.Sequential()
model3.add(keras.layers.Flatten(input_shape = [32, 32, 3]))
for _ in range(20):
    model3.add(keras.layers.Dense(100, activation = "selu", kernel_initializer = "lecun_normal"))
model3.add(keras.layers.Dense(10, activation = "softmax"))

nadam_optimisation = keras.optimizers.Nadam(learning_rate = 5e-4, beta_1 = 0.9, beta_2 = 0.999)
checkpoint_callback = keras.callbacks.ModelCheckpoint("my_best_selu_model.h5", save_best_only = True)

# Need to standardise data for SELU
X_mean = X_train.mean(axis = 0)
X_std = X_train.std(axis = 0)
X_train_scaled = (X_train - X_mean) / X_std
X_val_scaled = (X_val - X_mean) / X_std
X_test_scaled = (X_test - X_mean) / X_std

In [9]:
model3.compile(loss = "sparse_categorical_crossentropy",
               optimizer = nadam_optimisation,
               metrics = ["accuracy"])
model3.fit(X_train_scaled, y_train, epochs = 30,
           validation_data = (X_val_scaled, y_val),
           callbacks = [checkpoint_callback, early_stopping_callback])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30


<keras.callbacks.History at 0x7fbc54022250>

In [10]:
model3 = keras.models.load_model("my_best_selu_model.h5")
model3.evaluate(X_val_scaled, y_val)



[1.4653328657150269, 0.5001999735832214]

We get 50.0.2% accuracy, which is not very great; it performs similar to the original model. It also converges at a similar rate to the original model, but it definitely is the fastest to train.

# 8e.

In [11]:
keras.backend.clear_session()

model4 = keras.models.Sequential()
model4.add(keras.layers.Flatten(input_shape = [32, 32, 3]))
for _ in range(20):
    model4.add(keras.layers.Dense(100, activation = "selu", kernel_initializer = "lecun_normal"))
model4.add(keras.layers.AlphaDropout(rate = 0.1))
model4.add(keras.layers.Dense(10, activation = "softmax"))

nadam_optimisation = keras.optimizers.Nadam(learning_rate = 5e-4, beta_1 = 0.9, beta_2 = 0.999)
checkpoint_callback = keras.callbacks.ModelCheckpoint("my_best_selu_alpha_dropout_model.h5", save_best_only = True)
           
model4.compile(loss = "sparse_categorical_crossentropy",
               optimizer = nadam_optimisation,
               metrics = ["accuracy"])
model4.fit(X_train_scaled, y_train, epochs = 30,
           validation_data = (X_val_scaled, y_val),
           callbacks = [checkpoint_callback, early_stopping_callback])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30


<keras.callbacks.History at 0x7fbbd200d940>

In [12]:
model4 = keras.models.load_model("my_best_selu_alpha_dropout_model.h5")
model4.evaluate(X_val_scaled, y_val)



[1.5007606744766235, 0.4912000000476837]

Ok, so that's alpha dropout. Now, let's try MC dropout.

In [13]:
import numpy as np

class MCAlphaDropout(keras.layers.AlphaDropout):
    def call(self, inputs):
        return super().call(inputs, training = True)

mc_alpha_dropout_model = keras.models.Sequential([
    MCAlphaDropout(layer.rate) if isinstance(layer, keras.layers.AlphaDropout) else layer 
    for layer in model4.layers
])

def mc_alpha_dropout_predict_prob(model, X, n_samples = 10):
    y_probs = [model.predict(X) for sample in range(n_samples)]
    return np.mean(y_probs, axis = 0)

def mc_alpha_dropout_predict_classes(model, X, n_samples = 10):
    y_probs = mc_alpha_dropout_predict_prob(model, X, n_samples)
    return np.argmax(y_probs, axis = 1)

In [14]:
keras.backend.clear_session()

y_pred = mc_alpha_dropout_predict_classes(mc_alpha_dropout_model, X_val_scaled)
np.mean(y_pred == y_val[:, 0])



0.4906

Ok. So with alpha dropout, we get 49.12% accuracy. With MC dropout, we get 49.06% accuracy. Both are equally useless.

# 8f.

In [21]:
keras.backend.clear_session()

model5 = keras.models.Sequential()
model5.add(keras.layers.Flatten(input_shape = [32, 32, 3]))
for _ in range(20):
    model5.add(keras.layers.Dense(100, activation = "selu", kernel_initializer = "lecun_normal"))
model5.add(keras.layers.AlphaDropout(rate = 0.1))
model5.add(keras.layers.Dense(10, activation = "softmax"))

checkpoint_callback = keras.callbacks.ModelCheckpoint("my_best_1cycle_model.h5", save_best_only = True)

model5.compile(loss = "sparse_categorical_crossentropy",
               optimizer = keras.optimizers.SGD(learning_rate = 1e-3),
               metrics = ["accuracy"])

In [22]:
import math

class OneCycleScheduler(keras.callbacks.Callback):
    def __init__(self, iterations, max_rate, start_rate = None,
                 last_iterations = None, last_rate = None):
        self.iterations = iterations
        self.max_rate = max_rate
        self.start_rate = start_rate or max_rate / 10
        self.last_iterations = last_iterations or iterations // 10 + 1
        self.half_iteration = (iterations - self.last_iterations) // 2
        self.last_rate = last_rate or self.start_rate / 1000
        self.iteration = 0
    def _interpolate(self, iter1, iter2, rate1, rate2):
        return ((rate2 - rate1) * (self.iteration - iter1)
                / (iter2 - iter1) + rate1)
    def on_batch_begin(self, batch, logs):
        if self.iteration < self.half_iteration:
            rate = self._interpolate(0, self.half_iteration, self.start_rate, self.max_rate)
        elif self.iteration < 2 * self.half_iteration:
            rate = self._interpolate(self.half_iteration, 2 * self.half_iteration,
                                     self.max_rate, self.start_rate)
        else:
            rate = self._interpolate(2 * self.half_iteration, self.iterations,
                                     self.start_rate, self.last_rate)
        self.iteration += 1
        keras.backend.set_value(self.model.optimizer.learning_rate, rate)

batch_size = 50
n_epochs = 20
onecycle = OneCycleScheduler(math.ceil(len(X_train_scaled) / batch_size) * n_epochs, max_rate = 0.05)
history = model5.fit(X_train_scaled, y_train, epochs = n_epochs, batch_size = batch_size,
                     validation_data = (X_val_scaled, y_val),
                     callbacks=[onecycle, checkpoint_callback, early_stopping_callback])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20


In [23]:
model5 = keras.models.load_model("my_best_1cycle_model.h5")
model5.evaluate(X_val_scaled, y_val)



[1.4811267852783203, 0.49799999594688416]

By far, the fastest to train, but still performance is similar to the original model, convergence is also similar to the original model.