# Training Deep Neural Net (exercise)

> if all the weights have the same initial
value, even if that value is not zero, then symmetry is not broken (i.e., all neurons in a given layer are equivalent)

> A few advantages of the SELU function over the ReLU function are:
> - can take negative values (help alleviate the vanishing gradient problem) 
> - has non-zero derivative (avoid the dying units).

> MC Dropout is exactly like dropout during
training, but it is still active during inference, so each inference is
slowed down slightly. More importantly, when using MC Dropout you
generally want to run inference 10 times or more to get better predictions. This means that making predictions is slowed down by a factor
of 10 or more

In [24]:
import os
import numpy as np
import tensorflow as tf
from tensorflow import keras 
import matplotlib.pyplot as plt

 Build a DNN with 20 hidden layers of 100 neurons each (that's too many, but it's the point of this exercise). Use He initialization and the ELU activation function.

In [18]:
np.random.seed(42)
tf.random.set_seed(42)

model = tf.keras.Sequential()
model.add(keras.layers.Flatten(input_shape=(32,32,3)))
for hidden_layer in range(20) :
    model.add(keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'))
model.add(keras.layers.Dense(10, activation='softmax'))

 Using Nadam optimization and early stopping, train the network on the CIFAR10 dataset.

In [19]:
optimizer = keras.optimizers.Nadam(learning_rate=5e-5)
model.compile(loss=keras.losses.SparseCategoricalCrossentropy(), 
              optimizer=optimizer, metrics=['accuracy'])

Let's load dataset

In [21]:
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.cifar10.load_data()
X_train_full.shape, y_train_full.shape

((50000, 32, 32, 3), (50000, 1))

In [23]:
X_train = X_train_full[10000:]
X_valid = X_train_full[:10000]
y_train = y_train_full[10000:]
y_valid = y_train_full[:10000]

X_train.shape, X_valid.shape

((40000, 32, 32, 3), (10000, 32, 32, 3))

Now we can create the callbacks we need and train the model:

In [25]:
early_stopping = keras.callbacks.EarlyStopping(patience=20)
model_checkpoint = keras.callbacks.ModelCheckpoint('save_model/my_cifar10.h5', save_best_only=True)
run_index = 1 # increment every train the model
run_logdir = os.path.join('logdir', 'my_cifar_10_logs', f'run_{run_index}')
tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
callbacks = [early_stopping, model_checkpoint, tensorboard_cb]

In [29]:
%load_ext tensorboard
%tensorboard --logdir=./logdir/my_cifar_10_logs --port=6006

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [30]:
model.fit(X_train, y_train, epochs=100, validation_data=(X_valid, y_valid), callbacks=callbacks, verbose=0)

<keras.callbacks.History at 0x2741a0ed5d0>

In [31]:
model = keras.models.load_model('save_model/my_cifar10.h5') # load the best weihts
model.evaluate(X_valid, y_valid)



[1.495286464691162, 0.4708000123500824]

> with only 47% accuracy on Valid data, Let's see if we can improve performance using Batch Normalization.

try adding Batch Normalization and compare the learning
curves: Is it converging faster than before? Does it produce a better
model?

In [37]:
tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=(32, 32, 3)))
model.add(keras.layers.BatchNormalization())
for hidden_layer in range(20) :
    model.add(keras.layers.Dense(100, kernel_initializer='he_normal'))
    model.add(keras.layers.BatchNormalization())
    model.add(keras.layers.Activation('elu'))
model.add(keras.layers.Dense(10, activation='softmax'))

In [38]:
optimizer = keras.optimizers.Nadam(learning_rate=5e-4) # change learning rate from experiment(this is the best after trained 20 epochs)
model.compile(loss=keras.losses.SparseCategoricalCrossentropy(), optimizer=optimizer, metrics=['accuracy'])

In [39]:
early_stopping = keras.callbacks.EarlyStopping(patience=20)
model_checkpoint = keras.callbacks.ModelCheckpoint('save_model/my_cifar10_bn.h5', save_best_only=True) # change name
run_index = 1 # increment every train the model
run_logdir = os.path.join('logdir', 'my_cifar_10_logs', f'run_bn_{run_index}') #change path
tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
callbacks = [early_stopping, model_checkpoint, tensorboard_cb]

In [40]:
model.fit(X_train, y_train, epochs=100, validation_data=(X_valid, y_valid), callbacks=callbacks, batch_size=32) # add batch size

Epoch 40/100


<keras.callbacks.History at 0x2742db16830>

In [42]:
model = keras.models.load_model('save_model/my_cifar10_bn.h5') # load best weight
model.evaluate(X_valid, y_valid)



[1.326360821723938, 0.5300999879837036]

> - model converging faster 10 epochs (50 to 40 epochs)
> - more accuray (53% from 47%)
> - but slower time to training 11 min from 9 min

replacing Batch Normalization with SELU, and make the necessary adjustements to ensure the network self-normalizes (i.e., standardize the input features, use LeCun normal initialization)

In [63]:

tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))
for hidden_layer in range(20):
    model.add(keras.layers.Dense(100,
                                 kernel_initializer="lecun_normal",
                                 activation="selu"))
model.add(keras.layers.Dense(10, activation="softmax"))
optimizer = keras.optimizers.Nadam(learning_rate=7e-4) # change some learning rate
model.compile(loss=keras.losses.SparseCategoricalCrossentropy(),
              optimizer=optimizer,
              metrics=["accuracy"])

# callbacks
early_stopping = keras.callbacks.EarlyStopping(patience=20)
model_checkpoint = keras.callbacks.ModelCheckpoint('save_model/my_cifar10_selu.h5', save_best_only=True) # change name
run_index = 1 # increment every train the model
run_logdir = os.path.join('logdir', 'my_cifar_10_logs', f'run_selu_{run_index}') #change path
tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
callbacks = [early_stopping, model_checkpoint, tensorboard_cb]

# standardize input
X_means = X_train.mean(axis=0)
X_stds = X_train.std(axis=0)

X_train_scaled = (X_train - X_means) / X_stds # use scaling from X_train to prevent data leaked
X_valid_scaled = (X_valid - X_means) / X_stds
X_test_scaled = (X_test - X_means) / X_stds

# fit model
model.fit(X_train_scaled, y_train, epochs=100, validation_data=(X_valid_scaled, y_valid), callbacks=callbacks, batch_size=32) 

Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100


<keras.callbacks.History at 0x27431cc7340>

In [64]:
model = keras.models.load_model("save_model/my_cifar10_selu.h5")
model.evaluate(X_valid_scaled, y_valid)



[1.4797067642211914, 0.49059998989105225]

> not as good as the model using batch normalization (53%), but it's convergence was faster (only 29 epochs)

Try regularizing the model with alpha dropout. Then, without retraining your model, see if you can achieve better accuracy using MC Dropout.

In [71]:
tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))
for hidden_layer in range(20):
    model.add(keras.layers.Dense(100,
                                 kernel_initializer="lecun_normal",
                                 activation="selu"))

model.add(keras.layers.AlphaDropout(rate=0.1))
model.add(keras.layers.Dense(10, activation="softmax"))

optimizer = keras.optimizers.Nadam(learning_rate=5e-4) # change learning rate
model.compile(loss=keras.losses.SparseCategoricalCrossentropy(),
              optimizer=optimizer,
              metrics=["accuracy"])

# callbacks
early_stopping = keras.callbacks.EarlyStopping(patience=20)
model_checkpoint = keras.callbacks.ModelCheckpoint('save_model/my_cifar10_alpha.h5', save_best_only=True) # change name
run_index = 1 # increment every train the model
run_logdir = os.path.join('logdir', 'my_cifar_10_logs', f'run_alpha_{run_index}') #change path
tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
callbacks = [early_stopping, model_checkpoint, tensorboard_cb]
    

# fit model
model.fit(X_train_scaled, y_train, epochs=100, validation_data=(X_valid_scaled, y_valid), callbacks=callbacks, batch_size=32) 



<keras.callbacks.History at 0x27433da20b0>

In [72]:
model = keras.models.load_model("save_model/my_cifar10_alpha.h5")
model.evaluate(X_valid_scaled, y_valid)



[1.5190484523773193, 0.4641000032424927]

Let's use MC Dropout now. 

In [73]:
class MCAlphaDropout(keras.layers.AlphaDropout) :
    def call(self, inputs) :
        return super().call(inputs, training=True) # can perform on prediction part

Now let's create a new model, identical to the one we just trained (with the same weights), but with MCAlphaDropout dropout layers instead of AlphaDropout layers:

In [75]:
mc_model = keras.models.Sequential([ 
    MCAlphaDropout(layer.rate) if isinstance(layer, keras.layers.AlphaDropout) else layer for layer in model.layers  
])

create function to run model 10 (by default) times and return the mean prediction class proba then predict the most likely class.

In [77]:
def mc_dropout_probas(mc_model, X, n_samples=10) :
    y_probas = [mc_model.predict(X) for sample in range(n_samples)]
    return np.mean(y_probas, axis=0)

def mc_dropot_pred_classes(mc_model, X, n_samples=10) :
    y_probas = mc_dropout_probas(mc_model, X, n_samples)
    return np.argmax(y_probas, axis=1)

In [80]:
tf.random.set_seed(42)
np.random.seed(42)

y_pred = mc_dropot_pred_classes(mc_model, X_valid_scaled, 10)
accuracy = np.mean(y_pred == y_valid[:, 0])
accuracy

0.4628

> We get no accuracy improvement in this case.

> So the best model we got in this exercise is the Batch Normalization model.

Retrain your model using 1cycle scheduling and see if it improves training speed and model accuracy.

In [91]:
K = keras.backend

class OneCycleScheduler(keras.callbacks.Callback):
    def __init__(self, iterations, max_rate, start_rate=None,
                 last_iterations=None, last_rate=None):
        self.iterations = iterations
        self.max_rate = max_rate
        self.start_rate = start_rate or max_rate / 10
        self.last_iterations = last_iterations or iterations // 10 + 1
        self.half_iteration = (iterations - self.last_iterations) // 2
        self.last_rate = last_rate or self.start_rate / 1000
        self.iteration = 0
    def _interpolate(self, iter1, iter2, rate1, rate2):
        return ((rate2 - rate1) * (self.iteration - iter1) / (iter2 - iter1) + rate1)
    def on_batch_begin(self, batch, logs):
        if self.iteration < self.half_iteration:
            rate = self._interpolate(0, self.half_iteration, self.start_rate, self.max_rate)
        elif self.iteration < 2 * self.half_iteration:
            rate = self._interpolate(self.half_iteration, 2 * self.half_iteration,  self.max_rate, self.start_rate)
        else:
            rate = self._interpolate(2 * self.half_iteration, self.iterations, self.start_rate, self.last_rate)
        self.iteration += 1
        K.set_value(self.model.optimizer.learning_rate, rate)

In [92]:
tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))
for hidden_layer in range(20):
    model.add(keras.layers.Dense(100,
                                 kernel_initializer="lecun_normal",
                                 activation="selu"))

model.add(keras.layers.AlphaDropout(rate=0.1))
model.add(keras.layers.Dense(10, activation="softmax"))

optimizer = keras.optimizers.SGD(learning_rate=1e-2) # change learning rate
model.compile(loss=keras.losses.SparseCategoricalCrossentropy(),
              optimizer=optimizer,
              metrics=["accuracy"])

In [93]:
import math

batch_size = 128
n_epochs = 15
onecycle = OneCycleScheduler(math.ceil(len(X_train_scaled) / batch_size) * n_epochs, max_rate=0.05)
history = model.fit(X_train_scaled, y_train, epochs=n_epochs, batch_size=batch_size,
                    validation_data=(X_valid_scaled, y_valid),
                    callbacks=[onecycle])

Epoch 15/15


One cycle allowed us to train the model in just 15 epochs, each taking only 2 seconds (thanks to the larger batch size). This is several times faster than the fastest model we trained so far. 

Moreover, we improved the model's performance (from 47% to 52%). The batch normalized model reaches a slightly better performance (53%), but it's much slower to train.