# Training Deep Neural Networks

`Check table of initializers for each activation functions in the book (table 11-1)`

A good initialization strategy for ReLU activation funcition is He initialization so as to alleviate the exploding/vanishing gradient problem

"By default, Keras uses Glorot initialization with a uniform distribution. When you create a layer, you can switch to He initialization by setting kernel_initializer="he_uniform" or kernel_initializer="he_normal" like this:"

In [2]:
import tensorflow as tf

dense = tf.keras.layers.Dense(50, activation="relu", kernel_initializer="he_normal")

"Alternatively, you can obtain any of the initializations listed in table 11-1 and more using the VarianceScaling initializer. For example, if you want He initialization with a uniform distribution and based on fan_avg (rather than fan_in), you can use the following code:"

In [3]:
he_avg_init = tf.keras.initializers.VarianceScaling(scale=2., mode="fan_avg",
                                                    distribution="uniform")
dense = tf.keras.layers.Dense(50, activation="sigmoid", kernel_initializer=he_avg_init)

Better activation functions reduce instability of gradients

## Better activation functions

### Leaky ReLU

"Keras includes the classes LeakyReLU and PReLU in the tf.keras.layers package. Just like for other ReLU variants, you should use He initialization with these. For example:"

In [4]:
leaky_relu = tf.keras.layers.LeakyReLU(negative_slope=0.2) # defaults to alpha=0.3, alpha changed to "negative_slope"
dense = tf.keras.layers.Dense(50, activation=leaky_relu, kernel_initializer="he_normal")

We can also use LeakyReLU as a separate layer:

In [5]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape=[10]),
    tf.keras.layers.Dense(50, kernel_initializer="he_normal"), # no activation
    tf.keras.layers.LeakyReLU(negative_slope=0.2), # activation as a separate layer
    tf.keras.layers.Dense(1)
])

### PReLU

In [6]:
leaky_prelu = tf.keras.layers.PReLU()
dense = tf.keras.layers.Dense(50, activation=leaky_prelu, kernel_initializer="he_normal")

Smooth variants of the ReLU activation function:

### ELU

In [7]:
dense = tf.keras.layers.Dense(50, activation="elu")

### SELU

In [8]:
dense = tf.keras.layers.Dense(50, activation="selu", kernel_initializer="lecun_normal")

### GELU, Swish, Mish

In [9]:
dense = tf.keras.layers.Dense(50, activation="gelu", kernel_initializer="he_normal")
dense = tf.keras.layers.Dense(50, activation="swish", kernel_initializer="he_normal") # Keras does not support generalized Swish
# dense = tf.keras.layers.Dense(50, activation="mish", kernel_initializer="he_normal")  # Keras does not support mish

## Batch normalization

In [10]:
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=[28, 28]),
    tf.keras.layers.Flatten(),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(300, activation="relu",
                          kernel_initializer="he_normal"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(100, activation="relu",
                          kernel_initializer="he_normal"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(10, activation="softmax")
])

model.summary()

Looking at parameters of the first BN layer. 2 are trainable (by backpropagation) and 2 are not:

In [11]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('gamma', True),
 ('beta', True),
 ('moving_mean', False),
 ('moving_variance', False)]

Adding BN layers before the activation functions this time:

In [12]:
# Removing bias on layers that come before BN since BN layers already has a bias term.
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=[28, 28]),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(300, kernel_initializer="he_normal", use_bias=False),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation("relu"),
    tf.keras.layers.Dense(100, kernel_initializer="he_normal", use_bias=False),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation("relu"),
    tf.keras.layers.Dense(10, activation="softmax")
])

model.summary()    

## Gradient Clipping

In [13]:
# Normally used in RNN to mitigate exploding gradients
optimizer = tf.keras.optimizers.SGD(clipvalue=1.0)
optimizer = tf.keras.optimizers.SGD(clipnorm=1.0) # if you want to preserve orientation of gradient descent


## Transfer Learning with Keras

Use transfer learning for learning the `fashion` mnist dataset. I will pretend I only have 100 instances of each class on the training set. I will train an MLP to see its performance on 1000 samples, then I will use transfer learning to compare performance. The MLP whose layers I will transfer was trained on MNIST dataset with 50k samples on the training set and 98% accuracy on the validation set (10k samples).

In [14]:
import numpy as np
tf.keras.backend.clear_session()




In [15]:
(X_train_full, y_train_full), (X_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()

In [16]:
y_test

array([9, 2, 1, ..., 8, 1, 5], shape=(10000,), dtype=uint8)

In [17]:
print(X_train_full.shape)
X_train_full.dtype

(60000, 28, 28)


dtype('uint8')

In [18]:
X_train_1k = np.zeros(shape=(1000, 28, 28), dtype=np.uint8)
y_train_1k = np.zeros(shape=(1000,), dtype=np.uint8)

# get 100 instances from each class
for i in range(10):
    first_index = i*100
    indices = y_train_full == i
    x_100 = X_train_full[indices][:100]
    y_100 = y_train_full[indices][:100]
    X_train_1k[first_index : first_index + 100] = x_100
    y_train_1k[first_index : first_index + 100] = y_100
    

In [19]:
y_train_1k[90:110] # checking if it is correct

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
      dtype=uint8)

In [20]:
X_train_1k[0][1] # checking if it has non-zero values

array([  0,   0,   0,   1,   0,   0,   0,  49, 136, 219, 216, 228, 236,
       255, 255, 255, 255, 217, 215, 254, 231, 160,  45,   0,   0,   0,
         0,   0], dtype=uint8)

In [21]:
indices_shuffled = np.random.default_rng(seed=42).permutation(y_train_1k.shape[0])
indices_shuffled[:10]

array([978, 933, 859, 916, 127, 608, 856, 260, 147, 810])

In [22]:
# shuffle dataset
X_train_1k = X_train_1k[indices_shuffled]
y_train_1k = y_train_1k[indices_shuffled]
y_train_1k[:10]

array([9, 9, 8, 9, 1, 6, 8, 2, 1, 8], dtype=uint8)

In [23]:
X_train = X_train_1k[:800]
y_train = y_train_1k[:800]
X_valid = X_train_1k[800:]
y_valid = y_train_1k[800:]

print(X_train.shape, y_train.shape)

(800, 28, 28) (800,)


In [24]:
model_B = tf.keras.Sequential([
    tf.keras.layers.Input(shape=[28, 28]),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(480, activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.Dense(480, activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.Dense(480, activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.Dense(480, activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.Dense(480, activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.Dense(10, activation="softmax")
])

model_B.compile(optimizer=tf.keras.optimizers.Adam(),
                loss="sparse_categorical_crossentropy",
                metrics=["accuracy"]
               )

In [25]:
history = model_B.fit(X_train, y_train,
                      validation_data=(X_valid, y_valid),
                     epochs=30,
                )

Epoch 1/30
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 23ms/step - accuracy: 0.4275 - loss: 91.1915 - val_accuracy: 0.5350 - val_loss: 16.5927
Epoch 2/30
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - accuracy: 0.6850 - loss: 9.3166 - val_accuracy: 0.6500 - val_loss: 8.2295
Epoch 3/30
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - accuracy: 0.7337 - loss: 5.3916 - val_accuracy: 0.6250 - val_loss: 6.6847
Epoch 4/30
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 14ms/step - accuracy: 0.7862 - loss: 3.0369 - val_accuracy: 0.7550 - val_loss: 4.3091
Epoch 5/30
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - accuracy: 0.8087 - loss: 2.3798 - val_accuracy: 0.7250 - val_loss: 4.3688
Epoch 6/30
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - accuracy: 0.8425 - loss: 1.8060 - val_accuracy: 0.7550 - val_loss: 2.7328
Epoch 7/30
[1m25/25[0m [32m━━

In [26]:
model_B.evaluate(X_valid, y_valid)

[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.8100 - loss: 1.9160 


[1.9159855842590332, 0.8100000023841858]

In [27]:
model_B.evaluate(X_test, y_test)

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.7272 - loss: 3.4780


[3.477980852127075, 0.7271999716758728]

Accuracy of model B on test set is 73%. Now, with transfer learning:

In [28]:
# Loading model A which was trained on the mnist dataset (on ch10 notebook)
model_A = tf.keras.models.clone_model(model_B)

# Compiling would be needed if the model were created from scratch
# model_A.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.00021647994826791666),
#                 loss="sparse_categorical_crossentropy",
#                 metrics=["accuracy"]
#                )

model_A.load_weights("../ch10/checkpoints_q10_mnist.weights.h5")

In [29]:
# Test if model was loaded correctly
(X_train_full_mnist, y_train_full_mnist), (X_test_mnist, y_test_mnist) = tf.keras.datasets.mnist.load_data()
model_A.evaluate(X_test_mnist, y_test_mnist)

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - accuracy: 0.9817 - loss: 0.1447


[0.14471349120140076, 0.9817000031471252]

In [30]:
# cloning model so model A is not affected by training new model
model_A_clone = tf.keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())

In [31]:
# copying bottom 2 hidden layers ( + flatten layer)
model_B_on_A = tf.keras.Sequential(model_A_clone.layers[:3])
model_B_on_A.add(tf.keras.layers.Dense(480, activation="relu", kernel_initializer="he_normal"))
model_B_on_A.add(tf.keras.layers.Dense(480, activation="relu", kernel_initializer="he_normal"))
model_B_on_A.add(tf.keras.layers.Dense(480, activation="relu", kernel_initializer="he_normal"))
model_B_on_A.add(tf.keras.layers.Dense(10, activation="softmax"))

In [32]:
# Freeze bottom 2 layers
for layer in model_B_on_A.layers[:3]:
    layer.trainable = False

# Always compile after freezing/unfreezing layers
model_B_on_A.compile(optimizer=tf.keras.optimizers.Adam(),
                loss="sparse_categorical_crossentropy",
                metrics=["accuracy"]
               )

In [33]:
history = model_B_on_A.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    epochs=10
)

Epoch 1/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 19ms/step - accuracy: 0.5400 - loss: 30.8260 - val_accuracy: 0.6350 - val_loss: 8.1867
Epoch 2/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - accuracy: 0.7862 - loss: 3.5184 - val_accuracy: 0.7000 - val_loss: 3.9160
Epoch 3/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - accuracy: 0.8500 - loss: 1.7014 - val_accuracy: 0.7350 - val_loss: 3.5219
Epoch 4/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - accuracy: 0.9050 - loss: 0.9528 - val_accuracy: 0.7200 - val_loss: 3.4333
Epoch 5/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - accuracy: 0.8938 - loss: 1.6955 - val_accuracy: 0.6950 - val_loss: 5.5155
Epoch 6/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - accuracy: 0.9300 - loss: 0.8062 - val_accuracy: 0.7250 - val_loss: 2.1537
Epoch 7/10
[1m25/25[0m [32m━━━

In [34]:
# Unfreeze bottom 2 layers to fine-tune
for layer in model_B_on_A.layers[:3]:
    layer.trainable = True

# Always compile after freezing/unfreezing layers
model_B_on_A.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
                loss="sparse_categorical_crossentropy",
                metrics=["accuracy"]
               )

In [35]:
history = model_B_on_A.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    epochs=20
)

Epoch 1/20
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 22ms/step - accuracy: 0.9675 - loss: 0.1538 - val_accuracy: 0.7300 - val_loss: 2.2358
Epoch 2/20
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - accuracy: 0.9812 - loss: 0.1068 - val_accuracy: 0.7500 - val_loss: 2.5014
Epoch 3/20
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - accuracy: 0.9837 - loss: 0.0566 - val_accuracy: 0.7300 - val_loss: 2.7308
Epoch 4/20
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - accuracy: 0.9937 - loss: 0.0354 - val_accuracy: 0.7750 - val_loss: 2.0907
Epoch 5/20
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 13ms/step - accuracy: 0.9937 - loss: 0.0280 - val_accuracy: 0.7700 - val_loss: 2.6440
Epoch 6/20
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 15ms/step - accuracy: 0.9962 - loss: 0.0367 - val_accuracy: 0.7950 - val_loss: 2.2305
Epoch 7/20
[1m25/25[0m [32m━━━━

In [36]:
model_B_on_A.evaluate(X_test, y_test)

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - accuracy: 0.7406 - loss: 3.6280


[3.62803053855896, 0.7405999898910522]

The loss increased and the accuracy decreased a bit. Transfer learning did not help much here probably because I should have used 1 hidden layer (because tasks are very different) or trained the NN with frozen layers for more epochs.

The author says "transfer learning does not work very well with small dense networks, presumably because small networks learn few patterns, and dense networks learn very specific patterns, which are unlikely to be useful in other tasks.

##  Faster Optimizers

Momentum

In [37]:
optimizer = tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.9)

Nesterov Accelerated Gradient (NAG)

In [38]:
optimizer = tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.9, nesterov=True)

RMSProp

In [39]:
optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001, rho=0.9)

Adam

In [40]:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)

Adamax, Nadam, AdamW

In [41]:
optimizer = tf.keras.optimizers.Adamax(learning_rate=0.001, beta_1=0.9, beta_2=0.999)
optimizer = tf.keras.optimizers.AdamW(learning_rate=0.001, beta_1=0.9, beta_2=0.999, weight_decay=0.96)
optimizer = tf.keras.optimizers.Nadam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)

"Combining l2 regularization with Adam often results in models that do not generalize well." Use adam with weight decay (adamW) instead.
Sometimes, adaptive gradients (RMSProp, adam and its variations) do not generalize well

See table 11-2 for optimizers comparison

## Learning Rate Scheduling

Exponential scheduling (or exponential decay)

In [42]:
def exponential_decay_fn(epoch):
    return 0.01 * 0.1 ** (epoch / 20)

# if you do not want to hardcode function 
def exponential_decay(lr0, s):
    def exponential_decay_fn(epoch):
        return lr0 * 0.1 ** (epoch / s)
    return exponential_decay_fn

exponential_decay_fn = exponential_decay(lr0=0.01, s=20)

In [43]:
# Create a LearningRateScheduler callback
lr_scheduler = tf.keras.callbacks.LearningRateScheduler(exponential_decay_fn)

# Example
# history = model.fit(X_train, y_train, epochs=10, callbacks=[lr_schedueler]) 

The scheudle function can optionally take the current learning rate as a second argument:

In [44]:
def exponential_decay_fn(epoch, lr):
    return lr * 0.1 ** (1 / 20)

`When saving a model, the epoch does not get saved! If training again after loading model the epoch will start at 0! To solve this, use a learning schedule function independent of epoch or use fit()'s initial_epoch parameter`

Piecewise constant scheduling

In [45]:
def piecewise_constant_fn(epoch):
    if epoch < 5:
        return 0.01
    elif epoch < 15:
        return 0.005
    else:
        return 0.001

# Can create a more general function if I do not want to hardcode epoch thresholds just like previous section

Performance scheduling

In [46]:
lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)
# this multiplies learning rate by 0.5 if best validation loss does not improve for 5 consecutive epochs

Alternative to implementing learning rate scheduling:

In [47]:
batch_size = 32
n_epochs = 20
n_steps = n_epochs * np.ceil(len(X_train) / batch_size)
scheduled_learning_rate = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=0.01, decay_steps=n_steps, decay_rate=0.1)
optimizer = tf.keras.optimizers.SGD(learning_rate=scheduled_learning_rate)

The code above "is nice and simple, plus when you save the model, the learning rate and its schedule (including its state) get saved as well."

In [48]:
# Implement power scheduling
lr_schedule = tf.keras.optimizers.schedules.InverseTimeDecay(
    initial_learning_rate=0.01,
    decay_steps=10_000,
    decay_rate=1.0,
    staircase=False
)
optimizer = tf.keras.optimizers.SGD(learning_rate=lr_schedule)

##  Avoid overfitting through regularization

### $l_1$ and $l_2$ regularization

In [49]:
# l2
layer = tf.keras.layers.Dense(100, activation="relu",
                              kernel_initializer="he_normal",
                              kernel_regularizer=tf.keras.regularizers.l2(0.01)
                             )
# l1
layer = tf.keras.layers.Dense(100, activation="relu",
                              kernel_initializer="he_normal",
                              kernel_regularizer=tf.keras.regularizers.l1(0.01)
                             )
# l1 and l2
layer = tf.keras.layers.Dense(100, activation="relu",
                              kernel_initializer="he_normal",
                              kernel_regularizer=tf.keras.regularizers.l1_l2(l1=0.01, l2=0.01)
                             )

You tipically want to apply the same activation function, initializer, and regularizer in all hidden layers. To avoid too much repetition, you can use loops or Python's functools.partial() function, which lets you creat a thin wrapper for any callable, with some default argument values:

In [50]:
from functools import partial

RegularizedDense = partial(tf.keras.layers.Dense,
                 activation="relu",
                 kernel_initializer="he_normal",
                 kernel_regularizer=tf.keras.regularizers.l2(0.01))

model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=[28, 28]),
    tf.keras.layers.Flatten(),
    RegularizedDense(100),
    RegularizedDense(100),
    RegularizedDense(10, activation="softmax")
])

### Dropout

In [51]:
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=[28, 28]),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dropout(rate=0.2),
    tf.keras.layers.Dense(100, activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.Dropout(rate=0.2),
    tf.keras.layers.Dense(100, activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.Dropout(rate=0.2),
    tf.keras.layers.Dense(10, activation="softmax")
])

model.compile(optimizer="adam",
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"]
             )

history = model.fit(X_train, y_train,
                    validation_data=(X_valid, y_valid),
                    epochs=50)

Epoch 1/50
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 15ms/step - accuracy: 0.2450 - loss: 91.0348 - val_accuracy: 0.4950 - val_loss: 10.1590
Epoch 2/50
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.3225 - loss: 22.7316 - val_accuracy: 0.3300 - val_loss: 5.0413
Epoch 3/50
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.3125 - loss: 6.8452 - val_accuracy: 0.3100 - val_loss: 3.5483
Epoch 4/50
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.3088 - loss: 4.3433 - val_accuracy: 0.2650 - val_loss: 2.9787
Epoch 5/50
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.2837 - loss: 3.1877 - val_accuracy: 0.2700 - val_loss: 2.4408
Epoch 6/50
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.2937 - loss: 2.6116 - val_accuracy: 0.2250 - val_loss: 2.4072
Epoch 7/50
[1m25/25[0m [32m━━━━━━

In [52]:
model.evaluate(X_valid, y_valid)

[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.3000 - loss: 1.8981 


[1.8980798721313477, 0.30000001192092896]

In [53]:
model.evaluate(X_test, y_test) 

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.3563 - loss: 1.9297


[1.9297410249710083, 0.3562999963760376]

"Since dropout is only active during training, comparing the training loss and the validation loss can be misleading. In particular, a model may be overfitting the training set and yet have a similar training and validation losses. So, make sure to evaluate the training loss without dropout (e.g. after training)"

### Monte Carlo (MC) Dropout

In [54]:
import numpy as np

y_probas = np.stack([model(X_test, training=True) 
                     for sample in range(100)])
y_proba = y_probas.mean(axis=0)

In [55]:
# when dropout is turned off
model.predict(X_test[:1]).round(3)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 93ms/step


array([[0.032, 0.034, 0.029, 0.013, 0.05 , 0.081, 0.031, 0.121, 0.062,
        0.547]], dtype=float32)

In [56]:
# with MC dropout prediction
y_proba[0].round(3)

array([0.084, 0.034, 0.089, 0.066, 0.085, 0.045, 0.081, 0.072, 0.08 ,
       0.364], dtype=float32)

In [57]:
# standard deviation
y_std = y_probas.std(axis=0)
y_std[0].round(3)

array([0.061, 0.023, 0.063, 0.05 , 0.059, 0.033, 0.059, 0.077, 0.054,
       0.392], dtype=float32)

In [58]:
y_pred = y_proba.argmax(axis=1)
accuracy = (y_pred == y_test).sum() / len(y_test)
print(accuracy)

0.4573


Accuracy improved from 37% to 43.8%!

"If  your model contains other layers that behave in a special way during training (such as BatchNormalization), then you should not force training model like we just did. Instad, you should replace the Dropout layers with the following MCDropout class:"

In [59]:
class MCDropout(tf.keras.layers.Dropout):
    def call(self, inputs, training=False):
        return super().call(inputs, training=True)

###  Max-Norm Regularization

In [60]:
dense = tf.keras.layers.Dense(
    100, activation="relu", kernel_initializer="he_normal",
    kernel_constraint=tf.keras.constraints.max_norm(1.))

# Questions

1. To alleviate the unstable gradient problem.
2. It is not ok, because all neurons in a layer would have the same gradients and the entire layer would be equivalent to a single neuron multiplied by number of neurons in the layer. This would not generalize well as it would feel like there is a single neuron per hidden layer.
3.  I think, if all weights are initialized correctly, it is ok.
4.  Leaky Relu or Relu for faster (or just shallow) neural nets. SELU for deep dense nets. Swish, mish, GELU for any deep nets.
5.  Momentum's peak "speed" becomes much larger. However, it also takes longer to slow down.
6.  Use strong $l_1$ regularization, TF-mot package, and get rid of tiny weights.
7.  Dropout does slow down training because it is as if we are training a different model everytime. It does not slow down inference. MC dropout does slow down inferece because it makes the model predict on the same instance multiple times. 

8. Coding question:

In [62]:
deep_nn = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(32, 32)),
    tf.keras.layers.Flatten()
])
hidden = partial(tf.keras.layers.Dense,
                 units=100,
                 activation="swish",
                 kernel_initializer="he_normal"
                )

for _ in range(20):
    deep_nn.add(hidden())

deep_nn.add(tf.keras.layers.Dense(10, activation="softmax"))

In [64]:
len(deep_nn.layers)

22

In [65]:
class exponential_lr(tf.keras.callbacks.Callback):
    def __init__(self, factor):
        self.factor = factor
        self.losses = []
        self.rates = []
        
    def on_batch_end(self, batch, logs=None):
        lr = self.model.optimizer.learning_rate.numpy() * self.factor
        self.model.optimizer.learning_rate = lr
        self.rates.append(lr)
        self.losses.append(logs["loss"])

In [66]:
import matplotlib.pyplot as plt

def plot(model, X_train, y_train, lr0=1e-5, factor=1.005):
    model = tf.keras.models.clone_model(model)
    expo_lr = exponential_lr(factor=factor)
    optimizer = tf.keras.optimizers.Nadam(learning_rate=lr0)
    model.compile(loss="sparse_categorical_crossentropy",
                  optimizer=optimizer,
                  metrics=["accuracy"])

    history = model.fit(X_train, y_train,
                        epochs=1,
                        validation_split=0.2,
                        callbacks=[expo_lr]
                       )

    losses = expo_lr.losses
    rates = expo_lr.rates

    plt.plot(rates, losses)
    plt.gca().set_xscale("log")
    plt.grid()
    plt.xlabel("Learning rate (log)")
    plt.ylabel("Loss")
    plt.show()

In [67]:
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
[1m170498071/170498071[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 0us/step


  d = cPickle.load(f, encoding="bytes")


(50000, 32, 32, 3) (50000, 1) (10000, 32, 32, 3) (10000, 1)


In [68]:
y_train

array([[6],
       [9],
       [9],
       ...,
       [9],
       [1],
       [1]], shape=(50000, 1), dtype=uint8)