1. What is the problem that Glorot initialization and He initialization aim to ﬁx?

    The gradient vanishing and explosion problems: a way to significantly alleviate the unstable gradients problem. They point out that we need the signal to ﬂow properly in both directions: in the forward direction when making predictions, and in the reverse direction when backpropagating gradients. We don’t want the signal to die out, nor do we want it to explode and saturate.
    
    Glorot initialization and He initialization were designed to make the output standard deviation as close as possible to the input standard deviation, at least at the beginning of training. This reduces the vanishing/exploding gradients problem.

2. Is it OK to initialize all the weights to the same value as long as that value is selected randomly using He initialization?

    No, all weights should be sampled independently; they should not all have the same initial value. One important goal of sampling weights randomly is to break symmetry: if all the weights have the same initial value, even if that value is not zero, then symmetry is not broken (i.e., all neurons in a given layer are equivalent), and backpropagation will be unable to break it. Concretely, this means that all the neurons in any given layer will always have the same weights. It's like having just one neuron per layer, and much slower. It is virtually impossible for such a configuration to converge to a good solution.

3. Is it OK to initialize the bias terms to 0?

    It is perfectly fine to initialize the bias terms to zero. Some people like to initialize them just like weights, and that's OK too; it does not make much difference.

4. In which cases would you want to use each of the activation functions we discussed in this chapter?

    ReLU remains a good default for simple tasks: it’s often just as good as the more sophisticated activation functions, plus it’s very fast to compute, and many libraries and hardware accelerators provide ReLUspeciﬁc optimizations. ReLU is usually a good default for the hidden layers, as it is fast and yields good results. Its ability to output precisely zero can also be useful in some cases (e.g., see Chapter 17). Moreover, it can sometimes benefit from optimized implementations as well as from hardware acceleration. 
    
    If you care a lot about runtime latency, then you may prefer leaky ReLU, or parametrized leaky ReLU for more complex tasks. The leaky ReLU variants of ReLU can improve the model's quality without hindering its speed too much compared to ReLU.
    
    For large neural nets and more complex problems, GLU, Swish and Mish can give you a slightly higher quality model, but they have a computational cost. Swish is probably a better default for more complex tasks, and you can even try parametrized Swish with a learnable β parameter for the most complex tasks. Mish may give you slightly better results, but it requires a bit more compute. 
    
    For deep MLPs, give SELU a try, but make sure to respect the constraints listed earlier. If you have spare time and computing power, you can use cross-validation to evaluate other activation functions as well.
    
    The hyperbolic tangent (tanh) can be useful in the output layer if you need to output a number in a fixed range (by default between –1 and 1), but nowadays it is not used much in hidden layers, except in recurrent nets.
      
    The sigmoid activation function is also useful in the output layer when you need to estimate a probability (e.g., for binary classification), but it is rarely used in hidden layers (there are exceptions—for example, for the coding layer of variational autoencoders; see Chapter 17).

    The softplus activation function is useful in the output layer when you need to ensure that the output will always be positive. The softmax activation function is useful in the output layer to estimate probabilities for mutually exclusive classes, but it is rarely (if ever) used in hidden layers.

5. What may happen if you set the momentum hyperparameter too close to 1 (e.g., 0.99999) when using an SGD optimizer?

    Too much friction which will take too long to converge.
    
    If you set the momentum hyperparameter too close to 1 (e.g., 0.99999) when using an SGD optimizer, then the algorithm will likely pick up a lot of speed, hopefully moving roughly toward the global minimum, but its momentum will carry it right past the minimum. Then it will slow down and come back, accelerate again, overshoot again, and so on. It may oscillate this way many times before converging, so overall it will take much longer to converge than with a smaller momentum value.

6. Name three ways you can produce a sparse model.

    - Set the tiny weights to 0
    - Apply strong l1 regularization
    - Use the pruning API in the TensorFlow Model Optimization Toolkit (TF-MOT)

7. Does dropout slow down training? Does it slow down inference (i.e., making predictions on new instances)? What about MC dropout?

    Yes, dropout does slow down training, in general roughly by a factor of two. However, it has no impact on inference speed since it is only turned on during training. 
    
    MC Dropout is exactly like dropout during training, but it is still active during inference, so each inference is slowed down slightly. More importantly, when using MC Dropout you generally want to run inference 10 times or more to get better predictions. This means that making predictions is slowed down by a factor of 10 or more.

8. Practice training a deep neural network on the CIFAR10 image dataset:

    a. Build a DNN with 20 hidden layers of 100 neurons each (that’s too many, but it’s the point of this exercise). Use He initialization and the Swish activation function.
    
    b. Using Nadam optimization and early stopping, train the network on the CIFAR10 dataset. You can load it with tf.keras.datasets.cifar10.load_data(). The dataset is composed of 60,000 32 × 32–pixel color images (50,000 for training, 10,000 for testing) with 10 classes, so you’ll need a softmax output layer with 10 neurons. Remember to search for the right learning rate each time you change the model’s architecture or hyperparameters.

In [1]:
import sys
assert sys.version_info >= (3, 7)
from packaging import version
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from pathlib import Path

IMAGES_PATH = Path() / "images" / "deep"
IMAGES_PATH.mkdir(parents=True, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = IMAGES_PATH / f"{fig_id}.{fig_extension}"
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

In [2]:
%pip install -q -U tensorboard-plugin-profile

[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
cifar10 = tf.keras.datasets.cifar10.load_data()

(X_train_full, y_train_full), (X_test, y_test) = cifar10

X_train = X_train_full[5000:]
y_train = y_train_full[5000:]
X_valid = X_train_full[:5000]
y_valid = y_train_full[:5000]

Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz


In [4]:
tf.random.set_seed(42)
model = tf.keras.Sequential()
model.add(tf.keras.layers.Flatten(input_shape=[32, 32, 3]))
for layer in range(20):
    model.add(tf.keras.layers.Dense(100, activation="swish",
                              kernel_initializer="he_normal"))
model.add(tf.keras.layers.Dense(10, activation="softmax"))
# model.summary()

optimizer = tf.keras.optimizers.Nadam(learning_rate=5e-5)

2022-12-04 19:32:52.108722: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.


In [5]:
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,
                  metrics=["accuracy"])

early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=20, restore_best_weights=True)
model_checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("my_cifar10_model", save_best_only=True)

run_index = 1 # increment every time you train the model
run_logdir = Path() / "my_cifar10_logs" / f"run_{run_index:03d}"
tensorboard_cb = tf.keras.callbacks.TensorBoard(run_logdir)

callbacks = [early_stopping_cb, model_checkpoint_cb, tensorboard_cb]

# %load_ext tensorboard
# %tensorboard --logdir=./my_cifar10_logs

2022-12-04 19:32:52.684418: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2022-12-04 19:32:52.684885: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2022-12-04 19:32:52.686177: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.


In [6]:
# history = model.fit(X_train, y_train, epochs=100, validation_data=(X_valid, y_valid), callbacks=callbacks)

In [7]:
# pd.DataFrame(history.history).plot(
#     figsize=(8, 5), xlim=[0, 29], ylim=[0, 1], grid=True, xlabel="Epoch", style=["r--", "r--.", "b-", "b-*"])

# plt.show()

    c. Now try adding batch normalization and compare the learning curves: is it converging faster than before? Does it produce a better model? How does it aﬀect training speed?

    d. Try replacing batch normalization with SELU, and make the necessary adjustments to ensure the network selfnormalizes (i.e., standardize the input features, use LeCun normal initialization, make sure the DNN contains only a sequence of dense layers, etc.).

    e. Try regularizing the model with alpha dropout. Then, without retraining your model, see if you can achieve better accuracy using MC dropout. 

    f. Retrain your model using 1cycle scheduling and see if it improves training speed and model accuracy.