In [None]:
1. Is it OK to initialize all the weights to the same value as long as that value is selected
randomly using He initialization?


In [None]:
It is not recommended to initialize all the weights to the same value, even if they are selected randomly using He initialization. Initializing all weights to the same value would result in symmetry across the neurons, and during training, all neurons would update in the same way. This symmetry would persist throughout training, and the network would not be able to learn complex representations or differentiate between different features or patterns. Therefore, it is crucial to initialize weights with some degree of randomness to break symmetry and allow each neuron to learn unique representations.


In [None]:
2. Is it OK to initialize the bias terms to 0?


In [None]:
Initializing the bias terms to 0 is generally acceptable. Bias terms provide an additional parameter that allows shifting the activation function's threshold. Setting the biases to 0 initially does not introduce any asymmetry issues, as biases do not affect symmetry between neurons. During training, the biases will be updated according to the network's needs, so starting with 0 biases is reasonable.


In [None]:
3. Name three advantages of the SELU activation function over ReLU.


In [None]:
Advantages of the SELU activation function over ReLU include:
- Self-normalization: The SELU activation function is designed to ensure that the output of each neuron maintains a mean of 0 and a standard deviation of 1 during training. This property can help stabilize and improve the convergence of deep neural networks.
- Vanishing/exploding gradients mitigation: The self-normalizing property of SELU reduces the likelihood of vanishing or exploding gradients, which can be problematic during training.
- Improved learning on deep networks: SELU has been shown to perform well on deep neural networks, allowing for better learning and representation of complex patterns compared to ReLU, especially in architectures with many layers.


In [None]:
4. In which cases would you want to use each of the following activation functions: SELU, leaky
ReLU (and its variants), ReLU, tanh, logistic, and softmax?


In [None]:
Different activation functions are suitable for different scenarios:
- SELU: It is useful in deep neural networks where self-normalization and convergence stability are desired.
- Leaky ReLU (and variants like Parametric ReLU, Randomized ReLU): They are useful when avoiding the dying ReLU problem and enabling learning in the presence of negative inputs is important.
- ReLU: It is commonly used as a default choice in many scenarios due to its simplicity, computational efficiency, and ability to handle sparse activations.
- Tanh: It is suitable for cases where inputs are standardized and centered around zero, and the output range of -1 to 1 is desired.
- Logistic (Sigmoid): It is useful in binary classification problems where the output needs to be squashed into the range of 0 to 1.
- Softmax: It is used in multi-class classification problems where the output needs to represent class probabilities that sum up to 1.


In [None]:
5. What may happen if you set the momentum hyperparameter too close to 1 (e.g., 0.99999)
when using an SGD optimizer?


In [None]:
Setting the momentum hyperparameter too close to 1 (e.g., 0.99999) in an SGD optimizer can result in slow convergence or even instability during training. Momentum in SGD helps accelerate the learning process by allowing the algorithm to accumulate past gradients and smooth out the gradient descent trajectory. However, if the momentum value is set too close to 1, the updates become dominated by past gradients, leading to slower convergence or oscillations around the optimal solution.


In [None]:
6. Name three ways you can produce a sparse model.


In [None]:
Three ways to produce a sparse model are:
- L1 Regularization (Lasso): Adding an L1 penalty term to the loss function encourages sparsity by driving some of the weights towards zero. This results in some neurons or connections becoming inactive, effectively creating a sparse model.
- Dropout: Dropout randomly sets a fraction of the neurons' activations to zero during training. By dropping out neurons, the network becomes more robust and encourages other neurons to learn more independently, potentially leading to sparsity.
- Pruning: Pruning involves iteratively removing connections or neurons with low magnitudes or contributions based on certain criteria. Pruning can be performed during or after training to create a sparse model.



In [None]:
7. Does dropout slow down training? Does it slow down inference (i.e., making predictions on
new instances)? What about MC Dropout?


In [None]:
Dropout does not slow down training significantly. It helps prevent overfitting by reducing interdependencies among neurons, making the network more robust and preventing the dominance of a few influential neurons. During inference or making predictions on new instances, dropout is typically turned off, and the full network is used. However, dropout may slightly slow down inference due to the randomness introduced during training.

MC Dropout (Monte Carlo Dropout) is a technique where dropout is applied during inference as well. Instead of using a single prediction, the network is sampled multiple times with dropout enabled, and the predictions are averaged. MC Dropout provides uncertainty estimates and can be useful in scenarios where uncertainty quantification is important, such as in Bayesian deep learning.


In [None]:
8. Practice training a deep neural network on the CIFAR10 image dataset:
a. Build a DNN with 20 hidden layers of 100 neurons each (that’s too many, but it’s the
point of this exercise). Use He initialization and the ELU activation function.


In [4]:
import tensorflow as tf
from tensorflow import keras

model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
    model.add(keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"))
model.add(keras.layers.Dense(10, activation="softmax"))

In [3]:
pip install tensorflow

Note: you may need to restart the kernel to use updated packages.


In [None]:
b. Using Nadam optimization and early stopping, train the network on the CIFAR10
dataset. You can load it with keras.datasets.cifar10.load_​data(). The dataset is
composed of 60,000 32 × 32–pixel color images (50,000 for training, 10,000 for
testing) with 10 classes, so you’ll need a softmax output layer with 10 neurons.
Remember to search for the right learning rate each time you change the model’s
architecture or hyperparameters.


In [5]:
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.cifar10.load_data()

X_train = X_train_full.astype('float32') / 255.
X_test = X_test.astype('float32') / 255.

y_train = keras.utils.to_categorical(y_train_full, 10)
y_test = keras.utils.to_categorical(y_test, 10)

model.compile(loss="categorical_crossentropy", optimizer=keras.optimizers.Nadam(lr=0.001), metrics=["accuracy"])

early_stopping_cb = keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)

history = model.fit(X_train, y_train, epochs=100, validation_data=(X_test, y_test), callbacks=[early_stopping_cb])


Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz




Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100

KeyboardInterrupt: 

In [None]:
c. Now try adding Batch Normalization and compare the learning curves: Is it
converging faster than before? Does it produce a better model? How does it affect
training speed?


In [6]:
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
    model.add(keras.layers.Dense(100, kernel_initializer="he_normal", use_bias=False))
    model.add(keras.layers.BatchNormalization())
    model.add(keras.layers.Activation("elu"))
model.add(keras.layers.Dense(10, activation="softmax"))

model.compile(loss="categorical_crossentropy", optimizer=keras.optimizers.Nadam(lr=0.001), metrics=["accuracy"])

history = model.fit(X_train, y_train, epochs=100, validation_data=(X_test, y_test), callbacks=[early_stopping_cb])




Epoch 1/100
 217/1563 [===>..........................] - ETA: 28s - loss: 2.0605 - accuracy: 0.2579

KeyboardInterrupt: 

In [None]:
d. Try replacing Batch Normalization with SELU, and make the necessary adjustements
to ensure the network self-normalizes (i.e., standardize the input features, use
LeCun normal initialization, make sure the DNN contains only a sequence of dense
layers, etc.).


In [7]:
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
    model.add(keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"))
model.add(keras.layers.Dense(10, activation="softmax"))

model.compile(loss="categorical_crossentropy", optimizer=keras.optimizers.Nadam(lr=0.001), metrics=["accuracy"])

history = model.fit(X_train, y_train, epochs=100, validation_data=(X_test, y_test), callbacks=[early_stopping_cb])




Epoch 1/100

KeyboardInterrupt: 

In [None]:
e. Try regularizing the model with alpha dropout. Then, without retraining your model,
see if you can achieve better accuracy using MC Dropout.

In [8]:
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
    model.add(keras.layers.Dense(100, kernel_initializer="he_normal", activation="elu"))
    model.add(keras.layers.AlphaDropout(rate=0.5))
model.add(keras.layers.Dense(10, activation="softmax"))

model.compile(loss="categorical_crossentropy", optimizer=keras.optimizers.Nadam(lr=0.001), metrics=["accuracy"])

history = model.fit(X_train, y_train, epochs=100, validation_data=(X_test, y_test), callbacks=[early_stopping_cb])




Epoch 1/100

KeyboardInterrupt: 