
1. **Is it OK to initialize all the weights to the same value as long as that value is selected randomly using He initialization?**
   No, it is not OK to initialize all the weights to the same value, even if that value is selected randomly using He initialization. Initializing all weights to the same value would result in symmetry, causing all neurons to learn the same features during training. He initialization involves setting weights to small random values drawn from a specific distribution to break this symmetry.

2. **Is it OK to initialize the bias terms to 0?**
   Yes, it is generally OK to initialize the bias terms to 0. Bias terms are not subject to the same symmetry issues as weights, so initializing them to zero does not prevent the network from learning effectively.

3. **Three advantages of the SELU activation function over ReLU:**
   - **Self-Normalizing**: SELU induces self-normalization, meaning the activations automatically converge to zero mean and unit variance, which helps in stabilizing the training process.
   - **No Dead Neurons**: Unlike ReLU, SELU does not suffer from the "dying ReLU" problem where neurons can get stuck and stop learning.
   - **Faster Learning**: SELU networks tend to learn faster and achieve better performance without the need for additional normalization techniques like batch normalization.

4. **When to use each activation function:**
   - **SELU**: Use SELU when you want self-normalizing properties, especially in deep networks without batch normalization.
   - **Leaky ReLU (and variants)**: Use when you want to avoid the dying ReLU problem and need a small gradient for negative inputs.
   - **ReLU**: Use for general purposes, especially in hidden layers of deep networks due to its simplicity and effectiveness.
   - **Tanh**: Use when you need zero-centered outputs, which can help in faster convergence.
   - **Logistic (Sigmoid)**: Use in the output layer for binary classification problems.
   - **Softmax**: Use in the output layer for multi-class classification problems to get probability distributions over classes.

5. **What may happen if you set the momentum hyperparameter too close to 1 (e.g., 0.99999) when using an SGD optimizer?**
   Setting the momentum hyperparameter too close to 1 can cause the optimizer to overshoot the minimum, leading to instability and oscillations in the training process. It can also slow down convergence as the optimizer may take longer to settle into the minimum.

6. **Three ways to produce a sparse model:**
   - **Pruning**: Remove weights that are close to zero after training.
   - **Regularization**: Use L1 regularization to encourage sparsity in the weights.
   - **Sparse Initialization**: Start with a sparse network by initializing many weights to zero[^20^].

7. **Does dropout slow down training? Does it slow down inference? What about MC Dropout?**
   - **Training**: Dropout can slow down training because it requires additional computations to randomly drop units during each training iteration.
   - **Inference**: Dropout does not slow down inference because it is typically turned off during this phase. However, MC Dropout, which applies dropout during inference to estimate uncertainty, can slow down inference as it requires multiple forward passes.

8. **Practice training a deep neural network on the CIFAR10 image dataset:**
   - **a. Build a DNN with 20 hidden layers of 100 neurons each using He initialization and the ELU activation function.**
     ```python
     import tensorflow as tf
     from tensorflow.keras.layers import Dense, Flatten, ELU
     from tensorflow.keras.models import Sequential

     model = Sequential()
     model.add(Flatten(input_shape=(32, 32, 3)))
     for _ in range(20):
         model.add(Dense(100, kernel_initializer='he_normal'))
         model.add(ELU())
     model.add(Dense(10, activation='softmax'))
     ```

   - **b. Using Nadam optimization and early stopping, train the network on the CIFAR10 dataset.**
     ```python
     from tensorflow.keras.datasets import cifar10
     from tensorflow.keras.callbacks import EarlyStopping

     (x_train, y_train), (x_test, y_test) = cifar10.load_data()
     x_train, x_test = x_train / 255.0, x_test / 255.0

     model.compile(optimizer='nadam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
     early_stopping = EarlyStopping(patience=10, restore_best_weights=True)
     model.fit(x_train, y_train, epochs=100, validation_split=0.2, callbacks=[early_stopping])
     ```

   - **c. Add Batch Normalization and compare the learning curves.**
     ```python
     from tensorflow.keras.layers import BatchNormalization

     model = Sequential()
     model.add(Flatten(input_shape=(32, 32, 3)))
     for _ in range(20):
         model.add(Dense(100, kernel_initializer='he_normal'))
         model.add(BatchNormalization())
         model.add(ELU())
     model.add(Dense(10, activation='softmax'))
     model.compile(optimizer='nadam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
     model.fit(x_train, y_train, epochs=100, validation_split=0.2, callbacks=[early_stopping])
     ```

   - **d. Replace Batch Normalization with SELU and make necessary adjustments.**
     ```python
     model = Sequential()
     model.add(Flatten(input_shape=(32, 32, 3)))
     for _ in range(20):
         model.add(Dense(100, kernel_initializer='lecun_normal'))
         model.add(ELU())
     model.add(Dense(10, activation='softmax'))
     model.compile(optimizer='nadam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
     model.fit(x_train, y_train, epochs=100, validation_split=0.2, callbacks=[early_stopping])
     ```

   - **e. Regularize the model with alpha dropout and use MC Dropout for better accuracy.**
     ```python
     from tensorflow.keras.layers import AlphaDropout

     model = Sequential()
     model.add(Flatten(input_shape=(32, 32, 3)))
     for _ in range(20):
         model.add(Dense(100, kernel_initializer='lecun_normal'))
         model.add(ELU())
         model.add(AlphaDropout(0.1))
     model.add(Dense(10, activation='softmax'))
     model.compile(optimizer='nadam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
     model.fit(x_train, y_train, epochs=100, validation_split=0.2, callbacks=[early_stopping])

     # MC Dropout during inference
     import numpy as np

     def mc_dropout_predict(model, x, n_iter=100):
         f = tf.keras.backend.function([model.input, tf.keras.backend.learning_phase()], [model.output])
         result = np.zeros((n_iter,) + model.output_shape)
         for i in range(n_iter):
             result[i] = f([x, 1])[0]
         return result.mean(axis=0), result.std(axis=0)

     mean, std = mc_dropout_predict(model, x_test)
     ```