## QUESTIONS :

1. Is it OK to initialize all the weights to the same value as long as that value is selected
   randomly using He initialization?
   
2. Is it OK to initialize the bias terms to 0?

3. Name three advantages of the SELU activation function over ReLU.

4. In which cases would you want to use each of the following activation functions: SELU, leaky
   ReLU (and its variants), ReLU, tanh, logistic, and softmax?
   
5. What may happen if you set the momentum hyperparameter too close to 1 (e.g., 0.99999)
   when using an SGD optimizer?
   
6. Name three ways you can produce a sparse model.

7. Does dropout slow down training? Does it slow down inference (i.e., making predictions on
   new instances)? What about MC Dropout?
   
8. Practice training a deep neural network on the CIFAR10 image dataset:

   a. Build a DNN with 20 hidden layers of 100 neurons each (that’s too many, but it’s the
   point of this exercise). Use He initialization and the ELU activation function.
   
   
   b. Using Nadam optimization and early stopping, train the network on the CIFAR10
   dataset. You can load it with keras.datasets.cifar10.load_​data(). The dataset is
   composed of 60,000 32 × 32–pixel color images (50,000 for training, 10,000 for
   testing) with 10 classes, so you’ll need a softmax output layer with 10 neurons.
   Remember to search for the right learning rate each time you change the model’s
   architecture or hyperparameters.
   
   
   c. Now try adding Batch Normalization and compare the learning curves: Is it
   converging faster than before? Does it produce a better model? How does it affect
   training speed?
   
   
   d. Try replacing Batch Normalization with SELU, and make the necessary adjustements
   to ensure the network self-normalizes (i.e., standardize the input features, use
   LeCun normal initialization, make sure the DNN contains only a sequence of dense
   layers, etc.).
   
   e. Try regularizing the model with alpha dropout. Then, without retraining your model,
   see if you can achieve better accuracy using MC Dropout.
   
   -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## ANS :

1. Initializing all weights to the same value, even if randomly selected with He initialization, is not recommended. The purpose of He initialization is to break the symmetry of weights, allowing each neuron to learn different features. If all weights are set to the same value, this symmetry is not broken, and the network may not learn effectively.

2. Initializing bias terms to 0 is generally acceptable. Biases help the neurons to activate, and initializing them to 0 allows the network to start learning with no specific bias. However, some networks may benefit from non-zero bias initialization in certain cases.

3. Three advantages of the SELU activation function over ReLU are:
   a. It addresses the vanishing/exploding gradient problem.
   b. It is designed to be self-normalizing, helping maintain a consistent scale of activations.
   c. It allows for better gradient flow, potentially leading to faster convergence.

4. Use cases for activation functions:
   - SELU: Suitable for deep networks, helps with vanishing/exploding gradients.
   - Leaky ReLU and variants: Good for addressing dying ReLU problem, allowing a small gradient for negative values.
   - ReLU: Often a good default choice, computationally efficient.
   - Tanh: Useful in the middle layers of a neural network to center and scale the data.
   - Logistic (Sigmoid): Typically used in binary classification output layers.
   - Softmax: Appropriate for multi-class classification output layers.

5. Setting the momentum hyperparameter too close to 1 (e.g., 0.99999) in SGD can lead to slow convergence or oscillations around the minimum. It can cause the optimizer to overshoot the minimum and take longer to converge.

6. Three ways to produce a sparse model:
   a. **L1 Regularization**: Encourages sparse weight matrices by penalizing large weights.
   b. **Dropout**: Randomly drops connections during training, effectively creating a sparse network.
   c. **Pruning**: Remove connections or neurons based on their importance, often done after training.

7. Dropout can slow down training, but it helps prevent overfitting. During inference, dropout is typically turned off, so it does not affect the prediction speed. MC Dropout involves running the model multiple times with dropout enabled and averaging the predictions, which can be slower than a single prediction but helps capture uncertainty.

8. Answers for training a deep neural network on the CIFAR10 dataset:

   a. Here's a basic code snippet using TensorFlow/Keras:
   ```python
   import tensorflow as tf
   from tensorflow.keras import layers, models, optimizers
   from tensorflow.keras.datasets import cifar10

   # Load CIFAR10 dataset
   (X_train, y_train), (X_test, y_test) = cifar10.load_data()

   # Build DNN
   model = models.Sequential()
   model.add(layers.Flatten(input_shape=(32, 32, 3)))
   for _ in range(20):
       model.add(layers.Dense(100, kernel_initializer='he_normal', activation='elu'))

   model.add(layers.Dense(10, activation='softmax'))

   # b. Compile and train with Nadam optimization and early stopping
   model.compile(optimizer=optimizers.Nadam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

   early_stopping = tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)

   history = model.fit(X_train, y_train, epochs=100, validation_split=0.2, callbacks=[early_stopping])

   # c. Add Batch Normalization
   # Note: You need to modify the model architecture for Batch Normalization
   # (e.g., add BN layers after Dense layers), and then compare learning curves.

   # d. Replace Batch Normalization with SELU
   # Note: Adjust the model architecture, input standardization, and initialization.
   # Ensure the DNN contains only a sequence of dense layers.

   # e. Regularize with alpha dropout and try MC Dropout
   # Note: Add AlphaDropout layers, and for MC Dropout, run multiple predictions with dropout enabled and average results.
   ```

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------