## Assignment Solutions

#### 1. Is it OK to initialize all the weights to the same value as long as that value is selected randomly using He initialization?

**Ans:** It is common to initailize neural network weights randomly, intializing all weights to the same value is not recommended , even if that value is randomly selected using He initialization.

Initializing all weights to the same value would result in identical activations in the network's neurons. This means that during the forward propagation phase, all neurons would produce the same output, making the network redundant and limiting its learning capacity. The purpose of random initialization is to break the symmetry between neurons and allow them to learn distinct features and representations.

It is essential to initialize the weights with different random values, even when using He initialization or any other weight initialization technique.

#### 2. Is it OK to initialize the bias terms to 0?

**Ans:** Yes, it is generally acceptable to initialize the bias terms to 0. Initializing biases to 0 does not introduce symmetry issues or affect the learning capacity of the network.

Biases are scalar values added to each neuron in a layer, and they serve to shift the activation function of the neurons. Initializing biases to 0 means that initially, the neurons are not biased towards any particular activation value. During the training process, the network will learn the appropriate biases for each neuron based on the data it processes.

#### 3.Name three advantages of the SELU activation function over ReLU.

**Ans:** The SELU (Scaled Exponential Linear Unit) activation function offers several advantages over the commonly used ReLU (Rectified Linear Unit) activation function. Here are three advantages of SELU over ReLU:

**Self-normalizing property:** One significant advantage of SELU is its self-normalizing property. In a deep neural network, the activations can either explode or vanish as the information propagates through the layers. SELU is designed to prevent this issue by ensuring that the mean and variance of the activations remain stable during training. This property allows for more efficient training of deep neural networks without the need for additional normalization techniques, such as batch normalization.

**Non-zero output for negative values:** Unlike ReLU, which sets all negative inputs to zero, SELU provides a non-zero output for negative values. This can be beneficial in certain scenarios where it is important to preserve information from negative inputs. By allowing negative values to pass through, SELU can capture more nuanced patterns and gradients, potentially leading to improved performance in certain types of tasks.

**Continuous and smooth gradient:** The SELU activation function is smooth and continuously differentiable across its entire range of inputs. In contrast, ReLU has a discontinuous gradient at the origin (where inputs are negative), which can cause issues during backpropagation. The smooth gradient of SELU can result in more stable and consistent updates to the weights during training, allowing for more reliable convergence and improved learning dynamics.

#### 4. In which cases would you want to use each of the following activation functions: SELU, leaky ReLU (and its variants), ReLU, tanh, logistic, and softmax?

**Ans:** \
**SELU (Scaled Exponential Linear Unit):** \
Deep neural networks: SELU is particularly useful in deep neural networks due to its self-normalizing property. It helps stabilize the mean and variance of activations, enabling more efficient training without the need for additional normalization techniques. \
Tasks with negative values: SELU allows negative values to pass through while maintaining non-zero outputs, which can be advantageous when negative information needs to be preserved.

**Leaky ReLU and its variants:** \
Avoiding dead neurons: Leaky ReLU helps mitigate the "dying ReLU" problem, where ReLU neurons can become permanently inactive. By allowing a small negative slope for negative inputs, leaky ReLU ensures that all neurons can contribute to the learning process, preventing dead neurons. \
Improved gradient propagation: Leaky ReLU and its variants can facilitate better gradient flow during backpropagation compared to traditional ReLU, especially when dealing with deep networks.

**ReLU (Rectified Linear Unit):** \
General-purpose activation:* ReLU is widely used as a default choice for activation functions in deep learning. It is computationally efficient and introduces non-linearity into the network, which is crucial for learning complex representations. \
Sparse activations: ReLU tends to produce sparse activations, which can be beneficial for scenarios where sparsity is desirable or when dealing with high-dimensional inputs.

**tanh (Hyperbolic tangent):** \
Symmetric activation: tanh produces values between -1 and 1, centered around 0. It is often used in scenarios where symmetric activation is desired, such as in recurrent neural networks (RNNs) or autoencoders. \
Capturing negative values: tanh can capture both positive and negative values, making it suitable for tasks where the input range spans both sides of zero.

**logistic (Sigmoid):** \
Binary classification: The logistic function is commonly used in binary classification tasks where the output is required to be between 0 and 1, representing probabilities or binary decisions. \
Output probability mapping: logistic is suitable for mapping arbitrary real values to a probability range (0 to 1), making it useful in tasks like logistic regression or as the final layer activation in multi-label classification.

**softmax:** \
Multi-class classification: Softmax is frequently used in multi-class classification problems, where the goal is to assign input samples to multiple exclusive classes. It produces a probability distribution over the classes, ensuring that the sum of the output probabilities is 1.

#### 5. What may happen if you set the momentum hyperparameter too close to 1 (e.g., 0.99999) when using an SGD optimizer?

**Ans:** When setting the momentum hyperparameter too close to 1 (e.g., 0.99999) in Stochastic Gradient Descent (SGD) optimization, the following issues may arise:

Overshooting: Momentum helps accelerate SGD by accumulating a fraction of the previous updates to determine the current update direction. When the momentum value is extremely close to 1, it means that the updates from previous iterations have a significant impact on the current update. This can lead to overshooting, where the optimizer overshoots the optimal solution and keeps oscillating around it, struggling to converge.

Slow convergence or divergence: With a momentum value that is too close to 1, the updates become increasingly influenced by past gradients. This can cause the optimizer to continue moving in a particular direction even when it encounters a steep gradient that suggests changing course. As a result, the convergence can become slow, as the optimizer fails to respond effectively to local gradient information. In extreme cases, the optimizer may even diverge, leading to an unstable training process.

Difficulty in escaping local minima: Higher momentum values allow the optimizer to escape shallow local minima and find a better solution. However, when the momentum is set very close to 1, it becomes less capable of escaping local minima. This happens because the momentum essentially becomes more persistent, preventing the optimizer from exploring alternative paths or making significant changes in the update direction.

In general, momentum values close to 1 are not recommended unless there are specific requirements or scenarios where it has been observed to be beneficial. More commonly used values for momentum typically range from 0.9 to 0.99, striking a balance between utilizing past gradients and responsiveness to local gradients for efficient convergence.

#### 6. Name three ways you can produce a sparse model.

**Ans:** To produce a sparse model, where only a subset of the model's parameters are non-zero or active, several techniques can be employed. Here are three common ways to achieve sparsity in models:

**L1 Regularization (Lasso regularization):** L1 regularization is a technique that adds a penalty term to the loss function of a model, encouraging the model to learn sparse representations. By adding the absolute values of the model's parameters as the regularization term, L1 regularization promotes the shrinking or elimination of irrelevant or less important features. This encourages the model to focus on a subset of the most relevant features, resulting in a sparse model.

**Group Lasso Regularization:** Group Lasso regularization extends the concept of L1 regularization by promoting sparsity at the group level. Instead of penalizing individual parameters, it penalizes entire groups of parameters together. This is particularly useful when dealing with structured data where features or parameters can be grouped together, such as in image processing or natural language processing tasks. Group Lasso encourages the model to select entire groups of features while setting some groups to zero, resulting in sparsity at the group level.

**Dropout:** Dropout is a technique primarily used during training in neural networks to introduce sparsity in activations and weights. During training, randomly selected units (neurons or inputs) are "dropped out" or temporarily set to zero with a certain probability. This forces the network to learn robust representations that do not rely on specific units and encourages the network to rely on a diverse set of features. Dropout can effectively regularize the model, prevent overfitting, and create a sparse model by producing a network that uses only a subset of units during each forward pass.

#### 7. Does dropout slow down training? Does it slow down inference (i.e., making predictions on new instances)? What about MC Dropout?

**Ans:** Dropout can slow down the training process because it reduces the effective capacity of the network, but it helps prevent overfitting. During inference, dropout does not slow down prediction as the entire network is used. \
MC Dropout (Monte Carlo Dropout) is an extension of the dropout technique that introduces a form of approximate Bayesian inference during inference time. Instead of only using the model once for prediction, MC Dropout performs multiple forward passes with dropout enabled and computes predictions based on the average or ensemble of the predictions across the passes. This enables the model to capture the uncertainty associated with its predictions

#### 8. Practice training a deep neural network on the CIFAR10 image dataset:

- **a) Build a DNN with 20 hidden layers of 100 neurons each (that’s too many, but it’s the point of this exercise). Use He initialization and the ELU activation function.**

In [4]:
import tensorflow as tf 
from tensorflow import keras

# Load cifar-10 dataset
(x_train, y_train), (x_test, y_test) =keras.datasets.cifar10.load_data()

# Normalize pixel value between 0 and 1
x_train = x_train / 255.0
x_test = x_test / 255.0

# Model
model = keras.Sequential()

# Add the input layer
model.add(keras.layers.Flatten(input_shape = (32, 32, 3)))

# Add 20  hidden layer with 100 neurons each, using He initialization and ELU activation
for i in range(20):
    model.add(keras.layers.Dense(100,activation='elu', kernel_initializer = 'he_normal'))

# Add the output layer
model.add(keras.layers.Dense(10, activation = 'softmax'))

# Compile the model
model.compile(optimizer='adam',loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(x_train, y_train, epochs=10, batch_size= 128, validation_data=(x_test, y_test))


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1983580ea00>

- **b) Using Nadam optimization and early stopping, train the network on the CIFAR10 dataset. You can load it with keras.datasets.cifar10.load_​data(). The dataset is composed of 60,000 32 × 32–pixel color images (50,000 for training, 10,000 for testing) with 10 classes, so you’ll need a softmax output layer with 10 neurons. Remember to search for the right learning rate each time you change the model’s architecture or hyperparameters.**

In [10]:
import tensorflow as tf
from tensorflow import keras

# Load CIFAR-10 dataset
(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()

# Normalize pixel values between 0 and 1
x_train = x_train / 255.0
x_test = x_test / 255.0

# Define the deep neural network model
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(32, 32, 3)),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(10, activation='softmax')
])

# Compile the model with Nadam optimizer and sparse categorical cross-entropy loss
model.compile(optimizer=keras.optimizers.Nadam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Define early stopping callback
early_stopping = keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)

# Train the model with early stopping
history = model.fit(x_train, y_train, epochs=100, batch_size=128, validation_data=(x_test, y_test), callbacks=[early_stopping])


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100


- **c) Now try adding Batch Normalization and compare the learning curves: Is it converging faster than before? Does it produce a better model? How does it affect training speed?**

In [12]:
import tensorflow as tf
from tensorflow import keras

# Load cifar-10 dataset
(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()

# Normalize pixel values between 0 and 1 
x_train = x_train / 255.0
x_test = x_test / 255.0

# Define the deep neural network model with Batch Normalization
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(32,32, 3)),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation='elu',kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation='softmax')])

# Compile the model with Nadam optimizer and sparse categorical cross entropy loss
model.compile(optimizer=keras.optimizers.Nadam(),loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Define early stopping callback
early_stopping = keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)

# Train the model with early stopping
history = model.fit(x_train, y_train, epochs=100, batch_size=128, validation_data=(x_test, y_test),callbacks=[early_stopping])

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100


- **d) Try regularizing the model with alpha dropout. Then, without retraining your model, see if you can achieve better accuracy using MC Dropout.**

In [1]:
import tensorflow as tf
from tensorflow import keras

# Load CIFAR-10 dataset
(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()

# Normalize pixel values between 0 and 1
x_train = x_train / 255.0
x_test = x_test / 255.0

# Define the deep neural network model with Alpha Dropout
model_alpha_dropout = keras.Sequential([
    keras.layers.Flatten(input_shape=(32, 32, 3)),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.AlphaDropout(0.2),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.AlphaDropout(0.2),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.AlphaDropout(0.2),
    keras.layers.Dense(10, activation='softmax')
])

# Compile the model with Nadam optimizer and sparse categorical cross-entropy loss
model_alpha_dropout.compile(optimizer=keras.optimizers.Nadam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model with alpha dropout
history_alpha_dropout = model_alpha_dropout.fit(x_train, y_train, epochs=25, batch_size=128, validation_data=(x_test, y_test))

# Evaluate the model with alpha dropout on the test data
_, accuracy_alpha_dropout = model_alpha_dropout.evaluate(x_test, y_test)

# Create a copy of the model for MC Dropout
model_mc_dropout = keras.models.clone_model(model_alpha_dropout)
model_mc_dropout.set_weights(model_alpha_dropout.get_weights())

# Compile the model with MC Dropout and sparse categorical cross-entropy loss
model_mc_dropout.compile(optimizer=keras.optimizers.Nadam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Evaluate the model with MC Dropout on the test data
_, accuracy_mc_dropout = model_mc_dropout.evaluate(x_test, y_test)

print("Accuracy with Alpha Dropout:", accuracy_alpha_dropout)
print("Accuracy with MC Dropout:", accuracy_mc_dropout)


Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25
Accuracy with Alpha Dropout: 0.4918999969959259
Accuracy with MC Dropout: 0.4918999969959259
