# Chapter 11: Training Deep Neural Networks Exercises

## 1.

> Is it OK to initialize all the weights to the same value as long as that value is selected randomly using He initialization?

It's not okay to initialize all the weights with the same value even if the value is randomly selected from He initialization because it will still be unable to break the symmetry. If all the weights are the same value, the gradients would all be the same during backpropagation and the neural network becomes equivalent to one with only a single layer.

## 2.

> Is it OK to initialize the bias terms to 0?

It's okay to initialize the bias terms to 0 as it makes no difference.

## 3.

> Name three advantages of the SELU activation function over ReLU.

- The network will self-normalize, preserving a mean of 0 and standard deviation of 1, which solves the vanishing/exploding gradients problem.

- It is a scaled variant of ELU activation function so it can take negative values when $z<0$, which allows an average output closer to 0 and alleviates the vanishing gradients problem.

- Also from ELU, it has a nonzero gradient for $z<0$, which avoids the dead neurons problem (dying ReLUs).

## 4.

> In which cases would you want to use each of the following activation functions: SELU, leaky ReLU (and its variants), ReLU, tanh, logistic, and softmax?

In general, $ \text{SELU} > \text{ELU} > \text{leaky ReLU (and its variants)} > \text{ReLU} > \tanh > \text{logistic}$ and $\text{softmax}$.

SELU:
- Best performance but needs to meet certain conditions.
- Input features must be standardized (mean 0, standard deviation 1).
- Every hidden layer's weights must be LeCun normal initialization.
- Network's architecture must be sequential.
- All layers must be dense.

leaky ReLU (and its variants):
- If you care about runtime latency.

ReLU:
- If you care about speed.
- Most widely used so many libraries have optimizations for it.

tanh:
- Want output to fall within certain values.
- Has a range of -1 to 1.

logistic:
- Want output to be strictly positive.
- Has a range of 0 to 1.

softmax:
- Mostly output for multiclass classification.
- Estimated probabilities between 0 and 1.
- All probabilities add up to 1.

## 5.

> What may happen if you set the `momentum` hyperparameter too close to 1 (eg. 0.99999) when using an SGD optimizer?

If the `momentum` hyperparameter is set too close to 1, such as 0.99999, it means it has barely any friction. Without any friction, the optimizer will overshoot the optimum and oscillate for a very long time, possibly never even converging.

## 6.

> Name three ways you can produce a sparse model.

- Train the model as usual then get rid of the tiny weights (set them to 0).
- Apply strong $\ell_1$ regularization during training.
- Use TensorFlow Model Optimization Toolkit (TF-MOT) to iteratively remove connections during training.

## 7.

> Does dropout slow down training?

> Does it slow down inference (ie. making predictions on new instances)?

> What about MC Dropout?

Dropout does slow down training since at every training step, every neuron has a chance to ignored and thus slow down convergence to a solution.

Dropout does not slow down inference because it is only active during training.

For MC Dropout, training time is slowed down similar to regular dropout. But MC Dropout runs during inference, which means that doubling the number of Monte Carlo samples will double the inference time.

## 8.

> Practice training a deep neural network on the CIFAR10 image dataset.

> a. Build a DNN with 20 hidden layers of 100 neurons each (that's too many, but it's the point of this exercise). Use He initialization and the ELU activation function.

> b. Using Nadam optimization and early stopping, train the network on the CIFAR10 dataset.

>> - You can load it with `keras.datasets.cifar10.load_data()`.
>> - The dataset is composed of 60,000 32x32-pixel color images (50,000 for training, 10,000 for testing) with 10 classes.
>> - So you'll need a softmax output layer with 10 neurons.
>> - Remember to search for the right learning rate each time you change the model's architecture or hyperparameters.

> c. Now try adding Batch Normalization and compare the learning curves:

>> - Is it converging faster than before?
>> - Does it produce a better model?
>> - How does it affect training speed?

> d. Try replacing Batch Normalization with SELU, and make the necessary adjustments to ensure the network self-normalizes (ie. standardize the input features, use LeCun normal initialization, make sure the DNN contains only a sequence of dense layers, etc.).

> e. Try regularizing the model with alpha dropout. Then, without retraining your model, see if you can achieve better accuracy using MC Dropout.

> f. Retrain your model using 1cycle scheduling and see if it improves training speed and model accuracy.