## A Hands-on Workshop series in Machine Learning
### Session 5: Going Deeper into Neural Networks 
#### Instructor: Aashita Kesarwani



##### Revision:

<img src="https://www.researchgate.net/profile/Mohamed_Zahran6/publication/303875065/figure/fig4/AS:371118507610123@1465492955561/A-hypothetical-example-of-Multilayer-Perceptron-Network.png" width="450" height="500" />
<p style="text-align: center;"> Multi-layer Perceptron </p>

Some basic terms:
* Nodes
* Input layer
* Hidden layers
* Output layer


Some points to note about neural network architecture:
* The number of nodes in the input layer is equal to the number of features plus the bias. 
* Each layer has a bias, though they are often not shown in the neural network diagrams. 
    * The bias have no incoming connection from the nodes from the previous layers, unlike other nodes in the hidden layers
    * The bias are connected to each node in the next layer, just like other nodes.
* The output/activation of each layer is calculated in two steps: weighted sum of the incoming input followed by the activation function.
* The output/activation of each hidden layer becomes the input of the next layer.
* All the nodes in a layer share the same activation function but the activation functions can differ from layer to layer.
* The activation functions contributes to the non-linearity in the model. 
* The weights and bias of a neural network are learned using the training example.

Some more terms:
* Weights
* Bias
* Weighted sum $z_i$
* Activation functions $g_i$
* Activations $a_i$ 

The iterative training process in a nutshell: 
1. The weights are initialized.
2. For each input vector, the activations are propagated forward thru the network to give the final output. This is forward propagation.
3. This final output is compared with the target value to calculate the cost function.
4. The above cost is propagated backwards using gradients. This is called backpropagation. 
5. The weights are updated using the gradients calculated above.

More terms:
* Cost function $J$
* Forward propagation
* Backward propagation
* Iterations
* Epochs
* Learning rate $\alpha$

From the mathematical theory behind neural networks, it is known that given a function no matter how complicated, we can come up with a neural network that approximates it. See [here](http://neuralnetworksanddeeplearning.com/chap4.html) for a visual non-rigorous proof. 

In fact, a Multilayer Perceptron (MLP) with single hidden layer is a universal approximator, in theory. You can also look up the "Universal Approximation Theorem" if you're interested. 

It is not as simple in practise though. There are many obstacles in training a neural network to get a good approximation for the desired function. One of the most common problem we encounter, especially in deep networks, is that of vanishing gradients. 

### Vanishing Gradient problem:

Deep Learning architectures, that is neural networks with many hidden layers, are driving the recent advances in the machine learning. 

<img src="images/nn.png" width=600>
 
For deep neural networks, it can happen that the gradients are so small that the training process does not yield much positive results even after several epochs. The reason behind vanishing gradients lies in the backpropagation step. 

For a particular weight, say $w$, the weight update determined by the gradient is nothing but the partial derivate of the cost function, say $J(w)$, that is $\frac{dJ}{dw}$ multiplied by the learning rate $\alpha$.

$$w := w - \alpha \frac{dJ}{dw}$$

Given the gradients for a layer, how do we compute the gradients for the preceding layer?

Answer: Chain rule for derivatives

$$ \frac{d f(y)}{dx} =  \frac{d f(y)}{dy} \frac{d y}{dx}$$

When we look deeper into the calculations for the gradients, we see that it involves a series of derivatives multiplied to one another because of using the chain rule. Please see the solution of exercise 3 in the `Exercise solutions.ipynb` notebook from the last session that has worked out the calculation to clarify this.

Since each of these derivatives are influenced by the derivaties in the layer ahead of it, there are two possibilities:
* Gradients vanish to zero
$$ 0.9 * 0.8 * 0.7 * 0.6 * 0.5 * 0.4 * 0.3 * 0.2 * 0.1 \approx 0.0004$$

The earlier layers in the network are far more affected by either of these problems than the layers closer to the output because the gradients are propagated backwards adding on the multiplying factors for the earlier layers of the network.


How do we accelerate the learning process of our network?
* Does the number of layers make a difference? 
* Does the number of nodes in each layer make a difference?
* Does weight initialization make a difference?
* Does the activation function make a difference?
* Does the learning rate make a difference?
* Does the scale of the features(variables) make a difference?


Let us revisit the sigmoid function. What do you observe about the derivative (slope) of the sigmoid function at different points?

$$sig(t) = \frac{1}{1+e^{-t}}$$

<img src="https://upload.wikimedia.org/wikipedia/commons/5/53/Sigmoid-function-2.svg" width=400 />

The derivative of the sigmoid function becomes vanishingly small in the saturating regions on both sides. This will prevent the weights from updating their values.

<img src="https://miro.medium.com/max/3946/1*6A3A_rt4YmumHusvTvVTxw.png" width=500 />

Hence, another activation function ReLU is preferred over the sigmoid activation.  

#### Rectified Linear Units (ReLU):
Rectified Linear Units (ReLU) function is defined as:

\begin{equation}
g(x) = 
\begin{cases} 
x \text{ if } x \geq 0 \\
0 \text{ if } x < 0
\end{cases}
\end{equation}

<img src="https://miro.medium.com/max/2565/1*DfMRHwxY1gyyDmrIAd-gjQ.png" width=400 />


It's derivative 
\begin{equation}
g'(x) = 
\begin{cases} 
1 \text{ if } x \geq 0 \\
0 \text{ if } x < 0
\end{cases}
\end{equation}

ReLU speeds up the training process significantly as compared to the sigmoid function and hence, it is the most commonly used activation function for the hidden layers.

**Note:** Unlike the sigmoid function, ReLU is not suitable for use for the output layer of a network designed for binary classification. The sigmoid function is also limited to binary classifications, so what do we do in case we have more than two classes, which is often the case for real-world datasets?

#### Softmax function
For a multi-class (more than two classes) problem, the number of nodes in the output layer are equal to the number of classes and we want the probabilities for each class to add up to $1$. One of the simple way to ensure this would be to divide the output for each node by the total sum of the outputs of all the nodes. This is called standard normalization.

$$ prob_i = \frac{z_i}{z_1 + \dots + z_n} $$

The preferred method for multi-class  classification problems is to use softmax function in the output layer. It converts the outputs, say $z_i$'s, into probabilities adding to $1$, by performing standard normalization on the exponentials of the outputs.

For each node, the softmax formula is

$$softmax(z_i) = \frac{e^{z_i}}{e^{z_1} + \dots + e^{z_n}} $$


Softmax works better than standard normalization because it corrects the vanishing gradients. Without going in much details, it can be understood that the exponentials in the softmax cancels the log in the cross-entropy loss/cost function, causing loss to be linear in $z_i$ and thus, speeding up the weight update process.

The only assumption for softmax is that the examples cannot belong to two classes at the same time.

### Weight initialization:

To address the vanishing gradient problems, the general rule for weight initialization is:
* Generate a random sample from the standard normal distribution (also known as [Guassian distribution](https://www.khanacademy.org/math/statistics-probability/modeling-distributions-of-data/more-on-normal-distributions/v/introduction-to-the-normal-distribution))
* Multiply each weight by $\sqrt{\frac{2}{n_i}}$ where $n_i$ is the number of nodes in that layer 

More information on this topic can be found in [this paper](http://proceedings.mlr.press/v9/glorot10a.html) by Xavier Glorot and Yoshua Bengio, who introduced this rule. [This video](https://www.coursera.org/lecture/deep-neural-network/weight-initialization-for-deep-networks-RwqYe) on coursera is also helpful.

In Keras, this weight initialization rule can implemented by passing `kernel_initializer='glorot_normal'` to the `add(Dense())` function while defining the layers.

### Regularization:
When we have weights that are higher in magnitude, the model is more likely to overfit, so we want to penalize the weights. This is achieved by adding regularization term to the error term in the cost function. As the training process tries to minimize the cost function, the regularization term ensures that the weights are kept small and thereby simplifies the model.

There are two common ways to add the regularization term (using [$L1$ and $L2$-norms](https://machinelearningmastery.com/vector-norms-machine-learning/)) 

Cost function without regularization:
$$ J = \frac{1}{2 n} \sum_{i=1}^n (y^{(i)} - y_{pred}^{(i)})^2 $$

Cost function with L2 regularization:
$$ J = \frac{1}{2 n} \sum_{i=1}^n (y^{(i)} - y_{pred}^{(i)})^2 + \alpha \sum_{j=1}^m w_j^2$$

Cost function with L1 regularization:
$$ J = \frac{1}{2 n} \sum_{i=1}^n (y^{(i)} - y_{pred}^{(i)})^2 + \alpha \sum_{j=1}^m |w_j|$$

Forward propagation step remains unchanged, whereas the formulae for the backpropagation step that contains the gradients of the cost w.r.t weights changes:
$$ w := w - \alpha \frac{\partial J}{\partial w}$$


This technique of regularization using either $L1$ or $L2$-norm is one of the two most commonly used techniques to address overfitting in deep neural networks. The other one being Dropout regularization.

### Dropout

Drop-out means dropping out (or ignoring) some randomly chosen nodes during the training of the neural network. 

* For each iteration in the training, some nodes are dropped (or their activations are switched off). 
* The same nodes are dropped for both forward and backward propagation. 
* The nodes to be dropped off are chosen at random. Each node is given the probability $p$ that it will be kept and the probability $1-p$ that it will be dropped. Whether each node would be dropped or not is chosen at random with the given probabilities. This results in roughly $1-p$ proportion of nodes being dropped out in each iteration.
* The dropout is implemented on all layers, except the output layer since it would not make much sense to drop the output nodes.
* The dropout is implemented only during the training phase. It is not used for the predictions once the weights are trained.


The network learns multiple independent representation and hence, it is less sensitive to specific weights, thus reducing overfitting. In some ways, a network with dropout is like having an ensemble of neural networks.

### Gradient descent optimizers

* **Stochastic gradient descent**: A single training example is used in each iteration. The gradient (or the derivative of the cost function) is computed for that training example and it is used to update the weights.
* **Batch gradient descent**: The entire training set is used in each iteration. The average of the gradients for all the training examples is computed and it is used to update the weights.

Notes:
* The stochastic converges much faster for larger datasets than the batch gradient descent since the weights are updated a lot more frequently.
* For batch gradient descent, the cost function declines consistently with each iteration, whereas for the stochastic gradient descent, the cost fluctuates and declines overall after each epoch.
* The stochastic gradient descent cannot make use of the vectorized operations unlike batch gradient descent.


**Mini-batch gradient descent** is a mix of the above two and a good compromise. The training set is divided into batches and a single batch is used in each iteration.  The batch sizes of 32, 64 and 128 are often used.

In practice, mini-batch is most commonly used, especially for the large datasets. 

<img src="https://miro.medium.com/max/1634/1*PV-fcUsNlD9EgTIc61h-Ig.png" width=600 />


### Tuning the learning rate

The learning rate ($\alpha$), that determines the size of the steps in the gradient descent algorithm, is an important hyperparameter that needs to be tuned. 
* If the learning rate is too low, then it takes too long to converge. 
* If the learning rate is too high, then it might oscillate and never reach the minima.

<img src="https://www.math.purdue.edu/~nwinovic/figures/learning_rates.png" width=600 />

There is no easy formula to find the optimal learning rate for a model with a given dataset, but we need to find it using trial-and-error.

Our intuition tells us that the learning rate $\alpha$ should be larger at the beginning when are weights are initialized randomly and are far from being optimal and then it can be reduced as the training process proceeds.This is called learning rate schedule or learning rate decay. 

The tuning of the learning rate has been an area of research and the adaptive gradient descent algorithms such as Adam, RMSprop, AdaGrad, etc. are developed that usually give better results. 

### Tuning the number of epochs

It is helpful to keep a check on both training and validation errors by printing them out at regular intervals and stop the training process at a suitable time to avoid both underfitting and overfitting to the training set.

<img src="http://fouryears.eu/wp-content/uploads/2017/12/early_stopping.png" width=400 />


### Normalizing the features:

Suppose, we have only two features $x_1$ and $x_2$. Let us also suppose that $x_1$ takes values in the range of $0$ to $0.1$ whereas $x_2$ takes values in the range of $100$ to $500$. How do you think the difference in the scales of the two features will affect the training process of the network?

If the scales of the features vary a lot, the gradient descent takes a longer time to converges. It is helpful to normalize the features to be in the range of $-1$ to $1$ to speed up the learning process. 

<img src="https://www.jeremyjordan.me/content/images/2018/01/Screen-Shot-2018-01-23-at-2.27.20-PM.png" width=600 />

If the input features are not in the typical range of $-1$ to $1$, the default parameters for the network such as learning rate can also not work properly for the training process. 

### Acknowledgements:
The credits for the images used above are as follows:

- Image 1: https://www.researchgate.net/figure/A-hypothetical-example-of-Multilayer-Perceptron-Network_fig4_303875065
- Image 3 and 4: https://towardsdatascience.com/derivative-of-the-sigmoid-function-536880cf918e
- Image 5: https://medium.com/@danqing/a-practical-guide-to-relu-b83ca804f1f7
- Image 6: https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3
- Image 7: https://www.math.purdue.edu/~nwinovic/deep_learning_optimization.html
- Image 8: http://fouryears.eu/tags/data-analysis/
- Image 9: https://www.jeremyjordan.me/batch-normalization/
