# Neural Networks Representation

Neural Networks are generalization of the previous logestic regression model with multiple layers. Each layer will apply multiple calculations of $\sigma(z)$ where $z=w^Tx+b$. Consider the following image:

![Fig 1: Neural Network Representation](images/neural_network_representation.png)

It can be noticed that this network can represent a 2-layer neural network. The first layer, layer one, will compute $w^{[1]}.T \times x + b$, and apply it to the sigmoid $\sigma $ function. The second layer will compute again $w^{[2]}.T \times x + b$ and apply $\sigma $ function to the output to give the final output $\hat{y}$. It can be seen that each circle does two main computations, the first computation is $z^{[i]}$ and the second computation is $a^{[i]}=\sigma(z)$ where $i$ represents the layer index. 

In other words, we can say that the previous network represent a 3 logestic regression units stacked in the first layer, and the output of these units is fed into another one logestic regression unit in the second layer which is responsible of producing the output $\hat{y}$.


The computation of the above network can be summarized in the following computation graph:

$$ \boxed{\boxed{z^{[1]} = w^{[1]T} \times x + b^{[1]}} \longrightarrow \boxed{a^{[1]} = \sigma{(z^{[1]})}}} \longrightarrow \boxed{\boxed{z^{[2]}=w^{[2]T} \times a^{[1]} + b^{[2]}} \longrightarrow \boxed{a^{[2]}=\sigma{(z^{[2]})}}} \longrightarrow \boldsymbol{\ell} (a^{[2]},y)$$

- In the above computation graph, outer boxes represents layers while inner boxes represents computations within layers.
- **An important note** on the notation is that the superscript square brackets "$[]$" will refer to the layer index. This is not to confuse it the parantheses "$()$" where they refere to the example index in the given training/dev/test set.
- The input layer, $x$ can be also refered to as $ a^{[0]} $.
- vectorization and broadcasting techniques can be applied to represent and paralalize the previous network.

# Activation Functions

Throughout the previous material, the sigmoid $\sigma(x) = \frac{1}{1+e^{-x}}$ function was used to transform the output of the linear regression $wx+b$ into some nonlinar format. However, this is not the only function to use to add some nonlinearity to the linear regression formula. Some other alternative functions are listed.

## Tanh Function

Tanh function is limited between 1 and -1. It is, technically, a shifted and rescaled version of the sigmoid function. Tanh almost always, for hidden units, works better than the sigmoid function. This is because the mean of its values is closer to zero i.e. it kind of adds a centering effect to data in hand. The formula for the tanh function is:

$$\frac{e^{z}-e^{-z}}{e^z+e^{-z}}$$

The following graph, from walfram alpha, depicts the behaviour of the tanh function.

![tanh function](images/tanh.png)

### Downsides of tanh and sigmoid functions

The most prominent problem with tanh and sigmoid function is that as the input to the function,z, is on its exremes, either small or large, its derivative becomes very small, hence is the gradient. This slows down the learning of the network.

### The Rectevied Linear Unit ReLU function

ReLU activation function comes mainly to address the limitations of the sigmoid and tanh functions. 

The ReLU function is defined as follows $$f(x)=\begin{cases}x & x\geq 0 \\ 0 & x < 0 \\ \end{cases}$$ 

Hence, ReLU values when the input is on its extremes is not convergin to small values. It becomes the standard practice for the recent neural network implementations. The Graph of the ReLU , from walfram alpha, is illustrated below.

![ReLU Graph](images/ReLU.png)

#### ReLU limitations

- The function is not differantiable around 0. However, this limitation, in practice, can be mitigated by considering that the function has a 0 or 1 derivative around 0.
- The function derivative is always zero for negative inputs. This limitation is amended by another version of the function called leaky ReLU where the function has a tiny slope for the negative input values, say $$f(x)=\begin{cases}x & x\geq 0 \\ 0.01x & x < 0 \\ \end{cases}$$ 
Its graph is shown below from paperswithcode.com[https://paperswithcode.com/method/leaky-relu, last-accessed: 31-08-2021]:

<img src="images/LeakyReLU.png" width="200px" height="200px"/>

## The need for activation functions

Suppose a neural network does not have any kind of activation function, let us say: an identity activation function. Therefore, it can be noticed that the results of the whole neural network can be achieved through one linear function. Consider the following computation happening accross layers:

$$ a^{[1]} = z^{[1]} = w^{[1]}x + b^{[1]} $$
$$ a^{[2]} = z^{[2]} = w^{[2]}a^{[1]} + b^{[2]} $$
$$ a^{[2]} = w^{[2]}(w^{[1]}x + b^{[1]}) + b^{[2]} $$
$$ a^{[2]} = (w^{[2]}w^{[1]})x + (w^{[2]}b^{[1]} + b^{[2]}) $$

let $$w^`= (w^{[2]}w^{[1]})$$
and $$b^`=(w^{[2]}b^{[1]} + b^{[2]})$$

Hence

$$ a^{[2]} = w^`x + b^`$$

The above computation was applied on two layers network, but can be, similarly, generalized to k layers network.

It turns out that having a linear activation function for the hidden layers, the neural network is no different than a standard logestic regression model without any hidden layer. The above computation gives some insights on this conclusion.

**Hence,** a non-linear activation function is required to transform the regression function $wx+b$ into some nonlinear form.

## Activation Functions Derivatives

The below table shows the derivatives of the aforementioned activation functions. These derivatives will be used in the backpropagation algorithm.

| Name          | Formula                                                           | Derivative                                                                   |
| :-----------: | :---------------------------------------------------------------: | :--------------------------------------------------------------------------: |
| Sigmoid       | $$ \sigma(z) = \frac{1}{1+e^{-z}} $$                              | $$ \frac{1}{1+e^{-z}} ( 1 - \frac{1}{1+e^{-z}}) = \sigma(z) (1-\sigma(z)$$   |
| Tanh          | $$ \tanh(z) = \frac{e^{z}-e^{-z}}{e^z+e^{-z}} $$                  | $$ 1 - (\frac{e^{z}-e^{-z}}{e^z+e^{-z}})^2 = 1-(\tanh(z))^2 $$               |
| ReLU          | $$ f(x)=\begin{cases}x & x\geq 0 \\ 0 & x < 0 \\ \end{cases} $$   | $$ f(x)=\begin{cases}1 & x \geq 0 \\ 0 & x < 0 \\ \end{cases} $$             |

# Gradient Descent for Neural Networks

Below, we describe the forward and the backward passes of the neural network using gradient descent optimization algorithm.

Note that, in the below derivation, $g^{[i]}(Z)$ refers to the actviation function used in the layer $i$.

For a network with $n$ layers, the **forward pass** is:

*for layer $i$ from 0 to $n-1$, do:*

- $Z^{[i+1]} = w^{[i+1]}A^{[i]}+b^{[i+1]}$

- $A^{[i+1]} = g^{[i+1]}(Z^{[i+1]})$

*Finally, for the last layer n, we will use the sigmoid as an activation function.:*

$ A^{[n]} = g^{[n]}(Z^{[n]}) = \sigma(Z^{[i+1]})$

For the backward pass, the so-called backpropagation algorithm, its steps are described below:

*for $i$ in  layer 1 to $n$, update parameters $w$ and $b$ as follows:*

- $ W^{[i]} = W^{[i]} - \alpha \frac{\partial \ell}{\partial W^{[i]}} $

- $ b^{[i]} = b^{[i]} - \alpha \frac{\partial \ell}{\partial b^{[i]}} $

