## Activation Functions (Nonlinearities):
An activation function is a function associated with a given node which takes in the weighted sum (plus bias) and maps it to another number. Activation functions mimic the firing of neurons in biological neural networks. 

Activation functions should have the properties:
- Continuous at every $x$ in the real numbers
- *Injective* &mdash; one unique output for each unique input. Simple tests: check the derivative is monotonically increasing or use the horizontal line test
- *Non-linear* &mdash; not just a straight line. The relu function is a piecewise function so it's not linear in that way. Having a non-linear function is necessary for <em>conditional correlation</em>
- Efficient to compute &mdash; because the activation function could be called a massive number of times



### Activation Functions for the Hidden Layer:
- $\texttt{logistic sigmoid}$: $\frac{1}{1+e^{-x}}$ &mdash; maps weighted sums to a value in the interval $(0, 1)$
    - __Derivative:__ suppose $z=\frac{1}{1+e^{-x}}$, then $\frac{dz}{dx}=z(1-z)$
    - This lets you interpret the output of a neuron as a probability measure
    
<img src="images/sigmoid.png" style="width: 35%">

- $\texttt{tanh(x)}=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}=\frac{2}{1+e^{-2x}} - 1=2 \cdot S(2x) - 1$ &mdash; basically a scaled and offsetted version of logistic sigmoid.maps weighted sums to a value in the interval $(-1, 1)$. 
    - __Derivative:__ suppose $z=\tanh(x)$, then $\frac{dz}{dx}=1-z^2$ 
    - $\texttt{tanh}$ can give a measure of negative correlation, rather than *just* positive correlation in the case of logistic sigmoid. It generally outperforms sigmoid for hidden layers because of its ability to measure negative correlation
    - $\texttt{tanh}$ has a steeper gradient around the neighbourhood of $x=0$ 

<img src="images/tanh.png" style="width: 35%">

- $\texttt{ReLU}=\texttt{max(0, x)}$ &mdash; suppresses firing below 0, otherwise echoes the same input to the next layer.
    - Computationally cheaper than the sigmoid functions.
    - Unbounded y-values in the positive direction means the activation can 'blow-up'
    - 'Dying ReLU problem' &mdash; for nodes that output 0, their weight cannot be adjusted during gradient descent since the gradient is 0. A substantial number of nodes in the network can become 'passive' because of that
        - Leaky ReLU aims to mitigate this problem by ensuring nodes never have a non-zero gradient
        
<table>
    <tr>
        <th style="text-align: center;">ReLU</th>
        <th style="text-align: center;">Leaky ReLU</th>
    </tr>
    <tr>
        <td>            
            <img src="images/relu-graph.png">
        </td>
        <td>
            <img src="images/leaky-relu-graph.png">
        </td>
    <tr>
</table>



### Activation Functions for the Output Layer:
The choice of activation function in the <em>output layer</em> depends on what prediction is being made.
- $\texttt{sigmoid}$ &mdash; for yes/no probability predictions. 
    - Eg. is this a dog?
- $\texttt{softmax}$ &mdash; for classifications, based on highest probabilities (selecting a single label out of many possible labels). 
    - Eg. does this number look like a 1, a 3 or a 7? 
- No activation function for the output layer &mdash; for non-probability predictions. 
    - Eg. what will the temperature be tomorrow?




<hr />

### Loss Functions:

-  Regression Loss Functions:
    - __*Mean squared error*__: $E(t, z) = \frac{1}{2}\sum_{i=1}^{m} (t-z)^2$ &mdash; the sum of the squared difference between expected value $t$ and the forward-propagation's predicted value $z$
        - 

- Binary Classification Loss Functions:
    - __*Cross-entropy*__: $E(t, z) = -(t\log z + (1-t)\log (1-z))$ &mdash; where $t$ is the expected value and $z$ is the forward-propagation's prediction
    - When $t=1$, the second term disappears, leaving $E(t, z) = -t\log z$
    - When $t=0$, the first term disappears, leaving $E(t, z) = -(1-t)\log (1-z)$

- Multi-Class Classification Loss Functions:
    - __*KL-Divergence Loss*__ &mdash; TODO

#### Error Landscape:
If we think of $E$ has height, then each the loss function defines an 'error landscape' on the weight space. The aim of neural network training then is just to find a configuration of weights where $E$ takes the global minimum.

We minimise the loss function with the help of an *optimisation function* like __gradient descent__.





### Gradient Descent:

Gradient descent is an optimisation algorithm for finding the local minimum of a differentiable function. 

It works by taking steps proportional to the negative of the gradient:

$$w_{new} = w_{old} - \underbrace{\eta \frac{\partial E}{\partial w}}_\text{step}$$

Where $\eta$ is the learning rate &mdash; how big of step the weight update should take, or how fast the gradient descends towards an optimal weight.
- Higher learning rates can cause gradient descent to overshoot and 'bounce' up the convex error landscape. Slower learning rates cause slower training time and may cause the network to get stuck 

<table>
    <tr>
        <th style="text-align: center;">Gradient Descent</th>
        <th style="text-align: center;">Fast vs. slow learning rate</th>
    </tr>
    <tr>
        <td>            
            <img src="images/gradient-descent-demo.png">
        </td>
        <td>
            <img src="images/gradient-descent-fast-vs-slow.png">
        </td>
    <tr>
</table>



## Back-Propagation


The aim of back-propagation is to update the weights based on the error function value after forward-propagation. 

It aims to attribute the cause of the error to each node and penalise the node's output weight accordingly.




Given an algorithm f(x), an optimization algorithm help in either minimizing or maximizing the value of f(x). In the context of deep learning, we use optimization algorithms to train the neural network by optimizing the cost function J.

Gradient descent is an optimisation algorithm.