## Introduction
Deep learning is a subset of machine learning, which is a subset of artificial intelligence.

<table>
    <tr>
        <th style="text-align: center;">AI subfields</th>
        <th style="text-align: center;">A neural network</th>
    </tr>
    <tr>
        <td>            
            <img src="images/ai-subsets.png">
        </td>
        <td>
            <img src="images/neural-net.png">
            <p style="text-align: left;">
                As a biological brain analogue, each node represents a neuron and each edge represents a synaptic connection
            </p>
        </td>
    <tr>
</table>

Deep learning focuses on a particular class of machine learning algorithms called neural networks &mdash; deep neural networks in particular.

## Activation Functions (Nonlinearities):
An activation function is a function associated with a given node which takes in the weighted sum (plus bias) and maps it to another number. Activation functions mimic the firing of neurons in biological neural networks. 

Activation functions should have the properties:
- Continuous and differentiable at every $x$ in the real numbers
- *Injective* &mdash; one unique output for each unique input. Simple tests: check the derivative is monotonically increasing or use the horizontal line test
- *Non-linear* &mdash; not just a straight line. The relu function is a piecewise function so it's not linear in that way. Having a non-linear function is necessary for <em>conditional correlation</em>
- Efficient to compute &mdash; because the activation function could be called a massive number of times

The purpose of activation functions is to introduce non-linearity to the output of a neuron.

- Non-linearity can be thought of as “the outcome does not change in proportion to a change in any of the inputs”


### Activation Functions for the Hidden Layer:
- $\texttt{logistic sigmoid}$: $\frac{1}{1+e^{-x}}$ &mdash; maps weighted sums to a value in the interval $(0, 1)$
    - __Derivative:__ suppose $z=\frac{1}{1+e^{-x}}$, then $\frac{dz}{dx}=z(1-z)$
    - This lets you interpret the output of a neuron as a probability measure
    
<img src="images/sigmoid.png" style="width: 35%">

- $\texttt{tanh(x)}=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}=\frac{2}{1+e^{-2x}} - 1=2 \cdot S(2x) - 1$ &mdash; basically a scaled and offsetted version of logistic sigmoid.maps weighted sums to a value in the interval $(-1, 1)$. 
    - __Derivative:__ suppose $z=\tanh(x)$, then $\frac{dz}{dx}=1-z^2$ 
    - $\texttt{tanh}$ can give a measure of negative correlation, rather than *just* positive correlation in the case of logistic sigmoid. It generally outperforms sigmoid for hidden layers because of its ability to measure negative correlation
    - $\texttt{tanh}$ has a steeper gradient around the neighbourhood of $x=0$ 

<img src="images/tanh.png" style="width: 35%">

- $\texttt{ReLU}=\texttt{max(0, x)}$ &mdash; suppresses firing below 0, otherwise echoes the same input to the next layer.
    - Computationally cheaper than the sigmoid functions.
    - Unbounded y-values in the positive direction means the activation can 'blow-up'
    - 'Dying ReLU problem' &mdash; for nodes that output 0, their weight cannot be adjusted during gradient descent since the gradient is 0. A substantial number of nodes in the network can become 'passive' because of that
        - Leaky ReLU aims to mitigate this problem by ensuring nodes never have a non-zero gradient
        
<table>
    <tr>
        <th style="text-align: center;">ReLU</th>
        <th style="text-align: center;">Leaky ReLU</th>
    </tr>
    <tr>
        <td>            
            <img src="images/relu-graph.png">
        </td>
        <td>
            <img src="images/leaky-relu-graph.png">
        </td>
    <tr>
</table>



### Activation Functions for the Output Layer:
The choice of activation function in the <em>output layer</em> depends on what prediction is being made.
- $\texttt{sigmoid}$ &mdash; for yes/no probability predictions. 
    - Eg. is this a dog?
- $\texttt{softmax}$ &mdash; for classifications, based on highest probabilities (selecting a single label out of many possible labels). 
    - Eg. does this number look like a 1, a 3 or a 7? ]
    - $\texttt{softmax}$ serves as a normalising function for when the network needs to classify an input in two or more classes
- No activation function for the output layer &mdash; for non-probability predictions. 
    - Eg. what will the temperature be tomorrow?




<hr />

### Loss Functions:
A loss function measures how closely the prediction matches the true label.

-  Regression Loss Functions:
    - __*Mean squared error*__: $E(t, z) = \frac{1}{2}\sum_{i=1}^{m} (t-z)^2$ &mdash; the sum of the squared difference between expected value $t$ and the forward-propagation's predicted value $z$

- Binary Classification Loss Functions:
    - __*Cross-entropy*__: $E(t, z) = -(t\log z + (1-t)\log (1-z))$ &mdash; where $t$ is the expected value and $z$ is the forward-propagation's prediction
    - When $t=1$, the second term disappears, leaving $E(t, z) = -t\log z$
    - When $t=0$, the first term disappears, leaving $E(t, z) = -(1-t)\log (1-z)$

- Multi-Class Classification Loss Functions:
    - __*KL-Divergence Loss*__ &mdash; TODO

#### Cost Function
Cost function: $J(w, b) = \frac{1}{m}\sum_{i=1}^m E(t^{(i)}, z^{(i)})$ &mdash; the average of the sum of loss values for all training samples $1 \leq i \leq m$, where $w$ and $b$ are the weight and bias.
- The loss function measures how well the network performed on a single training example
- The cost function measures how well the network's parameters $w$ and $b$ performed on the entire training set
- Eg. if our loss function was cross-entropy, then the cost function would be: $J(w, b) = -\frac{1}{m}\sum_{i=1}^m \big( t^{(i)}\log y^{(i)} + (1-t^{(i)}) \log (1-z^{(i)}) \big)$.


The 'learning' of a neural network is the process of finding $w$ and $b$ so as to minimise $J$. This is done through an optimisation algorithm such as __*gradient descent*__.


#### Error/Cost Landscape:
If we think of $E$ has height, then each the loss function defines an 'error landscape' on the weight space. The aim of neural network training then is just to find a configuration of weights where $E$ takes the global minimum.


<table>
    <tr>
        <th style="text-align: center;">Error landscape</th>
        <th style="text-align: center;">Cost landscape</th>
    </tr>
    <tr>
        <td>            
            <img src="images/error-landscape.png">
        </td>
        <td>
            <img src="images/cost-landscape.png">
            <div style="text-align: left;">
                Where $w$ and $b$ are real numbers just to make this possible to visualise. Of course, $w$ would be higher dimensional. 
            </div>
        </td>
    <tr>
</table>

The partial derivative $\frac{\partial J(w, b)}{\partial w}$ gives the slope along the $w$ axis, while $\frac{\partial J(w, b)}{\partial b}$ gives the slope along the $b$ axis. When you 'nudge' $w$ a little bit, the derivative $\frac{\partial J(w, b)}{\partial w}$ tells you how much $J(w, b)$ changes.




### Gradient Descent:

Gradient descent is an optimisation algorithm for finding the local minimum of a differentiable function. 

It works by repeatedly taking steps proportional to the negative of the gradient, until the weight converges on a global minimum:

$$w_{new} = w_{old} - \underbrace{\eta \frac{\partial E}{\partial w}}_\text{step}$$

Where $\eta$ is the learning rate &mdash; how big of step the weight update should take, or how fast the gradient descends towards an optimal weight.
- The subtraction of the derivative term or slope value, $\frac{\partial E}{\partial w}$, will always be push the value in the 'downhill' direction

- Higher learning rates can cause gradient descent to overshoot and 'bounce' up the convex error landscape. Slower learning rates cause slower training time and may cause the network to get stuck 

<table>
    <tr>
        <th style="text-align: center;">Gradient Descent</th>
        <th style="text-align: center;">Fast vs. slow learning rate</th>
    </tr>
    <tr>
        <td>            
            <img src="images/gradient-descent-demo.png">
        </td>
        <td>
            <img src="images/gradient-descent-fast-vs-slow.png">
        </td>
    <tr>
</table>



### Optimisers:

Given an algorithm f(x), an optimization algorithm help in either minimizing or maximizing the value of f(x). In deep learning, we use optimisers to train the neural network by minimising the error/cost function.

<table>
    <tr>
        <td width="50%">   
            <p style="text-align: center;">
                <strong>A few standard optimisers</strong>
            </p>
            <img src="images/optimisers.png">
        </td>
        <td>
            <p style="text-align: center;">
                <strong>Different optimisers traversing the error landscape</strong>
            </p>
            <img src="images/optimisers-visualised.gif">
        </td>
    <tr>
</table>

The best optimiser is the one that traverse the error landscape for the specific problem the best. The choice of optimiser is made empirically rather than mathemtically, mostly. 


#### Stochastic Gradient Descent:

Stochastic gradient descent is when we update the weights after each training example rather than after a whole epoch (vanilla gradient descent).
- The problem with this is that a noisy training sample can steer the jump away from the optimum.



#### Stochastic Gradient Descent + Momentum:
One way to decrease the noise associated with stochastic gradient descent is to add momentum. With this, it can pay less attention to the occasional noisy samples that throw it off.


*coefficient of momentum* &mdash; the percentage of gradient from previous iterations that is retained. The gradient changes of the more recent steps have a greater weighting.
    - Usually initialised as $\eta = 0.5$ and may be changed over later epochs

<img src="images/sgd-momentum.png" width="70%">

Since the steps taken can get larger and larger, it may be easy to overshoot the minima.




Momentum takes into account the gradient of previous steps.

Momentum helps gradient descent converge quicker towards one direction and dampens oscillations.


#### Stochastic Gradient Descent + Momentum + Acceleration:
One way to decrease the noise associated with stochastic gradient descent is to add momentum. With this, it can pay less attention to the occasional noisy samples that throw it off.



#### Mini-Batch Gradient Descent:

Mini-batch gradient descent &mdash; only updating after $m$ number of training examples are done, rather than after a whole epoch or after 1 training sample.


#### NAG &mdash; Nesterov Accelerated Gradient:


#### Adagrad &mdash; Adaptive Gradient:


#### Adadelta:


#### RMSProp:


#### Adam &mdash; Adaptive Moment Optimisation:
Combines momentum and RMSProp?



### Computational Graph

Fundamentally, all neural networks are just a single big mathematical function.

When trying to find a neural network that works for a particular application, you're implicitly saying 'there exists some mathematical function that will reasonably approximate the observed behaviour'. The training of neural networks is to find this approximating function.

A computational graph is a way of representing math functions in graph theory.

#### Representing Functions:

Each node is either an input node, or a function node. Function nodes take in input and produce an output.

<table>
    <tr>
        <th style="text-align: center;">Representing $f(x, y, z)=(x+y) \cdot z$</th>
        <th style="text-align: center;">Representing $f(x, y)=ax^2+bxy+cy^2$</th>
    </tr>
    <tr>
        <td>            
            <img src="images/simple-computational-graph.png">
        </td>
        <td>
            <img src="images/another-computational-graph.png">
            <p style="text-align: left;">
                This could technically be considered a neural network. It's possible to train it with gradient descent and backprop and tune $a, b$ and $c$ if we had a training dataset
            </p>
        </td>
    <tr>
</table>

#### Representing Neural Networks:
Even simple neural networks may have hundreds of thousands of nodes and edges in their computational graph, which would be impossible to represent in standard function notation.

Each node in a neural network encapsulates smaller function nodes for: performing a weighted sum (plus bias), then computing a non-linear activation function.

<img src="images/neural-node-computational-graph.png" width="50%" />

With deep neural nets, we could have millions of individual weights, plus thousands of biases to individually tune. This is why a massive dataset and a massive amount of computing power is required.


## Back-Propagation


The aim of back-propagation + gradient descent is to update the weights based on the error function value after forward-propagation. The aim is to attribute the *cause of the error*, ie. to assign a proportionate magnitude of 'blame' or 'praise', to each node to penalise/boost their output weight accordingly.

Gradient descent requires the concrete values of $\frac{\partial E}{\partial w_{ij}}$ &mdash; the amount $E$ changes when you nudge $w_{ij}$ &mdash; to be known for each weight in the network. Backpropagation is the algorithm used to calculate these gradients, it does not do anything more than that.
    

Gradients are calculated from the last layer to the first layer. Partial computation of the gradient from one layer is reused in the computation of the gradients in the previous layer.

- The reuse of calculated gradients from each layer is what makes backpropagation more efficient than the naive approach of individually calculating the gradients $\frac{\partial E}{\partial w_{ij}}$ for all $i, j$.



 #### Visualised
 

 

<img src="images/backprop-simple.png" width="100%">



$$
\frac{\partial C}{\partial w^{(3)}}
    =
    \frac{\partial C}{\partial a^{(3)}}
    \frac{\partial a^{(3)}}{\partial z^{(3)}}
    \frac{\partial z^{(3)}}{\partial w^{(3)}}
\tag{Weights from layer 2 to 3}
$$

$$
\frac{\partial C}{\partial w^{(2)}}
    =
    \underbrace{
    \frac{\partial C}{\partial a^{(3)}}
    \frac{\partial a^{(3)}}{\partial z^{(3)}}
    }_\text{From $w^{(3)}$}
    \,
    \frac{\partial z^{(3)}}{\partial a^{(2)}}
    \frac{\partial a^{(2)}}{\partial z^{(2)}}
    \frac{\partial z^{(2)}}{\partial w^{(2)}}
$$

$$
\frac{\partial C}{\partial w^{(1)}}
    =
    \underbrace{
    \frac{\partial C}{\partial a^{(3)}}
    \frac{\partial a^{(3)}}{\partial z^{(3)}}
    }_\text{From $w^{(3)}$}
    \,
    \underbrace{
    \frac{\partial z^{(3)}}{\partial a^{(2)}}
    \frac{\partial a^{(2)}}{\partial z^{(2)}}
    }_\text{From $w^{(2)}$}
    \,
    \frac{\partial z^{(2)}}{\partial a^{(1)}}
    \frac{\partial a^{(1)}}{\partial z^{(1)}}
    \frac{\partial z^{(1)}}{\partial w^{(1)}}
$$

### Resources:

- Introduction:
    - <a href="https://medium.com/tebs-lab/introduction-to-deep-learning-a46e92cb0022">Intro to deep learning</a>
- <a href="https://medium.com/datadriveninvestor/overview-of-different-optimizers-for-neural-networks-e0ed119440c3">Types of optimisers</a>
- Computation graph: 
    - <a href="https://medium.com/tebs-lab/deep-neural-networks-as-computational-graphs-867fcaa56c9#:~:text=A%20computational%20graph%20is%20a,or%20functions%20for%20combining%20values.">Basics of computational graphs</a>