## What's so important about backprop

Backprop is a cross-disciplinary computational tool (it's been rediscovered at least dozens of times)

**Speeds up** training of neural networks with gradient descent ~ 10 million times

## What is backprop

Mathematical technique for quickly calculating derivatives

Application of *reverse-mode differentiation* (the application independent name for backprop) to neural networks

## Gradients

The gradient is
- a direction
- a rate of change

Both of these are valuable
- direction that reduces error
- largest (steepest) rate of change to reduce error

The change is always in the context of something else 
- we are interested in the how **error changes with respect to our parameters**

## Calculus 101

If I have a function

$$ f(x) = x^2 $$

The algorithm to find the derivative is known as the **Power Law**
1. multiply by the power
2. subtract the power by one

$$f'(x) = 2x $$

More examples

$$ f(x) = 3x^4 + 2x^{2} -> f'(x) = 12x^3 + 4x $$

If I take the derivative of a constant, it is always zero

If I take the derivative of a term that doesn't depend on the thing I'm deriving with respect to, that also is zero

## Notation

Calculus has a number of competing notations

$$ \nabla_{x} f = f' = \frac{df}{dx} $$

## Minimizing a function

We want to find the minimum of a simple function 

$$ f(x) = x^2 $$

We can **sample data** from this function
- it is clear where the minimum is (but lets pretend it isn't!)

| x  | f(x) |
|----|------|
| -5 | 25   |
| -2 | 4    |
| 0  | 0    |
| 2  | 4    |
| 5  | 25   |

Taking the derivative of this function

$$ f'(x) = \frac{df}{dx} = 2x $$

| x  | f(x) | f'(x) |
|----|------|-------|
| -5 | 25   | -10   |
| -2 | 4    | -4    |
| 0  | 0    | 0     |
| 2  | 4    | 4     |
| 5  | 25   | 10    |

This shows us the value of a gradient 
- it shows us the direction towards maximizing a function (we take the negative to minimize it)

## Fitting a function

The example above shows how we can use gradients to find the minimum of a function

In machine learning, the function we want to minimise is an **error or loss function**

This error is the difference between two functions
- a function that we parametrize $f(x; \theta)$ with weights $\theta$
- a function that we want to learn $F(x)$

We don't have access to $F(x)$ (if we did, we wouldn't need to learn it) - what we do have access to is the ability to sample:
- $x$ - features (inputs)
- $y$ - target / label (outputs)

So we have three things:
1. a function parametrized by weights $\theta$
2. samples of $x$ (features)
3. samples $y$ (target)

Returning to our simple example with our updated nomenclature

A parameterized function with a single feature

$$ f(x; \theta) = x^2 \cdot \theta_{0} $$

Samples

| x  | y  |
|----|----|
| -2 | 12 |
| 0  | 0  |
| 1  | 3  |

Mean square error

$$E = \frac{1}{2} (f(x; \theta) - y)^2 = \frac{1}{2} (x^2 \cdot \theta_{0} - y)^2 $$

Derivative of the error

$$E' = \frac{dE}{d\theta} = x^2 - y $$

We can now perform an iterative process to update our parameter $\theta$ - starting from an initial $\theta_{0} = 0 $

| x  | y  | E' |
|----|----|----|
| -2 | 12 | -8 |
| 0  | 0  | 0  |
| 1  | 3  | -2 |

How do we update our parameter?
- lets average over the three samples

$$E' = -10 / 3$$

As we are minimizing the error, we take the negative of the gradient and use it to update our parameter:

$$\theta_{1} = \theta_{0} + E' = 0 + 3.3 = 3.3 $$

Which is not so far from the true value of $3.0$

## Practical

Do this (on paper!) for $f(x) = 5 x^3$

## Partial derivatives

In the example above we used our prior knowledge of the true function $F(x)$ to engineer a single feature $x^2$ 
- this is known as inductive bias - use it if you can (it's one reason why convolution works so well!)

What happens if we don't encode this knowledge, and instead have multiple parameters

$$ f(x;\theta) = \theta_{0} x^2 + \theta_{1} x + \theta_{2} $$

We now need partial derivatives

$$ \frac{\partial f}{\partial \theta_{0}} = x^2 $$

$$ \frac{\partial f}{\partial \theta_{1}} = x $$

$$ \frac{\partial f}{\partial \theta_{2}} = 1 $$

Lets pick some parameters

$$\theta = [1,1,1]$$

## Practical

Calculate the partial derivatives and update the parameters using the following data

| x  | y  |
|----|----|
| -2 | -5 |
| 0  | 1  |
| 1  | 4  |

## The linear perceptron

If we have a feature of length 3 

$$x = [x_{0}, x_{1}, x_{2}]$$

Let give each feature its own parameter 

$$ \theta = [\theta_{0}, \theta_{1}, \theta_{2}] $$

And combine them together using a linear combination

$$ f(x; \theta) = x_{0} \cdot \theta_{0} + x_{1} \cdot \theta_{1} + x_{2} \cdot \theta_{2} $$

$$ f(x; \theta) = \sum x \cdot \theta $$

## The perceptron

This linear combination won't be much use for learning a non-linear function.  Lets adjust notation in anticipation of complexity

$$ z(x) = \sum x \theta $$

Lets add an activation function after the linear combination - lets use a sigmoid (which is a special case of the logistic function)

$$ a(z) = \frac{1}{1 + e^{-z}} $$

## A single hidden layer neural network

Three layers in total - input, hidden & output

Parameters from input -> hidden layer 

$$w_{0}, b_{0}$$

Linear combination of parameters

$$z_{0} = \sum X \cdot w_{0} + b_{0}$$

Hidden layer activation function (sigmoid)

$$a_{0} = \frac{1}{1 + \exp^{-z}} $$

Output layer linear combination

$$z_{1} = \sum a_{0} \cdot w_{1} + b_{1}$$

Error function

$$E = \frac{1}{2} (z_{1} - y)^2$$

## Partial derivatives of these components

Partial derivative of the linear combination of the input layer wrt a weight or bias in that layer

$$ \frac{\partial z_{0}}{\partial  w_{0}} = X $$

$$ \frac{\partial z_{0}}{\partial  b_{0}} = 1 $$

Partial derivative of the sigmoid activation on the hidden layer wrt the linear combination

$$ \frac{\partial a_{0}}{\partial  z_{0}} = a_{0}(z_{0}) * (1-a_{0}(z_{0})) $$

Partial derivative of the output layer linear combination wrt the activation

$$ \frac{\partial z_{1}}{\partial a_{0}} = w_{1} $$

Partial derivative of the error wrt the output layer

$$ \frac{\partial E}{\partial z_{1}} = z_{1} - y $$

Our full model can now be written as a composition of functions

$$ f(x; \theta) = z_{1}(a_{0}(z_{0}(x))) $$

We want change each of our parameters to minimize the error

$$ \frac{\partial E}{\partial w_{0}}, \frac{\partial E}{\partial b_{0}}, \frac{\partial E}{\partial w_{1}}, \frac{\partial E}{\partial b_{1}} $$

## The chain rule

When we have compositions of functions

$$ f(x) = a(z(x)) $$

The **chain rule** shows us how gradients flow through these compositions of functions

$$ \frac{df}{dz} = \frac{df}{da} \cdot \frac{da}{dz} $$

## Practical

You now have all the tools to derive update equations for all our weights and biases - do so on paper

Afterwards I will write the solution on the whiteboard :)