# **Backprop session - Agenda:** 
- ## Recall what makes up a neural network 

- ## Review the forward pass, sending data through the network to produce an output 

- ## Discuss some intuition for the backpropagation algorithm 

- ## Do a pen-and-paper example of a single round of backpropagation. 

- ## Will **not** cover the details of calculating the gradient. (The equations are included in this notebook for completeness). 




# **The neural network** 

### Will consider a neural network for **regression**, with target values in $\mathbb{R}$ (that is, targets are real values in one dimension).
### Neural networks can also be used for calssification, and target values can have multiple classes (or dimensions in regression).  

## **Termnology** (This may vary from author to author):
### $V$: The set of weights connecting the input layer to the hidden neurons. $V_{i, j}$ connects input neuron i to hidden neuron j

### $W$: Connecting the hidden neurons to the output neuron(s): $W_{i, j}$ connects hidden neuron $i$ to output neuron $j$. (in our case, we only have $j = 1$ since output is in one dimension). 

### $\mathbf{x} = (x_1, x_2)$ is a single data example with two features. 

### $z_j$ is the total signal into hidden neuron j. <font color = "red"> (Beware: in the lecture slides, $z_j$ is the input into **output neuron j**. In the slided, our $z_j$ is called $h_j$).  </font>

### Activation function $\sigma(z) = \frac{1}{1 + e^{-z}}$

### $a_j$ is the **activated** signal out of hidden neuron $j$. $a_j = \sigma(z_j)$

### $o$ is the total signal into the output neuron

### $y$ is the total signal out of the output neuron. 

### $t$ is the target value for the datapoint $\mathbf{x}$

### Loss function: Squared error -  $\mathcal{L}(y, t) = (y-t)^2$

![Neural network](img/neural_network.png)

# **The forward pass:** 
### The signal into the hidden neurons is the sum of input value times weights connecting into that neuron.
### Here's an example of the **signal into the first hidden neuron.**

### The accumulated signal **into the first hidden neuron** is $z_1$
![image.png](img/into_hidden_1.png)

### $z_1 = -1 \cdot V_{0,1} + x_1 \cdot V_{1,1} + x_2 \cdot V_{2,1}  $. Likewise, $z_2 = -1 \cdot V_{0,2} + x_1 \cdot V_{1,2} + x_2 \cdot V_{2,2}  $ , $z_3 = -1 \cdot V_{0,3} + x_1 \cdot V_{1,3} + x_2 \cdot V_{2,3}  $ 

### Before the signal gets passed fom the hidden neurons, the **activation function** is applied. 
### The **activated signals out of the hidden neurons are** $$(a_1, a_2, a_2) = (\sigma(z_1), \sigma(z_2), \sigma(z_3))$$ 


### The signal **into the output neuron** is the sum of the activated signals $\mathbf{a}$ and the bias $-1$ multiplied with the weights connecting to the output neuron: 

### $$y = -1 \cdot W_{0,1} + a_1 \cdot W_{1,1} + a_2 \cdot W_{2,1} +  a_3 \cdot W_{3,1} .$$ 

### Since this is a regression network, we do not apply an activation function to the output, so $o = y$. The output value $y$ is the predicted value for the input $(x_1, x_2)$

### Let's make an example. We initialize the network with these weights: 

![image.png](img/with_weights.png)

### Weights from bias neuron: 
$V_{0, 1} = 1$,

$V_{0, 2} = 0$,

$V_{0, 3} = -1$.

### Weights from feature 1: 
$V_{1, 1} = 1$,

$V_{1, 2} = 2$,

$V_{1, 3} = 0$.

### Weights from feature 1: 
$V_{2, 1} = 0$,

$V_{2, 2} = -2$,

$V_{2, 3} = 1$.

### Weights from the hidden neurons: 

$W_{0, 1} = 1$,

$W_{1, 1} = 2$,

$W_{2, 1} = 3$,

$W_{2, 1} = -1$.

### Input data: $\mathbf{x} = (1, 2)$

### Target value: $t = 5$



<br>

## **Using the wieghts and inputs from above:** 
### \begin{align} z_1 &= -1 \cdot 1 + 1 \cdot 1 + 2 \cdot 0 = 0 \\
z_2 &= -1 \cdot 0 + 1 \cdot 2 + 2 \cdot -2  = -2 \\ 
z_3 &= -1 \cdot -1 + 1 \cdot 0 + 2 \cdot 1  = 3 \\ 
\end{align}
## Activations out of the hidden neurons: 
### \begin{align} a_1 = \sigma(z_1) = \sigma(0) &= 0.5 \\
a_2 = \sigma(z_2) = \sigma(-2) & \approx 0.1192\\ 
a_3 = \sigma(z_3)  = \sigma(z_3) &\approx 0.952 
\end{align}

## Signal into output neuron: 
### $$ o =  -1 \cdot 1 + 0.5 \cdot 2 + 0.1192 \cdot 3 +  0.952 \cdot -1  \approx -0.5944 $$

## Since this is a network for regression, we don't apply any activation function to the output. The output from the network is 
### $$ y \approx -0.5944 $$

![image.png](img/img_1.png)

### (Above calculations can be done efficiently with matrix multiplication and vector operations in python. CF lecture slides, or ask in group session!)


<br><br><br> 

# **Back propagation - some intuition**


### Recall that the target was $5$ while the prediction was $-0.59$. This gives an MSE of $(-0.59 - 5)^2 \approx 31.25$.
### The goal is to use **gradient descent** to find weights $V, W$ that minimize this error, by stepping in the opposite direction of the gradient.

# $$\theta \leftarrow \theta - \eta \cdot \nabla \mathcal{L}( t, y)$$

### If we know the gradient of $\mathcal{L}$, we know how to update the weights to reduce the mistake we made! 

<br><br><br> 
### Q: How "sensitive" is the loss $\mathcal{L}$ to the change in a weight? 

<table><tr>
<td> 
  <p align="center" style="padding: 10px">
    <img alt="Forwarding" src="img/last_layer.png">
    <br>
    <em style="color: grey">Last layer change</em>
  </p> 
</td>
<td> 
  <p align="center">
    <img alt="Routing" src="img/first_layer.png">
    <br>
    <em style="color: grey">First layer change</em>
  </p> 
</td>
</tr></table>

## **The backpropagation equations:**  
## \begin{align}
&\frac{\partial \mathcal{L}}{\partial W_{i, 1}}(y, t) =  \overset{\text{ Change in } \mathcal{L} \text { from } y}{\overbrace{2(y-t)}} \cdot \underset{\text{... in } y \text{ from change in }W_{i, 1}}{\underbrace{a_j}} \\
\\
&\frac{\partial \mathcal{L}}{\partial V_{i, j}}(y, t) =  \overset{\text{ Change in } \mathcal{L} \text { from } y}{\overbrace{2(y-t)}} \cdot \underset{\text{ .. in } y \text { from change in }a_j}{\underbrace{W_{j, 1}} } \cdot  \overset{\text{ .. in } a_i \text { from change in } z_j}{\overbrace{a_j(1-a_j)}}\cdot \underset{\text{... in } z_j \text{ from change in }V_{i, j}}{\underbrace{x_i}}
\end{align}

### To change the weight $W_{j, 1}$ in the output layer we use: 
# \begin{align} W_{j, 1} & \leftarrow W_{j,1} - \eta \cdot \underset{\delta_o} {\underbrace{2(y-t)}}\cdot a_j  \\
W_{j, 1} & \leftarrow W_{j,1} -\eta \cdot \delta_o \cdot a_j
\end{align}

### To change the weight $V_{i, j}$ in the first layer, we use 
# \begin{align} V_{i, j} & \leftarrow V_{i, j} - \eta \cdot \underset{\delta_{h(j)}}{\underbrace{2(y-t) W_{j, 1} a_{j} (1-a_j)}} \cdot x_i \\ 
V_{i, j} & \leftarrow V_{i, j} - \eta \cdot \delta_{h(j)} x_i 
\end{align}

## **Back to our example:** 

### We use a learning rate of $\eta = 0.1$

### Input value was $x = (1, 2)$, target $t = 5$

### After the forward pass we found $y = -0.59$

### Let's begin to calculate the ouput delta: 
 ## \begin{align}
 \delta_{o} & = 2(y-t) \\  & = 2\cdot(-5.59) \\ &= -11.18
 \end{align}
 
### We first train the weight connecting the last **bias** to the output neuron: 

![image.png](img/img_4.png)

## \begin{align} 
    W_{0, 1} &\leftarrow W_{0.1} - \eta \cdot \delta_{o} \cdot \overset{\text{bias} = -1} { \overbrace{a_{0}}} \\
     \eta \cdot \delta_{o} \cdot a_0  &= 0.1 \cdot -11.18 \cdot -1 \\
     &= 1.118 \\
     \Longrightarrow W_{0.1} & \leftarrow 1 - 1.118\\
     &= -0.118
\end{align}


### We repeat this for output weights $W_{1, 1}, W_{2, 1} \text{ and }W_ {3, 1}$
![image.png](img/img_3.png)
## \begin{align}
W_{1, 1} &\leftarrow W_{1,1} - \eta \cdot \delta_{o} \cdot a_1 \\
     \eta \cdot \delta_{o} \cdot a_1  &= 0.1 \cdot -11.18 \cdot 0.5 \\
     &= -0.559\\
     \Longrightarrow W_{1,1} & \leftarrow 2 - (-0.559) \\
     &= 2.559
\end{align}
<br> <br> 
## \begin{align}
W_{2, 1} &\leftarrow W_{2,1} - \eta \cdot \delta_{o} \cdot a_2 \\
     \eta \cdot \delta_{o} \cdot a_2  &= 0.1 \cdot -11.18 \cdot 0.119 \\
     &= -0.133\\
     \Longrightarrow W_{1,1} & \leftarrow 3 - (-0.133) \\
     &= 3.133
\end{align}
<br><br>
## \begin{align}
W_{3, 1} &\leftarrow W_{3,1} - \eta \cdot \delta_{o} \cdot a_3 \\
     \eta \cdot \delta_{o} \cdot a_3  &= 0.1 \cdot -11.18 \cdot 0.952 \\
     &= -1.064\\
     \Longrightarrow W_{1,1} & \leftarrow -1 - (-1.064) \\
     &= 0.064
\end{align}

## We now know all the new weights in the output layer (But don't update them yet! We need them to calculte the hidden deltas):


![image.png](img/img_2.png)

## Now, the weights in the first layer. Recall: 

### \begin{align} \overset{\text{from input neuron i into hidden neuron j}}{\overbrace{V_{i, j}}} & \leftarrow V_{i, j} - \eta \cdot \underset{\delta_{h(j) =\text{ hidden delta no. j} }}{\underbrace{2(y-t) W_{j, 1} a_{j} (1-a_j)}} \cdot \overset{\text{input neuron no. i}}{ \overbrace{x_i}} \\ 
V_{i, j} & \leftarrow V_{i, j} - \eta \cdot \delta_{h(j)} x_i 
\end{align}

### Q: How does adjusting a weight in the first layer change the loss? 

### <font color = "red"> Beware: There is an error in the calculatons below. I somehow calculated the output delta to $-13.8$, not $-11.18$. This means that the final numbers are not exactly what we wanted, but the medthod is the same. Just repace $-13.8$ with $-11.18$ </font>
![first layer](img/first_layer.png)

### Let's first calculate the **hidden deltas** 

![The network](img/backward_prop.png)
### \begin{align}
\delta_{h(1)} = 2(y-t) W_{1, 1} a_1 (1-a_1) &= -13.8 \cdot 2 \cdot 0.5 \cdot (1-0.5)\\
\delta_{h(2)} = 2(y-t) W_{2, 1} a_2 (1-a_2) &= -13.8 \cdot 3 \cdot 0.119 \cdot (1-0.119)\\
\delta_{h(3)} = 2(y-t) W_{3, 1} a_3 (1-a_3) &= -13.8 \cdot -1 \cdot 0.952 \cdot (1-0.952)\\
\end{align}
### This gives
### \begin{align}
\delta_{h(1)} &= -6.9 \\
\delta_{h(2)} &= -4.34 \\
\delta_{h(3)} &= 0.63\\
\end{align}

### We can finally update the weights:

### \begin{align}
\quad V_{0, 1}  \leftarrow V_{0, 1} - \eta \cdot \delta_{h(1)}\cdot \overset{\text{bias}}{\overbrace{-1}} \Longrightarrow V_{0, 1} &= 1 - 0.1 \cdot -6.9 \cdot -1 = 0.31\\
V_{0, 2}  \leftarrow V_{0, 2} - \eta \cdot \delta_{h(2)}\cdot -1 \Longrightarrow V_{0, 2} &= 0 - 0.1 \cdot -4.34 \cdot -1 = -0.434\\
V_{0, 3}  \leftarrow V_{0, 3} - \eta \cdot \delta_{h(3)}\cdot -1 \Longrightarrow V_{0, 3} &= -1 - 0.1 \cdot 0.63 \cdot -1 = -0.9370
\end{align}
### \begin{align}
V_{1, 1}  \leftarrow V_{1, 1} - \eta \cdot \delta_{h(1)}\cdot x_1 \Longrightarrow V_{1, 1} &= 1 - 0.1 \cdot -6.9 \cdot 1 = 1.69\\
V_{1, 2}  \leftarrow V_{1, 2} - \eta \cdot \delta_{h(2)}\cdot x_1 \Longrightarrow V_{1, 2} &= 2 - 0.1 \cdot -4.34 \cdot 1 = 2.434\\
V_{1, 3}  \leftarrow V_{1, 3} - \eta \cdot \delta_{h(3)}\cdot x_1 \Longrightarrow V_{1, 3} &= 0  - 0.1 \cdot 0.63 \cdot 1 = -0.063
\end{align}
### \begin{align}
V_{2, 1}  \leftarrow V_{2, 1} - \eta \cdot \delta_{h(1)}\cdot x_2 \Longrightarrow V_{2, 1} &= 0 - 0.1 \cdot -6.9 \cdot 2 = 1.38\\
V_{2, 2}  \leftarrow V_{2, 2} - \eta \cdot \delta_{h(2)}\cdot x_2 \Longrightarrow V_{2, 2} &= -2 - 0.1 \cdot -4.34 \cdot 2 = -1.132\\
V_{2, 3}  \leftarrow V_{2, 3} - \eta \cdot \delta_{h(3)}\cdot x_2 \Longrightarrow V_{2, 3} &= 1 - 0.1 \cdot 0.63 \cdot 2 = 0.874
\end{align}

## **Summary of the mathematics**: 
## To change the weight $W_{j, 1}$ in the output layer we use: 
# \begin{align} W_{j, 1} & \leftarrow W_{j,1} - \eta \cdot \underset{\delta_o} {\underbrace{2(y-t)}}\cdot a_j  \\
W_{j, 1} & \leftarrow W_{j,1} -\eta \cdot \delta_o \cdot a_j
\end{align}
![Last layer](img/last_layer.png)


## To change the weight $V_{i, j}$ in the first layer, we use 
# \begin{align} V_{i, j} & \leftarrow V_{i, j} - \eta \cdot \underset{\delta_{h(j)}}{\underbrace{2(y-t) W_{j, 1} a_{j} (1-a_j)}} \cdot x_i \\ 
V_{i, j} & \leftarrow V_{i, j} - \eta \cdot \delta_{h(j)} x_i 
\end{align}
![first layer](img/first_layer.png)

<br><br><br> 

# **Mathematics of backpropagation**
### The following will be the mathematics of finding the gradient of the loss function. 
### We consider training on a single datapoint. This is known as **stochastic gradient descent**. 

## Step 1: Find the partial derivatives of the loss function with respect to the weights in the last layer. 

$$ \frac{\partial \mathcal{L}}{\partial W_{j, 1}} = \frac{\partial \mathcal{L}}{\partial y} \frac{\partial y}{ \partial W_{j, 1}} $$ 

### We have that $\frac{ \partial{\mathcal{L}} }{ \partial y }(y, t) = 2(y-t)$

### $y = -1 \cdot W_{0, 1} + a_1 \cdot W_{1, 1} + a_2 \cdot W_{2, 1} + \dots $, so $\frac{\partial y}{\partial W_{j, 1}} = a_j $.
### This means that we can update weight $W_{j, 1}$ in the **last layer** using 

### $$W_{j, 1} \leftarrow W_{j, 1} -  \eta 2  (y-t)a_j$$

## Step 2: How about the first layer? 
### \begin{align}
\frac{\partial \mathcal{L}}{\partial V_{i, j}} &= \frac{\partial \mathcal{L}}{\partial V_{i, j}} \frac{\partial y}{ \partial V_{i, j}} \\
&= 2(y-t) \frac{\partial y}{ \partial V_{i, j}}
\end{align}
### What is $ \frac{\partial y}{ \partial V_{i, j}}$? 
### Have that 
### \begin{align}
y = &-1 \cdot W_{0, 1} + \sigma(z_1)W_{1, 1} + \sigma(z_2) W_{2, 1} + \sigma(z_3) W_{3, 1}\\
 &= -1 \cdot W_{0, 1} + a_1 W_{1, 1} + a_2 W_{2, 1} + a_3 W_{3, 1},
\end{align}
### so
### $$ \frac{\partial y}{ \partial V_{i, j}} =  \frac{\partial y}{ \partial a_j} \frac{\partial a_j}{ \partial V_{i, j}},$$ since changing weight $V_{i, j}$ only changes the value of $a_j$. (All the other terms involving $a_k, k \neq j$ will become zero)  and $\frac{\partial y}{ \partial a_j} = W_{j, 1}$.

### Moreover,  $a_i = \sigma(z_i).$ The derivative of the sigmoid function $a_i = \sigma(z_i)$ is simply $a_i(1-a_i)$, so
### $\frac{\partial a_j}{ \partial V_{i, j}} = a_j(1-a_j) \frac{\partial z_j}{\partial V_{i, j}}. $
### Finally, $z_j = \sum_{k} x_k V_{k, j}$, so $\frac{\partial z_j}{\partial V_{i, j}} = x_i$. 

## Altogether: 
## \begin{align}
\frac{\partial \mathcal{L}}{\partial V_{i, j}}(y, t) &= \left(\frac{\partial \mathcal{L}}{\partial y}(y, t)\right)\frac{\partial y}{\partial a_j}\frac{\partial a_j}{z_j}\frac{\partial z_j}{\partial V_{i, j}}\\ 
&= 2(y-t)W_{j, 1}a_j(1-a_j)x_i
\end{align}
