# Backpropagation

Backpropagation is the most fundamental algorithm on which all other concepts within deep learning relies. At first backpropagation may seem very complicated. However in its' essence, it is just a generalization of the [chain-rule](https://en.wikipedia.org/wiki/Chain_rule).

Recall that a neural network (parameterized by weights $W$) predicts a target given an input x i.e. $\hat{y} = p(y|x,W)$. Training a neural network then corresponds to changing the weights $W$ such that the error $E$ between the predicted targets $\hat{y}$ and the true targets $y$ is minimized. 

Neural Networks are usually trained using [gradient descent](https://en.wikipedia.org/wiki/Gradient_descent). Here we update $W_{ij}$ by going in the negative direction of $\frac{\partial E}{\partial w_{ji}}$ with $\eta$ determining the step size / learning rate.

$$ W_{ij} =  W_{ij} - \eta \cdot \frac{\partial E}{\partial w_{ji}} $$

Backpropagation is simply an effective way to calculate the exact gradient of $E$ w.r.t to all the weights in the network.

In practice backpropagation is done with linear algebra. However, showcasing the math behind backpropagation with abstract linear algebra hides the inherent simplicity of the algorithm. Because of this, we will in this notebook only deal with networks which has a single unit in each layer.If you want to dig deeper into how backpropagation handles matrixes of various sizes there are numerous sources online.

Futhermore we will use [Mean Squared error (MSE)](https://en.wikipedia.org/wiki/Mean_squared_error) as our [cost function](https://en.wikipedia.org/wiki/Loss_function) and [sigmoid functions](https://en.wikipedia.org/wiki/Sigmoid_function) as our [activation functions](https://en.wikipedia.org/wiki/Activation_function).

# Forward pass / propagation

In order to do backpropagation we first need to propagate the input through the network from its' input units to its' output units. This is called a forward pass or forward propagation. We need to do this both in order to calculate the error $E$ between our predictions $\hat{y}$ and the actual targets $y$ and to have access the activations. 

If we break forward propagation into its' individual steps it is actually very simple. In each layer we take the input, multiply it by the weights. The result of this multiplication we use as the input to an activation function which in this case is the sigmoid function shown in the lower left corner of the animation. 


![forward propagation](images/forward_prop.gif)


The animation might be a bit to fast at first, but if you view it a couple of times it should start to make sense. Otherwise we have included all the steps in a static form here:

1) $ s_j = w_{in} \cdot x_{i} $. 

2) $ z_j = \sigma(w_{in} \cdot x_i)$.

3) $ s_k = w_{j} \cdot z_j  $.

4) $ z_k = \sigma(w_j \cdot \sigma(w_{in} \cdot x_i))$.

5) $ s_o = w_{k} \cdot z_k  $.

6) $ z_o = s_o $.

7) $\hat{y}_i = w_k \cdot \sigma(w_j \cdot \sigma(w_{in} \cdot x_{i}))$.

Notice that the functions becomes longer and longer, as we get deeper into the network. This is because the later layers are strictly dependent on the former layers. In layer $L_i$ we would in theory need to recalculate all operations from $L_0$ to $L_{i - 1}$. In modern networks this would require emense computational power, which is why we store the result of each operation in a cache in memory. This is important to remember, because it is the direct reason neural network has such a big memory requirement compared to other algorithms. 

After the forward pass, we have access to both the actual targets, and the predictions of the network. Because of this we can calculate the networks' error $E$, which we will do with MSE.
$$E = \frac{1}{2}(\hat{y} - y)^2$$

In [None]:
import numpy as np

"""
Try to construct the forward propagation of the network above
"""

def sigmoid(x):                                        
    return 1 / (1 + np.exp(-x))

x_i = np.array([1]) # The input to the network (Don't touch)
w_in = np.array([2]) # weight at first layer of the network (Don't touch)

s_j = [0]           # correct this line
z_j = [0]           # correct this line
w_j = np.array([5]) # weight at second layer of the network (Don't touch)


s_k = [0]           # correct this line
z_k = [0]           # correct this line
w_k = np.array([3]) # weight at final layer of the network (Don't touch)


s_o = [0]           # correct this line
z_o = [0]           # correct this line


y_hat = [0]         # correct this line


y_i = np.array([3]) # The target we want the network to predict (Don't touch)
error = [0]         # correct this line

# Print statements to help you get it right
print('z_j = correct' if '{0:0.4f}'.format(z_j[0]) == '0.8808' else "z_j is incorrect")
print('z_k = correct' if '{0:0.4f}'.format(z_k[0]) == '0.9879' else "z_k is incorrect")
print('z_o = correct' if '{0:0.4f}'.format(z_o[0]) == '2.9638' else "z_o is incorrect")
print('y_hat = correct' if '{0:0.4f}'.format(y_hat[0]) == '2.9638' else "y_hat is incorrect")
print('The network is correct!' if '{0:0.4f}'.format(error[0]) == '0.0007' else "The network is incorrect")

As noted before, we use backpropagation to figure out to which extend each weight contribute to the total error of the network. Therefore the error is the starting point of the algorithm.

Where forward propagation goes from the beginning of the network to its' end. Backpropagation goes from the end of the network and propagates the error until it reaches the input units (Hence its' name). At each layer it calculates the derivates of the weights using the chain rule combined with partial derivates. If you are not fully comfortable with the chain rule and/or partial derivatives, a great (and gentle) introduction can be found [here](https://www.youtube.com/watch?v=AXqhWeUEtQU&list=PLSQl0a2vh4HC5feHa6Rc5c0wbRTx56nF7&index=15).

The amount of calculatations needed is quite big for even a toy network like ours. However, if you look closely it is almost the same procedure all through the network.

In order to simplify the expressions, and we cache the redundant computations and mark them as $C_1 ... C_n$. These are also the expressions which would be stored in a cache in a deep learning framework. Furthermore we have marked the changes in each step with blue to give a better overview of what is going on. Spend some time making sure you have understood what is going on. *N.B most of it is the chain rule and partial derivatives, so if you are not comfortable with these concepts spend some time looking them up.* 

![backpropagation](images/back_prop.png)

As we have already mentioned, initially this might all seem quite complex. However, if you stare at the equations above long enough, you might begin to see a lot of redundancy.

First of all it is easy to see that each layer is strictly dependent on all layers after it in backpropagation, just like a layer is strictly dependent on all layers before it in forward propagation. This means that doing back propagation in big networks is practically infeasable if we don't store the intemediate results dynamically while we propagate through the network (Here shown as the cache e.g. $C_1$ & $C_2$).

In [None]:
"""
Try to calculate the partial derivatives of the weights defined in last exercise. 
In order to do that you will need the variables used before since the partial derivatives relies on them
"""

def sigmoid_derivative(x):
    return sigmoid(x) * (1 - sigmoid(x))

# ***********************************************************************
# First we calculate the cache values, used to reduce the computations  *
# needed while doing back propagation                                   *
# ***********************************************************************

C1 = [0]

C2 = [0]

C3 = [0]


# ***********************************************************************
# Then we do the actual calculations needed to find the gradient of     *
# the weights                                                           *
# ***********************************************************************

partial_W_k  = [0]   # correct this line

partial_W_j  = [0]   # correct this line

partial_W_in = [0]  # correct this line


# Print statements to help you get it right
print('C1 = correct!' if '{0:0.4f}'.format(C1[0]) == '-0.0362' else "C1 = incorrect")
print('C2 = correct!' if '{0:0.4f}'.format(C2[0]) == '0.0358' else "C2 = incorrect")
print('C3 = correct!' if '{0:0.4f}'.format(C3[0]) == '0.5250' else "C3 = incorrect")

print('partial_W_k = correct!' if '{0:0.4f}'.format(partial_W_k[0]) == '-0.0358' else "partial_W_k = incorrect")
print('partial_W_j = correct!' if '{0:0.4f}'.format(partial_W_j[0]) == '-0.0011' else "partial_W_j = incorrect")
print('partial_W_in = correct!' if '{0:0.6f}'.format(partial_W_in[0]) == '-0.000681' else "partial_W_in = incorrect")

# Error signals

By now you might have realized that each layer can be handled quite independently, if given the result of the calculations from the layers which comes before it in forward propagation and after it in backward propagation

The insight that we can treat each layer seperately is the foundation of all modern deep learning frameworks. It is nicely captured in back propagation if we introduce a new symbol we call the error signal $\delta$ (*delta*). The error signal is simply the accumulated error at each unit. 

The error signal can be defined recursively as: 

$$ \delta_j = \frac{\partial E}{\partial s_j} $$

Simply put, it is a measure of how much the network error varies with the input to unit $j$. The error signal has some nice properties. The most important one being that we can now write back propagation in a simpler and more compact form. 

Let's rewrite all the layers of our network using the error signal:
\begin{align}
\delta_k & = (\hat{y}_i - y_i) \\
\delta_j & = \delta_k \cdot w_k \cdot \sigma(s_k)(1 - \sigma(s_k)) \\
\delta_i & = \delta_j \cdot w_j \cdot \sigma(s_j)(1 - \sigma(s_j))
\end{align}

It is not far from the original notation, but it lets us define the equation for each layer independent of the proceeding layers in a correct and simple manner. 

*N.B notice how much it resembles the expressions we cached earlier*.

# Handling multiple inputs/outputs

The final key to understanding backpropagation is to know how to handle multiple inputs/outpus. We would never have a network with just one unit in each layer. This means that all units in every layer will have multiple inputs and multiple output. 

Luckily for us, it is not very complicated to handle this functionality mathematically. To showcase this, we define a new toy network which splits from one units to two and then back to one unit again.

![split network](images/split_network.png)

In [None]:
"""
Try to construct the forward propagation of the splitted network above
Remember that all inputs to a unit is simply summed in forward propagation
"""

x_i = np.array([1])  # The input to the network (Don't touch)
w_in = np.array([2]) # weight at first layer of the network (Don't touch)

s_i = [0]  # correct this line
z_i = [0]   # correct this line

w_i_j = np.array([4]) # weight at second layer of the network (Don't touch)
w_i_k = np.array([6]) # weight at second layer of the network (Don't touch)

s_j = [0]   # correct this line
z_j = [0]   # correct this line
w_j = np.array([5]) # weight at second layer of the network (Don't touch)

s_k = [0]   # correct this line
z_k = [0]   # correct this line
w_k = np.array([3]) # weight at final layer of the network (Don't touch)

s_o = [0]   # correct this line
z_o = [0]   # correct this line

y_hat = [0] # correct this line


y_i = np.array([8]) # The target we want the network to predict (Don't touch)

error = [0]   # correct this line

# Print statements to help you get it right
print('z_i = correct' if '{0:0.4f}'.format(z_i[0]) == '0.8808' else "z_i = incorrect")
print('z_j = correct' if '{0:0.4f}'.format(z_j[0]) == '0.9713' else "z_j = incorrect")
print('z_k = correct' if '{0:0.4f}'.format(z_k[0]) == '0.9950' else "z_k = incorrect")
print('z_o = correct' if '{0:0.4f}'.format(z_o[0]) == '7.8416' else "z_o = incorrect")
print('y_hat = correct' if '{0:0.4f}'.format(y_hat[0]) == '7.8416' else "y_hat = incorrect")
print('The network is correct!' if '{0:0.4f}'.format(error[0]) == '0.0125' else "The network is incorrect!")

## Multiple inputs

Looking at the network above, it is clear we need to figure out how to split the the error signal into two when going from $o$ to $k$ and $j$. Finding the error signal for $o$ itself is rather simple, since it is the last unit in the network. 

$$ \delta_o = (\hat{y}_i - y_i) $$

Luckily for us, it is also simple to split the error signal. Addition in neural networks can be seen as error signal distributors since it takes the error signal and distributes it to all the units in an even manner. This means that the error signal will be distributed equally to the two units in the following manner.

$$ \delta_j = \delta_o \cdot w_j \cdot \sigma(s_j)(1 - \sigma(s_j))$$
$$ \delta_k = \delta_o \cdot w_k \cdot \sigma(s_k)(1 - \sigma(s_k))$$

Notice that $\delta_o$ is included as it is in both units. This is important to remember since it is why several arcitectures you will hear about later works so well (LSTMs, ResNets etc.). What these architectures have in common is that they try to integrate addition into the networks in order to distribute the error signal / gradient more freely. This tends to have a great impact on  the performance of the networks.


## Multiple outputs

Finally we need to look at what happens in the opposite situation. When a unit have multiple outputs, we simply sum the error signal from each output, just as we sum all inputs of a unit in forward propagation. The error signal for unit $i$ is thus:

$$ \delta_i = \left(\sum_{l \in outs(i)} \delta_l \cdot w_l \right) \cdot \sigma(s_i)(1 - \sigma(s_i)) $$

The last unit in the network is the same as the former network, but defined in terms of the error signal it is:

$$ \delta_{in} = \delta_i \cdot w_{in} \cdot \sigma(x_i)(1 - \sigma(x_i))$$


## Converting error signal to gradients

The error signal allows us to express the propagation of error through the network in a simple and efficient manner. But it does not tell us what to do in order to improve the network. In order to do this, we need to use gradient descent. As you might have noticed, the cache we used earlier to store some of the computations, looked quite similar to the error signal. In practice we will also use a cache to store these deltas so we only have to compute each delta once.

With the deltas it is simple to calculate how much we should update each weight in the network. In order to do so we need to introduce a learning rate ($\eta$). We will not go into detail explaining the learning rate, but simply put, the learning rate is a constant which constrains the amount with which we can update our weights in one step.

The amount we update our weights is equal to the negative gradient multiplied with our learning rate. The equations thus becomes:

\begin{align}
 \Delta w_j & = - \eta \cdot \delta_o \cdot z_j \\
 \Delta w_k & = - \eta \cdot \delta_o \cdot z_k \\
 \Delta w_{i \to j} & = - \eta \cdot \delta_j \cdot z_i \\
 \Delta w_{i \to k} & = - \eta \cdot \delta_k \cdot z_i \\
 \Delta w_{in} & = - \eta \cdot \delta_i \cdot x_i \\
\end{align}

In [None]:
"""
Try to calculate the partial derivatives of the weights in this new splitted network. 
"""

eta = 0.01 # eta = Learning rate

delta_o = [0]  # correct this line

delta_j = [0]  # correct this line

delta_k = [0]  # correct this line

delta_i = [0]  # correct this line

update_W_j = [0]  # correct this line

update_W_k = [0]   # correct this line

update_W_i_j = [0] # correct this line

update_W_i_k = [0] # correct this line

update_W_in = [0]  # correct this line

You should now have a general understanding of the math underlying deep learning. Having a thorough understanding of how the gradients flow through a given network is fundamental to build new types of layers or complete network architectures that perform well.

Throughout this course we will introduce several types of achitectures and you will end up building a lot of them yourself. Try to keep backpropagation and the flow of gradients in your mind while you do so, this will help you a lot in the long term.