<center>
    
    Neural Networks and Backpropagation
    
    Author: Daniel Coble
</center>

We now know enough to talk about multilayer perceptrons and and neural networks. As we saw in the gradient descent notebook, gradient descent is the normal algorithm for how a machine 'learns'. But for gradient descent to work, there needs to be an algorithm to efficiently calculate gradients. That algorithm is backpropagation. This notebook will cover neural networks and how backpropagation through neural networks works, and implementing an MLP by hand. That'll involve some in-depth calculation, which is annoying. When doing actual machine learning, we don't have to do that and can just use an ML library.
But it's good to have an understanding of what's going on under the hood. Then we can move on and never have to worry about backpropagation ever again.

A neural network layer is a transform from $\mathbb{R}^n$ to $\mathbb{R}^m$ and consists of an affine transform (linear transform with a bias), then an activation function applied elementwise to the vector.
$$ \text{Layer = affine transform + activation function} $$
$$\phi(x) = \sigma(Wx+b) $$

The matrix and vector $W$ and $b$ are the weights and bias of the NN layer (sometimes they are both called weights) and parameterize the function $\phi$. $W$ and $b$ are like the $\beta$ in linear regression, and it's the goal of training to find the optimal weights. Thing of the weights as a way to parameterize the function we are learning. A single-layer NN isn't too useful, rather they are stacked on top of each other to make (hence the name 'layer'). Consider the layer functions $\phi_1: \mathbb{R}^{i} \rightarrow \mathbb{R}^{h_1}$, $\phi_2: \mathbb{R}^{h_1} \rightarrow \mathbb{R}^{h_2}$, $\phi_3: \mathbb{R}^{h_2} \rightarrow \mathbb{R}$. A 3-layer MLP is
$$ \phi(x) = \phi_3(\phi_2(\phi_1(x))) = \sigma\left(W_3\sigma\left(W_2\sigma\left(W_1x+b_1\right) + b_2\right)+b_3\right) $$
and the weights of the entire NN which need to be optimized are $\mathbb{W} = \{W_1,W_2,W_3,b_1,b_2,b_3\}$. For future reference, let's also define the following vectors. 
$$\xi_1 = W_1x + b_1$$
$$ \xi_2 = W_2\phi_1 + b_2 $$
$$ \xi_3 = W_3\phi_2 + b_2 $$
For simplicity I will surpress the dependence of $\phi_1$, $\phi_2$, and $\phi_3$. We can now work through how the gradient of loss with respect to the weights is calculated with backpropagation. Let a dataset be $\{(x_i, y_i)\}_{i=1}^N$. Loss is mean squared error.
$$ L(\mathbb{W}) = \frac{1}{N}\sum_{i=1}^N \left(\phi(x_i)-y_i\right)^2 $$
Backpropagation is called that because it works in the opposite direction of inference: where in inference $\phi_1$ is caculated, then $\phi_2$, then $\phi_3$, in backpropagation, the gradient with respect to the weights of $\phi_3$ is calculated first, then $\phi_2$, then $\phi_1$. The principle of backpropagation is the chain rule. As the first step, we have to calculate the gradients $\frac{\partial L}{\partial W_3}$, $\frac{\partial L}{\partial b_3}$, and $\frac{\partial L}{\partial \phi_2}$. The first two are used for weight updating and the last is used to continue backpropagation. As convention, we'll have the numerator of the partial derivative always be a scalar, and if the denominator is a matrix or vector, take that as short hand for the matrix/vector of the same shape of the partial derivative of each element. The following equations are correct, and I don't blame you if you just want to take my word for that.
$$ \frac{\partial L}{\partial \phi_{3,i}} = \frac{2}{N}\left(\phi_{3,i}-y_i\right) $$
$$ \frac{\partial L}{\partial \xi_{3,i}} = \mathrm{diag}\left(\sigma'\left(\xi_{3,i}\right)\right)\frac{\partial L}{\partial \phi_{3,i}} $$
$$ \frac{\partial L}{\partial W_3} = \sum_{i=1}^N\frac{\partial L}{\partial \xi_{3,i}}\phi_{2,i}^T $$
$$ \frac{\partial L}{\partial b_3} = \sum_{i=1}^N\frac{\partial L}{\partial \xi_{3,i}} $$
$$ \frac{\partial L}{\partial \phi_{2,i}} = \left(\frac{\partial L}{\partial \xi_{3,i}}^T W\right)^T $$
This solves the third layer. The second layer is identical.
$$ \frac{\partial L}{\partial \xi_{2,i}} = \mathrm{diag}\left(\sigma'\left(\xi_{2,i}\right)\right)\frac{\partial L}{\partial \phi_{2,i}} $$
$$ \frac{\partial L}{\partial W_2} = \sum_{i=1}^N\frac{\partial L}{\partial \xi_{2,i}}\phi_{1,i}^T $$
$$ \frac{\partial L}{\partial b_2} = \sum_{i=1}^N\frac{\partial L}{\partial \xi_{2,i}} $$
$$ \frac{\partial L}{\partial \phi_{1,i}} = \left(\frac{\partial L}{\partial \xi_{2,i}}^T W\right)^T $$
And again for the first layer.
$$ \frac{\partial L}{\partial \xi_{1,i}} = \mathrm{diag}\left(\sigma'\left(\xi_{1,i}\right)\right)\frac{\partial L}{\partial \phi_{1,i}} $$
$$ \frac{\partial L}{\partial W_1} = \sum_{i=1}^N\frac{\partial L}{\partial \xi_{1,i}}x_{i}^T $$
$$ \frac{\partial L}{\partial b_1} = \sum_{i=1}^N\frac{\partial L}{\partial \xi_{1,i}} $$
This is a mess and frankly it sucks. On the bright side it's actually very nice. It's really not that much work to calculate the gradients with respect to all weights, which enables us to do gradient descent for problems of much larger dimension than we did in the Gradient Descent notebook. 

Resources:

[Wikipedia, Backpropagation](https://en.wikipedia.org/wiki/Backpropagation)

Now let's program all of this. This example will emphasize how NNs can approximate a generic function by sampling from that function. I've created a 2D function below, first let's plot it and create a dataset by sampling from it.

**Challenge Problem**

The code above is hard-coded for a 3-layer MLP. Refactor it so that you can create a NN of arbitrarily many layers. Experiment with changing the number of layers and hidden units in the datasets. 