In [None]:
import torch, torchvision
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### A Forward Pass
In order to understand backpropagation we need to understand how a foward pass through a neural network works. 

![Perceptron](perceptron.png)

Above is a picture of an individual perceptron. These are the building blocks of neural network layers. On the left in blue circles you see the inputs labelled $x_1$ through $x_n$. If you look closely you will also see a constant 1 value at the top, it represents a bias. We can explore that a bit later, but for right now just know it is there. 

Next we have the weights in green boxes. There is a weight corresponding to each possible input and the single constant ranging from $w_0$ to $w_n$. These weights are multiplied to each input, think about it as if you're giving an input an importance level.

Next is summing together those values. That is represented via the first red circle.

Finally, the weighted sum is applied to the non-linearity function (aka hypothesis function, activation function, step-function, etc.). These generally are functions such as the sigmoidal function, $tan(h)$, ReLU, etc. The result of this gives us our output. 

### Example
For our example we are going to have 4 layers:
1. Input Layer (3 values)
2. Hidden Layer 1 (4 perceptrons)
3. Hidden Layer 2 (4 perceptrons)
4. Output Layer (1 value)

![Simple Neural Network](simple_neural_network.jpg)

We will assume the top value of the input layer is the bias constant 1, and then $x_1 = 0.8$ and $x_2 = 0.5$.

Each of the weights throughout the neural network is initialized to be 0.5 to start out with. Keep in mind weight initialization matters in practice because the better the weights are initialized the faster the machine trains.

So:

$$ws_{1,1} = (C * w_{1,0}) + (x_{1,1} * w_{1,1}) + (x_{1,2} * w_{1,2}) = (1 * 0.5) + (0.8 * 0.5) + (0.5 * 0.5) = 1.15$$

Using a sigmoidal activation function:

$$output_{1,1} = \frac{1}{1+e^{-ws_{1,1}}} = \frac{1}{1+e^{-1.15}} \approx .735$$

Do this same thing for each of the other values in the first Hidden Layer and you get:

1. Value 1: .735
2. Value 2: .735
3. Value 3: .735
4. Value 4: .735

Notice this is because each of our weights are the same and this is a fully connected network. That means this first pass, each value will be the same. Next, these output values of the first HL become the input values for the second HL, so:

$$ws_{2,1} = (x_{2,1} * w_{2,0}) + (x_{2,2} * w_{2,1}) + (x_{2,3} * w_{2,2}) + (x_{2,4} * w_{2,2}) = (.735 * 0.5) + (0.735 * 0.5) + (0.735 * 0.5) + (0.735 * 0.5) = 1.47$$

Using a sigmoidal activation function:

$$output_{2,1} = \frac{1}{1+e^{-ws_{2,1}}} = \frac{1}{1+e^{-1.47}} \approx .813$$

For Hidden Layer 2 you get:

1. Value 1: .813
2. Value 2: .813
3. Value 3: .813
4. Value 4: .813

Finally, sum them and apply the activation to them for the output:

$$ws_{3,1} = (x_{3,1} * w_{3,0}) + (x_{3,2} * w_{3,1}) + (x_{3,3} * w_{3,2}) + (x_{3,4} * w_{3,2}) = (.813 * 0.5) + (0.813 * 0.5) + (0.813 * 0.5) + (0.813 * 0.5) = 1.626$$

Using a sigmoidal activation function:

$$output_{3,1} = \frac{1}{1+e^{-ws_{3,1}}} = \frac{1}{1+e^{-1.626}} \approx .836$$

SO AS A RESULT, $.836$ IS OUR OUTPUT OF 1 PASS THROUGH OUR FEED FORWARD FULLY CONNECTED NEURAL NETWORK!!

### Backprop Example
All we need to do to update the weights, is simply go in reverse! As a reminder:

$$\delta_j(k) = -\frac{\partial E(k)}{\partial e_j(k)}\frac{\partial e_j(k)}{\partial y_j(k)}\frac{\partial y_j(k)}{\partial v_j(k)}$$

$$\delta_j(k) = e_j(k)\Phi_j'(v_j(k))$$

$$\Delta w_{ij}(k) = \alpha \delta_j(k)y_j(k)$$ 

where $\Phi'$ is the derivative of the sigmoid activation function so:

$$\Phi' = \Phi(x) * (1-\Phi(x))$$ 

We will assume the label for the output is **.95** and the learning rate **$\alpha$ = .001**.

So starting from the whole output and working towards the weights of Value 1 of Hidden Layer 2, we get the local gradient as:

$$\delta_j(k) = e_j(k)\Phi_j'(v_j(k)) = (.95 - .836)*(\Phi(1.626)*(1-\Phi(1.626))) = .114 * .83562 * .16438 \approx .016 $$

Next we take the output of each value as $y_j(k)$ and start working backwards through the weights. So:

$$\Delta w_{ij}(k) = \alpha \delta_j(k)y_j(k) => \Delta w_{3,1} = \alpha \delta_j(k) * output_{2,1} = .001 * .016 * .813 = .000013008$$ 

Hey, vanishing gradients! But either way, this is the update to the weight 3,1 going from the top perceptron of HL2 to the output layer. Continue this through and calculate the new $\delta_j(k)$ for each level. When you reach the hidden layer this means you need to recalculate it. This is actually very simple! Its just the summation of the products of the previous deltas and the corresponding weights times the derivative of the activation function. So:

$$\delta_j(k) = \Phi_j'(v_j(k))\sum_{1}\delta_1(k)*w_{1,j}(k)$$

\* **NOTE: This delta update must be done before we update the weights!!!**
\* Also, notational note: The "1" index in the preceeding problem goes from output to input, whereas indexing for our weights/inputs goes front to back. So 1 in that equation applied to this network is really 3, then 2, then 1.

So:
$$\delta_3(k)\approx .016$$
$$w_{3,1}, w_{3,2}, w_{3,3}, w_{3,4} = .5$$
$$\sum\delta_3(k)*w_{3,j}(k) = 0.032$$
$$\Phi' = (\Phi(1.47)*(1-\Phi(1.47))) = .813 * .187 = .152$$
$$\Phi' * \sum\delta_3(k)*w_{3,j}(k) = .152 * .032 \approx .004864$$

And we are good to go with our new local gradient ($\delta_j(k)$)!

### GPU explanation
So lets look at our network. If you were carefully watching you probably noticed a lot of the values were simply the sum of products. This, in matrix form, is just a dot product right? So instead of saying:

$$\sum_{1}\delta_1(k)*w_{1,j}(k)$$

We can say:

$$\delta \cdot w_1$$

And avoid looping through each weight! This is a major speed up, especially when considering that these networks can be incredibly deep and involve hundreds if not thousands of weights.

Because of their design, GPUs are optimized to handle floating point values where CPUs are not. So, doing dot products of matrices filled with large floating point values is a perfect task for GPUs.

## Exercises

![Perceptron](perceptron.png)


### Q1:
Apply the forward pass according to the following variables. Assume we are using the sigmoidal activation function. **DON'T FORGET TO INCLUDE THE CONSTANT**

$$x_1 = 2$$
$$x_2 = 1$$
$$x_3 = 3$$
$$w_0 = 0.4$$
$$w_1 = 0.1$$
$$w_2 = -0.6$$
$$w_3 = -0.2$$

### Q2:
Assuming the output result (label) is expected to be $.7$ and your $\alpha = .1$. What is the final value of the weights after backprop?