# Neural Network from Scratch

In this lab we'll create a neural network from scratch. In the process we'll learn about the algorithm that makes it all possible, **backpropagation**. In the end we'll end up with a mini Tensorflow-like library.

Outline:
1. Intro to Backprop + chain rule, derivatives of (+, * , -) topped with a simple example
2. Transition to thinking about functions as nodes with inputs and outputs in a graph. Rewrite previous example in these terms.
3. Matrix/vector, sigmoid ops + derivatives. MSE or Cross-entropy depending on if we’re regressing or classifying. 
4. Create neural net + train with SGD (MNIST?)


## Backpropagation and the Chain Rule

TODO: Improve this intro and change the exercise to $f = (x * y) + (x * -z)$ this illustrates what happens when a node has multiple outputs.

The fundamental idea behind neural networks is to minimize an objective, typically known as the *loss function*. The loss function outputs a single number that tells us how well we're doing, the smaller the better.

We can think of a neural network as a composition of functions with a loss (number) as the output.

$$L = f_n(f_{n-1}(...f_1(x)...))$$

The idea behind backpropagation, also known as reverse-mode differentiation is to use the chain rule

$$
\frac{\partial L}{\partial f_n}
\frac{\partial f_n}{\partial f_{n-1}}
...
\frac{\partial f_1}{\partial f_x}
$$

#### TODO: transition from above abstract functions to things like multplication and addition



$$
f = * \hspace{0.5in} f(x,y) = xy
$$

$$
\frac{\partial f}{\partial x} = y \hspace{0.5in} \frac{\partial f}{\partial y} = x
$$

Let's think about this for a bit. What we're saying here is the change of $f$ with respect to $x$ is $y$ and vice-versa. Remember, if our derivative is with respect to $x$ we treat $y$ as a constant. So let's say $y = 10$, think about changing $x$ from 3 to 4. Then $f(3,10) = 30$ and $f(4, 10) = 40$. That's a change in 10 or $y$! Every time we change $x$ by 1, $f$ changes by $y$.


$$
f = + \hspace{0.5in} f(x,y) = x + y
$$

$$
\frac{\partial f}{\partial x} = 1 \hspace{0.5in} \frac{\partial f}{\partial y} = 1
$$

Again, let's think about this. The change in $f$ with respect to $x$ is 1 and same for $y$. This means every time we change $x$ by 1, $f$ will also change by 1. This also shows $x$ and $y$ are independent of eachother.

Ok, let's now use this for a simple function.

TODO: picture here

In [None]:
# f(x, y, z) = x * y + (x * z)
# we can split these into subexpressions
# g(x, y) = x * y
# h(x, z) = x * z
# f(x, y, z) = g(x, y) + h(x, z)

# intial values
x = 3
y = 4
z = -5

g = x * y
h = x * z

f = g + h
print(f)

Let's take our function $f$ and apply the chain rule to compute the derivatives for $x, y, z$.

$$
\frac{\partial f}{\partial g} = 1 \hspace{0.1in}
\frac{\partial f}{\partial h} = 1 \hspace{0.1in}
\frac{\partial g}{\partial x} = y \hspace{0.1in}
\frac{\partial g}{\partial y} = x \hspace{0.1in}
\frac{\partial h}{\partial x} = z \hspace{0.1in}
\frac{\partial h}{\partial z} = x
$$


$$
\frac{\partial f}{\partial x} = 
\frac{\partial f}{\partial g}
\frac{\partial g}{\partial x}
+
\frac{\partial f}{\partial h}
\frac{\partial h}{\partial x}
\hspace{0.5in}
\frac{\partial f}{\partial y} = 
\frac{\partial f}{\partial g}
\frac{\partial g}{\partial y}
\hspace{0.5in}
\frac{\partial f}{\partial z} = 
\frac{\partial f}{\partial h}
\frac{\partial h}{\partial z}
$$


$$
\frac{\partial f}{\partial x} = 1 * y + 1 * z = y * z
\hspace{0.5in}
\frac{\partial f}{\partial y} = 1 * x = x
\hspace{0.5in}
\frac{\partial f}{\partial z} = 1 * x = x
$$

In [None]:
# The above in code
x = 3
y = 4
z = -5

dfdg = 1.0
dfdh = 1.0
dgdx = y
dgdy = x
dhdx = z
dhdz = x

dfdx = dfdg * dgdx + dfdh * dhdx
dfdy = dfdg * dgdy
dfdz = dfdh * dhdz

# the output of backpropagation is the gradient
gradient = [dfdx, dfdy, dfdz]
print(gradient) # [-1, 3, 3]

Notice the following expression, specifically the `+` function:

$$
\frac{\partial f}{\partial x} = 
\frac{\partial f}{\partial g}
\frac{\partial g}{\partial x}
+
\frac{\partial f}{\partial h}
\frac{\partial h}{\partial x}
$$

Think about how $x,y,z$ flow through the graph. Both $y$ and $z$ have 1 output edge and they follow 1 path to the $f$. On the other hand, $x$ has 2 output edges and follows 2 paths to `f`. Remember, we're calculating **the derivative of f with respect to x**. In order to do this we have to consider all the ways $x$ affects $f$ (all the paths in the graph from $x$ to $f$). 

An easy way to see how many paths we have to consider for a node's derivative is to trace all the paths back from the output node to the input node. So if $f$ is the output node and $x$ is the input node, trace all the paths back from $f$ to $x$. It's not always the case that all the output edges of $x$ will lead to $f$, so we shouldn't just assume we have to consider all the output edges of $x$.

Keep these things in mind for the node implementations!

## Graphs, Nodes and Ops

I'd like to draw attention to a few things from the previous section.

1. The picture of the broken down expression and subexpressions of $f$ resembles a graph where the nodes are function applications.
2. We can use of dynamic programming to make computing backpropagation efficient. Even in our simple example we see the reuse of $\frac{\partial f}{\partial g}$ and $\frac{\partial f}{\partial h}$. As our graph grows in size and complexity, it becomes much more evident how wasteful it is to recompute partials. The cornerstone of dynamic programming is **solving a large problem through many smaller ones** and **caching**. We'll do both!

In the following exercises that you'll implement the forward and backward passes for the nodes in our graph. 

You'll write your code in `miniflow.py` (same directory), the autoreload extension will automatically reload your code when you make a change!

In [None]:
%load_ext autoreload
%autoreload 2
from miniflow import *

### Exercise - Implement $f = (x * y) + (x * z)$ using nodes

Implement the `Mul` and `Add` nodes. The `Input` node is already provided.

Node template to get started

```
class Node(object):
    def __init__(self, input_nodes=[]):
        self.input_nodes = input_nodes
        self.output_nodes = []
        self.cache = {}
        self.value = 0
        self.dvalues = {}

    def forward(self):
        # TODO: implement and store in self.value
        
    def backward(self):
        # TODO: implement and store in self.dvalues
```

In [None]:
x = Input()
y = Input()
z = Input()

# TODO: implement Mul and Add in miniflow.py
g = Mul(x, y)
h = Mul(x, z)
f = Add(g, h)

# x, y, z nodes will take on these values and pass them to their outputs
feed_dict = {x: 3, y: 4, z: -5}
# compute the derivatives with respect to the following nodes
wrt = [x, y, z]

value, grad = value_and_grad(f, feed_dict, wrt)
assert value == -3 # should be -3
assert grad == [-1, 3, 3] # should be [-1, 3, 3]

## Functions for Neural Networks

Content: Introduce Linear, Sigmoid and CrossEntropyLoss nodes. Matrices, vectors, activation functions, loss functions.

We're now going to take our focus on how we can use differentiable graphs to compute functions for neural networks. 
Let's assume we have a vector of features $x$, a vector of weights $w$ and a bias scalar $b$. Then we to compute output we would perform a linear transform.

$$
o = (\sum_i x_iw_i) + b
$$

Or more concisely expressed as a dot product

$$
o =  x^Tw + b
$$

What if we have multiple outputs? Say we have $n$ features and $k$ outputs, then $b$ is a vector of length $k$, $x$ is a vector of length $n$ and $w$ becomes a $n$ by $k$ matrix, which we'll call $W$ from now on (matrices notation is typically a capital letter).

$$
o = x^TW + b
$$

What if we now have $m$ inputs? This is very common in practice feed in more than 1 input, it's referred to as the batch size. Then $x$ becomes a $m$ by $n$ matrix, we'll call this $X$.

$$
o = XW + b
$$

There we have it the famous linear transform! This on it's own though, is not all that powerful. We'll only do well if the data is linearably separable. This is where non-linear activations and layer stacking come into play. In fact, even a 2-layer neural network can [approximate arbitrary functions](http://neuralnetworksanddeeplearning.com/chap4.html). Pretty cool! There is however, a very fine line between being able to approximate any function theoretically and actually being able to do it efficiently and effectively in practice. If it was that easy then we wouldn't have convolution networks, recurrent neural networks, residual neural networks, generative neural network models, etc.

For this lab though, we'll keep it relatively simple. By the end of the lab you'll be able to construct a train a neural network with the following architecture.

$$
Input \rightarrow Linear \rightarrow Sigmoid \rightarrow Linear \rightarrow Softmax \rightarrow CrossEntropyLoss
$$

### Exercise - Implement `Linear` Node

In this exercise we'll implement the `Linear` node. This corresponds to the following function:

$$
f(X, W, b) = XW + b
$$

There are a few ways to go about the implementation, here are some possibilities:

1. Treat each element of the matrices and vector as a scalar and use existing `Mul` and `Add` nodes. You might have to implement a `Sum` node as well.
2. Break the function up into two subfunctions, you might call these `MatMul` and `BiasAdd`.
3. Treat it as one function.

Independent of the option chosen, the `dvalues` attribute of the `Linear` node should have 3 key, value pairs. Where the value is the same size/shape as the key. Example: If $W$ is of size $nxm$ then `dvalues[W]` should also be $nxm$.

**Tip: Write out the full above expression on paper and consider the derivative of $f$ with respect to a single element of $W, X, b$**

In [None]:
x_in, w_in, b_in = Input(), Input(), Input()
# TODO: implement Linear
f = Linear(x_in, w_in, b_in)

x = np.array([[-1., -2.], [-1, -2]])
w = np.array([[2., -3], [2., -3]])
b = np.array([-3., -3]).reshape(1, -1)

feed_dict = {x_in: x, w_in: w, b_in: b}
loss, grad = value_and_grad(f, feed_dict, (x_in, w_in, b_in))
assert np.allclose(loss, np.array([[-9.,  6.], [-9.,  6.]]))
assert np.allclose(grad[0], np.array([[-1.,  -1.], [-1.,  -1.]]))
assert np.allclose(grad[1], np.array([[-2.,  -2.], [-4., -4.]]))
assert np.allclose(grad[2], np.array([[2., 2.]]))

### Exercise - Implement `Sigmoid` Node

In this exercise we'll implement the `Sigmoid` node. This corresponds to the following function:

$$
f(x) = \frac {1} {1 + exp(-x)}
$$

Where $x$ is the output of a `Linear` node.

There are a 2 ways to go about the implementation:

1. Break it up into subfunctions, `Add`, `Divide`, `Exp`, etc.
2. Try to take the derivative of the sigmoid function and see if you can simplify it. The result is suprisingly simple!

In [None]:
x_in = Input()
# TODO: implement Sigmoid
f = Sigmoid(x_in)

x = np.array([-10., 0, 10])
feed_dict = {x_in: x}
loss, grad = value_and_grad(f, feed_dict, [x_in])
assert np.allclose(loss, np.array([0., 0.5, 1.]), atol=1.e-4)
assert np.allclose(grad, np.array([0., 0.25, 0.]), atol=1.e-4)

### Exercise - Implement `CrossEntropyLoss` Node

In this exercise we'll implement the `CrossEntropyLoss` node. This corresponds to the following functions:

$$
softmax(x_i) = \frac{e^{x_i}} {\sum_j e^{x_j}}
$$

The input to the $softmax$ function should be a $n x k$ matrix, where $n$ is the number of examples (batch size) and $k$ is the number classes an example can belong to. The $softmax$ function then computes probabilities (likelihood) of the example belonging to each class. In this case $\sum_j e^{x_j}$ sums over a row in the matrix and $e^{x_i}$ is a singular element in that row. The output of the $softmax$ should be a matrix (same shape as the input) of probabilities.

Example:

```
a = [0.2, 1.0, 0.3]
softmax(a) # [0.2309, 0.5138, 0.2551]
```

$$
cross\_entropy\_loss(probs, labels) = \frac{\sum_i -log(probs_{labels_i})} {n}
$$

Where $probs$ is the output of the $softmax$ function, $labels$ are the target labels (integer values from 0 to $k-1$) and $n$ is the number of rows of $probs$. For each row $i$ we pick the element at the index corresponding with the label. 

Ok, but how will this loss encourage our model to classify correctly?

There are 2 key pieces of information here:

1. $probs$ contains values between 0 and 1
2. $log(0) = -inf$ and $log(1) = 0$

TODO: log function graph?

Clearly we want the probability of the index of the correct label to be 1 or close to it and if it's close to 0 we'll be heavily penalized and our loss to shoot up. 

The negative sign in front of the log is simply so our objective is to minimize instead of maximize. The values produced by log between 0 and 1 will be negative or 0.

Once again, 2 ways to about the implementation:

1. Break it down into subfunctions.
2. Similar to the sigmoid function, we can simplify the derivative. The result is even simpler than sigmoid's!

In [None]:
x_in = Input()
y_in = Input()
# TODO: implement CrossEntropyLoss
f = CrossEntropyLoss(x_in, y_in)

# pretend output of a softmax
x = np.array([[0.5, 1., 1.5]])
y = np.array([1])
feed_dict = {x_in: x, y_in: y}
loss, grad = value_and_grad(f, feed_dict, wrt=[x_in])
assert np.allclose(loss, 1.1802, atol=1.e-4)
assert np.allclose(grad, np.array([[0.1863, -0.6928,  0.5064]]), atol=1.e-4)

## Create neural net to train MNIST

Content: neural network layers, SGD

We now have all our pieces in place, all that's left to do is stack them together like legos.

TODO: lego picture

### Exercise - Make a 2-layer neural net to train MNIST using `Input`, `Linear`, `Sigmoid`, `CrossEntropyLoss` nodes.

### Exercise - Training with Stochastic Gradient Descent

## Just here for a small test

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
N = 100 # number of points per class
D = 2 # dimensionality
K = 3 # number of classes
X = np.zeros((N*K,D)) # data matrix (each row = single example)
y = np.zeros(N*K, dtype='uint8') # class labels
for j in range(K):
    ix = range(N*j,N*(j+1))
    r = np.linspace(0.0,1,N) # radius
    t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta
    X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
    y[ix] = j
# lets visualize the data:
plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral)

In [None]:
X_in, W_in, b_in, y_in = Input(), Input(), Input(), Input()
f = Linear(X_in, W_in, b_in)
f = CrossEntropyLoss(f, y_in)

In [None]:
W = 0.01 * np.random.randn(D,K)
b = np.zeros((1,K))

In [None]:
# works
alpha = 1e-0
for i in range(200):
    feed_dict = {X_in: X, W_in: W, b_in: b, y_in: y}
    loss, grad = value_and_grad(f, feed_dict, [W_in, b_in])
    
    if i % 10 == 0:
        print("Iteration {}: Loss = {}".format(i, loss))
        
    # SGD
    dW, db = grad
    W -= alpha * dW
    b -= alpha * db