# CSS

In [1]:
from IPython.display import HTML
style = """
<style>
.expo {
  line-height: 150%;
}

.visual {
  width: 600px;
}

</style>
"""
HTML(style)

# Deep Learning from Scratch using Python 

Seth Weidman

11/04/2017

# To begin...

In [2]:
# import tensorflow as tf

Nah...

# Outline

This talk will have two parts:

## Part 1: Neural Nets from Scratch

* We'll implement a basic neural net with one hidden layer, from scratch, and use mathematical principles to get the backpropogation right.

* We'll show that this same framework can be used to learn MNIST.

## Part 2: Transitioning to Deep Learning

* We'll transition to Deep Learning by changing our mental model of neural nets to be that they contain "layers" which pass information backwards and forwards between them. 

* We'll show how this framework can be used to construct arbitrarily deep neural networks, and show how these can learn MNIST.

# Part 1: Neural Nets from Scratch

<div class="expo">
We've all seen diagrams like the following in the context of neural nets:
</div>

## Part 1: Neural Nets from Scratch

<div class="visual">
<img src='img/neural_net_v3.png'>
</div>

<div class="expo">
Many don't fully understand what is going on in this diagram. This talk will attempt to rectify that.
</div>

## Why neural nets?

<div class="expo">
Let's suppose we need a function that  that can learn a complicated relationship between inputs and outputs:
</div>

In [3]:
import os
import sys
sys.path.append(os.getcwd())
from helpers import *

In [4]:
df, X, Y = generate_x_y()

### Why Neural Nets?

In [5]:
df

Unnamed: 0,X1,X2,X3,y
0,1,0,0,1
1,0,1,0,0
2,1,1,0,1
3,0,0,0,0
4,1,0,1,0
5,0,0,1,0
6,1,1,1,1
7,0,1,1,1


<div class="expo">
So: we have eight observations, and a complex relationship between inputs and outputs. Now, let's build a neural net that can "learn" this relationship.
</div>

### Why Neural Nets?

<div class="expo">
How to even begin? Well, let's look at this from as high a level as possible and then progressively dive deeper. First, we want a function $N$ that--based on the data from before--maps inputs to outputs properly, that is:
</div>

$$ N(1, 0, 0) = 1 $$
$$ N(0, 1, 0) = 1 $$
$$ N(1, 1, 0) = 1 $$

etc.

### Why neural nets

<div class="expo">
First, let's observe that logistic regression can't do this. That is, there are no parameters $b$, $w_1$, $w_2$, and $w_3$ such that:
</div>

$$N(x_1, x_2, x_3) = \frac{1}{1 + e^{b + w_1 * x_1 + w_2 * x_2 + w_3 * x_3}}$$

### Why neural nets

<div class="expo">
This is true for the same reason that logistic regression cannot learn XOR. Our problem is a three dimensional problem since we have three features, but you can easily see in two dimensions that the space is not linearly separable:
</div>

<div class="visual">
  <img src="img/xor.png">
</div>

### Why neural nets

<div class="expo">
Of course, we *could* manually do feature engineering...
</div>

... but who likes feature engineering???

<div class="expo">
Let's have the computer do the feature engineering for us, via a neural net learning hidden features!
</div>

# Let's make a prediction using a neural net

Our goal will be to:

* Start with our three original features.
* Transform them into four "intermediate features" using logistic-regression-like transformation.
* Take these four "intermediate features" and use _them_ to predict our final output.

## Step 1: feeding features to the "intermediate" or "hidden" layer

How do we transform our original three features into an intermediate, or hidden layer? Let's call our intermediate features $a_1$, $a_2$, $a_3$, and $a_4$. 

We want $a_1$, for example, to be a linear combination of $x_1$, $x_2$, and $x_3$ - that is, we want some weights $v_{11}$, $v_{12}$, and $v_{13}$ so that:

$$ a_1 = x_1 * v_{11} + x_2 * v_{21} + x_3 * v_{31} $$

and similarly for $a_2$, $a_3$, and $a_4$.

### Step 1: feeding features to the "intermediate" or "hidden" layer

A way to concisely express this is to define your features as a vector

$$ X = \begin{bmatrix}x_1 & x_2 & x_3\end{bmatrix} $$

This should be intuitive, since $X$ is already a row in your data! 

### Step 1: feeding features to the "intermediate" or "hidden" layer

Then, multiplying this vector by a matrix $V$:

$$ V = \begin{bmatrix}v_{11} & v_{12} & v_{13} & v_{14} \\
                      v_{21} & v_{22} & v_{23} & v_{24} \\
                      v_{31} & v_{32} & v_{33} & v_{34}
                      \end{bmatrix} $$

gives us what we want, since "$A = X * V$" is equivalent to:

### Step 1: feeding features to the "intermediate" or "hidden" layer

$$ a_1 = x_1 * v_{11} + x_2 * v_{21} + x_3 * v_{31} $$
$$ a_2 = x_1 * v_{12} + x_2 * v_{22} + x_3 * v_{32} $$
$$ a_3 = x_1 * v_{13} + x_2 * v_{23} + x_3 * v_{33} $$
$$ a_4 = x_1 * v_{14} + x_2 * v_{24} + x_3 * v_{34} $$

which is what we want in order to get four intermediate features.

### Step 1: feeding features to the "intermediate" or "hidden" layer

Let's code this up.

In [6]:
x = np.array(X[0], ndmin=2)
array_print(x)

The array:
 [[1 0 0]]
The dimensions are 1 row and 3 columns


### Step 1: feeding features to the "intermediate" or "hidden" layer

In [7]:
V = np.random.randn(3, 4)
array_print(V)

The array:
 [[ 0.36  0.3  -0.63 -0.34]
 [-1.03  1.26  0.41  1.2 ]
 [-1.07 -0.68  0.08  1.04]]
The dimensions are 3 rows and 4 columns


In [8]:
A = np.dot(x, V)
array_print(A)

The array:
 [[ 0.36  0.3  -0.63 -0.34]]
The dimensions are 1 row and 4 columns


## Where are we?

<div class="visual">
<img src='img/neural_net_4_first_layer.png'>
</div>

## Step 2: feeding these intermediate features through an "activation function"

We're going to use a classic, easy-to-understand activation function, though one that is not often used in cutting-edge applications: the sigmoid function:

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

### Step 2: feeding these intermediate features through an "activation function"

$$ B = \sigma(A) $$ or

$$ b_1 = \sigma(a_1) $$
$$ b_2 = \sigma(a_2) $$
$$ b_3 = \sigma(a_3) $$
$$ b_4 = \sigma(a_4) $$

### Step 2: feeding these intermediate features through an "activation function"

In [9]:
def sigmoid(x):
    return 1.0/(1.0+np.exp(-x))

In [10]:
B = sigmoid(A)
array_print(B)

The array:
 [[ 0.59  0.58  0.35  0.42]]
The dimensions are 1 row and 4 columns


## Where are we?

<div class="visual">
<img src='img/neural_net_4_first_sigmoid.png'>
</div>

### Step 3: use these intermediate features as a linear combination to the output

We'll multiply these "sigmoided" results by another matrix $W$ to get a single output. Since we want to transform 4 features down into 1, we can use a 4 x 1 matrix:

$$ W = \begin{bmatrix}w_{11} \\
                      w_{21} \\
                      w_{31} \\
                      w_{41}
                      \end{bmatrix} $$

### Step 3: use these intermediate features as a linear combination to the output

And since we want the result to be:

$$ c_1 = w_{11} * b_1 + w_{21} * b_2 + w_{31} * b_3 + w_{41} * b_4 $$

### Step 3: use these intermediate features as a linear combination to the output

This is equivalent to writing:

$$ C = B * W $$

or:

$$ \begin{bmatrix}
c_1 \end{bmatrix} = 
\begin{bmatrix}b_1 &
                  b_2 &
                  b_3 &
                  b_4
                  \end{bmatrix} * 
\begin{bmatrix}w_{11} \\
               w_{21} \\
               w_{31} \\
               w_{41}
               \end{bmatrix} $$

### Step 3: use these intermediate features as a linear combination to the output

So we can simply code this up as:

In [11]:
W = np.random.randn(4, 1)
array_print(W)

The array:
 [[-0.47]
 [ 0.94]
 [ 0.56]
 [-1.2 ]]
The dimensions are 4 rows and 1 column


In [12]:
C = np.dot(B, W)
array_print(C)

The array:
 [[-0.04]]
The dimensions are 1 row and 1 column


## Where are we

<div class="visual">
<img src='img/neural_net_4_second_layer.png'>
</div>

### Step 4: sigmoid this to make a final prediction

Mathematically, we want:

$$ p_1 = \sigma(c_1) $$

So we can simply code this up as:

In [13]:
P = sigmoid(C)
array_print(P)

The array:
 [[ 0.49]]
The dimensions are 1 row and 1 column


## Where are we

<div class="visual">
<img src='img/neural_net_4_final_prediction.png'>
</div>

### Step 5: compute the loss

Mathematically, we'll compute mean squared error loss:

$$ L = \frac{1}{2}(y - P)^2 $$

And coding this up is simply:

In [14]:
y = np.array(Y[0], ndmin=2)
L = 0.5 * (y - P) ** 2
array_print(L)

The array:
 [[ 0.13]]
The dimensions are 1 row and 1 column


## Where are we

<div class="visual">
<img src='img/neural_net_4_loss.png'>
</div>

## Now what?

We have made our prediction and computed our loss, $L$. Now what?

Recall: each "step" is just a function applied to some input that results in some output.

### Now what?

If we write out what we just did in terms of mathematical functions, we could write it as:

\begin{align}
A &= a(x, V) \\
B &= b(A) \\
C &= c(B, W) \\
P &= p(C) \\
L &= l(P)
\end{align}

So, say we have a neural net with just one hidden layer. We could write the loss of a neural net on a given observation $ x $ as:

$$ L = l(p(c(b(a(x, V)), W))) $$

### Now what?

Mathematically, we _want_ to change the weights in such a way that the loss will be reduced during the next iteration. The equations:

$$ W = W - \frac{\partial l}{\partial W}$$

$$ V = V - \frac{\partial l}{\partial V}$$

do this.

### Now what?

Notice that this "makes sense":

* If $\frac{\partial l}{\partial W}$ is a positive number, then we want to _decrease_ the weight, since increasing the weight would _increase_ our loss. That is exactly what the equation $ W = W - \frac{\partial l}{\partial W}$ does.
* Similarly, if $\frac{\partial l}{\partial W}$ is a negative number, then we want to _increase_ the weight, since increasing the weight would _decrease_ our loss. In both cases, the equation $ W = W - \frac{\partial l}{\partial W}$ works.

## Backpropogation - setup

Now we want to make our neural net smarter by updating its weights. We've see that to do that, we need to compute $\frac{\partial L}{\partial W}$ and $\frac{\partial L}{\partial V}$. How do we do this?

Well, we know that 

$$ L = l(p(c(b(a(x, V)), W))) $$

### Backpropogation - setup

Our good friend the chain rule tells us that: 

$$ \frac{\partial L}{\partial W} = \frac{\partial l}{\partial P} * \frac{\partial p}{\partial C} * \frac{\partial c}{\partial W}  $$

and 

$$ \frac{\partial L}{\partial V} = \frac{\partial l}{\partial P} * \frac{\partial p}{\partial C} * \frac{\partial c}{\partial B} * \frac{\partial b}{\partial A} * \frac{\partial a}{\partial V}  $$

Each one of these partial derivatives turns out to be simple!

## Backpropogation - step 1:

First, let's compute:

$$ \frac{\partial l}{\partial P} $$

### Backpropogation - step 1:

Since 

$$ L = l(P) = \frac{1}{2}(y - P)^2 $$

Then:

$$ \frac{\partial l}{\partial P} = -(y - P)$$

### Backpropogation - step 1:

And coding this up is simply:

In [15]:
dLdP = -(y - P)
array_print(dLdP)

The array:
 [[-0.51]]
The dimensions are 1 row and 1 column


## Where are we

<div class="visual">
    <img src='img/neural_net_4_loss_grad.png'>
</div>

## Backpropogation - step 2:

Next, let's compute:

$$ \frac{\partial p}{\partial C} $$

Recall that:

$$ P = \begin{bmatrix} p_1 \end{bmatrix} = p(c) = \sigma(c) $$

### A digression on the sigmoid

If

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

Then 

$$\sigma'(x) = \sigma(x) * (1 - \sigma(x))$$

### Backpropogation - step 2:

So if

$$ p(c) = \sigma(c) $$

then

$$ p'(c) = \sigma(c) * (1 - \sigma(c)) $$

### Backpropogation - step 2:

So, coding this up is simply:

In [16]:
dPdC = sigmoid(C) * (1-sigmoid(C))
array_print(dPdC)

The array:
 [[ 0.25]]
The dimensions are 1 row and 1 column


## Where are we

<div class="visual">
    <img src='img/neural_net_4_prediction_grad.png'>
</div>

### Backpropogation - step 3:

Next we want to compute:

$$ \frac{\partial c}{\partial W} $$

### Backpropogation - step 3:

Recall that:


$$
\begin{align}
C &= \begin{bmatrix} c_1 \end{bmatrix} \\ 
&= c(W) \\
&= w_{11} * b_1 + w_{21} * b_2 + w_{31} * b_3 + w_{41} * b_4
\end{align}
$$

### Backpropogation - step 3:

Now recall that by $ \frac{\partial c}{\partial W} $ we mean:

$$ \begin{bmatrix}\frac{\partial c}{\partial w_{11}} \\
                  \frac{\partial c}{\partial w_{21}} \\
                  \frac{\partial c}{\partial w_{31}} \\
                  \frac{\partial c}{\partial w_{41}}
                  \end{bmatrix} $$

### Backpropogation - step 3:

But since 

$$ c(W) = w_{11} * b_1 + w_{21} * b_2 + w_{31} * b_3 + w_{41} * b_4 $$

$ \frac{\partial c}{\partial w_{11}} $, for example, is just $b_1$, $ \frac{\partial c}{\partial w_{21}} $ is just $b_2$, etc.

### Backpropogation - step 3:

Thus,

$$ \frac{\partial c}{\partial W} =
\begin{bmatrix}\frac{\partial c}{\partial w_{11}} \\
                  \frac{\partial c}{\partial w_{21}} \\
                  \frac{\partial c}{\partial w_{31}} \\
                  \frac{\partial c}{\partial w_{41}}
                  \end{bmatrix} = \begin{bmatrix}b_1 \\
                  b_2 \\
                  b_3 \\
                  b_4
                  \end{bmatrix} $$

Which is just $ B^T$.

### Backpropogation - step 3:

So, coding this up is simply:

In [17]:
dCdW = B.T
array_print(dCdW)

The array:
 [[ 0.59]
 [ 0.58]
 [ 0.35]
 [ 0.42]]
The dimensions are 4 rows and 1 column


Note that this has the same dimensions as `W`, which is what we want.

## Computing $\frac{\partial L}{\partial W}$

Now computing $\frac{\partial L}{\partial W}$ is simply a matter of doing the matrix multiplications, which again, by the chain rule, will actually cause the weights to be updated in the right direction.

In [18]:
dLdW = np.dot(dCdW, dLdP * dPdC)
array_print(dLdW, 3)

The array:
 [[-0.075]
 [-0.073]
 [-0.044]
 [-0.053]]
The dimensions are 4 rows and 1 column


## Backpropogation - step 4:

By the same logic that we applied in Step 3, since:

$$ c(W) = w_{11} * b_1 + w_{21} * b_2 + w_{31} * b_3 + w_{41} * b_4 $$

we have:

$$ \frac{\partial c}{\partial W} =
\begin{bmatrix}\frac{\partial c}{\partial w_{11}} \\
                  \frac{\partial c}{\partial w_{21}} \\
                  \frac{\partial c}{\partial w_{31}} \\
                  \frac{\partial c}{\partial w_{41}}
                  \end{bmatrix} = \begin{bmatrix}b_1 \\
                  b_2 \\
                  b_3 \\
                  b_4
                  \end{bmatrix} = B^T $$

### Backpropogation - step 4:

And again, since:

$$ c(W) = w_{11} * b_1 + w_{21} * b_2 + w_{31} * b_3 + w_{41} * b_4 $$

We have:
    
$$ \frac{\partial c}{\partial B} =
\begin{bmatrix}\frac{\partial c}{\partial b_1} \\
                  \frac{\partial c}{\partial b_2} \\
                  \frac{\partial c}{\partial b_3} \\
                  \frac{\partial c}{\partial b_4}
                  \end{bmatrix} = \begin{bmatrix}w_{11} \\
                  w_{21} \\
                  w_{31} \\
                  w_{41}
                  \end{bmatrix} = W^T $$

### Backpropogation - step 4:

So coding this up simply gives:

In [19]:
dCdB = W.T
array_print(dCdB)

The array:
 [[-0.47  0.94  0.56 -1.2 ]]
The dimensions are 1 row and 4 columns


## Where are we

<div class="visual">
    <img src='img/neural_net_4_c_grad.png'>
</div>

### Backpropogation - step 5:

Next, we want to compute:

$$ \frac{\partial b}{\partial A} $$

### Backpropogation - step 5:

But since:

$$ B = b(A) = \sigma(A) $$

Which is really just shorthand for:

$$ B = \begin{bmatrix} b_1 \\ b_2 \\ b_3 \\ b_4 \end{bmatrix} = \begin{bmatrix} \sigma(a_1) \\ \sigma(a_2) \\ \sigma(a_3) \\ \sigma(a_4) \end{bmatrix} $$

### Backpropogation - step 5:

Since we know that:
    
$$ \sigma'(A) = \sigma(A) * (1 - \sigma(A)) $$ 

Then:
    
$$ \frac{\partial b}{\partial A} = \begin{bmatrix} \sigma(a_1) * (1 - \sigma(a_1) \\ 
\sigma(a_2) * (1 - \sigma(a_2) \\ 
\sigma(a_3) * (1 - \sigma(a_3) \\
\sigma(a_4) * (1 - \sigma(a_4) \end{bmatrix} = \sigma(A) * (1 - \sigma(A))$$

### Backpropogation - step 5:

In [20]:
dBdA = sigmoid(A) * (1-sigmoid(A))
array_print(dBdA)

The array:
 [[ 0.24  0.24  0.23  0.24]]
The dimensions are 1 row and 4 columns


## Where are we

<div class="visual">
    <img src='img/neural_net_4_b_grad.png'>
</div>

### Backpropogation - step 6:

Finally, we want to compute the most involved of our partial derivatives:

$$ \frac{\partial a}{\partial V} $$

### Backpropogation - step 6:

Recalling that:

$$ a(X, V) = \begin{bmatrix} a_1 \\ a_2 \\ a_3 \\ a_4 \end{bmatrix}$$

### Backpropogation - step 6:

But $ a(X, V) $ is itself shorthand for the equations:

$$ x_1 * v_{11} + x_2 * v_{21} + x_3 * v_{31} = a_1 $$
$$ x_1 * v_{12} + x_2 * v_{22} + x_3 * v_{32} = a_2 $$
$$ x_1 * v_{13} + x_2 * v_{23} + x_3 * v_{33} = a_3 $$
$$ x_1 * v_{14} + x_2 * v_{24} + x_3 * v_{34} = a_4 $$

### Backpropogation - step 6:

So $ \frac{\partial a}{\partial V} $ is really shorthand for:

$$ \begin{bmatrix}\frac{\partial a}{\partial v_{11}} & \frac{\partial a}{\partial v_{12}} & \frac{\partial a}{\partial v_{13}} & \frac{\partial a}{\partial v_{14}} \\
\frac{\partial a}{\partial v_{21}} & \frac{\partial a}{\partial v_{22}} & \frac{\partial a}{\partial v_{23}} & \frac{\partial a}{\partial v_{24}} \\
\frac{\partial a}{\partial v_{31}} & \frac{\partial a}{\partial v_{32}} & \frac{\partial a}{\partial v_{33}} & \frac{\partial a}{\partial v_{34}} \\
\end{bmatrix} $$

### Backpropogation - step 6:

But, note that focusing on just $a_1$ for example:

$$ \frac{\partial a_1}{\partial v_{11}} = x_1 $$
$$ \frac{\partial a_1}{\partial v_{21}} = x_2 $$
$$ \frac{\partial a_1}{\partial v_{31}} = x_3 $$

Since again, $ x_1 * v_{11} + x_2 * v_{21} + x_3 * v_{31} = a_1 $.

### Backpropogation - step 6:

whereas for $a_2$ and $a_3$

$$ \frac{\partial a_2}{\partial v_{11}} = 0 $$
$$ \frac{\partial a_3}{\partial v_{11}} = 0 $$

Since, for example:

$$ x_1 * v_{12} + x_2 * v_{22} + x_3 * v_{32} = a_2 $$

### Backpropogation - step 6:

So if we write: 
    
$$ A = \begin{bmatrix}a_1 \\ a_2 \\ a_3 \\ a_4 \end{bmatrix} $$

### Backpropogation - step 6:

Then $\frac{\partial a}{\partial V}$ ends up being:

$$ \frac{\partial a}{\partial V} = \begin{bmatrix}
   \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} &
   \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} &
   \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} &
   \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}\end{bmatrix} $$

### Backpropogation - step 6:

Which in terms of the matrix multiplication that results is the same as writing just:

$$ \frac{\partial a}{\partial V} = X^T $$

### Backpropogation - step 6:

Which is of course easy to code as:

In [21]:
dAdV = x.T
array_print(dAdV)

The array:
 [[1]
 [0]
 [0]]
The dimensions are 3 rows and 1 column


## Where are we

<div class="visual">
    <img src='img/neural_net_4_a_grad.png'>
</div>

## Computing $\frac{\partial l}{\partial V}$

To compute $\frac{\partial l}{\partial V}$, we simply multiply all of these partial derivatives we've calculated together, being careful to use matrix multiplication where necessary and elementwise multiplication where necessary:

### Computing $\frac{\partial l}{\partial V}$

In [22]:
dLdV = np.dot(dAdV, np.dot(dLdP * dPdC, dCdB) * dBdA)
array_print(dLdV)

The array:
 [[ 0.01 -0.03 -0.02  0.04]
 [ 0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.  ]]
The dimensions are 3 rows and 4 columns


Note that this has the same shape as $V$, which is what we want!

## Updating the weights

Updating the weights can now be done simply:

In [23]:
W -= dLdW
V -= dLdV

# Putting this all together

Now let's write some functions that train the neural net. The following just wraps around what we've already done, running one batch through the neural net:

In [24]:
def learn(V, W, x_batch, y_batch):
    # forward pass
    A = np.dot(x_batch,V)
    B = sigmoid(A)
    C = np.dot(B,W)
    P = sigmoid(C)
    
    # loss
    L = 0.5 * (y_batch - P) ** 2
    
    # backpropogation
    dLdP = -1.0 * (y_batch - P)
    dPdC = sigmoid(C) * (1-sigmoid(C))
    dLdC = dLdP * dPdC
    dCdW = B.T
    dLdW = np.dot(dCdW, dLdC)
    dCdB = W.T
    dBdA = sigmoid(A) * (1-sigmoid(A))
    dAdV = x_batch.T
    dLdV = np.dot(dAdV, np.dot(dLdP * dPdC, dCdB) * dBdA)
    
    # update the weights
    W -= dLdW
    V -= dLdV
    
    return V, W

## Putting this all together

Now, let's play around with some results:

In [25]:
V, W = train_and_display(X, Y, 500, 4)
accuracy_binary(X, Y, V, W)

    epoch  loss
0       0  0.23
1      50  0.13
2     100  0.05
3     150  0.02
4     200  0.01
5     250  0.01
6     300  0.00
7     350  0.00
8     400  0.00
9     450  0.00
10    500  0.00
The data frame of the predictions this neural net produces is:
    Actual  Predicted
0     1.0       0.94
1     0.0       0.06
2     1.0       0.98
3     0.0       0.02
4     0.0       0.08
5     0.0       0.03
6     1.0       0.98
7     1.0       0.95
The accuracy of this trained neural net is 1.0


1.0

# MNIST Illustration

To illustrate that this simple framework really does have all the power of neural nets, let's show that it can solve the MNIST problem, the classic digit recognition task.

In [26]:
# Imports

from sklearn.datasets import fetch_mldata
from sklearn.model_selection import train_test_split

In [27]:
# "Fetch" the data

mnist = fetch_mldata('MNIST original') 
X_mnist, Y_mnist = get_mnist_X_Y(mnist)

In [28]:
# Train-test split

train_prop = 0.9
X_train, X_test, Y_train, Y_test = train_test_split(
    X_mnist, Y_mnist, 
    test_size=1-train_prop, 
    random_state=1)

## MNIST Illustration

In [29]:
V, W = train_and_display(X_train, Y_train, 10, 50)
accuracy = accuracy_multiclass(X_test, Y_test, V, W)
print("Neural Net MNIST Classification Accuracy:", round(accuracy, 3) * 100, "percent")

    epoch  loss
0       0  0.14
1       1  0.12
2       2  0.12
3       3  0.10
4       4  0.10
5       5  0.10
6       6  0.09
7       7  0.09
8       8  0.08
9       9  0.08
10     10  0.08
Neural Net MNIST Classification Accuracy: 95.0 percent


## MNIST Illustration

Keep in mind we got this accuracy without any "tricks": no convolutions, no dropout, no learning rate tuning - in fact, no "Deep Learning", since we used only one hidden layer! This shows how far simply having a solid understanding of the mathematics underlying neural nets can get you.

# Transitioning to Deep Learning

<div class="expo">
So: we've seen that neural nets can be expressed as mathematical functions. When expressed this way, we can explicity calculate the derivative of each nested function in the neural net to ensure that the weights are updated in the correct way and that the neural net will "learn".
</div>

<div class="expo">
But, this involved a lot of steps for a simple neural net with just one hidden layer. If we want to build deeper neural nets, we're going to have to come up with a way of describing neural nets other than as just "nested functions".
</div>

## Another way to understand neural nets

Note that if we were going to code up a deeper net - let's say one with two hidden layers instead of one - we would be repeating some steps:

In the middle of the net on the forwards pass, we would be passing the output of a layers through a matrix multiplication, and then through an activation function, twice.

Similarly, on the backwards pass, we would twice be passing "values" backwards, first based on the derivative of the activation function, and then based on the input to the prior layer.

## Neuron illustration

Here's an illustration of what is going on at each neuron in the net.

<div class="visual">
    <img src='img/neuron_illustration_backprop.png'>
</div>

In addition, this same operation is happening at the same neuron in each "layer" of the net.

## Towards a new way of thinking of neural nets

We might be able to think of neural nets as a series of "layers", each of which sends values forward to the layer in front of it and sends values backwards during the backwards pass using the process described on the prior slide.

Working backwards from the API we want, we could define a neural net as follows:

```
layer1 = Layer(args)
layer2 = Layer(args)
net = Net([layer1, layer2])
```

# New net framework

Here's how we could define new neural nets:

Neural nets are a series of layers that pass information forwards based on what they receive as <em>input</em> and pass information backwards based on what they receive as the <em>loss</em>.

Let's code this up. For getting me started on this code, I'm indebted to this guy:

<div class="visual">
    <img src="img/andersbll.png">
</div>

[Anders' GitHub](https://github.com/andersbll)

In [163]:
class NeuralNetwork(object):
    def __init__(self, layers, loss_function):
        self.layers = layers
        self.loss_function = loss_function

    def forwardpass(self, X):
        """ Calculate an output Y for the given input X. """
        X_next = X
        for layer in self.layers:
            X_next = layer.fprop(X_next)
        prediction = X_next
        return prediction
    
    def loss(self, prediction, Y):
        """ Calculate the loss on the data and send the 
        result backwards through the net. """
        loss = self.loss_function(prediction, Y)
        return self.loss_function(prediction, Y, bprop=True)

    def backpropogate(self, loss):
        """ Backpropogate the loss through the net. """
        loss_next = loss
        for layer in reversed(self.layers):
            loss_next = layer.bprop(loss_next)
        return loss

Next, let's define a fully connected layer.

In [164]:
class Layer(object):
    def _setup(self, input_shape):
        """ Setup layer with parameters that are unknown at __init__(). """
        pass

    def fprop(self, input):
        """ Calculate layer output for given input (forward propagation). """
        raise NotImplementedError()

    def bprop(self, output_grad):
        """ Calculate input gradient. """
        raise NotImplementedError()

In [166]:
class FullyConnected(Layer):
    
    random_seed = None
    
    def __init__(self, n_neurons, activation_function):
        self.n_neurons = n_neurons
        self.activation_function = activation_function        
        self.iteration = 0
        self.weights_initialized = False

    def fprop(self, layer_input):
        self.layer_input = layer_input
        np.random.seed(seed=self.random_seed)
        if not self.weights_initialized:
            self.W = np.random.normal(size=(self.layer_input.shape[1], self.n_neurons))
            self.weights_initialized = True
        self.activation_input = np.dot(layer_input, self.W)
        return self.activation_function(self.activation_input, bprop=False)

    def bprop(self, layer_gradient):
        dOutdActivationInput = self.activation_function(self.activation_input, bprop=True)
        dLayerInputdActivationInput = layer_gradient * dOutdActivationInput
        dActivationOutputdActivationInput = self.layer_input.T
        output_grad = np.dot(dLayerInputdActivationInput, self.W.T)
        weight_update = np.dot(dActivationOutputdActivationInput, dLayerInputdActivationInput)
        W_new = self.W - weight_update
        self.W = W_new
        self.iteration += 1
        return output_grad

## Activation functions

We'll need to redefine our functions to have `bprop` option:

In [167]:
def sigmoid(x, bprop=False):
    if bprop:
        return sigmoid(x) * (1-sigmoid(x))
    else:
        return 1.0/(1.0+np.exp(-x))

In [168]:
def mean_square_error(prediction, Y, bprop=False):
    if bprop:
        return -1.0 * (Y - prediction)
    else:
        return 0.5 * (Y - prediction) ** 2

## Neural Network definition

In [169]:
layer1 = FullyConnected(n_neurons=50, activation_function=sigmoid)
layer2 = FullyConnected(n_neurons=10, activation_function=sigmoid)

In [170]:
nn_mnist = NeuralNetwork(
    layers=[layer1, layer2],
    loss_function=mean_square_error)

In [171]:
# Randomly shuffle the indices of the points in the training set:
np.random.seed(4)
train_size = X_train.shape[0]
indices = list(range(train_size))
np.random.shuffle(indices)

In [172]:
def neural_net_pass(net, x, y):
    pred = net.forwardpass(x)
    loss = net.loss(pred, y)
    net.backpropogate(loss)
    return pred

In [173]:
# Loop through every element in the training set: 
for index in indices[0:5000]:
    if index % 1000 == 0:
        print(index)
    x = np.array(X_train[index], ndmin=2)
    y = np.array(Y_train[index], ndmin=2)
    neural_net_pass(nn_mnist, x, y)

14000
37000
40000
17000
6000
43000


In [174]:
P = nn_mnist.forwardpass(X_test)
preds = [np.argmax(x) for x in P]
actuals = [np.argmax(x) for x in Y_test]

accuracy = sum(np.array(preds) == np.array(actuals)) * 1.0 / len(preds)
print("Neural Net MNIST Classification Accuracy:", round(accuracy, 3) * 100, "percent")

Neural Net MNIST Classification Accuracy: 71.0 percent
