# CSS

In [1]:
from IPython.display import HTML
style = """
<style>
.expo {
  line-height: 150%;
}

.visual {
  width: 400px;
}

</style>
"""
HTML(style)

In [2]:
train_all = False

# Deep Learning from Scratch using Python 

Seth Weidman

11/30/2017

https://github.com/SethHWeidman/ODSC_Neural_Nets_11-04-17/blob/master/ODSC_Deep_Learning_2017.ipynb

# To begin...

In [3]:
# import tensorflow as tf

Nah...

# Outline

This talk will have two parts:

## Part 1: Neural Nets from Scratch

* We'll implement a basic neural net with one hidden layer, from scratch, and use mathematical principles to get the backpropagation right.

## Part 2: Transitioning to Deep Learning

* We'll transition to Deep Learning by changing our mental model of neural nets to be that they contain "layers" which pass information backwards and forwards between them. 

* We'll show how this framework can be used to construct arbitrarily deep neural networks, and show how these can learn MNIST.

# Part 1: Neural Nets from Scratch

<div class="expo">
We've all seen diagrams like the following in the context of neural nets:
</div>

<img src="img/neural_net_basic.png" class="visual">

## Part 1: Neural Nets from Scratch

<div class="visual">
<img src='img/neural_net_v3.png'>
</div>

<div class="expo">
Many don't fully understand what is going on in this diagram. This talk will attempt to rectify that.
</div>

## Why neural nets?

<div class="expo">
Let's suppose we need a function that  that can learn a complicated relationship between inputs and outputs:
</div>

In [4]:
import os
import sys
sys.path.append(os.getcwd())
from helpers import *

In [5]:
df, X, Y = generate_x_y()

### Why Neural Nets?

In [6]:
df

Unnamed: 0,X1,X2,X3,y
0,1,0,0,1
1,0,1,0,0
2,1,1,0,1
3,0,0,0,0
4,1,0,1,0
5,0,0,1,0
6,1,1,1,1
7,0,1,1,1


<div class="expo">
So: we have eight observations, and a complex relationship between inputs and outputs. Now, let's build a neural net that can "learn" this relationship.
</div>

### Why Neural Nets?

<div class="expo">
How to even begin? Well, let's look at this from as high a level as possible and then progressively dive deeper. First, we want a function $N$ that--based on the data from before--maps inputs to outputs properly, that is:
</div>

$$ N(1, 0, 0) = 1 $$
$$ N(0, 1, 0) = 1 $$
$$ N(1, 1, 0) = 0 $$

etc.

### Why neural nets

<div class="expo">
First, let's observe that logistic regression can't do this. That is, there are no parameters $b$, $w_1$, $w_2$, and $w_3$ such that:
</div>

$$N(x_1, x_2, x_3) = \frac{1}{1 + e^{b + w_1 * x_1 + w_2 * x_2 + w_3 * x_3}}$$

### Why neural nets

<div class="expo">
This is true for the same reason that logistic regression cannot learn XOR. Our problem is a three dimensional problem since we have three features, but you can easily see in two dimensions that the space is not linearly separable:
</div>

<div class="visual">
  <img src="img/xor.png">
</div>

### Why neural nets

<div class="expo">
Of course, we *could* manually do feature engineering...
</div>

... but who likes feature engineering???

<div class="expo">
Let's have the computer do the feature engineering for us, via a neural net learning hidden features!
</div>

# Let's make a prediction using a neural net

Our goal will be to:

* Start with our three original features.
* Transform them into four "intermediate features" using logistic-regression-like transformation.
* Take these four "intermediate features" and use _them_ to predict our final output.

## Step 1: feeding features to the "intermediate" or "hidden" layer

How do we transform our original three features into an intermediate, or hidden layer? Let's call our intermediate features $a_1$, $a_2$, $a_3$, and $a_4$. 

We want $a_1$, for example, to be a linear combination of $x_1$, $x_2$, and $x_3$ - that is, we want some weights $v_{11}$, $v_{12}$, and $v_{13}$ so that:

$$ a_1 = x_1 * v_{11} + x_2 * v_{21} + x_3 * v_{31} $$

and similarly for $a_2$, $a_3$, and $a_4$.

### Step 1: feeding features to the "intermediate" or "hidden" layer

A way to concisely express this is to define your features as a vector

$$ X = \begin{bmatrix}x_1 & x_2 & x_3\end{bmatrix} $$

This should be intuitive, since $X$ is already a row in your data! 

### Step 1: feeding features to the "intermediate" or "hidden" layer

Then, multiplying this vector by a matrix $V$:

$$ V = \begin{bmatrix}v_{11} & v_{12} & v_{13} & v_{14} \\
                      v_{21} & v_{22} & v_{23} & v_{24} \\
                      v_{31} & v_{32} & v_{33} & v_{34}
                      \end{bmatrix} $$

gives us what we want, since "$A = X * V$" is equivalent to:

### Step 1: feeding features to the "intermediate" or "hidden" layer

$$ a_1 = x_1 * v_{11} + x_2 * v_{21} + x_3 * v_{31} $$
$$ a_2 = x_1 * v_{12} + x_2 * v_{22} + x_3 * v_{32} $$
$$ a_3 = x_1 * v_{13} + x_2 * v_{23} + x_3 * v_{33} $$
$$ a_4 = x_1 * v_{14} + x_2 * v_{24} + x_3 * v_{34} $$

which is what we want in order to get four intermediate features.

### Step 1: feeding features to the "intermediate" or "hidden" layer

Let's code this up.

In [7]:
x = np.array(X[0], ndmin=2)
array_print(x)

The array:
 [[1 0 0]]
The dimensions are 1 row and 3 columns


### Step 1: feeding features to the "intermediate" or "hidden" layer

In [8]:
V = np.random.randn(3, 4)
array_print(V)

The array:
 [[ 0.51  0.59 -1.69 -0.82]
 [ 0.65 -1.18  2.04  0.29]
 [-1.31 -0.18  0.07 -0.2 ]]
The dimensions are 3 rows and 4 columns


In [9]:
A = np.dot(x, V)
array_print(A)

The array:
 [[ 0.51  0.59 -1.69 -0.82]]
The dimensions are 1 row and 4 columns


## Where are we?

<div class="visual">
<img src='img/neural_net_4_first_layer.png'>
</div>

## Step 2: feeding these intermediate features through an "activation function"

We're going to use a classic, easy-to-understand activation function, though one that is not often used in cutting-edge applications: the sigmoid function:

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

### Step 2: feeding these intermediate features through an "activation function"

$$ B = \sigma(A) $$ or

$$ b_1 = \sigma(a_1) $$
$$ b_2 = \sigma(a_2) $$
$$ b_3 = \sigma(a_3) $$
$$ b_4 = \sigma(a_4) $$

### Step 2: feeding these intermediate features through an "activation function"

In [10]:
def sigmoid(x):
    return 1.0/(1.0+np.exp(-x))

In [11]:
B = sigmoid(A)
array_print(B)

The array:
 [[ 0.62  0.64  0.16  0.31]]
The dimensions are 1 row and 4 columns


## Where are we?

<div class="visual">
<img src='img/neural_net_4_first_sigmoid.png'>
</div>

### Step 3: use these intermediate features as a linear combination to the output

We'll multiply these "sigmoided" results by another matrix $W$ to get a single output. Since we want to transform 4 features down into 1, we can use a 4 x 1 matrix:

$$ W = \begin{bmatrix}w_{11} \\
                      w_{21} \\
                      w_{31} \\
                      w_{41}
                      \end{bmatrix} $$

### Step 3: use these intermediate features as a linear combination to the output

And since we want the result to be:

$$ c_1 = w_{11} * b_1 + w_{21} * b_2 + w_{31} * b_3 + w_{41} * b_4 $$

### Step 3: use these intermediate features as a linear combination to the output

This is equivalent to writing:

$$ C = B * W $$

or:

$$ \begin{bmatrix}
c_1 \end{bmatrix} = 
\begin{bmatrix}b_1 &
                  b_2 &
                  b_3 &
                  b_4
                  \end{bmatrix} * 
\begin{bmatrix}w_{11} \\
               w_{21} \\
               w_{31} \\
               w_{41}
               \end{bmatrix} $$

### Step 3: use these intermediate features as a linear combination to the output

So we can simply code this up as:

In [12]:
W = np.random.randn(4, 1)
array_print(W)

The array:
 [[ 1.3 ]
 [ 0.32]
 [-0.79]
 [-0.15]]
The dimensions are 4 rows and 1 column


In [13]:
C = np.dot(B, W)
array_print(C)

The array:
 [[ 0.85]]
The dimensions are 1 row and 1 column


## Where are we

<div class="visual">
<img src='img/neural_net_4_second_layer.png'>
</div>

### Step 4: sigmoid this to make a final prediction

Mathematically, we want:

$$ p_1 = \sigma(c_1) $$

So we can simply code this up as:

In [14]:
P = sigmoid(C)
array_print(P)

The array:
 [[ 0.7]]
The dimensions are 1 row and 1 column


## Where are we

<div class="visual">
<img src='img/neural_net_4_final_prediction.png'>
</div>

### Step 5: compute the loss

Mathematically, we'll compute mean squared error loss:

$$ L = \frac{1}{2}(y - P)^2 $$

And coding this up is simply:

In [15]:
y = np.array(Y[0], ndmin=2)
L = 0.5 * (y - P) ** 2
array_print(L)

The array:
 [[ 0.04]]
The dimensions are 1 row and 1 column


## Where are we

<div class="visual">
<img src='img/neural_net_4_loss.png'>
</div>

## Now what?

We have made our prediction and computed our loss, $L$. Now what?

Recall: each "step" is just a function applied to some input that results in some output.

### Now what?

If we write out what we just did in terms of mathematical functions, we could write it as:

\begin{align}
A &= a(x, V) \\
B &= b(A) \\
C &= c(B, W) \\
P &= p(C) \\
L &= l(P)
\end{align}

So, say we have a neural net with just one hidden layer. We could write the loss of a neural net on a given observation $ x $ as:

$$ L = l(p(c(b(a(x, V)), W))) $$

### Now what?

Mathematically, we _want_ to change the weights in such a way that the loss will be reduced during the next iteration. The equations:

$$ W = W - \frac{\partial l}{\partial W}$$

$$ V = V - \frac{\partial l}{\partial V}$$

do this.

### Now what?

Notice that this "makes sense":

* If $\frac{\partial l}{\partial W}$ is a positive number, then we want to _decrease_ the weight, since increasing the weight would _increase_ our loss. That is exactly what the equation $ W = W - \frac{\partial l}{\partial W}$ does.
* Similarly, if $\frac{\partial l}{\partial W}$ is a negative number, then we want to _increase_ the weight, since increasing the weight would _decrease_ our loss. In both cases, the equation $ W = W - \frac{\partial l}{\partial W}$ works.

## Backpropogation - setup

Now we want to make our neural net smarter by updating its weights. We've see that to do that, we need to compute $\frac{\partial L}{\partial W}$ and $\frac{\partial L}{\partial V}$. How do we do this?

Well, we know that 

$$ L = l(p(c(b(a(x, V)), W))) $$

### Backpropogation - setup

Our good friend the chain rule tells us that: 

$$ \frac{\partial L}{\partial W} = \frac{\partial l}{\partial P} * \frac{\partial p}{\partial C} * \frac{\partial c}{\partial W}  $$

and 

$$ \frac{\partial L}{\partial V} = \frac{\partial l}{\partial P} * \frac{\partial p}{\partial C} * \frac{\partial c}{\partial B} * \frac{\partial b}{\partial A} * \frac{\partial a}{\partial V}  $$

Each one of these partial derivatives turns out to be simple!

## Backpropogation - step 1:

First, let's compute:

$$ \frac{\partial l}{\partial P} $$

### Backpropogation - step 1:

Since 

$$ L = l(P) = \frac{1}{2}(y - P)^2 $$

Then:

$$ \frac{\partial l}{\partial P} = -(y - P)$$

### Backpropagation - step 1:

And coding this up is simply:

In [16]:
dLdP = -(y - P)
array_print(dLdP)

The array:
 [[-0.3]]
The dimensions are 1 row and 1 column


## Where are we

<div class="visual">
    <img src='img/neural_net_4_loss_grad.png'>
</div>

## Backpropagation - step 2:

Next, let's compute:

$$ \frac{\partial p}{\partial C} $$

Recall that:

$$ P = \begin{bmatrix} p_1 \end{bmatrix} = p(c) = \sigma(c) $$

### A digression on the sigmoid

If

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

Then 

$$\sigma'(x) = \sigma(x) * (1 - \sigma(x))$$

### Backpropagation - step 2:

So if

$$ p(c) = \sigma(c) $$

then

$$ p'(c) = \sigma(c) * (1 - \sigma(c)) $$

### Backpropagation - step 2:

So, coding this up is simply:

In [17]:
dPdC = sigmoid(C) * (1-sigmoid(C))
array_print(dPdC)

The array:
 [[ 0.21]]
The dimensions are 1 row and 1 column


## Where are we

<div class="visual">
    <img src='img/neural_net_4_prediction_grad.png'>
</div>

### Backpropagation - step 3:

Next we want to compute:

$$ \frac{\partial c}{\partial W} $$

### Backpropagation - step 3:

Recall that:


$$
\begin{align}
C &= \begin{bmatrix} c_1 \end{bmatrix} \\ 
&= c(W) \\
&= w_{11} * b_1 + w_{21} * b_2 + w_{31} * b_3 + w_{41} * b_4
\end{align}
$$

### Backpropagation - step 3:

Now recall that by $ \frac{\partial c}{\partial W} $ we mean:

$$ \begin{bmatrix}\frac{\partial c}{\partial w_{11}} \\
                  \frac{\partial c}{\partial w_{21}} \\
                  \frac{\partial c}{\partial w_{31}} \\
                  \frac{\partial c}{\partial w_{41}}
                  \end{bmatrix} $$

### Backpropagation - step 3:

But since 

$$ c(W) = w_{11} * b_1 + w_{21} * b_2 + w_{31} * b_3 + w_{41} * b_4 $$

$ \frac{\partial c}{\partial w_{11}} $, for example, is just $b_1$, $ \frac{\partial c}{\partial w_{21}} $ is just $b_2$, etc.

### Backpropagation - step 3:

Thus,

$$ \frac{\partial c}{\partial W} =
\begin{bmatrix}\frac{\partial c}{\partial w_{11}} \\
                  \frac{\partial c}{\partial w_{21}} \\
                  \frac{\partial c}{\partial w_{31}} \\
                  \frac{\partial c}{\partial w_{41}}
                  \end{bmatrix} = \begin{bmatrix}b_1 \\
                  b_2 \\
                  b_3 \\
                  b_4
                  \end{bmatrix} $$

Which is just $ B^T$.

### Backpropagation - step 3:

So, coding this up is simply:

In [18]:
dCdW = B.T
array_print(dCdW)

The array:
 [[ 0.62]
 [ 0.64]
 [ 0.16]
 [ 0.31]]
The dimensions are 4 rows and 1 column


Note that this has the same dimensions as `W`, which is what we want.

## Computing $\frac{\partial L}{\partial W}$

Now computing $\frac{\partial L}{\partial W}$ is simply a matter of doing the matrix multiplications, which again, by the chain rule, will actually cause the weights to be updated in the right direction.

In [19]:
dLdW = np.dot(dCdW, dLdP * dPdC)
array_print(dLdW, 3)

The array:
 [[-0.039]
 [-0.04 ]
 [-0.01 ]
 [-0.019]]
The dimensions are 4 rows and 1 column


## Backpropagation - step 4:

By the same logic that we applied in Step 3, since:

$$ c(W) = w_{11} * b_1 + w_{21} * b_2 + w_{31} * b_3 + w_{41} * b_4 $$

we have:

$$ \frac{\partial c}{\partial W} =
\begin{bmatrix}\frac{\partial c}{\partial w_{11}} \\
                  \frac{\partial c}{\partial w_{21}} \\
                  \frac{\partial c}{\partial w_{31}} \\
                  \frac{\partial c}{\partial w_{41}}
                  \end{bmatrix} = \begin{bmatrix}b_1 \\
                  b_2 \\
                  b_3 \\
                  b_4
                  \end{bmatrix} = B^T $$

### Backpropagation - step 4:

And again, since:

$$ c(W) = w_{11} * b_1 + w_{21} * b_2 + w_{31} * b_3 + w_{41} * b_4 $$

We have:
    
$$ \frac{\partial c}{\partial B} =
\begin{bmatrix}\frac{\partial c}{\partial b_1} \\
                  \frac{\partial c}{\partial b_2} \\
                  \frac{\partial c}{\partial b_3} \\
                  \frac{\partial c}{\partial b_4}
                  \end{bmatrix} = \begin{bmatrix}w_{11} \\
                  w_{21} \\
                  w_{31} \\
                  w_{41}
                  \end{bmatrix} = W^T $$

### Backpropagation - step 4:

So coding this up simply gives:

In [20]:
dCdB = W.T
array_print(dCdB)

The array:
 [[ 1.3   0.32 -0.79 -0.15]]
The dimensions are 1 row and 4 columns


## Where are we

<div class="visual">
    <img src='img/neural_net_4_c_grad.png'>
</div>

### Backpropagation - step 5:

Next, we want to compute:

$$ \frac{\partial b}{\partial A} $$

### Backpropagation - step 5:

But since:

$$ B = b(A) = \sigma(A) $$

Which is really just shorthand for:

$$ B = \begin{bmatrix} b_1 \\ b_2 \\ b_3 \\ b_4 \end{bmatrix} = \begin{bmatrix} \sigma(a_1) \\ \sigma(a_2) \\ \sigma(a_3) \\ \sigma(a_4) \end{bmatrix} $$

### Backpropagation - step 5:

Since we know that:
    
$$ \sigma'(A) = \sigma(A) * (1 - \sigma(A)) $$ 

Then:
    
$$ \frac{\partial b}{\partial A} = \begin{bmatrix} \sigma(a_1) * (1 - \sigma(a_1) \\ 
\sigma(a_2) * (1 - \sigma(a_2) \\ 
\sigma(a_3) * (1 - \sigma(a_3) \\
\sigma(a_4) * (1 - \sigma(a_4) \end{bmatrix} = \sigma(A) * (1 - \sigma(A))$$

### Backpropagation - step 5:

In [21]:
dBdA = sigmoid(A) * (1-sigmoid(A))
array_print(dBdA)

The array:
 [[ 0.23  0.23  0.13  0.21]]
The dimensions are 1 row and 4 columns


## Where are we

<div class="visual">
    <img src='img/neural_net_4_b_grad.png'>
</div>

### Backpropagation - step 6:

Finally, we want to compute the most involved of our partial derivatives:

$$ \frac{\partial a}{\partial V} $$

### Backpropagation - step 6:

Recalling that:

$$ a(X, V) = \begin{bmatrix} a_1 \\ a_2 \\ a_3 \\ a_4 \end{bmatrix}$$

### Backpropagation - step 6:

But $ a(X, V) $ is itself shorthand for the equations:

$$ x_1 * v_{11} + x_2 * v_{21} + x_3 * v_{31} = a_1 $$
$$ x_1 * v_{12} + x_2 * v_{22} + x_3 * v_{32} = a_2 $$
$$ x_1 * v_{13} + x_2 * v_{23} + x_3 * v_{33} = a_3 $$
$$ x_1 * v_{14} + x_2 * v_{24} + x_3 * v_{34} = a_4 $$

### Backpropagation - step 6:

So $ \frac{\partial a}{\partial V} $ is really shorthand for:

$$ \begin{bmatrix}\frac{\partial a}{\partial v_{11}} & \frac{\partial a}{\partial v_{12}} & \frac{\partial a}{\partial v_{13}} & \frac{\partial a}{\partial v_{14}} \\
\frac{\partial a}{\partial v_{21}} & \frac{\partial a}{\partial v_{22}} & \frac{\partial a}{\partial v_{23}} & \frac{\partial a}{\partial v_{24}} \\
\frac{\partial a}{\partial v_{31}} & \frac{\partial a}{\partial v_{32}} & \frac{\partial a}{\partial v_{33}} & \frac{\partial a}{\partial v_{34}} \\
\end{bmatrix} $$

### Backpropagation - step 6:

But, note that focusing on just $a_1$ for example:

$$ \frac{\partial a_1}{\partial v_{11}} = x_1 $$
$$ \frac{\partial a_1}{\partial v_{21}} = x_2 $$
$$ \frac{\partial a_1}{\partial v_{31}} = x_3 $$

Since again, $ x_1 * v_{11} + x_2 * v_{21} + x_3 * v_{31} = a_1 $.

### Backpropagation - step 6:

whereas for $a_2$ and $a_3$

$$ \frac{\partial a_2}{\partial v_{11}} = 0 $$
$$ \frac{\partial a_3}{\partial v_{11}} = 0 $$

Since, for example:

$$ x_1 * v_{12} + x_2 * v_{22} + x_3 * v_{32} = a_2 $$

### Backpropagation - step 6:

So if we write: 
    
$$ A = \begin{bmatrix}a_1 \\ a_2 \\ a_3 \\ a_4 \end{bmatrix} $$

### Backpropagation - step 6:

Then $\frac{\partial a}{\partial V}$ ends up being:

$$ \frac{\partial a}{\partial V} = \begin{bmatrix}
   \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} &
   \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} &
   \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} &
   \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}\end{bmatrix} $$

### Backpropagation - step 6:

Which in terms of the matrix multiplication that results is the same as writing just:

$$ \frac{\partial a}{\partial V} = X^T $$

### Backpropagation - step 6:

Which is of course easy to code as:

In [22]:
dAdV = x.T
array_print(dAdV)

The array:
 [[1]
 [0]
 [0]]
The dimensions are 3 rows and 1 column


## Where are we

<div class="visual">
    <img src='img/neural_net_4_a_grad.png'>
</div>

## Computing $\frac{\partial l}{\partial V}$

To compute $\frac{\partial l}{\partial V}$, we simply multiply all of these partial derivatives we've calculated together, being careful to use matrix multiplication where necessary and elementwise multiplication where necessary:

### Computing $\frac{\partial l}{\partial V}$

In [23]:
dLdV = np.dot(dAdV, np.dot(dLdP * dPdC, dCdB) * dBdA)
array_print(dLdV, 3)

The array:
 [[-0.019 -0.005  0.007  0.002]
 [ 0.     0.     0.     0.   ]
 [ 0.     0.     0.     0.   ]]
The dimensions are 3 rows and 4 columns


Note that this has the same shape as $V$, which is what we want!

## Updating the weights

Updating the weights can now be done simply:

In [24]:
W -= dLdW
V -= dLdV

# Putting this all together

Now let's write some functions that train the neural net. The following just wraps around what we've already done, running one batch through the neural net:

In [25]:
def learn(V, W, x_batch, y_batch):
    # forward pass
    A = np.dot(x_batch,V)
    B = sigmoid(A)
    C = np.dot(B,W)
    P = sigmoid(C)
    
    # loss
    L = 0.5 * (y_batch - P) ** 2
    
    # backpropagation
    dLdP = -1.0 * (y_batch - P)
    dPdC = sigmoid(C) * (1-sigmoid(C))
    dLdC = dLdP * dPdC
    dCdW = B.T
    dLdW = np.dot(dCdW, dLdC)
    dCdB = W.T
    dBdA = sigmoid(A) * (1-sigmoid(A))
    dAdV = x_batch.T
    dLdV = np.dot(dAdV, np.dot(dLdP * dPdC, dCdB) * dBdA)
    
    # update the weights
    W -= dLdW
    V -= dLdV
    
    return V, W

## Putting this all together

Now, let's play around with some results:

In [26]:
V, W = train_and_display(X, Y, 500, 4)
accuracy_binary(X, Y, V, W)

    epoch  loss
0       0  0.23
1      50  0.13
2     100  0.05
3     150  0.02
4     200  0.01
5     250  0.01
6     300  0.00
7     350  0.00
8     400  0.00
9     450  0.00
10    500  0.00
The data frame of the predictions this neural net produces is:
    Actual  Predicted
0     1.0       0.94
1     0.0       0.06
2     1.0       0.98
3     0.0       0.02
4     0.0       0.08
5     0.0       0.03
6     1.0       0.98
7     1.0       0.95
The accuracy of this trained neural net is 1.0


1.0

In [27]:
df

Unnamed: 0,X1,X2,X3,y
0,1,0,0,1
1,0,1,0,0
2,1,1,0,1
3,0,0,0,0
4,1,0,1,0
5,0,0,1,0
6,1,1,1,1
7,0,1,1,1


# MNIST Illustration

To illustrate that this simple framework really does have all the power of neural nets, let's show that it can solve the MNIST problem, the classic digit recognition task.

In [28]:
# Imports

from sklearn.datasets import fetch_mldata
from sklearn.model_selection import train_test_split

In [29]:
# "Fetch" the data

mnist = fetch_mldata('MNIST original') 
X_mnist, Y_mnist = get_mnist_X_Y(mnist)

In [30]:
# Train-test split

train_prop = 0.9
X_train, X_test, Y_train, Y_test = train_test_split(
    X_mnist, Y_mnist, 
    test_size=1-train_prop, 
    random_state=1)

## MNIST Illustration

In [31]:
if train_all:
    V, W = train_and_display(X_train, Y_train, 1, 50)
    accuracy = accuracy_multiclass(X_test, Y_test, V, W)
    print("Neural Net MNIST Classification Accuracy:", round(accuracy, 3) * 100, "percent")

## MNIST Illustration

Keep in mind we got this accuracy without any "tricks": no convolutions, no dropout, no learning rate tuning - in fact, no "Deep Learning", since we used only one hidden layer! This shows how far simply having a solid understanding of the mathematics underlying neural nets can get you.

# Transitioning to Deep Learning

<div class="expo">
So: we've seen that neural nets can be expressed as mathematical functions. When expressed this way, we can explicity calculate the derivative of each nested function in the neural net to ensure that the weights are updated in the correct way and that the neural net will "learn".
</div>

<div class="expo">
But, this involved a lot of steps for a simple neural net with just one hidden layer. If we want to build deeper neural nets, we're going to have to come up with a way of describing neural nets other than as just "nested functions".
</div>

## Another way to understand neural nets

Note that if we were going to code up a deeper net - let's say one with two hidden layers instead of one - we would be repeating some steps:

In the middle of the net on the forwards pass, we would be passing the output of a layers through a matrix multiplication, and then through an activation function, twice.

Similarly, on the backwards pass, we would twice be passing "values" backwards, first based on the derivative of the activation function, and then based on the input to the prior layer.

## Neuron illustration

Here's an illustration of what is going on at each neuron in the net.

<div class="visual">
    <img src='img/neuron_illustration_backprop.png'>
</div>

In addition, this same operation is happening at the same neuron in each "layer" of the net.

## Towards a new way of thinking of neural nets

We might be able to think of neural nets as a series of "layers", each of which sends values forward to the layer in front of it and sends values backwards during the backwards pass using the process described on the prior slide.

Working backwards from the API we want, we could define a neural net as follows:

```
layer1 = Layer(args)
layer2 = Layer(args)
net = Net([layer1, layer2])
```

# New net framework

Here's how we could define new neural nets:

**Neural nets are a series of layers that pass information forwards based on what they receive as <em>input</em> and pass information backwards based on what they receive as the <em>loss</em>.**

Let's code this up. For getting me started on this code, I'm indebted to this guy:

<div class="visual">
    <img src="img/andersbll.png">
</div>

[Anders' GitHub](https://github.com/andersbll)

## Basic neural net framework

First, a helper function to set up the layers of the neural net:

In [32]:
def setup_layers(hidden_neurons, outputs):
    layers = []
    for i in range(len(hidden_neurons)):
        layer = FullyConnected(neurons=hidden_neurons[i], activation_function=sigmoid)
        layers.append(layer)

    output_layer = FullyConnected(neurons=outputs, activation_function=sigmoid)
    layers.append(output_layer)
    return layers

### Basic neural net framework

Now, a simple framework for running observations through a neural net:

In [33]:
class NeuralNetwork(object):
    def __init__(self, hidden_neurons, outputs, loss_function):
        self.hidden_neurons = hidden_neurons
        self.outputs = outputs
        self.loss_function = loss_function
        self.layers_setup = False

    def forwardpass(self, X, *args):
        """ Calculate an output Y for the given input X. """
        # If it is our first time doing a forward pass, set up the
        # layers of the network:
        if not self.layers_setup:
            self.layers = setup_layers(self.hidden_neurons, self.outputs)
            self.layers_setup = True

        X_next = X
        for layer in self.layers:
            X_next = layer.fprop(X_next)
        prediction = X_next
        return prediction
    
    def loss(self, prediction, Y):
        """ Calculate the loss on the data and send the result backwards through the net. """
        loss = self.loss_function(prediction, Y)
        return self.loss_function(prediction, Y, bprop=True)

    def backpropagate(self, loss):
        """ Backpropagate the loss through the net. """
        loss_next = loss
        for layer in reversed(self.layers):
            loss_next = layer.bprop(loss_next)
        return loss

## Basic layer definition

Next, let's define what a "layer" should be.

In [34]:
class Layer(object):

    def fprop(self, input):
        """ Calculate layer output for given input (forward propagation). """
        raise NotImplementedError()

    def bprop(self, output_grad):
        """ Calculate input gradient. """
        raise NotImplementedError()

## Fully connected layer

We know that our fully connected layer must have three components:

* `__init__` to set it up
* `fprop` that will take in a layer input and send it forward to the next layer appropriately
* `bprop` that will take in a loss from the following layer and send it backwards through the network appropriately.

### Fully connected layer

In addition, during the forward pass and backpropagation, we use the following abbreviations:

* LI = "Layer Input"
* AI = "Activation Input"
* AO = "Activation Output"
* LG = "Layer Gradient" -> the quantity that a layer is receiving from the layer above it.

In [35]:
class FullyConnected(Layer):
    
    def __init__(self, neurons, activation_function):
        self.n_neurons = neurons
        self.activation_function = activation_function        
        self.iterations = 0
        self.weights_initialized = False

    def fprop(self, layer_input):
        self.LI = layer_input
        
        if not self.weights_initialized:
            self.W = np.random.normal(size=(self.LI.shape[1], self.n_neurons))
            self.weights_initialized = True
        
        self.AI = np.dot(self.LI, self.W)
        layer_output = self.activation_function(self.AI, bprop=False)
        return layer_output
    
    def bprop(self, layer_gradient):
        
        dAOdAI = self.activation_function(self.AI, bprop=True)
        dLGdAI = layer_gradient * dAOdAI
        dAIdW = self.LI.T

        weight_update = np.dot(dAIdW, dLGdAI)
        W_new = self.W - weight_update
        self.W = W_new
        
        self.iterations += 1
        
        output_grad = np.dot(dLGdAI, self.W.T)
        return output_grad

## Activation functions

We'll need to redefine our functions to have `bprop` option:

In [36]:
def sigmoid(x, bprop=False):
    if bprop:
        return sigmoid(x) * (1-sigmoid(x))
    else:
        return 1.0/(1.0+np.exp(-x))

In [37]:
def mean_square_error(prediction, Y, bprop=False):
    if bprop:
        return -1.0 * (Y - prediction)
    else:
        return 0.5 * (Y - prediction) ** 2

## Defining the net

In [38]:
nn_mnist = NeuralNetwork(
    hidden_neurons=[50],
    outputs=10,
    loss_function=mean_square_error)

## Training

In [39]:
from neural_net import *

In [40]:
def train(net, X_train, Y_train, epochs=5, print_msg=True):
    X_train, Y_train = shuffle_data(X_train, Y_train)
    
    for i in range(epochs):
        one_epoch(net, X_train, Y_train)
        if print_msg:
            print("Done with epoch", i+1)

In [41]:
if train_all:
    train(nn_mnist, X_train, Y_train, epochs=1)

## Does it work?

In [42]:
if train_all:
    accuracy = net_accuracy(nn_mnist, X_test, Y_test)

Yes...kind of.

## Deep Learning Illustration

Yes, in fact, we can use this framework to do Deep Learning. Let's define a neural net with two hidden layers.

In [43]:
nn_mnist_2 = NeuralNetwork(
    hidden_neurons=[75, 50, 25],
    outputs=10,
    loss_function=mean_square_error)

In [44]:
if train_all:
    train(nn_mnist_2, X_train, Y_train, epochs=1)

In [45]:
if train_all:
    accuracy = net_accuracy(nn_mnist_2, X_test, Y_test)

Again, it only "kind of" works.

# Deep Learning Tricks

Now we get to the fun part: tuning our deep learning models using the many tricks researchers have discovered increase the performance of said models. 

In this talk, we'll get through as many of these as we can:

* Learning rate tuning
* Learning rate decay
* Varying learning rates by layer
* Learning rate momentum

* Dropout
* Dropconnect

* Weight initializations
* Different activation functions



# Learning rate tuning

<img src="img/bengio.png">

"The learning rate is the single most important hyperparameter and one should always make sure it is tuned."

-[Yoshua Bengio](http://www.iro.umontreal.ca/~bengioy/yoshua_en/)

## Learning rate definition

The learning rate is just a number that we multiply the weight update by during each iteration. So if the learning rate is $\alpha$, the weight update equation for a weight matrix $W$ becomes:

$$ W = W - \alpha * \frac{\partial l}{\partial W}$$

## Coding this up - learning rates

We'll modify the `bprop` function within the `FullyConnected` class, we'll add the learning rate to the weight update:

In [46]:
class FullyConnectedLR(FullyConnected):
    
    def bprop(self, layer_gradient):
        
        dAOdAI = self.activation_function(self.AI, bprop=True)
        dLGdAI = layer_gradient * dAOdAI
        dAIdW = self.LI.T

        weight_update = np.dot(dAIdW, dLGdAI)
        
        # We now multiply the weight update by a learning rate
        W_new = self.W - self.learning_rate * weight_update
        self.W = W_new
        
        self.iterations += 1
        
        output_grad = np.dot(dLGdAI, self.W.T)
        return output_grad

### Coding this up - learning rates

We'll modify the a new `setup_layers` function to give each layer a learning rate: 

In [47]:
def setup_layers(hidden_neurons, outputs, learning_rate=1.0):
    layers = []
    for i in range(len(hidden_neurons)):
        layer = FullyConnectedLR(neurons=hidden_neurons[i], activation_function=sigmoid)
        # We give each layer a learning rate during the set up
        setattr(layer, "learning_rate", learning_rate)
        layers.append(layer)

    output_layer = FullyConnectedLR(neurons=outputs, activation_function=sigmoid)
    setattr(output_layer, "learning_rate", learning_rate)
    layers.append(output_layer)
    return layers   

### Coding this up - learning rates

In [48]:
class NeuralNetworkLR(NeuralNetwork):
    def __init__(self, hidden_neurons, outputs, loss_function, learning_rate):
        NeuralNetwork.__init__(self, hidden_neurons, outputs, loss_function)
        # Add learning rate as a class variable
        self.learning_rate = learning_rate

    def forwardpass(self, X, *args):
        """ Calculate an output Y for the given input X. """
        
        if not self.layers_setup:
            # Add learning rate to the setup_layers function
            self.layers = setup_layers(self.hidden_neurons, 
                                       self.outputs, 
                                       self.learning_rate)
            self.layers_setup = True

        X_next = X
        for layer in self.layers:
            X_next = layer.fprop(X_next)
        prediction = X_next
        return prediction

## Tuning the learning rate

In [49]:
def accuracy_net_lr(learning_rate):
    nn_mnist_lr = NeuralNetworkLR(
        hidden_neurons=[75, 25],
        outputs=10,
        loss_function=mean_square_error, 
        learning_rate=learning_rate)
    
    train(nn_mnist_lr, X_train, Y_train, epochs=1, print_msg=False)
    accuracy = net_accuracy(nn_mnist_lr, X_test, Y_test)
    
    print("The accuracy of a net with learning rate", 
          learning_rate, 
          "was", 
          round(accuracy * 100, 1), 
          "percent.")
    return accuracy

In [50]:
if train_all:
    learning_rates = np.arange(0.1, 0.6, 0.1)
    accuracies = [accuracy_net_lr(lr) for lr in learning_rates]

## Varying Learning Rates by Layer

Because backpropagation involves multiplying a value by the derivative of the activation function, gradients (that tell the weights how to update) get smaller and smaller as you go get further from the output layer:

<img src="img/sigmoid_deriv_trask.png">
**At most, the gradient can be multiplied by 0.25 at each layer.** More [here](http://iamtrask.github.io/2015/07/12/basic-python-network/)

### Coding this up: varying learning rates by layer

In [51]:
def setup_layers(hidden_neurons, outputs, learning_rate=1.0, learning_rate_layer_decay=1.0):
    layers = []
    for i in range(len(hidden_neurons)):
        layer = FullyConnectedLR(neurons=hidden_neurons[i], activation_function=sigmoid)
        # Divide learning rate by layer number
        setattr(layer, "learning_rate", learning_rate / (learning_rate_layer_decay ** i))
        layers.append(layer)

    output_layer = FullyConnectedLR(neurons=outputs, activation_function=sigmoid)
    # Divide learning rate of last layer by number of layers
    setattr(output_layer, "learning_rate", learning_rate / (learning_rate_layer_decay ** len(hidden_neurons)))
    layers.append(output_layer)
    return layers   

### Coding this up: varying learning rates by layer

In [52]:
class NeuralNetworkLRDecay(NeuralNetworkLR):
    def __init__(self, hidden_neurons, outputs, loss_function, learning_rate, 
                 learning_rate_layer_decay):
        NeuralNetworkLR.__init__(self, hidden_neurons, outputs, loss_function, learning_rate)
        self.learning_rate_layer_decay = learning_rate_layer_decay

    def forwardpass(self, X, *args):
        """ Calculate an output Y for the given input X. """
        
        if not self.layers_setup:
            # Add learning rate decay to setup_layers function
            self.layers = setup_layers(self.hidden_neurons, 
                                       self.outputs, 
                                       self.learning_rate,
                                       self.learning_rate_layer_decay)
            self.layers_setup = True

        X_next = X
        for layer in self.layers:
            X_next = layer.fprop(X_next)
        prediction = X_next
        return prediction

In [53]:
nn_mnist_lr = NeuralNetworkLRDecay(
        hidden_neurons=[75, 25],
        outputs=10,
        loss_function=mean_square_error, 
        learning_rate=0.3, 
        learning_rate_layer_decay=4)

In [55]:
if train_all:
    train(nn_mnist_lr, X_train, Y_train, epochs=1, print_msg=False)
    accuracy = net_accuracy(nn_mnist_lr, X_test, Y_test)

## Learning Rate Momentum

The weights in a neural net are updated according to:

$$ W = W - \alpha * \frac{\partial l}{\partial W}$$

Recall that this is equivalent to doing gradient descent with each parameter:

<img src="img/gradient_descent.png">

This is analogous to a ball rolling down a hill. 

Balls rolling down hills have momentum. So, therefore, should our weights!

## Learning rate momentum

Let's define our weight update $ \frac{\partial L}{\partial W} $ to be $ U_t $. Then, instead of our weight update being $ U_t $ at each time step, it will be:

$$ U_t + \mu * U_{t-1} + \mu^2 * U_{t-2} + ... $$ 

where $\mu$ is a decay parameter between 0 and 1.

This is equivalent to, and often described as, increasing your learning rate when your weight updates are going in the same direction, iteration after iteration, and lowering your learning rate when the opposite is happening.

### Coding this up - learning rate momentum

In [56]:
def setup_layers(hidden_neurons, outputs, learning_rate=1.0, learning_rate_layer_decay=1.0, momentum=0.1):
    layers = []
    for i in range(len(hidden_neurons)):
        layer = FullyConnectedLRMomentum(neurons=hidden_neurons[i], activation_function=sigmoid)
        setattr(layer, "learning_rate", learning_rate)
        # Add momentum parameter to each layer
        setattr(layer, "momentum", momentum)
        layers.append(layer)

    output_layer = FullyConnectedLRMomentum(neurons=outputs, activation_function=sigmoid)
    setattr(output_layer, "learning_rate", learning_rate)
    # Add momentum parameter to last layer
    setattr(output_layer, "momentum", momentum)
    layers.append(output_layer)
    return layers   

### Coding this up - learning rate momentum

In [57]:
class FullyConnectedLRMomentum(FullyConnectedLR):
    def __init__(self, neurons, activation_function):
        FullyConnectedLR.__init__(self, neurons, activation_function)
        self.first = True
    
    def bprop(self, layer_gradient):
        
        dAOdAI = self.activation_function(self.AI, bprop=True)
        dLGdAI = layer_gradient * dAOdAI
        dAIdW = self.LI.T
        weight_update_current = np.dot(dAIdW, dLGdAI)
        
        # Set "velocity" on each iteration through the net
        if self.first:    
            self.velocity = self.learning_rate * weight_update_current
            self.first = False
        else:
            # On most iterations, multiply velocity by momentum, and add learning rate times current update
            self.velocity = np.add(self.momentum * self.velocity, 
                                   self.learning_rate * weight_update_current)
        self.W = self.W - self.velocity
        
        self.iterations += 1
        
        output_grad = np.dot(dLGdAI, self.W.T)
        return output_grad

### Coding this up - learning rate momentum

In [58]:
class NeuralNetworkLRMomentum(NeuralNetworkLRDecay):
    def __init__(self, hidden_neurons, outputs, loss_function, learning_rate, 
                 learning_rate_layer_decay, momentum):
        NeuralNetworkLRDecay.__init__(self, hidden_neurons, outputs, loss_function, 
                                      learning_rate, learning_rate_layer_decay)
        # Add momentum as class variable
        self.momentum = momentum
        
    def forwardpass(self, X, *args):
        """ Calculate an output Y for the given input X. """
        
        if not self.layers_setup:
            # Add momentum to setup layers function
            self.layers = setup_layers(self.hidden_neurons, 
                                       self.outputs, 
                                       self.learning_rate,
                                       self.learning_rate_layer_decay, 
                                       self.momentum)
    
            self.layers_setup = True
        X_next = X
        for layer in self.layers:
            X_next = layer.fprop(X_next)
        prediction = X_next
        return prediction

### Testing learning rate momentum

In [59]:
nn_mnist_lr = NeuralNetworkLRMomentum(
        hidden_neurons=[75, 25],
        outputs=10,
        loss_function=mean_square_error, 
        learning_rate=0.3, 
        learning_rate_layer_decay=4,
        momentum=0.75)

In [61]:
if train_all:
    train(nn_mnist_lr, X_train, Y_train, epochs=1, print_msg=False)
    accuracy = net_accuracy(nn_mnist_lr, X_test, Y_test)

## Dropout

Dropout can help prevent neural networks from overfitting. It involves "dropping" a portion of the neurons - that is, setting their values to zero - on each forward pass through the network. 

<img src="img/dropout.png">

This nudges the network toward learning "redundant representations of its data".

### Coding this up - dropout

In [62]:
class NeuralNetworkDropout(NeuralNetworkLRMomentum):
    def __init__(self, hidden_neurons, outputs, loss_function, learning_rate, 
                 learning_rate_layer_decay, momentum, dropout):
        NeuralNetworkLRMomentum.__init__(self, hidden_neurons, outputs, loss_function, 
                                      learning_rate, learning_rate_layer_decay, momentum)
        # Add dropout as a class variable
        self.dropout = dropout

        
    def forwardpass(self, X, predict=False):
        # Give "forwardpass" a "predict" flag so it only applies Dropout during training
        
        if not self.layers_setup:
            self.layers = setup_layers(self.hidden_neurons, 
                                       self.outputs, 
                                       self.learning_rate,
                                       self.learning_rate_layer_decay, 
                                       self.momentum)
            self.layers_setup = True

        X_next = X
        for i, layer in enumerate(self.layers):
            # Set '1-dropout' proportion of the neurons equal to zero
            if self.dropout and not predict:
                zero_indices = np.random.choice(range(layer.n_neurons), 
                                                size=int(layer.n_neurons * (1 - self.dropout)), 
                                                replace=False)
                X_next[:, zero_indices] = 0.0
            X_next = layer.fprop(X_next)
        prediction = X_next
        return prediction

### Testing dropout

In [63]:
nn_mnist_lr = NeuralNetworkDropout(
        hidden_neurons=[75, 25],
        outputs=10,
        loss_function=mean_square_error, 
        learning_rate=0.3, 
        learning_rate_layer_decay=4,
        momentum=0.2,
        dropout=0.75)

In [65]:
if train_all:
    train(nn_mnist_lr, X_train, Y_train, epochs=1, print_msg=False)
    accuracy = net_accuracy(nn_mnist_lr, X_test, Y_test)

## Weight initialization

Each neuron passes some values to the next layer. Those values have some variance. To ensure that the total variance that each layer receives from the prior layer is roughly constant, it has become a best practice to scale the variance of each set of weights by the number of neurons in that layer. So, if there are $N$ neurons in a layer, the weights coming out of that layer will be given variance $\frac{1}{N}$.

### Coding this up - Xavier weight initialization

In [66]:
class FullyConnectedXavier(FullyConnectedLRMomentum):
    def __init__(self, neurons, activation_function):
        FullyConnectedLRMomentum.__init__(self, neurons, activation_function)

    def fprop(self, layer_input):
        self.LI = layer_input
        
        if not self.weights_initialized:
            # Scale the weights' variance by the number of neurons going out of that layer
            self.W = np.random.normal(scale=1 / np.sqrt(self.n_neurons), size=(self.LI.shape[1], self.n_neurons))
            self.weights_initialized = True
        
        self.AI = np.dot(self.LI, self.W)
        return self.activation_function(self.AI, bprop=False)

### Coding this up - Xavier weight initialization

In [67]:
def setup_layers(hidden_neurons, outputs, learning_rate=1.0, learning_rate_layer_decay=1.0, momentum=0.1):
    layers = []
    for i in range(len(hidden_neurons)):
        # Change to use Xavier initialization
        layer = FullyConnectedXavier(neurons=hidden_neurons[i], activation_function=sigmoid)
        setattr(layer, "learning_rate", learning_rate / (learning_rate_layer_decay ** i))
        setattr(layer, "momentum", momentum)
        layers.append(layer)

    # Change to use Xavier initialization
    output_layer = FullyConnectedXavier(neurons=outputs, activation_function=sigmoid)
    setattr(output_layer, "learning_rate", learning_rate)
    setattr(output_layer, "momentum", momentum)
    layers.append(output_layer)
    return layers   

### Testing Xavier weight initialization

In [68]:
nn_mnist_lr = NeuralNetworkDropout(
        hidden_neurons=[75, 25],
        outputs=10,
        loss_function=mean_square_error, 
        learning_rate=0.3, 
        learning_rate_layer_decay=4,
        momentum=0.2,
        dropout=0.75)

In [70]:
if train_all:
    train(nn_mnist_lr, X_train, Y_train, epochs=1, print_msg=False)
    accuracy = net_accuracy(nn_mnist_lr, X_test, Y_test)

## DropConnect

<img src="img/MNIST_performance.png">

As we can see [here](http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html), the highest performance model on the MNIST data involved "Drop Connect", where a portion of the _weights_ in the neural net are set to zero, as opposed to half the _neurons_.

[Not in TensorFlow!](https://stackoverflow.com/questions/37135885/dropconnect-in-tensorflow)

### Coding this up - DropConnect

In [72]:
def apply_drop_connect_weights(weights, drop_connect):
    new_weights = weights.copy()
    num_weights = new_weights.shape[0] * new_weights.shape[1]
    reshaped_weights = np.reshape(new_weights, (num_weights, 1))
    zero_indices = np.random.choice(range(num_weights), 
                                    size=int(num_weights * (1 - drop_connect)), 
                                    replace=False)
    reshaped_weights[zero_indices, :] = 0.0
    drop_connected_weights = np.reshape(reshaped_weights, new_weights.shape)
    
    return drop_connected_weights

### Coding this up - DropConnect

In [73]:
class FullyConnectedDropConnect(FullyConnectedXavier):
    def __init__(self, neurons, activation_function, drop_connect):
        FullyConnectedXavier.__init__(self, neurons, activation_function)
        self.drop_connect = drop_connect
    
    def fprop(self, layer_input):

        self.LI = layer_input
        
        if not self.weights_initialized:
            # Scale the weights' variance by the number of neurons going out of that layer
            self.W = np.random.normal(scale=1 / np.sqrt(self.n_neurons), size=(self.LI.shape[1], self.n_neurons))
            self.weights_initialized = True
        
        if self.drop_connect:            
            drop_connected_weights = apply_drop_connect_weights(self.W, 
                                                                self.drop_connect)
            self.AI = np.dot(layer_input, 
                             drop_connected_weights)
        else:
            self.AI = np.dot(layer_input, self.W)

        return self.activation_function(self.AI, bprop=False)

### Coding this up - DropConnect

In [74]:
def setup_layers(hidden_neurons, outputs, learning_rate=1.0, learning_rate_layer_decay=1.0, momentum=0.1, 
                 drop_connect=0.0):
    layers = []
    for i in range(len(hidden_neurons)):
        # Change to use Xavier initialization
        layer = FullyConnectedDropConnect(neurons=hidden_neurons[i], 
                                          activation_function=sigmoid, 
                                          drop_connect=drop_connect)
        setattr(layer, "learning_rate", learning_rate / (learning_rate_layer_decay ** i))
        setattr(layer, "momentum", momentum)
        setattr(layer, "drop_connect", drop_connect)
        layers.append(layer)

    # Change to use Xavier initialization
    output_layer = FullyConnectedDropConnect(neurons=outputs, 
                                             activation_function=sigmoid,
                                             drop_connect=drop_connect)
    setattr(output_layer, "learning_rate", learning_rate)
    setattr(output_layer, "momentum", momentum)
    setattr(output_layer, "drop_connect", drop_connect)
    layers.append(output_layer)
    return layers   

### Coding this up - DropConnect

In [75]:
class NeuralNetworkDropConnect(NeuralNetworkDropout):
    def __init__(self, hidden_neurons, outputs, loss_function, learning_rate, 
                 learning_rate_layer_decay, momentum, dropout, drop_connect):
        NeuralNetworkDropout.__init__(self, hidden_neurons, outputs, loss_function, 
                                          learning_rate, learning_rate_layer_decay, momentum,
                                          dropout)
        # Add drop connect as a class variable
        self.drop_connect = drop_connect
        
    def forwardpass(self, X, predict=False):
        
        # Add drop_connect to the layers setup
        if not self.layers_setup:
            self.layers = setup_layers(self.hidden_neurons, 
                                       self.outputs, 
                                       self.learning_rate,
                                       self.learning_rate_layer_decay, 
                                       self.momentum, 
                                       self.drop_connect)
            self.layers_setup = True

        X_next = X
        for i, layer in enumerate(self.layers):
            # Set '1-dropout' proportion of the neurons equal to zero
            if self.dropout and not predict:
                zero_indices = np.random.choice(range(layer.n_neurons), 
                                                size=int(layer.n_neurons * (1 - self.dropout)), 
                                                replace=False)
                X_next[:, zero_indices] = 0.0
            X_next = layer.fprop(X_next)
        prediction = X_next

        return prediction

### Testing DropConnect

In [76]:
nn_mnist_lr = NeuralNetworkDropConnect(
        hidden_neurons=[75, 25],
        outputs=10,
        loss_function=mean_square_error, 
        learning_rate=0.3, 
        learning_rate_layer_decay=4,
        momentum=0.2,
        dropout=0.75,
        drop_connect=0.9)

In [77]:
if train_all:
    train(nn_mnist_lr, X_train, Y_train, epochs=1, print_msg=False)
    accuracy = net_accuracy(nn_mnist_lr, X_test, Y_test)

**Thanks!**

<img src="img/professional_headshot.png" class="visual">

[Website](https://www.sethweidman.com) | [Medium](https://medium.com/@sethweidman) | [GitHub](https://github.com/sethHWeidman/) | [Twitter](https://twitter.com/SethHWeidman) | [LinkedIn](https://www.linkedin.com/in/sethhweidman/)

seth@sethweidman.com if you have any questions.