# **Lecture: Neural Network, Full Breakdown**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## **The Data**

In [None]:
# Original data
DATA = pd.read_csv("~/Data/biology/Iris.csv")
print(DATA.keys())
DATA = DATA[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm', 'Species']]

# Make an index for different binary categories .
mapping = {"Iris-setosa": 0, "Iris-versicolor": 1, "Iris-virginica" : 1}

# Map Setosa to -1 and the two other flowers to 1.
DATA['Species'] = [mapping[item] for item in DATA['Species']]

DATA = DATA.sample(frac=1)

X = DATA[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']]
Y = DATA[['Species']]

X_train = X.iloc[ : 100] 
Y_train = Y.iloc[ : 100]

X_test = X.iloc[100 : ]
Y_test = Y.iloc[100 : ]

## **Network Setup**

In [None]:
def brain(): # Actually create "brain"
    # Establish random weights.
    W1 = 2 * np.random.rand(4,4) - 1 #First 4 is the features, then layer, layer, layer, output
    W2 = 2 * np.random.rand(4,3) - 1
    W3 = 2 * np.random.rand(3,1)
    
    return [W1, W2, W3]

In [None]:
# Activation function.
def sigmoid(z):
    a = 1 / (1 + np.e ** (-z))
    return a

## **Forward Pass**

In [None]:
# The foward pass. 
def forward_pass(X, W1, W2, W3):
    # Layer 1
    z1 = np.dot(X, W1) # Left hand side of node. (Pre-activation)
    a1 = sigmoid(z1) # Right hand side of the node. (Post-activation)
    
    # Layer 2
    z2 = np.dot(a1, W2) # Left hand side of node. (Pre-activation)
    a2 = sigmoid(z2) # Right hand side of the node. (Post-activation)
    
    # Layer 3
    z3 = np.dot(a2, W3) # Left hand side of node. (Pre-activation)
    a3 = sigmoid(z3) # Right hand side of the node. (Post-activation)
    
    # Returns the guess(s) for a give datum's features or for a batch of data. 
    return [z1, z2, z3, a3] #Capture unactivated values along the way and the activated value of a3
# If i wanted a2 instead of z2, I can send it through the activation function to get it back.
# a3 is a value I use in cost function evaluation (it's either 0 or 1)

In [None]:
W1,W2,W3 = brain()
Forward_Pass_Output = forward_pass(X, W1, W2, W3)
#Forward_Pass_Output #We send them through in batches, each line is a set of four unactivated features per datum

## **Evaluation**

$$ \textrm{Cost} = \frac{1}{N} \sum_{i=1}^N (a_{1}^{(3)} - y_i) ^ 2 $$

In [None]:
Cost = np.sum((Forward_Pass_Output[3] - Y) ** 2) / np.shape(Y)[1] #How to find lowest path Mean Square Error

In [None]:
Cost
# We want to cost to be 0, since it's 42, that's not great.

## **Back Propagation (update the weights)**

$$ \textrm{Cost} = \frac{1}{N} \sum_{i=1}^N (a_{1}^{(3)} - y_i) ^ 2 $$
and

$$ a_{1}^{(3)} = \sigma(z_{1}^{(3)}) $$ 

which is the activation in the final node. And, 

$$ z_{1}^{(3)} = w_{11}^{(3)} a_{1}^{(2)} + w_{21}^{(3)} a_{2}^{(2)} + w_{31}^{(3)} a_{3}^{(2)}$$

So,

$$ \textrm{Cost} = \frac{1}{N} \sum_{i=1}^N ( \sigma(\color{Red}{w_{11}^{(3)}} a_{1}^{(2)} + \color{Red}{w_{21}^{(3)}} a_{2}^{(2)} + \color{Red}{w_{31}^{(3)}} a_{3}^{(2)}) - y_i) ^ 2 $$

This last line shows us where the first weights that we will be updating will be on the cost equation. We can compute the gradient with respect to these weights to determine the "way to head" on the cost surface to reduce the overall error for the data or a batch of the data. 

Let's take the gradient computation on step at a time.

$$ \nabla \textrm{Cost} = \nabla \frac{1}{N} \sum_{i=1}^N ( \sigma(\color{Red}{w_{11}^{(3)}} a_{1}^{(2)} + \color{Red}{w_{21}^{(3)}} a_{2}^{(2)} + \color{Red}{w_{31}^{(3)}} a_{3}^{(2)}) - y_i) ^ 2 $$

$$= \frac{1}{N} \sum_{i=1}^N \nabla ( \sigma(\color{Red}{w_{11}^{(3)}} a_{1}^{(2)} + \color{Red}{w_{21}^{(3)}} a_{2}^{(2)} + \color{Red}{w_{31}^{(3)}} a_{3}^{(2)}) - y_i) ^ 2 $$

$= \frac{1}{N} \sum_{i=1}^N 2(\sigma(\color{Red}{w_{11}^{(3)}} a_{1}^{(2)} + \color{Red}{w_{21}^{(3)}} a_{2}^{(2)} + \color{Red}{w_{31}^{(3)}} a_{3}^{(2)}) - y_i) \\ \left \langle \sigma'(\color{Red}{w_{11}^{(3)}} a_{1}^{(2)} + \color{Red}{w_{21}^{(3)}} a_{2}^{(2)} + \color{Red}{w_{31}^{(3)}} a_{3}^{(2)}) a_{1}^{(2)},
\sigma'(\color{Red}{w_{11}^{(3)}} a_{1}^{(2)} + \color{Red}{w_{21}^{(3)}} a_{2}^{(2)} + \color{Red}{w_{31}^{(3)}} a_{3}^{(2)}) a_{2}^{(2)},
\sigma'(\color{Red}{w_{11}^{(3)}} a_{1}^{(2)} + \color{Red}{w_{21}^{(3)}} a_{2}^{(2)} + \color{Red}{w_{31}^{(3)}} a_{3}^{(2)})
a_{3}^{(2)} \right \rangle\\ $

$ = \frac{1}{N} \sum_{i=1}^N 2(\sigma(\color{Red}{w_{11}^{(3)}} a_{1}^{(2)} + \color{Red}{w_{21}^{(3)}} a_{2}^{(2)} + \color{Red}{w_{31}^{(3)}} a_{3}^{(2)}) - y_i) \\ \left \langle \sigma'(\color{Red}{w_{11}^{(3)}} a_{1}^{(2)} + \color{Red}{w_{21}^{(3)}} a_{2}^{(2)} + \color{Red}{w_{31}^{(3)}} a_{3}^{(2)}),
\sigma'(\color{Red}{w_{11}^{(3)}} a_{1}^{(2)} + \color{Red}{w_{21}^{(3)}} a_{2}^{(2)} + \color{Red}{w_{31}^{(3)}} a_{3}^{(2)}),
\sigma'(\color{Red}{w_{11}^{(3)}} a_{1}^{(2)} + \color{Red}{w_{21}^{(3)}} a_{2}^{(2)} + \color{Red}{w_{31}^{(3)}} a_{3}^{(2)})
 \right \rangle \\ \langle a_{1}^{(2)}, a_{2}^{(2)}, a_{3}^{(2)} \rangle $

 
Now let's simplify some of the notation by replacing, $\color{Red}{w_{11}^{(3)}} a_{1}^{(2)} + \color{Red}{w_{21}^{(3)}} a_{2}^{(2)} + \color{Red}{w_{31}^{(3)}} a_{3}^{(2)} $ with $z_{1}^{(3)} $.

$$ \frac{1}{N} \sum_{i=1}^N 2(\sigma(z_{1}^{(3)}) - y_i) \left \langle \sigma'(z_{1}^{(3)} ), \sigma'(z_{1}^{(3)} ), \sigma'(z_{1}^{(3)} ) \right \rangle \langle a_{1}^{(2)}, a_{2}^{(2)}, a_{3}^{(2)} \rangle $$

 
Now so that we can see the next step, let $N = 4$. We are going to use another subscript to indicate the datum that we are on. So the final notation will be, $z_{1_i}^{(3)}$ where $i$ is the datum that we are on. So

$$ \frac{1}{4} \sum_{i=1}^4 2(\sigma(z_{1_i}^{(3)}) - y_i) \color{green}{\left \langle \sigma'(z_{1_i}^{(3)}), \sigma'(z_{1_i}^{(3)}), \sigma'(z_{1_i}^{(3)}) \right \rangle} \color{blue}{\langle a_{1_i}^{(2)}, a_{2_i}^{(2)}, a_{3_i}^{(2)} \rangle} =$$

Applying the sum to the vectors can be done term by term,

$$ \frac{1}{4} 2(\sigma(z_{1_1}^{(3)}) - y_1) 
\color{green}{\left \langle \sigma'(z_{1_1}^{(3)}), \sigma'(z_{1_1}^{(3)}), \sigma'(z_{1_1}^{(3)}) \right \rangle}  
\color{blue}{\langle a_{1_1}^{(2)}, a_{2_1}^{(2)}, a_{3_1}^{(2)} \rangle} + \\
\frac{1}{4} 2(\sigma(z_{1_2}^{(3)}) - y_2) 
\color{green}{\left \langle \sigma'(z_{1_2}^{(3)}), \sigma'(z_{1_2}^{(3)}), \sigma'(z_{1_2}^{(3)}) \right \rangle}
\color{blue}{\langle a_{1_2}^{(2)}, a_{2_2}^{(2)}, a_{3_2}^{(2)} \rangle} + \\
\frac{1}{4} 2(\sigma(z_{1_3}^{(3)}) - y_3) 
\color{green}{\left \langle \sigma'(z_{1_3}^{(3)}), \sigma'(z_{1_3}^{(3)}), \sigma'(z_{1_3}^{(3)}) \right \rangle}
\color{blue}{\langle a_{1_3}^{(2)}, a_{2_3}^{(2)}, a_{3_3}^{(2)} \rangle} + \\
\frac{1}{4} 2(\sigma(z_{1_4}^{(3)}) - y_4) 
\color{green}{\left \langle \sigma'(z_{1_4}^{(3)}), \sigma'(z_{1_4}^{(3)}), \sigma'(z_{1_4}^{(3)}) \right \rangle}
\color{blue}{\langle a_{1_4}^{(2)}, a_{2_4}^{(2)}, a_{3_4}^{(2)} \rangle}$$

Then we can break the sum in each of the vector positions into a dot product,

$$\left \langle 
\left( \frac{2}{4} (\sigma(z_{1_1}^{(3)}) - y_1)\color{green}{\sigma'(z_{1_1}^{(3)})}\color{blue}{a_{1_1}^{(2)}} + \frac{2}{4} (\sigma(z_{1_2}^{(3)}) - y_2)\color{green}{\sigma'(z_{1_2}^{(3)})}\color{blue}{a_{1_2}^{(2)}} + \frac{2}{4}(\sigma(z_{1_3}^{(3)}) - y_3)\color{green}{\sigma'(z_{1_3}^{(3)})}\color{blue}{a_{1_3}^{(2)}} + \frac{2}{4}(\sigma(z_{1_4}^{(3)}) - y_4)\color{green}{\sigma'(z_{1_4}^{(3)})}\color{blue}{a_{1_4}^{(2)}} \right), \\
\left( \frac{2}{4} (\sigma(z_{1_1}^{(3)}) - y_1))\color{green}{\sigma'(z_{1_1}^{(3)})}\color{blue}{a_{2_1}^{(2)}} + \frac{2}{4} (\sigma(z_{1_2}^{(3)}) - y_2))\color{green}{\sigma'(z_{1_2}^{(3)})}\color{blue}{a_{2_2}^{(2)}} + \frac{2}{4}(\sigma(z_{1_3}^{(3)}) - y_3))\color{green}{\sigma'(z_{1_3}^{(3)})} \color{blue}{a_{2_3}^{(2)}} + \frac{2}{4}(\sigma(z_{1_4}^{(3)}) - y_4))\color{green}{\sigma'(z_{1_4}^{(3)})}\color{blue}{a_{2_4}^{(2)}}  \right), \\
\left( \frac{2}{4} (\sigma(z_{1_1}^{(3)}) - y_1))\color{green}{\sigma'(z_{1_1}^{(3)})}\color{blue}{a_{3_1}^{(2)}} + \frac{2}{4} (\sigma(z_{1_2}^{(3)}) - y_2))\color{green}{\sigma'(z_{1_2}^{(3)})} \color{blue}{a_{3_2}^{(2)}} + \frac{2}{4}(\sigma(z_{1_3}^{(3)}) - y_3))\color{green}{\sigma'(z_{1_3}^{(3)})} \color{blue}{a_{3_3}^{(2)}} + \frac{2}{4}(\sigma(z_{1_4}^{(3)}) - y_4))\color{green}{\sigma'(z_{1_4}^{(3)})}\color{blue}{a_{3_4}^{(2)}}\right)
\right \rangle 
$$

Then we can break the sum in each of the vector positions into a dot product,

$$\left \langle \frac{\delta}{\delta \color{Red}{w_{11}^{(3)}}}, \frac{\delta}{\delta \color{Red}{w_{21}^{(3)}}}, \frac{\delta}{\delta \color{Red}{w_{31}^{(3)}}} \right \rangle =
\color{blue}{\sigma\left(\begin{pmatrix} 
z_{1_1}^{(2)} & z_{1_2}^{(2)} & z_{1_3}^{(2)} & z_{1_4}^{(2)} \\
z_{2_1}^{(2)} & z_{2_2}^{(2)} & z_{2_3}^{(2)} & z_{2_4}^{(2)} \\
z_{3_1}^{(2)} & z_{3_2}^{(2)} & z_{3_3}^{(2)} & z_{3_4}^{(2)} 
\end{pmatrix}\right)}
\cdot
\begin{bmatrix} 
\frac{1}{4} 2(\sigma(z_{1_1}^{(3)}) - y_1) \\
\frac{1}{4} 2(\sigma(z_{1_2}^{(3)}) - y_1) \\
\frac{1}{4} 2(\sigma(z_{1_3}^{(3)}) - y_1) \\
\frac{1}{4} 2(\sigma(z_{1_4}^{(3)}) - y_1) 
\end{bmatrix}
\color{green}
{\sigma'
\begin{pmatrix} 
z_{1_1}^{(3)} \\
z_{1_2}^{(3)} \\ 
z_{1_3}^{(3)} \\
z_{1_4}^{(3)}
\end{pmatrix}} =
\color{blue}{\sigma\left(\begin{pmatrix} 
z_{1_1}^{(2)} & z_{2_1}^{(2)} & z_{3_1}^{(2)} \\
z_{2_2}^{(2)} & z_{2_2}^{(2)} & z_{3_2}^{(2)} \\
z_{2_3}^{(2)} & z_{2_3}^{(2)} & z_{3_3}^{(2)} \\
z_{2_4}^{(2)} & z_{2_4}^{(2)} & z_{3_4}^{(2)} 
\end{pmatrix}\right)^ {T}} 
\cdot
\begin{bmatrix} 
\frac{1}{4} 2(\sigma(z_{1_1}^{(3)}) - y_1) \\
\frac{1}{4} 2(\sigma(z_{1_2}^{(3)}) - y_1) \\
\frac{1}{4} 2(\sigma(z_{1_3}^{(3)}) - y_1) \\
\frac{1}{4} 2(\sigma(z_{1_4}^{(3)}) - y_1) 
\end{bmatrix}
\color{green}
{\sigma'
\begin{pmatrix} 
z_{1_1}^{(3)} \\
z_{1_2}^{(3)} \\ 
z_{1_3}^{(3)} \\
z_{1_4}^{(3)}
\end{pmatrix}}
$$

Where,
$$
\color{blue}{\textrm{Output of the previous layer activated} = 
\sigma\left(\begin{pmatrix} 
z_{1_1}^{(2)} & z_{2_1}^{(2)} & z_{3_1}^{(2)} \\
z_{2_2}^{(2)} & z_{2_2}^{(2)} & z_{3_2}^{(2)} \\
z_{2_3}^{(2)} & z_{2_3}^{(2)} & z_{3_3}^{(2)} \\
z_{2_4}^{(2)} & z_{2_4}^{(2)} & z_{3_4}^{(2)} 
\end{pmatrix}\right)} \\ 
\textrm{Difference between guess and target} = 
\begin{bmatrix} 
\frac{1}{4} 2(\sigma(z_{1_1}^{(3)}) - y_1) \\
\frac{1}{4} 2(\sigma(z_{1_2}^{(3)}) - y_1) \\
\frac{1}{4} 2(\sigma(z_{1_3}^{(3)}) - y_1) \\
\frac{1}{4} 2(\sigma(z_{1_4}^{(3)}) - y_1) 
\end{bmatrix}\\
\color{green}{\textrm{Derivative applied to the pre-activation final value} = \sigma'
\begin{pmatrix} 
z_{1_1}^{(3)} \\
z_{1_2}^{(3)} \\ 
z_{1_3}^{(3)} \\
z_{1_4}^{(3)}
\end{pmatrix}}
$$

While there are many details here, the main point of note is that we have all of these pieces from the forward pass. 

Now let's take a quick look at the next set of wights as we move our way backwards through the network. Consider again the cost surface,
$$ \textrm{Cost} = \frac{1}{N} \sum_{i=1}^N (a_{1}^{(3)} - y_i) ^ 2 $$

Now let's start building the derivatives. We are going to look at this in a bit different way then we did above. In the next few lines we will find each of the composite functions that make up the Cost function. The arrow points to the derivative for each, and we can use the chain rule to build the derivative.

$$\frac{1}{N} \sum_{i=1}^N (a_{1}^{(3)} - y_i) ^ 2 \ \ \xrightarrow{\frac{\partial}{\partial a_{1}^{(3)} }} \ \  \frac{2}{N} \sum_{i=1}^N (a_{1}^{(3)} - y_i)$$

$$ a_{1}^{(3)} = \sigma(z_{1}^{(3)}) \ \ \xrightarrow{\frac{\partial}{\partial z_{1}^{(3)} }} \ \ \sigma'(z_{1}^{(3)}) $$ 

which is the activation in the final node. And, 

$$ z_{1}^{(3)} = w_{11}^{(3)} a_{1}^{(2)} + w_{21}^{(3)} a_{2}^{(2)} + w_{31}^{(3)} a_{3}^{(2)} 
\ \ \xrightarrow{\frac{\partial}{\partial a_{1}^{(2)} }} \ \ w_{11}^{(3)} \\ 
z_{1}^{(3)} = w_{11}^{(3)} a_{1}^{(2)} + w_{21}^{(3)} a_{2}^{(2)} + w_{31}^{(3)} a_{3}^{(2)} 
\ \ \xrightarrow{\frac{\partial}{\partial a_{2}^{(2)} }} \ \ w_{21}^{(3)} \\
z_{1}^{(3)} = w_{11}^{(3)} a_{1}^{(2)} + w_{21}^{(3)} a_{2}^{(2)} + w_{31}^{(3)} a_{3}^{(2)} 
\ \ \xrightarrow{\frac{\partial}{\partial a_{3}^{(2)} }} \ \ w_{31}^{(3)}  $$

But to move back to the weights in the layer before we need to keep building. 

So,
$$
a_{1}^{(2)} = \sigma(z_{1}^{(2)}) \ \ \xrightarrow{\frac{\partial}{\partial z_{1}^{(3)} }} \ \ \sigma'(z_{1}^{(2)})  \\
a_{2}^{(2)} = \sigma(z_{2}^{(2)}) \ \ \xrightarrow{\frac{\partial}{\partial z_{2}^{(3)} }} \ \ \sigma'(z_{2}^{(2)}) \\
a_{3}^{(2)} = \sigma(z_{3}^{(2)}) \ \ \xrightarrow{\frac{\partial}{\partial z_{3}^{(3)} }} \ \ \sigma'(z_{3}^{(2)})
$$

and,
$$
z_{1}^{(2)} = w_{11}^{(2)} a_{1}^{(1)} + w_{21}^{(2)} a_{2}^{(1)} + w_{31}^{(2)} a_{3}^{(1)} + w_{41}^{(2)} a_{4}^{(1)}
 \ \ \xrightarrow{\frac{\partial}{\partial w_{11}^{(2)} }} \ \ a_{1}^{(1)} \\
z_{1}^{(2)} = w_{11}^{(2)} a_{1}^{(1)} + w_{21}^{(2)} a_{2}^{(1)} + w_{31}^{(2)} a_{3}^{(1)} + w_{41}^{(2)} a_{4}^{(1)}
 \ \ \xrightarrow{\frac{\partial}{\partial w_{21}^{(2)} }} \ \ a_{2}^{(1)} \\
z_{1}^{(2)} = w_{11}^{(2)} a_{2}^{(1)} + w_{21}^{(2)} a_{2}^{(1)} + w_{31}^{(2)} a_{3}^{(1)} + w_{41}^{(2)} a_{4}^{(1)}
 \ \ \xrightarrow{\frac{\partial}{\partial w_{31}^{(2)} }} \ \ a_{3}^{(1)} \\
z_{1}^{(2)} = w_{11}^{(2)} a_{3}^{(1)} + w_{21}^{(2)} a_{2}^{(1)} + w_{31}^{(2)} a_{3}^{(1)} + w_{41}^{(2)} a_{4}^{(1)}
 \ \ \xrightarrow{\frac{\partial}{\partial w_{41}^{(2)} }} \ \ a_{4}^{(1)} \\
$$

$$
z_{2}^{(2)} = w_{12}^{(2)} a_{1}^{(1)} + w_{22}^{(2)} a_{2}^{(1)} + w_{32}^{(2)} a_{3}^{(1)} + w_{42}^{(2)} a_{4}^{(1)}
 \ \ \xrightarrow{\frac{\partial}{\partial w_{12}^{(2)} }} \ \ a_{1}^{(1)} \\
z_{2}^{(2)} = w_{12}^{(2)} a_{1}^{(1)} + w_{22}^{(2)} a_{2}^{(1)} + w_{32}^{(2)} a_{3}^{(1)} + w_{42}^{(2)} a_{4}^{(1)}
 \ \ \xrightarrow{\frac{\partial}{\partial w_{22}^{(2)} }} \ \ a_{2}^{(1)} \\
z_{2}^{(2)} = w_{12}^{(2)} a_{1}^{(1)} + w_{22}^{(2)} a_{2}^{(1)} + w_{32}^{(2)} a_{3}^{(1)} + w_{42}^{(2)} a_{4}^{(1)}
 \ \ \xrightarrow{\frac{\partial}{\partial w_{32}^{(2)} }} \ \ a_{3}^{(1)} \\
z_{2}^{(2)} = w_{12}^{(2)} a_{1}^{(1)} + w_{22}^{(2)} a_{2}^{(1)} + w_{32}^{(2)} a_{3}^{(1)} + w_{42}^{(2)} a_{4}^{(1)} 
\ \ \xrightarrow{\frac{\partial}{\partial w_{42}^{(2)} }} \ \ a_{4}^{(1)} \\
$$

$$
z_{3}^{(2)} = w_{13}^{(2)} a_{1}^{(1)} + w_{23}^{(2)} a_{2}^{(1)} + w_{33}^{(2)} a_{3}^{(1)} + w_{43}^{(2)} a_{4}^{(1)}
\ \ \xrightarrow{\frac{\partial}{\partial w_{13}^{(2)} }} \ \ a_{1}^{(1)} \\
z_{3}^{(2)} = w_{13}^{(2)} a_{1}^{(1)} + w_{23}^{(2)} a_{2}^{(1)} + w_{33}^{(2)} a_{3}^{(1)} + w_{43}^{(2)} a_{4}^{(1)}
\ \ \xrightarrow{\frac{\partial}{\partial w_{23}^{(2)} }} \ \ a_{2}^{(1)} \\
z_{3}^{(2)} = w_{13}^{(2)} a_{1}^{(1)} + w_{23}^{(2)} a_{2}^{(1)} + w_{33}^{(2)} a_{3}^{(1)} + w_{43}^{(2)} a_{4}^{(1)}
\ \ \xrightarrow{\frac{\partial}{\partial w_{33}^{(2)} }} \ \ a_{3}^{(1)} \\
z_{3}^{(2)} = w_{13}^{(2)} a_{1}^{(1)} + w_{23}^{(2)} a_{2}^{(1)} + w_{33}^{(2)} a_{3}^{(1)} + w_{43}^{(2)} a_{4}^{(1)}
\ \ \xrightarrow{\frac{\partial}{\partial w_{43}^{(2)} }} \ \ a_{4}^{(1)} \\
$$

**NOTE: The products between the scalars, vectors and matrices below are not clear. The pattern above would be needed to turn the sum to a matrix product. The following math is to showcase the pattern and the idea of what will happen in the layers that follow.** 

So the pieces that make up the gradient are, 
$$
\frac{2}{N} \sum_{i=1}^N (a_{1}^{(3)} - y_i)
\sigma'(z_{1}^{(3)})
\langle w_{11}^{(3)}, w_{21}^{(3)}, w_{31}^{(3)} \rangle
\langle \sigma'(z_{1}^{(2)}), \sigma'(z_{2}^{(2)}), \sigma'(z_{3}^{(2)}) \rangle
\langle a_{1}^{(1)}, a_{2}^{(1)}, a_{3}^{(1)}, a_{4}^{(1)} \rangle
$$

And the pattern continues,
$$
\frac{2}{N} \sum_{i=1}^N (a_{1}^{(3)} - y_i)
\sigma'(z_{1}^{(3)})
\langle w_{11}^{(3)}, w_{21}^{(3)}, w_{31}^{(3)} \rangle
\langle \sigma'(z_{1}^{(2)}), \sigma'(z_{2}^{(2)}), \sigma'(z_{3}^{(2)}) \rangle
\begin{bmatrix}
w_{11}^{(2)} & w_{12}^{(2)} & w_{13}^{(2)} \\
w_{21}^{(2)} & w_{22}^{(2)} & w_{23}^{(2)} \\
w_{31}^{(2)} & w_{32}^{(2)} & w_{33}^{(2)} \\
w_{41}^{(2)} & w_{42}^{(2)} & w_{43}^{(2)} 
\end{bmatrix}
\langle \sigma'(z_{1}^{(1)}), \sigma'(z_{2}^{(1)}), \sigma'(z_{3}^{(1)}),  \sigma'(z_{4}^{(1)})  \rangle
X
$$

### **Gradient Structure Reference:**
$$ \textrm{learning rate} \ \color{green}{(a_{1}^{(3)} - y_i)} 
\color{blue}{\sigma'(z_{1}^{(3)})} \langle a_{1}^{(2)}, a_{2}^{(2)}, a_{3}^{(2)} \rangle
$$

$$
\textrm{learning rate} \ \color{green}{(a_{1}^{(3)} - y_i) \sigma'(z_{1}^{(3)}) \textrm{W3}}
\color{blue}{ \langle \sigma'(z_{1}^{(2)}), \sigma'(z_{2}^{(2)}), \sigma'(z_{3}^{(2)}) \rangle}
\langle a_{1}^{(1)}, a_{2}^{(1)}, a_{3}^{(1)}, a_{4}^{(1)} \rangle
$$

$$
 \textrm{learning rate} \ \color{green}{(a_{1}^{(3)} - y_i) \sigma'(z_{1}^{(3)}) \textrm{W3}
 \langle \sigma'(z_{1}^{(2)}), \sigma'(z_{2}^{(2)}), \sigma'(z_{3}^{(2)}) \rangle \textrm{W2}}
\color{blue} {\langle \sigma'(z_{1}^{(1)}), \sigma'(z_{2}^{(1)}), \sigma'(z_{3}^{(1)}),  \sigma'(z_{4}^{(1)}) \rangle}
X
$$

 <p style="text-align: center;"> <img src= nn_an.png width=500 alt='[img: SVM]'/>  </p>

In [None]:
def back_prop(X_data,learning_rate, layers, weights, error):
    z1 = layers[0]
    z2 = layers[1]
    z3 = layers[2]
    a3 = layers[3]
    
    W1 = weights[0]
    W2 = weights[1]
    W3 = weights[2]
    
    # Back through the node and update the weights.
    l3_delta = error * sigmoid(z3) * (1 - sigmoid(z3))
    W3_update = np.dot(sigmoid(z2).T, l3_delta)
    
    l2_error = np.dot(l3_delta, W3.T)
    l2_delta = l2_error * sigmoid(z2) * (1 - sigmoid(z2))
    W2_update = np.dot(sigmoid(z1).T, l2_delta)

    l1_error = np.dot(l2_delta, W2.T)
    l1_delta = l1_error * sigmoid(z1) * (1 - sigmoid(z1))
    W1_update = np.dot(X_data.T, l1_delta)

    W3 -= learning_rate * W3_update
    W2 -= learning_rate * W2_update
    W1 -= learning_rate * W1_update
    
    return [W1,W2,W3]
    

## **Training**

For the training we need a bit of each of the above. The steps are:
- Loop over all training data (Each pass is an epoch)
    - Break the data into batches.
    - Send a batch forward through the network.
    - Update the weights with the gradients computed above.
    
Things that we will need to send into the training loop:
- the training data.
- the target data.
- number of epoch.
- learning rate.
- weights.
- layers that we will need to make the gradients.

In [None]:
# Reset the brain 
W1,W2,W3 = brain()

In [None]:
epoch = 2000
training_features = X_train
learning_rate = .01

In [None]:
history = []
for e in range(epoch + 1):

    layers = forward_pass(training_features, W1, W2, W3) # Forward pass.
    error = layers[3] - Y_train # Evaluation.
    W1,W2,W3 = back_prop(X_train, learning_rate, layers, [W1,W2,W3], error) # Back Propogation.
    history.append([e, np.array(np.sum((layers[3] - Y_train) ** 2) / np.shape(Y_train)[1])[0]])
    
    # print('\r',"Our elevation on the cost surface (training):", np.array(np.sum((layers[3] - Y_train) ** 2) / np.shape(Y_train)[1])[0], end="                          ")
    
history = np.array(history)

In [None]:
plt.scatter(history[:,0], history[:,1], marker='.' )
plt.xlabel("Epochs")
plt.ylabel("Score")

## **Evaluation**

In [None]:
s = 0
for x,y in zip(np.array(X_test.iloc[1:]),np.array(Y_test.iloc[1:])):
    A = forward_pass(x, W1, W2, W3)
    if np.round(A[3]) == y:
        s += 1
        
A_Test = forward_pass(X_test, W1, W2, W3)
print("Our elevation on the cost surface (testing):", np.array(np.sum((A_Test[3] - Y_test) ** 2) / np.shape(Y_test)[1])[0])
print("score:", 100 * s / len(np.array(X_test.iloc[1:])),"%")