# 6. Building your first Deep Neural Network: Introduction to Backpropagation

In this Chapter:
- The Streetlight Problem.
- Matrices and the Matrix relationship.
- Full, Batch, and Stochastic Gradient Descent.
- Neural Networks learn Correlation.
- Overfitting.
- Creating your Own correlation.
- Backpropagation: Long-distance error attribution.
- Linear versus Non-Lienar.
- The Secret to Sometimes Correlation.
- Your first Deep Network.
- Backpropagation in Code: Bringing it all together.

> *"O Deep Thought Computer," he Said, "The task we have designed you to perform is this. We want you to tell us..." he paused. "The Answer"* - Douglas Adams, The Hitchhiker's Guide to the Galaxy

## The Streetlight Problem
### This toy problem considers how a network learn entire datasets

- Consider yourself approaching a street corner in a foreign country.
    - As you approach, you look up and realize that the street light is unfamiliar.
        - How can you know when it's safe to cross the street?

<div style="text-align:center;">
    <img style="width:33%" src="static/imgs/06/streetlight_data.png" />
</div>

- To Solve this problem, you might sit at a street corner for a few minutes observing the correlation between each light combination and whether people around you choose to stop or walk.
- You can say that there is a perfect correlation between the middle light and whether it's safe to walk.
- You learned this pattern by observing all individual data points and **searching for correlation** .
    - **This is What you're going to train a neural network to do!**

## Preparing the Data
### Neural Networks Don't Read Sreetlights

- You have two datasets
    - On the one hand, you have six streetlight states.
    - On the other hand, you have six observations of whether people walked.
- To Prepare this data for the neural network, you have to first split it into two groups
    - What you know
    - What you want to know

## Matrices & the Matrix Relationship
### translate the Streelight into math

<div style="text-align:center;">
    <img style="width:33%" src="static/imgs/06/symbols-to-numbers.png" />
    <img style="width:20%" src="static/imgs/06/output-pattern.png" />
</div>

- You want to teach a neural network to translate a streetlight pattern into the correct stop/walk pattern.
- What you really want to do is mimic the mattern of the streetlight in the form of numbers.
- In data matrices, it's convention to give each recorded example a single row.
- It's also convention to give each thing being recorded a single column.
    - This makes the matrix easy to read.

### Good Data Matrices perfectly mimic the outside world

- The data matrix doesn't have to be all 1s and 0s.
- A Matrix should mimic the patterns that exist in the real world.
    - So you can ask the computer to mimic them.
- An infinite number of matrices exist that perfectly reflect the streetlight patterns in the dataset.
- The underlying pattern isn't the same as the matrix.
    - **It's a Property of the Matrix**.
    - **the Pattern is what the matrix is expressing.**
    - The Pattern also existed in the streetlights.
- the Resulting Matrix is called a **Lossless Representation** 
    - because you can perfectly convert back and forth between your stop/walk notes and the matrix.

## Creating a Matrix or Two in Python
### Import the Matrices into Python

- Let's Create the Streetlight pattern matrix

In [1]:
import numpy as np

In [2]:
streetlights = np.array([[1,0,1], [0,1,1], [0,0,1], [1,1,1], [0,1,1], [1,0,1]])

- What is NumPy ?
    - Numpy is really just a fancy wrapper for an array of arrays that provides special, matrix-oriented functions.

In [4]:
walk_vs_stop = np.array([[0],[1],[0],[1],[1],[0]])

## Building a Neural Network

In [12]:
# I will attempt to build it.
ws = np.random.rand(streetlights.shape[1])
x_i = streetlights[0]
y_i = walk_vs_stop[0]
lr = .1

for interation in range(20):
    # predict.
    prediction = x_i.dot(ws)
    # MSE error.
    error = (prediction - y_i) ** 2
    # update weights.
    for j in range(len(ws)):
        gradient = 2 * x_i[j] * (prediction - y_i)
        ws[j] -= lr * gradient
    print("Prediction : " + str(prediction) + " Reality : " + str(y_i[0]) + " Error : " + str(error[0]))

Prediction : 0.9104498947095488 Reality : 0 Error : 0.8289190107766286
Prediction : 0.5462699368257292 Reality : 0 Error : 0.2984108438795862
Prediction : 0.3277619620954375 Reality : 0 Error : 0.10742790379665101
Prediction : 0.1966571772572625 Reality : 0 Error : 0.03867404536679436
Prediction : 0.11799430635435748 Reality : 0 Error : 0.013922656332045966
Prediction : 0.07079658381261444 Reality : 0 Error : 0.005012156279536542
Prediction : 0.04247795028756862 Reality : 0 Error : 0.0018043762606331512
Prediction : 0.025486770172541195 Reality : 0 Error : 0.0006495754538279355
Prediction : 0.01529206210352474 Reality : 0 Error : 0.00023384716337805747
Prediction : 0.009175237262114888 Reality : 0 Error : 8.418497881610151e-05
Prediction : 0.005505142357268955 Reality : 0 Error : 3.030659237379679e-05
Prediction : 0.0033030854143614174 Reality : 0 Error : 1.0910373254567137e-05
Prediction : 0.0019818512486168283 Reality : 0 Error : 3.927734371644081e-06
Prediction : 0.00118911074917005

In [13]:
# Book Implementation.
weights = np.array([.5, .48, -.7])
alpha = .1

input = streetlights[0]
goal_prediction = walk_vs_stop[0]

for iteration in range(20):
    prediction = input.dot(weights)
    error = (goal_prediction - prediction) ** 2
    delta = prediction - goal_prediction
    weights = weights - (alpha * (input * delta))
    print("Error:" + str(error) + " Prediction: " + str(prediction))

Error:[0.04] Prediction: -0.19999999999999996
Error:[0.0256] Prediction: -0.15999999999999992
Error:[0.016384] Prediction: -0.1279999999999999
Error:[0.01048576] Prediction: -0.10239999999999982
Error:[0.00671089] Prediction: -0.08191999999999977
Error:[0.00429497] Prediction: -0.06553599999999982
Error:[0.00274878] Prediction: -0.05242879999999994
Error:[0.00175922] Prediction: -0.04194304000000004
Error:[0.0011259] Prediction: -0.03355443200000008
Error:[0.00072058] Prediction: -0.02684354560000002
Error:[0.00046117] Prediction: -0.021474836479999926
Error:[0.00029515] Prediction: -0.01717986918399994
Error:[0.00018889] Prediction: -0.013743895347199997
Error:[0.00012089] Prediction: -0.010995116277759953
Error:[7.73712525e-05] Prediction: -0.008796093022207963
Error:[4.95176016e-05] Prediction: -0.007036874417766459
Error:[3.1691265e-05] Prediction: -0.0056294995342132115
Error:[2.02824096e-05] Prediction: -0.004503599627370569
Error:[1.29807421e-05] Prediction: -0.00360287970189654

## Learning the Whole Dataset
### The Neural Network has been learning only one streetlight. Don't we want it to learn them all?

- Thus far, you've trained neural networks that learned how to model a single training example. 

In [14]:
# let's generalize the algorithm ourselves.
# I will attempt to build it.
ws = np.random.rand(streetlights.shape[1])
lr = .1
epoches = 7

for interation in range(epoches):
    for i in range(len(streetlights)):
        # predict.
        prediction = streetlights[i].dot(ws)
        # MSE error.
        error = (prediction - walk_vs_stop[i]) ** 2
        # update weights.
        for j in range(len(ws)):
            gradient = 2 * streetlights[i][j] * (prediction - walk_vs_stop[i])
            ws[j] -= lr * gradient
        print("Prediction : " + str(prediction) + " Reality : " + str(walk_vs_stop[i][0]) + " Error : " + str(error[0]))

Prediction : 1.115349567974397 Reality : 0 Error : 1.2440046587806741
Prediction : 1.3518921380661582 Reality : 1 Error : 0.12382807683277211
Prediction : 0.5656721671456411 Reality : 0 Error : 0.3199850006832461
Prediction : 1.1311599954363323 Reality : 1 Error : 0.017202944402858703
Prediction : 1.0455368512360337 Reality : 1 Error : 0.0020736048204926623
Prediction : 0.42412551132053866 Reality : 0 Error : 0.17988244935290837
Prediction : 0.2544753067923232 Reality : 0 Error : 0.06475768176704702
Prediction : 0.8916019471190477 Reality : 1 Error : 0.011750137868381727
Prediction : 0.30315781133565767 Reality : 0 Error : 0.09190465857382621
Prediction : 0.7455365893202238 Reality : 1 Error : 0.06475162737478443
Prediction : 0.9761149702762076 Reality : 1 Error : 0.0005704946449064446
Prediction : 0.22029560260112177 Reality : 0 Error : 0.04853015252539137
Prediction : 0.13217736156067306 Reality : 0 Error : 0.017470854909140892
Prediction : 0.9151743893333657 Reality : 1 Error : 0.00

In [30]:
# our final weights predictions. Compared with ground truths
[ws.dot(streetlight) for streetlight in streetlights], streetlights, walk_vs_stop

([0.026286497591032618,
  1.0077360597065974,
  0.11707414150199424,
  0.9169484157956358,
  1.0077360597065974,
  0.026286497591032618],
 array([[1, 0, 1],
        [0, 1, 1],
        [0, 0, 1],
        [1, 1, 1],
        [0, 1, 1],
        [1, 0, 1]]),
 array([[0],
        [1],
        [0],
        [1],
        [1],
        [0]]))

In [31]:
# Book Implementation.
weights = np.array([.5, .48, -.7])
alpha = .1

for iteration in range(20):
    error_for_all_lights = 0
    for row_index in range(len(walk_vs_stop)):
        input = streetlights[row_index]
        goal_prediction = walk_vs_stop[row_index]
        
        prediction = input.dot(weights)
        
        error = (goal_prediction - prediction) ** 2
        error_for_all_lights += error
        
        delta = prediction - goal_prediction
        weights = weights - (alpha * (input * delta))
    print("Error:" + str(error_for_all_lights))

Error:[2.65612311]
Error:[0.96287018]
Error:[0.55091659]
Error:[0.36445837]
Error:[0.25167687]
Error:[0.17797575]
Error:[0.12864461]
Error:[0.09511037]
Error:[0.07194564]
Error:[0.05564915]
Error:[0.04394764]
Error:[0.03535797]
Error:[0.028907]
Error:[0.02395166]
Error:[0.02006311]
Error:[0.01695209]
Error:[0.01442082]
Error:[0.01233174]
Error:[0.01058739]
Error:[0.00911723]


## Full, Batch, and Stochastic Gradient Descent
### Stochastic Gradient Descent updates weights one example at a time

- This Idea of learning one example at a time is called **Stochastic Gradient Descent**.
    - It performs a prediction and weight update for each training example separately.
    - It iterates through the entire dataset many times until it can find a weight configuration that works well for the entire dataset.

### (Full) Gradient Descent updates weights one dataset at a time

- Instead of updating the weights once for each training example, the weight calculates the loss over the entire dataset.
    - Changing the weights only each time it computes a full average.

### Batch Gradient Descent updates weights taking in n examples

- Instead of updating the weights using one example or after the entire dataset of examples, you choose a batch size (typically between 8 and 256).
    - After which the weights are updated.

## Neural Networks Learn Correlation
### What did the last neural network learn?

- The correlation is located wherever the weights were set to high numbers.
- Inversely, randomness with respect to the input is found wherever the weights converge to 0.
- How did the network Identify Correlation?
    - In the Process of Gradient Descent, each training example asserts either *up pressure* or *down pressure* on the weights.
    - On average, there was more *up pressure* for the middle weight and more *down pressure* for the other two.
    - Where does the pressure come from?
    - Why is it different for different weights?

## Up and Down Pressue
### It comes from the Data

- Each node is individually trying to correctly predict the output given the input.
- For the most part, each node ignores all other nodes when attempting to do so.
- **The Only cross communication that occurs is that all 3 weights must share the same error measure**.
- **The weight update is nothing more than taking this shared error measure and multiplying it by each respective error**.

<div style="text-align:center;">
    <img style="width:50%" src="static/imgs/06/weight_update.png" />
</div>

- **A Key part of why neural networks learn is error attribution, which means given a shared error, the network needs to figure out which weights contributed (so they can be adjusted) and which weights did not contribute (so they can be left alone)**.
- On Average, this causes the network to find the correlation present between the middle weight and the output to be the dominant predictive force.
    - & Ofcoures making the network quite accurate.
- Bottom Line
    - The prediction is a weighted sum of the inputs.
    - The Learning algorithm rewards inputs that correlate with the output with upward pressure (toward 1)
    - & Penalize inputs that discorrale with the output with downward pressure (toward 0)
    - The weighted sum of the inputs find perfect correlation between the input and the output by weighting decorrelated inputs to 0.

## Edge Case: Overfitting
### Sometimes Correlation happens accidentally 

- Error is shared among all of the weigths
- If a particular configuration of weights accidentally creates perfect correlation between the prediction and the output dataset
    - **The Neural Network will stop Learning**
- In essence, **It memorized** the two training examples instead of finding the correlation that will generalize to any possible streetlight configuration.
- The greatest challenge you'll face with deep learning is convincing your neural network to **generalize** instead of just **memorize**.

## Edge Case: Conflicting Pressure
### Sometimes correlation fights itself

- As other nodes learn, they absorb some of the error; they absorb part of the correlation.
- This causes the network to predict with moderate correlative power, which reduces the error.
- the other weights then only try to adjust their weights to correctly predict what's left.
- **Regularization** forces weights with conflicting pressure to move toward 0.
- Regularization aims to say that only weights with really strong correlation can stay on.

<div style="text-align:center;">
    <img style="width:50%" src="static/imgs/06/latent_correlation.png" />
</div>

- in the case of one input and out output layers, each weight learn for itself and finds correlation between the associated column and the output.
- How about when the correlation is indirect, when a linear combination of the inputs is correlated with the output and not distinct columns.
    - To Solve this, We use the **Multi-Layer Perceptron** Architecture.

## Learning Indirect Correlation
### If your Data doesn't have correlation, create intermediate data that does!

- Neural networks search for correlation between their input and output **layers**.
- Because the Input dataset doesn't correlate with the output dataset, you'll use the input dataset to create an intermediate dataset that does have correlation with the output.
    - It's kind of like **cheating**.

## Creating Correlation

<div style="text-align:center;">
    <img style="width:50%" src="static/imgs/06/hidden-layer.png" />
</div>

- the middle layer represents the intermediate dataset.
- **this network is still just a function**
    - It has a bunch of weights that are collected together in a particular way.
    - Gradient Descent still works because you can calculate how much each weight contributes to the error and adjust it to reduce the error to 0.

## Stacking Neural Networks: A Review

- When you look at the stacked neural network architecture and ignore the lower weights & only consider their output to be the dataset.
    - the top half of the neural network is just like the networks trained in the preceding chapter.
    - You can use all the same learning logic to help them learn.
- The part that you don't yet understand is how to update the weights of the first layer.
    - What do they use as their error measure?
- The cached/normalized error measure is called *delta*.
    - You want to figure out how to know the *delta* values at the first layer so they can help the second layer make accurate predictions.

## Backpropagation: Long-distance Error Attribution
### The Weighted average error

<div style="text-align:center;">
    <img style="width:50%" src="static/imgs/06/backpropagation.png" />
</div>

- How do you use the *delta* at layer 2 to figure out the *delta* at layer 1?
    - You multiply it by each of the respective weights for layer 1.
    - It's like the prediction logic in reverse.
- this process of moving the *delta* signal around is called **backpropagation**.

## Backpropagation: Why does this work?
### The weigthed average delta

- The delta variable told you the **direction and amount** the value of this node should change next time.
- All backpropagation lets you do is say:
    - Hey, If you want this node to be x amount higher, then each of these previous four nodes needs to be x*weights_1_2 amount higher/lower.
    - Because these weights were amplifying the prediction by weights_1_2 times.
- Once you know this, you can update each weight matrix as you did before.
    - For each weight, multiply its output delta by its input value.
    - & adjust the weight by that much (or you can scale it by alpha).

## Linear vs. Nonlinear
### This is probably the **hardest** Concept in the Book, Let's Take it slowly

- As it turns out, you need **one more piece** to make this neural network train.
- Let's take it from two perspectives:
    - The first will show why the neural network can't train without it.
    - Second will show how to fix the problem.
- The Problem lies in the following statement:
    - All linear mappings of linear mappings produce linear mappings.
        - Meaning, no matter how many stacked layers you add, there exist an equivalent NN w/ one layer.

<div style="text-align:center;">
    <img style="width:66%" src="static/imgs/06/linearity-problem.png" />
</div>

## Why the Neural Network still doesn't work
### If you trained the three layer network as it is now, it wouldn't converge.

- The middle nodes don't get to add anything to the conversation.
    - they don't get to have correlation of their own.
    - they're more or less correlated to various input nodes.
- But because you know that in teh new dataset there is no correlation between any of the inputs/outputs, how can the middle layer help?
    - It mixes up a bunch of correlation that's already useless.
    - **What you really need is for the middle layer to be able to selectively correlate with the input**.
- **You want the middle layer to sometimes correlate with an input, and sometimes not correlate**.
    - That gives it correlation of its own.
- This gives the middle layer the opportunity to not just always be x% correlated with one input and y% correlated with another input.
    - Instead, it can be x% correlated with one input only when it wants to be.
- this is called **Conditional Correlation** or **sometimes correlation**.

## The Secret to Sometimes Correlation
### Turn off the node when the value is below 0

- If a Node's value dropped below 0, normally the node would still have the same correlation to the input as always.
    - it would just happen to be negative in value.
    - But if you turn off the node when it would be negative, then it has zero correlation to any inputs whenever It's negative.
    - What does this mean ?
- **The Node can now pick & choose when it wants to be correlated to something.**
    - This allows it to say something like:
        - Make me perfectly correlated to the left input, but only when the right input is turned off.
- This wasn't possible before, Now the node can be conditional.
    - Now **It can speak of itself**
- the fancy term for this "if the node would be negative, set it to 0" logic is **nonlinearity**.
    - Without this tweak, the neural network is linear.
- There are many kinds of nonlinearities, but the one discussed here is, in many cases, the best one to use.
    - It's also the simplest. (ReLU)

## Your First Deep Neural Network
### Here's how to make the prediction

<div style="text-align:center;">
    <img style="width:50%" src="static/imgs/06/introducing-relu.png" />
</div>

In [15]:
import numpy as np

np.random.seed(1)

In [16]:
def ReLU(x):
    return (x > 0) * x

In [17]:
lr = .1
hidden_size = 4

In [18]:
X = np.array([[1, 0, 1], 
              [0, 1, 1], 
              [0, 0, 1], 
              [1, 1, 1]])
y = np.array([[1, 1, 0, 0]]).T

In [19]:
# weights to connect the 3 layers.
ws_0_1 = 2*np.random.random((3, hidden_size)) - 1
ws_1_2 = 2*np.random.random((hidden_size, 1)) - 1
ws_0_1.shape, ws_1_2.shape

((3, 4), (4, 1))

In [20]:
layer_0 = X[0]
layer_1 = ReLU(np.dot(layer_0, ws_0_1))
layer_2 = np.dot(layer_1, ws_1_2)

## Backpropagating in Code
### You can learn the amount that each weight contributes to the final error

In [22]:
def ReLU_grad(x):
    return (x > 0) * 1

In [159]:
for epoch in range(100):
    for i in range(len(X)):
        # get input X[i] & target y[i]
        x_i, y_i = X[i], y[i]
        
        # calculate prediction
        hs = ReLU(np.dot(x_i, ws_0_1))
        prediction = np.dot(hs, ws_1_2)
        
        # calculate error, pure error.
        error = (prediction - y_i) ** 2
        delta = prediction - y_i
        
        # calculate gradients of 1st layer.
        grad_0_1 = np.zeros(ws_0_1.shape)
        for line_i in range(len(ws_0_1)):
            for col_i in range(len(ws_0_1[0])):
                grad_0_1[line_i][col_i] = 2 * delta * x_i[line_i] * ws_1_2[col_i] * ReLU_grad(hs[col_i])
        
        # update weights of 1st layer.
        ws_0_1 -= lr * grad_0_1
        
        # calculate gradients of 2nd layer.
        grad_1_2 = np.zeros(ws_1_2.shape)
        for line_i in range(len(ws_1_2)):
            grad_1_2[line_i]= 2 * delta * hs[line_i]
        
        # update weights of 2nd layer.
        ws_1_2 -= lr * grad_1_2
    if (epoch % 10 == 0):
        print('error: ' + str(error))
ws_1_2

error: [0.]
error: [0.07710452]
error: [0.03764561]
error: [0.00240581]
error: [9.22274338e-06]
error: [0.]
error: [0.]
error: [0.]
error: [0.]
error: [0.]


array([[-0.5910955 ],
       [ 1.13962134],
       [-0.94522481],
       [ 1.11202675]])

In [23]:
# Book Implementation.
for iteration in range(100):
    layer_2_error = 0
    for i in range(len(X)):
        layer_0 = X[i:i+1]
        layer_1 = ReLU(np.dot(layer_0, ws_0_1))
        layer_2 = np.dot(layer_1, ws_1_2)
        
        layer_2_error += np.sum((layer_2 - y[i:i+1]) ** 2)
        layer_2_delta = (layer_2 - y[i:i+1])
        layer_1_delta = layer_2_delta.dot(ws_1_2.T)*ReLU_grad(layer_1)
        
        ws_1_2 -= lr * layer_1.T.dot(layer_2_delta)
        ws_0_1 -= lr * layer_0.T.dot(layer_1_delta)
    if (iteration % 10 == 0):
        print("Error : " + str(layer_2_error))

() (1, 1) (1, 4)


- Remember, the goal is **error attribution**.
    - It's about figuring out how much each weight contributed to the overall error.
- Now that you know how much the final prediction should move up or down, you need to figure out how much each middle node should move up/down.
    - These are effectively **intermediate predictions**
- Once you have the delta at layer 1, you can use the same processes as before for calculating a weight update.
- Backpropagation is abount calculating *deltas* for intermediate layers so you can perform gradient descent.

## Why do Deep Networks Matter?
### What's the point of creating "intermediate datasets" that have correlation? 

- The two layer network might have a problem classifying cat vs. non-cats pectures, why ?
- Just like the last streetlight dataset, no individual pixel correlates with whether there's a cat in the picture.
    - only different configuration of pixels correlate with whether there's a cat.
- **Deep Learning is all about creating intermediate layers (datasets) wherein each node in an intermediate layer represents the presence or absence of a different configuration of inputs**.
- Because intermediate layers detect (presence or not) various pixel configurations, it then gives the final layer the information it needs to correctly predict the presence/absence of cat.
- Some Neural Networks have hundreds of layers.
- **The Rest of this book will be dedicated to studying different phenomena within these layers in an effort to explore the full power of deep neural networks**.

# Challenge: Build a 3-layer Neural Network from Memory!

In [110]:
import numpy as np 

In [111]:
X = np.array([[1, 0, 1], 
              [0, 1, 1], 
              [0, 0, 1], 
              [1, 1, 1]])
y = np.array([[1, 1, 0, 0]]).T

epochs = 10000
lr = 0.1

X.shape, y.shape

((4, 3), (4, 1))

In [112]:
# init weights.
ws_1 = np.random.rand(X.shape[1], 4)
ws_2 = np.random.rand(4, y.shape[1])

ws_1.shape, ws_2.shape

((3, 4), (4, 1))

In [113]:
def relu(x):
    return (x > 0) * x

def grad_relu(x):
    return x > 0

In [114]:
for epoch in range(epochs):
    for i in range(len(X)):
        # get input/output
        layer_in = X[i:i+1]
        
        # calculate prediction
        layer_1 = relu(layer_in.dot(ws_1))
        layer_out = layer_1.dot(ws_2).reshape(1, 1)
        
        # calculate delta 2
        delta_2 = layer_out - y[i:i+1]
        
        # calc error for logs
        error = delta_2 ** 2
        
        # calculate delta 1
        # delta_2.dot(ws_2.T) -> (1, 4)
        # grad_relu(hs) -> (4,)
        # * : element wise multiplication.
        delta_1 = delta_2.dot(ws_2.T)*grad_relu(layer_1)
        
        # update weights
        ws_2 -= lr * (layer_1.T.reshape(4,1).dot(delta_2))
        ws_1 -= lr * (layer_in.T.reshape(3,1).dot(delta_1))
    if epoch % 100 == 0:
        print("Error : ", error)

Error :  [[2.05239252]]
Error :  [[0.00109065]]
Error :  [[1.64704463e-09]]
Error :  [[1.55952904e-15]]
Error :  [[1.26140843e-21]]
Error :  [[1.09092376e-27]]
Error :  [[1.30192864e-31]]
Error :  [[1.30192864e-31]]
Error :  [[1.73333695e-31]]
Error :  [[1.50992908e-31]]
Error :  [[1.73333695e-31]]
Error :  [[1.30192864e-31]]
Error :  [[1.79483996e-31]]
Error :  [[1.50992908e-31]]
Error :  [[1.30192864e-31]]
Error :  [[1.30192864e-31]]
Error :  [[1.10933565e-31]]
Error :  [[1.30192864e-31]]
Error :  [[1.50992908e-31]]
Error :  [[1.50992908e-31]]
Error :  [[1.30192864e-31]]
Error :  [[1.30192864e-31]]
Error :  [[1.30192864e-31]]
Error :  [[1.50992908e-31]]
Error :  [[1.30192864e-31]]
Error :  [[1.73333695e-31]]
Error :  [[1.10933565e-31]]
Error :  [[1.30192864e-31]]
Error :  [[1.30192864e-31]]
Error :  [[1.50992908e-31]]
Error :  [[1.30192864e-31]]
Error :  [[1.30192864e-31]]
Error :  [[1.50992908e-31]]
Error :  [[1.30192864e-31]]
Error :  [[1.50992908e-31]]
Error :  [[1.50992908e-31]]


In [116]:
# test weights.
for i in range(len(X)):
    x_i, y_i = X[i], y[i]
    print('y: ', y_i, ' y_hat: ', relu(x_i.dot(ws_1)).dot(ws_2).squeeze())

y:  [1]  y_hat:  0.9999999999999999
y:  [1]  y_hat:  0.9999999999999997
y:  [0]  y_hat:  2.7755575615628914e-16
y:  [0]  y_hat:  3.608224830031759e-16


# Sketches

<div style="text-align:center;">
    <img style="width:66%" src="static/imgs/06/backprop.jpg" /><br/>
    <span style="color:gray;">* Error Note: last layer doesn't have an activation function</span><br/>
    <img style="width:66%" src="static/imgs/06/more-explanation.jpg" />
    <img style="width:66%" src="static/imgs/06/thought_understand.jpg" />
    <img style="width:66%" src="static/imgs/06/debug_mode.jpg" />
    <img style="width:66%" src="static/imgs/06/falsy.jpg" />
</div>