In [1]:
# default_exp ggd
# default_cls_lvl 2

# Learning Multiple Weights at a Time: Generalizing Gradient Descent
> 

In This Chapter:

- Gradient Descent Learning with multiple Inputs
- Freezing One Weight: One does it do?
- Gradient Descent Learning with Multiple Outputs
- Gradeint Descent Learning with Multiple Inputs and Outputs
- Visualizing Weight Values
- Visualizing Dot Products

> "You don't learn to walk by following rules. You learn by doing and by falling over." - Richard Branson

## Gradient Descent Learning with Multiple Inputs
### Gradient Descent Also works with Multiple Inputs

<div style="text-align:center;"><img style="width:500px" src="static/imgs/06/many-to-one.png" /></div>

- The following code shows how a network with multiple inputs can learn:

In [1]:
# an empty network with multiple inputs.
def w_sum(a, b):
    assert(len(a) == len(b))
    S = 0
    for i in range(len(a)):
        S += a[i]*b[i]
    return S

# init weights.
weights = [.1, .2, -.1]

# defining the model.
def neural_network(input, weights):
    prediction = w_sum(input, weights)
    return prediction

In [11]:
# PREDICT + COMPARE: Making a Prediction, and Calculating Error & Delta.
toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65, 0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]

win_or_lose_binary = [1, 1, 0, 1]

true = win_or_lose_binary[0]
input = [toes[0], wlrec[0], nfans[0]]

pred = neural_network(input, weights)

error = (pred - true) ** 2
delta = pred - true

In [2]:
# LEARN: Calculating each weight_delta and putting it on each weight
def ele_mul(number, vector):
    return [number * v_i for v_i in vector]

# we calculate gradients associated w/ each weight.
gradients = ele_mul(delta, input)

print(gradients)

In [6]:
# LEARN: Updating the Weights.
alpha = .01

for i in range(len(weights)):
    weights[i] -= alpha * gradients[i]
    
print(weights)
print(gradients)

[0.1119, 0.20091, -0.09832]
[-1.189999999999999, -0.09099999999999994, -0.16799999999999987]


## Gradient Descent with Multiple Inputs Exaplained
### Simple to Execute, and Fascinating to Understand

- The Properties involved are fascinating and worthy of discussion
- Let's take a look at them side by side:

In [2]:
# Single Input: Making a Prediction and calculating error and delta.
number_of_toes = [8.5]
win_or_lose_binary = [1]

input = number_of_toes[0]
true = win_or_lose_binary[0]
weight = 0.3

prediction = input * weight
error = (prediction - true) ** 2
gradient = 2 * input * (prediction - true)
delta = prediction - true

### How do you turn a single delta (on the node) into three weight delta values?

<div style="text-align:center;"><img style="width:66%;" src="static/imgs/06/Conversation.png" /></div>

- Delta: A Measure of how much higher or lower you want a node's value to be, to predict perfectly given the current training example.
- Weight delta: A derivative-based Estimate of the direction and amount you should move a weight to reduce node_delta, accounting for scaling, negative reversal, and stopping.


## Let's Watch several steps of learning

<div style="text-align:center;"><img style="width:66%;" src="static/imgs/06/Iteration_1.png" /></div>

- We can make three individual error/weight curves, on for each weight.
- The Slopes of these curves are reflected by the gradient values.
- Why is the gradient steeper for (a) than the others if they share the same prediction and error.
    - Because (a) has an input value that's significantly higher than the others and thus, a higher derivative.

<div style="text-align:center;"><img style="width:66%;" src="static/imgs/06/Iteration_2.png" /></div>
<div style="text-align:center;"><img style="width:66%;" src="static/imgs/06/Iteration_3.png" /></div>

- Most of the learning (weight changing) was performaed on the weight with the largest input.
    - Because the input changes the slope significantly.
- This isn't necessarily advantageous in all settings.
    - A subfield called *normalization* helps encourage learning across all weights despite dataset characteristics such as this.
- This significant difference in slope forced me to set alpha lower than i wanted.(.01 instead of .1)

## Freezing one weight: What does it do?

- This experiment is a bit advanced in terms of theory.
    - But i think it's a great exercise to understand how the weights effect each other.
- you're going to train again, except weight (a) won't ever be adjusted.
    - You'll try to learn the training example using only weights (b) and (c).

In [14]:
lr = .3

for iter in range(3):
    pred = neural_network(input, weights)
    
    error = (pred - true) ** 2
    delta = (pred - true)
    
    gradients = ele_mul(delta, input)
    gradients[0] = 0
    
    print("Iteration : " + str(iter))
    print("Pred : " + str(pred))
    print("Error : " + str(error))
    print("Weights : " + str(weights))
    
    for i in range(len(weights)):
        weights[i] -= lr * gradients[i]

Iteration : 0
Pred : 0.8600000000000001
Error : 0.01959999999999997
Weights : [0.1, 0.2, -0.1]
Iteration : 1
Pred : 0.9382250000000001
Error : 0.003816150624999989
Weights : [0.1, 0.2273, -0.04960000000000005]
Iteration : 2
Pred : 0.97274178125
Error : 0.000743010489422852
Weights : [0.1, 0.239346125, -0.02736100000000008]


<div style="text-align:center;">
    <img style="width:30%" src="static/imgs/06/Experiment_1.png" />
    <img style="width:30%" src="static/imgs/06/Experiment_2.png" />
</div>

- Perhaps you are surprised that you remained weights learned despite removing the biggest contributing feature.
- graph (a) still finds bottom of bowel because even though numerical value of weight didn't chnage, gradient loss graph changes (w/ error).
    - The black dot can move horizontally only if the weight is updated.
    - because (a) weight is frozen for the experiment, the dot must stay fixed, but the error goes to zero.
- This is an extremely important Lesson.
    - First, if you converged (reached Error == 0) with (b) and (c) weights & then tried to train (a), (a) wouldn't move.
        - Why ? error == 0 means gradient = 2 * input * (pred - true) == 0.
        - This Reveals a potentially damaging property of neural networks:
            - (a) may be a powerful input with lots of predictive power, But **if the network accidentally figures out how to predict accurately on the training data without it, then it will never learn how to incorporate (a) into its prediction.**
- Notice How (a) finds the Bottom of the bowel. Instead of the black dot moving, the curve seems to move to the left. What does this mean?
    - The Black dot can move horizontally only if the weight is updated.
        - **But it can move vertically**.
- The 3 graphs representing Loss function according to each input element are in reality **2D Slices of a 4-dimensional Space**.
    - 3 of the dimensions are the weight values, and the forth is the error.
- The shape of the 4-D function represents the **error plane**.
    - **Its Curvature is determined by the training data**.
- *Error* is determined by training data.
    - Any network can have any weight value
        - But the value of error given any weight configuration is 100% determined by data.

## Gradient Descent Learning with Multiple Outputs
### Neural Networks can also make multiple predictions using only a single input

- At this Point, I hope it's clear that a simple mechanism (stochastic gradient descent) is constantly used to perform learning across a wide veriety of architectures.

In [45]:
# An empty network with multiple outputs.
weights = [.3, .2, .9]
def neural_network(input, weights):
    prediction = ele_mul(input, weights)
    return prediction

In [46]:
# PREDICT: Making a prediction and calculating error and delta.
wlrec = [.65, 1, 1, .9]
hurt = [.1, 0, 0, .1]
win = [1, 1, 0, 1]
sad = [.1, 0, .1, .2]

input = wlrec[0]
target = [hurt[0], win[0], sad[0]]

pred = neural_network(input, weights)

error = [0, 0, 0]
pure_error = [0, 0, 0]

for i in range(len(target)):
    error[i] = (pred[i] - target[i]) ** 2
    pure_error[i] = pred[i] - target[i]

In [47]:
pred, error, pure_error

([0.195, 0.13, 0.5850000000000001],
 [0.009025, 0.7569, 0.2352250000000001],
 [0.095, -0.87, 0.4850000000000001])

In [48]:
# COMPARE: Calculating each gradient.
gradients = [input * pure_error[0], input * pure_error[1], input * pure_error[2]]

In [49]:
# UPDATE: Updating the Weights.
lr = 3
for i in range(len(weights)):
    weights[i] -= lr * gradients[i]

In [50]:
[input * w for w in weights]

[0.07458749999999997, 1.2327249999999998, -0.02973750000000019]

## Gradient Descent with Multiple Inputs & Outputs
### Gradient Descent generalizes to arbitrary large networks

<div style="text-align:center;"><img style="width:25%" src="static/imgs/06/many-to-many.png" /></div>

In [4]:
# 1. An empty Network with multiple inputs and outputs.
weights = [[.1, .1, -.3], [.1, .2, 0], [0, 1.3, .1]]

def vect_mat_mul(vect, matrix):
    assert(len(vect) == len(matrix))
    output = [0,0,0]
    for i in range(len(vect)):
        output[i] = w_sum(vect, matrix[i])
    return output

def neural_network(input, weights):
    prediction = vect_mat_mul(input, weights)
    return prediction

In [5]:
# 2. PREDICT: Making a Prediction & Calculating Error & Delta (Pure Error).

# Inputs.
toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65,0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]

# Outputs.
hurt = [0.1, 0.0, 0.0, 0.1]
win = [1, 1, 0, 1]
sad = [0.1, 0.0, 0.1, 0.2]

# learning rate.
alpha = .01

input = [toes[0], wlrec[0], nfans[0]]
target = [hurt[0], win[0], sad[0]]

prediction = neural_network(input, weights)
error, delta = [0, 0, 0], [0, 0, 0]
for i in range(len(prediction)):
    error[i] = (prediction[i] - target[i]) ** 2
    delta[i] = prediction[i] - target[i]

In [10]:
# 3. COMPARE: Calculating each weight_delta & Putting it on each weight.

# we have 9 gradients, or weight deltas.
def outer_prod(vect_a, vect_b):
    out = np.zeros((len(vect_a), len(vect_b)))
    for i in range(len(vect_a)):
        for j in range(len(vect_b)):
            out[i][j] = vect_a[i]*vect_b[j]
    return out

gradients = outer_prod(input, delta)

In [12]:
# 4. LEARN: Updating the Weights
for i in range(len(weights)):
    for j in range(len(weights[0])):
        weights[i][j] -= alpha * gradients[i][j]

## What Do These weights Learn?
### Each Weight Tries to reduce the Error, But what do they learn in aggregate?

- Congratulations! This is the part of the book where we move on to the first real-world dataset.
    - As Luck would have it, **It's one with historical significance**.
- It's called the modified national institute of standards and technology (MNIST) dataset.
    - It consits of digits that high school students and employees of the US census bureau wrote some years ago.
    - The interesting thing is that these images are black-and-white images of people's handwriting.
    - Accompanying each digit is the actual number they were writing (0-9).
- Each Image is 784 pixels (28x28).
- Each Training example contain 784 values.
    - So the neural network must have 784 input values.
- **You want to predict 10 probabilities**
    - One for each digit.
    - the neural network tells you which digit is most likely to be what was drawn.
- Let's Take a look at the MNIST Dataset:

In [1]:
import keras

Using TensorFlow backend.


In [2]:
from keras.datasets import mnist

In [5]:
(X_train, y_train), (X_test, y_test) = mnist.load_data()

In [7]:
# let's take a sample.
images = X_train[0:1000]
labels = y_train[0:1000]

In [8]:
images.shape, labels.shape

((1000, 28, 28), (1000,))

<div style="text-align:center;">
    <img style="width:50%" src="static/imgs/06/MNIST_clear_model.png" />
</div>

- This Diagram represents the new MNIST classification Neural Network.
- If this network could predict perfectly, It would take in an image's pixels (say, a 2) & predict a 1.0 in the correct output position.
    - .. and a 0 everywhere else.
- if it were able to do this correctly for all images in the dataset, It would have no error.
- But, **what does it mean to modify a bunch of weights to learn a pattern in aggregate**?

## Visualizing Weight Values

<div style="text-align:center;">
    <img style="width:33%" src="static/imgs/06/MNIST_weight_visualization.png" />
</div>

- An interesting & intuitive practice in neural network research is to visualize the weights as if they were an image.
- Each output node has a weight coming from every pixel.
- What is this relationship?
    - If the **weight is high, It means the model believes there's a high degree of correlation between that pixel and the number 2**.
    - If the **weight is very low (even negatives), the model believes there is a very low correlation between that pixel and the number 2**.
- A very vague 2 and 1 appear in the two images, which were created using the weights for 2 and 1.
- The Bright Areas are high weights, and the dark areas are negative weights.
- The neutral color (red) represents a 0.
- Why does it turn out this way?
    - This Takes us back to **Dot Products**

## Visualizing Dot Products (Weighted Sums)

In [10]:
# remember that the dot product is a mathematical measure of similarity.
import numpy as np

a = np.array([0, 1, 0, 1])
b = np.array([1, 0, 1, 0])

In [11]:
a.dot(b)

0

- **A Dot Product is a Loose measurement of similarity between two vectors.**

<div style="text-align:center;">
    <img style="width:50%" src="static/imgs/06/Neural_Similarity.png" />
</div>

- Meaning, If the weight vector is similar to the *Input* Vector for 2, then it will output a high score because the two scores are similar.

## Summary
### Gradient Descent is a General Learning Algorithm

- The Most important Subtext of this chapter is that gradient descent is a very flexible learning algorithm.
- If you Combine weights in a way that allows you to calculate an error function and a delta, gradient descent can show you how to reduce error.
- We'll spend the rest of this book exploring different types of weight combinations & Error Functions for which gradient descent is useful.

# Sketches

<div style="text-align:center;">
    <img style="width:333px" src="static/imgs/06/0.jpg" />
    <img style="width:333px" src="static/imgs/06/1.jpg" /><br/>
    <img style="width:333px" src="static/imgs/06/2.jpg" /><br/>
    <img style="width:333px" src="static/imgs/06/3.jpg" />
    <img style="width:333px" src="static/imgs/06/4'.jpg" /><br/>
    <img style="width:333px" src="static/imgs/06/5.jpg" />
    <img style="width:333px" src="static/imgs/06/6.jpg" />
</div>