# Learning Multiple Weights at a Time: Generalizing Gradient Descent

In this chapter, we will:

- Implement Gradient Descent with multiple inputs.
- Analyze the effects of freezing one weight.
- Implement gradient descent with multiple outputs.
- Implement gradient descent with multiple inputs and outputs.
- Visualize weight values. 
- Visual dot products.

> [Richard Branson] "You don't learn to walk by following rules. You learn by doing and by falling over."

## Gradient Descent Learning with Multiple Inputs

<div style="text-align:center;"><img style="width:400px" src="static/imgs/06/many-to-one.png" /></div>

Let's show how a neural network with multiple inputs can learn:

In [1]:
# an empty network with multiple inputs.
def w_sum(a, b):
    assert(len(a) == len(b))
    S = 0
    for i in range(len(a)):
        S += a[i] * b[i]
    return S

# init weights.
weights = [.1, .2, -.1]

# defining the model.
def neural_network(X, W):
    prediction = w_sum(X, W)
    return prediction

In [2]:
# PREDICT + COMPARE: Making a Prediction, and Calculating Error & Delta.
toes  = [ 8.5, 9.5, 9.9, 9.0]
wlrec = [0.65, 0.8, 0.8, 0.9]
nfans = [ 1.2, 1.3, 0.5, 1.0]

win_or_lose_binary = [1, 1, 0, 1]

true = win_or_lose_binary[0]
x0 = [toes[0], wlrec[0], nfans[0]]

pred = neural_network(x0, weights)

error = (pred - true) ** 2
delta = pred - true

In [3]:
# LEARN: Calculating each weight_delta and putting it on each weight
def ele_mul(number, vector):
    return [number * v_i for v_i in vector]

# we calculate gradients associated w/ each weight.
gradients = ele_mul(delta, x0)

print(gradients)

[-1.189999999999999, -0.09099999999999994, -0.16799999999999987]


In [4]:
# LEARN: Updating the Weights.
alpha = .01

for i in range(len(weights)):
    weights[i] -= alpha * gradients[i]
    
print(weights)
print(gradients)

[0.1119, 0.20091, -0.09832]
[-1.189999999999999, -0.09099999999999994, -0.16799999999999987]


## Gradient Descent with Multiple Inputs: Explaination

The properties involved in GD with multiple inputs are fascinating and worthy of discussion. Let's take a look at them side by side:

In [5]:
# Single Input: Making a Prediction and calculating error and delta.
number_of_toes = [8.5]
win_or_lose_binary = [1]

input = number_of_toes[0]
true = win_or_lose_binary[0]
weight = 0.3

prediction = input * weight
error = (prediction - true) ** 2
gradient = 2 * input * (prediction - true)
delta = prediction - true

### How do you turn a single delta (on the node) into three weight delta values?

<div style="text-align:center;"><img style="width:800px;" src="static/imgs/06/Conversation.png" /></div>

`Delta` in this case is a measure of how much higher or lower we want the node's value to be, to predict perfectly given the currrent trraining example. `Weight Delta` is a derivative-based estimate of the direction and amount we should move a weight to reduce `node_delta`, accounting for scaling, negative reversal, and stopping.

## Let's Watch several steps of learning

<div style="text-align:center;"><img style="width:800px;" src="static/imgs/06/Iteration_1.png" /></div>

In the above figure, we made three individual error/weight curves, one for each weight. The slope of each curve is reflected by the gradient values. The gradient is steeper for (a) than the others because (a) has an input value that is significantly higher than the others and thus, has a higher derivative.

<div style="text-align:center;"><img style="width:800px;" src="static/imgs/06/Iteration_2.png" /></div>
<div style="text-align:center;"><img style="width:800px;" src="static/imgs/06/Iteration_3.png" /></div>

We can notice that most of the learning was performed on the weight with the largest input because the input changes the slope significantly. This might be a good idea, in fact, there is a technique called `data normalization` that encourages learning across all weights despite data characteristics such as this. 

This significant difference in slope will force us to set a learning rate lower that the standard one (.01 instead of .1).

## Freezing one weight: What does it do?

Freezing weights is a great exercise to understand how the weights effect each other. We are going to train again, except weight (a) won't ever be adjusted, we will try to learn using only weights (b) and (c).

In [6]:
lr = .3

for iter in range(3):
    pred = neural_network(x0, weights)
    
    error = (pred - true) ** 2
    delta = (pred - true)
    
    gradients = ele_mul(delta, x0)
    gradients[0] = 0
    
    print(f"Iteration: {iter}")
    print(f"Prediction: {round(pred, 5)}")
    print(f"Error: {round(error, 5)}")
    print(f"Weights: {[round(w, 5) for w in weights]}")
    print(f"- - - - - -")
    
    for i in range(len(weights)):
        weights[i] -= lr * gradients[i]

Iteration: 0
Prediction: 0.96376
Error: 0.00131
Weights: [0.1119, 0.20091, -0.09832]
- - - - - -
Iteration: 1
Prediction: 0.98401
Error: 0.00026
Weights: [0.1119, 0.20798, -0.08527]
- - - - - -
Iteration: 2
Prediction: 0.99294
Error: 5e-05
Weights: [0.1119, 0.2111, -0.07952]
- - - - - -


<div style="text-align:center;">
    <img style="width:400px" src="static/imgs/06/Experiment_1.png" />
    <img style="width:400px" src="static/imgs/06/Experiment_2.png" />
</div>

It is somewhat surprising that the rest of the weights were learned despite removing the biggest contributing feature. However, the graph (a) still finds the bottom of the bowel because the whole gradient loss graph changes with error. The black dot can move horizontally only if the weight is updated, but because (a)'s weight is frozen, the dot must stay fixed. However, this does not stop the error to go to 0.

This is an extremely important lesson. If we converge using only (b) and (c) then tried to train (a), it wouldn't move because the error already reached 0. In other words, (a) may be a powerful input with lots of predictive power, but if **the network accidentally figures out how to predict accurately on the training data without it, then it will never learn how to incorporate (a) into its prediction.**

The 3 graphs representing the loss function according to each input element are. in reality **2D slices of a 4-dimensional space**. 3 of the dimensions arer. the weight values, and the forth is the error. The shape of the 4-D function represents the **error place** or the curvature of the loss function. The curvature is determined by the training data.

Our goal is to find the weight configuration at the global minimum of the loss curvature.

## Gradient Descent Learning with Multiple Outputs

Neural networks can also make multiple predictions using only a single input. Beyond that, we understand that a simple mechanism (stochastic gradient descent) is constantly being used to perform learning across a wide veriety of architectures.

In [7]:
# An empty network with multiple outputs.
weights = [.3, .2, .9]

def neural_network(X, W):
    prediction = ele_mul(X, W)
    return prediction

In [8]:
# PREDICT: Making a prediction and calculating error and delta.
wlrec = [.65, 1, 1, .9]
hurt = [.1, 0, 0, .1]
win = [1, 1, 0, 1]
sad = [.1, 0, .1, .2]

x0 = wlrec[0]
target = [hurt[0], win[0], sad[0]]

pred = neural_network(x0, weights)

error = [0, 0, 0]
pure_error = [0, 0, 0]

for i in range(len(target)):
    error[i] = (pred[i] - target[i]) ** 2
    pure_error[i] = pred[i] - target[i]

In [9]:
pred, error, pure_error

([0.195, 0.13, 0.5850000000000001],
 [0.009025, 0.7569, 0.2352250000000001],
 [0.095, -0.87, 0.4850000000000001])

In [10]:
# COMPARE: calculating each gradient.
gradients = [x0 * pure_error[0], x0 * pure_error[1], x0 * pure_error[2]]

In [11]:
# UPDATE: Updating the Weights.
lr = 3
for i in range(len(weights)):
    weights[i] -= lr * gradients[i]

In [12]:
[input * w for w in weights]

[0.9753749999999997, 16.12025, -0.38887500000000247]

## Gradient Descent with Multiple Inputs & Outputs

<div style="text-align:center;"><img style="width:200px;" src="static/imgs/06/many-to-many.png" /></div>

SGD naturally generalize to arbitrary architectures. Let's implement SGD with multiple inputs and outputs:

In [13]:
# 1. An empty Network with multiple inputs and outputs.
weights = [[.1, .1, -.3], 
           [.1, .2, 0], 
           [0, 1.3, .1]]

def vect_mat_mul(vect, matrix):
    assert(len(vect) == len(matrix))
    output = [0, 0, 0]
    for i in range(len(vect)):
        output[i] = w_sum(vect, matrix[i])
    return output

def neural_network(X, W):
    prediction = vect_mat_mul(X, W)
    return prediction

In [14]:
# 2. PREDICT: Making a Prediction & Calculating Error & Delta (Pure Error).

# Inputs
toes =  [8.5, 9.5, 9.9, 9.0]
wlrec = [.65, 0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]

# Outputs
hurt = [.1, 0, 0,.1]
win  = [ 1, 1, 0, 1]
sad  = [.1, 0,.1,.2]

# learning rate
alpha = .01

x0 = [toes[0], wlrec[0], nfans[0]]
target = [hurt[0], win[0], sad[0]]

prediction = neural_network(x0, weights)
error, delta = [0, 0, 0], [0, 0, 0]
for i in range(len(prediction)):
    error[i] = (prediction[i] - target[i]) ** 2
    delta[i] = prediction[i] - target[i]

In [15]:
import numpy as np

In [16]:
# 3. COMPARE: Calculating each weight_delta & Putting it on each weight.

# we have 9 gradients, or weight deltas.
def outer_prod(vect_a, vect_b):
    out = np.zeros((len(vect_a), len(vect_b)))
    for i in range(len(vect_a)):
        for j in range(len(vect_b)):
            out[i][j] = vect_a[i]*vect_b[j]
    return out

gradients = outer_prod(x0, delta)

In [17]:
# 4. LEARN: Updating the Weights
for i in range(len(weights)):
    for j in range(len(weights[0])):
        weights[i][j] -= alpha * gradients[i][j]

## What Do These weights Learn?

Even though we understand "how" learning happens, another interesting question is "**what do the weights store while learning?**"

To answer this question, we will move on to the first real-world dataset. It's called the modified national institure of standards and technology (MNIST) dataset. It consists of digits that high school students and employees of the US census bureau wrote some years ago. The interesting thing is that these images are black-and-white images of people's handwriting. Accompanying each digit is the actual number they were writing (0-9). Each image is 784 pixels (28 x 28). 

In this case, the neural network must have 784 input features. On the other end, we want to predict **10 probabilities**, one for each digit. The neural network will tell us which digit is most likely to be what was drawn.

Let's take a look at the MNIST dataset:

In [25]:
from tensorflow.keras.datasets import mnist

In [26]:
(X_train, y_train), (X_test, y_test) = mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


In [27]:
# let's take a sample.
images = X_train[0:1000]
labels = y_train[0:1000]

In [28]:
images.shape, labels.shape

((1000, 28, 28), (1000,))

<div style="text-align:center;">
    <img style="width:500px" src="static/imgs/06/MNIST_clear_model.png" />
</div>

This diagram represents the new MNIST classification neural network. If this network could perdict perfectly, It would take in an image's pixels (say, a 2) then predict `1` in the correct output position and `0` everywhere else. If it were able to do this correctly for all images in the dataset, it would have no error.

.. But, **what does it mean to modify a bunch of weights to learn an aggregate pattern**?

## Visualizing Weight Values

<div style="text-align:center;">
    <img style="width:400px;" src="static/imgs/06/MNIST_weight_visualization.png" />
</div>

An interesting and intuitive practice in neural network research is to visualize the weights as if they were an image. Each output node has a weight coming from each pixel, and so what is this relationship?
- If the weight is high, it means the model believes there's a high degree of correlation between that pixel and the number 2.
- If the weight is very low (even negative), the model believes there is a very low correlation between that pixel and the number 2.

The network learned to construct artifacts of `2` and `1` in the above figure, which were created using the weights for `2` and `1`. The natural color (red) represents a `0`. The key to understanding why these artifacts were created, we should go back to the notion of the **dot product**.

## Visualizing Dot Products (Weighted Sums)

In [29]:
# remember that the dot product is a mathematical measure of similarity.
import numpy as np

a = np.array([0, 1, 0, 1])
b = np.array([1, 0, 1, 0])

In [30]:
a.dot(b)

0

A dot product is a loose measurement of similarity between two vectors. 

<div style="text-align:center;">
    <img style="width:400px" src="static/imgs/06/Neural_Similarity.png" />
</div>

Meaning, if the weight vector for `2` is similar to the input vector, then it will output a high score for `2` resulting in a higher probability. 

## Summary
### Gradient Descent is a General Learning Algorithm

Gradient Descent is a general learning algorithm. The most important subtext in this chapter is that gradient descent is a very flexible algorithm. If we combine. weights in a way that allows us to calculate the error function and a delta, then gradient descent can show us how to reduce error.

We will spend the rest of this book exploring different types of weight combinations and error functions for which gradient descent is useful.

# Sketches

<div style="text-align:center;">
    <img style="width:333px" src="static/imgs/06/0.jpg" />
    <img style="width:333px" src="static/imgs/06/1.jpg" /><br/>
    <img style="width:333px" src="static/imgs/06/2.jpg" /><br/>
    <img style="width:333px" src="static/imgs/06/3.jpg" />
    <img style="width:333px" src="static/imgs/06/4'.jpg" /><br/>
    <img style="width:333px" src="static/imgs/06/5.jpg" />
    <img style="width:333px" src="static/imgs/06/6.jpg" />
</div>

---