# Introduction to Neural Learning: Gradient Descent

In this Chapter, we will:

- Check if neural network can make accurate predictions.
- Investigate the importance of measuring error.
- Define **Hot** and **Cold** learning.
- Calculate the following "learning" measures: "direction" & "emount", from the error. 
- Examine **Gradient Descent**.
- Learn about derivatives and how to use them to learn.
- Examine divergance and the learning rate (alpha).

> [Milton Friedman] "The only relevant test of the validity of a hypothesis is comparison of prediction with experience."

## Predict, Compare, & Learn

We surely came out of the last chapter asking the following question: **How do we set weight values so the network predicts accurately?**.

Answering this question is the main focus of this chapter.

## Compare

Once we have made a prediction, the next step is to evaluate how well we did. Coming up with a good way to measure error is one of the most **important** and **complicated** subjects in the fild of representation learning and machine learning in general.

We make one assumption, **error is always positive**, a negative error would not make sense because we assume that a prediction can not be any more "truer" than the given label or numerical target. An error measure gives an estimate of how much a prediction "missed" by. 

The output of the "Compare" step is a "Hot or Cold" type Signal. Given some prediction, we will calculate an error measure that says either "A Lot" or "A Little". It will more or less say "Big miss", "Little miss", or "Perfect prediction".     

In this chapter, we evaluate only one simple way of measuring error: `mean squared error`.

## Learn

Learning, which comes next, tells each weight how it can change to reduce the error signal. So learning is all about **error attribution** (or how each weight played its part in creating error).

The goal of the `learn` step is to compute an "update" value for each weight.

## Compare: Does our Network make good predictions?

Let's measure the error and find out:

In [1]:
knob_weight = .5
x0 = .5

# A number you recorded in the real world somewhere.
goal_prediction = .8

pred = knob_weight * x0

# mean squared error over one data point.
error = (pred - goal_prediction) ** 2  # ^2 will ignore very small errors and will amplify big errors [negative or positive]

print(error)

0.30250000000000005


We square the errors not only to force the output to be positive, but also to amplify big errors and reduce small ones which proved to be just "OK". We'd have the network pay attention to the big errors and not worry so much about the small ones.

## Why Measure Error ?

Measuring error simplifies the problem, changing `knob_weight` to make the network correctly predict `goal_prediction` is slightly more complicated than changing `knob_weight` to make `error == 0` (which can be categorized as a **minimization** problem).

We also stress that different ways of measuring error prioritize error differently. As mentioned in the previous example, while formulating the loss/error function, we will prioritize big error by amplifying them and ignoring small errors by diminishing them. If we took the `absolute value` instead of squaring the error, we wouldn't have this type of prioritization.

On the other hand, by forcing errors to be positive, we try to take the average error down to zero. If errors are allowed to be negative then the average of `err1 = -E` and `err2 = +E` is zero. We would fool ourserves into thinking that we've predicted perfectly, when in fact we had missed by an average of `E`. In summary, **we want errors to be positive so that they do not cancel each other out**.  

## What's the Simplest form of Neural Learning ?

The simplest form of neural learning is using the hot and cold method. We know whether to turn the knob up or down by trying both and seeing which one reduces the error. 

Hot & Cold learning wiggles the weights to see which direction reduces the error the most, moving the weights in that direction, and repeating until the error gets to **0**. 

In [2]:
# 1. An Empty Network.
weight = .1
lr = .01

def neural_network(X, W):
    prediction = X * W
    return prediction

In [3]:
# 1. PREDICT: Making a Prediction & Evaluating Error.
number_of_toes = [8.5]
win_or_loss_binary = [1]

input = number_of_toes[0]
true = win_or_loss_binary[0]

pred = neural_network(input, weight)

error = (pred - true) ** 2  # ^2 forces the error to be positive.
print(error)

0.022499999999999975


In [4]:
# 2. COMPARE: Making a Prediction with a higher weight and evaluating error.
p_up = neural_network(input, weight + lr)

e_up = (p_up - true) ** 2 

print(e_up)

0.004224999999999993


In [5]:
# 3. COMPARE: Making a Prediction with a lower weight and evaluating error.
p_dn = neural_network(input, weight - lr)

e_dn = (p_dn - true) ** 2

print(e_dn)

0.05522499999999994


In [6]:
# 4. COMPARE ERROR & SET NEW WEIGHT
if error > e_up or error > e_down:
    if e_up < e_dn:
        weight += lr
    else:
        weight =- lr

These last five steps represent only one iteration of hot and cold learning. Some people have to train their networks for **weeks or months** before they find a good enough **weight configuration**.

This reveals that learning in neural networks really is a **Search Problem**. We are searching for the best possible weight configuration that results in the lowest error possible.

## Hot & Cold Learning

Let's implement iterative hot & cold learning:

In [10]:
weight = .5
input = .5
goal_prediction = .8

step_amount = .001

for iteration in range(10000):
    prediction = input * weight
    
    # calculate error.
    error = (prediction - goal_prediction) ** 2
    
    # try knob up.
    p_up = input * (weight + step_amount)
    error_up = (p_up - goal_prediction) ** 2
    
    # try knob down.
    p_dn = input * (weight - step_amount)
    error_dn = (p_dn - goal_prediction) ** 2
    
    if error > error_up or error > error_dn:
        if error_up < error_dn:
            weight += step_amount
            if iteration % 20 == 0:
                print(f"Error: {round(error_up, 4)} | Prediction: {round(input * weight, 3)}")
        else:
            weight -= step_amount
            if iteration % 20 == 0:
                print(f"Error: {round(error_dn, 4)} | Prediction: {round(input * weight, 3)}")

Error: 0.302 | Prediction: 0.251
Error: 0.2911 | Prediction: 0.261
Error: 0.2804 | Prediction: 0.271
Error: 0.2699 | Prediction: 0.281
Error: 0.2596 | Prediction: 0.291
Error: 0.2495 | Prediction: 0.301
Error: 0.2396 | Prediction: 0.311
Error: 0.2299 | Prediction: 0.321
Error: 0.2204 | Prediction: 0.331
Error: 0.2111 | Prediction: 0.341
Error: 0.2021 | Prediction: 0.351
Error: 0.1932 | Prediction: 0.361
Error: 0.1845 | Prediction: 0.371
Error: 0.176 | Prediction: 0.381
Error: 0.1677 | Prediction: 0.391
Error: 0.1596 | Prediction: 0.401
Error: 0.1517 | Prediction: 0.411
Error: 0.144 | Prediction: 0.421
Error: 0.1365 | Prediction: 0.431
Error: 0.1292 | Prediction: 0.441
Error: 0.1222 | Prediction: 0.451
Error: 0.1153 | Prediction: 0.461
Error: 0.1086 | Prediction: 0.471
Error: 0.1021 | Prediction: 0.481
Error: 0.0958 | Prediction: 0.491
Error: 0.0897 | Prediction: 0.501
Error: 0.0838 | Prediction: 0.51
Error: 0.0781 | Prediction: 0.52
Error: 0.0726 | Prediction: 0.53
Error: 0.0673 | Pred

## Characteristics of Hot & Cold Learning
### It's Simple

We list the following characteristics of Hot & Cold learning:

- It is simple: we only have to change the weight and compare.
- It is inefficient: we have to predict multiple times to make a single knob update, this seems very inefficient.
- We have to guess the step size: even though we know the correct direction to move, we won't know the correct update value.

## Calculating Both Direction & Amount from Error

Let's use the **gradient** as a measure of direction and amount, we know the following:

$$
f(W) = (WX - \hat{y})^2 \\
\frac{\partial f}{\partial W} = 2(y - \hat{y})X
$$

In [12]:
weight = .5
goal_pred = .8
input = .5

for iteration in range(20):
    pred = input * weight
    error = (pred - goal_pred) ** 2
    direction_and_amount = (pred - goal_pred) * input  # pure error * scale.
    weight -= direction_and_amount
    print(f"Error: {round(error, 6)} | Gradient: {round(direction_and_amount, 3)} | Prediction: {round(input * weight, 3)}")

Error: 0.3025 | Gradient: -0.275 | Prediction: 0.388
Error: 0.170156 | Gradient: -0.206 | Prediction: 0.491
Error: 0.095713 | Gradient: -0.155 | Prediction: 0.568
Error: 0.053839 | Gradient: -0.116 | Prediction: 0.626
Error: 0.030284 | Gradient: -0.087 | Prediction: 0.669
Error: 0.017035 | Gradient: -0.065 | Prediction: 0.702
Error: 0.009582 | Gradient: -0.049 | Prediction: 0.727
Error: 0.00539 | Gradient: -0.037 | Prediction: 0.745
Error: 0.003032 | Gradient: -0.028 | Prediction: 0.759
Error: 0.001705 | Gradient: -0.021 | Prediction: 0.769
Error: 0.000959 | Gradient: -0.015 | Prediction: 0.777
Error: 0.00054 | Gradient: -0.012 | Prediction: 0.783
Error: 0.000304 | Gradient: -0.009 | Prediction: 0.787
Error: 0.000171 | Gradient: -0.007 | Prediction: 0.79
Error: 9.6e-05 | Gradient: -0.005 | Prediction: 0.793
Error: 5.4e-05 | Gradient: -0.004 | Prediction: 0.794
Error: 3e-05 | Gradient: -0.003 | Prediction: 0.796
Error: 1.7e-05 | Gradient: -0.002 | Prediction: 0.797
Error: 1e-05 | Gradie

What we see here is a superior form of learning known as **gradient descent**. GD allows us to calculate both the direction and amount we should change the weight by to reduce error.

We use scaling, negative reversal, & stopping to translate the pure error into the absolute amount we want to change the weight. The stopping criteria, stops learning if `error` is close enough to zero, making the gradient also close to zero. Scaling, on the other hand, would change the updates to grow as bit as the inputs or/and errors.

## One Iteration of Gradient Descent

We will perform a weight update on a single training example (`Input -> true`) pair:

In [14]:
# 1. an empty network.
weight = .1
alpha = .01

def neural_network(X, W):
    prediction = X * W
    return prediction

In [15]:
# 2. PREDICT: Making a Prediction & Evaluating Error.
number_of_toes = [8.5]
win_or_lose_binary = [1]
input = number_of_toes[0]
goal_prediction = win_or_lose_binary[0]

pred = neural_network(input, weight)

error = (pred - goal_prediction) ** 2

print(error)

0.022499999999999975


In [16]:
# 3. COMPARE: Calculating the Node `Delta` and putting it on the output node.
delta = pred - goal_prediction  # pure error.
print(delta)

-0.1499999999999999


Delta is a measure of how much the node has missed. In the previous example, the true prediction was `1.0` & the network's prediction was `.85`, so the network was too low by `0.15`. This is why delta is `-0.15`.

The primary difference between gradient descent & this implementation is the new variable *delta*. 

In [17]:
# 4. LEARN
number_of_toes = [8.5]
win_or_lose_binary = [1]

input = number_of_toes[0]
goal_pred = win_or_lose_binary[0]

pred = neural_network(input, weight)
error = (pred - goal_pred) ** 2
delta = pred - goal_pred

weight_delta = input * delta

In [18]:
# 5. LEARN: Updating the weight
alpha = .01  # learning step.
weight -= alpha * weight_delta

`Alpha` lets us control how fast the network learns. If it learns too fast, it can update weights too aggressively & overshoot. If it learns too slow, it will spend a lot of time & computing resources to converge to the optimal weight configuration.

## Learning is Just Reducing Error

Finally, we can modify weights to reduce error:

In [20]:
# putting it all together.
weight, goal_prediction, input = (0, .8, .5)

for iteration in range(5):
    pred = input * weight
    error = (pred - goal_prediction) ** 2
    delta = pred - goal_prediction
    gradient = delta * input
    weight -= gradient
    print(f"Error: {round(error, 5)} | Prediction: {round(pred, 3)}")

Error: 0.64 | Prediction: 0.0
Error: 0.36 | Prediction: 0.2
Error: 0.2025 | Prediction: 0.35
Error: 0.11391 | Prediction: 0.463
Error: 0.06407 | Prediction: 0.547


What we are trying to do is figure out the **right direction and amount** to modify *weight* so that error goes down. For any `input` and `goal_predict`, we can find an exact mapping between network **weights** and the resulting **error**. We call this mapping: the **Loss Function**:

<div style="text-align:center;"><img style="width:400px" src="static/imgs/04/Loss_function.png" /></div>

The slope (which is the gradient) points to the bottom of the bowel (lowest error) no matter where we are in the bowel. We can use this slope to help the neural network reduce the error.

## Let's Watch several steps of Learning

Our goal is to reach the bottom of the bowel:

In [21]:
weight, goal_prediction, input = 0, .8, 1.1

for iteration in range(4):
    prediction = input * weight
    error = (prediction - goal_prediction) ** 2
    delta = prediction - goal_prediction
    gradient = input * delta
    weight -= gradient
    print(f"Error: {round(error, 5)} | Prediction: {round(prediction, 5)} | Weight: {round(weight, 5)} | Gradient: {round(gradient, 5)}")

Error: 0.64 | Prediction: 0.0 | Weight: 0.88 | Gradient: -0.88
Error: 0.02822 | Prediction: 0.968 | Weight: 0.6952 | Gradient: 0.1848
Error: 0.00124 | Prediction: 0.76472 | Weight: 0.73401 | Gradient: -0.03881
Error: 5e-05 | Prediction: 0.80741 | Weight: 0.72586 | Gradient: 0.00815


Let's visualize the above steps:

<div style="text-align:center;">
    <img style="width:200px" src="static/imgs/04/step-0.png" />
    <img style="width:200px" src="static/imgs/04/step-1.png" />
    <img style="width:200px" src="static/imgs/04/step-2.png" />
    <img style="width:200px" src="static/imgs/04/step-3.png" />
</div>

## Why Does this Work?
### Let's back up & talk about functions. What is a Function & How do you understand One? 

Let's back up & talk about functions. What is a function and how do we understand it?

We start by analyzing the following function:

In [22]:
def my_function(x):
    return x * 2

`my_function` defines a relationship between the input number `x` and the output `my_function(x)`. Functions are powerful because they can take numbers (say pixels) and convert them into other numbers (the probability that the image contains a cat).

Each function has what we call "moving parts". Moving parts are the pieces we can tweak so that the function generates different outputs. The moving parts of a loss function $J_{W}(\hat{y}, y)$ are its general formula and the weights. We can say the same about neural networks.

Now let's study what we can change to drive the overall error down to zero:
- Changing the input: this would be the same as seeing the world as you want to see it instead of as it actually is, which is a bad idea.
- Changing the formula of $J(w)$: the error calculation is meaningless if it doesn't actually give a good measure of how much we missed. So a fake formula that always gives zero won't solve our problem.
- Changing the weights
    - Doesn't change our perception of the world.
    - Doesn't change our goal.
    - Doesn't change our error measure.
    
Changing the weights means that the function conforms to the patterns of the data.

## Tunnel Vision on one Concept

We consider learning as the process of adjusting the weight to reduce the error down to **0**. For us to be able to do that, **we need to understand how changing the weights changes the error**. Specifically, we want to know the direction and amount that error changes when weight changes.

Let's go back to the loss function formulation:

$$J(w)=(w*x - y)^2$$

We can solve this problem since this relationship is deterministic, computable, and continuous.

## A Box with Rods poking out of it

Computing a derivative can be as simple as answering the following question: "When I tug this part, how much this other part move?".

We always have the derivative between two variables. When the derivative is positive, it means that when we change one variable, the other will move in the same direction. If the derivative is negative, then when we change one variable, the other will move in the opposite direction.

The derivative represents the direction and the amount that one variable changes if we change the other variable.

## What You Really Need to Know
### With derivatives, you can take any two variables in any formula, and know how they interact

With derivatives, we can take any two variables in any function, and know how they interract. A neural network is just a function that consists of a bunch of weights we use to compute an output that gets used by the error or loss function.

So, for any error function, we can compute the relationship between any weight and the final error of the network.

## What you Don't Really Need to Know: Calculus

Us, practitioners, don't really need to know about calculus. Calculus is just about memorizing and practicing every possible derivative rule for every possible function. We just need to remember this: "the derivative between two variables represent the sensitivity between them", meaning:
- In which direction one variable moves when we move the other variable.
- How much one variable changes when we change the other variable.

In the context of our error function, reaching zero sensitivity points us to a local minimum.

## How to use a derivative to learn

The slope of a line or curve always points in the opposite direction of the curve's lowest point. To minimize error, we need to move in the opposite direction of the slope.

We can take each weight value, calcuate its derivative with respect to the loss function and then change the weight in the opposite direction of that slope. This method of learning (finding error minimums) is called **Gradient descent**.

## Breaking Gradient Descent

Let's implement the gradient descent algorithm:

In [27]:
# weight, goal_prediction, input, lr = .5, .8, .5, .5
weight, goal_prediction, input, lr = .5, .8, 2, .5

In [28]:
for iteration in range(15):
    error = ((input * weight) - goal_prediction) ** 2
    gradient = 2 * (input * weight - goal_prediction) * input
    weight -= gradient * lr
    print(f"ERROR: {round(error, 5)} | GRADIENT: {round(gradient, 5)} | PREDICTION: {round(input * weight, 5)}")

ERROR: 0.04 | GRADIENT: 0.8 | PREDICTION: 0.2
ERROR: 0.36 | GRADIENT: -2.4 | PREDICTION: 2.6
ERROR: 3.24 | GRADIENT: 7.2 | PREDICTION: -4.6
ERROR: 29.16 | GRADIENT: -21.6 | PREDICTION: 17.0
ERROR: 262.44 | GRADIENT: 64.8 | PREDICTION: -47.8
ERROR: 2361.96 | GRADIENT: -194.4 | PREDICTION: 146.6
ERROR: 21257.64 | GRADIENT: 583.2 | PREDICTION: -436.6
ERROR: 191318.76 | GRADIENT: -1749.6 | PREDICTION: 1313.0
ERROR: 1721868.84 | GRADIENT: 5248.8 | PREDICTION: -3935.8
ERROR: 15496819.56 | GRADIENT: -15746.4 | PREDICTION: 11810.6
ERROR: 139471376.04 | GRADIENT: 47239.2 | PREDICTION: -35428.6
ERROR: 1255242384.36 | GRADIENT: -141717.6 | PREDICTION: 106289.0
ERROR: 11297181459.24 | GRADIENT: 425152.8 | PREDICTION: -318863.8
ERROR: 101674633133.15994 | GRADIENT: -1275458.4 | PREDICTION: 956594.6
ERROR: 915071698198.4395 | GRADIENT: 3826375.2 | PREDICTION: -2869780.6


When we set the input to 2, learning overshoots because the input is part of the gradient, with a large input, the gradient will be big and the new weight will overshoot. All in all, the predictions exploded and got further away from the true answer at every step.

A simple solution to this problem is to introduce a learning rate and find the optimal value to scale the effects of the gradient.

## Divergence

<div style="text-align:center;"><img style="width:400px" src="static/imgs/04/divergence.png" /></div>

The explosion in the error was caused by the fact that we have made the norm of the weight larger, which is a direct result of having large weight updates and small errors.  

## Introducting $\alpha$: the Learning Rate

The learning rate is the simplest way to prevent overcorrecting weight updates.

Divergance comes when the new derivative is even larger in magnitude than when we started. The solution to this is to multiply the weight update by a fraction to make it smaller. This envolves multipliying the weight update by a single real-valued number between $0$ and $1$. Finding the appropriate $\alpha$, even for state-of-the-art neural networks, is often done by guessing.

## Alpha in Code

Let's add a learning rate to the gradient descent algorithm: 

In [29]:
# implementation.
weight, goal_pred, input, alpha = .5, .8, 2, .1

for iteration in range(7):
    prediction = input * weight
    error = (prediction - goal_pred) ** 2
    gradient = input * 2 * (prediction - goal_pred)
    weight = weight - alpha * gradient
    print(f"ERROR: {round(error, 5)} | PREDICTION: {round(prediction, 5)}")

ERROR: 0.04 | PREDICTION: 1.0
ERROR: 0.0016 | PREDICTION: 0.84
ERROR: 6e-05 | PREDICTION: 0.808
ERROR: 0.0 | PREDICTION: 0.8016
ERROR: 0.0 | PREDICTION: 0.80032
ERROR: 0.0 | PREDICTION: 0.80006
ERROR: 0.0 | PREDICTION: 0.80001


Coming up with a value for the `learning rate` is a matter of **guessing**. Despite all the crazy advancements of deep learning in the past few years, most people just try several orders of magnitudes of alpha (10, 1, .1, .01, .001, ..) and tweak the value from there to see what works best. Generally speaking, neural network design is more art than science. 

## Memorizing

Let's re-code the algorithm on our own:

In [30]:
x, w, y_hat, lr = .1, .3, 1, 10  # tested .1, .3, 1, & 10 worked as a learning rate.

In [32]:
for iteration in range(7):
    pred = x * w
    error = (pred - y_hat) ** 2
    gradient = 2 * x * (pred - y_hat)
    w = w - lr * gradient
    print(f"ERROR: {round(error, 5)} | WEIGHT: {round(w, 5)} | PREDICTION: {round(pred, 5)}")

ERROR: 0.04138 | WEIGHT: 8.37261 | PREDICTION: 0.79658
ERROR: 0.02648 | WEIGHT: 8.69809 | PREDICTION: 0.83726
ERROR: 0.01695 | WEIGHT: 8.95847 | PREDICTION: 0.86981
ERROR: 0.01085 | WEIGHT: 9.16678 | PREDICTION: 0.89585
ERROR: 0.00694 | WEIGHT: 9.33342 | PREDICTION: 0.91668
ERROR: 0.00444 | WEIGHT: 9.46674 | PREDICTION: 0.93334
ERROR: 0.00284 | WEIGHT: 9.57339 | PREDICTION: 0.94667


# Sketches

<div style="text-align:center;">
    <img style="width:333px" src="static/imgs/04/Loss_function.jpg" />
    <img style="width:333px" src="static/imgs/04/derivatives.jpg" /><br>
    <img style="width:333px" src="static/imgs/04/derivatives_2.jpg" />
    <img style="width:333px" src="static/imgs/04/dependence.jpg" /><br>
    <img style="width:333px" src="static/imgs/04/model_and_loss.jpg" />
    <img style="width:333px" src="static/imgs/04/optimization.jpg" />
</div>

---