# Introduction to Neural Learning: Gradient Descent

In this Chapter:

- Do Neural Networks make Accurate predictions ?
- Why Measure Error ?
- Hot and Cold Learning.
- Calculating Both direction & Amount from Error.
- Gradient Descent.
- Learning is just reducing error.
- Derivatives and how to use them to learn.
- Divergance and Alpha.

> "The Only Relevant test of the Validity of a Hypothesis is Comparison of its predictions with experience."

## Predict, Compare, & Learn

- "How do we set weight values so the network predicts accurately ?"
    - Answering this Question is the main focus of this Chapter.

## Compare
### Comparing gives a measurement of how much a prediction "missed" by

- Once you've made a predictionm, the next step is to evaluate how well you did.
- Coming up with a good way to measure error is one of the most **important** and **complicated** subjects of deep learning.
- Error is always Positive.
- In this Chapter, we evaluate only one simple way of measuring error: *mean squared error*.
- This Step will Give you a sense of how much you missed.
    - But that isn't enough to be able to learn.
- The Output of the Compare Logic is a "Hot or Cold" type Signal.
    - Given Some Prediction, you'll calculate an error measure that says either "A Lot" or "A Little".
- It won't tell you why you missed, what direction you missed, or what you should do to fix the error.
- It more or less says "Big miss", "Little miss", "Perfect Prediction".

## Learn
### Learning Tells each weight How it can change to reduce the Error.

- Learning is all about **Error Attribution**.
- Or **How each weight played its part in creating Error**.
    - It's the blame game of deep learning.
- At the end of the day, it results in computing a number for each weight.

## Compare: Does your Network make good predictions?
### Let's Measure the Error & Find out!

In [3]:
knob_weight = .5
input = .5

# A number you recorded in the real world somewhere.
goal_prediction = .8

pred = knob_weight * input

# mean squared error over one data point.
error = (pred - goal_prediction) ** 2  # ^2 will ignore very small errors and will amplify big errors [negative or positive]

print(error)

0.30250000000000005


- Why is the Error Squared ?
    - It forces the Output to be positive.
- Doesn't Squaring make big errors (>1) bigger and small errors (<1) Smaller ?
    - Yeah, It's kind of a weird way of measuring error, but it turns out that amplifying big errors and reducing small errors is OK.
    - You'd rather it pay attention to the big errors & not worry so much about the small ones.

## Why Measure Error ?

### Measuring Error Simplifies the Problem

- Changing `knob_weight` to make the network correctly predict `goal_prediction` is slightly more complicated than changing `knob_weight` to make `error === 0`.

### Different ways of measuring error prioritize error differently

- As mentioned in the previous example, while formulating the loss/error function, you will prioritize big errors by amplifying them and ignoring small errors by diminishing them.
- If you took the `absolute value` instead of squaring the error, you wouldn't have this type of prioritization.
    - The Error would be just the positive version of the pure error, which would be fine but different.


### Why do you want only positive error ?

- You'll try to take the average error down to zero.
- If you have two errors, one -E and the other one is +E, the average is **zero**.
    - You'd fool yourself into thinking you predicted perfectly, when you missed by 1,000 each time.
    - You want errors to be positive so that they don't cancel each other out.

## What's the Simplest form of Neural Learning ?

### Learning using the Hot and Cold Method

- How do you know whether to turn the knob up or down ?
    - **you try both Up and Down and see which one reduces the Error.**
- Hot & Cold learning means wiggling the weights to see which direction reduces the error the most, moving the weights in that direction, and repeating until the error gets to 0.

In [6]:
# 1. An Empty Network.
weight = .1
lr = .01
def neural_network(input, weight):
    prediction = input * weight
    return prediction

In [7]:
# 1. PREDICT: Making a Prediction & Evaluating Error.
number_of_toes = [8.5]
win_or_loss_binary = [1]

input = number_of_toes[0]
true = win_or_loss_binary[0]

pred = neural_network(input, weight)

error = (pred - true) ** 2  # ^2 forces the error to be positive.
print(error)

0.022499999999999975


In [8]:
# 2. COMPARE: Making a Prediction with a higher weight and evaluating error.
p_up = neural_network(input, weight+lr)

e_up = (p_up - true) ** 2 

print(e_up)

0.004224999999999993


In [9]:
# 3. COMPARE: Making a Prediction with a lower weight and evaluating error.
p_dn = neural_network(input, weight-lr)

e_dn = (p_dn - true) ** 2

print(e_dn)

0.05522499999999994


In [13]:
# 4. COMPARE ERROR & SET NEW WEIGHT
if error > e_up or error > e_down:
    if e_up < e_dn:
        weight += lr
    else:
        weight =- lr

- These last five steps are one iteration of hot and cold learning.
- Some People have to train their networks for **Weeks or Months before they find a good enough weight configuration**.
- This Reveals what learning in neural networks really is: **A Search Problem**.
    - You're searching for the best possible weight configuration that results in the lowest error possible.

## Hot & Cold Learning
### This is perhaps the simplest form of learning

In [22]:
weight = .5
input = .5
goal_prediction = .8

step_amount = .001

for iteration in range(10000):
    prediction = input * weight
    
    # calculate error.
    error = (prediction - goal_prediction) ** 2
    
    # try up.
    p_up = input * (weight+step_amount)
    error_up = (p_up - goal_prediction) ** 2
    
    # try down.
    p_dn = input * (weight-step_amount)
    error_dn = (p_dn - goal_prediction) ** 2
    
    if error > error_up or error > error_dn:
        if error_up < error_dn:
            weight += step_amount
            print("Error : " + str(error_up) + " | Prediction : " + str(input * weight))
        else:
            weight -= step_amount
            print("Error : " + str(error_dn) + " | Prediction : " + str(input * weight))

Error : 0.3019502500000001 | Prediction : 0.2505
Error : 0.30140100000000003 | Prediction : 0.251
Error : 0.30085225 | Prediction : 0.2515
Error : 0.30030400000000007 | Prediction : 0.252
Error : 0.2997562500000001 | Prediction : 0.2525
Error : 0.29920900000000006 | Prediction : 0.253
Error : 0.29866224999999996 | Prediction : 0.2535
Error : 0.29811600000000005 | Prediction : 0.254
Error : 0.2975702500000001 | Prediction : 0.2545
Error : 0.29702500000000004 | Prediction : 0.255
Error : 0.29648025 | Prediction : 0.2555
Error : 0.29593600000000003 | Prediction : 0.256
Error : 0.2953922500000001 | Prediction : 0.2565
Error : 0.294849 | Prediction : 0.257
Error : 0.29430625 | Prediction : 0.2575
Error : 0.293764 | Prediction : 0.258
Error : 0.2932222500000001 | Prediction : 0.2585
Error : 0.292681 | Prediction : 0.259
Error : 0.29214025 | Prediction : 0.2595
Error : 0.2916 | Prediction : 0.26
Error : 0.2910602500000001 | Prediction : 0.2605
Error : 0.29052100000000003 | Prediction : 0.261


## Characteristics of Hot & Cold Learning
### It's Simple

- Hot and Cold Learning is Simple.
- **Problem 1**: It's inefficient.
    - You have to predict multiple times to make a single knob update, this seems very inefficient.
- **Problem 2**: Sometimes it's impossible to predict the exact goal prediction.
    - Even though you know the correct direction to move weight, you don't know the correct amount.

## Calculating Both Direction & Amount from Error
### Let's Measure the error and find the direction and amount

In [27]:
weight = .5
goal_pred = .8
input = .5

for iteration in range(20):
    pred = input * weight
    error = (pred - goal_pred) ** 2
    direction_and_amount = (pred - goal_pred) * input  # pure error * scale.
    weight -= direction_and_amount
    print("Error: " + str(error) + " Gradient : " + str(direction_and_amount) + " Prediction: " + str(input*weight))

Error: 0.30250000000000005 Gradient : -0.275 Prediction: 0.3875
Error: 0.17015625000000004 Gradient : -0.20625000000000002 Prediction: 0.49062500000000003
Error: 0.095712890625 Gradient : -0.1546875 Prediction: 0.56796875
Error: 0.05383850097656251 Gradient : -0.11601562500000001 Prediction: 0.6259765625
Error: 0.03028415679931642 Gradient : -0.08701171875000002 Prediction: 0.669482421875
Error: 0.0170348381996155 Gradient : -0.06525878906250004 Prediction: 0.70211181640625
Error: 0.00958209648728372 Gradient : -0.04894409179687503 Prediction: 0.7265838623046875
Error: 0.005389929274097089 Gradient : -0.03670806884765626 Prediction: 0.7449378967285156
Error: 0.0030318352166796153 Gradient : -0.02753105163574221 Prediction: 0.7587034225463867
Error: 0.0017054073093822882 Gradient : -0.020648288726806685 Prediction: 0.76902756690979
Error: 0.0009592916115275371 Gradient : -0.015486216545105014 Prediction: 0.7767706751823426
Error: 0.0005396015314842384 Gradient : -0.011614662408828746 Pr

- What you see here is a superior form of learning known as *gradient descent*.
    - This method allows you to calculate both the direction & amount you should change *weight* to reduce *error*.
- What are Scaling, Negative reversal, & Stopping ?
    - They have the combined effect of translating the pure error into the absolute amount you want to change weight.
- What is Stopping ?
    - Stops learning if `input` is close to zero, making the gradient also close to zero.
- What is Scaling ?
    - Logically, if Your input is Big, your weight update should also be big. 

## One Iteration of Gradient Descent
### This performs a weight udpate on a single training example (input -> true) pair

In [2]:
# 1. an empty network.
weight = .1
alpha = .01
def neural_network(input, weight):
    prediction = input * weight
    return prediction

In [29]:
# 2. PREDICT: Making a Prediction & Evaluating Error.
number_of_toes = [8.5]
win_or_lose_binary = [1]
input = number_of_toes[0]
goal_prediction = win_or_lose_binary[0]

pred = neural_network(input, weight)

error = (pred - goal_prediction) ** 2

print(error)

0.022499999999999975


In [30]:
# 3. COMPARE: Calculating the Node `Delta` and putting it on the output node.
delta = pred - goal_prediction  # pure error.
print(delta)

-0.1499999999999999


- Delta is a measurement of how much this node has missed.
    - the true prediction is `1.0` & the network's prediction was `.85`.
    - so the network was too low by `0.15`.
    - thus delta is negative `.15`.
- The primary difference between gradient descent & this implementation is the new variable *delta*.
    - It's the Raw amount that the node was too high or too low.

In [5]:
# 4. LEARN
number_of_toes = [8.5]
win_or_lose_binary = [1]

input = number_of_toes[0]
goal_pred = win_or_lose_binary[0]

pred = neural_network(input, weight)
error = (pred - goal_pred) ** 2
delta = pred - goal_pred

weight_delta = input * delta

In [7]:
# 5. LEARN: Updating the weight
alpha = .01  # learning step.
weight -= alpha * weight_delta

- `alpha` lets you control how fast the network learns.
    - If it learns too fast, It can update weights too aggressively & overshoot.
    - If it learns too slow, It will spend a lot of time & computing resources to converge to the optimal weight configuration.

## Learning is Just Reducing Error
### You can modify weights to Reduce Error

In [9]:
# putting it all together.
weight, goal_prediction, input = (0, .8, .5)

for iteration in range(4):
    pred = input * weight
    error = (pred - goal_prediction) ** 2
    delta = pred - goal_prediction
    gradient = delta * input
    weight -= gradient
    print("Error : " + str(error) + " Prediction : " + str(pred))

Error : 0.6400000000000001 Prediction : 0.0
Error : 0.3600000000000001 Prediction : 0.2
Error : 0.2025 Prediction : 0.35000000000000003
Error : 0.11390625000000001 Prediction : 0.4625


- All you're trying to do is figure out the right direction and amount to modify *weight* so that error goes down.
- For any `input` & `goal_pred`, an exact *relationship* is defined between `error` & `weight`.
- an exact mapping between *weight* and *error* can be found:

<div style="text-align:center;"><img style="width:500px" src="static/imgs/04/Loss_function.png" /></div>

- **The Slope points to the Bottom of the Bowel (Lowest Error) No Matter where you are in the Bowel.
    - You can use this slope to help the neural network reduce the error.

## Let's Watch several steps of Learning
### Will we eventually find the Bottom of the Bowel?

In [12]:
weight, goal_prediction, input = (0, .8, 1.1)

for iteration in range(4):
    prediction = input * weight
    error = (prediction - goal_prediction) ** 2
    delta = prediction - goal_prediction
    gradient = input * delta
    weight -= gradient
    print("Error : " + str(error) + " | Prediction : " + str(prediction) + " | Weight : " + str(weight) + " | gradient : " + str(gradient))

Error : 0.6400000000000001 | Prediction : 0.0 | Weight : 0.8800000000000001 | gradient : -0.8800000000000001
Error : 0.02822400000000005 | Prediction : 0.9680000000000002 | Weight : 0.6951999999999999 | gradient : 0.1848000000000002
Error : 0.0012446784000000064 | Prediction : 0.76472 | Weight : 0.734008 | gradient : -0.0388080000000001
Error : 5.4890317439999896e-05 | Prediction : 0.8074088 | Weight : 0.72585832 | gradient : 0.008149679999999992


- A Visualization of the Above Stats:

<div style="text-align:center;">
    <img style="width:300px" src="static/imgs/04/step-0.png" />
    <img style="width:300px" src="static/imgs/04/step-1.png" />
    <img style="width:300px" src="static/imgs/04/step-2.png" />
    <img style="width:300px" src="static/imgs/04/step-3.png" />
</div>

## Why Does this Work? What is `weight_delta`, really?
### Let's back up & talk about functions. What is a Function & How do you understand One? 

- consider this function:

In [13]:
def my_function(x):
    return x * 2

- The function defines some sort of relationship between the input number & the output number.
- You can see why the ability to learn a function is so powerful:
    - It lets you take some numbers (say pixels) & convert them into other numbers (the probability that the image contains a cat).
- Each function has what you might call `moving parts` 
    - pieces you can tweak so that the function generates different outputs.
- What's controlling the relationship between the input and output of the loss function is the general formula of $J(w)$ & the weights.
    - same can be said about the neural network.
- Thought experiment: Consider changing `goal_pred` to reduce the error.
    - This is silly, but totally doable.
    - You might call this "Giving up".
        - Or setting goals to whatever your capabilities are.
- What if you changed Input until error is zero ?
    - That's akin to seeing the world as you want to see it instead of as it actually is.
        - this is loosely how inceptionism works.
- Now consider changing the formula of the error function (or the loss function)
    - the error calculation is meaningless if it doesn't actually give a good measure of how much you missed.
        - A Fake Formula that gives zeros won't do either.
- changing the weights doesn't change your perception of the world.
    - doesn't change your goal.
    - doesn't destroy your error measure.
- **changing weight means the function conforms to the patterns of the data**.

## Tunnel Vision on one Concept
### Concept: Learning is adjusting the weight to reduce the error to 0

- You need to understand the relationship between Weight & Error (AKA. Undertsand the Loss Function).
- **To understand the relationship between two variables is to understand how changing one variable changes the other**.
    - You're after the sensitivity between the two variables.
- You want to know the direction & amount that error changes when weight changes.
- Let's go back to the Loss Function Formulation:
$$J(w) = (w*x - y)^{2}$$
- This relationship is Exact, it's Computable, It's Universal.

## A Box with Rods poking out of it

- *Derivative*: "When I Tug on this part, How much this other part move ?"
- You always have the derivative between **two variables**
    - How one variable moves when you change another one.
- When the Derivative is positive, then when you change one variable, the other will move in the same direction.
- If the derivative is negative, then when you change one variable, the other will move in the opposite direction.
- The derivative represents the direction and the amount that one variable changes if you change the other variable.

## What You Really Need to Know
### With derivatives, you can take any two variables in any formula, and know how they interact

- for any function, you can pick any two variables and understand the relationship with each other.
- **A Neural Network is really just one thing: a bunch of weights you use to compute an error function**.
- For any Error Function, you can compute the relationship between any weight and the final error of the network.

## What you Don't Really Need to Know
### Calculus

- Calculus is just about memorizing and practicing every possible derivative rule for every possible function.
- Just Remember this: the derivative between two variables represent the **sensitivity** between them, meaning:
    - In Which Direction One variable moves when you move the other variable.
    - How much One variable changes when you change the other variable.
- Zero Sensitivity points to a local optimum

## How to use a derivative to learn
### *weight_delta* is your derivative

- The slope of a line or curve always points in the opposite direction of the lowest point of the line or curve.
- To minimize error, you move in the opposite direction of the slope.
- you can take each weight value, calculate its derivative with respect to *error* and then change weight in the opposite direction of that slope.
- This Method of learning (finding error minimums) is called **gradient descent**.

## Breaking Gradient Descent
### Just give me the code

In [68]:
# weight, goal_prediction, input, lr = .5, .8, .5, .5
weight, goal_prediction, input, lr = .5, .8, 2, .5

In [71]:
for iteration in range(15):
    error = ((input * weight) - goal_prediction) ** 2
    gradient = 2 * (input * weight - goal_prediction) * input
    weight -= gradient * lr
    print("ERROR : " + str(error) + " GRADIENT : " + str(gradient) + " PREDICTION : " + str(input * weight))

ERROR : 1.6956463310086472e+27 GRADIENT : 164712905675719.16 PREDICTION : -123534679256788.56
ERROR : 1.5260816979077823e+28 GRADIENT : -494138717027157.44 PREDICTION : 370604037770368.9
ERROR : 1.373473528117004e+29 GRADIENT : 1482416151081472.2 PREDICTION : -1111812113311103.4
ERROR : 1.2361261753053034e+30 GRADIENT : -4447248453244416.5 PREDICTION : 3335436339933313.0
ERROR : 1.112513557774773e+31 GRADIENT : 1.3341745359733248e+16 PREDICTION : -1.0006309019799936e+16
ERROR : 1.0012622019972956e+32 GRADIENT : -4.002523607919974e+16 PREDICTION : 3.001892705939981e+16
ERROR : 9.01135981797566e+32 GRADIENT : 1.2007570823759923e+17 PREDICTION : -9.005678117819942e+16
ERROR : 8.110223836178094e+33 GRADIENT : -3.602271247127977e+17 PREDICTION : 2.7017034353459827e+17
ERROR : 7.299201452560284e+34 GRADIENT : 1.0806813741383931e+18 PREDICTION : -8.105110306037948e+17
ERROR : 6.5692813073042565e+35 GRADIENT : -3.2420441224151793e+18 PREDICTION : 2.4315330918113843e+18
ERROR : 5.91235317657383

- When we set input to $2$, learning overshoots because input is part of the gradient, with large input, gradient will be big and the new weight will overshoot.
- The predictions exploded!
- gettting farther away from the true answer at every step.
- In other words, every update to the weight overcorrect.
- A simple solution to this problem is to introduce a learning rate and find the optimal value to scale the effects of the gradient.

## Divergence
### Sometimes neural networks explode in values, oops?

<div style="text-align:center;"><img style="width:666px" src="static/imgs/04/divergence.png" /></div>

- The explosion in the error was caused by the fact that you made the input larger.
- The network overcorrect when you have a large weight update and a small error.

## Introducting Alpha
### It's the simplest way to prevent overcorrecting weight updates 

- What's the Problem you're trying to solve ?
    - If the Input is too big, than the weight update can overcorrect.
- What's the Symptom ?
    - The new derivative is even larger in magnitude than when you started.
- The **Solution is to multiply the weight update by a fraction to make it smaller**.
- This envolves multiplying the weight update by a single real-valued number between 0 and 1.
- Finding the appropriate alpha, even for state-of-the-art neural networks, is often done by guessing.


## Alpha in Code
### Where does our alpha parameter comes into play?

In [3]:
# implementation.
weight, goal_pred, input, alpha = .5, .8, 2, .1

for iteration in range(7):
    prediction = input * weight
    error = (prediction - goal_pred) ** 2
    gradient = input * 2 * (prediction - goal_pred)  # But in most cases we don't know the derivative of the Loss function.
    weight = weight - alpha * gradient
    print("ERROR : " + str(error) + " PREDICTION : " + str(prediction))

ERROR : 0.03999999999999998 PREDICTION : 1.0
ERROR : 0.001600000000000003 PREDICTION : 0.8400000000000001
ERROR : 6.400000000000012e-05 PREDICTION : 0.808
ERROR : 2.5600000000001466e-06 PREDICTION : 0.8016000000000001
ERROR : 1.0239999999999165e-07 PREDICTION : 0.80032
ERROR : 4.095999999993982e-09 PREDICTION : 0.800064
ERROR : 1.6384000000089615e-10 PREDICTION : 0.8000128000000001


- How did I know to set Alpha to 0.1?
    - **Honestly, I tried it, and it worked**.
- Despite all the crazy advancements of deep learning in the past few years, most people just try several orders of magnitudes of alpha:
    - 10, 1, .1, .01, .001, .0001 ..
    - & then tweak it from there to see what works best.
- **It's more Art than Science.**

## Memorizing
### It's time to really learn this stuff!

- Code it on your Own!

In [23]:
x, w, y_hat, lr = .1, .3, 1, 10  # tested .1, .3, 1, & 10 worked as a learning rate.

In [25]:
for iteration in range(7):
    pred = x * w
    error = (pred - y_hat) ** 2
    gradient = 2 * x * (pred - y_hat)
    w = w - lr * gradient
    print("ERROR : " + str(error) + " WEIGHT : " + str(w) + " PREDICTION : " + str(pred))

ERROR : 0.04138121962297752 WEIGHT : 8.372610048 PREDICTION : 0.796576256
ERROR : 0.026483980558705593 WEIGHT : 8.6980880384 PREDICTION : 0.8372610048000001
ERROR : 0.01694974755757159 WEIGHT : 8.95847043072 PREDICTION : 0.86980880384
ERROR : 0.010847838436845818 WEIGHT : 9.166776344576 PREDICTION : 0.895847043072
ERROR : 0.006942616599581312 WEIGHT : 9.3334210756608 PREDICTION : 0.9166776344576001
ERROR : 0.004443274623732043 WEIGHT : 9.46673686052864 PREDICTION : 0.93334210756608
ERROR : 0.002843695759188517 WEIGHT : 9.573389488422912 PREDICTION : 0.946673686052864


# Sketches

<div style="text-align:center;">
    <img style="width:333px" src="static/imgs/04/Loss_function.jpg" />
    <img style="width:333px" src="static/imgs/04/derivatives.jpg" /><br>
    <img style="width:333px" src="static/imgs/04/derivatives_2.jpg" />
    <img style="width:333px" src="static/imgs/04/dependence.jpg" /><br>
    <img style="width:333px" src="static/imgs/04/model_and_loss.jpg" />
    <img style="width:333px" src="static/imgs/04/optimization.jpg" />
</div>