# Optimizing a neural network with backward propagation

# (1) The Need for Optimization

## Predictions with multiple points

- Making accurate predictions gets harder with more points
- At any set of weights, there are many values of the error
- ... corresponding to the many points we make predictions for 

## Loss Function 

- Aggregates errors in predictions from many data points into single number
- Measure of model's predictive performance

## Squred error loss function

| Prediction | Actual | Error | Squred Error |
| :- | :- | :- | :- |
| 10 | 20 | -10 | 100 |
| 8 | 3 | 5 | 25 |
| 6 | 1 | 5 | 25 |

- Total Squred Error: 150
- Mean Squred Error: 50

<img src = "image/Screenshot 2021-01-17 205409.png">

## Loss Function
- Lower loss function value means a better model
- Goal: Find the weights that give the lowest value for the loss function
- Gradient descent

## Gradient descent

- Imaging you are in a pitch dark field
- Want to find the lowest point
- Feel the ground to see how it slopes
- Take a small step downhill
- Repeat until it is uphill in every direction

## Grdient descent steps

- Start at random point
- Until you are somewhere flat:
    - Find the slope
    - take a step down hill

 ## Optimizing a model with a single weight

 <img src="image/Screenshot 2021-01-17 210059.png">

# Exercise I: Calculating model errors

For the exercises in this chapter, you'll continue working with the network to predict transactions for a bank.

What is the error (predicted - actual) for the following network using the ReLU activation function when the input data is $[3, 2]$ and the actual value of the target (what you are trying to predict) is 5? It may be helpful to get out a pen and piece of paper to calculate these values.

<img src="image/ch2_ex2_3.png">

Possible Answer:
    
- 5
- 6
- 11 (T)
- 16

# Exercise II: Understanding how weights change model accuracy

Imagine you have to make a prediction for a single data point. The actual value of the target is 7. The weight going from node_0 to the output is 2, as shown below. If you increased it slightly, changing it to 2.01, would the predictions become more accurate, less accurate, or stay the same?

<img src="image/ch2_ex2_3.png">

Possible Answers:
- More accurate
- Less accurate (T)
- Stay the same

# Exercise III: Coding how weight changes affect accuracy

Now you'll get to change weights in a real network and see how they affect model accuracy!

Have a look at the following neural network: 

<img src="image/ch2ex4.png">

Its weights have been pre-loaded as weights_0. Your task in this exercise is to update a single weight in weights_0 to create weights_1, which gives a perfect prediction (in which the predicted value is equal to target_actual: 3).

Use a pen and paper if necessary to experiment with different combinations. You'll use the predict_with_network() function, which takes an array of data as the first argument, and weights as the second argument.

### Instructuion

- Create a dictionary of weights called weights_1 where you have changed 1 weight from weights_0 (You only need to make 1 edit to weights_0 to generate the perfect prediction).
- Obtain predictions with the new weights using the predict_with_network() function with input_data and weights_1.
- Calculate the error for the new weights by subtracting target_actual from model_output_1.
- Hit 'Submit Answer' to see how the errors compare!


In [None]:
# The data point you will make a prediction for
input_data = np.array([0, 3])

# Sample weights
weights_0 = {'node_0': [2, 1],
             'node_1': [1, 2],
             'output': [1, 1]
            }

# The actual target value, used to calculate the error
target_actual = 3

# Make prediction using original weights
model_output_0 = predict_with_network(input_data, weights_0)

# Calculate error: error_0
error_0 = model_output_0 - target_actual

# Create weights that cause the network to make perfect prediction (3): weights_1
weights_1 = {'node_0': [2, 1],
             'node_1': [1, 0],
             'output': [1, 1]
            }

# Make prediction using new weights: model_output_1
model_output_1 = predict_with_network(input_data, weights_1)

# Calculate error: error_1
error_1 = model_output_1 - target_actual

# Print error_0 and error_1
print(error_0)
print(error_1)


# Exercise IV: Scaling up to multiple data points

You've seen how different weights will have different accuracies on a single prediction. But usually, you'll want to measure model accuracy on many points. You'll now write code to compare model accuracies for two different sets of weights, which have been stored as weights_0 and weights_1.

input_data is a list of arrays. Each item in that list contains the data to make a single prediction. target_actuals is a list of numbers. Each item in that list is the actual value we are trying to predict.

In this exercise, you'll use the mean_squared_error() function from sklearn.metrics. It takes the true values and the predicted values as arguments.

You'll also use the preloaded predict_with_network() function, which takes an array of data as the first argument, and weights as the second argument.

### Instructions

- Import mean_squared_error from sklearn.metrics.
- Using a for loop to iterate over each row of input_data:
    - Make predictions for each row with weights_0 using the predict_with_network() function and append it to model_output_0.
    - Do the same for weights_1, appending the predictions to model_output_1.
- Calculate the mean squared error of model_output_0 and then model_output_1 using the mean_squared_error() function. The first argument should be the actual values (target_actuals), and the second argument should be the predicted values (model_output_0 or model_output_1).


In [None]:
from sklearn.metrics import mean_squared_error

# Create model_output_0 
model_output_0 = []
# Create model_output_1
model_output_1 = []

# Loop over input_data
for row in input_data:
    # Append prediction to model_output_0
    model_output_0.append(predict_with_network(row, weights_0))
    
    # Append prediction to model_output_1
    model_output_1.append(predict_with_network(row, weights_1))

# Calculate the mean squared error for model_output_0: mse_0
mse_0 = mean_squared_error(target_actuals, model_output_0)

# Calculate the mean squared error for model_output_1: mse_1
mse_1 = mean_squared_error(target_actuals, model_output_1)

# Print mse_0 and mse_1
print("Mean squared error with weights_0: %f" %mse_0)
print("Mean squared error with weights_1: %f" %mse_1)


# (2) Gradient descent 

<img src="image/Screenshot 2021-01-17 210059.png">

## Gradient descent 

- If the slop is possible:
    - Going opposite the slope means moving to lower numbers
    - Subtract the slope from the current value
    - Too big a step might lead us astray
- Solution: Learning rate
    - Update each weight by subtracting learning rate * slope

## Slope calculation example

<img src="image/screenshot 2021-01-17 213439.png">

- To calculate the slope for a weight, need to multiply:
    - slope of the loss function w.r.t value at the node we feed into
    - The value of the node that feeds into our weight
    - Slope of the activation function w.r.t value we feed into
- Slope of mean-squared loss function w.r.t prediction:
    - 2 (Prediction Value - Actual Value) = 2 Error
    - 2 * -4
- To calculate the slope for a weight, need to multiply:
    - Slope of the loss function w.r.t value we feed into
- 2 * -4 * 3
- -24
- If learning rate is 0.01, the new weight would be
- 2 - 0.01(-24) = 2.24

## Network with two inputs affecting prediction

<img src="image/Screenshot 2021-01-17 223735.png">

## Code to calculate slopes and update weights

In [None]:
import numpy as np
weights = np.array([1, 2])
input_data = np.array([3, 4])
target = 6
learning_rate = 0.01
preds = (weights * input_data).sum()
error = preds - target
print(error)

## Code to calculate slopes and update weights

In [None]:
gradient = 2 * input_data * error
print(gradient)

In [None]:
weights_updated = weights - learning_rate * gradient
preds_updated = (weights_updated * input_data).sum()
error_updated = preds_updated - target
print(error_updated)

# Exercise V: Calculating slopes

You're now going to practice calculating slopes. When plotting the mean-squared error loss function against predictions, the slope is 2 * x * (xb-y), or 2 * input_data * error. Note that x and b may have multiple numbers (x is a vector for each data point, and b is a vector). In this case, the output will also be a vector, which is exactly what you want.

You're ready to write the code to calculate this slope while using a single data point. You'll use pre-defined weights called weights as well as data for a single point called input_data. The actual value of the target you want to predict is stored in target.

### Instructions


- Calculate the predictions, preds, by multiplying weights by the input_data and computing their sum.
- Calculate the error, which is preds minus target. Notice that this error corresponds to xb-y in the gradient expression.
- Calculate the slope of the loss function with respect to the prediction. To do this, you need to take the product of input_data and error and multiply that by 2.


In [None]:
# Calculate the predictions: preds
preds = (weights * input_data).sum()

# Calculate the error: error
error = preds - target

# Calculate the slope: slope
slope = input_data * error * 2

# Print the slope
print(slope)

# Exercise VI: Improving model weights

Hurray! You've just calculated the slopes you need. Now it's time to use those slopes to improve your model. If you add the slopes to your weights, you will move in the right direction. However, it's possible to move too far in that direction. So you will want to take a small step in that direction first, using a lower learning rate, and verify that the model is improving.

The weights have been pre-loaded as weights, the actual value of the target as target, and the input data as input_data. The predictions from the initial weights are stored as preds.

### Instructions

- Set the learning rate to be 0.01 and calculate the error from the original predictions. This has been done for you.
- Calculate the updated weights by subtracting the product of learning_rate and slope from weights.
- Calculate the updated predictions by multiplying weights_updated with input_data and computing their sum.
- Calculate the error for the new predictions. Store the result as error_updated.
- Hit 'Submit Answer' to compare the updated error to the original!


In [None]:
# Set the learning rate: learning_rate
learning_rate = 0.01

# Calculate the predictions: preds
preds = (weights * input_data).sum()

# Calculate the error: error
error = preds - target

# Calculate the slope: slope
slope = 2 * input_data * error

# Update the weights: weights_updated
weights_updated = weights - learning_rate * slope

# Get updated predictions: preds_updated
preds_updated = (weights_updated * input_data).sum()

# Calculate updated error: error_updated
error_updated = preds_updated - target

# Print the original error
print(error)

# Print the updated error
print(error_updated)


# Exercise VII: Making multiple updates to weights

You're now going to make multiple updates so you can dramatically improve your model weights, and see how the predictions improve with each update.

To keep your code clean, there is a pre-loaded get_slope() function that takes input_data, target, and weights as arguments. There is also a get_mse() function that takes the same arguments. The input_data, target, and weights have been pre-loaded.

This network does not have any hidden layers, and it goes directly from the input (with 3 nodes) to an output node. Note that weights is a single array.

We have also pre-loaded matplotlib.pyplot, and the error history will be plotted after you have done your gradient descent steps.

### Instructions


- Using a for loop to iteratively update weights:
    - Calculate the slope using the get_slope() function.
    - Update the weights using a learning rate of 0.01.
    - Calculate the mean squared error (mse) with the updated weights using the get_mse() function.
    - Append mse to mse_hist.
- Hit 'Submit Answer' to visualize mse_hist. What trend do you notice?


In [None]:
n_updates = 20
mse_hist = []

# Iterate over the number of updates
for i in range(n_updates):
    # Calculate the slope: slope
    slope = get_slope(input_data, target, weights)
    
    # Update the weights: weights
    weights = weights - 0.01 * slope
    
    # Calculate mse with new weights: mse
    mse = get_mse(input_data, target, weights)
    
    # Append the mse to mse_hist
    mse_hist.append(mse)

# Plot the mse history
plt.plot(mse_hist)
plt.xlabel('Iterations')
plt.ylabel('Mean Squared Error')
plt.show()

# (3) Backpropagation

## Backpropagation

<img src="image/Screenshot 2021-01-17 230520.png">

- Allows gradient descent to update all weights in neural network (by getting gradients for all weights)
- Comes from chain rule of calculus
- Important to understand the process but you will generally use a libraty that implements this

## Backpropagation process

- Trying to estimate the slope of the loss function w.r.t each weight
- Do forward propagation to calculate predictions and errors

<img src="image/Screenshot 2021-01-17 230932.png">

- Go back one layer at a time
- Gradients for weight is product of:
    1. Node value feeding into that weight
    2. Slope of loss function w.r.t node it feeds into
    3. Slope of activation function at the node it feeds into

## RelU Activation Function

<img src="image/Screenshot 2021-01-17 231307.png">

## Backpropogation process

- Need to also keep track of the slopes of the loss function w.r.t node values
- Slopes of node values are the sum of the slopes for all weights that come out of them

# Exercise VIII: The relationship between forward and backward propagation

If you have gone through 4 iterations of calculating slopes (using backward propagation) and then updated weights, how many times must you have done forward propagation?

Possible Answers:
- 0
- 1
- 4 (T)
- 8


# Exercise IX: Thinking about backward propagation

If your predictions were all exactly right, and your errors were all exactly 0, the slope of the loss function with respect to your predictions would also be 0. In that circumstance, which of the following statements would be correct?

Possible Answers
- The updates to all weights in the network would also be 0. (T)
- The updates to all weights in the network would be dependent on the activation functions.
- The updates to all weights in the network would be proportional to values from the input data.

# (4) Backpropagation in practice

## Backpropagation

<img src="image/Screenshot 2021-01-20 170319.png">

## Calculating slopes associated with any weight

- Gradients for weight is product of:
    1. Node value feeding into that weight
    2. Slope of activation function of the node being fed into
    3. Slope of loss function w.r.t output node

<img src="image/Screenshot 2021-01-20 170433.png">

| Current Weight Value | Gradient |
| :- | :- |
| 0 | 0 |
| 1 | 6 |
| 2 | 0 |
| 3 | 18 |

## Backpropagation: Recap

- Start at some random set of weights
- Use forward propagation to make a prediction
- Use backward propagation to calculate the slope of the loss function w.r.t each weight
- Multiply that slope by the learning rate, and subtract from the current weights
- Keep going with that cycle until we get to a flat part

## Stochastic gradient descent

- It is common to calculate slopes on only a subset of the data (a batch)
- Use a different batch of data to calculate the next update
- Start over from the beginning once all data is used
- Each time through the training data is called an epoch
- When slopes are calculatrd on one batch at a time stochastic gradient descent

# Exercise X: A round of backpropagation

In the network shown below, we have done forward propagation, and node values calculated as part of forward propagation are shown in white. The weights are shown in black. Layers after the question mark show the slopes calculated as part of back-prop, rather than the forward-prop values. Those slope values are shown in purple.

This network again uses the ReLU activation function, so the slope of the activation function is 1 for any node receiving a positive value as input. Assume the node being examined had a positive value (so the activation function's slope is 1).

<img src="image/ch2ex14_1.png">

Possible Answers:
- 0
- 2
- 6 (T)
- Not enough information