# Optimizing a neural network with backward propagation

In [2]:
from sklearn.metrics import mean_squared_error
import numpy as np

When we know what the output should be (like with labeled data) we'll try to minimize a cost or error function by changing the weights in the network. 

When we have multiple point things get harder. At any set of weights, there are many values of the error corresponding to the many points we make predictions for. 

Our Loss function will aggregate errors in predictions from many data points into a single number. It's a measure of the model's predictive performance. 

Example: Squared error loss function. We square all errors and get the mean of their sum. 

Goal is to find the weights that will minim. the loss function. We do this using gradient descent. 

### Scaling up to multiple data points
You've seen how different weights will have different accuracies on a single prediction. But usually, you'll want to measure model accuracy on many points. You'll now write code to compare model accuracies for two different sets of weights, which have been stored as weights_0 and weights_1.

input_data is a list of arrays. Each item in that list contains the data to make a single prediction. target_actuals is a list of numbers. Each item in that list is the actual value we are trying to predict.

In this exercise, you'll use the mean_squared_error() function from sklearn.metrics. It takes the true values and the predicted values as arguments.

You'll also use the preloaded predict_with_network() function, which takes an array of data as the first argument, and weights as the second argument.

In [None]:
from sklearn.metrics import mean_squared_error

# Create model_output_0 
model_output_0 = []
# Create model_output_1
model_output_1 = []

# Loop over input_data
for row in input_data:
    # Append prediction to model_output_0
    model_output_0.append(predict_with_network(row, weights_0))
    
    # Append prediction to model_output_1
    model_output_1.append(predict_with_network(row, weights_1))

# Calculate the mean squared error for model_output_0: mse_0
mse_0 = mean_squared_error(target_actuals, model_output_0)

# Calculate the mean squared error for model_output_1: mse_1
mse_1 = mean_squared_error(target_actuals, model_output_1)

# Print mse_0 and mse_1
print("Mean squared error with weights_0: %f" %mse_0)
print("Mean squared error with weights_1: %f" %mse_1)


## Gradient descent
If the slope is positive:
- going opposite the slope means moving to lower numbers.
- subtract the slope from the current value
- too big a step might lead us too far
The solution to this is the learning rate: update each weight by subtracting: $learning rate * slope$
![](grad.PNG)

In [4]:
#Code to calculate slopes and update weights
w = np.array([1, 2])
inputd = np.array([3, 4])
target = 6
learning_rate = 0.01

pred= (w*inputd).sum()
error = pred-target 
error

5

In [5]:
#gradient for our particular loss fn
gradient = 2 * inputd * error

w_updated = w - learning_rate * gradient
pred_updated = (w_updated * inputd).sum()
error_updated = pred_updated - target
error_updated

2.5

In [None]:
#structute to do multiple updates
n_updates = 20
mse_hist = []

# Iterate over the number of updates
for i in range(n_updates):
    # Calculate the slope: slope
    slope = ____(____, ____, ____)
    
    # Update the weights: weights
    weights = ____ - ____ * ____
    
    # Calculate mse with new weights: mse
    mse = ____(____, ____, ____)
    
    # Append the mse to mse_hist
    ____

# Plot the mse history
plt.plot(mse_hist)
plt.xlabel('Iterations')
plt.ylabel('Mean Squared Error')
plt.show()

## Backpropagation
We already optimized our weights using gradient descent. Now we'll learn how to use backprop to calculate the slopes we need to optimize more complex deep Learning models. 

Backpropagation (BP) takes the prediction error from the output layer and propagates it backwards through the hidden layers all the way to the input layer. It calculates the slopes sequentially from the weights clossest to the output layer, through the hidden layers and finally back to the weights coming from the inputs. We then use those slopes to update the weights like we did before.

It allows gradient descent to update all weights in nn and it comes from the chain rule.

The big picture problem is trying to estimate the slope of the loss function wrt each weight. To do this we first need to have predictions and errors, so we always do forward propagation before bp.

![](bp.PNG)

## Backpropagation in practice
![](bp2.PNG)
![](bp3.PNG)
Remember which 3 things we need to multiply to find the gradient:
- Node value feeding into that weight
- Slope of activation fn for the node being fed into
- Slope of loss fn wrt output node

For the up right node going into up left we have
- 0 
- 6
- 1 (from 6>0 -> derivative of relu = 1)
![](bp4.PNG)

RECAP:
![](bp5.PNG)
Then simply keep going with that cycle until we get to a flat part.

### Stochastic gradient descent
It is common to calculate slopes on only a subset of the data ('batch'). We use a different batch of data to calculate the next update.

Once all data is used start over from the beginning. 

Each time through the training data is called an epoch