### Need for Optimization:
You've seen the forward-propagation algorithm that neural networks use to make predictions. However, the mere fact that a model has the structure of a neural network does not guarantee that it will make good prediction. We need to optimize the model. Changing any weight will change our prediction. Let’s see what happens if we change the weights.

In [4]:
# Define predict_with_network()
def predict_with_network(input_data_row, weights):

    # Calculate node 0 value
    node_0_input = (input_data_row * weights["node_0"]).sum()
    node_0_output = relu(node_0_input)

    # Calculate node 1 value
    node_1_input = (input_data_row * weights["node_1"]).sum()
    node_1_output = relu(node_1_input)

    # Put node values into array: hidden_layer_outputs
    hidden_layer_outputs = np.array([node_0_output, node_1_output])
    
    # Calculate model output
    input_to_final_layer = (hidden_layer_outputs * weights["output"]).sum()
    model_output = relu(input_to_final_layer)
    
    # Return model output
    return(model_output)

### Coding how weight changes affect accuracy

In [5]:

import numpy as np
# The data point you will make a prediction for
input_data = np.array([0, 3])

# Sample weights
weights_0 = {'node_0': [2, 1],
             'node_1': [1, 2],
             'output': [1, 1]
            }

# The actual target value, used to calculate the error
target_actual = 3

# Make prediction using original weights
model_output_0 = predict_with_network(input_data, weights_0)

# Calculate error: error_0
error_0 = model_output_0 - target_actual

# Create weights that cause the network to make perfect prediction (3): weights_1
weights_1 = {'node_0': [2, 1],
             'node_1': [1, 2],
             'output': [-1, 1]
            }

# Make prediction using new weights: model_output_1
model_output_1 = predict_with_network(input_data, weights_1)

# Calculate error: error_1
error_1 = model_output_1 - target_actual

# Print error_0 and error_1
print(error_0)
print(error_1)

6
0


So, this change in weights minimize the error, thus improved the model. 

#### Scaling up to multiple data points
Making accurate predictions gets harder with multiple points. First of all, at any set of weights, we have many values of the error, corresponding to the many points we make predictions for. We use something called a loss function to aggregate all the errors into a single measure of the model's predictive performance. a common loss function for regression tasks is mean-squared error. You square each error, and take the average of that as a measure of model quality. The loss function aggregates all of the errors into a single score. Lower values mean a better model, so our goal is to find the weights giving the lowest value for the loss function. We do this with an algorithm called gradient descent. An analogy may be helpful.

In [8]:
from sklearn.metrics import mean_squared_error

input_data = [np.array([0, 3]), np.array([1, 2]), np.array([-1, -2]), np.array([4, 0])]

target_actuals = [1, 3, 5, 7]

weights_0 = {'node_0': np.array([2, 1]), 'node_1': np.array([1, 2]), 'output': np.array([1, 1])}
weights_1 = {'node_0': np.array([2, 1]), 'node_1': np.array([1. , 1.5]), 'output': np.array([1. , 1.5])}

# Create model_output_0 
model_output_0 = []
# Create model_output_1
model_output_1 = []

# Loop over input_data
for row in input_data:
    # Append prediction to model_output_0
    model_output_0.append(predict_with_network(row, weights_0))
    
    # Append prediction to model_output_1
    model_output_1.append(predict_with_network(row, weights_1))

# Calculate the mean squared error for model_output_0: mse_0
mse_0 = mean_squared_error(target_actuals, model_output_0)

# Calculate the mean squared error for model_output_1: mse_1
mse_1 = mean_squared_error(target_actuals, model_output_1)

# Print mse_0 and mse_1
print("Mean squared error with weights_0: %f" %mse_0)
print("Mean squared error with weights_1: %f" %mse_1)

Mean squared error with weights_0: 37.500000
Mean squared error with weights_1: 49.890625


#### Gradient descent
Imagine you are in a pitch dark field, and you want to find the lowest point. You might feel the ground to see how it slopes, and take a small step downhill. This gives an improvement, but not necessarily the lowest point yet. So you repeat this process until it is uphill in every direction. This is roughly how gradient descent works. The steps are: Start at a random point, until you are somewhere flat, find the slope, and take a step downhill.

With gradient descent, you repeatedly repeatedly found a slope capturing how your loss function changes as a weight changes. You then made a small change to the weight to get to a lower point, and you repeated this until you couldn't go downhill any more If the slope is positive, going opposite the slope means moving to lower numbers. Subtracting the slope from the current value achieves this. But too big a step might lead us far astray. So, instead of directly subtracting the slope, we multiply the slope by a small number, called the learning rate, and we change the weight by the product of that multiplication. Learning rate are frequently around point-01. This ensures we take small steps, so we reliably move towards the optimal weights. But how do we find the relevant slope for each weight we need to update? Working this out for yourself involves calculus, especially the application of the chain rule. 

#### Calculating slopes

In [12]:
weights = np.array([0, 2, 1])
input_data = np.array([1, 2, 3])
target = 0

# Calculate the predictions: preds
preds = (weights * input_data).sum()

# Calculate the error: error
error = preds - target

# Calculate the slope: slope
slope = input_data * error * 2

# Print the slope
print(slope)

[14 28 42]


#### Improving model weights

In [13]:
# Set the learning rate: learning_rate
learning_rate = 0.01

# Calculate the predictions: preds
preds = (weights * input_data).sum()

# Calculate the error: error
error = preds - target

# Calculate the slope: slope
slope = 2 * input_data * error

# Update the weights: weights_updated
weights_updated = weights - (learning_rate * slope)

# Get updated predictions: preds_updated
preds_updated = (weights_updated * input_data).sum()

# Calculate updated error: error_updated
error_updated = preds_updated - target

# Print the original error
print(error)

# Print the updated error
print(error_updated)

7
5.04


#### Backpropagation
You’ve used gradient descent to optimize weights in a simple model. Now we'll add a technique called “back propagation” to calculate the slopes you need to optimize more complex deep learning models.

Just as forward propagation sends input data through the hidden layers and into the output layer, back propagation takes the error from the output layer and propagates it backward through the hidden layers, towards the input layer. 

It calculates the necessary slopes sequentially from the weights closest to the prediction, through the hidden layers, eventually back to the weights coming from the inputs. We then use these slopes to update our weights as you've seen. Back propagation is tricky. So you should focus on the general structure of the algorithm, rather than trying to memorize every mathematical detail.

In the big picture, we are trying to estimate the slope of the loss function with respect to each weight in our network. You've already seen that we use prediction errors to calculate some of those slopes. So we always do forward propagation to make a prediction and calculate an error before we do back propagation.

#### ReLU function:
For the Relu function the slope is 0 if the input into a node is negative. If the input into the node is positive, the output is the same as the input. So the slope would be 1.

 #### Backpropagation: Recap
 
we start at some random set of weights. We then go through the following iterative process Use forward propagation to make a prediction. Use backward propagation to calculate the slope of the loss function with respect to each weight. Multiply that slope by the learning rate, and subtract that from the current weights. Keep going with that cycle until we get to a flat part.

#### Stochastic gradient descent
For computational efficiency, it is common to calculate slopes on only a subset of the data, called a batch, for each update of the weights. You then use a different batch of data to calculate the next update. Once we have used all our data, we start over again at the beginning of the data. Each time through the full training data is called an epoch. So if we're going through our data for the 3rd time, we'd say we are on the 3rd epoch. When slopes are calculated on one batch at a time, rather than on the full data, that is called stochastic gradient descent, rather than gradient descent, which uses all of the data for each slope calculation. The process will be partially automated for you, but understanding the process will help fix any surprises that come up when building your models.