# Detailing the Partial Derivative Calculations during Backpropagation

This 3-Layer Nerual Net example calculates all the partial derivatives explicitly during the backpropagation phase of training.

Credit to Matt Mazur for his excellent [step-by-step explanation](http://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example) of backpropagation.

In [None]:
%%javascript
require.config({
    paths: {
        flot: '//www.flotcharts.org/javascript/jquery.flot.min',
        d3: '//cdnjs.cloudflare.com/ajax/libs/d3/3.4.8/d3.min'
    }
});

In [None]:
%%html
<style>
div.output {
  max-height: 1000px;
  overflow-y: scroll;
}
</style>

In [None]:
import numpy as np
import json
import time
from IPython.display import Javascript
from IPython.core.display import display

In [None]:
# Set alpha
# ----
# Alpha is the step size multiplier. Each time we adjust
# weights, we'll multiply the adjustment times alpha, 
# scaling the size of the adjustment during each update
alpha = 10

In [None]:
# Activation function
# ----
# We'll use the sigmoid function as our activation function
def sigmoid(x):
    return 1/(1 + np.exp(-x))

# Activation derivative
# ----
# We use the derivative to find the direction
# of the gradient in SGD so we know which way (positive or negative)
# to move the weights when we update them
def sigmoid_output_derivative(output):
    return output*(1 - output)

# Inputs
X = np.array([[0,0,1]])

# Outputs
Y = np.array([[0]]).T

# Seed a random number generator
np.random.seed(2)

# Randomly initialize weights between -1 and 1
W0 = 2*np.random.random((3,4)) - 1
W1 = 2*np.random.random((4,1)) - 1

for i in range(60000):
    
    # ------------
    # Feed forward
    # ------------
    
    # Set Inputs
    layer0 = X
    
    # For Layer 1, calculate:
    #   layer1_inputs (individual inputs)
    #   layer1_net_input (net input)
    #   layer1_outputs (individual inputs)
    #   layer1_net_output (net ouput) 
    layer1_inputs     = np.dot(layer0,W0)
    layer1_outputs    = sigmoid(layer1_inputs)
    
    # Calculate input and output for all nodes in Layer 2
    layer2_inputs     = np.dot(layer1_outputs,W1)
    layer2_outputs    = sigmoid(layer2_inputs)
    
    # ---------------
    # Backpropagation
    # ---------------
    
    # First, we want to know how much a change in each of the W1 weights
    # affects the total error. We have 4 weights connected to the output layer,
    # so we will run this calculation for each weight. In partial derivative speak...
    #
    # Affect on ErrorT of changing W1[0] = d_ErrorT / d_W1[0]
    # Affect on ErrorT of changing W1[1] = d_ErrorT / d_W1[1]
    # Affect on ErrorT of changing W1[2] = d_ErrorT / d_W1[2]
    # Affect on ErrorT of changing W1[3] = d_ErrorT / d_W1[3]
    
    # This is where the Chain Rule is used to calculate these partial derivatives.
    # The Chain Rule tells us that:
    #
    # d_ErrorT / d_W1[0] = (d_ErrorT / d_layer2_outputs[0]) * 
    #                      (d_layer2_outputs[0] / d_layer2_inputs[0]) *
    #                      (d_layer2_inputs[0] / d_W1[0])
    #
    # ...and so on for each node in Layer 2 and weight in W1
    #
    # So, we need to now find the values of each of the 3 partial derivatives in the equation above.
    
    # Calculate: d_ErrorT / d_layer2_outputs[0]
    # 
    # The standard error calculation is:
    # 0.5 * (correct_output - measured_output)^2
    #
    # ...and if we had multiple output nodes, then the total error would be
    # the sum of the result of that equation over all output nodes.
    # 
    # ErrorT = 0.5 * (Y - layer2_outputs[0])**2
    
    # Now, take the derivative of ErrorT with respect to the output.
    # d_ErrorT / d_layer2_outputs[0] = 2 * 0.5 * (Y - layer2_outputs[0])**(2-1)
    #                                = 1 * (Y - layer2_outputs[0]) * -1
    #                                = -(Y - layer2_outputs[0])
    d_ErrorT_d_layer_2_outputs = -(Y - layer2_outputs[0])
    
    # Next, we need to find how the output of each node in layer2 changes
    # with respect to its input. Remember, we only have one output node.
    #
    # d_layer2_net_output / d_layer2_net_input
    #
    # The equation for the output is the sigmoid (aka logistic) function with the node's
    # value as input. So, since our output is calculated as...
    #
    # layer2_outputs = sigmoid(layer2_inputs)
    #
    #  ...we will use the derivative of the sigmoid, which is output*(1-output):
    #
    # d_layer2_output / d_layer2_input = layer2_output * (1 - layer2_output)
    d_layer2_outputs_d_layer2_inputs = layer2_outputs * (1 - layer2_outputs)
    
    # Finally, we need to calculate how much the input to Layer 2 changes with respect 
    # to each weight feeding into it.
    # Again, the net input to the nodes in layer 2 are calculated by the sum of products
    # from Layer 1 outputs and W1:
    #
    # layer2_net_input = layer1_outputs[0]*W1[0] +
    #                    layer1_outputs[1]*W1[1] +
    #                    layer1_outputs[2]*W1[2] +
    #                    layer1_outputs[3]*W1[3]
    #
    # When we take the partial derivative of the above with respect to a given weight,
    # the other terms all go to 0 since they're treated as constants, for example:
    #
    # d_layer2_net_input / d_layer1_weight[0] = 1*layer1_outputs[0]*W1[0]^(1-1) + 0 + 0 + 0
    #                                         = layer1_outputs[0]
    #
    # Here, we need a loop becuase we have 4 partial derivatives to compute
    d_layer2_inputs_d_layer1_weights = np.zeros((4,1))
    for j in range(4):
        d_layer2_inputs_d_layer1_weights[j][0] = layer1_outputs[0][j]
        
    # Now that we have all of our partial derivatives, we can multiply them together
    # to find d_ErrorT / d_W1
    d_errorT_d_W1 = d_ErrorT_d_layer_2_outputs * \
                    d_layer2_outputs_d_layer2_inputs * \
                    d_layer2_inputs_d_layer1_weights
    
    # Now we can calculate the deltas we want to apply to the weights in W1,
    # scaled by our alpha parameter 
    W1_deltas = alpha * d_errorT_d_W1
    
    # Phew! That was a lot to get through, but now we've backpropagated the error
    # in our output layer (Layer 2) to the hidden layer weights (W1)
    #
    # Now we get to do it all again for W0!
    
    # OK, so using the same logic, we will need to find how much changing each
    # weight in W0 contributes to the total output error. For example, looking at
    # the first weight in W0:
    #
    # d_ErrorT / d_W0[0] = (d_ErrorT / d_layer1_outputs[0]) * 
    #                      (d_layer1_outputs[0] / d_layer1_inputs[0]) *
    #                      (d_layer1_inputs[0] / d_W0[0])
    
    # This time, we need to find the contribution of each hidden layer node to
    # the total error, so we'll need a loop to cover all 4 hidden layer nodes.
        
    # First, find partial derivative of total error with respect to the
    # hidden layer node's output.
    #
    # By the chain rule:
    #
    # d_ErrorT / d_layer1_outputs[0] = (d_ErrorT / d_layer2_outputs[0]) *
    #                                  (d_layer2_outputs[0] / d_layer2_inputs[0]) *
    #                                  (d_layer2_inputs[0] / d_layer1_outputs[0])
    #
    # The first two terms we know from earlier, so we only need to calculate the third term.
    #
    # layer2_inputs[0] is calculated as:
    #
    # layer2_inputs[0] = sum(layer1_outputs * W1) # aka dot product of layer1_outputs and W1
    # layer2_inputs[0] = (layer1_outputs[0] * W1[0]) +
    #                    (layer1_outputs[1] * W1[1]) +
    #                    (layer1_outputs[2] * W1[2]) +
    #                    (layer1_outputs[3] * W1[3])
    #
    # Then take the partial derivative with respect to layer1_outputs[0], which again
    # is straightforward as all the other terms go to zero.
    #
    # d_layer2_inputs[0] / d_layer1_outputs[0] = W1[0]
    d_layer2_inputs_d_layer1_outputs = W1.T
    
    # Next find the partial derivative of layer1_outputs with respect to layer1_inputs
    #
    # Layer 1 output is the sigmoid of the input
    #
    d_layer1_outputs_d_layer1_inputs = layer1_outputs*(1 - layer1_outputs)
    
    # And now find the partial derivative of layer1 inputs with respect to each W0.
    #
    # layer1_input[0] = (layer0[0] * W0[0][0]) +
    #                   (layer0[1] * W0[1][0]) +
    #                   (layer0[2] * W0[2][0])
    #
    # d_layer1_inputs[0] / d_W0[0][0] = layer0[0]
    d_layer1_inputs_d_W0 = np.array([[0,0,1] for i in range(4)])
    
    # Finally, we can now determine how much each weight in w0 contributes to the total error:
    d_errorT_d_W0 = d_ErrorT_d_layer_2_outputs * \
                    d_layer2_outputs_d_layer2_inputs * \
                    d_layer2_inputs_d_layer1_outputs * \
                    d_layer1_outputs_d_layer1_inputs * \
                    d_layer1_inputs_d_W0.T
 
    # Now we can calculate the deltas we want to apply to the weights in W10,
    # scaled by our alpha parameter 
    W0_deltas = alpha * d_errorT_d_W0
    
    # Update weights
    W1 -= W1_deltas
    W0 -= W0_deltas
    
    if (i% 5000 == 0):
        print("Error: %.6f" % np.mean(np.abs(Y-layer2_outputs)))

    