# Chapter 9: Backpropagation

### Single Neuron Example
- **backpropagation** is a widely used algorithm in training feedforward neural networks for supervised learning
- in fitting a neural network, backpropagation computes the gradient of the loss function with respect to the weights of the network for a single input–output example, and does so efficiently 
---
- now that we have an idea of how to measure the impact of variables in a function’s output, we can begin to write the code to calculate these partial derivatives and begin to see their role in minimizing our loss
- before applying backpropagation to a complete neural network, let’s backpropagate the ReLU function for a single neuron and act as if our intention is to minimize the output for this single neuron:

In [65]:
# Forward pass
x = [1.0, -2.0, 3.0]  # input values
w = [-3.0, -1.0, 2.0]  # weights
b = 1.0  # bias

# Multiplying inputs by weights
wx0 = x[0] * w[0]
wx1 = x[1] * w[1]
wx2 = x[2] * w[2]

# Adding
s = wx0 + wx1 + wx2 + b

# ReLU
y = max(s, 0)  # we already described that with ReLU activation function description
y # partial derivative of ReLU function

6.0

- the first order of business is to backpropagate our gradients by calculating partial derivatives with respect to each of our parameters and inputs
- in the context of our neural network, this can be interpreted as $relufunction(sum(weights * inputs)+bias)$, in which we will apply the chain rule
- **let's work backwards with our single neuron, starting with the actual output (ReLU function)**
- the output function, which is a ReLU function, doesn’t have a subsequent function’s partial derivatives to be multiplied by
- so in this case, we just need to calculate the partial derivative for the ReLU function
- recall the partial derivative for ReLU is 1 if the input is greater than 0, otherwise it's 0

In [66]:
dy = (1 if s > 0 else 0) 
dy

1

- **moving backwards through our neural network, the function that precedes the activation function is the sum of the weighted inputs and the bias**
- this means we want to calculate the partial derivative of the sum function (derivative for a sum is 1) and then use the chain rule to multiply this value by the partial derivative of the subsequent function (which is the ReLU function)
- we will call this result $dy$

In [67]:
dwx0 = 1 * dy # derivative with respect to weighted input x
dwx1 = 1 * dy
dwx2 = 1 * dy
db = 1 * dy # derivative with respect to bias

- **continuing backward, the function that comes before the sum is the multiplication of weights and inputs**
- recall that the partial derivative of $f$ with respect to $x$ equals $y$ and the partial derivative of $f$ with respect to $y$ equals $x$
- we can now apply this and the chain rule to calculate partial derivatives of $weights * bias$ multiplication operations with respect to its arguments (weights and biases)

In [68]:
dx0 = w[0] * dwx0 # input
dw0 = x[0] * dwx0 # weight
dx1 = w[1] * dwx1
dw1 = x[1] * dwx1
dx2 = w[2] * dwx2
dw2 = x[2] * dwx2

- we are working backwards by taking the ReLU function's derivative, then taking the summing operation’s derivative, and multiplying both, and so on
- again, this process is called backpropagation using the chain rule
- as the name implies, the gradients of the resulting output function are passed back through the neural network
---
- at this point, we finally have the partial derivatives for the weights and the bias
- we have everything we need to minimize the output of this neuron 
- for example, we can print the partial derivatives for each weight and the bias:

In [69]:
print(dw0, dw1, dw2, b)

1.0 -2.0 3.0 1.0


- full code thus far for forward and backward passing a single neuron:

In [70]:
# Forward pass
x = [1.0, -2.0, 3.0]  # input values
w = [-3.0, -1.0, 2.0]  # weights
b = 1.0  # bias

# Multiplying inputs by weights
wx0 = x[0] * w[0]
wx1 = x[1] * w[1]
wx2 = x[2] * w[2]

# Adding
s = wx0 + wx1 + wx2 + b

# ReLU
y = max(s, 0)  # we already described that with ReLU activation function description
print(y)

# Backward pass
dy = (1 if s > 0 else 0)  # derivative on ReLU activation function

dwx0 = 1 * dy 
dwx1 = 1 * dy
dwx2 = 1 * dy
db = 1 * dy

dx0 = w[0] * dwx0  
dw0 = x[0] * dwx0
dx1 = w[1] * dwx1
dw1 = x[1] * dwx1
dx2 = w[2] * dwx2
dw2 = x[2] * dwx2

print(dw0, dw1, dw2, b)

6.0
1.0 -2.0 3.0 1.0


- we can apply the chain rule here to calculate partial derivatives of the loss function with respect to the weights and input values:

In [71]:
dw0 = (1 if s > 0 else 0) * x[0]
dw1 = (1 if s > 0 else 0) * x[1]
dw2 = (1 if s > 0 else 0) * x[2]
dx0 = (1 if s > 0 else 0) * w[0]
dx1 = (1 if s > 0 else 0) * w[1]
dx2 = (1 if s > 0 else 0) * w[2]

print(dw0, dw1, dw2)

1.0 -2.0 3.0


- all together, the partial derivatives above combined into a vector make up our gradients
- our gradients could be represented with:

In [72]:
dx = [dx0, dx1, dx2]  # gradients on inputs
dw = [dw0, dw1, dw2]  # gradients on weights
print(dx, dw, db) # gradient on bias... just 1 bias here

[-3.0, -1.0, 2.0] [1.0, -2.0, 3.0] 1


- continuing with our single neuron example, with the goal of minimizing the output, we now can apply these gradients to our weights
- for example, our current weights and bias are:

In [73]:
print(w, b)

[-3.0, -1.0, 2.0] 1.0


- we can then apply a fraction of the gradients to these values:

In [74]:
w[0] += -0.001 * dw[0]
w[1] += -0.001 * dw[1]
w[2] += -0.001 * dw[2]
b += -0.001 * db

print(w, b)

[-3.001, -0.998, 1.997] 0.999


- we’ve slightly changed our weights and bias to intelligently decrease the output
- we can see the effects of our tweaks on the output by doing another forward pass:

In [75]:
# Multiplying inputs by weights
wx0 = x[0] * w[0]
wx1 = x[1] * w[1]
wx2 = x[2] * w[2]

# Adding
s = wx0 + wx1 + wx2 + b

# ReLU
y = max(s, 0) 
print(y) # used to yield 6

5.985


- we’ve successfully decreased this neuron’s output from 6.000 to 5.985

### Entire Layer Example
- as we’ve done before, we’ll apply the one-neuron example to the list of samples and expand it to an entire layer of neurons
- to begin, let’s set a list of 3 feature sets (also called samples) for input, where each feature set consists of 4 features
- for this example, our network will consist of a single hidden layer containing 3 neurons
- as before, we’re going to use NumPy to apply the dot product to inputs and weights, and then add biases 
- to do that, we will transpose the weights in order to match the input dimension and transpose the biases to produce a column vector

In [76]:
# From previous toy example, modified for a purpose of backpropagation
import numpy as np

inputs = np.array([[1, 2, 3, 2.5], [2., 5., -1., 2], [-1.5, 2.7, 3.3, -0.8]])  # now we have 3 samples (feature sets) of data
weights = np.array([[0.2, 0.8, -0.5, 1], # now we have 3 sets of weights - one set for each neuron
                    [0.5, -0.91, 0.26, -0.5],
                    [-0.26, -0.27, 0.17, 0.87]]).T # transpose all weights to match input distribution
biases = np.array([[2], [3], [0.5]]).T  # one bias for each neuron

# Forward pass
layer_outputs = np.dot(inputs, weights) + biases  # forward pass thru Dense layer
relu_outputs = np.maximum(0, layer_outputs)  # forward pass thru ReLU activation

# Let's optimize and test backpropagation here
# ReLU activation
relu_dvalues = np.ones(relu_outputs.shape) # simulates derivative with respect to input values from next layer passed to current layer during backpropagation
relu_dvalues[layer_outputs <= 0] = 0
drelu = relu_dvalues

# Dense layer
dinputs = np.dot(drelu, weights.T)  # dinputs - multiply by weights
dweights = np.dot(inputs.T, drelu)  # dweights - multiply by inputs
dbiases = np.sum(drelu, axis=0, keepdims=True)  # sbiases - sum values, do this over samples (first axis), keepdims as this by default will produce a plain list - we discussed this earlier

# Update parameters
weights += -0.001 * dweights
biases += -0.001 * dbiases

print(weights) # successfully decreased
print(biases) # successfully decreased

[[ 0.1985  0.5005 -0.2615]
 [ 0.7903 -0.9147 -0.2797]
 [-0.5053  0.2537  0.1647]
 [ 0.9963 -0.5017  0.8663]]
[[1.997 2.998 0.497]]


- in this code, we replaced plain Python functions with NumPy functions, created example data, calculated forward and backward passes, and updated the parameters

### Updating Neural Network Code
- now let's update the Dense layer and ReLU activation code with a backwards method (for backpropagation)
- during the `forward()` method for our `Layer_Dense()` class, we will wwant to remember what the inputs were so we can use them for backpropagation
- this can be easily implemented:

In [77]:
class Layer_Dense:    
    ...
    def forward(self, inputs):
        ...
        self.inputs = inputs

- next, we will add our backward pass (backpropagation) code that we developed previously 
- we'll call this the `backward()` method

In [78]:
class Layer_Dense:
 
    ...  

    # Backward pass
    def backward(self, dvalues):
        # Gradients on parameters
        self.dweights = np.dot(self.inputs.T, dvalues)
        self.dbiases = np.sum(dvalues, axis=0, keepdims=True)
        # Gradient on values
        self.dvalues = np.dot(dvalues, self.weights.T)

- we will do the same for our `Activation_ReLU()` class

In [79]:
# ReLU activation
class Activation_ReLU:

    # Forward pass
    def forward(self, inputs):
        # Remember input values
        self.inputs = inputs
        self.output = np.maximum(0, inputs)

    # Backward pass
    def backward(self, dvalues):
        # Since we need to modify the original variable, 
        # let's make a copy of the values first
        self.dvalues = dvalues.copy()

        # Zero gradient where input values where negative 
        self.dvalues[self.inputs <= 0] = 0 

- at this point, we’ve covered everything we need in order to perform backpropagation except for the derivative of the Softmax activation function and the derivative of the Cross-entropy loss function
---
- our method of doing this will be similar to our previous backpropagation examples
- we will only calculate the derivative in the cross-entropy loss class, because we’re using one combined derivative formula for both the softmax activation and the cross-entropy loss
- we can combine these operations as we assume that cross-entropy loss is always going to be used together with the softmax activation function in the output layer 
- all we need to do is add a `backward()` method to our `Activation_Softmax()` class, and all this backward method needs to do is store the derivative values for the backpropagation we’re going to apply:

In [81]:
class Activation_Softmax:

    # Forward pass
    def forward(self, inputs):
        # Remember input values
        self.inputs = inputs

        # get unnormalized probabilities
        exp_values = np.exp(inputs - np.max(inputs, axis=1, 
                                            keepdims=True))
        # normalize them for each sample
        probabilities = exp_values / np.sum(exp_values, axis=1, 
                                            keepdims=True)

        self.output = probabilities

    # Backward pass
    def backward(self, dvalues):
        self.dvalues = dvalues.copy()

- next, we’ll add the code to calculate gradients of the loss function with the `Loss_CategoricalCrossentropy()` class

In [82]:
# Cross-entropy loss
class Loss_CategoricalCrossentropy:
    
    # Forward pass
    def forward(self, y_pred, y_true):

        # Number of samples in a batch
        samples = y_pred.shape[0]

        # Probabilities for target values - 
        # only if categorical labels
        if len(y_true.shape) == 1:
            y_pred = y_pred[range(samples), y_true]

        # Losses
        negative_log_likelihoods = -np.log(y_pred)

        # Mask values - only for one-hot encoded labels
        if len(y_true.shape) == 2:
            negative_log_likelihoods *= y_true

        # Overall loss
        data_loss = np.sum(negative_log_likelihoods) / samples
        return data_loss

    # Backward pass
    def backward(self, dvalues, y_true):

        samples = dvalues.shape[0]

        self.dvalues = dvalues.copy()  # Copy so we can safely modify
        self.dvalues[range(samples), y_true] -= 1
        self.dvalues = self.dvalues / samples

### Full Code

In [83]:
import numpy as np
import random

random.seed(0)
np.random.seed(0)


# Our sample dataset
def create_data(n, k):
    X = np.zeros((n*k, 2))  # data matrix (each row = single example)
    y = np.zeros(n*k, dtype='uint8')  # class labels
    for j in range(k):
        ix = range(n*j, n*(j+1))
        r = np.linspace(0.0, 1, n)  # radius
        t = np.linspace(j*4, (j+1)*4, n) + np.random.randn(n)*0.2  # theta
        X[ix] = np.c_[r*np.sin(t*2.5), r*np.cos(t*2.5)]
        y[ix] = j
    return X, y


# Dense layer
class Layer_Dense:

    # Layer initialization
    def __init__(self, inputs, neurons):
        # Initialize weights and biases
        self.weights = 0.01 * np.random.randn(inputs, neurons)
        self.biases = np.zeros((1, neurons))

    # Forward pass
    def forward(self, inputs):
        # Remember input values
        self.inputs = inputs
        # Calculate output values from input ones, weights and biases
        self.output = np.dot(inputs, self.weights) + self.biases

    # Backward pass
    def backward(self, dvalues):
        # Gradients on parameters
        self.dweights = np.dot(self.inputs.T, dvalues)
        self.dbiases = np.sum(dvalues, axis=0, keepdims=True)
        # Gradient on values
        self.dvalues = np.dot(dvalues, self.weights.T)


# ReLU activation
class Activation_ReLU:

    # Forward pass
    def forward(self, inputs):
        # Remember input values
        self.inputs = inputs
        # Calculate output values from input ones
        self.output = np.maximum(0, inputs)

    # Backward pass
    def backward(self, dvalues):
        # Since we need to modify original variable, 
        # let's make a copy of values first
        self.dvalues = dvalues.copy()

        # Zero gradient where input values were negative 
        self.dvalues[self.inputs <= 0] = 0 


# Softmax activation
class Activation_Softmax:

    # Forward pass
    def forward(self, inputs):
        # Remember input values
        self.inputs = inputs

        # get unnormalized probabilities
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        # normalize them for each sample
        probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)

        self.output = probabilities

    # Backward pass
    def backward(self, dvalues):
        self.dvalues = dvalues.copy()


# Cross-entropy loss
class Loss_CategoricalCrossentropy:

    # Forward pass
    def forward(self, y_pred, y_true):

        # Number of samples in a batch
        samples = y_pred.shape[0]

        # Probabilities for target values - only if categorical labels
        if len(y_true.shape) == 1:
            y_pred = y_pred[range(samples), y_true]

        # Losses
        negative_log_likelihoods = -np.log(y_pred)

        # Mask values - only for one-hot encoded labels
        if len(y_true.shape) == 2:
            negative_log_likelihoods *= y_true

        # Overall loss
        data_loss = np.sum(negative_log_likelihoods) / samples
        return data_loss

    # Backward pass
    def backward(self, dvalues, y_true):

        samples = dvalues.shape[0]

        self.dvalues = dvalues.copy()  # Copy so we can safely modify
        self.dvalues[range(samples), y_true] -= 1
        self.dvalues = self.dvalues / samples


# Create dataset
X, y = create_data(100, 3)

# Create Dense layer with 2 input features and 3 output values
dense1 = Layer_Dense(2, 3)  # first dense layer, 2 inputs (each sample has 2 features), 3 outputs

# Create ReLU activation (to be used with Dense layer):
activation1 = Activation_ReLU()

# Create second Dense layer with 3 input features (as we take output of previous layer here) and 3 output values (output values)
dense2 = Layer_Dense(3, 3)  # second dense layer, 3 inputs, 3 outputs

# Create Softmax activation (to be used with Dense layer):
activation2 = Activation_Softmax()

# Create loss function
loss_function = Loss_CategoricalCrossentropy()

# Make a forward pass of our training data thru this layer
dense1.forward(X)

# Make a forward pass thru activation function - we take output of previous layer here
activation1.forward(dense1.output)

# Make a forward pass thru second Dense layer - it takes outputs of activation function of first layer as inputs
dense2.forward(activation1.output)

# Make a forward pass thru activation function - we take output of previous layer here
activation2.forward(dense2.output)

# Let's see output of the first few samples:
print(activation2.output[:5])

# Calculate loss from output of activation2 (softmax activation)
loss = loss_function.forward(activation2.output, y)

# Print loss value
print('loss:', loss)

# Calculate accuracy from output of activation2 and targets
predictions = np.argmax(activation2.output, axis=1)  # calculate values along first axis
accuracy = np.mean(predictions==y)

print('acc:', accuracy)

# Backward pass
loss_function.backward(activation2.output, y)
activation2.backward(loss_function.dvalues)
dense2.backward(activation2.dvalues)
activation1.backward(dense2.dvalues)
dense1.backward(activation1.dvalues)

# Print gradients
print(dense1.dweights)
print(dense1.dbiases)
print(dense2.dweights)
print(dense2.dbiases)

[[0.33333333 0.33333333 0.33333333]
 [0.33333317 0.33333318 0.33333364]
 [0.33333289 0.33333292 0.3333342 ]
 [0.33333259 0.33333264 0.33333477]
 [0.33333233 0.33333239 0.33333528]]
loss: 1.0986104615465142
acc: 0.34
[[ 1.57663575e-04  7.83685868e-05  4.73243939e-05]
 [ 1.81610390e-04  1.10455707e-05 -3.30962973e-05]]
[[-3.60553684e-04  9.66122221e-05 -1.03671511e-04]]
[[ 5.44109554e-05  1.07411413e-04 -1.61822369e-04]
 [-4.07913528e-05 -7.16780945e-05  1.12469447e-04]
 [-5.30112970e-05  8.58172904e-05 -3.28059934e-05]]
[[-1.06521079e-05 -9.44490453e-06  2.00970125e-05]]


- at this point, thanks to gradients and backpropagation using the chain rule, we’ll be able to adjust the weights and biases in accordance with the objective of lowering our loss
- this process of adjusting weights and biases using gradients with the objective of decreasing loss is the job of the **optimizer**, which is the subject of the next chapter