# Chapter 9: Backpropagation

Start with a simplified forward pass with just one neuron.

Let’s backpropagate the ReLU function for a single neuron, then intend to minimize the output for this single neuron as a practice to show how we can leverage the chain rule with derivatives and partial derivatives.

We will start by <b>minimizing this more basic output</b> before jumping to the full network and overall loss.

In [1]:
x = [1.0, -2.0, 3.0] # input values
w = [-3.0, -1.0, 2.0] # weights
b = 1.0 # bias

xw0 = x[0] * w[0]
print(xw0)

-3.0


<center><img src='./image/9-1.png' style='width: 70%'/><font color='gray'><i>The first input and weight multiplication.</i></font></center>

In [2]:
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]

print(xw0, xw1, xw2)


-3.0 2.0 6.0


<center><img src='./image/9-2.png' style='width: 70%'/><font color='gray'><i>Input and weight multiplication of all of the inputs.</i></font></center>

Perform a sum of all weighted inputs with a bias.

In [3]:
# Adding weighted inputs and a bias
z = xw0 + xw1 + xw2 + b
print(z)

6.0


<center><img src='./image/9-3.png' style='width: 70%'/><font color='gray'><i>Weighted inputs and bias addition.</i></font></center>

In [4]:
# ReLU activation function
y = max(z, 0)
print(y)

6.0


<center><img src='./image/9-4.png' style='width: 70%'/><font color='gray'><i>ReLU activation applied to the neuron output.</i></font></center>

This is the full forward pass through a single neuron and a ReLU activation function.

Let’s treat all of these chained functions as one big function which takes input values ($x$), weights ($w$), and bias ($b$), as inputs, and outputs ($y$).

This big function consists of <b>3 chained functions</b> in total: a multiplication of input values and weights, a sum of these values and bias, as well as a $max$ function as the ReLU activation.


## 9.1. Derivative of activation function

Backpropagate our gradients by calculating derivatives and partial derivatives w.r.t each of our parameters and inputs.

The big function in the context of our neural network, can be loosely interpreted as:

$$
\text{ReLU} \Big( \sum[ \ \text{input} \cdot \text{weights} \ ] + \text{bias} \Big)
$$

Or in the form that matches code more precisely as:

$$
\text{ReLU} \Big( x_0 w_0 + x_1 w_1 + x_2 w_2 + b  \Big)
$$

Rewrite the function to the form that will allow us to determine how to calculate the derivatives more easily:

$$
y = \text{ReLU} \Big( sum \big( mul(x_0 , w_0 ), mul(x_1 , w_1 ), mul(x_2 , w_2 ), b \big) \Big)
$$

Calculate the partial derivative with respect to $w_0$.

$$
\frac{∂}{∂x_0} \bigg[ \text{ReLU} \Big( sum \big( mul(x_0 , w_0 ), mul(x_1 , w_1 ), mul(x_2 , w_2 ), b \big) \Big)  \bigg]
$$

The derivative with respect to the layer’s inputs is not used to update any parameters. It is used to chain to another layer.

We can repeat this to calculate all of the other remaining impacts.

We want to know the impact of a given weight or bias on the loss. Thus, we have to calculate the derivative of the loss function and apply the chain rule with the derivatives of all activation functions and neurons in all of the consecutive layers.

Assume that our neuron receives a gradient of $1$ from the next layer (for demonstration purposes), a value of $1$ won't change the values, that means we can more easily show all of the processes. The color red is used for derivatives.

<center><img src='./image/9-5.png' style='width: 70%'/><font color='gray'><i>Initial gradient (received during backpropagation).</i></font></center>

Recall the derivative of $ReLU()$ w.r.t its input:

$$
f(x) = max(x,0) \quad \to \quad \frac{d}{dx} f(x) = 1 (x>0)
$$

The input value to the $ReLU$ function is $6$, so the derivative equals $1$.

We have to use the chain rule and multiply this derivative with the derivative received from the next layer (which is 1 for the purpose of this example).

In [5]:
# Forward pass
x = [1.0, -2.0, 3.0] # input values
w = [-3.0, -1.0, 2.0] # weights
b = 1.0 # bias

# Multiplying inputs by weights
xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]

# Adding weighted inputs and a bias
z = xw0 + xw1 + xw2 + b

# ReLU activation function
y = max(z, 0)

# Backward pass
# The derivative from the next layer
dvalue = 1.0

# Derivative of ReLU and the chain rule
drelu_dz = dvalue * (1. if z > 0 else 0.)
print(drelu_dz)

1.0


<center><img src='./image/9-6.png' style='width: 70%'/><font color='gray'><i>Derivative of the $ReLU$ function and chain rule.</i></font></center>

This results with the derivative of 1 :

<center><img src='./image/9-7.png' style='width: 70%'/><font color='gray'><i>ReLU and chain rule gradient.</i></font></center>



## 9.2. Derivative of a sum of the weighted inputs and bias

Moving backward through our neural network, a sum of the weighted inputs and bias comes immediately before we perform the activation function.

Thus, we must calculate the partial derivative of the sum function, then, using the chain rule, multiply this by the partial derivative of the subsequent, outer, function, which is $ReLU$.

Recall the partial derivative of the sum operation is always 1:

$$
\begin{align}
f(x,y) = x+y \quad \to \quad & \frac{∂}{∂x} f(x,y) = 1\\
& \frac{∂}{∂y} f(x,y) = 1 \
\end{align}
$$

In [6]:
# Forward pass
x = [1.0, -2.0, 3.0] # input values
w = [-3.0, -1.0, 2.0] # weights
b = 1.0 # bias

# Multiplying inputs by weights
xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]

# Adding weighted inputs and a bias
z = xw0 + xw1 + xw2 + b

# ReLU activation function
y = max(z, 0)

# Backward pass
# The derivative from the next layer
dvalue = 1.0

# Derivative of ReLU and the chain rule
drelu_dz = dvalue * (1. if z > 0 else 0.)
print(drelu_dz)

# Partial derivatives of the multiplication, the chain rule
dsum_dxw0 = 1
dsum_dxw1 = 1
dsum_dxw2 = 1
dsum_db = 1
drelu_dxw0 = drelu_dz * dsum_dxw0
drelu_dxw1 = drelu_dz * dsum_dxw1
drelu_dxw2 = drelu_dz * dsum_dxw2
drelu_db = drelu_dz * dsum_db

print(drelu_dxw0, drelu_dxw1, drelu_dxw2, drelu_db)

1.0
1.0 1.0 1.0 1.0


<center><img src='./image/9-8.png' style='width: 70%'/><font color='gray'><i>Partial derivative of the sum function w.r.t. the first weighted input.</i></font></center>

This results with a partial derivative of 1 again:

<center><img src='./image/9-9.png' style='width: 70%'/><font color='gray'><i>The sum and chain rule gradient for the first weighted input.</i></font></center>

Perform the same operation with the next weighted input:

<center><img src='./image/9-10.png' style='width: 70%'/><font color='gray'><i>Partial derivative of the sum function w.r.t. the second weighted input.</i></font></center>

Which results with the next calculated partial derivative:

<center><img src='./image/9-11.png' style='width: 70%'/><font color='gray'><i>The sum and chain rule gradient (for the second weighted input).</i></font></center>

And the last weighted input:

<center><img src='./image/9-12.png' style='width: 70%'/><font color='gray'><i>Partial derivative of the sum function w.r.t. the third weighted input.</i></font></center>

<center><img src='./image/9-13.png' style='width: 70%'/><font color='gray'><i>The sum and chain rule gradient (for the third weighted input).</i></font></center>

Then the bias:

<center><img src='./image/9-14.png' style='width: 70%'/><font color='gray'><i>Partial derivative of the sum function w.r.t. the bias.</i></font></center>

<center><img src='./image/9-15.png' style='width: 70%'/><font color='gray'><i>The sum and chain rule gradient (for the bias).</i></font></center>


## 9.3. Derivative of the multiplication of weights and inputs

The derivative for a product is whatever the input is being multiplied by.

$$
\begin{align}
f(x,y) = x \cdot y \quad \to \quad & \frac{∂}{∂x} f(x,y) = y \\
& \frac{∂}{∂y} f(x,y) = x \
\end{align}
$$

In [7]:
# Partial derivatives of the multiplication, the chain rule
dmul_dx0 = w[0]
dmul_dx1 = w[1]
dmul_dx2 = w[2]
dmul_dw0 = x[0]
dmul_dw1 = x[1]
dmul_dw2 = x[2]
drelu_dx0 = drelu_dxw0 * dmul_dx0
drelu_dw0 = drelu_dxw0 * dmul_dw0
drelu_dx1 = drelu_dxw1 * dmul_dx1
drelu_dw1 = drelu_dxw1 * dmul_dw1
drelu_dx2 = drelu_dxw2 * dmul_dx2
drelu_dw2 = drelu_dxw2 * dmul_dw2

print(drelu_dx0, drelu_dw0, drelu_dx1, drelu_dw1, drelu_dx2, drelu_dw2)

-3.0 1.0 -1.0 -2.0 2.0 3.0


<center><img src='./image/9-16.png' style='width: 70%'/><font color='gray'><i>Partial derivative of the multiplication function w.r.t. the first input.</i></font></center>

<center><img src='./image/9-17.png' style='width: 70%'/><font color='gray'><i>The multiplication and chain rule gradient (for the first input).</i></font></center>

The complete set of the activated neuron’s partial derivatives with respect to the inputs, weights and a bias.

<center><img src='./image/9-18.png' style='width: 70%'/><font color='gray'><i>Complete backpropagation graph.</i></font></center>

Recall the equation from the beginning:

$$
\frac{∂}{∂x_0} \bigg[ \text{ReLU} \Big( sum \big( mul(x_0 , w_0 ), mul(x_1 , w_1 ), mul(x_2 , w_2 ), b \big) \Big)  \bigg]
= \frac{d \text{ReLU} }{d sum()} \cdot \frac{∂ sum() }{∂ mul(x_0 , w_0 )} \cdot \frac{∂ mul(x_0 , w_0 )}{∂x_0}
$$

The partial derivative of a neuron’s function, with respect to the weight, is the input related to this weight, and, with respect to the input, is the related weight. The partial derivative of the neuron’s function with respect to the bias is always 1.


## 9.4. Update weight and bias to decrease output

All partial derivatives above combined into a vector, make up our gradients.

In [8]:
dx = [drelu_dx0, drelu_dx1, drelu_dx2] # gradients on inputs
dw = [drelu_dw0, drelu_dw1, drelu_dw2] # gradients on weights
db = drelu_db # gradient on bias...just 1 bias here.

For this single neuron example, we won't need our $dx$.

We will apply these gradients to the weights to hopefully minimize the output.

We apply a negative fraction to this gradient to decrease the final output value, as the gradient shows the direction of the steepest ascent.

In [9]:
print(w, b)

[-3.0, -1.0, 2.0] 1.0


Apply a fraction of the gradients to these values:

In [10]:
w[0] += -0.001 * dw[0]
w[1] += -0.001 * dw[1]
w[2] += -0.001 * dw[2]
b += -0.001 * db
print(w, b)

[-3.001, -0.998, 1.997] 0.999


We slightly changed the weights and bias in such a way to decrease the output intelligently.

Do another forward pass to see the effects:

In [11]:
# Multiplying inputs by weights
xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]

# Adding
z = xw0 + xw1 + xw2 + b

# ReLU activation function
y = max(z, 0)

print(y)

5.985


We've successfully decreased this neuron's output from $6.000$ to $5.985$.