# Chapter 9: Backpropagation

Start with a simplified forward pass with just one neuron.

Let’s backpropagate the ReLU function for a single neuron, then intend to minimize the output for this single neuron as a practice to show how we can leverage the chain rule with derivatives and partial derivatives.

We will start by <b>minimizing this more basic output</b> before jumping to the full network and overall loss.

In [1]:
x = [1.0, -2.0, 3.0] # input values
w = [-3.0, -1.0, 2.0] # weights
b = 1.0 # bias

xw0 = x[0] * w[0]
print(xw0)

-3.0


<center><img src='./image/9-1.png' style='width: 70%'/><font color='gray'><i>The first input and weight multiplication.</i></font></center>

In [2]:
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]

print(xw0, xw1, xw2)


-3.0 2.0 6.0


<center><img src='./image/9-2.png' style='width: 70%'/><font color='gray'><i>Input and weight multiplication of all of the inputs.</i></font></center>

Perform a sum of all weighted inputs with a bias.

In [3]:
# Adding weighted inputs and a bias
z = xw0 + xw1 + xw2 + b
print(z)

6.0


<center><img src='./image/9-3.png' style='width: 70%'/><font color='gray'><i>Weighted inputs and bias addition.</i></font></center>

In [4]:
# ReLU activation function
y = max(z, 0)
print(y)

6.0


<center><img src='./image/9-4.png' style='width: 70%'/><font color='gray'><i>ReLU activation applied to the neuron output.</i></font></center>

This is the full forward pass through a single neuron and a ReLU activation function.

Let’s treat all of these chained functions as one big function which takes input values ($x$), weights ($w$), and bias ($b$), as inputs, and outputs ($y$).

This big function consists of <b>3 chained functions</b> in total: a multiplication of input values and weights, a sum of these values and bias, as well as a $max$ function as the ReLU activation.


## 9.1. Derivative of activation function

Backpropagate our gradients by calculating derivatives and partial derivatives w.r.t each of our parameters and inputs.

The big function in the context of our neural network, can be loosely interpreted as:

$$
\text{ReLU} \Big( \sum[ \ \text{input} \cdot \text{weights} \ ] + \text{bias} \Big)
$$

Or in the form that matches code more precisely as:

$$
\text{ReLU} \Big( x_0 w_0 + x_1 w_1 + x_2 w_2 + b  \Big)
$$

Rewrite the function to the form that will allow us to determine how to calculate the derivatives more easily:

$$
y = \text{ReLU} \Big( sum \big( mul(x_0 , w_0 ), mul(x_1 , w_1 ), mul(x_2 , w_2 ), b \big) \Big)
$$

Calculate the partial derivative with respect to $w_0$.

$$
\frac{∂}{∂x_0} \bigg[ \text{ReLU} \Big( sum \big( mul(x_0 , w_0 ), mul(x_1 , w_1 ), mul(x_2 , w_2 ), b \big) \Big)  \bigg]
$$

The derivative with respect to the layer’s inputs is not used to update any parameters. It is used to chain to another layer.

We can repeat this to calculate all of the other remaining impacts.

We want to know the impact of a given weight or bias on the loss. Thus, we have to calculate the derivative of the loss function and apply the chain rule with the derivatives of all activation functions and neurons in all of the consecutive layers.

Assume that our neuron receives a gradient of $1$ from the next layer (for demonstration purposes), a value of $1$ won't change the values, that means we can more easily show all of the processes. The color red is used for derivatives.

<center><img src='./image/9-5.png' style='width: 70%'/><font color='gray'><i>Initial gradient (received during backpropagation).</i></font></center>

Recall the derivative of $ReLU()$ w.r.t its input:

$$
f(x) = max(x,0) \quad \to \quad \frac{d}{dx} f(x) = 1 (x>0)
$$

The input value to the $ReLU$ function is $6$, so the derivative equals $1$.

We have to use the chain rule and multiply this derivative with the derivative received from the next layer (which is 1 for the purpose of this example).

In [5]:
# Forward pass
x = [1.0, -2.0, 3.0] # input values
w = [-3.0, -1.0, 2.0] # weights
b = 1.0 # bias

# Multiplying inputs by weights
xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]

# Adding weighted inputs and a bias
z = xw0 + xw1 + xw2 + b

# ReLU activation function
y = max(z, 0)

# Backward pass
# The derivative from the next layer
dvalue = 1.0

# Derivative of ReLU and the chain rule
drelu_dz = dvalue * (1. if z > 0 else 0.)
print(drelu_dz)

1.0


<center><img src='./image/9-6.png' style='width: 70%'/><font color='gray'><i>Derivative of the $ReLU$ function and chain rule.</i></font></center>

This results with the derivative of 1 :

<center><img src='./image/9-7.png' style='width: 70%'/><font color='gray'><i>ReLU and chain rule gradient.</i></font></center>



## 9.2. Derivative of a sum of the weighted inputs and bias

Moving backward through our neural network, a sum of the weighted inputs and bias comes immediately before we perform the activation function.

Thus, we must calculate the partial derivative of the sum function, then, using the chain rule, multiply this by the partial derivative of the subsequent, outer, function, which is $ReLU$.

Recall the partial derivative of the sum operation is always 1:

$$
\begin{align}
f(x,y) = x+y \quad \to \quad & \frac{∂}{∂x} f(x,y) = 1\\
& \frac{∂}{∂y} f(x,y) = 1 \
\end{align}
$$

In [6]:
# Forward pass
x = [1.0, -2.0, 3.0] # input values
w = [-3.0, -1.0, 2.0] # weights
b = 1.0 # bias

# Multiplying inputs by weights
xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]

# Adding weighted inputs and a bias
z = xw0 + xw1 + xw2 + b

# ReLU activation function
y = max(z, 0)

# Backward pass
# The derivative from the next layer
dvalue = 1.0

# Derivative of ReLU and the chain rule
drelu_dz = dvalue * (1. if z > 0 else 0.)
print(drelu_dz)

# Partial derivatives of the multiplication, the chain rule
dsum_dxw0 = 1
dsum_dxw1 = 1
dsum_dxw2 = 1
dsum_db = 1
drelu_dxw0 = drelu_dz * dsum_dxw0
drelu_dxw1 = drelu_dz * dsum_dxw1
drelu_dxw2 = drelu_dz * dsum_dxw2
drelu_db = drelu_dz * dsum_db

print(drelu_dxw0, drelu_dxw1, drelu_dxw2, drelu_db)

1.0
1.0 1.0 1.0 1.0


<center><img src='./image/9-8.png' style='width: 70%'/><font color='gray'><i>Partial derivative of the sum function w.r.t. the first weighted input.</i></font></center>

This results with a partial derivative of 1 again:

<center><img src='./image/9-9.png' style='width: 70%'/><font color='gray'><i>The sum and chain rule gradient for the first weighted input.</i></font></center>

Perform the same operation with the next weighted input:

<center><img src='./image/9-10.png' style='width: 70%'/><font color='gray'><i>Partial derivative of the sum function w.r.t. the second weighted input.</i></font></center>

Which results with the next calculated partial derivative:

<center><img src='./image/9-11.png' style='width: 70%'/><font color='gray'><i>The sum and chain rule gradient (for the second weighted input).</i></font></center>

And the last weighted input:

<center><img src='./image/9-12.png' style='width: 70%'/><font color='gray'><i>Partial derivative of the sum function w.r.t. the third weighted input.</i></font></center>

<center><img src='./image/9-13.png' style='width: 70%'/><font color='gray'><i>The sum and chain rule gradient (for the third weighted input).</i></font></center>

Then the bias:

<center><img src='./image/9-14.png' style='width: 70%'/><font color='gray'><i>Partial derivative of the sum function w.r.t. the bias.</i></font></center>

<center><img src='./image/9-15.png' style='width: 70%'/><font color='gray'><i>The sum and chain rule gradient (for the bias).</i></font></center>


## 9.3. Derivative of the multiplication of weights and inputs

The derivative for a product is whatever the input is being multiplied by.

$$
\begin{align}
f(x,y) = x \cdot y \quad \to \quad & \frac{∂}{∂x} f(x,y) = y \\
& \frac{∂}{∂y} f(x,y) = x \
\end{align}
$$

In [7]:
# Partial derivatives of the multiplication, the chain rule
dmul_dx0 = w[0]
dmul_dx1 = w[1]
dmul_dx2 = w[2]
dmul_dw0 = x[0]
dmul_dw1 = x[1]
dmul_dw2 = x[2]
drelu_dx0 = drelu_dxw0 * dmul_dx0
drelu_dw0 = drelu_dxw0 * dmul_dw0
drelu_dx1 = drelu_dxw1 * dmul_dx1
drelu_dw1 = drelu_dxw1 * dmul_dw1
drelu_dx2 = drelu_dxw2 * dmul_dx2
drelu_dw2 = drelu_dxw2 * dmul_dw2

print(drelu_dx0, drelu_dw0, drelu_dx1, drelu_dw1, drelu_dx2, drelu_dw2)

-3.0 1.0 -1.0 -2.0 2.0 3.0


<center><img src='./image/9-16.png' style='width: 70%'/><font color='gray'><i>Partial derivative of the multiplication function w.r.t. the first input.</i></font></center>

<center><img src='./image/9-17.png' style='width: 70%'/><font color='gray'><i>The multiplication and chain rule gradient (for the first input).</i></font></center>

The complete set of the activated neuron’s partial derivatives with respect to the inputs, weights and a bias.

<center><img src='./image/9-18.png' style='width: 70%'/><font color='gray'><i>Complete backpropagation graph.</i></font></center>

Recall the equation from the beginning:

$$
\frac{∂}{∂x_0} \bigg[ \text{ReLU} \Big( sum \big( mul(x_0 , w_0 ), mul(x_1 , w_1 ), mul(x_2 , w_2 ), b \big) \Big)  \bigg]
= \frac{d \text{ReLU} }{d sum()} \cdot \frac{∂ sum() }{∂ mul(x_0 , w_0 )} \cdot \frac{∂ mul(x_0 , w_0 )}{∂x_0}
$$

The partial derivative of a neuron’s function, with respect to the weight, is the input related to this weight, and, with respect to the input, is the related weight. The partial derivative of the neuron’s function with respect to the bias is always 1.


## 9.4. Update weight and bias to decrease output

All partial derivatives above combined into a vector, make up our gradients.

In [8]:
dx = [drelu_dx0, drelu_dx1, drelu_dx2] # gradients on inputs
dw = [drelu_dw0, drelu_dw1, drelu_dw2] # gradients on weights
db = drelu_db # gradient on bias...just 1 bias here.

For this single neuron example, we won't need our $dx$.

We will apply these gradients to the weights to hopefully minimize the output.

We apply a negative fraction to this gradient to decrease the final output value, as the gradient shows the direction of the steepest ascent.

In [9]:
print(w, b)

[-3.0, -1.0, 2.0] 1.0


Apply a fraction of the gradients to these values:

In [10]:
w[0] += -0.001 * dw[0]
w[1] += -0.001 * dw[1]
w[2] += -0.001 * dw[2]
b += -0.001 * db
print(w, b)

[-3.001, -0.998, 1.997] 0.999


We slightly changed the weights and bias in such a way to decrease the output intelligently.

Do another forward pass to see the effects:

In [11]:
# Multiplying inputs by weights
xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]

# Adding
z = xw0 + xw1 + xw2 + b

# ReLU activation function
y = max(z, 0)

print(y)

5.985


We've successfully decreased this neuron's output from $6.000$ to $5.985$.

It does not make sense to decrease the neuron's output in a real neural network. We do this as a simpler exercise.

We want to decrease the loss value, which is the last calculation in the chain of calculations during the forward pass, and it's the first one to calculate the gradient during the backpropagation.

## 9.5. List of samples and a layer of neurons

### A singular sample

A single neuron of the current layer connects to all of them — they all receive the output of this neuron.

What happen during backpropagation?

Each neuron from the next layer will return a partial derivative of its function w.r.t all of its inputs.

The neuron in the current layer will receive a vector consisting of these derivatives.

To continue backpropagation, we need to sum this vector to be a singular value for a singular neuron.

For a layer of neurons, it'll be a list of these vectors, or a 2D array.

To apply the chain rule, we need to multiply them by the gradient from the subsequent function.

We should perform a sum along the inputs - the first input to all of the neurons, the second input, and so on. So we have to sum columns.

The array of partial derivatives w.r.t all of the inputs equals the array of weights.

Since the array is transposed, we need to sum its rows instead of columns.

Then we calculate the gradient for the next layer in backpropagation.

In [12]:
import numpy as np

# Passed in gradient from the next layer
# for the purpose of this example we're going to use
# a vector of 1s
dvalues = np.array([[1., 1., 1.]])

# We have 3 sets of weights - one set for each neuron
# we have 4 inputs, thus 4 weights
# recall that we keep weights transposed
weights = np.array([[0.2, 0.8, -0.5, 1],
                    [0.5, -0.91 , 0.26, -0.5],
                    [-0.26, -0.27 , 0.17, 0.87]]).T

# sum weights of given input
# and multiply by the passed in gradient for this neuron
dx0 = sum(weights[0]) * dvalues[0]
dx1 = sum(weights[1]) * dvalues[0]
dx2 = sum(weights[2]) * dvalues[0]
dx3 = sum(weights[3]) * dvalues[0]

# the gradient of the neuron function with respect to inputsm
dinputs = np.array([dx0, dx1, dx2, dx3])

print(dinputs)

[[ 0.44  0.44  0.44]
 [-0.38 -0.38 -0.38]
 [-0.07 -0.07 -0.07]
 [ 1.37  1.37  1.37]]


We can achieve the same result by using the `np.dot` function. 

Recall, we have one partial derivative for each neuron and multiply it by the neuron’s partial derivative with respect to its input.

Then, multiply each of these gradients with each of the partial derivatives that are related to this neuron’s inputs.


In [13]:
import numpy as np

# Passed in gradient from the next layer
# for the purpose of this example we're going to use
# a vector of 1s
dvalues = np.array([[1., 1., 1.]])

# We have 3 sets of weights - one set for each neuron
# we have 4 inputs, thus 4 weights
# recall that we keep weights transposed
weights = np.array([[0.2, 0.8, -0.5, 1],
                    [0.5, -0.91, 0.26, -0.5 ],
                    [-0.26, -0.27, 0.17, 0.87 ]]).T

# sum weights of given input
# and multiply by the passed in gradient for this neuron
dinputs = np.dot(dvalues[ 0 ], weights.T)

print (dinputs)

[ 0.44 -0.38 -0.07  1.37]


### A batch of samples

In [14]:
import numpy as np

# Passed in gradient from the next layer
# for the purpose of this example we're going to use
# an array of an incremental gradient values
dvalues = np.array([[1., 1., 1.],
                    [2., 2., 2.],
                    [3., 3., 3.]])

# We have 3 sets of weights - one set for each neuron
# we have 4 inputs, thus 4 weights
# recall that we keep weights transposed
weights = np.array([[0.2, 0.8, -0.5, 1],
                    [0.5, -0.91, 0.26, -0.5],
                    [-0.26, -0.27, 0.17, 0]]).T

# sum weights of given input
# and multiply by the passed in gradient for this neuron
dinputs = np.dot(dvalues, weights.T)

print (dinputs)

[[ 0.44 -0.38 -0.07  0.5 ]
 [ 0.88 -0.76 -0.14  1.  ]
 [ 1.32 -1.14 -0.21  1.5 ]]


In [15]:
import numpy as np

# Passed in gradient from the next layer for the purpose of this example 
# We're going to use an array of an incremental gradient values
dvalues = np.array([[1., 1., 1.],
                    [2., 2., 2.],
                    [3., 3., 3.]])

# We have 3 sets of inputs - samples
inputs = np.array([[1, 2, 3, 2.5],
                [2., 5., -1., 2],
                [-1.5, 2.7, 3.3, -0.8]])

# Sum weights of given input and multiply by the passed in gradient for this neuron
dweights = np.dot(inputs.T, dvalues)

print(dweights)

[[ 0.5  0.5  0.5]
 [20.1 20.1 20.1]
 [10.9 10.9 10.9]
 [ 4.1  4.1  4.1]]


For the biases and derivatives with respect to them, the derivatives come from the sum operation and always equal 1, multiplied by the incoming gradients to apply the chain rule.

In [16]:
import numpy as np

# Passed in gradient from the next layer for the purpose of this example
# we're going to use an array of an incremental gradient values
dvalues = np.array([[1., 1., 1.],
                    [2., 2., 2.],
                    [3., 3., 3.]])

# One bias for each neuron biases are the row vector with a shape (1, neurons)
biases = np.array([[2, 3, 0.5]])

# dbiases - sum values, do this over samples (first axis), keepdims
# since this by default will produce a plain list - we explained this in the chapter 4
dbiases = np.sum(dvalues, axis = 0 , keepdims = True )

print(dbiases)

[[6. 6. 6.]]


### Derivative of the ReLU function

It equals 1 if the input is greater than 0 and 0 otherwise.

The layer passes its outputs through the $ReLU()$ activation during the forward pass.

For the backward pass, $ReLU()$ receives a gradient of the same shape.

The derivative of the ReLU function will form an array of the same shape, filled with 1 when the related input is greater than 0, and 0 otherwise.

To apply the chain rule, we need to multiply this array with the gradients of the following function.

In [17]:
import numpy as np

# Example layer output (4 neurons)
z = np.array([[1, 2, -3, -4],
              [2, -7, -1, 3],
              [-1, 2, 5, -1]])

dvalues = np.array([[1, 2, 3, 4],
                    [5, 6, 7, 8],
                    [9, 10, 11, 12]])

# ReLU activation's derivative
drelu = np.zeros_like(z)
drelu[z > 0] = 1
print(drelu)

# The chain rule
drelu *= dvalues

print(drelu)

[[1 1 0 0]
 [1 0 0 1]
 [0 1 1 0]]
[[ 1  2  0  0]
 [ 5  0  0  8]
 [ 0 10 11  0]]


We can simplify this operation.

In [18]:
import numpy as np

# Example layer output
z = np.array([[1, 2, -3, -4],
              [2, -7, -1, 3],
              [-1, 2, 5, -1]])

dvalues = np.array([[1, 2, 3, 4],
                    [5, 6, 7, 8],
                    [9, 10, 11, 12]])

# ReLU activation's derivative with the chain rule applied
drelu = dvalues.copy()    # ensures that we don't modify it during the ReLU derivative calculation.
drelu[z <= 0] = 0

print(drelu)

[[ 1  2  0  0]
 [ 5  0  0  8]
 [ 0 10 11  0]]


Let’s combine the forward and backward pass of a single neuron with a full layer and batch-based partial derivatives. We’ll minimize ReLU’s output, once again, only for this example:

In [19]:
import numpy as np

# Passed in gradient from the next layer for the purpose of this example
# We're going to use an array of an incremental gradient values
dvalues = np.array([[1., 1., 1.],
                    [2., 2., 2.],
                    [3., 3., 3.]])

# We have 3 sets of inputs - samples
inputs = np.array([[1, 2, 3, 2.5],
                   [2., 5., -1., 2],
                   [-1.5, 2.7, 3.3, -0.8]])

# We have 3 sets of weights - one set for each neuron
# we have 4 inputs, thus 4 weights
# recall that we keep weights transposed
weights = np.array([[0.2, 0.8, -0.5, 1],
                    [0.5, -0.91, 0.26, -0.5],
                    [-0.26, -0.27, 0.17, 0.87]]).T

# One bias for each neuron
# biases are the row vector with a shape (1, neurons)
biases = np.array([[2, 3, 0.5]])

# Forward pass
layer_outputs = np.dot(inputs, weights) + biases # Dense layer
relu_outputs = np.maximum(0, layer_outputs) # ReLU activation

# Let's optimize and test backpropagation here
# ReLU activation - simulates derivative with respect to input values
# from next layer passed to current layer during backpropagation
drelu = relu_outputs.copy()
drelu[layer_outputs <= 0] = 0

# Dense layer
# dinputs - multiply by weights
dinputs = np.dot(drelu, weights.T)

# dweights - multiply by inputs
dweights = np.dot(inputs.T, drelu)

# dbiases - sum values, do this over samples (first axis), keepdims
# since this by default will produce a plain list - we explained this in the chapter 4
dbiases = np.sum(drelu, axis = 0 , keepdims = True)

# Update parameters
weights += - 0.001 * dweights
biases += - 0.001 * dbiases

print(weights)
print(biases)

[[ 0.179515   0.5003665 -0.262746 ]
 [ 0.742093  -0.9152577 -0.2758402]
 [-0.510153   0.2529017  0.1629592]
 [ 0.971328  -0.5021842  0.8636583]]
[[1.98489  2.997739 0.497389]]


Update the dense layer and ReLU activation code with a `backward` method (for backpropagation).

In [20]:
# Dense layer
class Layer_Dense :
    # Layer initialization
    def __init__ (self, inputs, neurons):
        self.weights = 0.01 * np.random.randn(inputs, neurons)
        self.biases = np.zeros((1, neurons))

    # Forward pass
    def forward(self, inputs):
        self.output = np.dot(inputs, self.weights) + self.biases
        # To remember what the inputs were (needed when calculating the partial derivative w.r.t weights during backpropagation)
        self.inputs = inputs
        
    # Backward pass
    def backward(self, dvalues):
        # Gradients on parameters
        self.dweights = np.dot(self.inputs.T, dvalues)
        self.dbiases = np.sum(dvalues, axis = 0, keepdims = True)

        # Gradient on values
        self.dinputs = np.dot(dvalues, self.weights.T)


# ReLU activation
class Activation_ReLU :
    # Forward pass
    def forward(self, inputs):
        # Remember input values
        self.inputs = inputs
        self.output = np.maximum(0, inputs)

    # Backward pass
    def backward(self, dvalues):
        # Since we need to modify the original variable, let's make a copy of the values first
        self.dinputs = dvalues.copy()

        # Zero gradient where input values were negative
        self.dinputs[self.inputs <= 0 ] = 0

## 9.6. Categorical Cross-Entropy loss derivative

As stated in chapter 5, the Categorical Cross-Entropy loss function's formula is:

$$
L_i = \ \text{log} \ \big(  \hat{y}_{i,k} \big) \qquad \text{where }  k \text{ is an index of "true" probability}
$$

where $L_i$ denotes sample loss value, $i^{th}$ sample in a set, $k$ - index of the target label (ground-true label), $y$ - target values and $\hat y$ predicted values.

All we need is the output of the Softmax activation function at the index of the correct class.

To calculate partial derivatives with respect to each of the inputs, we need an equation that takes all of them as parameters, thus the choice to use the full equation.

For the purpose of the derivative calculation, we use the full equation mentioned back in chapter 5:

$$
L_i = - \sum_j y_{i,j} \ \text{log} \ \big(  \hat{y}_{i,j} \big)
$$

where $L_i$ denotes sample loss value, $i^{th}$ sample in a set, $j$ - label/output index, $y$ - target values and $\hat y$ predicted values.

Recall,

$$
f(x) = \text{log} \big( h(x) \big) \quad \to \quad f'(x) = \frac{1}{h(x)} \cdot h'(x)\\
f(x) = \text{log} (x) \quad \to \quad \frac{d}{dx} f(x) = \frac{d}{dx} \text{log} (x) =   \frac{1}{x} \cdot \frac{d}{dx} x = \frac{1}{x} \cdot 1 = \frac{1}{x}
$$

First, let’s define the gradient equation as the partial derivative of the loss function w.r.t each of its inputs.

$$
\frac{∂ L_i}{∂ \hat{y}_{i,j}} = \frac{∂ }{∂ \hat{y}_{i,j}} \Big[  - \sum_j y_{i,j} \ \text{log} \ \big(  \hat{y}_{i,j} \big)  \Big] =   - \sum_j y_{i,j} \cdot \frac{∂ }{∂ \hat{y}_{i,j}}  \text{log} \big(  \hat{y}_{i,j} \big)  =   - \sum_j y_{i,j} \cdot \frac{1}{\hat{y}_{i,j}} \cdot \frac{∂ }{∂ \hat{y}_{i,j}}  \hat{y}_{i,j}  =   - \sum_j y_{i,j} \cdot \frac{1}{\hat{y}_{i,j}} \cdot 1  =  - \sum_j \frac{y_{i,j}}{\hat{y}_{i,j}}  =  - \frac{y_{i,j}}{\hat{y}_{i,j}}
$$

The derivative of this loss function with respect to its inputs (predicted values at the $i^{th}$ sample, since we are interested in a gradient with respect to the predicted values) equals the negative ground-truth vector, divided by the vector of the predicted values (which is also the output vector of the softmax function).


## 9.7. Categorical Cross-Entropy loss derivative code implementation

Add a backward method to the `Loss_CategoricalCrossentropy` class, and pass the array of predictions and the array of true values into it and calculate the negated division of them:

In [21]:
# Common loss class
class Loss :
    
    # Calculates the data and regularization losses given model output and ground truth values
    def calculate ( self , output , y ):
        
        # Calculate sample losses
        sample_losses = self.forward(output, y)
        
        # Calculate mean loss
        data_loss = np.mean(sample_losses)
        
        # Return loss
        
        return data_loss

    
# Cross-entropy loss
class Loss_CategoricalCrossentropy(Loss):
    
    # Forward pass
    def forward(self, y_pred, y_true):
        
        # Number of samples in a batch
        samples = len(y_pred)
        
        # Clip data to prevent division by 0
        # Clip both sides to not drag mean towards any value
        y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)
        
        # Probabilities for target values - only if categorical labels
        if len(y_true.shape) == 1:
            correct_confidences = y_pred_clipped[ range(samples), y_true]
        
        # Mask values - only for one-hot encoded labels
        elif len(y_true.shape) == 2:
            correct_confidences = np.sum( y_pred_clipped*y_true, axis=1)
        
        # Losses
        negative_log_likelihoods = -np.log(correct_confidences)
        
        return negative_log_likelihoods

    # Backward pass
    def backward (self, dvalues, y_true):

        # Number of samples
        samples = len(dvalues)

        # Number of labels in every sample
        # We'll use the first sample to count them
        labels = len(dvalues[0])

        # If labels are sparse, turn them into one-hot vector
        if len(y_true.shape) == 1 :
            y_true = np.eye(labels)[y_true]

        # Calculate gradient
        self.dinputs = -y_true/dvalues

        # Normalize gradient
        self.dinputs = self.dinputs/samples


Optimizers sum all of the gradients related to each weight and bias before multiplying them by the learning rate (or some other factor). This means that the more samples we have in a dataset, the more gradient sets we’ll receive at this step, and the bigger this sum will become. As a consequence, we’ll have to adjust the learning rate according to each set of samples.

To solve this problem, we can divide all of the gradients by the number of samples. A sum of elements divided by a count of them is their mean value (the optimizer will perform the sum). Thus, we will effectively normalize the gradients and make their sum’s magnitude invariant to the number of samples.


## 9.8. Softmax activation derivative

The partial derivative of the Softmax function is a bit more complicated task than the derivative of the Categorical Cross-Entropy loss.

Recall, the equation of the Softmax activation function and define the derivative:

$$
S_{i.j} = \frac{e^{z_{i,j}}}{\sum_{l=1}^{L} e^{z_{i,l}}} \quad \to \quad \frac{∂ S_{i,j}}{∂z_{i.k}} = \frac{∂ \frac{e^{z_{i,j}}}{\sum_{l=1}^{L} e^{z_{i,l}}}}{∂z_{i.k}}
$$

where $S_{i_j}$ denotes $j^{th}$ Softmax’s output of $i^{th}$ sample, $z$ - input array which is a list of input vectors (output vectors from the previous layer), $z_{i,j}$ - $j^{th}$ Softmax’s input of $i^{th}$ sample, $L$ - number of inputs, $z_{i,k}$ - $k^{th}$ Softmax’s input of $i^{th}$ sample.