




Derivation of Backpropagation for Multiclass Classification

This document details the step-by-step derivation of the backpropagation equations for a neural network layer performing multiclass classification. We assume a standard setup using Softmax activation and Categorical Cross-Entropy Loss.

1. Notation and Setup

Let's define our variables for a single training example with $N$ classes.

$x$: Input vector of shape $(D \times 1)$, where $D$ is the number of input features.

$W$: Weight matrix of shape $(N \times D)$.

$b$: Bias vector of shape $(N \times 1)$.

$z$: The "logits" or linear output, shape $(N \times 1)$.

$a$: The activation (predicted probabilities) output by Softmax, shape $(N \times 1)$.

$y$: The ground truth label (one-hot encoded vector), shape $(N \times 1)$.

The Forward Pass Equations

Linear Transformation:


$$z_i = \sum_{d=1}^{D} W_{id} x_d + b_i$$


In vector form: $z = Wx + b$

Softmax Activation:


$$a_i = \frac{e^{z_i}}{\sum_{k=1}^{N} e^{z_k}}$$


Note: $a_i$ is often denoted as $\hat{y}_i$.

Categorical Cross-Entropy Loss:


$$L = -\sum_{k=1}^{N} y_k \log(a_k)$$


Since $y$ is a one-hot vector (only one element is 1, the rest are 0), if the true class is $c$, this simplifies to $L = -\log(a_c)$. However, we will use the summation form for the general derivation.

2. The Goal

To update the weights and biases using gradient descent, we need to find the partial derivatives of the Loss $L$ with respect to the weights $W$ and biases $b$:

$$\frac{\partial L}{\partial W} \quad \text{and} \quad \frac{\partial L}{\partial b}$$

Using the Chain Rule, we can break this down:

$$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial W}$$

We will compute this layer by layer, starting from the loss and moving backward.

3. Step-by-Step Derivation

Step A: Derivative of Loss w.r.t. Softmax Output ($a$)

First, we differentiate the loss function $L$ with respect to the $k$-th output activation $a_k$.

$$L = -\sum_{n=1}^{N} y_n \log(a_n)$$

Focusing on a specific class index $k$:

$$\frac{\partial L}{\partial a_k} = -\frac{\partial}{\partial a_k} \left( y_k \log(a_k) + \sum_{n \neq k} y_n \log(a_n) \right)$$

Since the terms where $n \neq k$ are constant with respect to $a_k$:

$$\boxed{\frac{\partial L}{\partial a_k} = -\frac{y_k}{a_k}}$$

Step B: Derivative of Softmax ($a$) w.r.t. Logits ($z$)

This is the trickiest part. The Softmax function relates every output $a_k$ to every input $z_j$ because of the denominator sum.

$$a_k = \frac{e^{z_k}}{\Sigma} \quad \text{where} \quad \Sigma = \sum_{n=1}^{N} e^{z_n}$$

We need to find $\frac{\partial a_k}{\partial z_j}$. There are two cases to consider using the quotient rule: $\left( \frac{u}{v} \right)' = \frac{u'v - uv'}{v^2}$.

Case 1: $k = j$
We are differentiating the output $a_j$ with respect to its own specific input $z_j$.


$$\frac{\partial a_j}{\partial z_j} = \frac{(e^{z_j})(\Sigma) - (e^{z_j})(e^{z_j})}{\Sigma^2}$$

$$= \frac{e^{z_j}}{\Sigma} \cdot \frac{\Sigma - e^{z_j}}{\Sigma} = a_j (1 - a_j)$$

Case 2: $k \neq j$
We are differentiating an output $a_k$ with respect to a different input $z_j$. The numerator $e^{z_k}$ is constant w.r.t $z_j$.


$$\frac{\partial a_k}{\partial z_j} = \frac{(0)(\Sigma) - (e^{z_k})(e^{z_j})}{\Sigma^2}$$

$$= -\frac{e^{z_k}}{\Sigma} \cdot \frac{e^{z_j}}{\Sigma} = -a_k a_j$$

Summary of Step B:


$$\frac{\partial a_k}{\partial z_j} = \begin{cases} a_j(1 - a_j) & \text{if } k=j \\ -a_k a_j & \text{if } k \neq j \end{cases}$$


This can be written using the Kronecker delta $\delta_{kj}$ (which is 1 if $k=j$ and 0 otherwise):


$$\boxed{\frac{\partial a_k}{\partial z_j} = a_k (\delta_{kj} - a_j)}$$

Step C: Derivative of Loss ($L$) w.r.t. Logits ($z$)

Now we combine Step A and Step B using the Chain Rule. We want the sensitivity of the Loss to a specific logit $z_j$. Since $z_j$ affects the Loss through all $N$ Softmax outputs, we must sum gradients from all $a_k$.

$$\frac{\partial L}{\partial z_j} = \sum_{k=1}^{N} \frac{\partial L}{\partial a_k} \cdot \frac{\partial a_k}{\partial z_j}$$

Substitute the results from Steps A and B:

$$\frac{\partial L}{\partial z_j} = \sum_{k=1}^{N} \left( -\frac{y_k}{a_k} \right) \cdot (a_k (\delta_{kj} - a_j))$$

Simplify the term inside the sum:

$$\frac{\partial L}{\partial z_j} = -\sum_{k=1}^{N} y_k (\delta_{kj} - a_j)$$

$$\frac{\partial L}{\partial z_j} = -\left( \sum_{k=1}^{N} y_k \delta_{kj} - \sum_{k=1}^{N} y_k a_j \right)$$

Let's analyze the two summation terms:

$\sum_{k=1}^{N} y_k \delta_{kj}$: Since $\delta_{kj}$ is 0 except when $k=j$, this sum collapses to simply $y_j$.

$\sum_{k=1}^{N} y_k a_j$: $a_j$ is constant w.r.t the sum index $k$, so we pull it out: $a_j \sum_{k=1}^{N} y_k$. Since $y$ is a one-hot vector (probabilities sum to 1), $\sum y_k = 1$. Thus, this term is $a_j$.

Putting it back together:


$$\frac{\partial L}{\partial z_j} = -(y_j - a_j)$$

$$\boxed{\frac{\partial L}{\partial z_j} = a_j - y_j}$$

This beautiful, simple result is why Softmax is nearly always paired with Cross-Entropy. The error signal is simply the difference between the predicted probability and the true label.