# Gradient Derivation for Logistic Regression


Recall that the cross-entropy loss for multi-class classification is:

$$ L = -\sum_{i=1}^K y_i \log(p_i) $$

where $y_i$ are the true (one-hot encoded) labels and $p_i$ are the predicted probabilities.

## Softmax Function

The softmax function converts logits $z_j$ to probabilities $p_j$:

$$ p_j = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}} $$

## Gradient Derivation


<div class="warning" style='background-color:#E9D8FD; color: #69337A; border: solid #805AD5 4px; border-radius: 4px; padding:0.7em; width:90%'>

### **Calculating the gradient for logistic regression in Python** 

If `y_true` is a one-hot encoded vector of true values for a multiclass Logistic Regression problem, and `y_pred` is a vector of probabilities, the gradient `dZ` is simply:

~~~python

dZ = y_pred - y_true

~~~

To see why this is so, follow the derivation below.
<div>


The equation for the gradient follows directly from the application of calculus. We want to find $\frac{\partial L}{\partial z_j}$, the gradient of the loss with respect to the logits.

Using the chain rule:

$$\frac{\partial L}{\partial z_j} = \sum_{i=1}^K \frac{\partial L}{\partial p_i} \frac{\partial p_i}{\partial z_j}$$

First, $\frac{\partial L}{\partial p_i} = -\frac{y_i}{p_i}$

Now, let's consider $\frac{\partial p_i}{\partial z_j}$:

We start with the softmax function:

$$p_j = \frac{\exp(z_j)}{\sum_k \exp(z_k)}$$

We want to find $\frac{\partial p_i}{\partial z_j}$ for two cases: when $i = j$ and when $i \neq j$.

### Case 1: $i = j$ (the derivative of $p_i$ with respect to its own logit $z_i$)

1)Let's apply the quotient rule. If $u = \exp(z_i)$ and $v = \sum_k \exp(z_k)$, then:

$$\frac{\partial p_i}{\partial z_i} = \frac{v \cdot \frac{\partial u}{\partial z_i} - u \cdot \frac{\partial v}{\partial z_i}}{v^2}$$

2) We have two separate derivatives here:
$\frac{\partial u}{\partial z_i} = \exp(z_i)$ (derivative of the numerator)
$\frac{\partial v}{\partial z_i} = \exp(z_i)$ (derivative of the denominator, because $z_i$ is one term in the sum)

3) Substituting:

   $$\frac{\partial p_i}{\partial z_i} = \frac{\sum_k \exp(z_k) \cdot \exp(z_i) - \exp(z_i) \cdot \exp(z_i)}{(\sum_k \exp(z_k))^2}$$

4) Factoring out $\frac{\exp(z_i)}{\sum_k \exp(z_k)}$:

   $$\frac{\partial p_i}{\partial z_i} = \frac{\exp(z_i)}{\sum_k \exp(z_k)} \cdot \left(1 - \frac{\exp(z_i)}{\sum_k \exp(z_k)}\right)$$

5) Recognizing $p_i = \frac{\exp(z_i)}{\sum_k \exp(z_k)}$, we get:

   $$\frac{\partial p_i}{\partial z_i} = p_i \cdot (1 - p_i)$$

### Case 2: $i \neq j$ (the derivative of $p_i$ with respect to a different logit $z_j$)

1) Again, let $u = \exp(z_i)$ and $v = \sum_k \exp(z_k)$. But now:

   $$\frac{\partial p_i}{\partial z_j} = \frac{v \cdot \frac{\partial u}{\partial z_j} - u \cdot \frac{\partial v}{\partial z_j}}{v^2}$$

2) $\frac{\partial u}{\partial z_j} = 0$ (because $z_j$ doesn't appear in the numerator of $p_i$)
   $\frac{\partial v}{\partial z_j} = \exp(z_j)$

3) Substituting:

   $$\frac{\partial p_i}{\partial z_j} = \frac{0 - \exp(z_i) \cdot \exp(z_j)}{(\sum_k \exp(z_k))^2}$$

4) Factoring:

   $$\frac{\partial p_i}{\partial z_j} = -\frac{\exp(z_i)}{\sum_k \exp(z_k)} \cdot \frac{\exp(z_j)}{\sum_k \exp(z_k)}$$

5) Recognizing $p_i$ and $p_j$, we get:

   $$\frac{\partial p_i}{\partial z_j} = -p_i \cdot p_j$$




Putting this together:

$$\frac{\partial L}{\partial z_j} = -\frac{y_j}{p_j} p_j(1 - p_j) - \sum_{i \neq j} \frac{y_i}{p_i} (-p_i p_j)$$

$$= -y_j(1 - p_j) + \sum_{i \neq j} y_i p_j$$

$$= -y_j + y_j p_j + \sum_{i \neq j} y_i p_j$$

$$= -y_j + \sum_{i=1}^K y_i p_j$$

$$= -y_j + p_j \sum_{i=1}^K y_i$$

Since $y$ is one-hot encoded, $\sum_{i=1}^K y_i = 1$, so:

$$\frac{\partial L}{\partial z_j} = p_j - y_j$$

In python code, this would amount to something like `y_pred - y_true`.

## In Matrix Form

When we write `dz = y_pred - y_true`, we're doing this calculation for all classes simultaneously. This is equivalent to the gradient $\nabla_z L = p - y$ where $p$ is the vector of predicted probabilities and $y$ is the one-hot encoded true label vector.

## Conclusion

The simplified gradient calculation `y_pred - y_true` is the gradient of the cross-entropy loss with respect to the logits. This derivation shows why this simple subtraction works and how it relates to the more general formulation of the gradient of the loss function.