# CrossEntropyLoss

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Mitchell-Mirano/sorix/blob/main/docs/learn/loss/03-CrossEntropyLoss.ipynb)
[![Open in GitHub](https://img.shields.io/badge/Open%20in-GitHub-black?logo=github)](https://github.com/Mitchell-Mirano/sorix/blob/main/docs/learn/loss/03-CrossEntropyLoss.ipynb)
[![Open in Docs](https://img.shields.io/badge/Open%20in-Docs-blue?logo=readthedocs)](http://127.0.0.1:8000/sorix/learn/loss/03-CrossEntropyLoss)



The **Cross Entropy** loss measures the performance of a classification model whose output is a probability distribution. The goal is to minimize the difference between the predicted distribution and the true distribution.

In Sorix, `CrossEntropyLoss` is implemented for multiclass classification. It expects **raw logits** as input and applies the **Softmax** internally.

The loss is calculated as the mean over $n$ samples:
$$L = - \frac{1}{n} \sum_{i=1}^{n} \sum_{c=1}^{C} y_{i,c} \ln(p_{i,c})$$

Where:
- $n$ is the batch size.
- $C$ is the number of classes.
- $y_{i,c}$ is 1 if class $c$ is the correct label for sample $i$, 0 otherwise.
- $p_{i,c}$ is the predicted probability for class $c$ of sample $i$ (after Softmax).

## 1. Internal Softmax Integration

Combining the Softmax activation and the Cross Entropy loss into a single step is a standard practice in deep learning frameworks. The main reasons are:

### Numerical Stability (The Log-Sum-Exp Trick)
Softmax involves $e^{x_i}$, which can easily overflow for large positive $x_i$. Sorix uses the Log-Sum-Exp trick to calculate these safely:
$$\text{Softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum e^{x_j - \max(x)}}$$

### Computational Efficiency
The mathematical derivative of the combined Softmax + Cross Entropy simplifies beautifully. 

If we have the loss $L = - \ln(\text{Softmax}(x_k))$ where $k$ is the correct class, its derivative simplified is:
$$\frac{\partial L}{\partial x_i} = \frac{1}{n}(P_i - Y_i)$$

Where:
- $P_i$ is the predicted probability for class $i$.
- $Y_i$ is 1 if class $i$ is the target, 0 otherwise.
- $n$ is the batch size.

This means the gradient is just the difference between the prediction and the target, which is extremely cheap to calculate and numerically robust.

In [4]:
# Uncomment the next line and run this cell to install sorix
#!pip install 'sorix @ git+https://github.com/Mitchell-Mirano/sorix.git@main'

In [1]:
import numpy as np
from sorix import tensor
from sorix.nn import CrossEntropyLoss

# Create logits for 3 classes
logits = tensor([[2.0, 1.0, 0.1], [0.0, 5.0, 0.2]], requires_grad=True)
targets = tensor([0, 1]) # Class 0 for sample 1, class 1 for sample 2

criterion = CrossEntropyLoss()
loss = criterion(logits, targets)

print(f"Logits sample 1: {logits.numpy()[0]} (Highest is label 0)")
print(f"Logits sample 2: {logits.numpy()[1]} (Highest is label 1)")
print(f"Cross Entropy Loss: {loss.item():.4f}")

Logits sample 1: [2.  1.  0.1] (Highest is label 0)
Logits sample 2: [0.  5.  0.2] (Highest is label 1)
Cross Entropy Loss: 0.2159


### Gradient Verification

As mentioned before, the gradient is just $(P - Y) / n$. Let's verify this in Sorix.

In [2]:
loss.backward()
print(f"Gradients w.r.t logits:\n{logits.grad}")

# Manual verification: dL/d_logits = 1/n * (softmax(logits) - targets_one_hot)
batch_size = logits.data.shape[0]
exp_logits = np.exp(logits.data - np.max(logits.data, axis=-1, keepdims=True))
probs = exp_logits / np.sum(exp_logits, axis=-1, keepdims=True)

Y_one_hot = np.zeros_like(probs)
Y_one_hot[np.arange(batch_size), targets.data.flatten().astype(int)] = 1

manual_grad = (probs - Y_one_hot) / batch_size
print(f"\nManual Gradients (P - Y) / n:\n{manual_grad}")

Gradients w.r.t logits:
[[-0.17049944  0.12121648  0.04928295]
 [ 0.00331929 -0.00737348  0.00405419]]

Manual Gradients (P - Y) / n:
[[-0.17049944  0.12121648  0.04928295]
 [ 0.00331929 -0.00737348  0.00405419]]


### Training Example

Let's see how `CrossEntropyLoss` helps a simple layer identify the correct class.

In [3]:
from sorix.optim import SGD
from sorix.nn import Linear

x = tensor([[1.0, 0.0, 0.0]]) # Input data
target = tensor([2]) # We want it to be class 2

model = Linear(3, 3)
optimizer = SGD(model.parameters(), lr=0.1)

print(f"Initial raw scores: {model(x).numpy()}")

for i in range(51):
    y_pred = model(x)
    loss = criterion(y_pred, target)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    if i % 10 == 0:
        # The score for index 2 should increase
        print(f"Step {i:2d} | Loss: {loss.item():.4f} | Output: {y_pred.numpy().flatten()}")

print(f"\nFinal output: {model(x).numpy().flatten()}")

Initial raw scores: [[-1.129986    1.8432822  -0.88651246]]
Step  0 | Loss: 2.8399 | Output: [-1.129986    1.8432822  -0.88651246]
Step 10 | Loss: 0.6802 | Output: [-1.2658902   0.4506843   0.64198977]
Step 20 | Loss: 0.2603 | Output: [-1.392698   -0.12021631  1.3396983 ]
Step 30 | Loss: 0.1513 | Output: [-1.4783971  -0.39979756  1.7049787 ]
Step 40 | Loss: 0.1052 | Output: [-1.5416893  -0.57646394  1.944937  ]
Step 50 | Loss: 0.0803 | Output: [-1.5918877  -0.70369416  2.1223652 ]

Final output: [-1.5963862 -0.714629   2.1377985]
