# Probability, Entropy, Cross-Entropy

* [Artem:
The Key Equation Behind Probability](https://www.youtube.com/watch?v=KHVR587oW8I)

* [StatQuest: 
Neural Networks Part 6: Cross Entropy](https://www.youtube.com/watch?v=6ArSys5qHAU)
* [StatQuest: Neural Networks Part 7: Cross Entropy Derivatives and Backpropagation](https://www.youtube.com/watch?v=xBEh66V9gZo)


# Loss Functions

* https://www.digitalocean.com/community/tutorials/pytorch-loss-functions
* https://keras.io/api/losses/
* https://docs.pytorch.org/docs/stable/nn.html#loss-functions

Loss functions measure how far model predictions are from the true values. The model learns by minimizing this loss during training.

$$\text{Loss} = f(y_{true}, y_{pred})$$

---

## Entropy & Cross-Entropy: The Foundation

### The Origin: Shannon's Information Theory (1948)

Claude Shannon asked: **How do we measure information?**

The key insight is that information is related to **surprise**. "The sun rose today" isn't informative—you expected it. "A meteor hit New York" is highly informative because it's unexpected.

Shannon formalized this. The **information content** of an event with probability $p$ should satisfy:

1. **Rare events carry more information** — $I(p)$ decreases as $p$ increases
2. **Certain events carry no information** — $I(1) = 0$
3. **Independent events add up** — $I(p_1 \cdot p_2) = I(p_1) + I(p_2)$

The **only function** satisfying all three is the logarithm:

$$I(x) = -\log(p(x)) = \log\left(\frac{1}{p(x)}\right)$$

This is called **self-information** or **surprisal**, measured in **bits** (log base 2) or **nats** (natural log).

| Event | Probability | Information |
|-------|-------------|-------------|
| Fair coin heads | $p = 0.5$ | $-\log_2(0.5) = 1$ bit |
| Dice rolls 6 | $p = 1/6$ | $-\log_2(1/6) \approx 2.58$ bits |
| Certain event | $p = 1$ | $-\log_2(1) = 0$ bits |

---

### Entropy: Average Surprise

**Entropy** is the *expected* information—the average surprise when sampling from a distribution:

$$H(P) = \mathbb{E}_{x \sim P}[I(x)] = -\sum_{x} p(x) \log p(x)$$

**Intuition 1: Minimum bits to encode messages**

Entropy answers: *What's the minimum average number of bits needed to encode symbols from this distribution?*

| Distribution | Entropy | Interpretation |
|--------------|---------|----------------|
| Fair coin (50/50) | 1 bit | Maximum uncertainty |
| Biased coin (99/1) | 0.08 bits | Almost certain |
| Uniform over 8 items | 3 bits | $\log_2(8) = 3$ |

**Intuition 2: Uncertainty/Disorder**

- **High entropy** = flat, uniform, maximum uncertainty
- **Low entropy** = peaked, concentrated, predictable

---

### Cross-Entropy: Encoding with the Wrong Distribution

Suppose the true distribution is $P$, but you design your encoding based on distribution $Q$. The **cross-entropy** is:

$$H(P, Q) = -\sum_{x} p(x) \log q(x)$$

This is the **expected bits needed to encode samples from $P$ using a code optimized for $Q$**.

**The critical relationship:**

$$H(P, Q) = H(P) + D_{KL}(P \| Q)$$

Where $D_{KL}$ is **KL-divergence** (the penalty for using the wrong distribution):

$$D_{KL}(P \| Q) = \sum_{x} p(x) \log \frac{p(x)}{q(x)} \geq 0$$

Since $D_{KL} \geq 0$, we have $H(P, Q) \geq H(P)$, with equality only when $P = Q$.

---

### Why Cross-Entropy for Classification?

In classification:
- **True distribution $P$**: one-hot labels (e.g., $[0, 1, 0]$)
- **Predicted distribution $Q$**: softmax outputs (e.g., $[0.1, 0.7, 0.2]$)

Since $y$ is one-hot, cross-entropy simplifies to:

$$\mathcal{L} = -\sum_{c} y_c \log(\hat{y}_c) = -\log(\hat{y}_{true})$$

**Why it works:**

1. **Penalizes confident wrong predictions harshly** — predict 0.01 for true class → loss = 4.6; predict 0.99 → loss = 0.01

2. **Beautiful gradient** — for softmax + cross-entropy: $\frac{\partial \mathcal{L}}{\partial z_i} = \hat{y}_i - y_i$

3. **Maximum likelihood** — minimizing cross-entropy = maximizing log-likelihood

---

### Visual Summary

```
                    Information Theory
                           │
              ┌────────────┴────────────┐
              │                         │
         Self-Information           Entropy
         I(x) = -log p(x)       H(P) = E[-log p(x)]
         "surprise of event"    "average surprise"
                                       │
              ┌────────────────────────┴────────────────────────┐
              │                                                 │
        Cross-Entropy                                    KL-Divergence
    H(P,Q) = E_P[-log q(x)]                          D_KL(P||Q) = H(P,Q) - H(P)
    "bits using wrong code"                          "penalty for wrong code"
              │
              │
    Classification Loss
    L = -log(predicted prob of true class)
```

**The punchline:** Cross-entropy loss asks *how many extra bits do you waste by believing your model's distribution instead of the true distribution?* Minimizing it makes your model's beliefs match reality.

---

## Classification Losses

### Cross-Entropy Loss (Log Loss)

The most common loss for classification. Penalizes confident wrong predictions heavily.

**Binary Cross-Entropy** — for binary classification (0 or 1):
$$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]$$

**Categorical Cross-Entropy** — for multi-class (one-hot encoded labels):
$$\mathcal{L} = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)$$

**Sparse Categorical Cross-Entropy** — mathematically identical, but labels are integers instead of one-hot vectors.

| Variant | Label Format | Example (class 2 of 5) |
|---------|--------------|------------------------|
| Categorical | One-hot vector | `[0, 0, 1, 0, 0]` |
| Sparse Categorical | Integer index | `2` |

**Why "sparse"?** One-hot vectors are sparse (mostly zeros). Instead of storing/computing with sparse vectors, just store the index of the 1. For 1000 classes, you store 1 integer vs 1000 floats.

**The math is the same:** Since the one-hot $y$ has only one non-zero entry at position $k$:
$$\mathcal{L} = -\sum_{c=1}^{C} y_c \log(\hat{y}_c) = -\log(\hat{y}_k)$$

Sparse version directly uses $k$ without creating the one-hot vector.

| PyTorch | Keras | Use Case |
|---------|-------|----------|
| `nn.BCELoss()` | `BinaryCrossentropy()` | Binary (after sigmoid) |
| `nn.BCEWithLogitsLoss()` | `BinaryCrossentropy(from_logits=True)` | Binary (raw logits) |
| `nn.CrossEntropyLoss()` | `SparseCategoricalCrossentropy()` | Multi-class (integer labels) |
| `nn.NLLLoss()` | `CategoricalCrossentropy()` | Multi-class (after log_softmax) |

> **Note:** PyTorch's `CrossEntropyLoss` combines `log_softmax` + `NLLLoss` internally, so don't apply softmax to your model output!

---

### Focal Loss

Addresses class imbalance by down-weighting easy examples.

$$\mathcal{L} = -\alpha (1 - \hat{y})^\gamma \log(\hat{y})$$

Where $\gamma$ (focusing parameter) reduces loss for well-classified examples.

| PyTorch | Keras |
|---------|-------|
| `torchvision.ops.sigmoid_focal_loss` | `BinaryFocalCrossentropy()` / `CategoricalFocalCrossentropy()` |

---

### Hinge Loss

Used in SVMs and "maximum-margin" classification.

$$\mathcal{L} = \max(0, 1 - y \cdot \hat{y})$$

| PyTorch | Keras |
|---------|-------|
| `nn.HingeEmbeddingLoss()` | `Hinge()` / `SquaredHinge()` / `CategoricalHinge()` |

---

## Regression Losses

### Mean Squared Error (MSE / L2 Loss)

Penalizes large errors more due to squaring. Sensitive to outliers.

$$\mathcal{L} = \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2$$

| PyTorch | Keras |
|---------|-------|
| `nn.MSELoss()` | `MeanSquaredError()` |

---

### Mean Absolute Error (MAE / L1 Loss)

More robust to outliers than MSE.

$$\mathcal{L} = \frac{1}{N}\sum_{i=1}^{N}|y_i - \hat{y}_i|$$

| PyTorch | Keras |
|---------|-------|
| `nn.L1Loss()` | `MeanAbsoluteError()` |

---

### Huber Loss (Smooth L1)

Combines MSE and MAE — quadratic for small errors, linear for large errors. Best of both worlds.

$$\mathcal{L} = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \leq \delta \\ \delta|y - \hat{y}| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}$$

| PyTorch | Keras |
|---------|-------|
| `nn.SmoothL1Loss()` / `nn.HuberLoss()` | `Huber()` |

---

### Cosine Similarity Loss

Measures angle between vectors (ignores magnitude). Useful for embeddings.

$$\mathcal{L} = 1 - \cos(\theta) = 1 - \frac{y \cdot \hat{y}}{||y|| \cdot ||\hat{y}||}$$

| PyTorch | Keras |
|---------|-------|
| `nn.CosineEmbeddingLoss()` | `CosineSimilarity()` |

---

## Other Notable Losses

| Loss | Use Case | PyTorch |
|------|----------|---------|
| **KL Divergence** | Distribution matching, VAEs | `nn.KLDivLoss()` |
| **Triplet Loss** | Metric learning, face recognition | `nn.TripletMarginLoss()` |
| **CTC Loss** | Sequence-to-sequence (speech, OCR) | `nn.CTCLoss()` |
| **Dice Loss** | Image segmentation | Custom (Keras: `Dice()`) |

---

## Quick Reference

| Task | Recommended Loss | Output Activation |
|------|-----------------|-------------------|
| Binary Classification | BCEWithLogitsLoss | None (raw logits) |
| Multi-class Classification | CrossEntropyLoss | None (raw logits) |
| Regression | MSELoss or L1Loss | None |
| Regression (with outliers) | HuberLoss | None |
| Segmentation | Dice + BCE | Sigmoid |

**Sources:** [Keras Losses](https://keras.io/api/losses/) | [PyTorch Loss Functions](https://docs.pytorch.org/docs/stable/nn.html#loss-functions)
