# Why MSE Is Not Good for Classification

* With a **threshold function** (0/1), the cost as a function of the parameters has **large flat regions** — the gradient is 0, so optimization stops even if there are still misclassified samples.
* Even when using a **sigmoid** with **MSE**, the cost surface can still be **uninformative** (flat in key areas), especially in higher-dimensional spaces (W, B). The result: convergence depends on initialization and is often slow or unstable.

---

# From MLE to Cross-Entropy (CE)

1. **Model**: for logistic regression
   $(p_\theta(y=1\mid x)=\sigma(w^\top x+b))$
   $(p_\theta(y=0\mid x)=1-\sigma(\cdot))$
2. **Likelihood** on a dataset $({(x_n,y_n)})$:
   $(\mathcal{L}(\theta)=\prod_n p_\theta(y_n\mid x_n))$
3. **Log-likelihood**:
   $(\log\mathcal{L}=\sum_n [y_n\log \hat{p}_n + (1-y_n)\log(1-\hat{p}_n)])$
4. **From maximization to minimization**: minimize the **negative log-likelihood** → **Binary Cross-Entropy**
   $[
   \mathcal{J}(\theta)= -\frac{1}{N}\sum_n [y_n\log \hat{p}_n + (1-y_n)\log(1-\hat{p}_n)]
   ]$
   This loss is **smooth**, with **useful gradients** almost everywhere and well-behaved minima.

---

# Geometric Intuition

* With threshold / MSE: cost changes in “steps” → flat regions with no gradient.
* With **sigmoid + CE**: the curve/surface is **smooth**; shifting the boundary slightly changes the cost → the optimizer “feels” which way to move.

---

# PyTorch Implementation (Binary)

* **Model** (1D → 1D) and **loss** options:

  * **Option A (recommended)** – output **logits** (no sigmoid in `forward`) + `nn.BCEWithLogitsLoss()`
  * **Option B** – output includes **sigmoid** + `nn.BCELoss()` (less numerically stable)

* **Optimizer**: typically SGD or Adam

* **Training loop**: `forward → loss → loss.backward() → optimizer.step() → optimizer.zero_grad()`
  After training, apply a **0.5 threshold** on (\hat{p}) to obtain class 0/1.

---

### Minimal Example (recommended: BCEWithLogitsLoss)

```python
import torch, torch.nn as nn

X = torch.randn(200, 1)                       # 1D feature
y = (X + 0.5 * torch.randn_like(X) > 0).float().view(-1, 1)  # labels 0/1

class Logistic(nn.Module):
    def __init__(self):
        super().__init__()
        self.lin = nn.Linear(1, 1)
    def forward(self, x):
        return self.lin(x)  # output logits

model = Logistic()
loss_fn = nn.BCEWithLogitsLoss()
opt = torch.optim.SGD(model.parameters(), lr=1e-2)

for epoch in range(100):
    logits = model(X)
    loss = loss_fn(logits, y)
    opt.zero_grad()
    loss.backward()
    opt.step()

with torch.no_grad():
    probs = torch.sigmoid(model(X))
    y_hat = (probs >= 0.5).float()
```

---

# Practical Notes

* **Multi-class**: use `nn.CrossEntropyLoss()` with **logits** (no softmax in `forward`), and integer targets (\in{0,\dots,K-1}).
* **Numerical stability**: prefer `BCEWithLogitsLoss` (it combines sigmoid and BCE safely).
* **Metrics**: track **loss + accuracy** (or F1/AUROC) to see if the model is truly reducing false positives/negatives.

---

**In short:**
Cross-Entropy naturally arises from **Maximum Likelihood Estimation** for probabilistic classifiers.
Compared to MSE / threshold losses, it produces a **smooth cost surface with meaningful gradients**, making gradient-based optimization **more efficient and reliable**.
