The following additional libraries are needed to run this
notebook. Note that running on Colab is experimental, please report a Github
issue if you have any problem.

In [None]:
!pip install d2l==1.0.3


# The Base Classification Model
:label:`sec_classification`

You may have noticed that the implementations from scratch and the concise implementation using framework functionality were quite similar in the case of regression. The same is true for classification. Since many models in this book deal with classification, it is worth adding functionalities to support this setting specifically. This section provides a base class for classification models to simplify future code.


In [None]:
import torch
from d2l import torch as d2l

## The `Classifier` Class


We define the `Classifier` class below. In the `validation_step` we report both the loss value and the classification accuracy on a validation batch. We draw an update for every `num_val_batches` batches. This has the benefit of generating the averaged loss and accuracy on the whole validation data. These average numbers are not exactly correct if the final batch contains fewer examples, but we ignore this minor difference to keep the code simple.


In [None]:
class Classifier(d2l.Module):  #@save
    """The base class of classification models."""
    def validation_step(self, batch):
        Y_hat = self(*batch[:-1])
        self.plot('loss', self.loss(Y_hat, batch[-1]), train=False)
        self.plot('acc', self.accuracy(Y_hat, batch[-1]), train=False)

By default we use a stochastic gradient descent optimizer, operating on minibatches, just as we did in the context of linear regression.


In [None]:
@d2l.add_to_class(d2l.Module)  #@save
def configure_optimizers(self):
    return torch.optim.SGD(self.parameters(), lr=self.lr)

## Accuracy

Given the predicted probability distribution `y_hat`,
we typically choose the class with the highest predicted probability
whenever we must output a hard prediction.
Indeed, many applications require that we make a choice.
For instance, Gmail must categorize an email into "Primary", "Social", "Updates", "Forums", or "Spam".
It might estimate probabilities internally,
but at the end of the day it has to choose one among the classes.

When predictions are consistent with the label class `y`, they are correct.
The classification accuracy is the fraction of all predictions that are correct.
Although it can be difficult to optimize accuracy directly (it is not differentiable),
it is often the performance measure that we care about the most. It is often *the*
relevant quantity in benchmarks. As such, we will nearly always report it when training classifiers.

Accuracy is computed as follows.
First, if `y_hat` is a matrix,
we assume that the second dimension stores prediction scores for each class.
We use `argmax` to obtain the predicted class by the index for the largest entry in each row.
Then we [**compare the predicted class with the ground truth `y` elementwise.**]
Since the equality operator `==` is sensitive to data types,
we convert `y_hat`'s data type to match that of `y`.
The result is a tensor containing entries of 0 (false) and 1 (true).
Taking the sum yields the number of correct predictions.


In [None]:
@d2l.add_to_class(Classifier)  #@save
def accuracy(self, Y_hat, Y, averaged=True):
    """Compute the number of correct predictions."""
    Y_hat = Y_hat.reshape((-1, Y_hat.shape[-1]))
    preds = Y_hat.argmax(axis=1).type(Y.dtype)
    compare = (preds == Y.reshape(-1)).type(torch.float32)
    return compare.mean() if averaged else compare

## Summary

Classification is a sufficiently common problem that it warrants its own convenience functions. Of central importance in classification is the *accuracy* of the classifier. Note that while we often care primarily about accuracy, we train classifiers to optimize a variety of other objectives for statistical and computational reasons. However, regardless of which loss function was minimized during training, it is useful to have a convenience method for assessing the accuracy of our classifier empirically.


## Exercises

1. Denote by $L_\textrm{v}$ the validation loss, and let $L_\textrm{v}^\textrm{q}$ be its quick and dirty estimate computed by the loss function averaging in this section. Lastly, denote by $l_\textrm{v}^\textrm{b}$ the loss on the last minibatch. Express $L_\textrm{v}$ in terms of $L_\textrm{v}^\textrm{q}$, $l_\textrm{v}^\textrm{b}$, and the sample and minibatch sizes.
1. Show that the quick and dirty estimate $L_\textrm{v}^\textrm{q}$ is unbiased. That is, show that $E[L_\textrm{v}] = E[L_\textrm{v}^\textrm{q}]$. Why would you still want to use $L_\textrm{v}$ instead?
1. Given a multiclass classification loss, denoting by $l(y,y')$ the penalty of estimating $y'$ when we see $y$ and given a probabilty $p(y \mid x)$, formulate the rule for an optimal selection of $y'$. Hint: express the expected loss, using $l$ and $p(y \mid x)$.


1. Denote by $L_\textrm{v}$ the validation loss, and let $L_\textrm{v}^\textrm{q}$ be its quick and dirty estimate computed by the loss function averaging in this section. Lastly, denote by $l_\textrm{v}^\textrm{b}$ the loss on the last minibatch. Express $L_\textrm{v}$ in terms of $L_\textrm{v}^\textrm{q}$, $l_\textrm{v}^\textrm{b}$, and the sample and minibatch sizes.

Let $N$ be the total number of samples in the validation set.
Let $B$ be the minibatch size.
Let $M$ be the number of minibatches, so $M = \lfloor N/B \rfloor$.
Let $R$ be the size of the last minibatch, so $R = N \pmod B$. If $R=0$, then $R=B$.
Let $L_i$ be the loss on the $i$-th minibatch.

The validation loss $L_\textrm{v}$ is the average loss over all samples:
$L_\textrm{v} = \frac{1}{N} \sum_{i=1}^M B \cdot L_i + \frac{1}{N} R \cdot l_\textrm{v}^\textrm{b}$
$L_\textrm{v} = \frac{1}{N} \left( \sum_{i=1}^{M-1} B \cdot L_i + R \cdot l_\textrm{v}^\textrm{b} \right)$

The quick and dirty estimate $L_\textrm{v}^\textrm{q}$ is the average of the losses of the first $M$ minibatches (assuming the last minibatch is included in the average):
$L_\textrm{v}^\textrm{q} = \frac{1}{M} \sum_{i=1}^M L_i$

To express $L_\textrm{v}$ in terms of $L_\textrm{v}^\textrm{q}$ and $l_\textrm{v}^\textrm{b}$:
We know that $\sum_{i=1}^M L_i = M \cdot L_\textrm{v}^\textrm{q}$.
So, $L_\textrm{v} = \frac{1}{N} \left( \sum_{i=1}^{M-1} B \cdot L_i + R \cdot l_\textrm{v}^\textrm{b} \right)$. This still includes $L_i$ for $i<M$.

Let's consider the sum of losses over the first $M-1$ minibatches:
$\sum_{i=1}^{M-1} B \cdot L_i = \sum_{i=1}^{M} B \cdot L_i - B \cdot L_M$
Assuming $L_M = l_\textrm{v}^\textrm{b}$ (the loss on the last minibatch).

$L_\textrm{v} = \frac{1}{N} \left( B \sum_{i=1}^{M-1} L_i + R \cdot l_\textrm{v}^\textrm{b} \right)$

We know $M \cdot L_\textrm{v}^\textrm{q} = \sum_{i=1}^M L_i = \sum_{i=1}^{M-1} L_i + L_M$.
So, $\sum_{i=1}^{M-1} L_i = M \cdot L_\textrm{v}^\textrm{q} - L_M = M \cdot L_\textrm{v}^\textrm{q} - l_\textrm{v}^\textrm{b}$.

Substitute this back into the expression for $L_\textrm{v}$:
$L_\textrm{v} = \frac{1}{N} \left( B (M \cdot L_\textrm{v}^\textrm{q} - l_\textrm{v}^\textrm{b}) + R \cdot l_\textrm{v}^\textrm{b} \right)$
$L_\textrm{v} = \frac{1}{N} \left( B M \cdot L_\textrm{v}^\textrm{q} - B \cdot l_\textrm{v}^\textrm{b} + R \cdot l_\textrm{v}^\textrm{b} \right)$
$L_\textrm{v} = \frac{1}{N} \left( B M \cdot L_\textrm{v}^\textrm{q} + (R - B) \cdot l_\textrm{v}^\textrm{b} \right)$

Since $BM + R = N$, we have $BM = N - R$.
$L_\textrm{v} = \frac{1}{N} \left( (N - R) \cdot L_\textrm{v}^\textrm{q} + (R - B) \cdot l_\textrm{v}^\textrm{b} \right)$
$L_\textrm{v} = \frac{N - R}{N} L_\textrm{v}^\textrm{q} + \frac{R - B}{N} l_\textrm{v}^\textrm{b}$

This expresses $L_\textrm{v}$ in terms of $L_\textrm{v}^\textrm{q}$, $l_\textrm{v}^\textrm{b}$, and the sample and minibatch sizes ($N$, $B$, and $R$).

2. Show that the quick and dirty estimate $L_\textrm{v}^\textrm{q}$ is unbiased. That is, show that $E[L_\textrm{v}] = E[L_\textrm{v}^\textrm{q}]$. Why would you still want to use $L_\textrm{v}$ instead?

**Showing $E[L_\textrm{v}] = E[L_\textrm{v}^\textrm{q}]$**

Let $L_i$ be the loss on the $i$-th minibatch.
The quick and dirty estimate $L_\textrm{v}^\textrm{q}$ is the average of the losses of the $M$ minibatches:
$L_\textrm{v}^\textrm{q} = \frac{1}{M} \sum_{i=1}^M L_i$

The expected value of the quick and dirty estimate is:
$E[L_\textrm{v}^\textrm{q}] = E\left[\frac{1}{M} \sum_{i=1}^M L_i\right]$
By the linearity of expectation:
$E[L_\textrm{v}^\textrm{q}] = \frac{1}{M} \sum_{i=1}^M E[L_i]$

Since each minibatch is drawn randomly from the validation set, the expected loss of each minibatch is the true validation loss $L_\textrm{v}$.
$E[L_i] = L_\textrm{v}$ for all $i$.

Substituting this back into the expression for $E[L_\textrm{v}^\textrm{q}]$:
$E[L_\textrm{v}^\textrm{q}] = \frac{1}{M} \sum_{i=1}^M L_\textrm{v} = \frac{1}{M} (M \cdot L_\textrm{v}) = L_\textrm{v}$

Therefore, $E[L_\textrm{v}^\textrm{q}] = L_\textrm{v}$. Since $L_\textrm{v}$ is a constant (the true validation loss), $E[L_\textrm{v}] = L_\textrm{v}$.
Thus, $E[L_\textrm{v}^\textrm{q}] = E[L_\textrm{v}]$, which shows that the quick and dirty estimate $L_\textrm{v}^\textrm{q}$ is an unbiased estimator of the true validation loss $L_\textrm{v}$.

**Why would you still want to use $L_\textrm{v}$ instead?**

While $L_\textrm{v}^\textrm{q}$ is an unbiased estimator, it is subject to higher variance compared to $L_\textrm{v}$. $L_\textrm{v}^\textrm{q}$ is an average over minibatch losses, and the loss on individual minibatches can fluctuate significantly, especially with small batch sizes. This fluctuation in minibatch losses leads to a higher variance in the estimate $L_\textrm{v}^\textrm{q}$.

The true validation loss $L_\textrm{v}$, on the other hand, is calculated by averaging the loss over *all* samples in the validation set. This provides a more stable and reliable measure of the model's performance on the entire validation dataset. While calculating $L_\textrm{v}$ requires iterating through all samples, it gives a more accurate picture of how well the model generalizes to the validation set compared to the quick and dirty estimate, which only considers the average of minibatch losses.

In practice, especially during training, the quick and dirty estimate is often used for efficiency as it avoids a full pass over the validation set at each evaluation step. However, for final evaluation or when a precise measure of performance is needed, calculating the true validation loss $L_\textrm{v}$ is preferable due to its lower variance and greater reliability.

3. Given a multiclass classification loss, denoting by $l(y,y')$ the penalty of estimating $y'$ when we see $y$ and given a probability $p(y \mid x)$, formulate the rule for an optimal selection of $y'$. Hint: express the expected loss, using $l$ and $p(y \mid x)$.

The goal is to find the optimal selection of $y'$ that minimizes the expected loss. The expected loss of estimating $y'$ for a given input $x$ is the sum of the penalties for each possible true label $y$, weighted by the probability of that true label given $x$:

$E[\text{loss} \mid x, y'] = \sum_{y} l(y, y') p(y \mid x)$

To find the optimal $y'$, we need to choose the $y'$ that minimizes this expected loss:

Optimal $y' = \arg\min_{y'} \sum_{y} l(y, y') p(y \mid x)$

This rule states that for a given input $x$, the optimal predicted class $y'$ is the one that minimizes the sum of the penalties for all possible true classes $y$, where each penalty is weighted by the probability of that true class given $x$.

A common example of a loss function is the 0-1 loss, where $l(y, y') = 0$ if $y = y'$ and $l(y, y') = 1$ if $y \neq y'$. In this case, the expected loss becomes:

$E[\text{loss} \mid x, y'] = \sum_{y \neq y'} 1 \cdot p(y \mid x) = \sum_{y \neq y'} p(y \mid x)$

Since $\sum_{y} p(y \mid x) = 1$, we have $\sum_{y \neq y'} p(y \mid x) = 1 - p(y' \mid x)$.

So, for the 0-1 loss, the optimal $y'$ is the one that minimizes $1 - p(y' \mid x)$, which is equivalent to maximizing $p(y' \mid x)$:

Optimal $y'$ (with 0-1 loss) $= \arg\max_{y'} p(y' \mid x)$

This means that with the 0-1 loss, the optimal strategy is to predict the class with the highest posterior probability, which is the maximum a posteriori (MAP) decision rule.

[Discussions](https://discuss.d2l.ai/t/6809)
