# The Base Classification Model
:label:`sec_classification`

You may have noticed that the implementations from scratch and the concise implementation using framework functionality were quite similar in the case of regression. The same is true for classification. Since many models in this book deal with classification, it is worth adding functionalities to support this setting specifically. This section provides a base class for classification models to simplify future code.


In [2]:
import torch
from d2l import torch as d2l

## The `Classifier` Class


We define the `Classifier` class below. 

- In the `validation_step` we report both the loss value and the classification accuracy on a validation batch.

We draw an update for every `num_val_batches` batches. This has the benefit of generating the averaged loss and accuracy on the whole validation data. These average numbers are not exactly correct if the final batch contains fewer examples, but we ignore this minor difference to keep the code simple.


Let's revisit the `Module`:

<img src="../images/module.png">

So, basically it says **what happens when you do the training step** and **val step**. You will configure the stuff here. Like forward, loss etc.

#### Why are we doing this for classifier?
- See, in the **original** implementation we only report the "loss" but not accuracy.
- We need to have both, but just for the validation. And not for training, thus we do this.

In [3]:
class Classifier(d2l.Module):  #@save
    """
    Overriding the validation step method which will 
    plot loss and the accuracy.
    """
    def validation_step(self, batch):
        Y_hat = self(*batch[:-1])
        self.plot('loss', self.loss(Y_hat, batch[-1]), train=False)
        self.plot('acc', self.accuracy(Y_hat, batch[-1]), train=False)

By default we use a stochastic gradient descent optimizer, operating on minibatches, just as we did in the context of linear regression.


In [5]:
@d2l.add_to_class(Classifier)  #@save
def configure_optimizers(self):
    return torch.optim.SGD(self.parameters(), lr=self.lr)

## Accuracy
- **Accuracy calculation process**:
  1. If `y_hat` is a matrix, it contains prediction scores for each class.
  2. **Use `argmax`** to get the predicted class (index of the largest value) for each row.
  3. **Compare** predicted classes with the true labels (`y`) elementwise.
  4. Convert `y_hat`'s data type to match `y` for consistent comparison.
  5. **Result**: A tensor of 0s (false) and 1s (true).
  6. **Summing** these values gives the number of correct predictions.


In [6]:
@d2l.add_to_class(Classifier)  #@save
def accuracy(self, Y_hat, Y, averaged=True):
    """Compute the number of correct predictions."""
    Y_hat = Y_hat.reshape((-1, Y_hat.shape[-1]))
    preds = Y_hat.argmax(axis=1).type(Y.dtype)
    compare = (preds == Y.reshape(-1)).type(torch.float32)
    return compare.mean() if averaged else compare

## Summary

Classification is a sufficiently common problem that it warrants its own convenience functions. Of central importance in classification is the *accuracy* of the classifier. Note that while we often care primarily about accuracy, we train classifiers to optimize a variety of other objectives for statistical and computational reasons. However, regardless of which loss function was minimized during training, it is useful to have a convenience method for assessing the accuracy of our classifier empirically. 


## Exercises

1. Denote by $L_\textrm{v}$ the validation loss, and let $L_\textrm{v}^\textrm{q}$ be its quick and dirty estimate computed by the loss function averaging in this section. Lastly, denote by $l_\textrm{v}^\textrm{b}$ the loss on the last minibatch. Express $L_\textrm{v}$ in terms of $L_\textrm{v}^\textrm{q}$, $l_\textrm{v}^\textrm{b}$, and the sample and minibatch sizes.
1. Show that the quick and dirty estimate $L_\textrm{v}^\textrm{q}$ is unbiased. That is, show that $E[L_\textrm{v}] = E[L_\textrm{v}^\textrm{q}]$. Why would you still want to use $L_\textrm{v}$ instead?
1. Given a multiclass classification loss, denoting by $l(y,y')$ the penalty of estimating $y'$ when we see $y$ and given a probabilty $p(y \mid x)$, formulate the rule for an optimal selection of $y'$. Hint: express the expected loss, using $l$ and $p(y \mid x)$.


[Discussions](https://discuss.d2l.ai/t/6809)
