<a href="https://colab.research.google.com/github/Bule-rain/PyTorch-/blob/main/chapter_linear-classification/classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The following additional libraries are needed to run this
notebook. Note that running on Colab is experimental, please report a Github
issue if you have any problem.

In [None]:
!pip install d2l==1.0.3


# The Base Classification Model
:label:`sec_classification`

You may have noticed that the implementations from scratch and the concise implementation using framework functionality were quite similar in the case of regression. The same is true for classification. Since many models in this book deal with classification, it is worth adding functionalities to support this setting specifically. This section provides a base class for classification models to simplify future code.


In [None]:
import torch
from d2l import torch as d2l

## The `Classifier` Class


We define the `Classifier` class below. In the `validation_step` we report both the loss value and the classification accuracy on a validation batch. We draw an update for every `num_val_batches` batches. This has the benefit of generating the averaged loss and accuracy on the whole validation data. These average numbers are not exactly correct if the final batch contains fewer examples, but we ignore this minor difference to keep the code simple.


In [None]:
class Classifier(d2l.Module):  #@save
    """The base class of classification models."""
    def validation_step(self, batch):
        Y_hat = self(*batch[:-1])
        self.plot('loss', self.loss(Y_hat, batch[-1]), train=False)
        self.plot('acc', self.accuracy(Y_hat, batch[-1]), train=False)

By default we use a stochastic gradient descent optimizer, operating on minibatches, just as we did in the context of linear regression.


In [None]:
@d2l.add_to_class(d2l.Module)  #@save
def configure_optimizers(self):
    return torch.optim.SGD(self.parameters(), lr=self.lr)

## Accuracy

Given the predicted probability distribution `y_hat`,
we typically choose the class with the highest predicted probability
whenever we must output a hard prediction.
Indeed, many applications require that we make a choice.
For instance, Gmail must categorize an email into "Primary", "Social", "Updates", "Forums", or "Spam".
It might estimate probabilities internally,
but at the end of the day it has to choose one among the classes.

When predictions are consistent with the label class `y`, they are correct.
The classification accuracy is the fraction of all predictions that are correct.
Although it can be difficult to optimize accuracy directly (it is not differentiable),
it is often the performance measure that we care about the most. It is often *the*
relevant quantity in benchmarks. As such, we will nearly always report it when training classifiers.

Accuracy is computed as follows.
First, if `y_hat` is a matrix,
we assume that the second dimension stores prediction scores for each class.
We use `argmax` to obtain the predicted class by the index for the largest entry in each row.
Then we [**compare the predicted class with the ground truth `y` elementwise.**]
Since the equality operator `==` is sensitive to data types,
we convert `y_hat`'s data type to match that of `y`.
The result is a tensor containing entries of 0 (false) and 1 (true).
Taking the sum yields the number of correct predictions.


In [None]:
@d2l.add_to_class(Classifier)  #@save
def accuracy(self, Y_hat, Y, averaged=True):
    """Compute the number of correct predictions."""
    Y_hat = Y_hat.reshape((-1, Y_hat.shape[-1]))
    preds = Y_hat.argmax(axis=1).type(Y.dtype)
    compare = (preds == Y.reshape(-1)).type(torch.float32)
    return compare.mean() if averaged else compare

## Summary

Classification is a sufficiently common problem that it warrants its own convenience functions. Of central importance in classification is the *accuracy* of the classifier. Note that while we often care primarily about accuracy, we train classifiers to optimize a variety of other objectives for statistical and computational reasons. However, regardless of which loss function was minimized during training, it is useful to have a convenience method for assessing the accuracy of our classifier empirically.


## Exercises

1. Denote by $L_\textrm{v}$ the validation loss, and let $L_\textrm{v}^\textrm{q}$ be its quick and dirty estimate computed by the loss function averaging in this section. Lastly, denote by $l_\textrm{v}^\textrm{b}$ the loss on the last minibatch. Express $L_\textrm{v}$ in terms of $L_\textrm{v}^\textrm{q}$, $l_\textrm{v}^\textrm{b}$, and the sample and minibatch sizes.
1. Show that the quick and dirty estimate $L_\textrm{v}^\textrm{q}$ is unbiased. That is, show that $E[L_\textrm{v}] = E[L_\textrm{v}^\textrm{q}]$. Why would you still want to use $L_\textrm{v}$ instead?
1. Given a multiclass classification loss, denoting by $l(y,y')$ the penalty of estimating $y'$ when we see $y$ and given a probabilty $p(y \mid x)$, formulate the rule for an optimal selection of $y'$. Hint: express the expected loss, using $l$ and $p(y \mid x)$.


1. 验证损失与批量损失的关系问题：设验证损失为 \(L_v\)，快速估计损失为 \(L_{qv}\)，最后一个批量的损失为 \(l_{bv}\)，推导三者与样本数、批量大小的关系。解答：符号定义：验证集样本数为 N，批量大小为 B，总批量数为 \(M = \lfloor N/B \rfloor\)，最后一个批量的样本数为 \(n = N - M \cdot B\)（\(n \leq B\)）。验证损失 \(L_v\)：\(L_v = \frac{1}{N} \left( \sum_{i=1}^{M-1} L_i \cdot B + L_M \cdot n \right),\)
其中 \(L_i\) 为第 i 个批量的平均损失。快速估计损失 \(L_{qv}\)：
按 M 个批量平均计算（忽略最后一个批量的样本数差异）：\(L_{qv} = \frac{1}{M} \sum_{i=1}^{M} L_i.\)最后一个批量损失 \(l_{bv} = L_M\)：关系式：\(L_v = \frac{1}{N} \left( (M-1) \cdot B \cdot L_{qv} + n \cdot l_{bv} \right).\)
当 \(n = B\)（批量大小整除样本数）时，\(L_v = L_{qv}\)；当 \(n < B\) 时，\(L_v\) 是前 \(M-1\) 个批量与最后一个批量的加权平均。
2. 快速估计损失的无偏性与局限性问题：证明 \(L_{qv}\) 是 \(L_v\) 的无偏估计，并说明为何仍需使用 \(L_v\)。解答：无偏性证明：
假设各批量的损失 \(L_i\) 是独立同分布的随机变量，期望为 \(\mathbb{E}[L_i] = \mu\)，则：\(\mathbb{E}[L_{qv}] = \mathbb{E}\left[ \frac{1}{M} \sum_{i=1}^{M} L_i \right] = \frac{1}{M} \sum_{i=1}^{M} \mathbb{E}[L_i] = \mu = \mathbb{E}[L_v].\)
故 \(L_{qv}\) 是 \(L_v\) 的无偏估计。使用 \(L_v\) 的原因：方差问题：\(L_{qv}\) 忽略了最后一个批量的样本数差异，当 \(n < B\) 时，其方差大于 \(L_v\)，估计更不稳定。精确性需求：在关键实验或论文中，需精确计算 \(L_v\) 以保证结果的严谨性，避免因批量划分引入误差。样本不均衡：若最后一个批量的样本特性与其他批量不同（如类别分布偏差），\(L_{qv}\) 可能产生系统偏差。
3. 多分类问题的最优预测规则问题：给定多分类损失 \(l(y, y')\) 和条件概率 \(p(y|x)\)，推导最优预测 \(y'\) 的选择规则。解答：期望损失：
对于输入 x，预测 \(y'\) 的期望损失为：\(\mathbb{E}_{y \sim p(y|x)} [l(y, y')] = \sum_y p(y|x) \cdot l(y, y').\)最优预测规则：
选择 \(y'\) 使期望损失最小：\(y'^* = \arg\min_{y'} \sum_y p(y|x) \cdot l(y, y').\)特例分析：0-1 损失：\(l(y, y') = \mathbb{I}(y \neq y')\)，则：\(y'^* = \arg\max_y p(y|x),\)
即选择后验概率最大的类别（最大后验估计）。交叉熵损失：\(l(y, y') = -\log y'_y\)（\(y'\) 为概率向量），则：\(y'^* = \arg\max_y p(y|x),\)
与 0-1 损失的最优解一致，体现了交叉熵损失与最大似然估计的等价性。

[Discussions](https://discuss.d2l.ai/t/6809)
