# 15.6. Fine-Tuning BERT for Sequence-Level and Token-Level Applications

- 对于序列级和词元级自然语言处理应用，BERT只需要**最小的架构改变**（额外的全连接层），如
  - 单个文本分类（例如，情感分析和测试语言可接受性）、
  - 文本对分类或回归（例如，自然语言推断和语义文本相似性）、
  - 文本标记（例如，词性标记）
  - 问答
- 在下游应用的监督学习期间，额外层的参数是从零开始学习的，而预训练BERT模型中的所有参数都是微调的。


In the previous sections of this chapter, we have designed different models for NLP applications, such as based on **RNNs, CNNs, attention**, and **MLPs**. These models are helpful when there is space or time constraint, **however**, crafting a specific model for every NLP task is practically infeasible （为每个任务都精心设计一个特定的模型是不可行的）.

In [8_bert.md](../bert_diy/8_bert.md), we introduced a pretraining model, **BERT**, that requires minimal architecture changes for a wide range of NLP tasks.

- One one hand, at the time of its proposal, BERT improved the state of the art on various NLP tasks.
- On the other hand, as noted in [Section 14.10](https://d2l.ai/chapter_natural-language-processing-pretraining/bert-pretraining.html#sec-bert-pretraining), the **two versions** of the original BERT model come with 110 million and 340 million parameters.
- Thus, when there are sufficient computational resources, we may consider **fine-tuning BERT** for **downstream NLP applications**.

In the following, we generalize a subset of **NLP applications** as sequence-level and token-level.

- **On the sequence level**, we introduce **how to transform** the BERT representation of the text input **to** the output label in single text classification **and** text pair classification or regression.
  - `input`: the BERT representation of the text
  - `output`:
    - 1）single text：label classification
    - 2）text pair：1. label classification； 2. text pair regression
- **On the token level**, we will briefly introduce **new applications** such as：如下，shed light on how BERT can represent their inputs and get transformed into output labels.
  - 3）text tagging： 文本标注，（机器翻译也是一样）
  - 4）question answering：问答

During **fine-tuning**, the “minimal architecture changes” required by BERT across different applications are the extra fully-connected layers. 最小架构，即一个全连接层

During **supervised learning of a downstream application**, parameters of the extra layers are learned from scratch while all the parameters in the pretrained BERT model are fine-tuned. 只有额外层的参数是从0开始学习的，而预训练 BERT 模型中所有参数都是微调（即基于加载bert参数的基础上，再训练）

## 15.6.1. Single Text Classification

***Single text classification*** task:

- `input`：a single text sequence
- `output`：classification result

Besides sentiment analysis that we have studied in this chapter, the Corpus of Linguistic Acceptability (CoLA) is also a dataset for single text classification, judging whether a given sentence is grammatically acceptable or not [[Warstadt et al., 2019]](https://d2l.ai/chapter_references/zreferences.html#warstadt-singh-bowman-2019) (例如判断语法的正确性). **For instance**, “I should study.” is acceptable but “I should studying.” is not.

<center>
    <img style="border-radius: 0.3125em;
    box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);" 
    src="https://d2l.ai/_images/bert-one-seq.svg" width = "65%" alt=""/>
    <br>
    <div style="color:orange; border-bottom: 1px solid #d9d9d9;
    display: inline-block;
    color: #999;
    padding: 2px;">
      Fig. 15.6.1 Fine-tuning BERT for single text classification applications, such as sentiment analysis and testing linguistic acceptability. Suppose that the input single text has six tokens.
  	</div>
</center>

[8_bert.md](../bert_diy/8_bert.md) describes the input representation of BERT. The **BERT input sequence** unambiguously represents both `single text` and `text pairs`, where

- the special classification token `“<cls>”` is used for **sequence classification**
- the special classification token `“<sep>”` **marks the end** of `single text` or **separates** a `pair of text`.

As shown in [Fig. 15.6.1](https://d2l.ai/chapter_natural-language-processing-applications/finetuning-bert.html#fig-bert-one-seq), in single text classification applications,

- the BERT representation of the special classification token `“<cls>”` **encodes the information of the entire input text sequence** (因为在注意力机制中，每一个词都会与其他词计算similarity，**而`“cls”`作为句子的开头单词，它的attention_weight自然包含了句子的所有信息**，同时它自己的内积也不需要有意义).
- As the representation of the input single text, it will be fed into a small MLP consisting of fully-connected (dense) layers **to output the distribution** of all the discrete label values.（输出离散标签值的分布）

## 6.2. Text Pair Classification or Regression

We have also examined **natural language inference: 假设是否可以从前提中推断出来** in this chapter. It belongs to  ***text pair classification*** ,a type of application **classifying a pair of text**：

- `input`：a pair of text
- `output`：classification result：例如，A可以推出B，则正确

***semantic textual similarity*** is a popular *text pair **regression*** task.

- `input`：a pair of text
- `output`：outputting a **continuous value**

This task measures semantic **similarity** of sentences (评估语义的相似度). **For instance**, in the Semantic Textual Similarity Benchmark dataset, the similarity score of a pair of sentences is an ordinal scale ranging from 0 (no meaning overlap) to 5 (meaning equivalence) [[Cer et al., 2017]](https://d2l.ai/chapter_references/zreferences.html#cer-diab-agirre-ea-2017). The goal is to predict these scores. **Examples** from the Semantic Textual Similarity Benchmark dataset include (sentence 1, sentence 2, similarity score):

* `“A plane is taking off.”`, `“An air plane is taking off.”`, 5.000;
* `“A woman is eating something.”`, `“A woman is eating meat.”`, 3.000;
* `“A woman is dancing.”`, `“A man is talking.”`, 0.000.

<center>
    <img style="border-radius: 0.3125em;
    box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);" 
    src="https://d2l.ai/_images/bert-two-seqs.svg" width = "65%" alt=""/>
    <br>
    <div style="color:orange; border-bottom: 1px solid #d9d9d9;
    display: inline-block;
    color: #999;
    padding: 2px;">
      Fig. 15.6.2 Fine-tuning BERT for text pair classification or regression applications, such as natural language inference and semantic textual similarity. Suppose that the input text pair has two and three tokens.
  	</div>
</center>

Comparing with single text classification in [Fig. 15.6.1](https://d2l.ai/chapter_natural-language-processing-applications/finetuning-bert.html#fig-bert-one-seq), fine-tuning BERT for **text pair classification** in [Fig. 15.6.2](https://d2l.ai/chapter_natural-language-processing-applications/finetuning-bert.html#fig-bert-two-seqs) is different in the input representation.  

For **text pair regression** tasks such as semantic textual similarity, trivial changes can be applied such as outputting **a continuous label value** and using the **mean squared loss （MSE）**: they are common for regression.

## 6.3. Text Tagging 文本标注

Now let us consider **token-level tasks**, such as  ***text tagging*** , where each token is assigned a label （每个token都有标签）. Among text tagging tasks, ***part-of-speech tagging*** assigns each word a part-of-speech tag (每个单词的词性标记，e.g., adjective and determiner) according to the role of the word in the sentence.

**For example**, according to the Penn Treebank II tag set, the sentence `“John Smith ‘s car is new”` should be tagged as `“NNP (noun, proper singular) NNP POS (possessive ending) NN (noun, singular or mass) VB (verb, base form) JJ (adjective)”`.


<center>
    <img style="border-radius: 0.3125em;
    box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);" 
    src="https://d2l.ai/_images/bert-tagging.svg" width = "65%" alt=""/>
    <br>
    <div style="color:orange; border-bottom: 1px solid #d9d9d9;
    display: inline-block;
    color: #999;
    padding: 2px;">
      Fig. 15.6.3 Fine-tuning BERT for text tagging applications, such as part-of-speech tagging. Suppose that the input single text has six tokens.
  	</div>
</center>

Fine-tuning BERT for text tagging applications is illustrated in [Fig. 15.6.3](https://d2l.ai/chapter_natural-language-processing-applications/finetuning-bert.html#fig-bert-tagging). Comparing with [Fig. 15.6.1](https://d2l.ai/chapter_natural-language-processing-applications/finetuning-bert.html#fig-bert-one-seq), the only distinction lies in that in text tagging , the BERT representation of *every token* of the input text is fed into the same extra fully-connected layers to output the label of the token, such as a part-of-speech tag.

(与第一个任务相比，唯一的区别就是输入的每个词元的BERT表示，都被送到相同的额外全连接层中)

## 6.4. Question Answering

As another token-level application, ***question answering*** reflects capabilities of reading comprehension　（反映了阅读理解能力）. **For example**, the Stanford Question Answering Dataset (SQuAD v1.1) consists of reading passages and questions, where the answer to every question is just a segment of text (text span) from the passage that the question is about [[Rajpurkar et al., 2016]](https://d2l.ai/chapter_references/zreferences.html#rajpurkar-zhang-lopyrev-ea-2016). To explain, consider **a passage** `“Some experts report that a mask’s efficacy is inconclusive. However, mask makers insist that their products, such as N95 respirator masks, can guard against the virus.”` and a **question** `“Who say that N95 respirator masks can guard against the virus?”`. The **answer** should be the text span `“mask makers”` in the passage. Thus, the goal in SQuAD v1.1 is to predict the start and end of the text span in the passage given a pair of question and passage.


<center>
    <img style="border-radius: 0.3125em;
    box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);" 
    src="https://d2l.ai/_images/bert-qa.svg" width = "65%" alt=""/>
    <br>
    <div style="color:orange; border-bottom: 1px solid #d9d9d9;
    display: inline-block;
    color: #999;
    padding: 2px;">
      Fig. 15.6.4 Fine-tuning BERT for question answering. Suppose that the input text pair has two and three tokens.
  	</div>
</center>

To fine-tune BERT for **question answering**, the **question** and **passage** are packed as the **first** and **second text sequence**, respectively, in the input of BERT.

- To predict the position of **the start of the text span**,

  - the same additional fully-connected layer will **transform** the BERT representation of any `token` **from the passage** of position $i$ **into** a `scalar score` $\textbf{s}_i$.
  - Such `scores` of **all** the passage tokens **are further transformed** by the `softmax` operation **into** a `probability distribution`, so that each token position $i$ in the passage is assigned a probability $p_i$ of being **the start of the text span**.
- Predicting **the end of the text span** is the same as above (与上面相同), except that parameters in its additional fully-connected layer are independent from those for predicting the start.（除了全连接层中的参数与用于预测开始位置的参数无关）
- When predicting the end （预测结束后）, any passage token of position $i$ is transformed by the same fully-connected layer into `a scalar score` $\textbf{e}_i$.

[Fig. 15.6.4](https://d2l.ai/chapter_natural-language-processing-applications/finetuning-bert.html#fig-bert-qa) depicts fine-tuning BERT for question answering.

For question answering, the supervised learning's training objective is as straightforward as maximizing the log-likelihoods of the ground-truth start and end positions. When **predicting the span （预测片段）** , we can compute the score $s_i + e_j$ for a valid span from position $i$ to position $j$ ($i \leq j$), and output the span with the highest score.