<a href="https://colab.research.google.com/github/Leoli04/llms-notebooks/blob/main/huggingface/hf_nlp_07_main_NLP_tasks_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 主要的NLP任务（一）

### 介绍

在本章中，我们将解决以下常见的 NLP 任务：

- token分类
- 屏蔽语言建模（如 BERT）
- 总结
- 翻译
- 因果语言建模预训练（如 GPT-2）
- 问答

 Trainer API 非常适合微调或训练您的模型，而无需担心训练循环期间幕后发生的情况。
 使用 Accelerate 可以让您更轻松地自定义您想要的任何部分。

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate
# To run the training on TPU, you will need to uncomment the following line:
# !pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
!apt install git-lfs

In [None]:
pip install transformers[torch] -U


### Token classification（token 分类）

这类任务主要有：

- 命名实体识别 (NER)：查找句子中的实体（例如人、位置或组织）
- 词性标记（POS）：将句子中的每个单词标记为对应于特定词性（例如名词、动词、形容词等）。
- 分块：查找属于同一实体的标记。此任务（可以与 POS 或 NER 组合）可以表述为将一个标签（通常为 B- ）分配给位于块开头的任何标记，另一个标签（通常为 I- ）表示不属于任何块的标记。

#### 准备数据

首先，我们需要一个适合标记分类的数据集。在本节中，我们将使用 CoNLL-2003 数据集，其中包含来自路透社的新闻报道。

##### CoNLL-2003 数据集

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("conll2003")

raw_datasets

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

可以看到数据集包含我们之前提到的三个任务的标签：NER、POS 和分块。

In [None]:
# 训查看练集的第一个元素
raw_datasets["train"][0]["tokens"]

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']

In [None]:
# 由于我们想要执行命名实体识别，因此我们将查看 NER 标签：
raw_datasets["train"][0]["ner_tags"]

[3, 0, 7, 0, 0, 0, 7, 0, 0]

通过查看数据集的 features 属性来访问这些整数和标签名称之间的对应关系：

In [None]:
ner_feature = raw_datasets["train"].features["ner_tags"]
ner_feature

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

该列包含 ClassLabel 序列的元素。序列元素的类型位于 ner_feature 的 feature 属性中，我们可以通过查看 ner_feature 的 names 属性来访问名称列表 feature ：

In [None]:
label_names = ner_feature.feature.names
label_names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

token-classification 管道在前面章节已经看到了这些标签，这里快速回顾一下：

- O 表示该单词不对应于任何实体。
- B-PER / I-PER 表示该单词对应于人实体的开头/位于人实体内部。
- B-ORG / I-ORG 表示该词对应于组织实体的开头/位于组织实体内部。
- B-LOC / I-LOC 表示该单词对应于位置实体的开头/位于位置实体内。
- B-MISC / I-MISC 表示该单词对应于其他实体的开头/位于其中。

In [None]:
# 解码
# 查看训练数据0和4索引位置的数据与ner标签
for i in (0,4):
  words = raw_datasets["train"][i]["tokens"]
  labels = raw_datasets["train"][i]["ner_tags"]

  line1 = ""
  line2 = ""
  for word, label in zip(words, labels):
      full_label = label_names[label]
      max_length = max(len(word), len(full_label))
      # 单词添加到line1，然后在单词后面添加空格，使得单词和标签的列宽对齐。
      line1 += word + " " * (max_length - len(word) + 1)
      # 将完整标签添加到line2，然后在标签后面添加空格，使得单词和标签的列宽对齐。
      line2 += full_label + " " * (max_length - len(full_label) + 1)

  print(line1)
  print(line2)

EU    rejects German call to boycott British lamb . 
B-ORG O       B-MISC O    O  O       B-MISC  O    O 
Germany 's representative to the European Union 's veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer . 
B-LOC   O  O              O  O   B-ORG    I-ORG O  O          O         B-PER  I-PER     O    O  O         O         O      O   O         O    O         O     O    B-LOC   O     O   O          O      O   O       O 


正如我们所看到的，跨越两个单词的实体，例如“European Union”和“Werner Zwingmann”，第一个单词被赋予 B- 标签，第二个单词被赋予 I- 标签。

In [None]:
pos_feature = raw_datasets["train"].features["pos_tags"]
label_names_pos = pos_feature.feature.names
# 查看训练数据0和4索引位置的数据与标签
for i in (0,4):
  words = raw_datasets["train"][i]["tokens"]
  labels = raw_datasets["train"][i]["pos_tags"]

  line1 = ""
  line2 = ""
  for word, label in zip(words, labels):
      full_label = label_names_pos[label]
      max_length = max(len(word), len(full_label))
      # 单词添加到line1，然后在单词后面添加空格，使得单词和标签的列宽对齐。
      line1 += word + " " * (max_length - len(word) + 1)
      # 将完整标签添加到line2，然后在标签后面添加空格，使得单词和标签的列宽对齐。
      line2 += full_label + " " * (max_length - len(full_label) + 1)

  print(line1)
  print(line2)

EU  rejects German call to boycott British lamb . 
NNP VBZ     JJ     NN   TO VB      JJ      NN   . 
Germany 's  representative to the European Union 's  veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer . 
NNP     POS NN             TO DT  NNP      NNP   POS JJ         NN        NNP    NNP       VBD  IN NNP       NNS       MD     VB  NN        IN   NNS       JJ    IN   NNP     IN    DT  JJ         NN     VBD JJR     . 


##### 处理数据

In [None]:
from transformers import AutoTokenizer
# 创建 tokenizer 对象
model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

可以将 model_checkpoint 替换为您喜欢的 Hub 中的任何其他模型，或者替换为您保存了预训练模型和分词器的本地文件夹。唯一的限制是 tokenizer 需要由 🤗 Tokenizers 库支持，因此有一个“快速”版本可用。您可以在这个[大表](https://huggingface.co/transformers/#supported-frameworks)中看到快速版本的所有架构，并检查您正在使用的 tokenizer 对象是否确实由 🤗 Tokenizers 支持，您可以查看其 is_fast

In [None]:
tokenizer.is_fast

True

In [None]:
# 预标记:将文本转换为token id,便于模型理解
inputs = tokenizer(raw_datasets["train"][0]["tokens"], is_split_into_words=True)
inputs.tokens()



['[CLS]',
 'EU',
 'rejects',
 'German',
 'call',
 'to',
 'boycott',
 'British',
 'la',
 '##mb',
 '.',
 '[SEP]']

分词器添加了模型使用的特殊标记（在开头的 [CLS] 和在结尾的 [SEP] ），并保留了大部分单词不变。然而，单词 lamb 被标记为两个子词 la 和 ##mb 。快速分词器可以轻松地将每个分词映射到其相应的单词。

In [None]:
# 获取每个分词映射到其相应的单词，
inputs.word_ids()

[None, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, None]

我们要应用的第一条规则是特殊标记的标签为 -100 。这是因为默认情况下 -100 是一个在我们将使用的损失函数（交叉熵）中被忽略的索引。然后，每个标记都会获得与其内部单词开头的标记相同的标签，因为它们是同一实体的一部分。对于单词内部但不在开头的标记，我们将 B- 替换为 I- （因为标记不开始实体）：

In [None]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # 如果标签是以B-开头的（标签数值为奇数），则将其转换为以I-开头的标签（数值加1）
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

In [None]:
labels = raw_datasets["train"][0]["ner_tags"]
word_ids = inputs.word_ids()
print(labels)
print(word_ids)
print(align_labels_with_tokens(labels, word_ids))

[3, 0, 7, 0, 0, 0, 7, 0, 0]
[None, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, None]
[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]


In [None]:
def tokenize_and_align_labels(examples):
  # 一批数据的的所有token ids
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    # 所有的标签
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

In [None]:
# 对数据分批梳理
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

Map:   0%|          | 0/14041 [00:00<?, ? examples/s]

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

通过上面步骤，我们已经完成了数据预处理，并通过tokenize_and_align_labels方法给分词器结果增加了labels属性。labels属性值是通过align_labels_with_tokens处理的，其内部确保了标签与token的一致性，并且将B-XXX转换为I-XXX，以符合BIO（Begin-Inside-Outside）标签体系的规则

- B-XXX 表示一个实体的开始。
- I-XXX 表示一个实体的内部部分。
- O 表示非实体部分。

In [None]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3453
    })
})

#### 使用 Trainer API 微调模型

##### 数据整理

我们不能像第 3 章那样只使用 DataCollatorWithPadding ，因为它只会填充输入（输入 ID、注意力掩码和令牌类型 ID）。这里我们的标签应该以与输入完全相同的方式填充，以便它们保持相同的大小，使用 -100 作为值，以便在损失计算中忽略相应的预测。这都是由 DataCollatorForTokenClassification 完成的。

In [None]:
# 数据整理，将每批数据长度填充一致
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [None]:
# 测试
batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])
batch["labels"]

tensor([[-100,    3,    0,    7,    0,    0,    0,    7,    0,    0,    0, -100],
        [-100,    1,    2, -100, -100, -100, -100, -100, -100, -100, -100, -100]])

In [None]:
for i in range(2):
    print(tokenized_datasets["train"][i]["labels"])

[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]
[-100, 1, 2, -100]


##### 指标

为了让 Trainer 在每个时期计算一个指标，我们需要定义一个 compute_metrics() 函数，它接受预测和标签数组，并返回一个包含指标名称和值的字典。

用于评估 token 分类预测的传统框架是 seqeval。该指标的行为与标准精度不同：它实际上将标签列表视为字符串，而不是整数，因此我们需要在将预测和标签传递给指标之前对其进行完全解码。

In [None]:
!pip install seqeval

In [None]:
import evaluate

metric = evaluate.load("seqeval")

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

In [None]:
labels = raw_datasets["train"][0]["ner_tags"]
labels = [label_names[i] for i in labels]
print(raw_datasets["train"][0]["tokens"])
print(labels)

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']


In [None]:
predictions = labels.copy()
predictions[2] = "O"
metric.compute(predictions=[predictions], references=[labels])

{'MISC': {'precision': 1.0,
  'recall': 0.5,
  'f1': 0.6666666666666666,
  'number': 2},
 'ORG': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'overall_precision': 1.0,
 'overall_recall': 0.6666666666666666,
 'overall_f1': 0.8,
 'overall_accuracy': 0.8888888888888888}


我们获得每个单独实体以及总体的精度、召回率和 F1 分数。

定义一个函数，仅保留总体分数。
这个 compute_metrics() 函数首先采用 logits 的 argmax 将它们转换为预测.
然后我们必须将标签和预测从整数转换为字符串。我们删除标签为 -100 的所有值，然后将结果传递给 metric.compute() 方法：

In [None]:
import numpy as np

# 仅保留总体分数
def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

##### 定义模型

In [None]:
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

In [None]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
model.config.num_labels

9

##### 微调模型

In [None]:
!pip install transformers[torch] -U

In [None]:
from transformers import TrainingArguments
# 定义我们的训练参数
args = TrainingArguments(
    "bert-finetuned-ner",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)



In [None]:
from transformers import Trainer
# 将所有内容传递给 Trainer 并启动训练
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)
trainer.train()

Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0769,0.069989,0.907007,0.932346,0.919502,0.981604
2,0.0371,0.069327,0.932692,0.946819,0.939703,0.985062


In [None]:
# 上传模型
trainer.push_to_hub(commit_message="Training complete")

#### 自定义训练循环

##### 准备好训练的一切

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], collate_fn=data_collator, batch_size=8
)

In [None]:
model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

In [None]:
from torch.optim import AdamW
# 优化器
optimizer = AdamW(model.parameters(), lr=2e-5)

In [None]:
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

In [None]:
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [None]:
from transformers import get_scheduler
# 将我们的模型推送到中心
num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [None]:
# 克隆存储库到本地文件夹
output_dir = "bert-finetuned-ner-accelerate"
repo = Repository(output_dir, clone_from=repo_name)

我们可以通过调用 repo.push_to_hub() 方法上传保存在 output_dir 中的任何内容。

##### 训练循环

In [None]:
def postprocess(predictions, labels):
    predictions = predictions.detach().cpu().clone().numpy()
    labels = labels.detach().cpu().clone().numpy()

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    return true_labels, true_predictions

定义进度条以跟踪训练的进行情况后，循环分为三个部分：
- 训练本身是对 train_dataloader 的经典迭代，向前传递模型，然后向后传递和优化器步骤。
- 评估，在批量获得模型的输出后有一个新颖之处：由于两个进程可能将输入和标签填充到不同的形状，因此我们需要使用 accelerator.pad_across_processes() 进行预测和在调用 gather() 方法之前标记相同的形状。如果我们不这样做，评估要么会出错，要么永远挂起。然后，我们将结果发送到 metric.add_batch() 并在评估循环结束后调用 metric.compute() 。
- 保存和上传，我们首先保存模型和分词器，然后调用 repo.push_to_hub() 。请注意，我们使用参数 blocking=False 告诉 🤗 Hub 库推送异步进程。这样，训练就可以正常继续，并且该（长）指令在后台执行。

训练循环的完整代码：

In [None]:
from tqdm.auto import tqdm
import torch

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    for batch in eval_dataloader:
        with torch.no_grad():
            outputs = model(**batch)

        predictions = outputs.logits.argmax(dim=-1)
        labels = batch["labels"]

        # Necessary to pad predictions and labels for being gathered
        predictions = accelerator.pad_across_processes(predictions, dim=1, pad_index=-100)
        labels = accelerator.pad_across_processes(labels, dim=1, pad_index=-100)

        predictions_gathered = accelerator.gather(predictions)
        labels_gathered = accelerator.gather(labels)

        true_predictions, true_labels = postprocess(predictions_gathered, labels_gathered)
        metric.add_batch(predictions=true_predictions, references=true_labels)

    results = metric.compute()
    print(
        f"epoch {epoch}:",
        {
            key: results[f"overall_{key}"]
            for key in ["precision", "recall", "f1", "accuracy"]
        },
    )

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

 🤗 Accelerate 保存的模型说明：

In [None]:
# 告诉所有进程等到每个人都处于该阶段后再继续。这是为了确保在保存之前我们在每个过程中都有相同的模型。
accelerator.wait_for_everyone()
# 然后我们获取 unwrapped_model ，这是我们定义的基本模型。
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)

#### 使用微调模型

In [None]:
from transformers import pipeline

# Replace this with your own checkpoint
model_checkpoint = "huggingface-course/bert-finetuned-ner"
token_classifier = pipeline(
    "token-classification", model=model_checkpoint, aggregation_strategy="simple"
)
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

### 微调masked语言模型

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate
# To run the training on TPU, you will need to uncomment the following line:
# !pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
!apt install git-lfs

In [None]:
!pip install transformers[torch] -U

在某些情况下，您需要在训练特定任务的 head 之前先对数据上的语言模型进行微调。例如，如果您的数据集包含法律合同或科学文章，像 BERT 这样的普通 Transformer 模型通常会将语料库中的特定领域单词视为稀有标记，并且最终的性能可能不太令人满意。通过对域内数据微调语言模型，您可以提高许多下游任务的性能，这意味着您通常只需执行此步骤一次！

在域内数据上微调预训练语言模型的过程通常称为域适应。

#### 选择用于掩码语言建模的预训练模型

在Hugging Face Hub可以通过选择“Fill-Mask” 过滤。

尽管 BERT 和 RoBERTa 系列模型的下载量最多，但我们将使用名为 DistilBERT 的模型，该模型可以更快地训练，并且下游性能几乎没有损失。该模型使用一种称为知识蒸馏的特殊技术进行训练，其中使用像 BERT 这样的大型“教师模型”来指导参数少得多的“学生模型”的训练。

In [None]:
from transformers import AutoModelForMaskedLM
# 下载模型
model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

我们可以通过调用 num_parameters() 方法来查看这个模型有多少个参数：

DistilBERT 拥有大约 6700 万个参数，大约比 BERT 基础模型小两倍，这大致意味着训练速度提高了两倍

In [None]:
distilbert_num_parameters = model.num_parameters() / 1_000_000
print(f"'>>> DistilBERT number of parameters: {round(distilbert_num_parameters)}M'")
print(f"'>>> BERT number of parameters: 110M'")

'>>> DistilBERT number of parameters: 67M'
'>>> BERT number of parameters: 110M'


In [None]:
from transformers import AutoTokenizer
# 加载 DistilBERT 的分词器
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [None]:
text = "This is a great [MASK]."

In [None]:
import torch

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
print(token_logits)
# Find the location of [MASK] and extract its logits
# torch.where返回两个值：第一个值是满足条件的行索引，第二个值是满足条件的列索引。
# 由于inputs["input_ids"]是一个二维张量，我们只需要列索引（索引为1），因此使用[1]来获取。
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
# 取批次的第一个元素，然后取mask_token_index指定的[MASK]标记的logits，最后[:]表示取这个位置的所有词汇表logits。
mask_token_logits = token_logits[0, mask_token_index, :]
print(mask_token_logits)
# Pick the [MASK] candidates with the highest logits
# 使用torch.topk函数来找到mask_token_logits中最高的5个logits值。
# indices属性包含了每个序列位置的top-k索引值，使用[0]来获取第一个批次的索引，使用tolist()将索引张量转换为Python列表
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

tensor([[[ -5.5882,  -5.5868,  -5.5958,  ...,  -4.9448,  -4.8174,  -2.9905],
         [-11.9031, -11.8872, -12.0623,  ..., -10.9570, -10.6464,  -8.6324],
         [-11.9604, -12.1520, -12.1278,  ..., -10.0218,  -8.6074,  -8.0971],
         ...,
         [ -4.8228,  -4.6268,  -5.1041,  ...,  -4.2771,  -5.0184,  -3.9428],
         [-11.2945, -11.2388, -11.3857,  ...,  -9.2063,  -9.3411,  -6.1505],
         [ -9.5213,  -9.4632,  -9.5022,  ...,  -8.6561,  -8.4908,  -4.6903]]],
       grad_fn=<ViewBackward0>)
tensor([[-4.8228, -4.6268, -5.1041,  ..., -4.2771, -5.0184, -3.9428]],
       grad_fn=<IndexBackward0>)
'>>> This is a great deal.'
'>>> This is a great success.'
'>>> This is a great adventure.'
'>>> This is a great idea.'
'>>> This is a great feat.'


#### 数据集

In [None]:
from datasets import load_dataset
# 加载数据集
imdb_dataset = load_dataset("imdb")
imdb_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [None]:
sample = imdb_dataset["train"].shuffle(seed=42).select(range(3))

for row in sample:
    print(f"\n'>>> Review: {row['text']}'")
    print(f"'>>> Label: {row['label']}'")


'>>> Review: There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...'
'>>> Label: 1'

'>>> Review: This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of with a scene where Hank sings a song with a bunch of kids called "when you stu

#### 预处理数据

对于自回归和掩码语言模型，常见的预处理步骤是连接所有示例，然后将整个语料库分割成大小相等的块。把所有东西连接在一起原因是，如果单个示例太长，则可能会被截断，这将导致丢失可能对语言建模任务有用的信息！

所以，首先，我们将像往常一样首先对语料库进行标记，但不在标记生成器中设置 truncation=True 选项。

In [None]:
# 获取
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


# Use batched=True to activate fast multithreading!
tokenized_datasets = imdb_dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "label"]
)
tokenized_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (532 > 512). Running this sequence through the model will result in indexing errors


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 50000
    })
})

现在我们已经对电影评论进行了标记，下一步是将它们组合在一起并将结果分成多个块。可以通过检查分词器的 model_max_length 属性来推断块大小

In [None]:
tokenizer.model_max_length

512

In [None]:
chunk_size = 128

In [None]:
# Slicing produces a list of lists for each feature
tokenized_samples = tokenized_datasets["train"][:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Review {idx} length: {len(sample)}'")

'>>> Review 0 length: 363'
'>>> Review 1 length: 304'
'>>> Review 2 length: 133'


In [None]:
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated reviews length: {total_length}'")

'>>> Concatenated reviews length: 800'


In [None]:
chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 32'


正如您在此示例中所看到的，最后一个块通常会小于最大块大小。处理这个问题有两种主要策略：
- 如果最后一个块小于 chunk_size ，则删除它。
- 填充最后一个块，直到其长度等于 chunk_size 。

我们将在这里采用第一种方法，所以让我们将上述所有逻辑包装在一个函数中，我们可以将其应用于标记化数据集：

In [None]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

在 group_texts() 的最后一步中，我们创建了一个新的 labels 列，它是 input_ids 列的副本。正如我们很快就会看到的，这是因为在屏蔽语言模型中，目标是预测输入批次中的随机屏蔽标记，并且通过创建 labels 列，我们为我们的语言模型提供了可供学习的基本事实。

In [None]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 61291
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 59904
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 122957
    })
})

In [None]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

"as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men. < br / > < br / > what kills me about i am curious - yellow is that 40 years ago, this was considered pornographic. really, the sex and nudity scenes are few and far between, even then it's not shot like some cheaply made porno. while my countrymen mind find it shocking, in reality sex and nudity are a major staple in swedish cinema. even ingmar bergman,"

#### 使用 Trainer API 微调 DistilBERT

🤗 Transformers 专门为这项任务准备了一个 DataCollatorForLanguageModeling 。我们只需向它传递标记生成器和一个 mlm_probability 参数，该参数指定要屏蔽的标记的部分。我们选择 15%，这是 BERT 使用的数量，也是文献中的常见选择

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

In [None]:
# 了解随机屏蔽的工作原理
samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] [MASK] rented i am curious [MASK] yellow from my video store because of all the controversy that surrounded it when it [MASK] first released in 1967. i also heard that at first it was seized by u. s. customs if it ever tried to enter this country, therefore being [unused154] fan of films considered " controversial " i really had to see this for myself. < br / > < br restrict > the plot is centered around a young swedish drama student named lena who wants to learn [MASK] she can [MASK] life. in [MASK] she wants to focus her attentions to making some sort of documentary on what the average [MASK]ede thought about certain political issues [MASK]'

'>>> as the vietnam war and race [MASK] [MASK] the united states. in between asking politicians and ordinary den [MASK]ns of stockholm about their [MASK] on [MASK], she [MASK] sex with her drama [MASK], classmates, and married men. < br / > < [MASK] / > what kills resurrected about i am curious - yellow is that 40 years ago, this was

在训练用于屏蔽语言建模的模型时，可以使用的一种技术是将整个单词屏蔽在一起，而不仅仅是单个标记。这种方法称为全字屏蔽。如果我们想使用全字屏蔽，我们需要自己构建一个数据整理器。数据整理器只是一个函数，它获取样本列表并将它们转换为批处理

In [None]:
import collections
import numpy as np

from transformers import default_data_collator

wwm_probability = 0.2


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return default_data_collator(features)

In [None]:
samples = [lm_datasets["train"][i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

for chunk in batch["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] [MASK] rented i am curious - yellow from my video [MASK] because [MASK] [MASK] the controversy that [MASK] it when it was first released in 1967 [MASK] i [MASK] [MASK] that [MASK] first [MASK] was seized by u [MASK] [MASK] [MASK] customs if it ever tried to enter this country, [MASK] being a fan of films [MASK] " controversial " i really had to see this for myself. < br / > < br / > the plot is centered around a [MASK] [MASK] [MASK] student named lena [MASK] wants to learn everything she can [MASK] life. in particular she wants to focus her attentions to making some sort of [MASK] on [MASK] the average swede thought about certain [MASK] issues such'

'>>> as the vietnam war and [MASK] issues in [MASK] united states. in [MASK] asking politicians and ordinary denizens [MASK] stockholm about their opinions on politics [MASK] [MASK] has sex with her drama teacher [MASK] [MASK], and [MASK] men [MASK] < br / [MASK] < br / > [MASK] kills me about [MASK] am curious - yellow is that

In [None]:
# 缩小数据集大小：
# 自动创建新的 train 和 test 分割，训练集大小设置为 10,000 个示例，验证集设置为其中的 10%
train_size = 10_000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1000
    })
})

In [None]:
from transformers import TrainingArguments

batch_size = 64
# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-imdb",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=True,
    fp16=True,
    logging_steps=logging_steps,
)



 logging_steps 以确保我们跟踪每个时期的训练损失
 fp16=True 来启用混合精度训练

 如果您使用全字屏蔽整理器，还需要设置 remove_unused_columns=False 以确保我们在训练期间不会丢失 word_ids 列

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss


##### 语言模型的困惑

与文本分类或问答等其他任务不同，我们需要使用带标签的语料库进行训练，而语言建模则没有任何明确的标签。那么我们如何确定什么才是好的语言模型呢？

困惑度有多种数学定义，但我们将使用的定义为交叉熵损失的指数。因此，我们可以通过使用 Trainer.evaluate() 函数计算测试集上的交叉熵损失，然后取结果的指数来计算预训练模型的困惑度：

较低的困惑度分数意味着更好的语言模型。

In [None]:
# 计算测试集上产生的困惑度
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

#### 使用 🤗 Accelerate 微调 DistilBERT

，我们看到 DataCollatorForLanguageModeling 还在每次评估中应用随机掩蔽，因此每次训练运行时我们都会看到困惑度分数出现一些波动。消除这种随机性来源的一种方法是在整个测试集上应用一次屏蔽，然后使用 🤗 Transformers 中的默认数据整理器在评估期间收集批次。

In [None]:
def insert_random_mask(batch):
    features = [dict(zip(batch, t)) for t in zip(*batch.values())]
    masked_inputs = data_collator(features)
    # Create a new "masked" column for each column in the dataset
    return {"masked_" + k: v.numpy() for k, v in masked_inputs.items()}

In [None]:
downsampled_dataset = downsampled_dataset.remove_columns(["word_ids"])
eval_dataset = downsampled_dataset["test"].map(
    insert_random_mask,
    batched=True,
    remove_columns=downsampled_dataset["test"].column_names,
)
eval_dataset = eval_dataset.rename_columns(
    {
        "masked_input_ids": "input_ids",
        "masked_attention_mask": "attention_mask",
        "masked_labels": "labels",
    }
)

In [None]:
# 数据加载器，使用 🤗 Transformers 中的 default_data_collator 作为评估集：
from torch.utils.data import DataLoader
from transformers import default_data_collator

batch_size = 64
train_dataloader = DataLoader(
    downsampled_dataset["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)
eval_dataloader = DataLoader(
    eval_dataset, batch_size=batch_size, collate_fn=default_data_collator
)

In [None]:
# 加载预训练模型的新版本
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)


In [None]:
from torch.optim import AdamW
# 优化器
optimizer = AdamW(model.parameters(), lr=5e-5)

In [None]:
from accelerate import Accelerator
# 使用 Accelerator 对象进行训练 内容准备
accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

In [None]:
# 指定学习率调度程序
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [None]:
# 在 Hugging Face Hub 上创建模型存储库
from huggingface_hub import get_full_repo_name

model_name = "distilbert-base-uncased-finetuned-imdb-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name

In [None]:
from huggingface_hub import Repository
# 使用 🤗 Hub 中的 Repository 类创建并克隆存储库
output_dir = model_name
repo = Repository(output_dir, clone_from=repo_name)

In [None]:
from tqdm.auto import tqdm
import torch
import math
# 循环训练与评估
progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)

        loss = outputs.loss
        losses.append(accelerator.gather(loss.repeat(batch_size)))

    losses = torch.cat(losses)
    losses = losses[: len(eval_dataset)]
    try:
        perplexity = math.exp(torch.mean(losses))
    except OverflowError:
        perplexity = float("inf")

    print(f">>> Epoch {epoch}: Perplexity: {perplexity}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

#### 使用我们微调的模型

In [None]:
from transformers import pipeline

mask_filler = pipeline(
    "fill-mask", model="huggingface-course/distilbert-base-uncased-finetuned-imdb"
)

In [None]:
preds = mask_filler(text)

for pred in preds:
    print(f">>> {pred['sequence']}")