# Causal language modeling

有两种类型的语言建模，因果和掩码。本指南阐述了因果语言建模。
因果语言模型经常用于文本生成。您可以将这些模型用于创意应用，如选择自己的文本冒险或智能编码助手，如Copilot或CodeParrot。

In [None]:
#@title
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/Vpjb1lu0MDk?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

因果语言建模预测令牌序列中的下一个令牌，模型只能关注左侧的令牌。这意味着模型无法看到未来的令牌。GPT-2是因果语言模型的一个示例。
这个指南将向您展示如何：

1. 在ELI5数据集的r/askscience子集上对DistilGPT2进行微调。
2. 使用您微调的模型进行推理。
在开始之前，请确保已安装所有必要的库：

```
pip install transformers datasets evaluate
```

我们鼓励您登录您的Hugging Face帐户，这样您就可以上传和分享您的模型给社区。在提示时，请输入您的令牌以登录：

有两种类型的语言建模，因果和掩码。本指南阐述了因果语言建模。
因果语言模型经常用于文本生成。您可以将这些模型用于创意应用，如选择自己的文本冒险或智能编码助手，如Copilot或CodeParrot。

In [None]:
from huggingface_hub import notebook_login

notebook_login()

## Load ELI5 dataset

从🤗数据集库中加载r/askscience子集的ELI5数据集的较小子集。这将为您提供一个机会来进行实验，确保一切正常，然后再花时间在完整数据集上进行训练。

In [None]:
from datasets import load_dataset

eli5 = load_dataset("eli5", split="train_asks[:5000]")

使用[train_test_split](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split)方法将数据集的[train_asks](file:///Users/zhangchunyang/PycharmProjects/nlp-in-action/llm-tutorials-chinese/llm-training/01.pretraining/language-modeling.ipynb#1%2C22-1%2C22)拆分为训练集和测试集。

In [None]:
eli5 = eli5.train_test_split(test_size=0.2)

然后看一个例子：

In [None]:
eli5["train"][0]

{'answers': {'a_id': ['c3d1aib', 'c3d4lya'],
  'score': [6, 3],
  'text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
   "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"]},
 'answers_urls': {'url': []},
 'document': '',
 'q_id': 'nyxfp',
 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how d

虽然这看起来很多，但您实际上只对[text](file:///Users/zhangchunyang/PycharmProjects/nlp-in-action/llm-tutorials-chinese/llm-training/01.pretraining/language-modeling.ipynb#14%2C1272-14%2C1272)字段感兴趣。语言建模任务的有趣之处在于您不需要标签（也称为无监督任务），因为下一个词*就是*标签。

## Preprocess

In [None]:
#@title
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/ma1TrR7gE7I?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

接下来的步骤是加载一个DistilGPT2的分词器来处理[text](file:///Users/zhangchunyang/PycharmProjects/nlp-in-action/llm-tutorials-chinese/llm-training/01.pretraining/language-modeling.ipynb#14%2C1272-14%2C1272)子字段。

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilgpt2")

您会注意到上面的示例中，[text](file:///Users/zhangchunyang/PycharmProjects/nlp-in-action/llm-tutorials-chinese/llm-training/01.pretraining/language-modeling.ipynb#14%2C1272-14%2C1272)字段实际上是嵌套在[answers](file:///Users/zhangchunyang/PycharmProjects/nlp-in-action/llm-tutorials-chinese/llm-training/01.pretraining/language-modeling.ipynb#1%2C83-1%2C83)内部的。这意味着您需要使用[链接](https://huggingface.co/docs/datasets/process.html#flatten)方法从其嵌套结构中提取[text](file:///Users/zhangchunyang/PycharmProjects/nlp-in-action/llm-tutorials-chinese/llm-training/01.pretraining/language-modeling.ipynb#14%2C1272-14%2C1272)子字段。

In [None]:
eli5 = eli5.flatten()
eli5["train"][0]

{'answers.a_id': ['c3d1aib', 'c3d4lya'],
 'answers.score': [6, 3],
 'answers.text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
  "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"],
 'answers_urls.url': [],
 'document': '',
 'q_id': 'nyxfp',
 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also ho

现在，每个子字段都是一个单独的列，如“answer”前缀所示，“text”字段现在是一个列表。不要单独标记每个句子，而是将列表转换为字符串，以便可以联合标记它们。

以下是第一个预处理函数，用于连接每个示例的字符串列表并标记结果：

In [None]:
def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]])

要将此预处理函数应用于整个数据集，请使用🤗 数据集[map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map)方法您可以通过设置“batched=True”来同时处理数据集的多个元素，并使用“num_proc”增加进程数量，从而加快“map”函数的速度。删除任何不需要的列：

In [None]:
tokenized_eli5 = eli5.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=eli5["train"].column_names,
)

此数据集包含令牌序列，但其中一些令牌序列比模型的最大输入长度长。
现在可以使用第二个预处理函数

- 连接所有序列
- 将连接的序列拆分为由“block_size”定义的较短的块，该块既应短于最大输入长度，又应短到足以容纳GPU RAM。

In [None]:
block_size = 128


def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block_size.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

将“group_texts”函数应用于整个数据集：

In [None]:
lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

现在使用[DataCollatorForLanguageModeling](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorForLanguageModeling)创建一个示例批次。在整理过程中，将句子动态填充到批次中的最大长度，而不是将整个数据集填充到最大长度。

使用结束序列标记作为填充标记，并设置`mlm=False`。这将使用输入作为标签，向右移动一个元素：


In [None]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

## Train

如果你不熟悉如何使用[Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer)进行模型微调，请查看[基础教程](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-with-pytorch-trainer)！

现在你已经准备好开始训练你的模型了！使用[AutoModelForCausalLM](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForCausalLM)加载DistilGPT2：

In [None]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained("distilgpt2")

到这一步，只剩下三个步骤：

1. 在[TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments)中定义你的训练超参数。唯一必需的参数是`output_dir`，它指定了保存模型的位置。通过设置`push_to_hub=True`（你需要登录到Hugging Face才能上传你的模型）将这个模型推送到Hub。
2. 将训练参数连同模型、数据集和数据整理器传递给[Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer)。
3. 调用[train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train)来微调你的模型。

In [None]:
training_args = TrainingArguments(
    output_dir="my_awesome_eli5_clm-model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

trainer.train()

一旦训练完成，使用[evaluate()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.evaluate)方法来评估你的模型并获取其困惑度：

In [None]:
import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 49.61

然后使用[push_to_hub()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.push_to_hub)方法将你的模型分享到Hub，这样每个人都可以使用你的模型：

In [None]:
trainer.push_to_hub()

提示

如果你想要一个更深入的示例，了解如何为因果语言建模微调模型，请查看相应的
[PyTorch 笔记本](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)
或 [TensorFlow 笔记本](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb)。


## Inference

太好了，现在你已经微调了一个模型，你可以用它来进行推理！

想出一个你想要从中生成文本的提示：

In [None]:
prompt = "Somatic hypermutation allows the immune system to"

尝试使用你微调的模型进行推理的最简单方法是在[pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline)中使用它。用你的模型实例化一个文本生成的`pipeline`，并将你的文本传递给它：

In [None]:
from transformers import pipeline

generator = pipeline("text-generation", model="my_awesome_eli5_clm-model")
generator(prompt)

[{'generated_text': "Somatic hypermutation allows the immune system to be able to effectively reverse the damage caused by an infection.\n\n\nThe damage caused by an infection is caused by the immune system's ability to perform its own self-correcting tasks."}]

对文本进行分词并将`input_ids`作为PyTorch张量返回：

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_clm-model")
inputs = tokenizer(prompt, return_tensors="pt").input_ids

使用[generate()](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationMixin.generate)方法来生成文本。
想要了解更多关于不同文本生成策略和控制生成的参数的详细信息，请查看[文本生成策略](https://huggingface.co/docs/transformers/main/en/tasks/../generation_strategies)页面。

In [None]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("my_awesome_eli5_clm-model")
outputs = model.generate(inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)

将生成的令牌ID解码回文本：

In [None]:
tokenizer.batch_decode(outputs, skip_special_tokens=True)

["Somatic hypermutation allows the immune system to react to drugs with the ability to adapt to a different environmental situation. In other words, a system of 'hypermutation' can help the immune system to adapt to a different environmental situation or in some cases even a single life. In contrast, researchers at the University of Massachusetts-Boston have found that 'hypermutation' is much stronger in mice than in humans but can be found in humans, and that it's not completely unknown to the immune system. A study on how the immune system"]