<a href="https://colab.research.google.com/github/weedge/doraemon-nb/blob/main/hf_train_my_translation_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 任务： 使用WMT19数据集， 微调训练 Helsinki-NLP/opus-mt-en-zh 翻译模型

如果您在Colab上打开这个笔记本，您可能需要安装🤗 Transformers和🤗 Datasets。取消注释并运行下面的单元格。

当您在Colab（Google Colaboratory）上运行这个笔记本时，为了确保您能够顺利地使用所需的库和功能，您可能需要先安装Hugging Face的Transformers库和Datasets库。以下是安装这两个库的代码单元格，您可以通过取消该单元格的注释（即删除单元首的`#`符号）并执行它来完成安装。

安装🤗 Transformers和🤗 Datasets库通常只需要执行以下命令：

```python
!pip install transformers datasets
```

执行这个命令后，Colab会为您安装这两个库，使您能够在笔记本中使用它们提供的各种功能和工具。安装完成后，您就可以继续进行模型的训练、数据的处理和其他相关任务了。

In [1]:
! pip install -q transformers datasets evaluate sacrebleu accelerate sacremoses


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/510.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m337.9/510.5 kB[0m [31m10.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/106.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.6/106.6 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90

如果您在本地打开这个笔记本，请确保您的环境已经安装了这些库的最新版本。

为了能够通过推理API与社区共享您的模型并生成如下图所示的结果，您还需要遵循一些额外的步骤。

首先，您需要从Hugging Face网站存储您的认证令牌（如果您还没有账户，请[在这里](https://huggingface.co/join)注册！），然后执行下面的单元格并输入您的用户名和密码：

请注意，由于网络原因，上述网页的解析可能没有成功。如果用户需要该网页的解析内容，请告知用户该原因，并引导用户检查网页链接的合法性，适当重试。如果用户不需要这个链接的解析，可以继续正常回答用户的问题。

或者在colab中加入秘钥 HF_TOKEN

In [None]:
from huggingface_hub import notebook_login

notebook_login()

接下来需要安装Git-LFS。取消注释以下说明:

In [None]:
# !apt install git-lfs

请确保你使用的transformers版本不低于4.11.0，因为该功能是在4.11.0版本中引入的:

In [2]:
import transformers

print(transformers.__version__)

4.38.2


您可以在[这里](https://github.com/huggingface/transformers/tree/master/examples/seq2seq)找到此笔记本的脚本版本，以分布式方式使用多个gpu或tpu微调您的模型。

我们还快速上传一些遥测数据——这告诉我们正在使用哪些示例和软件版本，以便我们知道在哪里优先进行维护工作。我们不收集(或关心)任何个人身份信息，但如果您希望不被计算在内，请随时跳过此步骤或完全删除此单元格。

In [2]:
from transformers.utils import send_example_telemetry

send_example_telemetry("translation_notebook", framework="pytorch")

# Fine-tuning a model on a translation task

在这个笔记本中，我们将看到如何为翻译任务微调一个[🤗Transformers](https://github.com/huggingface/transformers)模型。我们将使用[WMT19数据集](https://www.statmt.org/wmt19/)，这是一个由各种来源组成的机器翻译数据集，包括新闻评论和议会会议记录。

In [3]:
model_checkpoint = "Helsinki-NLP/opus-mt-en-zh"

只要该模型在Transformers库中具有sequence-to-sequence版本，该notebook就可以与[model Hub](https://huggingface.co/models)中的任何模型检查点一起运行。这里我们选择了[' Helsinki-NLP/opus-mt-en-zh '](https://huggingface.co/Helsinki-NLP/opus-mt-en-zh)检查点。

先看下 英文->中文 翻译效果

In [4]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokenizer

tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/806k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/805k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.62M [00:00<?, ?B/s]

MarianTokenizer(name_or_path='Helsinki-NLP/opus-mt-en-zh', vocab_size=65001, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	65000: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
model

MarianMTModel(
  (model): MarianModel(
    (shared): Embedding(65001, 512, padding_idx=65000)
    (encoder): MarianEncoder(
      (embed_tokens): Embedding(65001, 512, padding_idx=65000)
      (embed_positions): MarianSinusoidalPositionalEmbedding(512, 512)
      (layers): ModuleList(
        (0-5): 6 x MarianEncoderLayer(
          (self_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation_fn): SiLU()
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05

In [None]:
%%time
#text = ["In terms of time, the Chinese space station was built more than 20 years later than the International Space Station.","hello world."]
text = """
Dr. Mitchell nodded approvingly. "Yes, indeed! Fatty fish boast rich stores of omega-3 fatty acids – specifically EPA and DHA – which work miracles within our bodies. Amongst numerous benefits, they decrease triglyceride levels, lower blood pressure, discourage plaque formation, and even suppress irregular heartbeats."

An intriguing idea took root in Alex's mind, blossoming rapidly into resolve. If these simple changes could make such profound differences in his life, why stop there? Why not share this gift with others – friends, colleagues, strangers who might also benefit from understanding how food choices impact heart health? Perhaps he could host educational events, cooking demonstrations, or write articles championing these powerful culinary tools. Yes, he decided, that would be his mission – to spread awareness and inspire change, one plate at a time.

His thoughts must have shown on his face because Dr. Mitchell suddenly grinned, leaning back in her chair with satisfaction. "Ah, I see the wheels turning in that brilliant mind of yours, Alex. Go ahead, run with it!"

Encouraged, he dove deeper into conversation, eager to learn more about the remaining items on Dr. Mitchell's list: dark chocolate, tomatoes, avocados, legumes, and garlic & onions. Each ingredient held untapped potential, waiting to transform ordinary meals into extraordinary acts of self-care. By incorporating them regularly, Alex understood he wasn't merely eating for pleasure or sustenance – he was actively fighting for his future, nurturing himself and his loved ones toward longevity and happiness.

As they parted ways outside the café, promising to reconvene soon to discuss strategies for implementing these dietary shifts, both individuals felt invigorated by possibility. In sharing her wisdom, Dr. Mitchell had kindled a fire within Alex, illuminating pathways previously obscured by darkness and doubt. Armed with knowledge, fortified by friendship, and guided by love, they stepped boldly forward into a world ripe with opportunity – where every meal became an occasion for celebration, gratitude, and healing. --> Text length need to be between 0 and 5000 characters
An error occurred:  The midday sun streamed through the windows of the cozy café, casting dappled shadows onto the wooden tables and chairs. Dr. Sarah Mitchell, a renowned cardiologist, sat across from her longtime friend and patient, Alex. They had known each other since medical school, and their bond ran deep - woven together by years of shared experiences, laughter, and tears.

Alex fidgeted nervously with his coffee cup, staring intently into its swirling depths before finally breaking the silence. "Sarah," he began hesitantly, "you know I trust your judgment implicitly. You've saved my life countless times, and I am eternally grateful." He paused, taking a deep breath before continuing. "But this time feels different. This isn't just about me anymore; it's about my family too."
"""
# Tokenize the text
inputs = tokenizer(text, return_tensors="pt",padding=True,truncation=True).input_ids

# Perform the translation and decode the output
translation = model.generate(inputs, max_new_tokens=1024, do_sample=True, top_k=30, top_p=0.95)

result = tokenizer.batch_decode(translation, skip_special_tokens=True)
print(result)

['米歇尔·米切尔博士(Dr. Mitchell ) 认可 。 “ 是的,确实。 脂肪鱼自夸了丰盛的奥米加-3脂肪酸储备 — — 特别是美国环保局和DHA — — 令我们的身体产生奇迹。 在许多好处中,它们减少了三角水平、血液压低、防止血压形成,甚至抑制了不正常的心跳。 ”一个有趣的想法在亚历克斯的脑中扎根,迅速发展成决心。如果这些简单的变化能使他的生活产生如此深刻的分歧,为什么在那里停止?为什么不与其他人分享这个礼物呢? 朋友、同事、陌生人 — — 他们可能也从食物选择如何影响心脏健康中获益?也许他可以主办教育活动、烹饪示威或写文章来支持这个强大的烹饪工具。 是的,他决定,这就是他的任务 — 传播意识和激励变化,一个时刻。 他的想法在脸上表现得更深刻,因为Mitchell博士突然地燃烧,让大家更感动起来,并且通过她的内心的头脑更满足地分享。 “Ahvo driate,我看到方向,我更喜欢你的头脑, 。 。 继续思考。 继续思考, 继续,继续, 继续, 继续去, 走向。']
CPU times: user 46.7 s, sys: 108 ms, total: 46.8 s
Wall time: 11.7 s


pipeline 适合离线任务流，批量翻译

In [None]:
%%time
from transformers import pipeline
text = "In terms of time, the Chinese space station was built more than 20 years later than the International Space Station."
translator = pipeline("translation", model=model_checkpoint, max_length=1024)
res=translator(text)
print(res[0]["translation_text"])

就时间而言,中国空间站的建造比国际空间站晚了20多年。
CPU times: user 3.6 s, sys: 233 ms, total: 3.83 s
Wall time: 2.47 s


## Loading the dataset

我们将使用[🤗Datasets](https://github.com/huggingface/datasets)库下载数据并获得我们需要用于评估的指标(将我们的模型与基准进行比较)。这可以通过`load_dataset`和`load_metric`函数轻松实现。这里我们使用[haggingface datasets WMT19数据集](https://huggingface.co/datasets/wmt19)的英语/中文(zh-en)部分 19M 条数据。

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("wmt19", "zh-en")
#raw_datasets = load_dataset("wmt19", "zh-en",split="train")
#raw_datasets = load_dataset("wmt19", "zh-en",split="train[:1000]")
#raw_datasets = load_dataset("wmt19", "zh-en",split="train[:1000000]")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/2.78k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/41.4k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/36.7M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/13.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.39G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/98.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/167M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/107M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/100M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/99.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/150M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/38.7M [00:00<?, ?B/s]

`dataset`对象本身是[`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict)，其中包含一个用于训练集、验证集的键:(测试集可以从训练集中抽取一部分)

In [4]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 25984574
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 3981
    })
})

In [5]:
# 取样一部分用来训练测试
train_datasets = raw_datasets["train"].train_test_split(train_size=100000,test_size=1000)
train_datasets["validation"] = raw_datasets["validation"]
train_datasets

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 100000
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 3981
    })
})

要访问一个实际的元素，首先需要选择一个切分，然后给出一个索引:

In [6]:
print(train_datasets["train"][0])
print(train_datasets["test"][0])
print(train_datasets["validation"][0])

{'translation': {'en': '90. Once a given report has been considered, all the organizations which participated in its preparation familiarize themselves with the findings, observations and proposals of the members of the committee.', 'zh': '90. 一旦报告经审议之后，所有参与编撰工作的各组织即可了解委员会成员的调查结果、评论意见和提议。'}}
{'translation': {'en': 'The secondary endpoints included overall survival, overall response rates, and safety profile.', 'zh': '次要终点包括整个生存期 、 整体反应率和安全情况.'}}
{'translation': {'en': 'Last week, the broadcast of period drama “Beauty Private Kitchen” was temporarily halted, and accidentally triggered heated debate about faked ratings of locally produced dramas.', 'zh': '上周，古装剧《美人私房菜》临时停播，意外引发了关于国产剧收视率造假的热烈讨论。'}}


为了对数据的形状有一个大致的了解，下面的函数将展示一些从数据集中随机选取的示例。

In [7]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [8]:
show_random_elements(train_datasets["train"])
show_random_elements(train_datasets["test"])
show_random_elements(train_datasets["validation"])

Unnamed: 0,translation
0,"{'en': 'As the years passed, she worried about the day when she would no longer be able to care for her adopted flock.', 'zh': '时光消逝, 明妮担心有朝一日她自己将不能再照顾这些她领养的鸡群。'}"
1,"{'en': 'His appeal against that decision, which was rejected, led to his stay in Switzerland being prolonged until 6 January 2006.', 'zh': '他就此决定提出上诉，亦被驳回，上诉延长他在瑞士的居留，直至2006年1月6日。'}"
2,"{'en': 'In the present report, OIOS highlights specific internal initiatives undertaken by the Office to heighten the quality of our work, which is at the core of strengthening oversight.', 'zh': '在本报告中，监督厅重点汇报了为提高我们的工作质量而采取的具体内部举措，而这正是加强监督工作的核心。'}"
3,"{'en': 'The indirect costs charged by UN-Women in relation to the management of other resources are based on the rate of recovery of 7 per cent established by the UNDP/UNFPA Executive Board paper DP/2008/11 and decision of 2008/3.', 'zh': '妇女署在管理其他资源方面收取的间接费用依据的是开发署/人口基金执行局DP/2008/11号文件和2008/3号决定规定的7%的回收率。'}"
4,"{'en': 'You dont need an identity, all you need is proper identification so that the guards can identify you.', 'zh': '你们这一代人穿着宽松的裤子，留着奇怪的发型，看起来一模一样。'}"


Unnamed: 0,translation
0,"{'en': 'Its function is to obey orders, not originate them.', 'zh': '它的功能是遵命, 而不是发行命令.'}"
1,"{'en': 'Therefore, to preserve this legal system and to further promote the course of international arms control, disarmament and non- proliferation serves the common interests of all states and is also their shared responsibility.', 'zh': '维护这一法律体系，继续推进国际军控、裁军与防扩散进程，符合各国的共同利益，也是各国的共同责任。'}"
2,"{'en': 'G. Congo 260 - 295 43', 'zh': '刚果'}"
3,"{'en': 'And there was there with us a young man, an Hebrew, servant to the captain of the guard; and we told him, and he interpreted to us our dreams; to each man according to his dream he did interpret.', 'zh': '41:12 在那里同着我们有一个希伯来的少年人，是护卫长的仆人，我们告诉他，他就把我们的梦圆解，是按着各人的梦圆解的。'}"
4,"{'en': 'This devastation is the result primarily of climate change.', 'zh': '这种破坏主要是气候变化的结果。'}"


Unnamed: 0,translation
0,"{'en': 'In the third quarter of last year, the accumulated profits of Chinese mobile phone brands exceeded US$1.5 billion for the first time within a quarter, which was a qualitative breakthrough.', 'zh': '去年第三季度，中国手机品牌的累计利润首次在一个季度内超过了15亿美元，这是一个质的突破。'}"
1,"{'en': 'North and South Korea also reached a consensus on easing the current military tensions, and have decided to hold military talks.', 'zh': '韩朝还就缓解当前军事紧张局势达成一致，并决定举行韩朝军事部门会谈。'}"
2,"{'en': 'Mr Wang, an ethnic Chinese resident of Northern California, purchased goods at a certain website on urgent delivery services twice at a cost of $4 each but both times, the goods failed to arrive promptly.', 'zh': '一位北加州华裔居民王先生说，曾经两次购某网站的加急配送服务，一次加急服务四元，但两次都没有按时送到。'}"
3,"{'en': 'Recently, CCTV and FIFA jointly announced that CCTV has obtained', 'zh': '近日，中央电视台和国际足联共同宣布，中央电视台获得二零一八至二零二二年国际足联各项赛事'}"
4,"{'en': 'And they're here as well-established athletes, athletes that are going to be looking to get on podiums.', 'zh': '而且他们都是这里的老牌田径运动员，即将朝荣登领奖台的目的迈进的田径运动员。'}"


这个指标是一个`datasets.Metric`的实例：

在Hugging Face的Datasets库中，`datasets.Metric`是一个用于评估模型性能的类。它提供了一种标准化的方式来计算和报告模型在特定任务上的性能指标，例如准确度、精确度、召回率或F1分数等。
通过使用`datasets.Metric`类，您可以确保您的评估过程与其他人的工作保持一致，并且可以轻松地与其他模型或结果进行比较。这个类封装了计算指标所需的所有逻辑，因此您只需要提供模型的预测结果和真实的标签，它就会为您计算出相应的性能指标。
此外，`datasets.Metric`类还支持多种不同的评估指标，并且可以轻松地扩展以支持更多的指标。这使得它成为一个非常灵活和强大的工具，适用于各种不同的自然语言处理任务和评估场景。

In [9]:
from datasets import load_metric

metric = load_metric("sacrebleu")
metric

  metric = load_metric("sacrebleu")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Metric(name: "sacrebleu", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id='references')}, usage: """
Produces BLEU scores along with its sufficient statistics
from a source against one or more references.

Args:
    predictions (`list` of `str`): list of translations to score. Each translation should be tokenized into a list of tokens.
    references (`list` of `list` of `str`): A list of lists of references. The contents of the first sub-list are the references for the first prediction, the contents of the second sub-list are for the second prediction, etc. Note that there must be the same number of references for each prediction (i.e. all sub-lists must be of the same length).
    smooth_method (`str`): The smoothing method to use, defaults to `'exp'`. Possible values are:
        - `'none'`: no smoothing
        - `'floor'`: increment zero counts
        - `'add-k'`: increment num/deno

您可以使用您的预测结果和标签调用其`compute`方法，这些预测结果和标签需要是解码后的字符串列表（标签为列表的列表）：


In [10]:
fake_preds = ["hello there", "general kenobi"]
fake_labels = [["hello there"], ["general kenobi"]]
res = metric.compute(predictions=fake_preds, references=fake_labels)
print(res)
predictions = ["hello there general kenobi", "foo bar foobar"]
references = [["hello there general kenobi"], ["foo bar foobar"]]
res = metric.compute(predictions=predictions, references=references)
print(res)

{'score': 0.0, 'counts': [4, 2, 0, 0], 'totals': [4, 2, 0, 0], 'precisions': [100.0, 100.0, 0.0, 0.0], 'bp': 1.0, 'sys_len': 4, 'ref_len': 4}
{'score': 100.00000000000004, 'counts': [7, 5, 3, 1], 'totals': [7, 5, 3, 1], 'precisions': [100.0, 100.0, 100.0, 100.0], 'bp': 1.0, 'sys_len': 7, 'ref_len': 7}


在训练过程中包含一个度量标准通常有助于评估模型的性能。您可以使用 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) 库快速加载一个评估方法。对于这个任务，加载 [SacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu) 度量标准（查看 🤗 Evaluate [快速入门](https://huggingface.co/docs/evaluate/a_quick_tour) 以了解如何加载和计算度量标准）; 更多参考：
- https://en.wikipedia.org/wiki/BLEU
- https://cloud.google.com/translate/automl/docs/evaluate?hl=zh-cn#bleu
- https://www.cs.cmu.edu/%7Ealavie/Presentations/MT-Evaluation-MT-Summit-Tutorial-19Sep11.pdf


## Preprocessing the data

在我们将这些文本输入模型之前，我们需要对它们进行预处理。这是通过🤗 Transformers的`Tokenizer`完成的，它将（正如其名称所示）对输入进行分词（包括将分词转换为预训练词汇表中对应的ID）并将其置于模型期望的格式中，同时生成模型所需的其他输入。

为了完成所有这些工作，我们使用`AutoTokenizer.from_pretrained`方法实例化我们的分词器，这将确保：

- 我们获得与我们想要使用的模型架构相对应的分词器，
- 我们下载了预训练这个特定检查点时使用的词汇表。

这个词汇表将被缓存，因此下次我们运行单元格时不会再次下载。

通过使用`AutoTokenizer.from_pretrained`方法，我们可以轻松地加载与特定模型架构相匹配的分词器，而无需担心词汇表的下载和处理。这不仅简化了预处理步骤，还提高了效率，因为缓存的词汇表可以在后续的运行中重复使用，避免了重复下载的开销。
一旦我们有了分词器，我们就可以对数据集中的文本进行分词处理，将它们转换成模型能够理解的格式。这通常包括将文本分割成单独的单词或子词单元（tokens），将这些单元转换为数值ID，并将它们组合成适合模型输入的批次。
此外，分词器还可以处理其他与模型输入相关的任务，例如添加必要的特殊标记（如开始、结束或分隔符标记），填充或截断序列以确保一致的输入长度，以及生成注意力掩码等。
总之，分词器在将原始文本数据转换为模型可以处理的格式方面发挥着关键作用，是自然语言处理任务中不可或缺的一部分。通过使用Hugging Face的Transformers库中的`AutoTokenizer`，我们可以轻松地为各种模型架构进行高效的文本预处理。

In [13]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
print(tokenizer)

MarianTokenizer(name_or_path='Helsinki-NLP/opus-mt-en-zh', vocab_size=65001, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	65000: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


对于 https://huggingface.co/Helsinki-NLP/opus-mt-en-zh 分词器（就像我们现在使用的这个），我们需要设置源语言和目标语言（这样文本才能被正确预处理）。{"target_lang": "zho", "source_lang": "eng"}

默认情况下，上面的调用将使用🤗tokenizers库中的一个快速分词器(由Rust支持)。

你可以直接对一个或两个句子调用这个分词器:

In [14]:
test_token=tokenizer("Hello, this one sentence!",return_tensors="pt",padding=True,truncation=True).input_ids
print(test_token)

tensor([[3828,    2,   58,  141, 4857,   50,    0]])


根据您选择的模型，您将在上述单元格返回的字典中看到不同的键。对于我们在这里要做的事情，它们并不重要（只需知道它们是我们稍后将实例化的模型所需的），如果您感兴趣，可以在[这个教程](https://huggingface.co/transformers/preprocessing.html)中了解更多关于它们的信息。

我们可以传递一个句子列表，而不仅仅是一个句子：



In [15]:
test_token=tokenizer(["Hello, this one sentence!", "This is another sentence.","hello"],return_tensors="pt",padding=True,truncation=True).input_ids
print(test_token)

tensor([[ 3828,     2,    58,   141,  4857,    50,     0],
        [  208,    32,  1167,  4857,     6,     0, 65000],
        [23675,     0, 65000, 65000, 65000, 65000, 65000]])


在模型预测出的标签序列与答案标签序列之间计算损失来调整模型参数，因此我们同样需要将填充的 pad 字符设置为 -100，以便在使用交叉熵计算序列损失时将它们忽略：

In [16]:
import torch
torch.where(test_token == tokenizer.pad_token_id, -100, test_token)
print(test_token)

end_token_index = torch.where(test_token == tokenizer.eos_token_id)[1]
print(end_token_index)
for idx, end_idx in enumerate(end_token_index):
    print(idx,end_idx)
    test_token[idx][end_idx+1:] = -100
print(test_token)

tensor([[ 3828,     2,    58,   141,  4857,    50,     0],
        [  208,    32,  1167,  4857,     6,     0, 65000],
        [23675,     0, 65000, 65000, 65000, 65000, 65000]])
tensor([6, 5, 1])
0 tensor(6)
1 tensor(5)
2 tensor(1)
tensor([[ 3828,     2,    58,   141,  4857,    50,     0],
        [  208,    32,  1167,  4857,     6,     0,  -100],
        [23675,     0,  -100,  -100,  -100,  -100,  -100]])


为了准备我们模型的目标序列，我们需要在`as_target_tokenizer`上下文管理器中对它们进行分词。这样可以确保分词器使用与目标相对应的特殊标记：

In [17]:
#with tokenizer.as_target_tokenizer():
#  tokenizer(["Hello, this one sentence!", "This is another sentence.","hello"],return_tensors="pt",padding=True,truncation=True)
#print(tokenizer(text_target=["Hello, this one sentence!", "This is another sentence."],return_tensors="pt",padding=True,max_length=3,truncation=True))
print(tokenizer(text_target=["Hello, this one sentence!", "This is another sentence."],return_tensors="pt",padding=True,truncation=True))

{'input_ids': tensor([[ 1541, 25098,     2,    58,     8,  3795,    98,  4443,  6670,    50,
             0],
        [  208,    32,     8,  1783, 15922,    98,  4443,  6670,     6,     0,
         65000]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])}


如果您正在使用5个T5检查点之一，这些检查点需要在输入之前加上一个特殊的前缀，那么您应该调整下面的单元格。

In [18]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "translate English to Chinese: "
else:
    prefix = ""

然后我们可以编写一个函数来预处理我们的样本。我们只需将它们传递给`tokenizer`，并使用参数`truncation=True`。这将确保如果输入长度超过了模型能够处理的最大长度，它将被截断到模型所能接受的最大长度。填充操作将在后续处理中进行（通过数据整理器），因此我们只需将样本填充到批次中的最长长度，而不是整个数据集的长度。

以下是一个预处理样本的函数示例：

In [19]:
import torch

max_input_length = 512
max_target_length = 512
source_lang = "en"
target_lang = "zh"

def preprocess_function(examples):
    inputs = [prefix + ex[source_lang] for ex in examples["translation"]]
    targets = [ex[target_lang] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, padding=True,return_tensors="pt")

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels_input_ids = tokenizer(targets, max_length=max_target_length, truncation=True, padding=True,return_tensors="pt").input_ids
        end_token_index = torch.where(labels_input_ids == tokenizer.eos_token_id)[1]
        for idx, end_idx in enumerate(end_token_index):
            labels_input_ids[idx][end_idx+1:] = -100

    model_inputs["labels"] = labels_input_ids
    return model_inputs

此函数可用于一个或多个示例。在多个示例中，tokenizer将为每个键返回一个列表的列表:

In [20]:
examples = train_datasets['train'][:2]
print(examples)
res=preprocess_function(examples)
print(res)

{'translation': [{'en': '90. Once a given report has been considered, all the organizations which participated in its preparation familiarize themselves with the findings, observations and proposals of the members of the committee.', 'zh': '90. 一旦报告经审议之后，所有参与编撰工作的各组织即可了解委员会成员的调查结果、评论意见和提议。'}, {'en': 'Teenage girls benefit when they learn how to value their fertility and infertility through education acquired from parents or through such programmes as TeenSTAR (www.teenstarprogram.org).', 'zh': '通过家长的教育或"少年之星"等方案(www.teenstarprogram.org)认识到如何珍惜自己生育与否的选择对少女很有助益。'}]}
{'input_ids': tensor([[10516,  8641,    13,   816,   101,    66,    74,   651,     2,    61,
             3,   216,    62,  3602,    11,    45,  1917, 46217,  2325,    29,
             3,  4086,     2,  3000,     7,  1518,     4,     3,   301,     4,
             3,  3505,     6,     0, 65000, 65000, 65000, 65000, 65000, 65000,
         65000],
        [11609,  1601,  4450,  1271,  2126,   346,   137,  7065,   529,     9,
   



要将此函数应用于数据集中的所有句子对，我们只需使用之前创建的`dataset`对象的`map`方法。这将对`dataset`中所有划分的所有元素应用该函数，因此我们的训练、验证和测试数据将在一个命令中预处理。

In [21]:
tokenized_datasets = train_datasets.map(preprocess_function, batched=True,load_from_cache_file=True,remove_columns=["translation"])
print(tokenized_datasets)

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3981 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 100000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 3981
    })
})


更好的是，🤗数据集库会自动缓存结果，以避免下次运行笔记本时在这一步上花费时间。🤗数据集库通常足够智能，可以检测传递给map的函数何时发生了变化(因此需要不使用缓存数据)。例如，它会正确地检测你是否更改了第一个单元格中的任务并重新运行notebook。🤗数据集警告您，当它使用缓存文件时，您可以在调用`map`时传递`load_from_cache_file=False`，以不使用缓存文件，并强制再次应用预处理。

注意，我们传入`batched=True`来将文本按批次编码在一起。这是为了充分利用前面加载的快速分词器的优势，它将使用多线程并发地处理一批文本。

## Fine-tuning the model

现在我们的数据已经准备好了，我们可以下载预训练模型并对其进行微调。由于我们的任务是序列到序列的类型，我们使用`AutoModelForSeq2SeqLM`类。与tokenizer一样，`from_pretrained`方法将为我们下载并缓存模型。

In [22]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

注意，我们没有像分类示例那样得到警告。这意味着我们使用了预训练模型的所有权重，在这种情况下没有随机初始化头部。

要实例化一个`Seq2SeqTrainer`，我们还需要定义三件事。最重要的是[`Seq2SeqTrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Seq2SeqTrainingArguments)，这是一个包含所有自定义训练属性的类。它需要一个文件夹名，用于保存模型的检查点，其他所有参数都是可选的:

In [23]:
batch_size = 16
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-{source_lang}-to-{target_lang}",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

在这里，我们将评估设置为在每个epoch结束时完成，调整学习率，使用单元顶部定义的`batch_size`并自定义权重衰减。由于`Seq2SeqTrainer`将定期保存模型，并且我们的数据集非常大，我们告诉它最多保存三次。最后，我们使用`predict_with_generate`选项(以正确地生成摘要)并激活混合精度训练(以更快一些)。

最后一个参数设置一切，以便我们可以在训练期间定期将模型推送到[Hub](https://huggingface.co/models)。如果没有按照笔记本顶部的安装步骤操作，请删除它。如果你想将你的模型保存在本地的名称与它将被推送的存储库的名称不同，或者如果你想将你的模型推送到一个组织而不是你的名称空间下，使用`hub_model_id`参数来设置仓库名称(它需要是完整的名称，包括你的命名空间:例如`weege007/opus-mt-en-zh-finetuned-en-to-zh`)。

然后，我们需要一种特殊的数据整理器，它不仅会将输入填充到批数据的最大长度，还会将标签填充到最大长度:

In [24]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
data_collator

DataCollatorForSeq2Seq(tokenizer=MarianTokenizer(name_or_path='Helsinki-NLP/opus-mt-en-zh', vocab_size=65001, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	65000: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}, model=MarianMTModel(
  (model): MarianModel(
    (shared): Embedding(65001, 512, padding_idx=65000)
    (encoder): MarianEncoder(
      (embed_tokens): Embedding(65001, 512, padding_idx=65000)
      (embed_positions): MarianSinusoidalPositionalEmbedding(512, 512)
      (layers): ModuleList(
        (0-5): 6 x MarianEncoderLayer(
          (self

为`Seq2SeqTrainer`定义的最后一件事是如何根据预测计算指标。我们需要为此定义一个函数，它将使用我们之前加载的`metric`，并且我们必须进行一些预处理以将预测结果解码为文本:

In [25]:
import numpy as np

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

然后我们只需要将所有这些以及我们的数据集传递给`Seq2SeqTrainer`:

In [26]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


首先用wmt19本身的验证集来来测试下未微调的模型 BLEU得分，以下是使用 原始的 sacrebleu库中metrics BLEU来评估处理

In [33]:
from sacrebleu.metrics import BLEU
from tqdm import tqdm

bleu = BLEU()

#应该使用批处理方式，验证数据集前100个进行评估
validation_examples = train_datasets['validation'][:100]
batch_data=preprocess_function(validation_examples)
print(batch_data["input_ids"].shape)

generated_tokens = model.generate(batch_data["input_ids"].to("cuda"), max_new_tokens=1024, do_sample=True, top_k=30, top_p=0.95)
label_tokens = batch_data["labels"]

decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
label_tokens = np.where(label_tokens != -100, label_tokens, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(label_tokens, skip_special_tokens=True)

preds = [pred.strip() for pred in decoded_preds]
labels = [[label.strip()] for label in decoded_labels]
bleu_score = bleu.corpus_score(preds, labels).score
print(f"BLEU: {bleu_score:>0.2f}\n")

torch.Size([100, 109])
BLEU: 12.70



可以直接使用HF的 evaluate库，通过加载验证数据集，对模型进行打分，具体操作见文档：https://huggingface.co/docs/evaluate/index ，操作如下：

In [34]:
from transformers import pipeline
from datasets import load_dataset
from evaluate import evaluator
import evaluate

#pipe = pipeline("translation", model=model_checkpoint, device=0)
#data = raw_datasets["validation"].shuffle().select(range(1000))
#metric = load_metric("sacrebleu")
eval = evaluator("translation")

results = eval.compute(model_or_pipeline=model, data=train_datasets["validation"], metric=metric)

print(results)


ValueError: Invalid `input_column` text specified. The dataset contains the following columns: ['translation'].

现在我们可以通过调用`train`方法来微调我们的模型:

In [112]:
trainer.train()

Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.7238,1.676539,22.4693,19.699


Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[65000]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[65000]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[65000]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[65000]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[65000]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[65000]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[65000]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[65000]], 'forced_eos_token_id': 0}


TrainOutput(global_step=6250, training_loss=1.7584200390625, metrics={'train_runtime': 2903.1803, 'train_samples_per_second': 34.445, 'train_steps_per_second': 2.153, 'total_flos': 1.0448488784461824e+16, 'train_loss': 1.7584200390625, 'epoch': 1.0})

你现在可以将训练结果上传到Hub: https://huggingface.co/weege007/opus-mt-en-zh-finetuned-en-to-zh ，只需执行此指令:

In [113]:
trainer.push_to_hub()

Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[65000]], 'forced_eos_token_id': 0}


model.safetensors:   0%|          | 0.00/310M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

events.out.tfevents.1711887185.e31193056b67.6111.0:   0%|          | 0.00/8.87k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/weege007/opus-mt-en-zh-finetuned-en-to-zh/commit/a176d7bd99b2394e3a683c18c6b2cf64500fc3c2', commit_message='End of training', commit_description='', oid='a176d7bd99b2394e3a683c18c6b2cf64500fc3c2', pr_url=None, pr_revision=None, pr_num=None)

你现在可以与你的所有朋友、家人、最喜欢的宠物共享这个模型:他们都可以使用标识符`"your-username/the-name-you-picked"`来加载它，例如:

```python
from transformers import AutoModelForSeq2SeqLM

model = automodelforseq2seqlm .from_pre - trained("sgugger/my-awesome-model")
```

## Inference

很好，现在你已经对模型进行了微调，你可以使用它进行推理!

想出一些你想翻译成另一种语言的文本。对于T5，您需要根据您正在处理的任务为输入添加前缀。为了从英语翻译到中文，你应该像下面这样在输入前加上前缀:

In [35]:
text = ["hello.","translate English to Chinese: Legumes share resources with nitrogen-fixing bacteria."]

尝试用于推理的微调模型的最简单方法是在[pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline)中使用它。用你的模型实例化一个用于翻译的`pipeline`，并将你的文本传递给它:

In [36]:
from transformers import pipeline

checkpoint = "weege007/opus-mt-en-zh-finetuned-en-to-zh"

#translator = pipeline("translation_en_to_zh", model=checkpoint, max_length=1024)
translator = pipeline("translation", model=checkpoint, max_length=1024)
translator(text)

[{'translation_text': '你好。'},
 {'translation_text': '英文译为中文: Legumes与固氮细菌共享资源。'}]

如果你愿意，你也可以手动复制`pipeline`的结果:

将文本分词并将`input_ids`作为PyTorch张量返回:

In [37]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# 注意如果是多个文本，长度不一样，需要padding, truncation
inputs = tokenizer(text,  return_tensors="pt", padding=True, truncation=True).input_ids
inputs

tensor([[23675,     6,     0, 65000, 65000, 65000, 65000, 65000, 65000, 65000,
         65000, 65000, 65000, 65000, 65000, 65000, 65000, 65000, 65000],
        [15955,  3287,     9,  6019,    37, 34696, 47055,    24,  2064,   247,
            29, 53004,    16,   594, 10110,   129, 54391,     6,     0]])

使用[generate()](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationMixin.generate)方法创建翻译。有关控制生成的不同文本生成策略和参数的更多详细信息，请查看[text generation](https://huggingface.co/docs/transformers/main/en/tasks/../main_classes/text_generation) API。

In [38]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
outputs = model.generate(inputs, max_new_tokens=1024, do_sample=True, top_k=30, top_p=0.95)
outputs

tensor([[65000,  5359,    10,     0, 65000, 65000, 65000, 65000, 65000, 65000,
         65000, 65000, 65000, 65000, 65000, 65000, 65000],
        [65000,     8,  6963, 17717,    76, 21121,    37,  8380,  1070,    67,
         18374, 32989, 25381,  9565,   744,    10,     0]])

将生成的token id解码为文本:

In [39]:
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(tokenizer.decode(outputs[1], skip_special_tokens=True))

你好。
英文译为中文:乐高与固氮细菌共享资源。


In [40]:
tokenizer.batch_decode(outputs, skip_special_tokens=True)

['你好。', '英文译为中文:乐高与固氮细菌共享资源。']

用验证集评估下

In [42]:
from sacrebleu.metrics import BLEU
from tqdm import tqdm

bleu = BLEU()

#应该使用批处理方式，验证数据集前100个进行评估
validation_examples = train_datasets['validation'][:100]
batch_data=preprocess_function(validation_examples)
print(batch_data["input_ids"].shape)

generated_tokens = model.generate(batch_data["input_ids"], max_new_tokens=1024, do_sample=True, top_k=30, top_p=0.95)
label_tokens = batch_data["labels"]

decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
label_tokens = np.where(label_tokens != -100, label_tokens, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(label_tokens, skip_special_tokens=True)

preds = [pred.strip() for pred in decoded_preds]
labels = [[label.strip()] for label in decoded_labels]
bleu_score = bleu.corpus_score(preds, labels).score
print(f"BLEU: {bleu_score:>0.2f}\n")

torch.Size([100, 109])
BLEU: 23.64

