# 微调transformer模型解决翻译任务

In [50]:
model_checkpoint = "./opus-mt-zh-en" 
# 选择一个模型checkpoint

只要预训练的transformer模型包含seq2seq结构的head层，那么本notebook理论上可以使用各种各样的transformer模型[模型面板](https://huggingface.co/models)，解决任何翻译任务。

## 加载数据

In [51]:
import datasets
datasets.__version__

'2.10.0'

In [53]:
from datasets import Dataset, DatasetDict
import pandas as pd

# 读取CSV文件
df = pd.read_csv('small.csv', names=['en', 'zh'],sep="|")  # 假设文件没有头部，第一列为英文，第二列为罗马尼亚语
eval_df = pd.read_csv("eval.csv",names=['en','zh'],sep="|")
# 创建一个列表，其中每个元素是一个字典，包含翻译对
translations = [{'translation': {'en': row['en'], 'zh': row['zh']}} for _, row in df.iterrows()]
validation_translations = [{'translation': {'en': row['en'], 'zh': row['zh']}} for _, row in eval_df.iterrows()]
# 创建一个Dataset
dataset = Dataset.from_pandas(pd.DataFrame(translations))
validation_dataset = Dataset.from_pandas(pd.DataFrame(validation_translations))


raw_datasets = DatasetDict({
    'train': dataset,
    'val': validation_dataset
})

# 查看转换后的数据集
print(raw_datasets)

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 25
    })
    val: Dataset({
        features: ['translation'],
        num_rows: 6
    })
})


这个datasets对象本身是一种[`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict)数据结构. 对于训练集、验证集和测试集，只需要使用对应的key（train，validation，test）即可得到相应的数据。

非常方便

给定一个数据切分的key（train、validation或者test）和下标即可查看数据。

In [54]:
raw_datasets["train"][0]

{'translation': {'en': 'English', 'zh': '中文'}}

## 随机可视化数据集

In [55]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [56]:
show_random_elements(raw_datasets["train"])

Unnamed: 0,translation
0,"{'en': 'Liza put on the air of a conquering hero, and sauntered on, enchanted at the uproar. She stuck out her elbows and jerked her head on one side, and said to herself as she passed through the bellowing crowd:', 'zh': '莉莎摆出一副征服英雄的姿态，在喧闹声中陶醉地大步向前走去。她伸出胳膊肘，把头扭向一边，在穿过喧闹的人群时自言自语道：'这才叫果酱！''}"
1,"{'en': '‘Come on, Florrie, you and me ain’t shy; we’ll begin, and bust it!’', 'zh': '来吧，弗洛莉，你和我都不害羞；我们开始吧，跳个痛快！""两个女孩互相拉着对方的手。'}"
2,"{'en': 'English', 'zh': '中文'}"
3,"{'en': 'When she came to the group round the barrel-organ, one of the girls cried out to her:', 'zh': '当她走到围着木桶风琴的人群中时，一个女孩对她喊道：'这是你的新衣服吗？'}"
4,"{'en': 'It was a young girl of about eighteen, with dark eyes, and an enormous fringe, puffed-out and curled and frizzed, covering her whole forehead from side to side, and coming down to meet her eyebrows. She was dressed in brilliant violet, with great lappets of velvet, and she had on her head an enormous black hat covered with feathers.', 'zh': '那是一个大约十八岁的年轻女孩，她有一双乌黑的眼睛，一头蓬松、卷曲、毛茸茸的巨大流苏从两侧覆盖了整个额头，一直垂到眉毛上。她穿着艳丽的紫罗兰色衣服，上面镶嵌着巨大的天鹅绒花边，头上戴着一顶巨大的黑色帽子，上面插满了羽毛。'}"


metric是[`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric)类的一个实例，查看metric和使用的例子:

In [57]:
from datasets import load_metric

# 加载一个指标，比如 BLEU
metric = load_metric("sacrebleu")

In [58]:
metric

Metric(name: "sacrebleu", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id='references')}, usage: """
Produces BLEU scores along with its sufficient statistics
from a source against one or more references.

Args:
    predictions (`list` of `str`): list of translations to score. Each translation should be tokenized into a list of tokens.
    references (`list` of `list` of `str`): A list of lists of references. The contents of the first sub-list are the references for the first prediction, the contents of the second sub-list are for the second prediction, etc. Note that there must be the same number of references for each prediction (i.e. all sub-lists must be of the same length).
    smooth_method (`str`): The smoothing method to use, defaults to `'exp'`. Possible values are:
        - `'none'`: no smoothing
        - `'floor'`: increment zero counts
        - `'add-k'`: increment num/deno

## 评分计算

In [59]:
fake_preds = ["hello there", "general kenobi"]
fake_labels = [["hello there"], ["general kenobi"]]
metric.compute(predictions=fake_preds, references=fake_labels)

{'score': 0.0,
 'counts': [4, 2, 0, 0],
 'totals': [4, 2, 0, 0],
 'precisions': [100.0, 100.0, 0.0, 0.0],
 'bp': 1.0,
 'sys_len': 4,
 'ref_len': 4}

## 数据预处理

在将数据喂入模型之前，我们需要对数据进行预处理。预处理的工具叫Tokenizer。Tokenizer首先对输入进行tokenize，然后将tokens转化为预模型中需要对应的token ID，再转化为模型需要的输入格式。

为了达到数据预处理的目的，我们使用AutoTokenizer.from_pretrained方法实例化我们的tokenizer，这样可以确保：

- 我们得到一个与预训练模型一一对应的tokenizer。
- 使用指定的模型checkpoint对应的tokenizer的时候，我们也下载了模型需要的词表库vocabulary，准确来说是tokens vocabulary。


这个被下载的tokens vocabulary会被缓存起来，从而再次使用的时候不会重新下载。

In [60]:
from transformers import AutoTokenizer
# 需要安装`sentencepiece`： pip install sentencepiece
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)



以我们使用的mBART模型为例，我们需要正确设置source语言和target语言。如果您要翻译的是其他双语语料，请查看[这里](https://huggingface.co/facebook/mbart-large-cc25)。我们可以检查source和target语言的设置：


## word -> index

In [61]:
if "mbart" in model_checkpoint:
    tokenizer.src_lang = "en-XX"
    tokenizer.tgt_lang = "zh-CN"

tokenizer既可以对单个文本进行预处理，也可以对一对文本进行预处理，tokenizer预处理后得到的数据满足预训练模型输入格式

In [62]:
tokenizer("Hello, this one sentence!")

{'input_ids': [1496, 26607, 2, 56, 7, 5112, 95, 4509, 8233, 48, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

上面看到的token IDs也就是input_ids一般来说随着预训练模型名字的不同而有所不同。原因是不同的预训练模型在预训练的时候设定了不同的规则。但只要tokenizer和model的名字一致，那么tokenizer预处理的输入格式就会满足model需求的。关于预处理更多内容参考[这个教程](https://huggingface.co/transformers/preprocessing.html)

除了可以tokenize一句话，我们也可以tokenize一个list的句子。

In [63]:
tokenizer(["Hello, this one sentence!", "This is another sentence."])

{'input_ids': [[1496, 26607, 2, 56, 7, 5112, 95, 4509, 8233, 48, 0], [206, 30, 7, 16689, 19114, 95, 4509, 8233, 5, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

注意：为了给模型准备好翻译的targets，我们使用`as_target_tokenizer`来控制targets所对应的特殊token：

In [64]:
with tokenizer.as_target_tokenizer():
    print(tokenizer("Hello, this one sentence!"))
    model_input = tokenizer("Hello, this one sentence!")
    tokens = tokenizer.convert_ids_to_tokens(model_input['input_ids'])
    # 打印看一下special toke
    print('tokens: {}'.format(tokens))

{'input_ids': [3833, 2, 56, 139, 4839, 48, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
tokens: ['▁H', 'ello', ',', '▁this', '▁', 'one', '▁s', 'ent', 'ence', '!', '</s>']




如果您使用的是T5预训练模型的checkpoints，需要对特殊的前缀进行检查。T5使用特殊的前缀来告诉模型具体要做的任务，具体前缀例子如下：


In [65]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "translate English to Romanian: "
else:
    prefix = ""

现在我们可以把所有内容放在一起组成我们的预处理函数了。我们对样本进行预处理的时候，我们还会`truncation=True`这个参数来确保我们超长的句子被截断。默认情况下，对与比较短的句子我们会自动padding。

In [66]:
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "zh"

def preprocess_function(examples):
    inputs = [prefix + ex[source_lang] for ex in examples["translation"]]
    targets = [ex[target_lang] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

以上的预处理函数可以处理一个样本，也可以处理多个样本exapmles。如果是处理多个样本，则返回的是多个样本被预处理之后的结果list。

In [67]:
preprocess_function(raw_datasets['train'][:2])

{'input_ids': [[471, 3920, 3707, 6436, 0], [7, 21042, 1481, 17210, 2172, 10955, 25, 7, 21042, 177, 129, 1246, 22, 7, 1813, 7, 2032, 1813, 48, 1246, 2482, 3804, 250, 7, 17933, 589, 12, 147, 3908, 1685, 7, 8378, 6807, 2515, 7, 2757, 18, 3936, 5, 0]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[7, 21116, 0], [83, 1, 2, 1, 48, 21, 1, 0]]}

接下来对数据集datasets里面的所有样本进行预处理，处理的方式是使用map函数，将预处理函数prepare_train_features应用到（map)所有样本上。

In [69]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)



[A[A

[A[A

[A[A

[A[A

更好的是，返回的结果会自动被缓存，避免下次处理的时候重新计算（但是也要注意，如果输入有改动，可能会被缓存影响！）。datasets库函数会对输入的参数进行检测，判断是否有变化，如果没有变化就使用缓存数据，如果有变化就重新处理。但如果输入参数不变，想改变输入的时候，最好清理调这个缓存。清理的方式是使用`load_from_cache_file=False`参数。另外，上面使用到的`batched=True`这个参数是tokenizer的特点，以为这会使用多线程同时并行对输入进行处理。

## 微调transformer模型

既然数据已经准备好了，现在我们需要下载并加载我们的预训练模型，然后微调预训练模型。既然我们是做seq2seq任务，那么我们需要一个能解决这个任务的模型类。我们使用`AutoModelForSeq2SeqLM`这个类。和tokenizer相似，`from_pretrained`方法同样可以帮助我们下载并加载模型，同时也会对模型进行缓存，就不会重复下载模型啦。

In [70]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

由于我们微调的任务是机器翻译，而我们加载的是预训练的seq2seq模型，所以不会提示我们加载模型的时候扔掉了一些不匹配的神经网络参数（比如：预训练语言模型的神经网络head被扔掉了，同时随机初始化了机器翻译的神经网络head）。


为了能够得到一个`Seq2SeqTrainer`训练工具，我们还需要3个要素，其中最重要的是训练的设定/参数[`Seq2SeqTrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Seq2SeqTrainingArguments)。这个训练设定包含了能够定义训练过程的所有属性

In [77]:
batch_size = 6
args = Seq2SeqTrainingArguments(
    "test-translation",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=False,
)



上面evaluation_strategy = "epoch"参数告诉训练代码：我们每个epcoh会做一次验证评估。

上面batch_size在这个notebook之前定义好了。

由于我们的数据集比较大，同时`Seq2SeqTrainer`会不断保存模型，所以我们需要告诉它至多保存`save_total_limit=3`个模型。

最后我们需要一个数据收集器data collator，将我们处理好的输入喂给模型。

In [78]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

设置好`Seq2SeqTrainer`还剩最后一件事情，那就是我们需要定义好评估方法。我们使用`metric`来完成评估。将模型预测送入评估之前，我们也会做一些数据后处理：

In [91]:
import numpy as np

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds if pred.strip()]
    labels = [[label.strip() for label in label_list if label.strip()] for label_list in labels]
    labels = [label_list for label_list in labels if label_list]  # 过滤掉空的参考列表
    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
    
    # 确保每个预测都有一个参考翻译
    references_per_prediction = max(len(refs) for refs in decoded_labels) if decoded_labels else 0
    
    valid_indices = [i for i, (pred, label) in enumerate(zip(decoded_preds, decoded_labels)) if pred.strip() and any(l.strip() for l in label)]
    
    if not valid_indices:
        print("Warning: No valid predictions or references. BLEU score computation skipped.")
        return {"bleu": 0.0, "gen_len": 0.0}
    
    valid_preds = [decoded_preds[i] for i in valid_indices]
    valid_labels = [refs for i, refs in enumerate(decoded_labels) if i in valid_indices]
    
    # 确保每个预测有相同数量的参考
    valid_labels = [refs if len(refs) == references_per_prediction else [refs[0]] * references_per_prediction for refs in valid_labels]
    
    result = metric.compute(predictions=valid_preds, references=valid_labels)
    result = {"bleu": result["score"]}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

最后将所有的参数/数据/模型传给`Seq2SeqTrainer`即可

In [92]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["val"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

  trainer = Seq2SeqTrainer(


调用`train`方法进行微调训练。

In [94]:
trainer.train()

100%|██████████| 5/5 [00:03<00:00,  1.57it/s]Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.

[A


[A[A[A                                    

[A[A                               
                                             

[A[A
100%|██████████| 5/5 [00:08<00:00,  1.57it/s]
[A


[A[A[A                                     
                      

{'eval_loss': 3.0219037532806396, 'eval_bleu': 1.0683, 'eval_gen_len': 19.5, 'eval_runtime': 3.9333, 'eval_samples_per_second': 1.525, 'eval_steps_per_second': 0.254, 'epoch': 1.0}
{'train_runtime': 8.7425, 'train_samples_per_second': 2.86, 'train_steps_per_second': 0.572, 'train_loss': 1.2025309562683106, 'epoch': 1.0}





TrainOutput(global_step=5, training_loss=1.2025309562683106, metrics={'train_runtime': 8.7425, 'train_samples_per_second': 2.86, 'train_steps_per_second': 0.572, 'total_flos': 707628367872.0, 'train_loss': 1.2025309562683106, 'epoch': 1.0})

最后别忘了，查看如何上传模型 ，上传模型到](https://huggingface.co/transformers/model_sharing.html) 到[🤗 Model Hub](https://huggingface.co/models)。随后您就可以像这个notebook一开始一样，直接用模型名字就能使用您的模型啦。
