# 翻译任务

翻译经典的为序列到序列的任务（seq2seq）,在翻译形式上和其他的很多任务接近：
- 文本摘要 (Summarization)：将长文本压缩为短文本，并且还要尽可能保留核心内容。
- 风格转换 (Style transfer)：将文本转换为另一种书写风格，例如将文言文转换为白话文、将古典英语转换为现代英语；
- 生成式问答 (Generative question answering)：对于给定的问题，基于上下文生成对应的答案。

理论上我们也可以将这一节的操作应用于完成这些 Seq2Seq 任务。

本章我们将微调一个 Marian 翻译模型进行汉英翻译，该模型已经基于 Opus 语料对汉英翻译任务进行了预训练，因此可以直接用于翻译。而通过我们的微调，可以进一步提升该模型在特定语料上的性能。

## 数据集的准备
 translation2019zh 语料
 
 example：`{"english": "In Italy, there is no real public pressure for a new, fairer tax system.", "chinese": "在意大利，公众不会真的向政府施压，要求实行新的、更公平的税收制度。"}`

In [1]:
from torch.utils.data import Dataset, random_split
import json

# 选择22w条数据，训练20w， 验证2w
max_dataset_size = 220000
train_set_size = 200000
valid_set_size = 20000

In [2]:
class TRANS(Dataset):
    def __init__(self, data_file):
        self.data = self.load_data(data_file)

    def load_data(self, data_file):
        Data = {}
        with open(data_file, "rt", encoding="utf-8") as f:
            for idx, line in enumerate(f):
                if idx >= max_dataset_size:
                    break
                sample = json.loads(line.strip())
                Data[idx] = sample
        return Data
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        return self.data[idx]
    

In [3]:
!pwd

/root/autodl-tmp


In [4]:
data = TRANS("./dataset/translation2019zh/translation2019zh_train.json")
train_data, val_data = random_split(data, [train_set_size, valid_set_size])
test_data = TRANS("./dataset/translation2019zh/translation2019zh_valid.json")

In [5]:
print(f'train set size: {len(train_data)}')
print(f'valid set size: {len(val_data)}')
print(f'test set size: {len(test_data)}')
print(next(iter(train_data)))

train set size: 200000
valid set size: 20000
test set size: 39323
{'english': 'If you need a good facial scrub, you can coarsely grind some coffee beans and use them to scrub your face. They have great exfoliating properties.', 'chinese': '如果你需要优质的洁面粉，你可以粗粗磨一些咖啡豆，然后用它们打磨你的面部，能很好地去角质噢。'}


## 数据预处理
接下来我们通过选择 Helsinki-NLP 提供的汉英翻译模型 opus-mt-zh-en 对应的分词器进行 token IDs的转化
- model_checkpoint 设置为对应的语言即可
- 默认情况下分词器会采用源语言的设定来编码文本（对于英翻译模型 opus-mt-zh-en 而言就是中文），要编码目标语言则需要使用 text_targets 参数

In [6]:
from transformers import AutoTokenizer

model_checkpoint = "./model/opus-mt-zh-en"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
tokenizer.special_tokens_map

{'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'}

展示

In [8]:
zh_sentence = train_data[0]["chinese"]
en_sentence = train_data[0]["english"]

input = tokenizer(zh_sentence)
target = tokenizer(text_target=en_sentence)
print(train_data[0], "\n", input, "\n", target)

{'english': 'If you need a good facial scrub, you can coarsely grind some coffee beans and use them to scrub your face. They have great exfoliating properties.', 'chinese': '如果你需要优质的洁面粉，你可以粗粗磨一些咖啡豆，然后用它们打磨你的面部，能很好地去角质噢。'} 
 {'input_ids': [4790, 257, 20195, 11, 19835, 2119, 11180, 2, 15525, 16977, 16977, 21639, 617, 9999, 17630, 2, 4453, 646, 896, 1444, 21639, 1301, 2119, 1163, 2, 533, 4665, 241, 605, 9981, 7708, 27164, 9, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} 
 {'input_ids': [686, 37, 204, 12, 472, 58342, 55229, 2, 37, 122, 2394, 57212, 480, 52631, 239, 12220, 48648, 6, 283, 199, 8, 55229, 168, 2260, 5, 446, 53, 1325, 4908, 589, 23840, 7070, 15159, 5, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


如果没有使用 `text_target`参数则会使用源语言进行分词，产生糟糕的结果

In [9]:
wrong_targets = tokenizer(en_sentence)

print(tokenizer.convert_ids_to_tokens(input["input_ids"]))
print(tokenizer.convert_ids_to_tokens(target["input_ids"]))
print(tokenizer.convert_ids_to_tokens(wrong_targets["input_ids"]))

['▁如果你', '需要', '优质', '的', '洁', '面', '粉', ',', '你可以', '粗', '粗', '磨', '一些', '咖啡', '豆', ',', '然后', '用', '它们', '打', '磨', '你的', '面', '部', ',', '能', '很好', '地', '去', '角', '质', '噢', '。', '</s>']
['▁If', '▁you', '▁need', '▁a', '▁good', '▁facial', '▁scrub', ',', '▁you', '▁can', '▁co', 'arse', 'ly', '▁grind', '▁some', '▁coffee', '▁beans', '▁and', '▁use', '▁them', '▁to', '▁scrub', '▁your', '▁face', '.', '▁They', '▁have', '▁great', '▁ex', 'f', 'oli', 'ating', '▁properties', '.', '</s>']
['▁I', 'f', '▁you', '▁need', '▁a', '▁good', '▁', 'fa', 'ci', 'al', '▁', 'sc', 'ru', 'b', ',', '▁you', '▁can', '▁', 'co', 'ar', 'se', 'ly', '▁g', 'rin', 'd', '▁some', '▁c', 'off', 'ee', '▁be', 'ans', '▁and', '▁', 'use', '▁them', '▁to', '▁', 'sc', 'ru', 'b', '▁your', '▁f', 'ace', '.', '▁They', '▁have', '▁g', 're', 'at', '▁', 'ex', 'f', 'oli', 'a', 'ting', '▁', 'pro', 'per', 'ti', 'es', '.', '</s>']


- 对于翻译任务，标签序列就是目标语言的 token ID 序列。
- 同样需要将填充的 pad 字符设置为 -100，以便在使用交叉熵计算序列损失时将它们忽略

In [10]:
import torch

max_input_len = 128
max_output_len = 128

inputs = [train_data[s_idx]["chinese"] for s_idx in range(4)]
targets = [train_data[s_idx]["english"] for s_idx in range(4)]

model_input = tokenizer(
    inputs,
    padding=True,
    max_length=max_input_len,
    truncation=True,
    return_tensors="pt"
)

labels = tokenizer(
    targets,
    padding=True,
    max_length=max_output_len,
    truncation=True,
    return_tensors="pt"
)["input_ids"]

# Marian 模型会在分词结果的结尾加上特殊 token '</s>'
end_token_index = torch.where(labels == tokenizer.eos_token_id)[1]

# 特殊 token '</s>'之后进行padding
for idx, end_idx in enumerate(end_token_index):
    labels[idx][end_idx+1:] = -100

print('batch_X shape:', {k: v.shape for k, v in model_input.items()})
print('batch_y shape:', labels.shape)
print(model_input)
print(labels)

batch_X shape: {'input_ids': torch.Size([4, 45]), 'attention_mask': torch.Size([4, 45])}
batch_y shape: torch.Size([4, 89])
{'input_ids': tensor([[ 4790,   257, 20195,    11, 19835,  2119, 11180,     2, 15525, 16977,
         16977, 21639,   617,  9999, 17630,     2,  4453,   646,   896,  1444,
         21639,  1301,  2119,  1163,     2,   533,  4665,   241,   605,  9981,
          7708, 27164,     9,     0, 65000, 65000, 65000, 65000, 65000, 65000,
         65000, 65000, 65000, 65000, 65000],
        [23627,  1075,  2132,   263,  4269,  5096,  2641,   373,  6206,    15,
          5963,   102,    69,  4086, 23421,  5251, 19835,    11,     2, 24853,
          3493,   375, 45939,    69,  4378,   272, 21407, 22368,     9,     0,
         65000, 65000, 65000, 65000, 65000, 65000, 65000, 65000, 65000, 65000,
         65000, 65000, 65000, 65000, 65000],
        [  538,  8181,  1209,  7399,    75,  3890,    47,  3758,    69,  2271,
          1209, 33967,   322,   646,  8816,  1926,  8275, 183

考虑到不同模型的移位操作可能存在差异，我们通过模型自带的 prepare_decoder_input_ids_from_labels 函数来完成。完整的批处理函数为

In [11]:
import torch
from torch.utils.data import DataLoader
from transformers import AutoModelForSeq2SeqLM

max_len = 128

device = "cuda" if torch.cuda.is_available else "cpu"
print(f"Using {device} device")

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, trust_remote_code=True)
model.to(device)

Using cuda device


MarianMTModel(
  (model): MarianModel(
    (shared): Embedding(65001, 512, padding_idx=65000)
    (encoder): MarianEncoder(
      (embed_tokens): Embedding(65001, 512, padding_idx=65000)
      (embed_positions): MarianSinusoidalPositionalEmbedding(512, 512)
      (layers): ModuleList(
        (0-5): 6 x MarianEncoderLayer(
          (self_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation_fn): SiLU()
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05

In [12]:
def collote_fn(batch_samples):
    batch_inputs, batch_targets = [], []
    for sample in batch_samples:
        batch_inputs.append(sample["chinese"])
        batch_targets.append(sample["english"])
    batch_data = tokenizer(
        batch_inputs, 
        text_target=batch_targets,
        padding=True,
        max_length=max_len,
        truncation=True,
        return_tensors="pt"
    )
    # 在 labels 序列前添加特殊的起始 token <s>, 并去掉末尾的 token
    batch_data["decoder_input_ids"] = model.prepare_decoder_input_ids_from_labels(batch_data["labels"])
    end_token_index = torch.where(batch_data["labels"] == tokenizer.eos_token_id)[1]
    for idx, end_idx in enumerate(end_token_index):
        batch_data["labels"][idx][end_idx+1:] = -100
    return batch_data


In [13]:
train_dataloader = DataLoader(train_data, batch_size=32, shuffle=True, collate_fn=collote_fn)
valid_dataloader = DataLoader(val_data, batch_size=32, shuffle=False, collate_fn=collote_fn)

In [14]:
batch = next(iter(train_dataloader))
print(batch.keys())
print('batch shape:', {k: v.shape for k, v in batch.items()})
print(batch)

KeysView({'input_ids': tensor([[    7, 14575,  3565,  ..., 65000, 65000, 65000],
        [ 3500, 15402, 18804,  ..., 65000, 65000, 65000],
        [ 3279,  8803,  2064,  ..., 65000, 65000, 65000],
        ...,
        [    7,  6365, 27418,  ...,     9,     0, 65000],
        [ 1824,    63,   322,  ..., 65000, 65000, 65000],
        [    7,  1187,  1847,  ..., 65000, 65000, 65000]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 1, 1, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'labels': tensor([[19254, 58793,    15,  ...,  -100,  -100,  -100],
        [   24, 22585,  3599,  ...,  -100,  -100,  -100],
        [   26,    21,   419,  ...,  -100,  -100,  -100],
        ...,
        [ 7305,  4660,  5447,  ...,  -100,  -100,  -100],
        [ 2900, 49171,   102,  ...,  -100,  -100,  -100],
        [29121,   456,  3687,  ...,   611,     5,     0]]), 'dec

## 优化模型参数
使用 AutoModelForSeq2SeqLM 构造的模型已经封装好了对应的损失函数，并且计算出的**损失会直接包含在模型的输出 outputs 中**，可以直接通过 outputs.loss 获得，因此训练循环为：

In [15]:
from tqdm.auto import tqdm

def train_loop(dataloader, model, optimizer, lr_scheduler, epoch, total_loss):
    progress_bar = tqdm(range(len(dataloader)))
    progress_bar.set_description(f'loss: {0:>7f}')
    finish_batch_num = (epoch-1) * len(dataloader)
    
    model.train()
    for batch, batch_data in enumerate(dataloader, start=1):
        batch_data = batch_data.to(device)
        outputs = model(**batch_data)
        loss = outputs.loss

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        lr_scheduler.step()

        total_loss += loss.item()
        progress_bar.set_description(f'loss: {total_loss/(finish_batch_num + batch):>7f}')
        progress_bar.update(1)
    return total_loss

验证/测试循环负责评估模型的性能。对于翻译任务，经典的评估指标是 Kishore Papineni 等人在[《BLEU: a Method for Automatic Evaluation of Machine Translation》](https://aclanthology.org/P02-1040.pdf)中提出的 [BLEU ](https://en.wikipedia.org/wiki/BLEU)值，用于度量两个词语序列之间的一致性，但是其并不会衡量语义连贯性或者语法正确性。

由于计算 BLEU 值需要输入分好词的文本，而不同的分词方式会对结果造成影响，因此现在更常用的评估指标是 [SacreBLEU](https://github.com/mjpost/sacrebleu)，它对分词的过程进行了标准化。

In [16]:
from sacrebleu.metrics import BLEU

predictions = [
    "This plugin lets you translate web pages between several languages automatically."
]
bad_predictions_1 = ["This This This This"]
bad_predictions_2 = ["This plugin"]
references = [
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]

bleu = BLEU()
print(bleu.corpus_score(predictions, references).score)
print(bleu.corpus_score(bad_predictions_1, references).score)
print(bleu.corpus_score(bad_predictions_2, references).score)

46.750469682990165
1.683602693167689
0.0


SacreBLEU 默认会采用 mteval-v13a.pl 分词器对文本进行分词，但是它无法处理中文、日文等非拉丁系语言。**对于中文就需要设置参数 tokenize='zh' 手动使用中文分词器**，否则会计算出不正确的 BLEU 值：

In [17]:
from sacrebleu.metrics import BLEU

predictions = [
    "我在苏州大学学习计算机，苏州大学很美丽。"
]

references = [
    [
        "我在环境优美的苏州大学学习计算机。"
    ]
]

bleu = BLEU(tokenize='zh')
print(f'BLEU: {bleu.corpus_score(predictions, references).score}')
bleu = BLEU()
print(f'wrong BLEU: {bleu.corpus_score(predictions, references).score}')

BLEU: 45.340106118883256
wrong BLEU: 0.0


```python
import torch
from transformers import AutoTokenizer
from transformers import AutoModelForSeq2SeqLM

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Using {device} device')

model_checkpoint = "Helsinki-NLP/opus-mt-zh-en"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
model = model.to(device)

sentence = '我叫张三，我住在苏州。'

sentence_inputs = tokenizer(sentence, return_tensors="pt").to(device)
sentence_generated_tokens = model.generate(
    sentence_inputs["input_ids"],
    attention_mask=sentence_inputs["attention_mask"],
    max_length=128
)
sentence_decoded_pred = tokenizer.decode(sentence_generated_tokens[0], skip_special_tokens=True)
print(sentence_decoded_pred)```

- outputs:
```
Using cpu device  
My name is Zhang San, and I live in Suzhou.
```

在 `generate()` 生成 token ID 之后，我们通过分词器自带的 `tokenizer.batch_decode()` 函数将 batch 中所有的 token ID 序列都转换为文本，因此翻译多个句子也没有问题：
```python
sentences = ['我叫张三，我住在苏州。', '我在环境优美的苏州大学学习计算机。']

sentences_inputs = tokenizer(
    sentences, 
    padding=True, 
    max_length=128,
    truncation=True, 
    return_tensors="pt"
).to(device)
sentences_generated_tokens = model.generate(
    sentences_inputs["input_ids"],
    attention_mask=sentences_inputs["attention_mask"],
    max_length=128
)
sentences_decoded_preds = tokenizer.batch_decode(sentences_generated_tokens, skip_special_tokens=True)
print(sentences_decoded_preds)
```
```
[
    'My name is Zhang San, and I live in Suzhou.', 
    "I'm studying computers at Suzhou University in a beautiful environment."
]
```

- “验证/测试循环”
    - model.generate() 函数获取预测结果
    - 结果和正确标签都处理为 SacreBLEU 接受的文本列表形式
    - 最后送入到 SacreBLEU 

In [18]:
from sacrebleu.metrics import BLEU
import numpy as np
bleu = BLEU()

def test_loop(dataloader, model):
    preds, labels =[], []

    model.eval()
    for batch_data in tqdm(dataloader):
        batch_data = batch_data.to(device)
        with torch.no_grad():
            generated_tokens =model.generate(
                batch_data["input_ids"],
                attention_mask=batch_data["attention_mask"],
                max_length = max_len,
            ).cpu().numpy()
            label_tokens = batch_data["labels"].cpu().numpy()
            
            decode_preds = tokenizer.batch_decode(generated_tokens, 
                                                  skip_special_tokens=True)
            label_tokens = np.where(label_tokens != -100,    # 条件：检查每个元素是否不等于-100
                                    label_tokens,            # 如果True：保持原值不变
                                    tokenizer.pad_token_id)  # 如果False：替换为pad_token_id
            decode_labels = tokenizer.batch_decode(label_tokens, 
                                                   skip_special_tokens=True)
            
            preds += [pred.strip() for pred in decode_preds] 
            labels += [[label.strip()] for label in decode_labels]
    return bleu.corpus_score(preds, labels).score

In [19]:
tokenizer.pad_token_id

65000

## 保存模型

在训练之前，我们先评估一下没有微调的模型在测试集上的性能。

In [20]:
test_dataloader = DataLoader(test_data, batch_size=32, shuffle=False, collate_fn=collote_fn)

test_loop(test_dataloader, model)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1229/1229 [07:06<00:00,  2.88it/s]


42.610827239170156

In [21]:
from torch.optim import AdamW
from transformers import get_scheduler

learning_rate = 1e-5
epoch_num = 3   


optimizer = AdamW(model.parameters(), lr=learning_rate)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=epoch_num*len(train_dataloader),
)

total_loss = 0.
best_bleu = 0.
for t in range(epoch_num):
    print(f"Epoch {t+1}/{epoch_num}\n-------------------------------")
    total_loss = train_loop(train_dataloader, model, optimizer, lr_scheduler, t+1, total_loss)
    valid_bleu = test_loop(valid_dataloader, model)
    print(f"BLEU: {valid_bleu:>0.2f}\n")
    if valid_bleu > best_bleu:
        best_bleu = valid_bleu
        print('saving new weights...\n')
        torch.save(model.state_dict(), f'epoch_{t+1}_valid_bleu_{valid_bleu:0.2f}_model_weights.bin')
print("Done!")

Epoch 1/3
-------------------------------


loss: 2.573716: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [07:40<00:00, 13.57it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 625/625 [03:29<00:00,  2.98it/s]


BLEU: 46.35

saving new weights...

Epoch 2/3
-------------------------------


loss: 2.501357: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [07:42<00:00, 13.52it/s]
100%|████████████████████████████████████████████████████████████████████████| 625/625 [03:30<00:00,  2.97it/s]


BLEU: 57.07

saving new weights...

Epoch 3/3
-------------------------------


loss: 2.457017: 100%|██████████████████████████████████████████████████████| 6250/6250 [07:44<00:00, 13.46it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 625/625 [03:27<00:00,  3.02it/s]


BLEU: 49.90

Done!


## 测试模型


In [24]:
test_dataloader = DataLoader(test_data, batch_size=32, shuffle=False, collate_fn=collote_fn)

import json

model.load_state_dict(torch.load("epoch_2_valid_bleu_57.07_model_weights.bin"))

model.eval()
with torch.no_grad():
    print('evaluating on test set...')
    sources, preds, labels = [], [], []
    for batch_data in tqdm(test_dataloader):
        batch_data = batch_data.to(device)
        generated_tokens = model.generate(
            batch_data["input_ids"],
            attention_mask=batch_data["attention_mask"],
            max_length=max_len,
        ).cpu().numpy()
        label_tokens = batch_data["labels"].cpu().numpy()

        decoded_sources = tokenizer.batch_decode(
            batch_data["input_ids"].cpu().numpy(), 
            skip_special_tokens=True, 
            use_source_tokenizer=True
        )
        decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
        label_tokens = np.where(label_tokens != -100, label_tokens, tokenizer.pad_token_id)
        decoded_labels = tokenizer.batch_decode(label_tokens, skip_special_tokens=True)

        sources += [source.strip() for source in decoded_sources]
        preds += [pred.strip() for pred in decoded_preds]
        labels += [[label.strip()] for label in decoded_labels]
    bleu_score = bleu.corpus_score(preds, labels).score
    print(f"Test BLEU: {bleu_score:>0.2f}\n")
    results = []
    print('saving predicted results...')
    for source, pred, label in zip(sources, preds, labels):
        results.append({
            "sentence": source, 
            "prediction": pred, 
            "translation": label[0]
        })
    with open('test_data_pred.json', 'wt', encoding='utf-8') as f:
        for exapmle_result in results:
            f.write(json.dumps(exapmle_result, ensure_ascii=False) + '\n')

evaluating on test set...


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1229/1229 [07:00<00:00,  2.92it/s]


Test BLEU: 54.87

saving predicted results...
