<a href="https://colab.research.google.com/github/JieShenAI/torch/blob/main/huggingface/example/translation/%E8%8B%B1%E6%B1%89%E4%BA%92%E8%AF%91.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 参考资料
* huggingface翻译 https://huggingface.co/learn/nlp-course/zh-CN/chapter7/4

## 简介

`jieshenai/zh_en_translation`
* 参考了`f"Helsinki-NLP/opus-mt-{src}-{trg}"`模型，在`kde4`数据集上训练而成

* 本文的目的并不是为您描述 `jieshenai/zh_en_translation` 预训练模型多强大，而是为您呈现一个完整的汉译英的例子

### GPU运行相关注意事项

* 笔者在运行时，GPU内存占用最高达到了 14.1GB，一般的个人用显卡很难有如此大的内存
空间;
  * `trainer.train()` 占用的GPU内存空间没有 `trainer.evaluate`大
  * `trainer.evaluate(max_length=max_length)` 这个运行占用的GPU内存空间最大， 您可以不运行这行代码。
  若您的GPU内存没有这么大，将batch参数调小一点
    * per_device_train_batch_size=32,      
    * per_device_eval_batch_size=64,

In [None]:
!pip install transformers==4.24.0
!pip install SentencePiece
!pip install sacremoses
!pip install datasets
!pip install evaluate
!pip install sacrebleu

In [3]:
import torch
from torch import nn
from transformers import (
    # MarianTokenizer,
    # MarianMTModel,
    # MarianConfig,
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    # T5ForConditionalGeneration,
    )

In [None]:
# MarianMTModel?

In [4]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

## 验证预训练模型效果

In [1]:
# model = MarianMTModel.from_pretrained(model_name)
# tokenizer = MarianTokenizer.from_pretrained(model_name)

In [None]:
model_checkpoint = "jieshenai/zh_en_translation"
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [None]:
# 模型换到GPU上，若无GPU，则使用CPU
model.to(device)

In [7]:
## 翻译
def trans(model, tokenizer, sample_text):
    batch = tokenizer([sample_text], max_length=128, truncation=True, return_tensors="pt")
    for k,v in batch.items():
      batch[k] = v.to(device)
    generated_ids = model.generate(**batch)
    text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return text

In [13]:
zh_text = [
 "今天天气不错。",
 "小明和小红，一起去电影院看电影去了。",
 "天气太热了，咱们去超市买雪糕吃吧。"
]

for zh in zh_text:
  print("zh:", zh)
  print("trans:", trans(model, tokenizer, zh))
  print()

zh: 今天天气不错。




trans: It's a nice day.

zh: 小明和小红，一起去电影院看电影去了。
trans: Ming and Red went to the cinema to see a movie.

zh: 天气太热了，咱们去超市买雪糕吃吧。
trans: It's too hot. Let's go to the market and get ice-cream.



zh_text的汉语文本都是笔者随意编写的，通过翻译的结果看出预训练模型的汉译英结果可以接受。

将预训练模型的参数给随机初始化，重新在一个汉语与英文的数据集上进行训练，这样可以更直观的看出模型训练的结果


## 重置模型参数

将预训练模型的参数给破坏掉，故意降低模型性能

In [None]:
# 模型参数字典
# model.state_dict().keys()

原先预训练模型，某一层的模型参数

In [14]:
# k 来自 model.state_dict().keys()
k = "model.encoder.embed_tokens.weight"
model.state_dict()[k]

tensor([[-0.0163,  0.0074, -0.0123,  ..., -0.0095, -0.0437, -0.0572],
        [ 0.0236,  0.0017, -0.0039,  ...,  0.0575, -0.0856, -0.0709],
        [-0.0048,  0.0046,  0.0190,  ...,  0.0015, -0.0178, -0.0430],
        ...,
        [ 0.0017,  0.0350,  0.0080,  ...,  0.0581, -0.0314, -0.0496],
        [ 0.0012,  0.0347,  0.0075,  ...,  0.0574, -0.0319, -0.0500],
        [-0.0034,  0.0066, -0.0055,  ..., -0.0063,  0.0080,  0.0034]],
       device='cuda:0')

对模型的所有 linear 参数随机初始化

In [None]:
def init_xavier(m):
  if type(m) == nn.Linear:
    nn.init.xavier_uniform_(m.weight)
model.apply(init_xavier)

对模型参数随机初始化后，可发现与原来预训练模型参数已不一样

In [16]:
model.state_dict()[k]

tensor([[ 0.0075,  0.0052,  0.0053,  ...,  0.0052,  0.0045,  0.0020],
        [ 0.0092, -0.0093,  0.0047,  ..., -0.0044,  0.0063,  0.0009],
        [ 0.0005, -0.0030, -0.0094,  ...,  0.0037, -0.0007, -0.0014],
        ...,
        [ 0.0072,  0.0009,  0.0066,  ..., -0.0073, -0.0036,  0.0028],
        [-0.0046, -0.0033, -0.0017,  ..., -0.0031, -0.0036,  0.0029],
        [-0.0074,  0.0061,  0.0072,  ..., -0.0002, -0.0034, -0.0004]],
       device='cuda:0')

In [17]:
for zh in zh_text:
  print("zh:", zh)
  print("trans:", trans(model, tokenizer, zh))
  print()

zh: 今天天气不错。
trans: 

zh: 小明和小红，一起去电影院看电影去了。
trans: 

zh: 天气太热了，咱们去超市买雪糕吃吧。
trans: 



此时模型已经不会翻译，这已达到我们的预期目的

## 数据集

汉译英数据集

![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter7/language_tags.png)

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("kde4", lang1="en", lang2="zh_CN")
split_datasets = raw_datasets["train"].train_test_split(train_size=0.9, seed=20)
train_dataset = split_datasets["train"]
test_dataset = split_datasets["test"]

In [19]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 139666
    })
})

In [20]:
raw_datasets['train'][0:3]

{'id': ['0', '1', '2'],
 'translation': [{'en': 'ROLES_OF_TRANSLATORS', 'zh_CN': 'Funda Wang'},
  {'en': 'CREDIT_FOR_TRANSLATORS', 'zh_CN': '开源软件国际化之简体中文组'},
  {'en': 'ROLES_OF_TRANSLATORS', 'zh_CN': 'Funda Wang'}]}

In [21]:
split_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 125699
    })
    test: Dataset({
        features: ['id', 'translation'],
        num_rows: 13967
    })
})

In [None]:
# split_datasets["validation"] = split_datasets.pop("test")

In [22]:
train_dataset[0]

{'id': '114557', 'translation': {'en': 'Download', 'zh_CN': '下载'}}

若 您不熟悉 preprocess_function(examples) 这个函数的用法，建议阅读 https://huggingface.co/learn/nlp-course/zh-CN/chapter5/3?fw=pt

当然huggingface 还有很多文档，您都可以阅读

In [23]:
max_length = 128

def preprocess_function(examples):
  inputs = [ex["zh_CN"] for ex in examples["translation"]]
  targets = [ex["en"] for ex in examples["translation"]]
  return tokenizer(
      inputs, text_target=targets, max_length=max_length, truncation=True
  )

In [24]:
train_dataset = train_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=train_dataset.column_names,
)

test_dataset = test_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=test_dataset.column_names,
)

Map:   0%|          | 0/125699 [00:00<?, ? examples/s]

Map:   0%|          | 0/13967 [00:00<?, ? examples/s]

In [25]:
train_dataset[0], test_dataset[0]

({'input_ids': [7, 25618, 0],
  'attention_mask': [1, 1, 1],
  'labels': [53351, 0]},
 {'input_ids': [7, 9120, 0],
  'attention_mask': [1, 1, 1],
  'labels': [27021, 0]})

使用如下方式导入预训练模型，也是可以的

In [None]:
# from transformers import AutoModelForSeq2SeqLM
# model_checkpoint = "jieshenai/zh_en_translation"
# model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

In [26]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [27]:
batch = data_collator([train_dataset[i] for i in range(0, 3)])
batch

{'input_ids': tensor([[    7, 25618,     0, 65000, 65000, 65000, 65000, 65000, 65000],
        [    7,   622,    59,     7,    11,  9587,   627, 56848,     0],
        [    7, 46277,     0, 65000, 65000, 65000, 65000, 65000, 65000]]), 'attention_mask': tensor([[1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 0, 0, 0, 0, 0, 0]]), 'labels': tensor([[53351,     0,  -100,  -100,  -100,  -100,  -100,  -100],
        [54581,  2638, 44518,    14,     7,   622,    59,     0],
        [  456,     0,  -100,  -100,  -100,  -100,  -100,  -100]]), 'decoder_input_ids': tensor([[65000, 53351,     0, 65000, 65000, 65000, 65000, 65000],
        [65000, 54581,  2638, 44518,    14,     7,   622,    59],
        [65000,   456,     0, 65000, 65000, 65000, 65000, 65000]])}

## 训练

In [28]:
import numpy as np
import evaluate

metric = evaluate.load("sacrebleu")

# 计算精度
def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # In case the model returns more than the prediction logits
    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100s in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": result["score"]}

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

* num_train_epochs 模型训练的epoch, 我们设为2是为了缩短模型训练的时间，您可以设置大一点

In [29]:
from transformers import Seq2SeqTrainingArguments

args = Seq2SeqTrainingArguments(
    "jieshenai/zh_en_translation",
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=2,
    predict_with_generate=True,
    fp16=False
)

In [30]:
train_dataset[0]

{'input_ids': [7, 25618, 0], 'attention_mask': [1, 1, 1], 'labels': [53351, 0]}

In [31]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,

)

In [32]:
trainer.evaluate(max_length=max_length)

***** Running Evaluation *****
  Num examples = 13967
  Batch size = 64


{'eval_loss': 9.394546508789062,
 'eval_bleu': 0.0,
 'eval_runtime': 178.3005,
 'eval_samples_per_second': 78.334,
 'eval_steps_per_second': 1.228}

随机初始化后的结果
这不是预训练模型的效果，我们已经把预训练模型的参数给破坏掉了

```python
{'eval_loss': 9.30862045288086,
 'eval_bleu': 0.0,
 'eval_runtime': 184.2068,
 'eval_samples_per_second': 75.822,
 'eval_steps_per_second': 1.189}
```



经过数据集训练之后, 以前训练的结果

```python
***** Running Evaluation *****
  Num examples = 13967
  Batch size = 64

[219/219 20:19]

{'eval_loss': 1.142004132270813,
 'eval_bleu': 42.28398017121882,
 'eval_runtime': 1251.9837,
 'eval_samples_per_second': 11.156,
 'eval_steps_per_second': 0.175,
 'epoch': 3.0}
```

运行时间较长，总计运行34分钟

In [None]:
trainer.train()

***** Running training *****
  Num examples = 125699
  Num Epochs = 2
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 7858
  Number of trainable parameters = 77419008


Step,Training Loss
500,6.9332
1000,6.3298


Step,Training Loss
500,6.9332
1000,6.3298
1500,6.0799


In [None]:
trainer.predict(

)

In [50]:
for zh in zh_text:
  print("zh:", zh)
  print("trans:", trans(model, tokenizer, zh))
  print()

zh: 今天天气不错。
trans: This the time.

zh: 小明和小红，一起去电影院看电影去了。




trans: This of the the the the a a a a a a a a a a a a a a a the the the the the the the.

zh: 天气太热了，咱们去超市买雪糕吃吧。
trans: This of the a a a a a a a a a a a a a a a a a a a a a a. Written. Written of the a a a a a a a a a a a a a a a a a the the.



In [None]:
trans(model, tokenizer, "午间新闻播报，今日在武汉市有歹徒上街杀人并强奸数位女性。该歹徒随后被警方击毙。")

'Newscast on noon today in Wuhan city when a mob killed a few of them and killed several of them. The mob was shot and killed by the police.'

## evaluate

在运行了 15分钟后，仍然没有跑完，笔者主动停止了运行

In [49]:
trainer.evaluate(max_length=max_length)

***** Running Evaluation *****
  Num examples = 13967
  Batch size = 64


KeyboardInterrupt: ignored