<a href="https://colab.research.google.com/github/JieShenAI/torch/blob/main/huggingface/example/translation/%E8%8B%B1%E6%B1%89%E4%BA%92%E8%AF%91.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 参考资料
* huggingface翻译 https://huggingface.co/learn/nlp-course/zh-CN/chapter7/4

## 简介

`jieshenai/zh_en_translation`
* 参考了`f"Helsinki-NLP/opus-mt-{src}-{trg}"`模型，在`kde4`数据集上训练而成

* 本文的目的并不是为您描述 `jieshenai/zh_en_translation` 预训练模型多强大，而是为您呈现一个完整的汉译英的例子

### GPU运行相关注意事项

* 笔者在运行时，GPU内存占用最高达到了 12.7GB，一般的个人用显卡很难有如此大的内存
空间;
  * `trainer.train()` 占用的GPU内存空间没有 `trainer.evaluate`大
  * `trainer.evaluate(max_length=max_length)` 这个运行占用的GPU内存空间最大， 您可以不运行这行代码。
  若您的GPU内存没有这么大，将batch参数调小一点
    * per_device_train_batch_size=32,      
    * per_device_eval_batch_size=32,

* 国内可用免费GPU推荐: kaggle.com

In [None]:
!pip install transformers==4.24.0
!pip install SentencePiece
!pip install sacremoses
!pip install datasets
!pip install evaluate
!pip install sacrebleu

In [2]:
import torch
from torch import nn
from transformers import (
    # MarianTokenizer,
    # MarianMTModel,
    # MarianConfig,
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    # T5ForConditionalGeneration,
    )

## 验证预训练模型效果

In [None]:
# model = MarianMTModel.from_pretrained(model_name)
# tokenizer = MarianTokenizer.from_pretrained(model_name)

In [3]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model_checkpoint = "jieshenai/zh_en_translation"
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/310M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/325 [00:00<?, ?B/s]

Downloading (…)olve/main/source.spm:   0%|          | 0.00/805k [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/807k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.75M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]

In [None]:
# 模型换到GPU上，若无GPU，则使用CPU
model.to(device)

In [5]:
## 翻译
def trans(model, tokenizer, sample_text):
    batch = tokenizer([sample_text], max_length=128, truncation=True, return_tensors="pt")
    for k,v in batch.items():
      batch[k] = v.to(device)
    generated_ids = model.generate(**batch)
    text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return text

In [6]:
zh_text = [
 "今天天气不错。",
 "小明和小红，一起去电影院看电影去了。",
 "天气太热了，咱们去超市买雪糕吃吧。"
]

for zh in zh_text:
  print("zh:", zh)
  print("trans:", trans(model, tokenizer, zh))
  print()

zh: 今天天气不错。




trans: It's a nice day.

zh: 小明和小红，一起去电影院看电影去了。
trans: Ming and Red went to the cinema to see a movie.

zh: 天气太热了，咱们去超市买雪糕吃吧。
trans: It's too hot. Let's go to the market and get ice-cream.



zh_text的汉语文本都是笔者随意编写的，通过翻译的结果看出预训练模型的汉译英结果可以接受。

将预训练模型的参数给随机初始化，在一个汉语与英文的数据集上进行微调训练


## 数据集

汉译英数据集

![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter7/language_tags.png)

In [7]:
from datasets import load_dataset

raw_datasets = load_dataset("kde4", lang1="en", lang2="zh_CN")
split_datasets = raw_datasets["train"].train_test_split(train_size=0.95, seed=20)
train_dataset = split_datasets["train"]
test_dataset = split_datasets["test"]

Downloading builder script:   0%|          | 0.00/4.25k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/8.45k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.10k [00:00<?, ?B/s]

Downloading and preparing dataset kde4/en-zh_CN to /root/.cache/huggingface/datasets/kde4/en-zh_CN-lang1=en,lang2=zh_CN/0.0.0/243129fb2398d5b0b4f7f6831ab27ad84774b7ce374cf10f60f6e1ff331648ac...


Downloading data:   0%|          | 0.00/3.16M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset kde4 downloaded and prepared to /root/.cache/huggingface/datasets/kde4/en-zh_CN-lang1=en,lang2=zh_CN/0.0.0/243129fb2398d5b0b4f7f6831ab27ad84774b7ce374cf10f60f6e1ff331648ac. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [8]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 139666
    })
})

In [9]:
raw_datasets['train'][0:3]

{'id': ['0', '1', '2'],
 'translation': [{'en': 'ROLES_OF_TRANSLATORS', 'zh_CN': 'Funda Wang'},
  {'en': 'CREDIT_FOR_TRANSLATORS', 'zh_CN': '开源软件国际化之简体中文组'},
  {'en': 'ROLES_OF_TRANSLATORS', 'zh_CN': 'Funda Wang'}]}

In [10]:
split_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 132682
    })
    test: Dataset({
        features: ['id', 'translation'],
        num_rows: 6984
    })
})

In [11]:
train_dataset[0], test_dataset[0]

({'id': '88416',
  'translation': {'en': 'Maximum upload speed:', 'zh_CN': '最大上传速度 ：'}},
 {'id': '73758', 'translation': {'en': 'Easy', 'zh_CN': '简单'}})

若 您不熟悉 preprocess_function(examples) 这个函数的用法，建议阅读 https://huggingface.co/learn/nlp-course/zh-CN/chapter5/3?fw=pt

当然huggingface 还有很多文档，您都可以阅读

In [12]:
max_length = 128

def preprocess_function(examples):
  inputs = [ex["zh_CN"] for ex in examples["translation"]]
  targets = [ex["en"] for ex in examples["translation"]]
  return tokenizer(
      inputs, text_target=targets, max_length=max_length, truncation=True
  )

In [13]:
train_dataset = train_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=train_dataset.column_names,
)

test_dataset = test_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=test_dataset.column_names,
)

Map:   0%|          | 0/132682 [00:00<?, ? examples/s]

Map:   0%|          | 0/6984 [00:00<?, ? examples/s]

In [14]:
train_dataset[0], test_dataset[0]

({'input_ids': [7, 6366, 150, 5763, 10156, 7, 35, 0],
  'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1],
  'labels': [50984, 48962, 10847, 35, 0]},
 {'input_ids': [7, 9120, 0],
  'attention_mask': [1, 1, 1],
  'labels': [27021, 0]})

使用如下方式导入预训练模型，也是可以的

In [15]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

batch 例子

In [16]:
batch = data_collator([train_dataset[i] for i in range(0, 3)])
batch

{'input_ids': tensor([[    7,  6366,   150,  5763, 10156,     7,    35,     0, 65000, 65000,
         65000, 65000, 65000],
        [    7, 25618,  9757,   109,  8216,     7,  2046,   128, 18177,  8017,
             7,     9,     0],
        [    7,  7326,    63,     0, 65000, 65000, 65000, 65000, 65000, 65000,
         65000, 65000, 65000]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'labels': tensor([[50984, 48962, 10847,    35,     0,  -100,  -100,  -100,  -100,  -100],
        [53351,   102,     4, 48747,  8216,  3336, 17943,  4761,     5,     0],
        [  522, 16969,     0,  -100,  -100,  -100,  -100,  -100,  -100,  -100]]), 'decoder_input_ids': tensor([[65000, 50984, 48962, 10847,    35,     0, 65000, 65000, 65000, 65000],
        [65000, 53351,   102,     4, 48747,  8216,  3336, 17943,  4761,     5],
        [65000,   522, 16969,     0, 65000, 65000, 650

## 训练

In [17]:
import numpy as np
import evaluate

metric = evaluate.load("sacrebleu")

# 计算精度
def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # In case the model returns more than the prediction logits
    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100s in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": result["score"]}

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

* num_train_epochs 模型训练的epoch, 我们设为2是为了缩短模型训练的时间，您可以设置大一点

In [18]:
from transformers import Seq2SeqTrainingArguments

args = Seq2SeqTrainingArguments(
    "jieshenai/zh_en_translation",
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=2,
    predict_with_generate=True,
    fp16=False
)

In [19]:
train_dataset[0]

{'input_ids': [7, 6366, 150, 5763, 10156, 7, 35, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1],
 'labels': [50984, 48962, 10847, 35, 0]}

In [20]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,

)

In [None]:
trainer.evaluate(max_length=max_length)

In [None]:
# 运行时间较长，总计运行34分钟
trainer.train()

除了可以使用 trans(model, tokenizer, zh)，这种方式生成翻译的结果；trainer.predict 也能实现翻译功能

In [None]:
predicts = trainer.predict(
  test_dataset
)

***** Running Prediction *****
  Num examples = 6984
  Batch size = 32


观察经过微调后与没微调翻译结果的区别

In [None]:
for zh in zh_text:
  print("zh:", zh)
  print("trans:", trans(model, tokenizer, zh))
  print()

## evaluate

在运行了 15分钟后，仍然没有跑完，笔者主动停止了运行

In [None]:
trainer.evaluate(max_length=max_length)