# 基于Transformer的机器翻译

机器翻译是利用计算机将一种自然语言(源语言)转换为另一种自然语言(目标语言)的过程。

本项目是机器翻译领域主流模型 Transformer 的 PaddlePaddle 实现，包含模型训练，预测以及使用自定义数据等内容。用户可以基于发布的内容搭建自己的翻译模型。

Transformer 是论文 [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf) 中提出的用以完成机器翻译（Machine Translation）等序列到序列（Seq2Seq）学习任务的一种全新网络结构，其完全使用注意力（Attention）机制来实现序列到序列的建模。

<center><img src="https://ai-studio-static-online.cdn.bcebos.com/6e48d8033d7b4b8a8a34baa3af7ebdab2d3d852f7e8348a58e462ed3089db668" width="500" height="313" ></center>
<br><center>图1：Transformer 网络结构图</center></br>

相较于此前 Seq2Seq 模型中广泛使用的循环神经网络（Recurrent Neural Network, RNN），使用Self Attention进行输入序列到输出序列的变换主要具有以下优势：

- 计算复杂度小
	- 特征维度为 d 、长度为 n 的序列，在 RNN 中计算复杂度为 O(n * d * d) （n 个时间步，每个时间步计算 d 维的矩阵向量乘法），在 Self-Attention 中计算复杂度为 O(n * n * d) （n 个时间步两两计算 d 维的向量点积或其他相关度函数），n 通常要小于 d 。
- 计算并行度高
	- RNN 中当前时间步的计算要依赖前一个时间步的计算结果；Self-Attention 中各时间步的计算只依赖输入不依赖之前时间步输出，各时间步可以完全并行。
- 容易学习长距离依赖（long-range dependencies）
	- RNN 中相距为 n 的两个位置间的关联需要 n 步才能建立；Self-Attention 中任何两个位置都直接相连；路径越短信号传播越容易。
Transformer 中引入使用的基于 Self-Attention 的序列建模模块结构，已被广泛应用在 Bert 等语义表示模型中，取得了显著效果。

# 环境介绍

- PaddlePaddle框架，AI Studio平台已经默认安装最新版2.1。

- PaddleNLP，深度兼容框架2.1，是飞桨框架2.1在NLP领域的最佳实践。


In [1]:
!unzip -o transformer_mt.zip

Archive:  transformer_mt.zip
   creating: transformer_mt/
  inflating: transformer_mt/1918692.ipynb  
  inflating: transformer_mt/get_data_and_model.sh  
 extracting: transformer_mt/mosesdecoder.tar.gz  
  inflating: transformer_mt/preprocess.sh  
 extracting: transformer_mt/requirements.txt  
 extracting: transformer_mt/train_dev_test.tar.gz  
  inflating: transformer_mt/transformer.base.yaml  
  inflating: transformer_mt/utils.py  
   creating: transformer_mt/__pycache__/
  inflating: transformer_mt/__pycache__/utils.cpython-37.pyc  


In [2]:
%cd transformer_mt/

/home/aistudio/transformer_mt


In [3]:
# 安装依赖
!pip install paddlenlp==2.3.2 -i https://pypi.tuna.tsinghua.edu.cn/simple
!pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple, https://mirrors.aliyun.com/pypi/simple/, https://pypi.tuna.tsinghua.edu.cn/simple/
Collecting paddlenlp==2.3.2
  Downloading https://mirrors.aliyun.com/pypi/packages/23/99/8bc858da7fa76b4e44de3907bbd045a22b1eb6a5c893ab87a521571d4c20/paddlenlp-2.3.2-py3-none-any.whl (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m935.5 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting datasets>=2.0.0
  Downloading https://mirrors.aliyun.com/pypi/packages/d3/95/ef83542e7a8e2bfc4432ee2cd8a6b52eb30fb1e605871e8871e94ce65fb1/datasets-2.13.2-py3-none-any.whl (512 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m512.7/512.7 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting xxhash
  Downloading https://mirrors.aliyun.com/pypi/packages/54/a0/dae1c5dc27601a61897b48a367232c743c760c765d9ab38be1a903cf0d87/xxhash-3.4.1-cp37-cp37m-manylinux_2_17_x86_64.m

# Pipeline
<center><img src="https://ai-studio-static-online.cdn.bcebos.com/b7aafbcc8e2f4bfc9b864d1a1bc0af749260ef9b690c458399f2ba9c66c1ab80" width="1200" height="600" ></center>
<br><center>图2：Pipeline </center></br>

In [33]:
import os
import time
import yaml
import logging
import argparse
import numpy as np
from pprint import pprint
from attrdict import AttrDict
import jieba

import numpy as np
from functools import partial
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
import paddle.distributed as dist
from paddle.io import DataLoader,BatchSampler
from paddlenlp.data import Vocab, Pad
from paddlenlp.datasets import load_dataset
# from paddlenlp.transformers import TransformerModel, InferTransformerModel, CrossEntropyCriterion, position_encoding_init
from paddlenlp.transformers import *
from paddlenlp.utils.log import logger

from utils import post_process_seq

## 1. 数据预处理
本教程使用[CWMT](http://nlp.nju.edu.cn/cwmt-wmt/)数据集中的中文英文的数据作为训练语料，
CWMT数据集在900万+，质量较高，非常适合来训练Transformer机器翻译。  
中文需要Jieba+BPE，英文需要BPE  

### BPE(Byte Pair Encoding)
BPE优势：
- 压缩词表；
- 一定程度上缓解OOV(out of vocabulary)问题
<center><img src="https://ai-studio-static-online.cdn.bcebos.com/e7a59d7bab514a6fa17d24a116af0b680fbd664439c948799c0a0541dffd35a2" width="1000" height="500" ></center>
<br><center>图3：learn BPE </center></br>

<center><img src="https://ai-studio-static-online.cdn.bcebos.com/d4b9c48ba7274395af3fbb267c1f9adcba50dd4b147d4258be58999b3b5a198c" width="1000" height="500" ></center>

<br><center>图4：Apply BPE </center></br>


<center><img src="https://ai-studio-static-online.cdn.bcebos.com/c319a24e5612413fb715885d7143f62882eba16ce43943c5b53903963591687c" width="1000" height="500" ></center>
<br><center>图5：Jieba+BPE </center></br>


In [5]:
# 数据预处理过程，包括jieba分词、bpe分词和词表。
!bash preprocess.sh

Decompress train_dev_test data...
jieba tokenize...
Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.939 seconds.
Prefix dict has been built successfully.
source learn-bpe and apply-bpe...
no pair has frequency >= 2. Stopping
target learn-bpe and apply-bpe...
no pair has frequency >= 2. Stopping
source get-vocab. if loading pretrained model, use its vocab.
target get-vocab. if loading pretrained model, use its vocab.
Over.


In [6]:
# 下载预训练模型
!bash get_data_and_model.sh

Download model.
--2024-01-13 14:34:46--  https://paddlenlp.bj.bcebos.com/models/transformers/transformer/CWMT2021_step_345000.tar.gz
正在解析主机 paddlenlp.bj.bcebos.com (paddlenlp.bj.bcebos.com)... 182.61.200.229, 182.61.200.195, 2409:8c04:1001:1002:0:ff:b001:368a
正在连接 paddlenlp.bj.bcebos.com (paddlenlp.bj.bcebos.com)|182.61.200.229|:443... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度： 1386250752 (1.3G) [application/x-gzip]
正在保存至: “trained_models/CWMT2021_step_345000.tar.gz”


2024-01-13 14:35:28 (31.5 MB/s) - 已保存 “trained_models/CWMT2021_step_345000.tar.gz” [1386250752/1386250752])

Decompress model.
Over.


## 2. 构造Dataloader

下面的`create_data_loader`函数用于创建训练集、验证集所需要的`DataLoader`对象,  
`create_infer_loader`函数用于创建预测集所需要的`DataLoader`对象，   
`DataLoader`对象用于产生一个个batch的数据。下面对函数中调用的`paddlenlp`内置函数作简单说明：
* `paddlenlp.data.Vocab.load_vocabulary`：Vocab词表类，集合了一系列文本token与ids之间映射的一系列方法，支持从文件、字典、json等一系方式构建词表
* `paddlenlp.datasets.load_dataset`：从本地文件创建数据集时，推荐根据本地数据集的格式给出读取function并传入 load_dataset() 中创建数据集
* `paddlenlp.data.Pad`：padding 操作
具体可参考[PaddleNLP的文档](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_self_defined.html)

<center><img src="https://ai-studio-static-online.cdn.bcebos.com/0085d53068134546bcb914347774430c3a2c94cd77934c2e90420ac740d16fc7" width="700" height="350" ></center>
<br><center>图6：构造Dataloader的流程 </center></br>


<center><img src="https://ai-studio-static-online.cdn.bcebos.com/3bf38365718346a19c25729bb67e2e6afe8f0bafc61348018d2b9dd60dc9a8bf" width="1000" height="500" ></center>
<br><center>图7：Dataloader细节 </center></br>

In [17]:
# 自定义读取本地数据的方法
def read(src_path, tgt_path, is_predict=False):
    if is_predict:
        with open(src_path, 'r', encoding='utf8') as src_f:
            for src_line in src_f.readlines():
                src_line = src_line.strip()
                if not src_line:
                    continue
                yield {'src':src_line, 'tgt':''}
    else:
        with open(src_path, 'r', encoding='utf8') as src_f, open(tgt_path, 'r', encoding='utf8') as tgt_f:
            for src_line, tgt_line in zip(src_f.readlines(), tgt_f.readlines()):
                src_line = src_line.strip()
                if not src_line:
                    continue
                tgt_line = tgt_line.strip()
                if not tgt_line:
                    continue
                yield {'src':src_line, 'tgt':tgt_line}
 # 过滤掉长度 ≤min_len或者≥max_len 的数据            
def min_max_filer(data, max_len, min_len=0):
    # 1 for special tokens.
    data_min_len = min(len(data[0]), len(data[1])) + 1
    data_max_len = max(len(data[0]), len(data[1])) + 1
    return (data_min_len >= min_len) and (data_max_len <= max_len)

In [18]:

# 创建训练集、验证集的dataloader
def create_data_loader(args):
    train_dataset = load_dataset(read, src_path=args.training_file.split(',')[0], tgt_path=args.training_file.split(',')[1], lazy=False)
    dev_dataset = load_dataset(read, src_path=args.validation_file.split(',')[0], tgt_path=args.validation_file.split(',')[1], lazy=False)

    src_vocab = Vocab.load_vocabulary(
        args.src_vocab_fpath,
        bos_token=args.special_token[0],
        eos_token=args.special_token[1],
        unk_token=args.special_token[2])
    trg_vocab = Vocab.load_vocabulary(
        args.trg_vocab_fpath,
        bos_token=args.special_token[0],
        eos_token=args.special_token[1],
        unk_token=args.special_token[2])

    padding_vocab = (
        lambda x: (x + args.pad_factor - 1) // args.pad_factor * args.pad_factor
    )
    args.src_vocab_size = padding_vocab(len(src_vocab))
    args.trg_vocab_size = padding_vocab(len(trg_vocab))

    def convert_samples(sample):
        source = sample['src'].split()
        target = sample['tgt'].split()

        source = src_vocab.to_indices(source)
        target = trg_vocab.to_indices(target)

        return source, target

    # 训练集dataloader和验证集dataloader
    data_loaders = []
    for i, dataset in enumerate([train_dataset, dev_dataset]):
        dataset = dataset.map(convert_samples, lazy=False).filter(
            partial(min_max_filer, max_len=args.max_length))

        # BatchSampler: https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/io/BatchSampler_cn.html
        batch_sampler = BatchSampler(dataset,batch_size=args.batch_size, shuffle=True,drop_last=False)
        
        # DataLoader: https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/io/DataLoader_cn.html
        data_loader = DataLoader(
            dataset=dataset,
            batch_sampler=batch_sampler,
            collate_fn=partial(
                prepare_train_input,
                bos_idx=args.bos_idx,
                eos_idx=args.eos_idx,
                pad_idx=args.bos_idx),
                num_workers=0,
                return_list=True)
        data_loaders.append(data_loader)

    return data_loaders


def prepare_train_input(insts, bos_idx, eos_idx, pad_idx):
    """
    Put all padded data needed by training into a list.
    """
    word_pad = Pad(pad_idx)
    src_word = word_pad([inst[0] + [eos_idx] for inst in insts])
    trg_word = word_pad([[bos_idx] + inst[1] for inst in insts])
    lbl_word = np.expand_dims(
        word_pad([inst[1] + [eos_idx] for inst in insts]), axis=2)

    data_inputs = [src_word, trg_word, lbl_word]

    return data_inputs


In [19]:
# 创建测试集的dataloader，原理步骤同上（创建训练集、验证集的dataloader）
def create_infer_loader(args):
    dataset = load_dataset(read, src_path=args.predict_file, tgt_path=None, is_predict=True, lazy=False)

    src_vocab = Vocab.load_vocabulary(
        args.src_vocab_fpath,
        bos_token=args.special_token[0],
        eos_token=args.special_token[1],
        unk_token=args.special_token[2])
    trg_vocab = Vocab.load_vocabulary(
        args.trg_vocab_fpath,
        bos_token=args.special_token[0],
        eos_token=args.special_token[1],
        unk_token=args.special_token[2])

    padding_vocab = (
        lambda x: (x + args.pad_factor - 1) // args.pad_factor * args.pad_factor
    )
    args.src_vocab_size = padding_vocab(len(src_vocab))
    args.trg_vocab_size = padding_vocab(len(trg_vocab))

    def convert_samples(sample):
        source = sample['src'].split()
        target = sample['tgt'].split()

        source = src_vocab.to_indices(source)
        target = trg_vocab.to_indices(target)

        return source, target

    dataset = dataset.map(convert_samples, lazy=False)

    # BatchSampler: https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/io/BatchSampler_cn.html
    batch_sampler = BatchSampler(dataset,batch_size=args.infer_batch_size,drop_last=False)
    
    # DataLoader: https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/io/DataLoader_cn.html
    data_loader = DataLoader(
        dataset=dataset,
        batch_sampler=batch_sampler,
        collate_fn=partial(
            prepare_infer_input,
            bos_idx=args.bos_idx,
            eos_idx=args.eos_idx,
            pad_idx=args.bos_idx),
            num_workers=0,
            return_list=True)
    return data_loader, trg_vocab.to_tokens

def prepare_infer_input(insts, bos_idx, eos_idx, pad_idx):
    """
    Put all padded data needed by beam search decoder into a list.
    """
    word_pad = Pad(pad_idx)
    src_word = word_pad([inst[0] + [eos_idx] for inst in insts])

    return [src_word, ]

## 3. 搭建模型
PaddleNLP提供Transformer API供调用：
* [`paddlenlp.transformers.TransformerModel`](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/transformers/transformer/modeling.py#L523)：Transformer模型的实现
* [`paddlenlp.transformers.InferTransformerModel`](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/transformers/transformer/modeling.py#L702)：Transformer模型用于生成
* [`paddlenlp.transformers.CrossEntropyCriterion`](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/transformers/transformer/modeling.py#L191)：计算交叉熵损失
* [`paddlenlp.transformers.position_encoding_init`](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/transformers/transformer/modeling.py#L17)：Transformer 位置编码的初始化

<center><img src="https://ai-studio-static-online.cdn.bcebos.com/7fdcad8a336d41b3a3b461de2adce5ae5b28317e681b4efdab29f53641866897" width="500" height="250" ></center>
<br><center>图8：模型搭建 </center></br>


<center><img src="https://ai-studio-static-online.cdn.bcebos.com/fb181b57c2d347b884502d5d11d8c61e918ee803069d4e26bcf4c6533cf948c6" width="1000" height="500" ></center>
<br><center>图9：Example </center></br>


搭建Transformer网络，这里参考paddlenlp内定义的TransformerModel类，并使用paddlenlp提供的API
<center><img src="https://ai-studio-static-online.cdn.bcebos.com/6e48d8033d7b4b8a8a34baa3af7ebdab2d3d852f7e8348a58e462ed3089db668" width="500" height="313" ></center>
<br><center>图1：Transformer 网络结构图</center></br>

In [29]:
class Transformer(nn.Layer):
    def __init__(
        self,
        src_vocab_size,
        trg_vocab_size,
        max_length,
        num_encoder_layers,
        num_decoder_layers,
        n_head,
        d_model,
        d_inner_hid,
        dropout,
        weight_sharing,
        attn_dropout=None,
        act_dropout=None,
        bos_id=0,
        eos_id=1,
        pad_id=None,
        activation="relu",
        normalize_before=True,
    ):
        super(Transformer, self).__init__()
        self.trg_vocab_size = trg_vocab_size
        self.emb_dim = d_model
        self.bos_id = bos_id
        self.eos_id = eos_id
        self.pad_id = pad_id if pad_id is not None else self.bos_id
        self.dropout = dropout

        self.src_word_embedding = WordEmbedding(vocab_size=src_vocab_size, emb_dim=d_model, bos_id=self.pad_id)
        self.src_pos_embedding = PositionalEmbedding(emb_dim=d_model, max_length=max_length)
        if weight_sharing:
            assert (
                src_vocab_size == trg_vocab_size
            ), "Vocabularies in source and target should be same for weight sharing."
            self.trg_word_embedding = self.src_word_embedding
            self.trg_pos_embedding = self.src_pos_embedding
        else:
            self.trg_word_embedding = WordEmbedding(vocab_size=trg_vocab_size, emb_dim=d_model, bos_id=self.pad_id)
            self.trg_pos_embedding = PositionalEmbedding(emb_dim=d_model, max_length=max_length)

        if not normalize_before:
            encoder_layer = TransformerEncoderLayer(
                d_model=d_model,
                nhead=n_head,
                dim_feedforward=d_inner_hid,
                dropout=dropout,
                activation=activation,
                attn_dropout=attn_dropout,
                act_dropout=act_dropout,
                normalize_before=normalize_before,
            )
            encoder_with_post_norm = TransformerEncoder(encoder_layer, num_encoder_layers)

            decoder_layer = TransformerDecoderLayer(
                d_model=d_model,
                nhead=n_head,
                dim_feedforward=d_inner_hid,
                dropout=dropout,
                activation=activation,
                attn_dropout=attn_dropout,
                act_dropout=act_dropout,
                normalize_before=normalize_before,
            )
            decoder_with_post_norm = TransformerDecoder(decoder_layer, num_decoder_layers)

        self.transformer = paddle.nn.Transformer(
            d_model=d_model,
            nhead=n_head,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            dim_feedforward=d_inner_hid,
            dropout=dropout,
            attn_dropout=attn_dropout,
            act_dropout=act_dropout,
            activation=activation,
            normalize_before=normalize_before,
            custom_encoder=None if normalize_before else encoder_with_post_norm,
            custom_decoder=None if normalize_before else decoder_with_post_norm,
        )

        if weight_sharing:
            self.linear = lambda x: paddle.matmul(
                x=x, y=self.trg_word_embedding.word_embedding.weight, transpose_y=True
            )
        else:
            self.linear = nn.Linear(in_features=d_model, out_features=trg_vocab_size, bias_attr=False)

    def forward(self, src_word, trg_word):
        src_max_len = paddle.shape(src_word)[-1]
        trg_max_len = paddle.shape(trg_word)[-1]
        src_slf_attn_bias = (
            paddle.cast(src_word == self.pad_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e4
        )
        src_slf_attn_bias.stop_gradient = True
        trg_slf_attn_bias = self.transformer.generate_square_subsequent_mask(trg_max_len)
        trg_slf_attn_bias.stop_gradient = True
        trg_src_attn_bias = src_slf_attn_bias
        src_pos = paddle.cast(src_word != self.pad_id, dtype=src_word.dtype) * paddle.arange(
            start=0, end=src_max_len, dtype=src_word.dtype
        )
        trg_pos = paddle.cast(trg_word != self.pad_id, dtype=src_word.dtype) * paddle.arange(
            start=0, end=trg_max_len, dtype=trg_word.dtype
        )

        with paddle.static.amp.fp16_guard():
            src_emb = self.src_word_embedding(src_word)
            src_pos_emb = self.src_pos_embedding(src_pos)
            src_emb = src_emb + src_pos_emb
            enc_input = F.dropout(src_emb, p=self.dropout, training=self.training) if self.dropout else src_emb

            trg_emb = self.trg_word_embedding(trg_word)
            trg_pos_emb = self.trg_pos_embedding(trg_pos)
            trg_emb = trg_emb + trg_pos_emb
            dec_input = F.dropout(trg_emb, p=self.dropout, training=self.training) if self.dropout else trg_emb

            dec_output = self.transformer(
                enc_input,
                dec_input,
                src_mask=src_slf_attn_bias,
                tgt_mask=trg_slf_attn_bias,
                memory_mask=trg_src_attn_bias,
            )

            predict = self.linear(dec_output)

        return predict

搭建InferTransformer网络，用于生成，同样使用paddlenlp提供的API

In [38]:
class InferTransformer(Transformer):
    def __init__(
        self,
        src_vocab_size,
        trg_vocab_size,
        max_length,
        num_encoder_layers,
        num_decoder_layers,
        n_head,
        d_model,
        d_inner_hid,
        dropout,
        weight_sharing,
        attn_dropout=None,
        act_dropout=None,
        bos_id=0,
        eos_id=1,
        pad_id=None,
        beam_size=4,
        max_out_len=256,
        output_time_major=False,
        beam_search_version="v1",
        activation="relu",
        normalize_before=True,
        **kwargs
    ):
        args = dict(locals())
        args.pop("self")
        args.pop("__class__", None)
        self.beam_size = args.pop("beam_size")
        self.max_out_len = args.pop("max_out_len")
        self.output_time_major = args.pop("output_time_major")
        self.dropout = dropout
        self.beam_search_version = args.pop("beam_search_version")
        kwargs = args.pop("kwargs")
        if self.beam_search_version == "v2":
            self.alpha = kwargs.get("alpha", 0.6)
            self.rel_len = kwargs.get("rel_len", False)
        super(InferTransformer, self).__init__(**args)

        cell = TransformerDecodeCell(
            self.transformer.decoder, self.trg_word_embedding, self.trg_pos_embedding, self.linear, self.dropout
        )

        self.decode = TransformerBeamSearchDecoder(cell, bos_id, eos_id, beam_size, var_dim_in_state=2)

    def forward(self, src_word, trg_word=None):
        if trg_word is not None:
            trg_length = paddle.sum(paddle.cast(trg_word != self.pad_id, dtype="int32"), axis=-1)
        else:
            trg_length = None

        if self.beam_search_version == "v1":
            src_max_len = paddle.shape(src_word)[-1]
            src_slf_attn_bias = (
                paddle.cast(src_word == self.pad_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e4
            )
            trg_src_attn_bias = src_slf_attn_bias
            src_pos = paddle.cast(src_word != self.pad_id, dtype=src_word.dtype) * paddle.arange(
                start=0, end=src_max_len, dtype=src_word.dtype
            )

            # Run encoder
            src_emb = self.src_word_embedding(src_word)
            src_pos_emb = self.src_pos_embedding(src_pos)
            src_emb = src_emb + src_pos_emb
            enc_input = F.dropout(src_emb, p=self.dropout, training=False) if self.dropout else src_emb
            enc_output = self.transformer.encoder(enc_input, src_slf_attn_bias)

            # Init states (caches) for transformer, need to be updated according to selected beam
            incremental_cache, static_cache = self.transformer.decoder.gen_cache(enc_output, do_zip=True)

            static_cache, enc_output, trg_src_attn_bias = TransformerBeamSearchDecoder.tile_beam_merge_with_batch(
                (static_cache, enc_output, trg_src_attn_bias), self.beam_size
            )

            rs, _ = nn.decode.dynamic_decode(
                decoder=self.decode,
                inits=incremental_cache,
                max_step_num=self.max_out_len,
                memory=enc_output,
                trg_src_attn_bias=trg_src_attn_bias,
                static_cache=static_cache,
                is_test=True,
                output_time_major=self.output_time_major,
                trg_word=trg_word,
                trg_length=trg_length,
            )

            return rs

        elif self.beam_search_version == "v2":
            finished_seq, finished_scores = self.beam_search_v2(
                src_word, self.beam_size, self.max_out_len, self.alpha, trg_word, trg_length
            )
            if self.output_time_major:
                finished_seq = finished_seq.transpose([2, 0, 1])
            else:
                finished_seq = finished_seq.transpose([0, 2, 1])

            return finished_seq

## 4.训练模型
运行`do_train`函数，
在`do_train`函数中，配置优化器、损失函数，以及评价指标Perplexity；  

Perplexity，即困惑度，常用来衡量语言模型优劣，即句子的通顺度，一般用于机器翻译和文本生成等领域。Perplexity越小，句子越通顺，该语言模型越好。
<center><img src="https://ai-studio-static-online.cdn.bcebos.com/d6e0d38ae1d94deea1cc299a785b96663317a33d38e84f208c528c4dd03e83f2" width="600" height="300" ></center>
<br><center>图10：训练模型 </center></br>


In [30]:
def do_train(args):
    if args.use_gpu:
        place = "gpu"
    else:
        place = "cpu"
    paddle.set_device(place)
    # Set seed for CE
    random_seed = eval(str(args.random_seed))
    if random_seed is not None:
        paddle.seed(random_seed)

    # Define data loader
    (train_loader), (eval_loader) = create_data_loader(args)

    # Define model
    transformer = Transformer(
        src_vocab_size=args.src_vocab_size,
        trg_vocab_size=args.trg_vocab_size,
        max_length=args.max_length + 1,
        num_encoder_layers=args.n_layer,
        num_decoder_layers=args.n_layer,
        n_head=args.n_head,
        d_model=args.d_model,
        d_inner_hid=args.d_inner_hid,
        dropout=args.dropout,
        weight_sharing=args.weight_sharing,
        bos_id=args.bos_idx,
        eos_id=args.eos_idx)

    # Define loss
    criterion = CrossEntropyCriterion(args.label_smooth_eps, args.bos_idx)

    scheduler = paddle.optimizer.lr.NoamDecay(
        args.d_model, args.warmup_steps, args.learning_rate, last_epoch=0)

    # Define optimizer
    optimizer = paddle.optimizer.Adam(
        learning_rate=scheduler,
        beta1=args.beta1,
        beta2=args.beta2,
        epsilon=float(args.eps),
        parameters=transformer.parameters())

    step_idx = 0

    # Train loop
    for pass_id in range(args.epoch):
        batch_id = 0
        for input_data in train_loader:

            (src_word, trg_word, lbl_word) = input_data

            logits = transformer(src_word=src_word, trg_word=trg_word)

            sum_cost, avg_cost, token_num = criterion(logits, lbl_word)
            
            # 计算梯度
            avg_cost.backward() 
            # 更新参数
            optimizer.step() 
            # 梯度清零
            optimizer.clear_grad() 

            if (step_idx + 1) % args.print_step == 0 or step_idx == 0:
                total_avg_cost = avg_cost.numpy()
                logger.info(
                    "step_idx: %d, epoch: %d, batch: %d, avg loss: %f, "
                    " ppl: %f " %
                    (step_idx, pass_id, batch_id, total_avg_cost,
                        np.exp([min(total_avg_cost, 100)])))

            if (step_idx + 1) % args.save_step == 0:
                # Validation
                transformer.eval()
                total_sum_cost = 0
                total_token_num = 0
                with paddle.no_grad():
                    for input_data in eval_loader:
                        (src_word, trg_word, lbl_word) = input_data
                        logits = transformer(
                            src_word=src_word, trg_word=trg_word)
                        sum_cost, avg_cost, token_num = criterion(logits,
                                                                  lbl_word)
                        total_sum_cost += sum_cost.numpy()
                        total_token_num += token_num.numpy()
                        total_avg_cost = total_sum_cost / total_token_num
                    logger.info("validation, step_idx: %d, avg loss: %f, "
                                " ppl: %f" %
                                (step_idx, total_avg_cost,
                                 np.exp([min(total_avg_cost, 100)])))
                transformer.train()

                if args.save_model:
                    model_dir = os.path.join(args.save_model,
                                             "step_" + str(step_idx))
                    if not os.path.exists(model_dir):
                        os.makedirs(model_dir)
                    paddle.save(transformer.state_dict(),
                                os.path.join(model_dir, "transformer.pdparams"))
                    paddle.save(optimizer.state_dict(),
                                os.path.join(model_dir, "transformer.pdopt"))
            batch_id += 1
            step_idx += 1
            scheduler.step()


    if args.save_model:
        model_dir = os.path.join(args.save_model, "step_final")
        if not os.path.exists(model_dir):
            os.makedirs(model_dir)
        paddle.save(transformer.state_dict(),
                    os.path.join(model_dir, "transformer.pdparams"))
        paddle.save(optimizer.state_dict(),
                    os.path.join(model_dir, "transformer.pdopt"))

In [31]:
# 读入参数
yaml_file = 'transformer.base.yaml'
with open(yaml_file, 'rt') as f:
    args = AttrDict(yaml.safe_load(f))
    pprint(args)

{'batch_size': 50,
 'beam_size': 5,
 'beta1': 0.9,
 'beta2': 0.997,
 'bos_idx': 0,
 'd_inner_hid': 2048,
 'd_model': 512,
 'dropout': 0.1,
 'eos_idx': 1,
 'epoch': 50,
 'eps': '1e-9',
 'infer_batch_size': 50,
 'init_from_params': 'trained_models/CWMT2021_step_345000/',
 'label_smooth_eps': 0.1,
 'learning_rate': 2.0,
 'max_length': 256,
 'max_out_len': 256,
 'n_best': 1,
 'n_head': 8,
 'n_layer': 6,
 'output_file': 'train_dev_test/predict.txt',
 'pad_factor': 8,
 'predict_file': 'train_dev_test/ccmt2019-news.zh2en.source_bpe',
 'print_step': 10,
 'random_seed': 'None',
 'save_model': 'trained_models',
 'save_step': 20,
 'special_token': ['<s>', '<e>', '<unk>'],
 'src_vocab_fpath': 'train_dev_test/vocab.ch.src',
 'src_vocab_size': 10000,
 'training_file': 'train_dev_test/train.ch.bpe,train_dev_test/train.en.bpe',
 'trg_vocab_fpath': 'train_dev_test/vocab.en.tgt',
 'trg_vocab_size': 10000,
 'unk_idx': 2,
 'use_gpu': True,
 'validation_file': 'train_dev_

In [34]:
do_train(args)

[2024-01-13 14:54:13,881] [    INFO] - step_idx: 0, epoch: 0, batch: 0, avg loss: 10.525280,  ppl: 37245.261719 
[2024-01-13 14:54:14,986] [    INFO] - step_idx: 9, epoch: 0, batch: 9, avg loss: 10.511203,  ppl: 36724.625000 
[2024-01-13 14:54:16,280] [    INFO] - step_idx: 19, epoch: 0, batch: 19, avg loss: 10.476138,  ppl: 35459.203125 
[2024-01-13 14:54:16,392] [    INFO] - validation, step_idx: 19, avg loss: 10.481450,  ppl: 35648.062500
[2024-01-13 14:54:29,635] [    INFO] - step_idx: 29, epoch: 1, batch: 9, avg loss: 10.414930,  ppl: 33353.914062 
[2024-01-13 14:54:30,940] [    INFO] - step_idx: 39, epoch: 1, batch: 19, avg loss: 10.366558,  ppl: 31778.908203 
[2024-01-13 14:54:31,067] [    INFO] - validation, step_idx: 39, avg loss: 10.387623,  ppl: 32455.423828
[2024-01-13 14:54:44,980] [    INFO] - step_idx: 49, epoch: 2, batch: 9, avg loss: 10.295083,  ppl: 29586.783203 
[2024-01-13 14:54:46,769] [    INFO] - step_idx: 59, epoch: 2, batch: 19, avg loss: 10.218628,  pp

## 5. 预测和评估
模型最终训练的效果一般可通过测试集来进行测试，机器翻译领域一般计算BLEU值。
<center><img src="https://ai-studio-static-online.cdn.bcebos.com/cf7161d1059e4ca9989e03fc09f197c63590185720de444fa7c2d15ac3bee696" width="600" height="300" ></center>
<br><center>图11： 预测和评估 </center></br>


In [39]:
def do_predict(args):
    if args.use_gpu:
        place = "gpu"
    else:
        place = "cpu"
    paddle.set_device(place)

    # Define data loader
    test_loader, to_tokens = create_infer_loader(args)

    # Define model
    transformer = InferTransformer(
        src_vocab_size=args.src_vocab_size,
        trg_vocab_size=args.trg_vocab_size,
        max_length=args.max_length + 1,
        num_encoder_layers=args.n_layer,
        num_decoder_layers=args.n_layer,
        n_head=args.n_head,
        d_model=args.d_model,
        d_inner_hid=args.d_inner_hid,
        dropout=args.dropout,
        weight_sharing=args.weight_sharing,
        bos_id=args.bos_idx,
        eos_id=args.eos_idx,
        beam_size=args.beam_size,
        max_out_len=args.max_out_len)

    # Load the trained model
    assert args.init_from_params, (
        "Please set init_from_params to load the infer model.")

    model_dict = paddle.load(
        os.path.join(args.init_from_params, "transformer.pdparams"))

    # To avoid a longer length than training, reset the size of position
    # encoding to max_length
    model_dict["encoder.pos_encoder.weight"] = position_encoding_init(
        args.max_length + 1, args.d_model)
    model_dict["decoder.pos_encoder.weight"] = position_encoding_init(
        args.max_length + 1, args.d_model)
    transformer.load_dict(model_dict)

    # Set evaluate mode
    transformer.eval()

    f = open(args.output_file, "w")
    with paddle.no_grad():
        for (src_word, ) in test_loader:
            finished_seq = transformer(src_word=src_word)
            finished_seq = finished_seq.numpy().transpose([0, 2, 1])
            for ins in finished_seq:
                for beam_idx, beam in enumerate(ins):
                    if beam_idx >= args.n_best:
                        break
                    id_list = post_process_seq(beam, args.bos_idx, args.eos_idx)
                    word_list = to_tokens(id_list)
                    sequence = " ".join(word_list) + "\n"
                    f.write(sequence)
    f.close()

In [40]:
do_predict(args)

### 模型评估
预测结果中每行输出是对应行输入的得分最高的翻译，对于使用 BPE 的数据，预测出的翻译结果也将是 BPE 表示的数据，要还原成原始的数据（这里指 tokenize 后的数据）才能进行正确的评估

In [41]:
# 还原 predict.txt 中的预测结果为 tokenize 后的数据
! sed -r 's/(@@ )|(@@ ?$)//g' train_dev_test/predict.txt > train_dev_test/predict.tok.txt
# BLEU评估工具来源于 https://github.com/moses-smt/mosesdecoder.git
! tar -zxf mosesdecoder.tar.gz
# 计算multi-bleu
! perl mosesdecoder/scripts/generic/multi-bleu.perl train_dev_test/ccmt2019-news.zh2en.ref*.txt < train_dev_test/predict.tok.txt

BLEU = 38.11, 74.5/49.1/32.5/21.7 (BP=0.951, ratio=0.952, hyp_len=22252, ref_len=23371)
It is not advisable to publish scores from multi-bleu.perl.  The scores depend on your tokenizer, which is unlikely to be reproducible from your paper or consistent across research groups.  Instead you should detokenize then use mteval-v14.pl, which has a standard tokenization.  Scores from multi-bleu.perl can still be used for internal purposes when you have a consistent tokenizer.
