# 示例2: 日期字符串转换

> 训练可以将日期字符串从一种格式转换为另一种格式的编码器-解码器模型
>
>例如:从`April 22, 2019` 转换为 `2019-04-22`.

In [46]:
from datetime import datetime
from datetime import date

# date.fromordinal(): 返回对应于预期格列高利历序号的日期
dt = date.fromordinal(1)
dt, dt.strftime("%d, %Y"), dt.isoformat()

(datetime.date(1, 1, 1), '01, 0001', '0001-01-01')

In [47]:
# date.toordinal(): 返回对应于公历格里高利序数的日期,其中第1年1月1日有序数1.
dt = datetime.now() # 获取当前日期时间
ordinal_dt = date.toordinal(dt)
dt, ordinal_dt

(datetime.datetime(2022, 5, 18, 1, 7, 14, 720711), 738293)

1. 随机生成日期, 并以输入格式和目标格式显示.

In [48]:
initialization(42)

In [49]:
MONTHS = ["January", "February", "March", "April", "May", "June", 
          "July", "August", "September", "October", "November", "December"]

def random_dates(n_dates):
    min_date = date(1000,1,1).toordinal()
    max_date = date(9999,12,31).toordinal()
    
    ordinals = np.random.randint(low=min_date, high=max_date+1, size=n_dates)
    dates = [date.fromordinal(ordinal) for ordinal in ordinals]
    
    X = [MONTHS[date.month-1]+" "+date.strftime("%d, %Y") for date in dates]
    y = [date.isoformat() for date in dates]
    
    return X, y

In [50]:
random_dates(1)

(['November 29, 1333'], ['1333-11-29'])

In [51]:
n_dates = 5
X_example, y_example = random_dates(n_dates)

print("{:25s}{:25s}".format("输入", "目标"))
print("-" * 40)
for i in range(n_dates):
    print("{:25s}{:25s}".format(X_example[i], y_example[i]))

输入                       目标                       
----------------------------------------
July 24, 2837            2837-07-24               
March 21, 1361           1361-03-21               
August 19, 2001          2001-08-19               
August 10, 1709          1709-08-10               
September 03, 2763       2763-09-03               


2. 确定输入,目标的词汇(字符)表

In [52]:
# 输入字符表
input_chars = "".join(sorted(set("".join(MONTHS) + "0123456789, ")))
input_chars

' ,0123456789ADFJMNOSabceghilmnoprstuvy'

In [53]:
# 目标字符表
output_chars = "0123456789-"
output_chars

'0123456789-'

In [54]:
src_vocab = len(input_chars)     # 输入字符表长度: 38
tgt_vocab = len(output_chars)    # 目标字符表长度: 11

3. 编写函数将字符串转化为IDs形式, **id从1开始**

In [55]:
# def date_str_to_ids(date_str, chars_list):
#     return [chars_list.index(c) for c in date_str]


def date_str_to_ids(date_str, chars_list):
    return [chars_list.index(c) + 1 for c in date_str]

In [56]:
print("{:25s}{:25s}".format("输入", "目标"))
print("-" * 40)
print("{:25s}{:25s}".format(X_example[0], y_example[0]))

输入                       目标                       
----------------------------------------
July 24, 2837            2837-07-24               


In [57]:
date_str_to_ids(X_example[0], chars_list=input_chars)    # 输入

[16, 36, 28, 38, 1, 5, 7, 2, 1, 5, 11, 6, 10]

In [58]:
date_str_to_ids(y_example[0], chars_list=output_chars)   # 目标

[3, 9, 4, 8, 11, 1, 8, 11, 3, 5]

4. 处理可变长度的序列

    - `可变长度的输入序列`:
        - 可以**通过填充较短的序列**来处理，以便批次中的所有序列具有相同的长度，并**使用掩码来确保 RNN 忽略填充标记**。为了获得更好的性能，您可能还希望创建包含相似大小序列的批次。
        `pythorch`中使用`pad_sequence()`进行填充.
        - **不规则张量可以保存可变长度的序列**.
    
    - `可变长度输出序列`:
        - 如果预先知道输出序列的长度，那么您只需要配置损失函数，以便它忽略序列末尾之后的标记.同样，将使用模型的代码应该忽略序列末尾之外的标记。
        - 但是一般输出序列的长度是事先不知道的，所以解决的办法是**训练模型，使其在每个序列的末尾输出一个序列结束标记。**

In [59]:
from torch.nn.utils.rnn import pad_sequence

In [60]:
def prepare_date_strs(date_strs, chars=input_chars):
    X_ids = [torch.tensor(date_str_to_ids(date, chars)) for date in date_strs]
    X = pad_sequence(X_ids, batch_first=True, padding_value=0)
    return X  

In [61]:
prepare_date_strs(X_example[0:4], input_chars)

tensor([[16, 36, 28, 38,  1,  5,  7,  2,  1,  5, 11,  6, 10,  0,  0],
        [17, 21, 33, 23, 26,  1,  5,  4,  2,  1,  4,  6,  9,  4,  0],
        [13, 36, 25, 36, 34, 35,  1,  4, 12,  2,  1,  5,  3,  3,  4],
        [13, 36, 25, 36, 34, 35,  1,  4,  3,  2,  1,  4, 10,  3, 12]])

5. 批量和掩码 Batched and Masking

In [62]:
class Batch:

    def __init__(self, src, tgt, pad=0):
        """
        :param pad: 默认0 表示<blank>
        """
        self.src = src
        # 将与令牌匹配的位置表示为False, 否则为True
        # 并在倒数第二个维度后面添加一维度
        self.src_mask = (src != pad).unsqueeze(-2)

        if tgt is not None:
            self.tgt = tgt[:, :-1]  # Decoder的输入，即除去最后一个结束token的部分
            self.tgt_y = tgt[:, 1:]  # Decoder的期望输入，即除去首个一个起始token的部分
            self.tgt_mask = self.make_std_mask(self.tgt, pad)
            self.ntokens = (self.tgt_y != pad).data.sum()  # 所有True的词元数量

    @staticmethod
    # staticmethod 返回函数的静态方法 可以不实例化即可调用方法
    def make_std_mask(tgt, pad):
        """
        pad 和 future words 均在mask中用pad表示
        """
        tgt_mask = (tgt != pad).unsqueeze(-2)
        sequence_len = tgt.size(-1)  # 或是batch中最长时间步数
        tgt_mask = tgt_mask & subsequent_mask(size=sequence_len).type_as(
            tgt_mask.data
            # &:进行位运算
            # subsequent_mask()返回维度为(1, size, size)
            # type_as():将数据类型转换为tgt_mask的数据类型
        )
        return tgt_mask

6. 构建数据集

In [63]:
sos_id = tgt_vocab + 1   # 11+1=12

In [64]:
def shift_output_sequences(y, device=None):
    if device == "cuda":
        sos_token = torch.Tensor(len(y), 1).fill_(sos_id).int().cuda()
        decoder = torch.cat((sos_token, y[:, :-1]), axis=1).cuda()
    else:
        sos_token = torch.Tensor(len(y), 1).fill_(sos_id).int()
        decoder = torch.cat((sos_token, y[:, :-1]), axis=1)
    return decoder

In [65]:
def create_dataset(n_dates, device=None):
    X, y = random_dates(n_dates)
    X_pre = prepare_date_strs(X, input_chars)
    y_pre = prepare_date_strs(y, output_chars)
    y_pre_shift = shift_output_sequences(y_pre, device=device)
    return X_pre, y_pre_shift

In [66]:
create_dataset(10)

(tensor([[18, 31, 37, 24, 29, 22, 24, 33,  1,  5, 10,  2,  1,  4,  6,  3,  4,  0],
         [13, 32, 33, 27, 28,  1,  4,  4,  2,  1,  4,  4,  8,  3,  0,  0,  0,  0],
         [16, 21, 30, 36, 21, 33, 38,  1,  3, 10,  2,  1,  4,  6, 10,  9,  0,  0],
         [13, 36, 25, 36, 34, 35,  1,  4, 11,  2,  1,  5,  7,  5, 10,  0,  0,  0],
         [16, 36, 28, 38,  1,  5,  8,  2,  1,  4,  5,  6, 12,  0,  0,  0,  0,  0],
         [20, 24, 32, 35, 24, 29, 22, 24, 33,  1,  4,  3,  2,  1,  4,  7, 10, 12],
         [18, 31, 37, 24, 29, 22, 24, 33,  1,  4,  4,  2,  1,  4,  8,  5,  6,  0],
         [13, 36, 25, 36, 34, 35,  1,  3,  9,  2,  1,  4, 10,  9,  4,  0,  0,  0],
         [16, 36, 28, 38,  1,  3,  6,  2,  1,  4,  4,  4,  5,  0,  0,  0,  0,  0],
         [19, 23, 35, 31, 22, 24, 33,  1,  4,  3,  2,  1,  4, 12,  3,  4,  0,  0]]),
 tensor([[12,  2,  4,  1,  2, 11,  2,  2, 11,  3],
         [12,  2,  2,  6,  1, 11,  1,  5, 11,  2],
         [12,  2,  4,  8,  7, 11,  1,  2, 11,  1],
         [12,  

In [67]:
def data_gen(n_batches, batch_size, device=None):
    """
    <编码器-解码器日期字符串转换任务> 随机数据生成器
    :param batch_size: 批次大小
    :param n_batches: 需要生成的批次数量
    """
    for i in range(n_batches):
        X_pre, y_pre = create_dataset(batch_size, device=device)
        # data = torch.randint(2, V, size=(batch_size, s_len))
        # .batch()
        # 返回一个新的tensor，从当前计算图中分离下来的，但是仍指向原变量的存放位置
        # 不同之处只是requires_grad为false，得到的这个tensor永远不需要计算其梯度，不具有grad。
        # requires_grad 默认为False
        src = X_pre.requires_grad_(False).clone().detach()
        tgt = y_pre.requires_grad_(False).clone().detach()
        if device == "cuda":
            src = src.cuda()
            tgt = tgt.cuda()
        yield Batch(src=src, tgt=tgt, pad=0)

7. 训练评估模型

In [68]:
def example_simple_model(device=None):
    # V = 11  # 字典的大小
    criterion = LabelSmoothing(size=tgt_vocab + 2, padding_idx=0, smoothing=0.0)
    model = make_model(src_vocab=src_vocab + 1, tgt_vocab=tgt_vocab + 2, N=2)
    if device == "cuda":
        model.cuda()
    model_size = model.src_embed[0].d_model  # 512

    n_epochs = 20
    n_batch_train_epoch = 200  # 训练时每个epoch所需批次大小
    n_batch_val_epoch = 50  # 验证时每个epoch所需批次大小
    batch_size = 100

    optimizer = torch.optim.Adam(model.parameters(),
                                 lr=0.5,
                                 betas=(0.9, 0.98),
                                 eps=1e-9)
    lr_scheduler = LambdaLR(
        optimizer=optimizer,
        lr_lambda=lambda step: rate(
            step=step, model_size=model_size, factor=0.1, warmup=400))

    for epoch in range(n_epochs):
        loss_compute = SimpleLossCompute(generator=model.generator,
                                         criterion=criterion)

        print(f"\n|   批次: {epoch}   |")
        print("*" * 5 + "训练" + "*" * 5)
        model.train()  # self.training=True

        train_data_iter = data_gen(n_batches=n_batch_train_epoch,
                                   batch_size=batch_size,
                                   device=device)
        run_epoch(data_iter=train_data_iter,
                  model=model,
                  loss_compute=loss_compute,
                  optimizer=optimizer,
                  scheduler=lr_scheduler,
                  mode="train")

        # -----------
        print("*" * 5 + "验证" + "*" * 5)
        model.eval()  # self.training=False

        val_data_iter = data_gen(n_batches=n_batch_val_epoch,
                                 batch_size=batch_size,
                                 device=device)
        valid_mean_loss = run_epoch(
            data_iter=val_data_iter,
            model=model,
            loss_compute=loss_compute,
            optimizer=DummyOptimizer(),  # None
            scheduler=DummyScheduler(),  # None
            mode="eval")[0]  # 返回: total_loss / total_tokens
        print(f"|验证损失: {valid_mean_loss} |")

    model.eval()
    torch.save(model, './models/Pytorch/example_2_date.pth')

In [70]:
example_simple_model("cuda")


|   批次: 0   |
*****训练*****
Epoch Step:      1 | Accumulation Step:   2 | Loss:   3.53 | Tokens / Sec:  1870.1 | Learning Rate: 3.0e-07
Epoch Step:     41 | Accumulation Step:  42 | Loss:   2.58 | Tokens / Sec: 21059.7 | Learning Rate: 6.3e-06
Epoch Step:     81 | Accumulation Step:  82 | Loss:   2.01 | Tokens / Sec: 22141.1 | Learning Rate: 1.2e-05
Epoch Step:    121 | Accumulation Step: 122 | Loss:   1.68 | Tokens / Sec: 22022.5 | Learning Rate: 1.8e-05
Epoch Step:    161 | Accumulation Step: 162 | Loss:   1.35 | Tokens / Sec: 22286.5 | Learning Rate: 2.4e-05
*****验证*****
|验证损失: 0.9029087424278259 |

|   批次: 1   |
*****训练*****
Epoch Step:      1 | Accumulation Step:   2 | Loss:   1.14 | Tokens / Sec: 24451.9 | Learning Rate: 3.0e-05
Epoch Step:     41 | Accumulation Step:  42 | Loss:   0.98 | Tokens / Sec: 22205.0 | Learning Rate: 3.6e-05
Epoch Step:     81 | Accumulation Step:  82 | Loss:   0.84 | Tokens / Sec: 22275.8 | Learning Rate: 4.2e-05
Epoch Step:    121 | Accumulation Step:

8. 编写函数将IDs转化为字符串形式

In [69]:
def ids_to_date_strs(ids, chars_list):
    return [
        "".join([(" " + chars_list)[index] for index in sequence])
        for sequence in ids
    ]

9. 预处理序列 
    
    强制进行0填充至`length==18(max)`

In [70]:
# tf.pad：填充函数
t = torch.tensor([[2, 3, 4], [5, 6, 7]])
pd = (2, 2, 1, 1)
F.pad(t, pd, 'constant', 0)

tensor([[0, 0, 0, 0, 0, 0, 0],
        [0, 0, 2, 3, 4, 0, 0],
        [0, 0, 5, 6, 7, 0, 0],
        [0, 0, 0, 0, 0, 0, 0]])

In [71]:
max_input_length = 18

# 预处理序列 -> 强制进行0填充至length==18(max)
def prepare_date_strs_padded(date_strs):
    X = prepare_date_strs(date_strs, input_chars)
    pd = (0, max_input_length-X.shape[1], 0,0)

    if X.shape[1] < max_input_length:
        X = F.pad(X, pd, 'constant',0)
#     X[:, 0] = 1
    return X

In [72]:
prepare_date_strs_padded(X_example[0:4])

tensor([[16, 36, 28, 38,  1,  5,  7,  2,  1,  5, 11,  6, 10,  0,  0,  0,  0,  0],
        [17, 21, 33, 23, 26,  1,  5,  4,  2,  1,  4,  6,  9,  4,  0,  0,  0,  0],
        [13, 36, 25, 36, 34, 35,  1,  4, 12,  2,  1,  5,  3,  3,  4,  0,  0,  0],
        [13, 36, 25, 36, 34, 35,  1,  4,  3,  2,  1,  4, 10,  3, 12,  0,  0,  0]])

10. 使用模型进行预测

In [73]:
# 10. 使用模型进行预测 预测日期字符串函数
def pred_date_strs(model, date_strs, device=None):
    X = prepare_date_strs_padded(date_strs)
    if device == "cuda":
        src_mask = torch.ones(1, 1, 18).cuda()
    else:
        src_mask = torch.ones(1, 1, 18)

    ys = greedy_decode(model, src=X, src_mask=src_mask, max_len=tgt_vocab, start_symbol=sos_id)

    #     # 排除<sos>
    y_pred_str = ids_to_date_strs(ys[:, 1:], output_chars)

    return y_pred_str

In [74]:
model = torch.load('./models/Pytorch/example_2_date.pth',map_location='cpu')

In [75]:
pred_date_strs(model, ["December 14, 2019"])

['2019-12-14']

# 示例3: 真实案例: 翻译任务

现在我们考虑一个 `IWSLT 汉语-英语数据集`实现翻译任务的真实示例。该任务要比论文中讨论的 `WMT` 任务稍微小一点，但足够展示整个系统。我们同样还展示了如何使用多 GPU 处理来令加速训练过程。

In [76]:
from torchtext import data, datasets
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.functional import to_map_style_dataset
from torch.utils.data.distributed import DistributedSampler
from torch.utils.data import DataLoader
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.multiprocessing as mp
import GPUtil
import spacy
import os
from os.path import exists

1. 加载德文，英文分词器

    我们将使用 `torchtext` 和 `spacy` 加载数据集，并实现分词。

> `torchtext`主要包含三部分:
> - `Field`：配置对数据的预处理信息，比如指定分词方法、是否是序列、指定起始字符和结束字符等等。
> - `Dataset`：用于加载数据集。
> - `Iterator`：输出的迭代器，来把数据切分成`batch_size`来提供给模型作为输入。

In [77]:
def load_tokenizers():
    try:
        spacy_de = spacy.load("de_core_news_sm")
    except IOError:
        os.system("python -m spacy download de_core_news_sm")
        spacy_de = spacy.load("de_core_news_sm")

    try:
        spacy_en = spacy.load("en_core_web_sm")
    except IOError:
        os.system("python -m spacy download en_core_web_sm")
        spacy_en = spacy.load("en_core_web_sm")

    return spacy_de, spacy_en

2. 分词

In [78]:
def tokenize(text, tokenizer):
    """
    分词处理
    Spacy 会先将文档分解成句子，然后再 tokenize 。我们可以使用迭代来遍历整个文档
    :param text: 文本
    :param tokenizer: 分词器 spacy_zh 或 spacy_en
    :return:
    """
    return [token.text for token in tokenizer.tokenizer(text)]

In [79]:
def yield_tokens(data_iter, tokenizer, index):
    # index[0]:德文 index[1]:英文
    for from_to_tuple in data_iter:
        yield tokenizer(from_to_tuple[index])

In [80]:
spacy_de, spacy_en = load_tokenizers()
vocab_path = "./models/Pytorch/vocab.pt"

print(
    tokenize(text="Wir können Iteration verwenden, um das gesamte Dokument zu durchlaufen",
             tokenizer=spacy_de))

print(
    tokenize(text="We can use iteration to traverse the entire document",
             tokenizer=spacy_en))

['Wir', 'können', 'Iteration', 'verwenden', ',', 'um', 'das', 'gesamte', 'Dokument', 'zu', 'durchlaufen']
['We', 'can', 'use', 'iteration', 'to', 'traverse', 'the', 'entire', 'document']


3. 构建语料数据集

    批处理对速度很重要。我们希望有非常均匀的批量，且有最小的填充，因此我们必须对默认的 torchtext 分批函数进行修改。这段代码修改了默认的分批过程，以确保我们能搜索足够的语句以找到紧凑的批量。

In [81]:
train, val, test = datasets.Multi30k(language_pair=("de", "en"))

示例:
*('Ein Mann mit einem orangefarbenen Hut, der etwas anstarrt.', 'A man in an orange hat starring at something.')
('Ein Boston Terrier läuft über saftig-grünes Gras vor einem weißen Zaun.', 'A Boston Terrier is running on lush green grass in front of a white fence.')*

In [82]:
def tokenize(text, tokenizer):
    return [tok.text for tok in tokenizer.tokenizer(text)]


def yield_tokens(data_iter, tokenizer, index):
    for from_to_tuple in data_iter:
        yield tokenizer(from_to_tuple[index])

In [83]:
def build_vocabulary(spacy_de, spacy_en):
    """
    构建数据集
    - 使用torchtext自带的机器翻译数据集 Multi30k
        -- language_pair:指定使用的翻译句子对的语言，默认是从德语到英语。数据集中的每一行是一对指定语言的句子对
    - build_vocab_from_iterator() :从迭代器构建词汇函数
        -- iterator: 构建 Vocab 的迭代器
        -- min_freq: 在词汇表中包含标记所需的最小频率
        -- specials: 要添加的特殊符号
    """
    BOS_WORD = '<s>'  # beginning of sequence 序列开始标识
    EOS_WORD = '</s>'  # end of sequence 序列结束标识
    BLANK_WORD = '<blank>'  # 空白标识
    UNK_WORD = '<unk>'  # 未知字符标识

    def tokenize_de(text):
        return tokenize(text=text, tokenizer=spacy_de)

    def tokenize_en(text):
        return tokenize(text=text, tokenizer=spacy_en)

    print("***构建德文数据集***")
    train, val, test = datasets.Multi30k(language_pair=("de", "en"))
    vocab_src = build_vocab_from_iterator(
        iterator=yield_tokens(
            data_iter=train + val + test,
            tokenizer=tokenize_de,
            index=0
        ),
        min_freq=2,
        specials=[BOS_WORD, EOS_WORD, BLANK_WORD, UNK_WORD]
    )

    print("***构建英文数据集***")
    train, val, test = datasets.Multi30k(language_pair=("de", "en"))
    vocab_tgt = build_vocab_from_iterator(
        iterator=yield_tokens(
            data_iter=train + val + test,
            tokenizer=tokenize_en,
            index=1
        ),
        min_freq=2,
        specials=[BOS_WORD, EOS_WORD, BLANK_WORD, UNK_WORD]
    )

    vocab_src.set_default_index(vocab_src[UNK_WORD])
    vocab_tgt.set_default_index(vocab_src[UNK_WORD])

    return vocab_src, vocab_tgt

In [84]:
def load_vocab(spacy_de, spacy_en, vocab_path):
    if not exists(vocab_path):
        vocab_src, vocab_tgt = build_vocabulary(spacy_de, spacy_en)
        torch.save((vocab_src, vocab_tgt), vocab_path)
        print("分词器完成构建!")
    else:
        vocab_src, vocab_tgt = torch.load(vocab_path)
        print("分词器完成加载!")
    print("德文词汇量：" + str(len(vocab_src)))
    print("英文词汇量：" + str(len(vocab_tgt)))
    return vocab_src, vocab_tgt

In [85]:
spacy_de, spacy_en = load_tokenizers()
load_vocab(spacy_de, spacy_en, vocab_path)

分词器完成加载!
德文词汇量：8315
英文词汇量：6384


(Vocab(), Vocab())

4. 数据迭代器

    迭代器定义了分批过程的多项操作，包括数据清洗、整理和分批等。

In [86]:
def collate_batch(batch,
                  src_pipeline,
                  tgt_pipeline,
                  src_vocab,
                  tgt_vocab,
                  device,
                  max_padding=128,
                  PAD_id=2):
    """
    批次数据整理:标识开始结束token 并进行填充至统一长度
    :param batch:
    :param src_pipeline: 输入分词器-tokenize_de
    :param tgt_pipeline: 目标分词器-tokenize_en
    :param src_vocab: 输入词汇表-vocab_src
    :param tgt_vocab: 输出词汇表-vocab_tgt
    :param device: 使用GPU加速
    :param max_padding: 最大填充默认128
    :param PAD_id: 填充id -> <black> 空白标识
    :return:
    """
    BOS_id = torch.tensor([0], device=device)  # <s>  序列开始标识ID
    EOS_id = torch.tensor([1], device=device)  # </s> 序列结束标识ID
    src_list, tgt_list = [], []
    for (_src, _tgt) in batch:
        # 对输入批次进行预处理 添加序列开始和结束标识
        processed_src = torch.cat(
            [
                BOS_id,  # <s>
                torch.tensor(
                    src_vocab(src_pipeline(_src)),
                    dtype=torch.int64,
                    device=device,
                ),
                EOS_id,  # </s>
            ],
            dim=0,
        )
        # 对目标批次进行预处理
        processed_tgt = torch.cat(
            [
                BOS_id,  # <s>
                torch.tensor(
                    tgt_vocab(tgt_pipeline(_tgt)),
                    dtype=torch.int64,
                    device=device,
                ),
                EOS_id,  # </s>
            ],
            dim=0,
        )

        # F.pad 通过填充较短的序列来处理，以便批次中的所有序列具有相同的长度
        src_list.append(
            F.pad(
                input=processed_src,
                pad=(0, max_padding -
                     len(processed_src)),  # (0, 128-len(processed_src))
                mode="constant",
                value=PAD_id,
            ))
        tgt_list.append(
            F.pad(
                input=processed_tgt,
                pad=(0, max_padding - len(processed_tgt)),
                mode="constant",
                value=PAD_id,
            ))

    # stack(): 对张量序列进行连接
    src = torch.stack(src_list)
    tgt = torch.stack(tgt_list)

    return (src, tgt)

In [87]:
def create_dataloaders(device,
                       vocab_src,
                       vocab_tgt,
                       spacy_de,
                       spacy_en,
                       batch_size=12000,
                       max_padding=128,
                       is_distributed=True):
    """
    创建数据加载器
    :param spacy_de: 德文分词器
    :param spacy_en: 英文分词器
    :param batch_size: 批次大小为12000
    :param is_distributed: 是否使用分布式训练
    :return:
    """

    def tokenize_de(text):
        return tokenize(text=text, tokenizer=spacy_de)

    def tokenize_en(text):
        return tokenize(text=text, tokenizer=spacy_en)

    def collate_fn(batch):
        return collate_batch(
            batch=batch,
            src_pipeline=tokenize_de,
            tgt_pipeline=tokenize_en,
            src_vocab=vocab_src,
            tgt_vocab=vocab_tgt,
            device=device,
            max_padding=max_padding,
            # get_stoi(): 获得字典的dict对象
            PAD_id=vocab_src.get_stoi()["<blank>"],  # 2
        )

    train_iter, valid_iter, test_iter = datasets.Multi30k(language_pair=("de",
                                                                         "en"))
    """
    1. 转换数据集类型为map-style
       - to_map_style_dataset(): 将`iterable-style`数据集转换为`map-style`数据集。
           - `map-style`是使用索引/键向数据样本进行映射
           - `iterable-style`的迭代型的数据集就是真正载入数据
    """
    train_iter_map = to_map_style_dataset(train_iter)
    valid_iter_map = to_map_style_dataset(valid_iter)
    """
    2. 使用分布式采样器
       - DistributedSampler(): 分布式采样器  由于使用多GPU训练 加载策略是负责只提供加载数据集中的一个子集
           - 使用分布式采样器需要数据集的len()
    """
    train_sampler = (DistributedSampler(
        dataset=train_iter_map) if is_distributed else None)
    valid_sampler = (DistributedSampler(
        dataset=valid_iter_map) if is_distributed else None)
    """
    3. 创建数据加载器： 使用DataLoader()结合数据集和采样器，并提供可迭代的给定的数据集。
    """
    train_dataloader = DataLoader(
        dataset=train_iter_map,
        batch_size=batch_size,
        shuffle=(train_sampler is None),  # 如果未指定采样器则进行混洗
        sampler=train_sampler,
        collate_fn=collate_fn,  # 在使用批量加载`map-style`数据集时使用 批次数据整理
    )
    valid_dataloader = DataLoader(
        dataset=valid_iter_map,
        batch_size=batch_size,
        shuffle=(valid_sampler is None),
        sampler=valid_sampler,
        collate_fn=collate_fn,
    )
    return train_dataloader, valid_dataloader

5. 多 GPU 训练
    
    最后为了快速训练，我们使用了多块 GPU。这段代码将实现多 GPU 的词生成，但它并不是针对 Transformer 的具体方法，所以这里并不会具体讨论。多 GPU 训练的基本思想即在训练过程中将词生成分割为语块（chunks），并传入不同的 GPU 实现并行处理，我们可以使用 PyTorch 并行基元实现这一点。

In [88]:
def train_worker(gpu,
                 ngpus_per_node,
                 vocab_src,
                 vocab_tgt,
                 spacy_de,
                 spacy_en,
                 config,
                 is_distributed=False):
    """
    配置训练任务
    :param gpu: 主机编号
    :param ngpus_per_node: 主机数量
    :param config: 参数配置
    :param is_distributed: 是否使用分布式训练
    :return:
    """
    # ----多卡配置----
    print(f"使用 GPU 训练工作进程： {gpu} 训练", flush=True)
    torch.cuda.set_device(gpu)

    pad_idx = vocab_tgt["<blank>"]  # 空白填充token的id
    d_model = 512
    model = make_model(len(vocab_src), len(vocab_tgt), N=6)
    model.cuda(gpu)
    module = model
    is_main_process = True

    if is_distributed:
        """
        torch.distributed库: 实现单机多卡训练库
        - init_process_group(): 初始化默认的分布式进程组
           - backend: 一般来说使用NCCL对于GPU分布式训练，使用gloo对CPU进行分布式训练
           - init_method: URL指定了如何初始化互相通信的进程    
           - rank: 优先度或gpu的编号/进程的编号 rank=0的主机就是主要节点
           - world_size: 执行训练的所有的进程数/GPU数
        - DistributedDataParallel(): DDP模式: 使用多进程；性能更优；模型广播只在初始化的时候, 故训练加速
        """
        dist.init_process_group(backend="nccl",
                                init_method="env://",
                                rank=gpu,
                                world_size=ngpus_per_node)
        model = DDP(module=model, device_ids=[gpu])
        module = model.module
        is_main_process = gpu == 0

    # ----训练配置----
    criterion = LabelSmoothing(size=len(vocab_tgt),
                               padding_idx=pad_idx,
                               smoothing=0.1)
    criterion.cuda(gpu)
    # 创建数据加载器
    train_dataloader, valid_dataloader = create_dataloaders(
        gpu,
        vocab_src,
        vocab_tgt,
        spacy_de,
        spacy_en,
        batch_size=config["batch_size"] // ngpus_per_node,  # 批次大小 // 总GPU数
        max_padding=config["max_padding"],
        is_distributed=is_distributed,
    )
    # 优化器
    optimizer = torch.optim.Adam(model.parameters(),
                                 lr=config["base_lr"],
                                 betas=(0.9, 0.98),
                                 eps=1e-9)
    # 学习率调度
    lr_scheduler = LambdaLR(
        optimizer=optimizer,
        lr_lambda=lambda step: rate(
            step, d_model, factor=1, warmup=config["warmup"]),
    )
    train_state = TrainState()  # 跟踪处理的步骤、示例和标记的数量

    for epoch in range(config["num_epochs"]):
        if is_distributed:  # 使用分布式训练就需要进行分批操作
            train_dataloader.sampler.set_epoch(epoch)
            valid_dataloader.sampler.set_epoch(epoch)

        print(f"\n|   批次: {epoch}   |")
        print("*" * 5 + "训练" + "*" * 5)
        model.train()
        print(f"[GPU{gpu}] Epoch {epoch} Training ====", flush=True)
        train_data_iter = (Batch(src=b[0], tgt=b[1], pad=pad_idx)
                           for b in train_dataloader)  #######
        _, train_state = run_epoch(
            data_iter=train_data_iter,
            model=model,
            loss_compute=SimpleLossCompute(module.generator, criterion),
            optimizer=optimizer,
            scheduler=lr_scheduler,
            mode="train+log",
            accum_iter=config["accum_iter"],
            train_state=train_state,
        )

        GPUtil.showUtilization()  # 实时查看GPU状况
        # 保存检查点模型
        if is_main_process:
            file_path = "./models/Pytorch/%s%.2d.pt" % (config["file_prefix"], epoch)
            torch.save(module.state_dict(), file_path)
        torch.cuda.empty_cache()

        # -----------
        print("*" * 5 + "验证" + "*" * 5)
        print(f"[GPU{gpu}] Epoch {epoch} Validation ====", flush=True)
        model.eval()
        valid_data_iter = (Batch(src=b[0], tgt=b[1], pad=pad_idx)
                           for b in valid_dataloader)  ######
        valid_mean_loss = run_epoch(
            data_iter=valid_data_iter,
            model=model,
            loss_compute=SimpleLossCompute(module.generator, criterion),
            optimizer=DummyOptimizer(),
            scheduler=DummyScheduler(),
            mode="eval",
        )
        print(valid_mean_loss)
        torch.cuda.empty_cache()

    # 保存最终模型
    if is_main_process:
        file_path = "./models/Pytorch/%sfinal.pt" % config["file_prefix"]
        torch.save(module.state_dict(), file_path)

In [89]:
def train_distributed_model(vocab_src, vocab_tgt, spacy_de, spacy_en, config):
    """
    配置分布式训练任务
    """
    ngpus = torch.cuda.device_count()
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12356"
    print(f"检测到的 GPUs 数量： {ngpus}")
    print("产生训练过程中 ...")

    # torch.multiprocessing(): 实现pytorch多进程
    mp.spawn(
        fn=train_worker,
        nprocs=ngpus,
        args=(ngpus, vocab_src, vocab_tgt, spacy_de, spacy_en, config, True),
    )

6. 模型训练

    `Harvard NLP` 团队首先运行了一些预热迭代，但是其它的设定都能使用默认的参数。在带有 4 块 Tesla V100 的 AWS p3.8xlarge 中，批量大小为 12000 的情况下每秒能运行 27000 个词。

In [90]:
def train_model(vocab_src, vocab_tgt, spacy_de, spacy_en, config):
    """
    选择训练任务的训练类型
    """
    if config["distributed"]:  # 执行分布式训练
        train_distributed_model(vocab_src, vocab_tgt, spacy_de, spacy_en,
                                config)
    else:  # 执行单GPU训练
        train_worker(gpu=0,
                     ngpus_per_node=1,
                     vocab_src=vocab_src,
                     vocab_tgt=vocab_tgt,
                     spacy_de=spacy_de,
                     spacy_en=spacy_en,
                     config=config,
                     is_distributed=False)

In [91]:
def load_trained_model():
    config = {
        "batch_size": 32,
        "distributed": False,  # 默认不开启分布式训练
        "num_epochs": 8,
        "accum_iter": 10,
        "base_lr": 1.0,
        "max_padding": 72,
        "warmup": 3000,
        "file_prefix": "multi30k_model_",
    }
    model_path = "./models/Pytorch/multi30k_model_final.pt"
    if not exists(model_path):
        train_model(vocab_src, vocab_tgt, spacy_de, spacy_en, config)
        print("模型训练完成")

    model = make_model(len(vocab_src), len(vocab_tgt), N=6)
    model.load_state_dict(
        torch.load("./models/Pytorch/multi30k_model_final.pt",
                   map_location='cpu'))  # 使用cpu加载GPU模型
    print("模型加载完成")
    return model

In [92]:
if __name__ == "__main__":
    spacy_de, spacy_en = load_tokenizers()
    vocab_path = "./models/Pytorch/vocab.pt" 
    vocab_src, vocab_tgt = load_vocab(spacy_de, spacy_en, vocab_path)
    model = load_trained_model()

***构建德文数据集***
***构建英文数据集***
完成构建!
德文词汇量：8315
英文词汇量：6384
使用 GPU 训练工作进程： 0 训练

|   批次: 0   |
*****训练*****
[GPU0] Epoch 0 Training ====
Epoch Step:      1 | Accumulation Step:   1 | Loss:   7.58 | Tokens / Sec:   825.4 | Learning Rate: 5.4e-07
Epoch Step:     41 | Accumulation Step:   5 | Loss:   7.36 | Tokens / Sec:  2880.7 | Learning Rate: 1.1e-05
Epoch Step:     81 | Accumulation Step:   9 | Loss:   6.98 | Tokens / Sec:  2856.0 | Learning Rate: 2.2e-05
Epoch Step:    121 | Accumulation Step:  13 | Loss:   6.71 | Tokens / Sec:  2825.3 | Learning Rate: 3.3e-05
Epoch Step:    161 | Accumulation Step:  17 | Loss:   6.45 | Tokens / Sec:  2841.8 | Learning Rate: 4.4e-05
Epoch Step:    201 | Accumulation Step:  21 | Loss:   6.38 | Tokens / Sec:  2801.3 | Learning Rate: 5.4e-05
Epoch Step:    241 | Accumulation Step:  25 | Loss:   6.23 | Tokens / Sec:  2863.9 | Learning Rate: 6.5e-05
Epoch Step:    281 | Accumulation Step:  29 | Loss:   5.95 | Tokens / Sec:  2808.8 | Learning Rate: 7.6e-05
Epo

一旦训练完成了，我们就能解码模型并生成一组翻译，下面我们简单地翻译了验证集中的第一句话。该数据集非常小，所以模型通过贪婪搜索也能获得不错的翻译效果。

# 附加组件：BPE, Search, Averaging

In [92]:
spacy_de, spacy_en = load_tokenizers()
vocab_src, vocab_tgt = load_vocab(spacy_de, spacy_en, vocab_path)
model = load_trained_model()

分词器完成加载!
德文词汇量：8315
英文词汇量：6384
模型加载完成


## BPE/ Word-piece

最原始的分词方法包括: 
-  按空格分token（**word粒度**）
   - 缺点是OOV问题; 不利于模型学习到word的不同时态和不同词缀之间的关联
- **ngram字符粒度**
   - 缺点是由于粒度太细,会丢失一部分语义信息;模型的输入变长，模型训练复杂难以收敛
   
`Subword`模型方法横空出世。它的划分粒度介于词与字符之间，比如可以将"looking"划分为"look"和"ing"两个子词，而划分出来的"look"，"ing"又能够用来构造其它词，如"look"和"ed"子词可组成单词"looked"，因而`Subword`方法能够大大降低词典的大小，同时对相近词能更好地处理。

- **BPE**：

> BPE获得Subword的步骤: 
> 1. 准备足够大的训练语料，并确定期望的`Subword`词表大小；
> 2. 将单词拆分为成最小单元。比如英文中26个字母加上各种符号，这些作为初始词表；并在词表的末尾添加后缀`</w>`。
    <img src="./images/other/16-56.jpg" width=200>
> 3. 在语料上统计单词内相邻单元对的频数，选取**频数最高的单元对合并成新的`Subword`单元**；
    <img src="./images/other/16-57.jpg" width=200>
    最高频连续子词对"e"和"s"出现了6+3=9次，将其合并成"es",不存在's'子词了，因此将其从词表中删除
    <img src="./images/other/16-58.jpg" width=200>
> 4. 重复第3步直到达到第1步设定的`Subword`词表大小或下一个最高频数为1.

> 得到`Subword`词表之后，需要对输入模型的句子中的单词进行编码，编码流程如下：
> 1. 将词典中的所有子词按照长度由大到小进行排序；
> 2. 对于单词w，依次遍历排好序的词典。查看当前子词是否是该单词的子字符串，如果是，则输出当前子词，并对剩余单词字符串继续匹配。
> 3. 如果遍历完字典后，仍然有子字符串没有匹配，则将剩余字符串替换为特殊符号输出，如`<unk>`。
> 4. 单词的表示即为上述所有输出子词。

- **WordPiece**:

    `Bert`在分词时采用的就是`WodPiece`。和`BPE`相似, `Wordpiece`同样每次从词表中选出两个`subword`组建成新的`subword`。
    
    `WodPiece`和`BPE`的区别在于：`BPE`直接根据`subword`出现的词频排序, 选择频次最高的`subword`合并。而**`WordPiece`选择能够提升语言模型概率最大的相邻子词加入词表**。
    > 假设语料中的句子 $\mathrm{S}$ 由 $\mathrm{n}$ 个word $t_{i}$ 组成, 各个`subword`之间独立存在, 则语言模型的概率似然为:
    $$
    \log P(S)=\sum_{i=1}^{n} \log P\left(t_{i}\right)
    $$
    假设将两个`subword` $x$ 和 $y$ 合并后, 得到新的`subword` $z$ 。那么似然值的变化为：
    $$
    \log P\left(t_{z}\right)-\left(\log P\left(t_{x}\right)+\log P\left(t_{y}\right)\right)=\log \left(\frac{P\left(t_{z}\right)}{P\left(t_{x}\right) P\left(t_{y}\right)}\right)
    $$
    >
    > 可以看出似然值的变化可以衡量`subword` $x$ 和 $y$ 之间的互信息。如果合并的两个`subword`有最大的互信 息, 那么这两个`subword`在语言模型上具有很强的关联, 表示经常在语料中同时出现。

我们可以使用库首先将数据预处理为`subword`单元。请参阅[subword-nmt](https://github.com/rsennrich/subword-nmt)。这些模型会将训练数据转换为如下所示：

    ▁Die ▁Protokoll datei ▁kann ▁ heimlich ▁per ▁E - Mail ▁oder ▁FTP
    ▁an ▁einen ▁bestimmte n ▁Empfänger ▁gesendet ▁werden .

- **共享嵌入**：当使用具有共享词汇表的 `BPE` 时，我们可以在`source / target / generator`之间共享相同的权重向量。

    要将其添加到模型中，只需执行以下操作：

In [93]:
if False:
    model.src_embed[0].lut.weight = model.tgt_embedings[0].lut.weight
    model.generator.lut.weight = model.tgt_embed[0].lut.weight

## Beam Search 和  Model Averaging

- **Beam Search** : 详见`第16章 使用RNN和注意力机制进行自然语言处理(2)- 1.3 束搜索`

- **Model Averaging**：论文对最后 k 个检查点进行平均以创建集成效果。

    如果我们有一堆模型，我们可以这样做：

In [94]:
def average(model, models):
    "Average models into model"
    for ps in zip(*[m.params() for m in [model] + models]):
        ps[0].copy_(torch.sum(*ps[1:]) / len(ps[1:]))

# 结果 Results

<img src="./images/other/16-59.png" >

> 在 WMT 2014 英语到法语的翻译任务中，原论文中的大型的 `Transformer` 模型实现了 41.0 的 BLEU 分值，它要比以前所有的单模型效果更好，且只有前面顶级的模型 1/4 的训练成本。

通过上一节中的附加扩展 , Harvard NLP 团队的实现中，`OpenNMT-py` 版本的模型在 EN-DE WMT 数据集上实现了 26.9 的 BLEU 分值。

1. 检查输出

In [95]:
def check_output(valid_dataloader,
                 model,
                 vocab_src,
                 vocab_tgt,
                 n_examples=15,
                 pad_idx=2,
                 eos_string="</s>"):
    """
    :param valid_dataloader: 验证数据加载器，使用DataLoader()加载
    :param n_examples: 样例数量 默认为15
    :param pad_idx: 填充标识id
    :param eos_string: </s> 序列结束标识
    """
    results = [()] * n_examples  # [(), (),..., ()]
    for idx in range(n_examples):
        print("\n====Example %d ====\n" % (idx + 1))
        b = next(iter(valid_dataloader))  # 生成器获取下一个元素next()
        rb = Batch(src=b[0], tgt=b[1], pad=pad_idx)
        # greedy_decode(model=model, src=rb.src, src_mask=rb.src_mask, max_len=64, start_symbol=0)[0]

        # get_stoi(): 获得字典的dict对象
        # get_itos(): 获得记录在词典中的词列表，在列表中按索引升序顺序排列
        src_tokens = [
            vocab_src.get_itos()[x] for x in rb.src[0] if x != pad_idx
        ]
        tgt_tokens = [
            vocab_tgt.get_itos()[x] for x in rb.tgt[0] if x != pad_idx
        ]
        print("输入文本:    " + " ".join(src_tokens).replace("\n", ""))
        print("目标文本:    " + " ".join(tgt_tokens).replace("\n", ""))

        model_out = greedy_decode(model=model,
                                  src=rb.src,
                                  src_mask=rb.src_mask,
                                  max_len=72,
                                  start_symbol=0)[0]
        model_txt = (
            " ".join([
                vocab_tgt.get_itos()[x] for x in model_out if x != pad_idx
            ]).split(sep=eos_string, maxsplit=1)[0]  # 当遇到</s>终止token时进行分割
            + eos_string)
        print("输出文本:    " + model_txt.replace("\n", ""))
        results[idx] = (rb, src_tokens, tgt_tokens, model_out, model_txt)
    return results

2. 输出模型示例

In [96]:
def run_model_example(n_examples=5):
    global vocab_src, vocab_tgt, spacy_de, spacy_en

    cpu = torch.device("cpu")
    print("正在处理数据中....")
    _, valid_dataloader = create_dataloaders(
        device=cpu,
        vocab_src=vocab_src,
        vocab_tgt=vocab_tgt,
        spacy_de=spacy_de,
        spacy_en=spacy_en,
        batch_size=1,
        is_distributed=False,
    )

    print("加载模型中...")
    model = make_model(len(vocab_src), len(vocab_tgt), N=6)
    model.load_state_dict(
        torch.load("./models/Pytorch/multi30k_model_final.pt",
                   map_location=cpu))  # 使用cpu加载GPU模型

    print("检查模型输出中...")
    example_data = check_output(valid_dataloader,
                                model,
                                vocab_src,
                                vocab_tgt,
                                n_examples=n_examples)

    return model, example_data

In [97]:
model, example_data = run_model_example(n_examples=5)

正在处理数据中....




加载模型中...
检查模型输出中...

====Example 1 ====

输入文本:    <s> Eine Frau in Weiß spielt auf einer schwarzen Gitarre . </s>
目标文本:    <s> A woman in white plays a black guitar . </s>
输出文本:    <s> A woman in white is playing a black guitar . </s>

====Example 2 ====

输入文本:    <s> Ein kleines Mädchen steht draußen in einer Pfütze . </s>
目标文本:    <s> A little girl stands in a puddle outside . </s>
输出文本:    <s> A little girl stands outside in a puddle . </s>

====Example 3 ====

输入文本:    <s> Ein kleines Mädchen späht über eine blaue Mauer . </s>
目标文本:    <s> A little girl peering over a blue wall . </s>
输出文本:    <s> A little girl is peeking over a blue wall . </s>

====Example 4 ====

输入文本:    <s> Der Mann mit dem weißen Gürtel und der Sonnenbrille hält die Hand des Mädchens . </s>
目标文本:    <s> The man in the white belt and sunglasses is holding the girl 's hand . </s>
输出文本:    <s> The man with a white belt and sunglasses is holding the girl . </s>

====Example 5 ====

输入文本:    <s> Ein kleines Kind s

## 注意力可视化 Attention Visualization

即使使用贪婪解码器，翻译看起来也很不错。我们可以进一步将其可视化，以查看注意力的每一层发生了什么.

In [98]:
import pandas as pd

In [99]:
def mtx2df(m, max_row, max_col, row_tokens, col_tokens):
    "convert a dense matrix to a data frame with row and column indices"
    return pd.DataFrame(
        [(
            r,
            c,
            float(m[r, c]),
            "%.3d %s" %
            (r, row_tokens[r] if len(row_tokens) > r else "<blank>"),
            "%.3d %s" %
            (c, col_tokens[c] if len(col_tokens) > c else "<blank>"),
        ) for r in range(m.shape[0])
         for c in range(m.shape[1]) if r < max_row and c < max_col],
        # if float(m[r,c]) != 0 and r < max_row and c < max_col],
        columns=["row", "column", "value", "row_token", "col_token"],
    )


def attn_map(attn, layer, head, row_tokens, col_tokens, max_dim=30):
    df = mtx2df(
        attn[0, head].data,
        max_dim,
        max_dim,
        row_tokens,
        col_tokens,
    )
    return (alt.Chart(data=df).mark_rect().encode(
        x=alt.X("col_token", axis=alt.Axis(title="")),
        y=alt.Y("row_token", axis=alt.Axis(title="")),
        color="value",
        tooltip=["row", "column", "value", "row_token", "col_token"],
    ).properties(height=400, width=400).interactive())

In [100]:
def get_encoder(model, layer):
    return model.encoder.layers[layer].self_attn.p_attn


def get_decoder_self(model, layer):
    return model.decoder.layers[layer].self_attn.p_attn


def get_decoder_src(model, layer):
    return model.decoder.layers[layer].src_attn.p_attn


def visualize_layer(model, layer, getter_fn, ntokens, row_tokens, col_tokens):
    # ntokens = last_example[0].ntokens
    attn = getter_fn(model, layer)
    n_heads = attn.shape[1]
    charts = [
        attn_map(
            attn,
            0,
            h,
            row_tokens=row_tokens,
            col_tokens=col_tokens,
            max_dim=ntokens,
        ) for h in range(n_heads)
    ]
    assert n_heads == 8
    return alt.vconcat(charts[0]
                       # | charts[1]
                       | charts[2]
                       # | charts[3]
                       | charts[4]
                       # | charts[5]
                       | charts[6]
                       # | charts[7]
                       # layer + 1 due to 0-indexing
                       ).properties(title="Layer %d" % (layer + 1))

## Encoder Self Attention

In [101]:
def viz_encoder_self():
    model, example_data = run_model_example(n_examples=1)
    example = example_data[len(example_data) -
                           1]  # batch object for the final example

    layer_viz = [
        visualize_layer(model, layer, get_encoder, len(example[1]), example[1],
                        example[1]) for layer in range(6)
    ]
    return alt.hconcat(layer_viz[0]
                       # & layer_viz[1]
                       & layer_viz[2]
                       # & layer_viz[3]
                       & layer_viz[4]
                       # & layer_viz[5]
                       )


show_example(viz_encoder_self)

正在处理数据中....
加载模型中...
检查模型输出中...

====Example 1 ====

输入文本:    <s> Hund an Leine buddelt in ländlicher Gegend im Schnee . </s>
目标文本:    <s> Dog on leash <unk> into snow in rural area . </s>
输出文本:    <s> A dog is <unk> through a snow - covered area . </s>


<img src="./images/other/16-60.svg" >

## Decoder Self Attention

In [102]:
def viz_decoder_self():
    model, example_data = run_model_example(n_examples=1)
    example = example_data[len(example_data) - 1]

    layer_viz = [
        visualize_layer(
            model,
            layer,
            get_decoder_self,
            len(example[1]),
            example[1],
            example[1],
        ) for layer in range(6)
    ]
    return alt.hconcat(layer_viz[0]
                       & layer_viz[1]
                       & layer_viz[2]
                       & layer_viz[3]
                       & layer_viz[4]
                       & layer_viz[5])


show_example(viz_decoder_self)

正在处理数据中....
加载模型中...
检查模型输出中...

====Example 1 ====

输入文本:    <s> Ein Mann beim Wakeboarden im Wasser . </s>
目标文本:    <s> A man wakeboards in the water . </s>
输出文本:    <s> A man is in the water in the water . </s>


<img src="./images/other/16-61.svg" >

## Decoder Src Attention

In [103]:
def viz_decoder_src():
    model, example_data = run_model_example(n_examples=1)
    example = example_data[len(example_data) - 1]

    layer_viz = [
        visualize_layer(
            model,
            layer,
            get_decoder_src,
            max(len(example[1]), len(example[2])),
            example[1],
            example[2],
        )
        for layer in range(6)
    ]
    return alt.hconcat(
        layer_viz[0]
        & layer_viz[1]
        & layer_viz[2]
        & layer_viz[3]
        & layer_viz[4]
        & layer_viz[5]
    )


show_example(viz_decoder_src)

正在处理数据中....
加载模型中...
检查模型输出中...

====Example 1 ====

输入文本:    <s> Ein Mann , der zum <unk> von OU gehört , trägt während eines <unk> eine Sonnenbrille auf dem Kopf . </s>
目标文本:    <s> A man on the coaching staff for OU has sunglasses on his head during a football game . </s>
输出文本:    <s> A man is playing a video of food while a <unk> in sunglasses on his head . </s>


<img src="./images/other/16-62.svg" >

# 结论 Conclusion

> Apotosome 05/18/22