#  Seq2Seq模型

## 1. 模型基础架构

### 1.1 模型解释

​		Seq2Seq模型可以形象地由下图表示：

![img](https://bamboowine-img-1259155549.cos.ap-beijing.myqcloud.com/img/gF2xtP.png)

如果将Seq2Seq看作一个黑盒模型，它有三个重要的输入输出变量：

+ `enc_input`：编码器Encoder的输入
+ `dec_input`：解码器Decoder的输入
+ `dec_output`： 解码器Decoder的输出

除此之外，还有其他重要的细节：

1.  对编码器和解码器的三个重要变量会添加一些 **开始标志** 和 **结束标志**，即 `<SOS>` 和 `<EOS>`；
2.  之后会对输入字符串进行转换，转为向量形式；

### 1.2 若干疑问

+ Decoder的输入和输出，即 `dec_input` 和 `dec_output` 有什么关系？

    1) 训练过程，一般情况，`dec_input[t]` 是 t-1时刻解码器的输出；如果采用teacher_forcing，那么`dec_iput`是 `dec_output` 右移一位的结果，这一点从上图中也可以看出来的；2) 测试过程，那么`dec_input[t]` 是 t-1时刻解码器的输出，如下图所示；

    <img src="https://bamboowine-img-1259155549.cos.ap-beijing.myqcloud.com/img/image-20230104233534346.png" alt="image-20230104233534346" style="zoom:88%;" />

+ 训练和测试过程，Decoder会不会停不下来？

    这个是不会的；在训练阶段，目标字符串的长度是已知的；而在测试阶段，是有 **长度限制** 的；

## 2. 代码实现

### 2.1 导入模块、设置随机数

​		导入实验所需的模块，并且设置随机数，保证实验可以复现。

In [2]:
import torch
import copy
import datetime
import numpy as np
import pandas as pd
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, random_split
from matplotlib import pyplot as plt


  from .autonotebook import tqdm as notebook_tqdm


In [3]:
def same_seed(seed):
    '''Fixes random number generator seeds for reproducibility.'''
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)


same_seed(1)


### 2.2 生成数据集

​		这个数据集的想法是来源于网络的，可以看作 ** 英文反义词翻译 **，不过他们的想法是一个单词的翻译，就类似下面代码第三行中的 `seq_data`；但是这样的数据集，是没法做测试集的，因为一个崭新的单词，模型肯定是没有见过的，这样输出的结果是毫无意义的。

​		所以我们计划对原始的 `seq_data` 两两组合，从原来的9组数据扩展到72组，每组数据是由两个单词组合成的，源字符串和目标字符串仍然是反义词。


In [4]:
letter = list('SE?abcdefghijklmnopqrstuvwxyz ')
letter2idx = {n: i for i, n in enumerate(letter)}

bin_seq = [
    ['man', 'women'], ['black', 'white'], ['king', 'queen'],
    ['girl', 'boy'], ['up', 'down'], ['high', 'low'],
    ['left', 'right'], ['small', 'big'], ['fat', 'thin']]

seq_data = []
for i in range(len(bin_seq)):
    for j in range(i + 1, len(bin_seq)):
        seq_data.append([bin_seq[i][0] + ' ' + bin_seq[j][0],
                         bin_seq[i][1] + ' ' + bin_seq[j][1]])
        seq_data.append([bin_seq[j][0] + ' ' + bin_seq[i][0],
                         bin_seq[j][1] + ' ' + bin_seq[i][1]])
np.random.shuffle(seq_data)
print(f'seq_data size: {len(seq_data)}')
print('====================================')
print('seq_data: ')
print(seq_data)
print('====================================')
max_len = max([max(len(seq[0]), len(seq[1]))for seq in seq_data])

vocab_size = len(letter)
print(f'vocab_size: {vocab_size}, max seq_len: {max_len}')

"""
seq_data size: 72
====================================
seq_data: 
[['left girl', 'right boy'], ['fat man', 'thin women'], ['girl black', 'boy white'], ['girl up', 'boy down'], ['black man', 'white women'], ['high black', 'low white'], ['small up', 'big down'], ['black up', 'white down'], ['up girl', 'down boy'], ['girl high', 'boy low'], ['black left', 'white right'], ['up left', 'down right'], ['up man', 'down women'], ['left up', 'right down'], ['high small', 'low big'], ['small man', 'big women'], ['black girl', 'white boy'], ['fat king', 'thin queen'], ['girl left', 'boy right'], ['up black', 'down white'], ['high up', 'low down'], ['man king', 'women queen'], ['fat up', 'thin down'], ['up king', 'down queen'], ['high fat', 'low thin'], ['fat black', 'thin white'], ['left small', 'right big'], ['man left', 'women right'], ['king small', 'queen big'], ['small black', 'big white'], ['king high', 'queen low'], ['black king', 'white queen'], ['girl small', 'boy big'], ['black fat', 'white thin'], ['small high', 'big low'], ['left black', 'right white'], ['small fat', 'big thin'], ['king up', 'queen down'], ['fat small', 'thin big'], ['up fat', 'down thin'], ['small left', 'big right'], ['fat girl', 'thin boy'], ['man up', 'women down'], ['up small', 'down big'], ['girl fat', 'boy thin'], ['king left', 'queen right'], ['king man', 'queen women'], ['left man', 'right women'], ['left high', 'right low'], ['man high', 'women low'], ['high left', 'low right'], ['left king', 'right queen'], ['girl king', 'boy queen'], ['man fat', 'women thin'], ['high girl', 'low boy'], ['high man', 'low women'], ['fat left', 'thin right'], ['high king', 'low queen'], ['small king', 'big queen'], ['black high', 'white low'], ['left fat', 'right thin'], ['black small', 'white big'], ['king girl', 'queen boy'], ['man black', 'women white'], ['girl man', 'boy women'], ['up high', 'down low'], ['small girl', 'big boy'], ['fat high', 'thin low'], ['man girl', 'women boy'], ['man small', 'women big'], ['king fat', 'queen thin'], ['king black', 'queen white']]
====================================
vocab_size: 30, max seq_len: 11
"""


seq_data size: 72
seq_data: 
[['girl black', 'boy white'], ['left up', 'right down'], ['man left', 'women right'], ['left fat', 'right thin'], ['high up', 'low down'], ['black small', 'white big'], ['small king', 'big queen'], ['up left', 'down right'], ['up king', 'down queen'], ['fat up', 'thin down'], ['king fat', 'queen thin'], ['girl king', 'boy queen'], ['man king', 'women queen'], ['small girl', 'big boy'], ['high king', 'low queen'], ['king small', 'queen big'], ['fat man', 'thin women'], ['up small', 'down big'], ['king high', 'queen low'], ['left small', 'right big'], ['up fat', 'down thin'], ['small fat', 'big thin'], ['small up', 'big down'], ['small black', 'big white'], ['fat girl', 'thin boy'], ['girl left', 'boy right'], ['small high', 'big low'], ['high left', 'low right'], ['up black', 'down white'], ['small left', 'big right'], ['king man', 'queen women'], ['up girl', 'down boy'], ['left girl', 'right boy'], ['king left', 'queen right'], ['girl high', 'boy low'], ['b



然后需要将字符串转为向量形式，就是如下的函数 `word2vec`，采用的方法就是：**用每个字符在letter中的位置索引来代替**；然后需要将所有的字符串填补到 **最大指定长度**，即用 '?' 符号进行尾填充；之后在 `enc_input` 添加 **结束后缀**，`dec_input` 添加 **开始前缀**，`dec_output` 添加 **结束后缀**。

In [5]:
def word2vec(seq_data):
    enc_input_all, dec_input_all, dec_output_all = [], [], []
    seq_data = copy.deepcopy(seq_data)
    for seq in seq_data:
        for i in range(2):
            seq[i] = seq[i] + '?' * (max_len - len(seq[i]))  # 'man??', 'women'

        enc_input = [letter2idx[n] for n in (seq[0] + 'E')]
        dec_input = [letter2idx[n] for n in ('S' + seq[1])]
        dec_output = [letter2idx[n] for n in (seq[1] + 'E')]

        enc_input_all.append(enc_input)
        dec_input_all.append(dec_input)
        dec_output_all.append(dec_output)

    # make tensor
    return torch.LongTensor(enc_input_all), torch.LongTensor(dec_input_all), torch.LongTensor(dec_output_all)

# dim: [len(seq_data), max_len+1]
# enc_input_all, dec_input_all, dec_output_all = word2vec(seq_data)
# enc_input_all.shape, dec_input_all.shape, dec_output_all.shape


### 2.3 参数定义

In [6]:
enc_num_emb = vocab_size        # src 的 vocab_size
enc_emb_dim = 16
enc_hid_dim = 256
enc_num_layers = 2

dec_num_emb = vocab_size        # trg 的 vocab_size
dec_emb_dim = 16
dec_hid_dim = 256
dec_num_layers = 2

dropout = 0.5
bidirectional = True

lr = 0.001
n_epochs = 5000
n_early_stop = 1500
n_save_steps = 100
batch_size = 3
teacher_forcing_ratio = 1.0

train_ratio = 0.9
valid_ratio = 0.2
save_path = 'model.ckpt'
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


### 2.4 定义Dataset

In [7]:
class SeqDataset(Dataset):
    def __init__(self, enc_input_all, dec_input_all, dec_output_all) -> None:
        super().__init__()
        self.enc_input_all = enc_input_all
        self.dec_input_all = dec_input_all
        self.dec_output_all = dec_output_all

    def __getitem__(self, idx):
        return self.enc_input_all[idx], self.dec_input_all[idx], self.dec_output_all[idx]

    def __len__(self):
        return len(self.enc_input_all)


### 2.5 数据集划分

​		这里就是按照上面的参数，将原始数据集划分为 **训练集**、**验证集** 和 **测试集**，同时用 `Dataset` 进行加载。

In [8]:
train_size = int(len(seq_data) * train_ratio)
train_seq_data, test_seq_data = seq_data[:train_size], seq_data[train_size:]

valid_size = int(train_size * valid_ratio)
train_seq_data, valid_seq_data = train_seq_data[valid_size:], train_seq_data[:valid_size]

train_dataset, valid_dataset, test_dataset = SeqDataset(*word2vec(train_seq_data)), \
    SeqDataset(*word2vec(valid_seq_data)), \
    SeqDataset(*word2vec(test_seq_data))

train_loader = DataLoader(
    train_dataset, batch_size=batch_size, shuffle=True, pin_memory=True)
valid_loader = DataLoader(
    valid_dataset, batch_size=batch_size, shuffle=True, pin_memory=True)
test_loader = DataLoader(
    test_dataset, batch_size=batch_size, shuffle=False, pin_memory=True)

print(f'train data size: {len(train_seq_data)}')
print(f'valid data size: {len(valid_seq_data)}')
print(f'test data size: {len(test_seq_data)}')

"""
train data size: 52
valid data size: 12
test data size: 8
"""


train data size: 52
valid data size: 12
test data size: 8


'\ntrain data size: 52\nvalid data size: 12\ntest data size: 8\n'

### 2.6 网络模型

#### 2.6.1 编码器

​		下面的代码就是编码器的部分，整个流程比较简单：

1.  数据进来后，先通过 `nn.Embedding` 层，将之前用索引向量表示的句子转为一个二维向量，类似于one-hot编码；
2.  然后是一个GRU模型，需要注意的是，其中的参数 `bidirectional` 默认被设置为 `True`，也就是双向GRU；

In [9]:
class Encoder(nn.Module):
    def __init__(self, num_emb, emb_dim, hid_dim, num_layers, dropout=0.1) -> None:
        super().__init__()
        self.hid_dim = hid_dim
        self.num_layers = num_layers
        self.embedding = nn.Embedding(num_emb, emb_dim)
        self.rnn = nn.GRU(emb_dim, hid_dim, num_layers,
                          dropout=dropout, bidirectional=bidirectional)
        self.dropout = nn.Dropout(dropout)

    # src: [batch_size, seq_len]
    def forward(self, src):
        # embedded: [batch_size, seq_len, emb_dim]
        embedded = self.embedding(src)
        embedded = self.dropout(embedded)

        # embedded: [seq_len, batch_size, emb_dim]
        embedded = embedded.transpose(0, 1)

        # h0 is zeros, default
        # out: [seq_len * D, batch_size, hid_dim]
        # hn: [D * num_layers, batch_size, hid_dim]
        out, hn = self.rnn(embedded)
        return out, hn


#### 2.6.2 解码器

​		解码器的流程：

1.  模型的输入有三项：
+ `input`：代表前一时刻解码器的**预测结果 **；一般在训练阶段会采用teacher_forcing，那么这时 `input` 可以代表前一时刻的**实际结果 **，也就是上面`Dataset`中的 `dec_input_all`；值得注意的就是在第一时刻，`input` 是上文提到的 ** 开始前缀**；
+ `hidden`：代表前一时刻解码器的隐层状态；在第一时刻，`hidden`是**编码器的隐层状态输出**；
+ `context`：代表编码器得到的 ** 上下文向量**；
2.  首先还是 `input` 会经过一个 `nn.Embeddeding`层；
3.  然后 `context` 会和 `input` 拼接，作为当前时刻解码器的输入 ；
4.  最后是一个线性层，进行维度转换，从 `hid_dim` 到 `num_emb`；


In [10]:
class Decoder(nn.Module):
    def __init__(self, num_emb, emb_dim, hid_dim, num_layers, dropout=0.1) -> None:
        super().__init__()
        self.num_emb = num_emb
        self.hid_dim = hid_dim
        self.num_layers = num_layers
        self.embedding = nn.Embedding(num_emb, emb_dim)
        # self.rnn = nn.GRU(emb_dim, hid_dim, num_layers, dropout=dropout)
        self.rnn = nn.GRU(emb_dim + hid_dim, hid_dim, num_layers,
                          dropout=dropout, bidirectional=False)
        self.dropout = nn.Dropout(dropout)
        self.fc_out = nn.Linear(hid_dim, num_emb)

    # input:    [batch_size,]
    # hidden:   [num_layers, batch_size, hid_dim]
    # context:  [1, batch_size, hid_dim]
    def forward(self, input, hidden, context):
        # embedded: [batch_size, emb_dim]
        embedded = self.embedding(input)
        embedded = self.dropout(embedded)

        # embedded: [1, batch_size, emb_dim]
        embedded = embedded.unsqueeze(0)

        # emb_cxt_cat: [1, batch_size, emb_dim + hid_dim]
        emb_cxt_cat = torch.cat((embedded, context), dim=-1)

        # out: [1, batch_size, hid_dim]
        # hn: [num_layers, batch_size, hid_dim]
        out, hn = self.rnn(emb_cxt_cat, hidden)

        # out: [batch_size, num_emb]
        out = self.fc_out(out.squeeze(0))
        return out, hn


#### 2.6.3 seq2seq

​		seq2seq模型的工作是将Encoder和Decoder进行结合：

1.  模型的输入有三项：
    + `src`：源字符串；
    + `trg`：目标字符串；在训练阶段是 **正确的目标字符串**，在测试阶段是 **一段最大长度的<pad>填充的字符串**；
    + `teacher_forcing_ratio`：解码器输入的teacher_forcing概率；
2.  首先将 `input` 输入到编码器，得到 `enc_out` 和 `hidden`，在demo中，令最后一个时刻的 `enc_out` 输出作为 上下文向量 `context`，即 `context = enc_out[-1:]`；
3.  然后由于编码器是双向的，而解码器采用单向的，所以编码器的输出 `enc_out` 和 `hidden` 维度和解码器的输入维度不对等，因此需要加一个线性层进行维度转换；
4.  之后解码器的工作流程是由 **for循环迭代** 完成的，最后得到预测的预测结果 `dec_outs`；

In [11]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device) -> None:
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

        assert encoder.hid_dim == decoder.hid_dim, "Hidden dimensions of encoder and decoder must be equal!"
        assert encoder.num_layers == decoder.num_layers, "Encoder and decoder must have equal number of layers!"

        # Encoder
        if bidirectional:
            self.hid_tran = nn.Linear(
                encoder.num_layers * 2, decoder.num_layers)
            self.out_tran = nn.Linear(encoder.hid_dim * 2, decoder.hid_dim)

    # src: [batch_size, src_len]
    # trg: [batch_size, trg_len]
    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size, trg_len = trg.shape
        num_trg_vocab = self.decoder.num_emb
        enc_out, hidden = self.encoder(src)
        context = enc_out[-1:]

        if bidirectional:
            hidden = self.hid_tran(hidden.permute(
                1, 2, 0)).permute(2, 0, 1).contiguous()
            context = self.out_tran(context)

        dec_outs = torch.zeros(trg_len, batch_size,
                               num_trg_vocab).to(self.device)

        # dec_input：[batch_size,]
        dec_input = trg[:, 0]
        for t in range(0, trg_len):
            # out: [batch_size, num_trg_vocab]
            out, hidden = self.decoder(dec_input, hidden, context)
            dec_outs[t] = out
            pred = out.argmax(1)
            dec_input = trg[:, t] if np.random.random(
            ) < teacher_forcing_ratio else pred
        return dec_outs


### 2.7 训练过程

​		整个训练过程中规中矩，采用 CrossEntropyLoss 的 criterion，然后在计算每个batch的损失值是由for循环计算每组数据的loss再累加得到的。

In [12]:
def main(train_loader, valid_loader, criterion):
    optimizer = optim.Adam(seq2seq.parameters(), lr=lr)
    # scheduler = optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0=16, T_mult=1)

    best_loss, early_stop = np.inf, 0
    for epoch in range(n_epochs):
        loss_record = []
        seq2seq.train()
        for enc_input_batch, dec_input_batch, dec_out_batch in train_loader:
            enc_input_batch, dec_input_batch, dec_out_batch = enc_input_batch.to(
                device), dec_input_batch.to(device), dec_out_batch.to(device)
            pred = seq2seq(enc_input_batch, dec_input_batch,
                           teacher_forcing_ratio)
            pred = pred.transpose(0, 1)

            loss = 0
            for i in range(len(dec_out_batch)):
                loss += criterion(pred[i], dec_out_batch[i])
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            # scheduler.step()

            loss_record.append(loss.item())

        mean_train_loss = sum(loss_record) / len(loss_record)

        seq2seq.eval()
        loss_record = []
        with torch.no_grad():
            for enc_input_batch, dec_input_batch, dec_out_batch in valid_loader:
                enc_input_batch, dec_input_batch, dec_out_batch = enc_input_batch.to(
                    device), dec_input_batch.to(device), dec_out_batch.to(device)
                pred = seq2seq(enc_input_batch, dec_input_batch,
                               teacher_forcing_ratio)
                pred = pred.transpose(0, 1)
                loss = 0
                for i in range(len(dec_out_batch)):
                    loss += criterion(pred[i], dec_out_batch[i])
                loss_record.append(loss.item())

        mean_valid_loss = sum(loss_record) / len(loss_record)

        if (epoch + 1) % n_save_steps == 0:
            print(
                f'Epoch [{epoch + 1}/{n_epochs}]: Train loss: {mean_train_loss:.8f}, Valid loss: {mean_valid_loss:.8f}')
            if mean_valid_loss < best_loss:
                best_loss = mean_valid_loss
                torch.save(seq2seq.state_dict(), save_path)
                print(f'Saving model with loss {best_loss:.8f}...')
                early_stop = 0
            else:
                early_stop += n_save_steps

        if early_stop >= n_early_stop:
            print(f'\nBest valid loss: {best_loss}')
            print('\nModel is not improving, so we halt the training session.')


encoder = Encoder(enc_num_emb, enc_emb_dim, enc_hid_dim,
                  enc_num_layers, dropout).to(device)
decoder = Decoder(dec_num_emb, dec_emb_dim, dec_hid_dim,
                  dec_num_layers, dropout).to(device)
seq2seq = Seq2Seq(encoder, decoder, device).to(device)
criterion = nn.CrossEntropyLoss().to(device)

main(train_loader, valid_loader, criterion)

"""
Epoch [100/5000]: Train loss: 0.02501816, Valid loss: 0.05040905
Saving model with loss 0.05040905...
Epoch [200/5000]: Train loss: 0.11218390, Valid loss: 0.13767572
Epoch [300/5000]: Train loss: 0.00292161, Valid loss: 0.00804337
Saving model with loss 0.00804337...
Epoch [400/5000]: Train loss: 0.06505425, Valid loss: 0.06209812
Epoch [500/5000]: Train loss: 0.00547583, Valid loss: 0.01514314
Epoch [600/5000]: Train loss: 0.00443477, Valid loss: 0.07757122
Epoch [700/5000]: Train loss: 0.00155966, Valid loss: 0.05722510
Epoch [800/5000]: Train loss: 0.00427389, Valid loss: 0.07414128
Epoch [900/5000]: Train loss: 0.18153160, Valid loss: 0.12175162
Epoch [1000/5000]: Train loss: 0.00124802, Valid loss: 0.01263226
Epoch [1100/5000]: Train loss: 0.02267172, Valid loss: 0.14032485
Epoch [1200/5000]: Train loss: 0.00175910, Valid loss: 0.01846752
Epoch [1300/5000]: Train loss: 0.00211835, Valid loss: 0.02610047
Epoch [1400/5000]: Train loss: 0.00064308, Valid loss: 0.14011183
Epoch [1500/5000]: Train loss: 0.00622014, Valid loss: 0.02119574
Epoch [1600/5000]: Train loss: 0.00481345, Valid loss: 0.01523455
Epoch [1700/5000]: Train loss: 0.00129550, Valid loss: 0.35570633
Epoch [1800/5000]: Train loss: 0.00097497, Valid loss: 0.02685569

Best valid loss: 0.008043368288781494

Model is not improving, so we halt the training session.
"""


Epoch [100/5000]: Train loss: 0.01967122, Valid loss: 0.06276214
Saving model with loss 0.06276214...
Epoch [200/5000]: Train loss: 0.06223520, Valid loss: 0.02086368
Saving model with loss 0.02086368...
Epoch [300/5000]: Train loss: 0.00187870, Valid loss: 0.01757691
Saving model with loss 0.01757691...
Epoch [400/5000]: Train loss: 0.00505274, Valid loss: 0.01616128
Saving model with loss 0.01616128...
Epoch [500/5000]: Train loss: 0.06280922, Valid loss: 0.05333685
Epoch [600/5000]: Train loss: 0.00925549, Valid loss: 0.20344227
Epoch [700/5000]: Train loss: 0.10506145, Valid loss: 0.12112796
Epoch [800/5000]: Train loss: 0.01458156, Valid loss: 0.09561630
Epoch [900/5000]: Train loss: 0.00054563, Valid loss: 0.14745553
Epoch [1000/5000]: Train loss: 0.00074608, Valid loss: 0.21140787
Epoch [1100/5000]: Train loss: 0.01201170, Valid loss: 0.04874226
Epoch [1200/5000]: Train loss: 0.06151213, Valid loss: 0.27386813
Epoch [1300/5000]: Train loss: 0.00314407, Valid loss: 0.09637893
Epo

'\nEpoch [100/5000]: Train loss: 0.02501816, Valid loss: 0.05040905\nSaving model with loss 0.05040905...\nEpoch [200/5000]: Train loss: 0.11218390, Valid loss: 0.13767572\nEpoch [300/5000]: Train loss: 0.00292161, Valid loss: 0.00804337\nSaving model with loss 0.00804337...\nEpoch [400/5000]: Train loss: 0.06505425, Valid loss: 0.06209812\nEpoch [500/5000]: Train loss: 0.00547583, Valid loss: 0.01514314\nEpoch [600/5000]: Train loss: 0.00443477, Valid loss: 0.07757122\nEpoch [700/5000]: Train loss: 0.00155966, Valid loss: 0.05722510\nEpoch [800/5000]: Train loss: 0.00427389, Valid loss: 0.07414128\nEpoch [900/5000]: Train loss: 0.18153160, Valid loss: 0.12175162\nEpoch [1000/5000]: Train loss: 0.00124802, Valid loss: 0.01263226\nEpoch [1100/5000]: Train loss: 0.02267172, Valid loss: 0.14032485\nEpoch [1200/5000]: Train loss: 0.00175910, Valid loss: 0.01846752\nEpoch [1300/5000]: Train loss: 0.00211835, Valid loss: 0.02610047\nEpoch [1400/5000]: Train loss: 0.00064308, Valid loss: 0.14

### 2.8 测试

​		测试阶段的流程：

1.  我们首先实现了 `predict` 函数，它的输入是源字符串，输出是预测字符串；
2.  为了满足seq2seq模型的输入，我们需要构造一个 “假的” 目标字符串，由 `max_len` 个 ？组成，这也是 **预测阶段不会无法停止** 的重要原因！！！
3.  通过seq2seq模型，得到预测输出 `out`，然后选择最大值的索引，形成索引向量；之后通过 `letter` 将索引向量再转为字符串列表，然后以 `E` 或者 `?` 第一次出现的位置进行截断；

In [13]:
encoder = Encoder(enc_num_emb, enc_emb_dim, enc_hid_dim,
                  enc_num_layers, dropout).to(device)
decoder = Decoder(dec_num_emb, dec_emb_dim, dec_hid_dim,
                  dec_num_layers, dropout).to(device)
seq2seq = Seq2Seq(encoder, decoder, device).to(device)
seq2seq.load_state_dict(torch.load(save_path))
seq2seq.eval()


def index(s, c):
    return s.index(c) if c in s else len(s)


def predict(word: str):
    enc_input, dec_input, _ = word2vec([[word, '?' * max_len]])
    enc_input, dec_input = enc_input.to(device), dec_input.to(device)
    with torch.no_grad():
        out = seq2seq(enc_input, dec_input, teacher_forcing_ratio=0.0)
    out = out.squeeze(1).argmax(1)
    pred = ''.join([letter[i] for i in out])
    end = min(index(pred, 'E'), index(pred, '?'))
    pred = pred[:end]
    return pred


for seq in test_seq_data:
    src, trg = seq
    pred = predict(src)
    print(f"{src:^11} -> {pred:^11}")

"""
  man up    -> women down 
 black man  -> white women
black king  -> white queen
 high fat   ->  low thin  
 girl man   ->  boy women 
 high man   ->  low women 
 man small  ->  women big 
 left king  -> right queen
"""


  man up    -> women down 
 black man  -> white women
black king  -> white duwn 
 high fat   ->  low thin  
 girl man   ->  boy women 
 high man   ->  low women 
 man small  ->  women big 
 left king  -> right women


'\n  man up    -> women down \n black man  -> white women\nblack king  -> white queen\n high fat   ->  low thin  \n girl man   ->  boy women \n high man   ->  low women \n man small  ->  women big \n left king  -> right queen\n'

## 3. Attention Layer

​		之前的Seq2Seq模型中Encoder是一个RNN，因此在输入序列很少的时候，Encoder最终会或多会少遗忘部分信息，然后Decoder可能无法生成正确的输出结果。解决该缺点最有效的方法是 **Attention**：**Decoder每次更新状态时会查看Encoder所有的状态，让Decoder关注Encoder中最相关的信息，从而避免遗忘**。

​		添加Attention layer的工作流程：

+ 在Encoder对输入序列编码结束之后（**保存所有的状态$h_1, h_2, \ldots, h_m$**），Attention和Decoder同时工作；

+ 根据Decoder当前时刻的状态$s_{t}$，Attention会计算其与Encoder所有状态的相关性 $\alpha_{t1},\alpha_{t2},\ldots, \alpha_{tm}$，满足
    $$
    \sum_{i=1}^m \alpha_{ti} = 1
    $$
    这一点无疑可以通过Softmax来实现。

+ 将相关性和Encoder的所有状态进行加权平均，得到Context vector，记为 $c_t$
    $$
    c_t = \alpha_{t1}h_1 + \alpha_{t2}h_2 + \cdots + \alpha_{tm}h_m
    $$

+ 然后上下文向量 $c_t$ 和 当前时刻embedding进行拼接，作为Decoder的输入。

然后现在的核心问题就是 **Attention如何计算相关性权重**，这里介绍两种方法，首先<b><font color="red">第一种</font></b>，如下图所示。将Encoder中隐藏层状态 $h_i$ 和解码器当前状态 $s_t$ 拼接，然后左乘参数矩阵 $W$ 得到一个向量；之后应用双曲正切函数tanh在得到的向量上，将元素值调整到-1到1之间，最后再和参数向量 $V$ 进行点积运算，记为 $\tilde{a}_{ti}$。计算出 $\tilde{a}_{t1}, \tilde{a}_{t2}, \dots, \tilde{a}_{tm}$，进行Softmax变换，得到 ${a}_{t1}, {a}_{t2}, \dots, {a}_{tm}$。

![image-20230108184250154](https://bamboowine-img-1259155549.cos.ap-beijing.myqcloud.com/img/image-20230108184250154.png)

这里再介绍<b><font color="red">另一种方法</font></b>：

1. 分别用两个参数矩阵 $W_k$ 和 $W_Q$ 对 $h_i$ 和 $s_t$ 进行线性变换，得到向量 $k_i$ 和 $q_t$：
    $$
    \begin{aligned}
    k_i = W_k \cdot h_i, \quad \text{for i = 1 to m} \\
    q_t = W_Q \cdot s_t, \quad \text{for t = 1 to T}
    \end{aligned}
    $$

2. 计算向量 $k_i$ 与 $q_t$ 的内积，得到 $\tilde{\alpha}_{ti}$：
    $$
    \tilde{\alpha}_{ti} = k_i^Tq_t, \quad \text{for i = 1 to m,} \,\, \text{t = 0 to T}
    $$

3. 进行Softmax变换，得到 ${a}_{t1}, {a}_{t2}, \dots, {a}_{tm}$。

<b><font color="red"> 本Demo采用的是第一种</font></b>。

### 3.1 Encoder

​		编码器部分和之前的是一样的，因为主要在**解码器、以及与编码器连接的部分**需要改动。

### 3.2 Attention

​		Attention部分的代码比较简单，输出值是**相关性权重**；构造参数是编码器和解码器的隐层维度，forward函数的输入是解码器当前状态 `s` 和编码器输出 `enc_out` (也就是隐层状态)，然后就是将 `s` 和 `enc_out` 进行拼接，首先是需要将 `s` 进行 `repeat`操作，对 `enc_out` 维度适当调整，接下来的步骤和前面介绍的完全一致。这里需要注意的是参数矩阵 `atten` 的维度，这里 `(enc_hid_dim * 2)` 的原因是 **Encoder采用的是双向的**，而另一个维度 `dec_hid_dim` 则是随机指定的。

In [14]:
class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim) -> None:
        super().__init__()
        self.atten = nn.Linear((enc_hid_dim * 2) +
                               dec_hid_dim, dec_hid_dim, bias=False)
        self.v = nn.Linear(dec_hid_dim, 1, bias=False)

    # s:   		[batch_size, dec_hid_dim]
    # enc_out:  [seq_len, batch_size, enc_hid_dim * D(2)]
    def forward(self, s, enc_out):
        batch_size, seq_len = enc_out.shape[1], enc_out.shape[0]
        # s:   [batch_size, seq_len, dec_hid_dim]
        # enc_out:  [batch_size, seq_len, enc_hid_dim * D(2)]
        s = s.unsqueeze(1).repeat(1, seq_len, 1)
        enc_out = enc_out.transpose(0, 1)

        # energy = [batch_size, src_len, dec_hid_dim]
        energy = torch.tanh(self.atten(torch.cat((s, enc_out), dim=2)))
        # attention: [batch_size, seq_len]
        attention = self.v(energy).squeeze(2)
        return F.softmax(attention, dim=1)


### 3.3 Decoder

​		网上有关这一部分的代码和我的不要一样，主要原因在于 **他们的GRU是一层的，但我可以是多层的**；我浅说一下这里的设计：

+ 首先由于可以采用多层GRU，所以它的隐层状态就是多层的，为了得到Attention中的 `s`，我们可以对隐层状态进行 类似**池化**操作，我这里是通过一个线性层完成的，也就是 `fc_hidden`；
+ 将 `s` 和 `enc_out` 输入到 Attention，得到相关性权重 `a`；
+ 之后将 `a` 和 `enc_out` 相乘，得到上下文向量 `c`；
+ `c` 和 `embedding` 连接，输入到GRU中；

其余的操作和之前是相同的。

In [15]:
class Decoder(nn.Module):
    def __init__(self, num_emb, emb_dim, enc_hid_dim, dec_hid_dim, num_layers, dropout, attention) -> None:
        super().__init__()
        self.num_emb = num_emb
        self.hid_dim = dec_hid_dim
        self.num_layers = num_layers
        self.attention = attention
        self.embedding = nn.Embedding(num_emb, emb_dim)
        self.rnn = nn.GRU(emb_dim + (enc_hid_dim) * 2, dec_hid_dim,
                          num_layers, dropout=dropout, bidirectional=False)
        self.dropout = nn.Dropout(dropout)
        self.fc_hidden = nn.Linear(num_layers, 1)
        self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, num_emb)

    # input:    [batch_size,]
    # hidden:   [num_layers, batch_size, dec_hid_dim]
    # enc_out:  [seq_len, batch_size, D * enc_hid_dim]
    def forward(self, input, hidden, enc_out):
        # embedded: [batch_size, emb_dim]
        embedded = self.dropout(self.embedding(input))

        # embedded: [1, batch_size, emb_dim]
        embedded = embedded.unsqueeze(0)

        s = self.fc_hidden(hidden.permute(1, 2, 0)).permute(
            2, 0, 1).contiguous()
        s = s.squeeze(0)

        # a: [batch_size, 1, src_len]
        a = self.attention(s, enc_out).unsqueeze(1)

        # enc_out: [batch_size, src_len, enc_hid_dim * 2]
        enc_out = enc_out.transpose(0, 1)

        # c: [1, batch_size, enc_hid_dim * 2]
        c = torch.bmm(a, enc_out).transpose(0, 1)

        rnn_input = torch.cat((embedded, c), dim=2)

        # out: [1, batch_size, dec_hid_dim]
        # hn: [num_layers, batch_size, dec_hid_dim]
        out, hn = self.rnn(rnn_input, hidden)

        # out: [batch_size, num_emb]
        out = self.fc_out(torch.cat((out.squeeze(0), c.squeeze(0)), dim=1))
        return out, hn


### 3.4 Seq2Seq

​		解码器部分和之前几乎是完全相同的，依旧**对Encoder的最终隐层状态进行线性变换，得到Decoder隐层输入的维度**。

In [16]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device) -> None:
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        self.fc = nn.Linear(encoder.hid_dim * 2, decoder.hid_dim)

        assert encoder.hid_dim == decoder.hid_dim, "Hidden dimensions of encoder and decoder must be equal!"
        assert encoder.num_layers == decoder.num_layers, "Encoder and decoder must have equal number of layers!"

        # Encoder
        if bidirectional:
            self.fc_hidden = nn.Linear(
                encoder.num_layers * 2, decoder.num_layers)

    # src: [batch_size, src_len]
    # trg: [batch_size, trg_len]
    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size, trg_len = trg.shape
        num_trg_vocab = self.decoder.num_emb
        enc_out, hidden = self.encoder(src)

        # hidden: [num_layers, batch_size, hid_dim]
        if bidirectional:
            hidden = self.fc_hidden(hidden.permute(
                1, 2, 0)).permute(2, 0, 1).contiguous()

        dec_outs = torch.zeros(trg_len, batch_size,
                               num_trg_vocab).to(self.device)

        # dec_input: [batch_size,]
        dec_input = trg[:, 0]
        for t in range(0, trg_len):
            # out: [batch_size, num_trg_vocab]
            out, hidden = self.decoder(dec_input, hidden, enc_out)
            dec_outs[t] = out
            pred = out.argmax(1)
            dec_input = trg[:, t] if np.random.random(
            ) < teacher_forcing_ratio else pred
        return dec_outs


### 3.5 测试效果

​		可能由于数据集比较小，所以之前的基于RNN的Seq2Seq模型已经可以取得非常好的效果，基于RNN和Attention的效果看不出明显的提升。

In [17]:
"""
girl black  ->  boy white 
  left up   -> right down 
 man left   -> women right
 left fat   -> right thin 
  high up   ->  low down  
black small ->  white big 
small king  ->  big queen 
  up left   -> down right 
  up king   -> down queen 
  fat up    ->  thin down 
 king fat   -> queen thin 
 girl king  ->  boy queen 
"""


'\ngirl black  ->  boy white \n  left up   -> right down \n man left   -> women right\n left fat   -> right thin \n  high up   ->  low down  \nblack small ->  white big \nsmall king  ->  big queen \n  up left   -> down right \n  up king   -> down queen \n  fat up    ->  thin down \n king fat   -> queen thin \n girl king  ->  boy queen \n'

## 4. 遇到的问题

+ 无论是之前的Seq2Seq，还是基于Attention的Seq2Seq，当 `num_layers = 1` 时，模型的效果总不是很理想，当 `num_layers` 设置为 2 的时候，就可以取得非常好的效果。

## 5. 参考资料

1. [Seq2Seq 的 PyTorch 实现]([Seq2Seq的PyTorch实现 - mathor (wmathor.com)](https://wmathor.com/index.php/archives/1448/))
2. [pytorch中如何做seq2seq]([pytorch中如何做seq2seq - 知乎 (zhihu.com)](https://zhuanlan.zhihu.com/p/352276786))
3. [基于pytorch的Seq2Seq的实现]([基于pytorch的Seq2Seq的实现_loki2018的博客-CSDN博客_pytorch seq2seq](https://blog.csdn.net/loki2018/article/details/118071500))
4. [Seq2Seq(Attention)的PyTorch实现（超级详细）]([Seq2Seq(Attention)的PyTorch实现（超级详细）_数学家是我理想的博客-CSDN博客_seq2seq pytorch实现](https://blog.csdn.net/qq_37236745/article/details/107085532))
5. [Attention is all you need：剥离RNN，保留Attention]([Attention is all you need：剥离RNN，保留Attention_DeepGeGe的博客-CSDN博客](https://blog.csdn.net/qq_24178985/article/details/118727611))

