# Seq2Seq, Attention


在这份notebook当中，我们会(尽可能)复现Luong的attention模型

由于我们的数据集非常小，只有一万多个句子的训练数据，所以训练出来的模型效果并不好。如果大家想训练一个好一点的模型，可以参考下面的资料。

## 更多阅读

#### 课件
- [cs224d](http://cs224d.stanford.edu/lectures/CS224d-Lecture15.pdf)


#### 论文
- [Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation](https://arxiv.org/abs/1406.1078)
- [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025?context=cs)
- [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1406.1078)


#### PyTorch代码
- [seq2seq-tutorial](https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb)
- [Tutorial from Ben Trevett](https://github.com/bentrevett/pytorch-seq2seq)
- [IBM seq2seq](https://github.com/IBM/pytorch-seq2seq)
- [OpenNMT-py](https://github.com/OpenNMT/OpenNMT-py)


#### 更多关于Machine Translation
- [Beam Search](https://www.coursera.org/lecture/nlp-sequence-models/beam-search-4EtHZ)
- Pointer network 文本摘要
- Copy Mechanism 文本摘要
- Converage Loss 
- ConvSeq2Seq
- Transformer
- Tensor2Tensor

#### TODO
- 建议同学尝试对中文进行分词

#### NER
- https://github.com/allenai/allennlp/tree/master/allennlp


In [1]:
import os
import sys
import math
from collections import Counter #计数器
import numpy as np
import random

import torch
import torch.nn as nn
import torch.nn.functional as F

import nltk
import jieba # 一个中文分词插件jieba.cut

In [2]:
ls /usr/local/share/nltk_data/tokenizers

[34mpunkt[m[m/


In [5]:
# for word in jieba.tokenize("日本有很多温泉"): # 这个是用jieba做中文分词
#     print(word)
for word in jieba.cut("日本有很多温泉"): # 这个是用jieba做中文分词
    print(word)

('日本', 0, 2)
('有', 2, 3)
('很多', 3, 5)
('温泉', 5, 7)
日本
有
很多
温泉


In [4]:
for word in nltk.word_tokenize("There are a lot of hot springs in Japan."):
    print(word)

There
are
a
lot
of
hot
springs
in
Japan
.


读入中英文数据
- 英文我们使用nltk的word tokenizer来分词，并且使用小写字母
- 中文我们直接使用单个汉字作为基本单元

In [4]:
def load_data(in_file):
    en = [] # 英文
    cn = [] # 中文
    num_examples = 0
    with open(in_file,mode='r') as f: # 逐行读出
        for line in f:
            print(line) # 把每一环的数据打印出来看一下
            line = line.strip().split("\t") # 分词后用逗号隔开
            print(line) # 把分词后的line数据打印出来
            
            # 在每句话开头加上"BOS"，结尾加上"EOS"
            en.append(["BOS"] + nltk.word_tokenize(line[0].lower())+["EOS"])
            
            # split chinese sentence into characters
            # 分割中文句子：这里使用子来分割的，也可以用jieba.cut来进行词的分割
#             cn.append(["BOS"]+[c for c in line[1]]+["EOS"])
            
            # 这种使用jieba将中文分成词语
            cn.append(["BOS"]+ [word for word in jieba.cut(line[1])]+["EOS"])

    return en,cn

traing_file = "/Users/zhenwuzhou/AiProject/data/nmt/en-cn/train.txt"
dev_file = "/Users/zhenwuzhou/AiProject/data/nmt/en-cn/dev.txt"
train_en,train_cn = load_data(traing_file)
dev_en,dev_cn = load_data(dev_file)

There are a lot of hot springs in Japan.	日本有很多温泉。

['There are a lot of hot springs in Japan.', '日本有很多温泉。']
There are a lot of hot springs in Japan.	日本有很多温泉。

['There are a lot of hot springs in Japan.', '日本有很多温泉。']
There are a lot of hot springs in Japan.	日本有很多温泉。

['There are a lot of hot springs in Japan.', '日本有很多温泉。']
jump start	立刻开始一件事

['jump start', '立刻开始一件事']
to stick to one's ribs	大吃一顿

["to stick to one's ribs", '大吃一顿']
to stick in one's craw	无法忍受

["to stick in one's craw", '无法忍受']
to stick one's neck out	冒着风险

["to stick one's neck out", '冒着风险']
soft touch	好说话

['soft touch', '好说话']
There are a lot of hot springs in Japan.	日本有很多温泉。

['There are a lot of hot springs in Japan.', '日本有很多温泉。']
There are a lot of hot springs in Japan.	日本有很多温泉。

['There are a lot of hot springs in Japan.', '日本有很多温泉。']
There are a lot of hot springs in Japan.	日本有很多温泉。

['There are a lot of hot springs in Japan.', '日本有很多温泉。']
There are a lot of hot springs in Japan.	日本有很多温泉。

['There are a lot of ho

In [5]:
print(train_en[:10])

[['BOS', 'there', 'are', 'a', 'lot', 'of', 'hot', 'springs', 'in', 'japan', '.', 'EOS'], ['BOS', 'there', 'are', 'a', 'lot', 'of', 'hot', 'springs', 'in', 'japan', '.', 'EOS'], ['BOS', 'there', 'are', 'a', 'lot', 'of', 'hot', 'springs', 'in', 'japan', '.', 'EOS'], ['BOS', 'jump', 'start', 'EOS'], ['BOS', 'to', 'stick', 'to', 'one', "'s", 'ribs', 'EOS'], ['BOS', 'to', 'stick', 'in', 'one', "'s", 'craw', 'EOS'], ['BOS', 'to', 'stick', 'one', "'s", 'neck', 'out', 'EOS'], ['BOS', 'soft', 'touch', 'EOS'], ['BOS', 'there', 'are', 'a', 'lot', 'of', 'hot', 'springs', 'in', 'japan', '.', 'EOS'], ['BOS', 'there', 'are', 'a', 'lot', 'of', 'hot', 'springs', 'in', 'japan', '.', 'EOS']]


In [6]:
print(train_cn[:10])

[['BOS', '日本', '有', '很多', '温泉', '。', 'EOS'], ['BOS', '日本', '有', '很多', '温泉', '。', 'EOS'], ['BOS', '日本', '有', '很多', '温泉', '。', 'EOS'], ['BOS', '立刻', '开始', '一件', '事', 'EOS'], ['BOS', '大吃一顿', 'EOS'], ['BOS', '无法忍受', 'EOS'], ['BOS', '冒', '着', '风险', 'EOS'], ['BOS', '好', '说话', 'EOS'], ['BOS', '日本', '有', '很多', '温泉', '。', 'EOS'], ['BOS', '日本', '有', '很多', '温泉', '。', 'EOS']]


# 构建单词表

In [7]:
UNK_IDX = 0
PAD_IDX = 1
def build_dict(sentences, max_words=50000):
    word_count = Counter()
    for sentence in sentences:
        for s in sentence:
            word_count[s] += 1  #word_count这里应该是个字典
    ls = word_count.most_common(max_words) 
    #按每个单词数量排序前50000个,这个数字自己定的，不重复单词数没有50000
    print(len(ls)) #train_en：5491
    total_words = len(ls) + 2
    #加的2是留给"unk"和"pad"
    #ls = [('BOS', 14533), ('EOS', 14533), ('.', 12521), ('i', 4045), .......
    word_dict = {w[0]: index+2 for index, w in enumerate(ls)}
    #加的2是留给"unk"和"pad",转换成字典格式。
    word_dict["UNK"] = UNK_IDX
    word_dict["PAD"] = PAD_IDX
    return word_dict, total_words

en_dict, en_total_words = build_dict(train_en)
cn_dict, cn_total_words = build_dict(train_cn)
inv_en_dict = {v: k for k, v in en_dict.items()}
#en_dict.items()把字典转换成可迭代对象，取出键值，并调换键值的位置。
inv_cn_dict = {v: k for k, v in cn_dict.items()}

97
76


In [8]:
print(en_dict)
print(inv_en_dict)

{'BOS': 2, 'EOS': 3, 'in': 4, 'one': 5, "'s": 6, 'a': 7, 'to': 8, 'the': 9, 'of': 10, 'there': 11, 'are': 12, 'lot': 13, 'hot': 14, 'springs': 15, 'japan': 16, '.': 17, 'stick': 18, 'out': 19, 'off': 20, 'hard': 21, 'jump': 22, 'start': 23, 'ribs': 24, 'craw': 25, 'neck': 26, 'soft': 27, 'touch': 28, 'doozie': 29, 'at': 30, 'loggerheads': 31, 'sweeten': 32, 'pot': 33, 'sweetness': 34, 'and': 35, 'light': 36, 'sweetheart': 37, 'deal': 38, 'cat': 39, 'line': 40, 'on': 41, 'heart': 42, 'it': 43, 'make': 44, 'fat': 45, 'is': 46, 'fire': 47, 'meow': 48, 'has': 49, 'got': 50, 'your': 51, 'tongue': 52, 'deep': 53, 'six': 54, 'four': 55, 'flusher': 56, 'sticky': 57, 'wicket': 58, 'go': 59, 'reservation': 60, 'flip': 61, 'gum': 62, 'up': 63, 'works': 64, 'lose': 65, 'cool': 66, 'act': 67, 'follow': 68, 'play': 69, 'ball': 70, 'row': 71, 'hoe': 72, 'toe': 73, 'be': 74, 'have': 75, 'set': 76, 'something': 77, 'not': 78, 'how': 79, 'does': 80, 'that': 81, 'grab': 82, 'you': 83, 'babe': 84, 'woods'

In [9]:
print(cn_dict)
print(inv_cn_dict)

{'BOS': 2, 'EOS': 3, '日本': 4, '有': 5, '很多': 6, '温泉': 7, '。': 8, '的': 9, '事': 10, ' ': 11, '出色': 12, '人': 13, '立刻': 14, '开始': 15, '一件': 16, '大吃一顿': 17, '无法忍受': 18, '冒': 19, '着': 20, '风险': 21, '好': 22, '说话': 23, '或': 24, '心存': 25, '怨恨': 26, '使人': 27, '愿意': 28, '做事': 29, '表里不一': 30, '私下交易': 31, 'the': 32, '别人': 33, '强硬': 34, '事情': 35, 'fat': 36, 'is': 37, 'in': 38, 'fire': 39, '非常': 40, '焦灼': 41, '烦躁': 42, '置之不理': 43, '欺骗': 44, '走投无路': 45, '采取': 46, '行动': 47, '反击': 48, '怒火中烧': 49, '阻碍': 50, '发展': 51, '情绪': 52, '失控': 53, '使用': 54, '手段': 55, '很': 56, '努力': 57, '才能': 58, '完成': 59, '遵守规则': 60, '处于': 61, '危险': 62, '状态': 63, '专心致志': 64, '心不在焉': 65, '你': 66, '怎么': 67, '看': 68, '一窍不通': 69, '大概': 70, '记得': 71, '不自量力': 72, '重视': 73, '相信': 74, '的话': 75, '坦率地': 76, '表达意见': 77, 'UNK': 0, 'PAD': 1}
{2: 'BOS', 3: 'EOS', 4: '日本', 5: '有', 6: '很多', 7: '温泉', 8: '。', 9: '的', 10: '事', 11: ' ', 12: '出色', 13: '人', 14: '立刻', 15: '开始', 16: '一件', 17: '大吃一顿', 18: '无法忍受', 19: '冒', 20: '着', 21: '风险', 22: '好', 23: '说话

# 把单词全部转变成数字

In [10]:
def encode(en_sentences, cn_sentences, en_dict, cn_dict, sort_by_len=True):
    '''
        Encode the sequences. 
    '''
    length = len(en_sentences)
    #en_sentences=[['BOS', 'anyone', 'can', 'do', 'that', '.', 'EOS'],....
    
    out_en_sentences = [[en_dict.get(w, 0) for w in sent] for sent in en_sentences]
    #out_en_sentences=[[2, 328, 43, 14, 28, 4, 3], ....
    #.get(w, 0)，返回w对应的值，没有就为0.因题库比较小，这里所有的单词向量都有非零索引。
    
 
    out_cn_sentences = [[cn_dict.get(w, 0) for w in sent] for sent in cn_sentences]

    # sort sentences by english lengths
    def len_argsort(seq):
        return sorted(range(len(seq)), key=lambda x: len(seq[x]))
      #sorted()排序,key参数可以自定义规则，按seq[x]的长度排序，seq[0]为第一句话长度
       
    # 把中文和英文按照同样的顺序排序
    if sort_by_len:
        sorted_index = len_argsort(out_en_sentences)
    #print(sorted_index)
    #sorted_index=[63, 1544, 1917, 2650, 3998, 6240, 6294, 6703, ....
     #前面的索引都是最短句子的索引
      
        out_en_sentences = [out_en_sentences[i] for i in sorted_index]
     #print(out_en_sentences)
     #out_en_sentences=[[2, 475, 4, 3], [2, 1318, 126, 3], [2, 1707, 126, 3], ......
     
        out_cn_sentences = [out_cn_sentences[i] for i in sorted_index]
        
    return out_en_sentences, out_cn_sentences

train_en, train_cn = encode(train_en, train_cn, en_dict, cn_dict)
dev_en, dev_cn = encode(dev_en, dev_cn, en_dict, cn_dict)

In [12]:
print(train_en)

[[2, 29, 3], [2, 29, 3], [2, 22, 23, 3], [2, 27, 28, 3], [2, 30, 31, 3], [2, 37, 38, 3], [2, 30, 31, 3], [2, 37, 38, 3], [2, 53, 54, 3], [2, 55, 56, 3], [2, 57, 58, 3], [2, 61, 19, 3], [2, 22, 23, 3], [2, 27, 28, 3], [2, 32, 9, 33, 3], [2, 34, 35, 36, 3], [2, 32, 9, 33, 3], [2, 34, 35, 36, 3], [2, 39, 6, 48, 3], [2, 59, 20, 60, 3], [2, 69, 21, 70, 3], [2, 44, 7, 93, 3], [2, 44, 97, 98, 3], [2, 62, 63, 9, 64, 3], [2, 65, 5, 6, 66, 3], [2, 21, 67, 8, 68, 3], [2, 73, 4, 9, 40, 3], [2, 7, 21, 71, 8, 72, 3], [2, 8, 74, 41, 9, 40, 3], [2, 79, 80, 81, 82, 83, 3], [2, 7, 84, 4, 9, 85, 3], [2, 8, 18, 8, 5, 6, 24, 3], [2, 8, 18, 4, 5, 6, 25, 3], [2, 8, 18, 5, 6, 26, 19, 3], [2, 9, 45, 46, 4, 9, 47, 3], [2, 49, 9, 39, 50, 51, 52, 3], [2, 8, 18, 8, 5, 6, 24, 3], [2, 8, 18, 4, 5, 6, 25, 3], [2, 8, 18, 5, 6, 26, 19, 3], [2, 5, 6, 42, 78, 4, 43, 3], [2, 94, 5, 6, 95, 96, 43, 3], [2, 75, 5, 6, 42, 76, 41, 77, 3], [2, 20, 9, 86, 10, 5, 6, 87, 3], [2, 88, 20, 89, 90, 5, 91, 92, 3], [2, 11, 12, 7, 13, 10

In [11]:
print(train_cn)

[[2, 12, 9, 13, 24, 10, 3], [2, 12, 9, 13, 24, 10, 3], [2, 14, 15, 16, 10, 3], [2, 22, 23, 3], [2, 25, 26, 3], [2, 31, 3], [2, 25, 26, 3], [2, 31, 3], [2, 43, 3], [2, 44, 33, 9, 13, 3], [2, 45, 3], [2, 49, 3], [2, 14, 15, 16, 10, 3], [2, 22, 23, 3], [2, 27, 28, 29, 3], [2, 30, 3], [2, 27, 28, 29, 3], [2, 30, 3], [2, 40, 12, 3], [2, 46, 34, 47, 48, 3], [2, 54, 34, 9, 55, 3], [2, 73, 3], [2, 76, 77, 3], [2, 50, 35, 51, 3], [2, 52, 53, 3], [2, 12, 9, 35, 3], [2, 60, 3], [2, 56, 57, 58, 59, 9, 10, 3], [2, 61, 62, 63, 3], [2, 66, 67, 68, 3], [2, 69, 3], [2, 17, 3], [2, 18, 3], [2, 19, 20, 21, 3], [2, 32, 11, 36, 11, 37, 11, 38, 11, 32, 11, 39, 3], [2, 41, 42, 3], [2, 17, 3], [2, 18, 3], [2, 19, 20, 21, 3], [2, 65, 3], [2, 74, 33, 75, 3], [2, 64, 3], [2, 70, 71, 3], [2, 72, 3], [2, 4, 5, 6, 7, 8, 3], [2, 4, 5, 6, 7, 8, 3], [2, 4, 5, 6, 7, 8, 3], [2, 4, 5, 6, 7, 8, 3], [2, 4, 5, 6, 7, 8, 3], [2, 4, 5, 6, 7, 8, 3], [2, 4, 5, 6, 7, 8, 3], [2, 4, 5, 6, 7, 8, 3]]


In [13]:
k=10
print(" ".join([inv_cn_dict[i] for i in train_cn[k]])) #通过inv字典获取单词
print(" ".join([inv_en_dict[i] for i in train_en[k]])) 

BOS 走投无路 EOS
BOS sticky wicket EOS


# 把全部的句子分成batch

In [14]:
def get_minibatches(n, minibatch_size, shuffle=True):
    idx_list = np.arange(0, n, minibatch_size) # [0, 1, ..., n-1]
    if shuffle:
        np.random.shuffle(idx_list) #打乱数据
    minibatches = []
    for idx in idx_list:
        minibatches.append(np.arange(idx, min(idx + minibatch_size, n)))
        #所有batch放在一个大列表里
    return minibatches

In [15]:
# 15 个一组，从0-99 分成7个组，每组里面句子顺序打乱
get_minibatches(100,15) #随机打乱的

[array([75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89]),
 array([90, 91, 92, 93, 94, 95, 96, 97, 98, 99]),
 array([45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59]),
 array([30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44]),
 array([15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]),
 array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]),
 array([60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74])]

# 把数据预处理成最终需要的训练和测试数据

In [22]:
def prepare_data(seqs):
#seqs=[[2, 12, 167, 23, 114, 5, 27, 1755, 4, 3], ........
    lengths = [len(seq) for seq in seqs]#每个batch里语句的长度统计出来
    n_samples = len(seqs) #一个batch有多少语句
    max_len = np.max(lengths) #取出最长的的语句长度，后面用这个做padding基准
    x = np.zeros((n_samples, max_len)).astype('int32')
    #先初始化全零矩阵，后面依次赋值
    #print(x.shape) #64*最大句子长度
    
    x_lengths = np.array(lengths).astype("int32")
    #print(x_lengths) 
#这里看下面的输入语句发现英文句子长度都一样，中文句子长短不一。
#说明英文句子是特征，中文句子是标签。


    for idx, seq in enumerate(seqs):
      #取出一个batch的每条语句和对应的索引
        x[idx, :lengths[idx]] = seq
        #每条语句按行赋值给x，x会有一些零值没有被赋值。
        
    return x, x_lengths #x_mask

def gen_examples(en_sentences, cn_sentences, batch_size):
    minibatches = get_minibatches(len(en_sentences), batch_size)
    all_ex = []
    for minibatch in minibatches:
        mb_en_sentences = [en_sentences[t] for t in minibatch]
#按打乱的batch序号分数据，打乱只是batch打乱，一个batach里面的语句还是顺序的。
        #print(mb_en_sentences)
        
        mb_cn_sentences = [cn_sentences[t] for t in minibatch]
        mb_x, mb_x_len = prepare_data(mb_en_sentences)
        #返回的维度为：mb_x=(64 * 最大句子长度）,mb_x_len=最大句子长度
        mb_y, mb_y_len = prepare_data(mb_cn_sentences)
        
        all_ex.append((mb_x, mb_x_len, mb_y, mb_y_len))
  #这里把所有batch数据集合到一起。
  #依次为英文句子，英文长度，中文句子翻译，中文句子长度，这四个放在一个列表中
  #一个列表为一个batch的数据，所有batch组成一个大列表数据
  
        
    return all_ex

batch_size = 10
train_data = gen_examples(train_en, train_cn, batch_size)
random.shuffle(train_data)
dev_data = gen_examples(dev_en, dev_cn, batch_size)

In [26]:
# 查看一下训练数据:一组里面是64个句子
train_data[5] # 第一个array是英文句子，第二个array是英文句子的长度；第三,四array分别是中文句子和其长度

(array([[ 2, 57, 58,  3,  0],
        [ 2, 61, 19,  3,  0],
        [ 2, 22, 23,  3,  0],
        [ 2, 27, 28,  3,  0],
        [ 2, 32,  9, 33,  3],
        [ 2, 34, 35, 36,  3],
        [ 2, 32,  9, 33,  3],
        [ 2, 34, 35, 36,  3],
        [ 2, 39,  6, 48,  3],
        [ 2, 59, 20, 60,  3]], dtype=int32),
 array([4, 4, 4, 4, 5, 5, 5, 5, 5, 5], dtype=int32),
 array([[ 2, 45,  3,  0,  0,  0],
        [ 2, 49,  3,  0,  0,  0],
        [ 2, 14, 15, 16, 10,  3],
        [ 2, 22, 23,  3,  0,  0],
        [ 2, 27, 28, 29,  3,  0],
        [ 2, 30,  3,  0,  0,  0],
        [ 2, 27, 28, 29,  3,  0],
        [ 2, 30,  3,  0,  0,  0],
        [ 2, 40, 12,  3,  0,  0],
        [ 2, 46, 34, 47, 48,  3]], dtype=int32),
 array([3, 3, 6, 4, 5, 3, 5, 3, 4, 6], dtype=int32))

# 没有Attention的版本
下面是一个没有Attention的encoder decoder模型

In [573]:
# 定义Encoder模型
class PlainEncoder(nn.Module):
    def __init__(self,vocab_size,hidden_size,drop_out=0.2):
        # 模型的输入需要需要encode的语言的vocab_size,hidden_size,drop_out
        # hidden_size，和drop_out都根据网络框架定义，
        #以英文为例，vocab_size=5493, hidden_size=100, dropout=0.2
        super(PlainEncoder,self).__init__()
        
        # 第一步先进行Embed操作
        self.embed = nn.Embedding(vocab_size,hidden_size)
        
        # 第二步为了进行drop_out操作
        self.dropout = nn.Dropout(drop_out)
        
        # 第三步进行Rnn训练
        # batch_first=True 可以把batch_size移动到第一个维度
        # 第一个参是输入特征数量，第二个参数是输出特征数量，这里输入=输出=hidden_size
        self.rnn = nn.GRU(hidden_size,hidden_size,batch_first=True)
        
        
        
    def forward(self,x,lengths):
        #x是输入的batch的所有单词，lengths：batch里每个句子的长度
        #因为需要把最后一个hidden state取出来，需要知道长度，因为句子长度不一样
        ##print(x.shape,lengths),x.sahpe = torch.Size([64, 10])
        # lengths= =tensor([10, 10, 10, ..... 10, 10, 10])
        
        # 把Batch里面的seq按照长度排序;descending=True长的在前。
        # 返回两个参数，句子长度和未排序前的索引
        # sorted_idx=tensor([41, 40, 46, 45,...... 19, 18, 63])
        # sorted_len=tensor([10, 10, 10, ..... 10, 10, 10])
        sorted_len, sorted_idx = lengths.sort(0, descending=True)
        
        # 句子用新的idx，按长度排好序了
        x_sorted = x[sorted_idx.long()]
        
        
        embedded = self.dropout(self.embed(x_sorted))
        #print(embedded.shape)=torch.Size([64, 10, 100])
        #tensor([[[-0.6312, -0.9863, -0.3123,  ..., -0.7384,  0.9230, -0.4311],....

        
        
        # 这个函数就是用来处理不同长度的句子的，https: // www.cnblogs.com / sbj123456789 / p / 9834018. html
        # 因为句子在预处理的时候会用padding补全成相同长度，
        # 但是我们如果补全后句子长度是100，实际句子长度是7，
        # 那么我们想要的输出其实是真实句子长度7后的输出，而不是最后100的输出，并且也不希望去计算后面93词padding
        # 所以这里使用pack_padded_sequence方法来处理
        # 这个方法需要传入batch数据中每一个句子的真实长度,并且是排序好的（目前只这样）
        # 注意我们之前定义的时候使用了batch_first=True，这里要保持一致
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded,
                                                             sorted_len.long().cpu().data.numpy(),
                                                             batch_first=True)
        
        packed_out,hid = self.rnn(packed_embedded)
        #hid.shape = torch.Size([1, 64, 100])
        
        # 因为上面用pack_padded_sequence进行了处理，所以这里要用pad_packed_sequence进行处理
        out, _ = nn.utils.rnn.pad_packed_sequence(packed_out, batch_first=True)
        #out.shape = torch.Size([64, 10, 100]),
        
        
        
        # 需要按照原来的idx来重写整理输出，不然返回的batch的句子和结果就对不上了
        _, original_idx = sorted_idx.sort(0,descending=False)
        out = out[original_idx.long()].contiguous()
        hid = hid[:,original_idx.long()].contiguous()
        #out.shape = torch.Size([64, 10, 100])
        #hid.shape = torch.Size([1, 64, 100])
        
        return out,hid[[-1]] #有时候num_layers层数多，需要取出最后一层


In [574]:
# 定义decode模型
class PlainDecoder(nn.Module):
    def __init__(self,vocab_size,hidden_size,drou_out=0.2):
        super(PlainDecoder,self).__init__()
        self.embed = nn.Embedding(vocab_size,hidden_size)
        self.drouout = nn.Dropout(drou_out)

        
        self.rnn = nn.GRU(hidden_size,hidden_size,batch_first=True)
        
        # 需要用全连接把hidde_size的结果转成vocabsize
        self.out = nn.Linear(hidden_size,vocab_size)
        
    
    def forward(self,y,y_lengths,hid):
        #中文的y和y_lengths
        #print(y.shape)=torch.Size([64, 12])
        #print(hid.shape)=torch.Size([1, 64, 100])
        
        # 与encode类似，我们也需要做排序，pack_padded 和pad_pacded操作
        sorted_len, sorted_idx = y_lengths.sort(0, descending=True)
        y_sorted = y[sorted_idx.long()]
        hid = hid[:, sorted_idx.long()] #隐藏层也要排序

        y_embedded = self.drouout(self.embed(y_sorted))
        # batch_size, output_length, embed_size
        
        packed_embedded = nn.utils.rnn.pack_padded_sequence(y_embedded,
                                                            sorted_len.long().cpu().data.numpy(),
                                                            batch_first=True)
        packed_out,hid = self.rnn(packed_embedded,hid) # 加上隐藏层
        #print(hid.shape)=torch.Size([1, 64, 100])
        
        # 因为上面用pack_padded_sequence进行了处理，所以这里要用pad_packed_sequence进行处理
        out,_ = nn.utils.rnn.pad_packed_sequence(packed_out,batch_first=True)
     
        
        # 需要按照原来的idx来重写整理输出，不然返回的batch的句子和结果就对不上了
        _,original_idx = sorted_idx.sort(0,descending=False)
        
        output_seq =  out[original_idx.long()].contiguous()
        #print(output_seq.shape)=torch.Size([64, 12, 100])
        
        hid = hid[:,original_idx.long()].contiguous()
        #print(hid.shape)=torch.Size([1, 64, 100])
        
        output = F.log_softmax(self.out(output_seq),-1)
        #print(output.shape)=torch.Size([64, 12, 3195])
        
        return output, hid
        

In [575]:
class PlainSeq2Seq(nn.Module):
    def __init__(self,encoder,decoder):
        #encoder是上面PlainEncoder的实例
        #decoder是上面PlainDecoder的实例
        super(PlainSeq2Seq,self).__init__()
        self.encoder = encoder
        self.decoder = decoder
    
    # 把两个模型串起来
    def forward(self,x,x_lengths,y,y_lengths):
        # 先计算encode
        encoder_out, hid = self.encoder(x,x_lengths)
        #self.encoder(x, x_lengths)调用PlainEncoder里面forward的方法
        #返回forward的out和hid
        
        decoder_out,hid = self.decoder(y,y_lengths,hid=hid)
        
        return decoder_out,None
        
    def translate(self,x,x_lengths,y,max_length=10):
        # x是一个句子，用字典中的onehot的数字表示
        # x_lengths是句子的长度
        # y是“bos”的数值索引=2
        
        encoder_out, hid = self.encoder(x,x_lengths)
        preds = []
        batch_size = x.shape[0]
        attns = []
        for i in range(max_length):
            output, hid = self.decoder(y=y,
                    y_lengths=torch.ones(batch_size).long().to(y.device),
                    hid=hid) 
            
#刚开始循环bos作为模型的首个输入单词，后续更新y，下个预测单词的输入是上个输出单词
            y = output.max(2)[1].view(batch_size, 1)
            preds.append(y)
        
        return torch.cat(preds,1), None


# 开始初始化模型

In [576]:
en_total_words

14

In [577]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
drop_out = 0.2
hidden_size = 100

# 传入对应参数初始化模型
encoder = PlainEncoder(vocab_size=en_total_words,
                       hidden_size=hidden_size,
                       drop_out=drop_out)
decoder = PlainDecoder(vocab_size=cn_total_words,
                       hidden_size=hidden_size,
                       drou_out=drop_out)
model = PlainSeq2Seq(encoder,decoder)


In [578]:
# 定义seq2seq序列模型的交叉熵随时函数
class LanguageModelCriterion(nn.Module):
    def __init__(self):
        super(LanguageModelCriterion,self).__init__()
    
    def forward(self,input,target,mask):
        # 这里的mask是每个句子都会有一部分遮挡，这个loss函数就是要把遮挡部分的loss忽略掉
        # target=tensor([[5,108,8,4,3,0,0,0,0,0,0,0],....
        # mask=tensor([[1,1 ,1,1,1,0,0,0,0,0,0,0],.....
        # print(input.shape,target.shape,mask.shape)
        # torch.Size([64, 12, 3195]) torch.Size([64, 12]) torch.Size([64, 12])
        
        # input: [batch_size * seq_len]*vocab_size
#         print(input.shape)
#         print(input)
        input = input.contiguous().view(-1,input.size(-1)) # 这里相当于把前面两维整合到了一起
        
        # target [batch_size * 1=768*1]
        target = target.contiguous().view(-1,1)
        mask = mask.contiguous().view(-1,1)
        # print(-input.gather(1,target))
        output = -input.gather(1, target) * mask
        #这里算得就是交叉熵损失，前面已经算了F.log_softmax
        #.gather的作用https://blog.csdn.net/edogawachia/article/details/80515038
        #output.shape=torch.Size([768, 1])
        #mask作用是把padding为0的地方重置为零，因为input.gather时，为0的地方不是零了
        
        #均值损失
        output = torch.sum(output) / torch.sum(mask)
        
        return output

In [579]:
#构造一个随机初始化的矩阵：
x = torch.rand(5,3,2)
x

tensor([[[0.5073, 0.6437],
         [0.0574, 0.6901],
         [0.6368, 0.5890]],

        [[0.1623, 0.3672],
         [0.6350, 0.0481],
         [0.4503, 0.3044]],

        [[0.1689, 0.4535],
         [0.5477, 0.9881],
         [0.9722, 0.0390]],

        [[0.8396, 0.7042],
         [0.2090, 0.3169],
         [0.0758, 0.8810]],

        [[0.6250, 0.2300],
         [0.0755, 0.1212],
         [0.6610, 0.1891]]])

In [580]:
print(x.shape)
print(x.contiguous().shape)
y = x.contiguous().view(-1,x.size(2))
print(y.shape)
y

torch.Size([5, 3, 2])
torch.Size([5, 3, 2])
torch.Size([15, 2])


tensor([[0.5073, 0.6437],
        [0.0574, 0.6901],
        [0.6368, 0.5890],
        [0.1623, 0.3672],
        [0.6350, 0.0481],
        [0.4503, 0.3044],
        [0.1689, 0.4535],
        [0.5477, 0.9881],
        [0.9722, 0.0390],
        [0.8396, 0.7042],
        [0.2090, 0.3169],
        [0.0758, 0.8810],
        [0.6250, 0.2300],
        [0.0755, 0.1212],
        [0.6610, 0.1891]])

In [581]:
gather_test = torch.rand(2,3)
gather_test

tensor([[0.9722, 0.1772, 0.9262],
        [0.3455, 0.6602, 0.6919]])

In [582]:
index_0 = torch.LongTensor([[1,1,1]])
gather_test.gather(0,index_0) # 取出每一列的1索引位置(即第二列)的值输出成一个新的

tensor([[0.3455, 0.6602, 0.6919]])

In [583]:
index_1 = torch.LongTensor([[1],  # 第一行的1代表one-hot词向量位置索引1的那个词，
                            [2]] # 第二行的2代表one-hot词向量位置索引2的那个词，
                          )
-gather_test.gather(1,index_1) # 把每一行中对应索引的预测概率值取出来，取负数就可以当做loss函数的值

tensor([[-0.1772],
        [-0.6919]])

In [584]:
model.to(device)
loss_fn = LanguageModelCriterion().to(device)
optimizer = torch.optim.Adam(model.parameters())

In [585]:
def evaluate(model, data):
    model.eval()
    total_num_words = total_loss = 0.
    with torch.no_grad():#不需要更新模型，不需要梯度
        for it, (mb_x, mb_x_len, mb_y, mb_y_len) in enumerate(data):
            mb_x = torch.from_numpy(mb_x).to(device).long()
            mb_x_len = torch.from_numpy(mb_x_len).to(device).long()
            mb_input = torch.from_numpy(mb_y[:, :-1]).to(device).long()
            mb_output = torch.from_numpy(mb_y[:, 1:]).to(device).long()
            mb_y_len = torch.from_numpy(mb_y_len-1).to(device).long()
            mb_y_len[mb_y_len<=0] = 1

            mb_pred, attn = model(mb_x, mb_x_len, mb_input, mb_y_len)

            mb_out_mask = torch.arange(mb_y_len.max().item(), device=device)[None, :] < mb_y_len[:, None]
            mb_out_mask = mb_out_mask.float()

            loss = loss_fn(mb_pred, mb_output, mb_out_mask)

            num_words = torch.sum(mb_y_len).item()
            total_loss += loss.item() * num_words
            total_num_words += num_words
    print("Evaluation loss", total_loss/total_num_words)

In [586]:
def train(model, data, num_epochs=2):
    for epoch in range(num_epochs):
        model.train()
        total_num_words = total_loss = 0.
        for it, (mb_x, mb_x_len, mb_y, mb_y_len) in enumerate(data):
            #（英文batch，英文长度，中文batch，中文长度）
            
            mb_x = torch.from_numpy(mb_x).to(device).long()
            mb_x_len = torch.from_numpy(mb_x_len).to(device).long()
            
            #前n-1个单词作为输入，后n-1个单词作为输出，因为输入的前一个单词要预测后一个单词
            mb_input = torch.from_numpy(mb_y[:, :-1]).to(device).long()
            mb_output = torch.from_numpy(mb_y[:, 1:]).to(device).long()
            #
            mb_y_len = torch.from_numpy(mb_y_len-1).to(device).long()
            #输入输出的长度都减一。
            
            mb_y_len[mb_y_len<=0] = 1
            
            mb_pred, attn = model(mb_x, mb_x_len, mb_input, mb_y_len)
            #返回的是类PlainSeq2Seq里forward函数的两个返回值
            
            # 这个mask就是padding的位置设置为0，其他设置为1,如果
            mb_out_mask = torch.arange(mb_y_len.max().item(), device=device)[None, :] < mb_y_len[:, None]
            #mb_out_mask=tensor([[1, 1, 1,  ..., 0, 0, 0],[1, 1, 1,  ..., 0, 0, 0],
            #mb_out_mask.shape= (64*19),这句代码咱不懂，这个mask就是padding的位置设置为0，其他设置为1
            #mb_out_mask就是LanguageModelCriterion的传入参数mask。

            mb_out_mask = mb_out_mask.float()
            
            loss = loss_fn(mb_pred, mb_output, mb_out_mask)
            
            num_words = torch.sum(mb_y_len).item()
            #一个batch里多少个单词
            
            total_loss += loss.item() * num_words
            #总损失，loss计算的是均值损失，每个单词都是都有损失，所以乘以单词数
            
            total_num_words += num_words
            #总单词数
            
            # 更新模型
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 5.)
            #为了防止梯度过大，设置梯度的阈值
            
            optimizer.step()
            
            if it % 100 == 0:
                print("Epoch", epoch, "iteration", it, "loss", loss.item())

                
        print("Epoch", epoch, "Training loss", total_loss/total_num_words)
        if epoch % 5 == 0:
            evaluate(model, dev_data) #评估模型
train(model, train_data, num_epochs=100)

Epoch 0 iteration 0 loss 2.1663944721221924
Epoch 0 Training loss 2.1663944721221924
Evaluation loss 2.0161991119384766
Epoch 1 iteration 0 loss 2.0190865993499756
Epoch 1 Training loss 2.0190865993499756
Epoch 2 iteration 0 loss 1.8618738651275635
Epoch 2 Training loss 1.8618738651275635
Epoch 3 iteration 0 loss 1.7167640924453735
Epoch 3 Training loss 1.7167640924453735
Epoch 4 iteration 0 loss 1.5836002826690674
Epoch 4 Training loss 1.5836002826690674
Epoch 5 iteration 0 loss 1.4538174867630005
Epoch 5 Training loss 1.4538174867630005
Evaluation loss 1.420767903327942
Epoch 6 iteration 0 loss 1.3237935304641724
Epoch 6 Training loss 1.3237935304641724
Epoch 7 iteration 0 loss 1.215800404548645
Epoch 7 Training loss 1.215800404548645
Epoch 8 iteration 0 loss 1.1033129692077637
Epoch 8 Training loss 1.1033129692077637
Epoch 9 iteration 0 loss 0.9981317520141602
Epoch 9 Training loss 0.9981317520141602
Epoch 10 iteration 0 loss 0.9030351638793945
Epoch 10 Training loss 0.9030351638793

Epoch 86 iteration 0 loss 0.008919456973671913
Epoch 86 Training loss 0.008919456973671913
Epoch 87 iteration 0 loss 0.00880465842783451
Epoch 87 Training loss 0.00880465842783451
Epoch 88 iteration 0 loss 0.008630175143480301
Epoch 88 Training loss 0.008630175143480301
Epoch 89 iteration 0 loss 0.008316387422382832
Epoch 89 Training loss 0.008316387422382832
Epoch 90 iteration 0 loss 0.008283020928502083
Epoch 90 Training loss 0.008283020928502083
Evaluation loss 0.5359717607498169
Epoch 91 iteration 0 loss 0.008138339035212994
Epoch 91 Training loss 0.008138339035212994
Epoch 92 iteration 0 loss 0.008223367854952812
Epoch 92 Training loss 0.008223367854952812
Epoch 93 iteration 0 loss 0.0080807413905859
Epoch 93 Training loss 0.0080807413905859
Epoch 94 iteration 0 loss 0.00772683834657073
Epoch 94 Training loss 0.00772683834657073
Epoch 95 iteration 0 loss 0.007726415526121855
Epoch 95 Training loss 0.007726415526121855
Evaluation loss 0.5385764241218567
Epoch 96 iteration 0 loss 0.

In [587]:
#翻译个句子看看结果咋样
def translate_dev(i):
    #随便取出句子
    en_sent = " ".join([inv_en_dict[w] for w in dev_en[i]])
    print(en_sent)
    cn_sent = " ".join([inv_cn_dict[w] for w in dev_cn[i]])
    print("".join(cn_sent))

    mb_x = torch.from_numpy(np.array(dev_en[i]).reshape(1, -1)).long().to(device)
    #把句子升维，并转换成tensor
    
    mb_x_len = torch.from_numpy(np.array([len(dev_en[i])])).long().to(device)
    #取出句子长度，并转换成tensor
    
    bos = torch.Tensor([[cn_dict["BOS"]]]).long().to(device)
    #bos=tensor([[2]])

    translation, attn = model.translate(mb_x, mb_x_len, bos)
    #这里传入bos作为首个单词的输入
    #translation=tensor([[ 8,  6, 11, 25, 22, 57, 10,  5,  6,  4]])
    
    translation = [inv_cn_dict[i] for i in translation.data.cpu().numpy().reshape(-1)]
    trans = []
    for word in translation:
        if word != "EOS": # 把数值变成单词形式
            trans.append(word) #
        else:
            break
    print("".join(trans))

for i in range(1,10):
    translate_dev(i)
    print()

BOS UNK UNK UNK . EOS
BOS UNK UNK UNK 。 EOS
日本有很多温泉。

BOS UNK UNK UNK . EOS
BOS UNK UNK UNK 。 EOS
日本有很多温泉。

BOS there are a lot of hot springs in japan . EOS
BOS 日本 有 很多 温泉 。 EOS
日本有很多温泉。

BOS there are a lot of hot springs in japan . EOS
BOS 日本 有 很多 温泉 。 EOS
日本有很多温泉。

BOS there are a lot of hot springs in japan . EOS
BOS 日本 有 很多 温泉 。 EOS
日本有很多温泉。

BOS there are a lot of hot springs in japan . EOS
BOS 日本 有 很多 温泉 。 EOS
日本有很多温泉。

BOS there are a lot of hot springs in japan . EOS
BOS 日本 有 很多 温泉 。 EOS
日本有很多温泉。

BOS there are a lot of hot springs in japan . EOS
BOS 日本 有 很多 温泉 。 EOS
日本有很多温泉。

BOS there are a lot of hot springs in japan . EOS
BOS 日本 有 很多 温泉 。 EOS
日本有很多温泉。



# 下面实现Attention模式的seq2seq模型

#### Encoder
- Encoder模型的任务是把输入文字传入embedding层和GRU层，转换成一些hidden states作为后续的context vectors

In [588]:
test_x_lengths = next(iter(dev_data))[1]
test_x_lengths = torch.from_numpy(test_x_lengths).to(device).long()
test_x_lengths

tensor([ 6,  6,  6, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
        12, 12])

In [589]:
test_sort_x = test_x_lengths.sort(0,descending=True)
test_sort_x

torch.return_types.sort(
values=tensor([12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,  6,
         6,  6]),
indices=tensor([11,  9, 18, 17, 16, 15, 14, 13, 12, 19, 10,  8,  7,  6,  5,  4,  3,  1,
         2,  0]))

In [590]:
test_sort_out = test_sort_x[0] + 1
test_sort_out

tensor([13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,  7,
         7,  7])

In [591]:
_, original_idx = test_sort_x[1].sort(0,descending=False)
original_idx

tensor([19, 17, 18, 16, 15, 14, 13, 12, 11,  1, 10,  0,  8,  7,  6,  5,  4,  3,
         2,  9])

In [592]:
# 把顺序还原:按照original_idx.long()的顺序把test_sort_out的数据进行整理
test_sort_out[original_idx.long()].contiguous()

tensor([ 7,  7,  7, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
        13, 13])

In [593]:
# torch.cat的测试
torch_cat_test = torch.rand(5,2,3)
torch_cat_test

tensor([[[0.0123, 0.9678, 0.9676],
         [0.0701, 0.3914, 0.9731]],

        [[0.9625, 0.4149, 0.7018],
         [0.9172, 0.9421, 0.8030]],

        [[0.3468, 0.2552, 0.0889],
         [0.7382, 0.8738, 0.6047]],

        [[0.0547, 0.6712, 0.3401],
         [0.5232, 0.7037, 0.6484]],

        [[0.8720, 0.0330, 0.6181],
         [0.1282, 0.9049, 0.7606]]])

In [594]:
torch_cat_test[-2]

tensor([[0.0547, 0.6712, 0.3401],
        [0.5232, 0.7037, 0.6484]])

In [595]:
torch.cat([torch_cat_test[-2],torch_cat_test[-1]],dim=0) # 行增加

tensor([[0.0547, 0.6712, 0.3401],
        [0.5232, 0.7037, 0.6484],
        [0.8720, 0.0330, 0.6181],
        [0.1282, 0.9049, 0.7606]])

In [596]:
torch.cat([torch_cat_test[-2],torch_cat_test[-1]],dim=1) # 列增加

tensor([[0.0547, 0.6712, 0.3401, 0.8720, 0.0330, 0.6181],
        [0.5232, 0.7037, 0.6484, 0.1282, 0.9049, 0.7606]])

# 定义双向循环的RNN单层encoder模型

In [597]:
class Encoder(nn.Module):
    def __init__(self,vocab_size,embed_size,encode_hidden_size,decode_hidden_size,drop_out):
        super(Encoder,self).__init__()
        # 先把onehot进行embedding
        self.embed = nn.Embedding(vocab_size,embed_size)
        
        # 然后dropout
        self.dropout = nn.Dropout(drop_out)
        
        # 双向循环的Rnn神经网络
        self.rnn = nn.GRU(embed_size,encode_hidden_size,batch_first=True,
                          bidirectional=True) 
        
        # 最后转为全连接输出层
        # 因为是双向神经网络，所以最后要把两个方向的encode_hidden_size链接在一起做一层全连接
        # 最后转换成decode时需要的输入的维度
        self.fc = nn.Linear(2*encode_hidden_size,decode_hidden_size)
        
    def forward(self, x, lengths):
        # 先排序
        sorted_len,sorted_idx = lengths.sort(0,descending=True)
        x_sorted = x[sorted_idx.long()]
        
        # embedding并且dropout
        embedded = self.dropout(self.embed(x_sorted))
        
        # pack_padded_sequence处理超出部分用padding补全的部分
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded,
                                                            sorted_len.long().cpu().data.numpy(),
                                                            batch_first=True)
        # 进行rnn操作
        packed_out , packed_hid = self.rnn(packed_embedded)
        
        # pad_packed_sequence处理packed_out
        padded_out,_ = nn.utils.rnn.pad_packed_sequence(packed_out,batch_first=True)
        
        # 把padded_out和packed_hid顺序调回去
        _,original_idx = sorted_idx.sort(0,descending=False)
        out = padded_out[original_idx.long()].contiguous()
        hid = packed_hid[:,original_idx.long()].contiguous()
        
        
        # 因为是双向循环神经网络，所以还要把最后的输出进行连接起来
        hid = torch.cat([hid[-2],hid[-1]],dim=1)
        
        # 最后进行全连接把hide转成decode时需要的shape
        hid = torch.tanh(self.fc(hid)).unsqueeze(0)
        
        return out,hid 


# Luong Attention
- 根据context vectors和当前的输出hidden states，计算输出

In [598]:
class Attention(nn.Module):
    def __init__(self,encoder_hidden_size,decode_hidden_size):
        super(Attention,self).__init__()
        
        self.encoder_hidden_size = encoder_hidden_size
        self.decode_hidden_size = decode_hidden_size
        
        # 现将encode的结果进行yi
        self.linear_in = nn.Linear(encoder_hidden_size*2,decode_hidden_size,bias=False)
        
        self.linear_out = nn.Linear(encoder_hidden_size*2+ decode_hidden_size,decode_hidden_size)
        
    def forward(self,output,context,mask):
        # output: batch_size,output_len,decode_hidden_size
        # context: batch_size,input_len, 2*encode_hidden_size
        
        batch_size = output.size(0)
        output_len = output.size(1)
        input_len = context.size(1)
        
        # 先把context的前面两维batch_size和input_len合并成一维，
        # 进行全连接把最后一维的encoder_hidden_size*2变为decode_hidden_size；
        # 在全连接结束之后再把维度转回：[batch_size,input_len,decode_hidden_size]
        context_in = self.linear_in(context.view(batch_size*input_len,-1)).view(
                        batch_size,input_len,-1) #[batch_size,input_len,decode_hidden_size]
        
        # context_in.transpose(1,2):batch_size,decode_hidden_size,input_len
        # output：batch_size,output_len,decode_hidden_size
        attn = torch.bmm(output,context_in.transpose(1,2)) # 矩阵相乘生成注意力矩阵
        # batch_size,output_len,input_len
        
        # 把mask的地方的权重设置成一个非常小的数字，这里是设置成10-6
        # 因为这些地方并不需要Attention，
        attn.data.masked_fill(mask.bool(),-1e6) 
        
        # 对最后一维进行softmax，
        # softmax的值代表需要对input_len里的每个值所保持注意力的权重
        attn = F.softmax(attn,dim=2) 
        # [batch_size,output_len,input_len
        
        context = torch.bmm(attn,context)
        # [batch_size,ouput_len,encode_hidden_size*2]
        
        
        output = torch.cat((context,output),dim=2)
        # batch_size,output_len,encoder_hidden_size*2+ decode_hidden_size
        
        # 把前面两维先合并，然后做全连接
        # batch_size*output_len,encoder_hidden_size*2+ decode_hidden_size
        output = output.view(output.size(0)*output.size(1),-1) 
        # 全连接之后用tanh激活
        output = torch.tanh(self.linear_out(output))
        # 最后再把shape还原成：batch_size，output_len，decode_hidden_size
        
        return output,attn
                     

# Decoder模型
- decoder会根据已经翻译的句子内容，和context vectors，来决定下一个输出的单词

In [599]:
class Decoder(nn.Module):
    def __init__(self,vocab_size,embed_size,encode_hidden_size,
                 decode_hidden_size,drop_out=0.2):
        super(Decoder,self).__init__()
        # 先embedding
        self.embed = nn.Embedding(vocab_size,embed_size)
        
        # 然后dropout
        self.dropout = nn.Dropout(drop_out) 
        
        # 单向循环神经网络，因为这个只能由前面的词推断后面的词
        self.rnn = nn.GRU(embed_size,decode_hidden_size,batch_first=True)
        
        # 加上Attention
        self.attention = Attention(encode_hidden_size,decode_hidden_size)
        
        # 最后用全连接转回vocab_size来预测每个词的概率
        self.out = nn.Linear(decode_hidden_size,vocab_size)
        
        
    
    def create_mask(self,x_len,y_len):
        # mask shape: x_len * y_len
        device = x_len.device
        max_x_len = x_len.max()
        max_y_len = y_len.max()
        x_mask = torch.arange(max_x_len, device=x_len.device)[None, :] < x_len[:, None]
        y_mask = torch.arange(max_y_len, device=x_len.device)[None, :] < y_len[:, None]
        mask = torch.logical_not(x_mask[:, :, None] * y_mask[:, None, :]).int()
        return mask
        
    def forward(self,ctx,ctx_lengths,y,y_lengths,hid):
        # 先对y和encode最后时刻的最后一层输出的hid排序
        sorted_len,sorted_idx = y_lengths.sort(0,descending=True)
        y_sorted = y[sorted_idx.long()]
        hid = hid[:,sorted_idx.long()]
        
        #embed 然后dropout
        y_embedded = self.dropout(self.embed(y_sorted))
        # [batch_size,output_length,embed_size]
        
        #pack_padded操作
        packed_seq = nn.utils.rnn.pack_padded_sequence(y_embedded,
                                                sorted_len.long().cpu().data.numpy(),
                                                      batch_first=True)
        # 进行rnn操作
        packed_out,packed_hid = self.rnn(packed_seq,hid)
        # pad_packed操作
        padded_seq,_ = nn.utils.rnn.pad_packed_sequence(packed_out,batch_first=True)
        
        # 还原排序
        _,original_idx = sorted_idx.sort(0,descending=False)
        output_seq = padded_seq[original_idx.long()].contiguous()
        hid = hid[:,original_idx.long()].contiguous()
        
        # 这里要创建mask
        mask = self.create_mask(y_lengths,ctx_lengths)
        
        # 加上Attention
        output,attn = self.attention(output_seq,ctx,mask)
        
        # 最后在用全连接输出，并用log_softmax输出(这是为了后面方便计算loss)
        output = F.log_softmax(self.out(output),-1)
        
        return output,hid,attn

# Seq2Seq
- 最后我们构建Seq2Seq模型把encder，Attention，decoder串到一起

In [600]:
class Seq2Seq(nn.Module):
    def __init__(self,encoder,decoder):
        super(Seq2Seq,self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        
    def forward(self,x,x_lengths,y,y_lengths):
        encoder_out, hid = self.encoder(x,x_lengths)
        output,hid,attn = self.decoder(ctx=encoder_out,
                                      ctx_lengths = x_lengths,
                                      y=y,
                                      y_lengths=y_lengths,
                                      hid=hid)
        return output,attn
    
    def translate(self,x,x_lengths,y,max_length=100):
        encoder_out, hid = self.encoder(x, x_lengths)
        preds = []
        batch_size = x.shape[0]
        attns = []
        for i in range(max_length):
            output, hid, attn = self.decoder(ctx=encoder_out, 
                    ctx_lengths=x_lengths,
                    y=y,
                    y_lengths=torch.ones(batch_size).long().to(y.device),
                    hid=hid)
            y = output.max(2)[1].view(batch_size, 1)
            preds.append(y)
            attns.append(attn)
        return torch.cat(preds, 1), torch.cat(attns, 1)

In [601]:
dropout = 0.2
embed_size = hidden_size = 100
encoder = Encoder(vocab_size=en_total_words,
                       embed_size=embed_size,
                      encode_hidden_size=hidden_size,
                       decode_hidden_size=hidden_size,
                      drop_out=dropout)
decoder = Decoder(vocab_size=cn_total_words,
                      embed_size=embed_size,
                      encode_hidden_size=hidden_size,
                       decode_hidden_size=hidden_size,
                      drop_out=dropout)
model = Seq2Seq(encoder, decoder)
model = model.to(device)
loss_fn = LanguageModelCriterion().to(device)
optimizer = torch.optim.Adam(model.parameters())

In [602]:
train(model, train_data, num_epochs=30)

Epoch 0 iteration 0 loss 2.1646461486816406
Epoch 0 Training loss 2.1646461486816406
Evaluation loss 2.033919095993042
Epoch 1 iteration 0 loss 2.0176072120666504
Epoch 1 Training loss 2.0176072120666504
Epoch 2 iteration 0 loss 1.876351237297058
Epoch 2 Training loss 1.876351237297058
Epoch 3 iteration 0 loss 1.7437283992767334
Epoch 3 Training loss 1.7437283992767334
Epoch 4 iteration 0 loss 1.6022510528564453
Epoch 4 Training loss 1.6022510528564453
Epoch 5 iteration 0 loss 1.457313060760498
Epoch 5 Training loss 1.457313060760498
Evaluation loss 1.3850897550582886
Epoch 6 iteration 0 loss 1.3056591749191284
Epoch 6 Training loss 1.3056591749191284
Epoch 7 iteration 0 loss 1.1589945554733276
Epoch 7 Training loss 1.1589945554733276
Epoch 8 iteration 0 loss 1.0136878490447998
Epoch 8 Training loss 1.0136878490447998
Epoch 9 iteration 0 loss 0.873263418674469
Epoch 9 Training loss 0.873263418674469
Epoch 10 iteration 0 loss 0.7512374520301819
Epoch 10 Training loss 0.7512374520301819


In [603]:
testTensor1 = torch.tensor([[True, False, True, False, True, True],
        [True, False, True, True, True, True],
        [True, True, True, True, True, True]])
testTensor2 = torch.tensor([[True, True, True, True, True, True, True, True, True, True, True, True],
        [True, False, True, True, True, False, True, False, True, True, True, True],
        [True, True, True, False, True, False, True, True, True, False, True, True]])

In [604]:
testTensor1
print(testTensor1.shape)
testTensor1c = testTensor1[:, :, None] 
print(testTensor1c.shape)
print(testTensor1)
testTensor1c

torch.Size([3, 6])
torch.Size([3, 6, 1])
tensor([[ True, False,  True, False,  True,  True],
        [ True, False,  True,  True,  True,  True],
        [ True,  True,  True,  True,  True,  True]])


tensor([[[ True],
         [False],
         [ True],
         [False],
         [ True],
         [ True]],

        [[ True],
         [False],
         [ True],
         [ True],
         [ True],
         [ True]],

        [[ True],
         [ True],
         [ True],
         [ True],
         [ True],
         [ True]]])

In [605]:
testTensor2
print(testTensor2.shape)
testTensor2c = testTensor2[:, None,: ] 
print(testTensor2c.shape)
print(testTensor2)
testTensor2c

torch.Size([3, 12])
torch.Size([3, 1, 12])
tensor([[ True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True],
        [ True, False,  True,  True,  True, False,  True, False,  True,  True,
          True,  True],
        [ True,  True,  True, False,  True, False,  True,  True,  True, False,
          True,  True]])


tensor([[[ True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
           True,  True]],

        [[ True, False,  True,  True,  True, False,  True, False,  True,  True,
           True,  True]],

        [[ True,  True,  True, False,  True, False,  True,  True,  True, False,
           True,  True]]])

In [606]:
testTensor1c2c = testTensor1c * testTensor2c
print(testTensor1c2c.shape)
testTensor1c2c

torch.Size([3, 6, 12])


tensor([[[ True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
           True,  True],
         [False, False, False, False, False, False, False, False, False, False,
          False, False],
         [ True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
           True,  True],
         [False, False, False, False, False, False, False, False, False, False,
          False, False],
         [ True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
           True,  True],
         [ True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
           True,  True]],

        [[ True, False,  True,  True,  True, False,  True, False,  True,  True,
           True,  True],
         [False, False, False, False, False, False, False, False, False, False,
          False, False],
         [ True, False,  True,  True,  True, False,  True, False,  True,  True,
           True,  True],
         [ True, False,  True,  True,  True, False,  

In [607]:
torch.logical_not(testTensor1c2c).int()

tensor([[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],

        [[0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
         [0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0],
         [0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0],
         [0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0],
         [0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0]],

        [[0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0],
         [0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0],
         [0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0],
         [0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0],
         [0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0],
         [0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0]]], dtype=torch.int32)

In [608]:
testTensor2c.size(-1)

12