# DeepPavlov的 序列-序列模型 教程
在本教程中，我们将在DeepPavlov中实现序列-序列模型。 （论文链接：https://arxiv.org/abs/1409.3215 ） 
  
序列-序列指将输入序列映射到目标序列，序列-序列模型由编码器和解码器两大部分组成。编码器用于将输入序列编码为密集式表示，解码器使用这种密集式表示生成目标序列。  
  
![](https://github.com/deepmipt/DeepPavlov/raw/c7896c6db96f43f57cacd9a6a471e37cb70bf07a/examples/tutorials/img/seq2seq.png)  
  
上面这个图片中,输入序列是ABC，使用特殊token < EOS >(序列末尾)作为指示符，来指示开始解码目标序列WXYZ。  
  
为了在DeepPavlov中实现这个模型，我们需要编写一些DeepPavlov的抽象代码:  
- DatasetReader：读取数据
- DatasetIterator：生成批次
- Vocabulary：将单词转化成索引
- Model：训练并使用模型
- 以及其它一些用于预处理和后处理的组件

In [1]:
%load_ext autoreload
%autoreload 2

import deeppavlov
import json
import numpy as np
import tensorflow as tf

from itertools import chain
from pathlib import Path

## 下载和解压数据集

In [2]:
from deeppavlov.core.data.utils import download_decompress
download_decompress('http://files.deeppavlov.ai/datasets/personachat_v2.tar.gz', './personachat')

2018-08-23 16:26:27.237 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 205: Starting new HTTP connection (1): files.deeppavlov.ai:80
2018-08-23 16:26:27.907 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 393: http://files.deeppavlov.ai:80 "GET /datasets/personachat_v2.tar.gz HTTP/1.1" 200 223217972
2018-08-23 16:26:27.915 INFO in 'deeppavlov.core.data.utils'['utils'] at line 65: Downloading from http://files.deeppavlov.ai/datasets/personachat_v2.tar.gz to /home/jiahang/Jiahang_Jupyter_Note/personachat/personachat_v2.tar.gz
100%|██████████| 223M/223M [03:10<00:00, 1.17MB/s]   
2018-08-23 16:29:38.356 INFO in 'deeppavlov.core.data.utils'['utils'] at line 149: Extracting personachat/personachat_v2.tar.gz archive into personachat


## DatasetReader
DatasetReader用于从文件中读取和解析数据。  
  
这里，我们定义一个新的类PersonaChatDatasetReader读取 PersonaChat数据集。  
  
PersonaChat数据集由对话框和用户的个性组成。  
  
用户个性用四个句子来描述，例如:  
~~~
i like to remodel homes.
i like to go hunting.
i like to shoot a bow.
my favorite holiday is halloween.
~~~

In [3]:
from deeppavlov.core.commands.train import build_model_from_config
from deeppavlov.core.data.dataset_reader import DatasetReader
from deeppavlov.core.data.utils import download_decompress
from deeppavlov.core.common.registry import register

@register('personachat_dataset_reader')
class PersonaChatDatasetReader(DatasetReader):
    """
    PersonaChat dataset from
    Zhang S. et al. Personalizing Dialogue Agents: I have a dog, do you have pets too?
    https://arxiv.org/abs/1801.07243
    Also, this dataset is used in ConvAI2 http://convai.io/
    This class reads dataset to the following format:
    [{
        'persona': [list of persona sentences],
        'x': input utterance,
        'y': output utterance,
        'dialog_history': list of previous utterances
        'candidates': [list of candidate utterances]
        'y_idx': index of y utt in candidates list
      },
       ...
    ]
    """
    def read(self, dir_path: str, mode='self_original'):
        dir_path = Path(dir_path)
        dataset = {}
        for dt in ['train', 'valid', 'test']:
            dataset[dt] = self._parse_data(dir_path / '{}_{}.txt'.format(dt, mode))

        return dataset

    @staticmethod
    def _parse_data(filename):
        examples = []
        print(filename)
        curr_persona = []
        curr_dialog_history = []
        persona_done = False
        with filename.open('r') as fin:
            for line in fin:
                line = ' '.join(line.strip().split(' ')[1:])
                your_persona_pref = 'your persona: '
                if line[:len(your_persona_pref)] == your_persona_pref and persona_done:
                    curr_persona = [line[len(your_persona_pref):]]
                    curr_dialog_history = []
                    persona_done = False
                elif line[:len(your_persona_pref)] == your_persona_pref:
                    curr_persona.append(line[len(your_persona_pref):])
                else:
                    persona_done = True
                    x, y, _, candidates = line.split('\t')
                    candidates = candidates.split('|')
                    example = {
                        'persona': curr_persona,
                        'x': x,
                        'y': y,
                        'dialog_history': curr_dialog_history[:],
                        'candidates': candidates,
                        'y_idx': candidates.index(y)
                    }
                    curr_dialog_history.extend([x, y])
                    examples.append(example)

        return examples

In [4]:
data = PersonaChatDatasetReader().read('./personachat')

personachat/train_self_original.txt
personachat/valid_self_original.txt
personachat/test_self_original.txt


我们看看数据集的大小：

In [5]:
for k in data:
    print(k, len(data[k]))

train 65719
valid 7801
test 7512


In [6]:
data['train'][0]

{'persona': ['i like to remodel homes.',
  'i like to go hunting.',
  'i like to shoot a bow.',
  'my favorite holiday is halloween.'],
 'x': 'hi , how are you doing ? i am getting ready to do some cheetah chasing to stay in shape .',
 'y': 'you must be very fast . hunting is one of my favorite hobbies .',
 'dialog_history': [],
 'candidates': ['my mom was single with 3 boys , so we never left the projects .',
  'i try to wear all black every day . it makes me feel comfortable .',
  'well nursing stresses you out so i wish luck with sister',
  'yeah just want to pick up nba nfl getting old',
  'i really like celine dion . what about you ?',
  'no . i live near farms .',
  'i wish i had a daughter , i am a boy mom . they are beautiful boys though still lucky',
  'yeah when i get bored i play gone with the wind my favorite movie .',
  'hi how are you ? i am eating dinner with my hubby and 2 kids .',
  'were you married to your high school sweetheart ? i was .',
  'that is great to hear !

## Dataset iterator
数据集迭代器（Dataset iterator） 用于从已解析的数据集(DatasetReader)生成批次。  
  
让我们只从已解析的数据集中提取x和y，并用它们从句子x来预测句子y。

In [7]:
from deeppavlov.core.data.data_learning_iterator import DataLearningIterator

@register('personachat_iterator')
class PersonaChatIterator(DataLearningIterator):
    def split(self, *args, **kwargs):
        for dt in ['train', 'valid', 'test']:
            setattr(self, dt, self._to_tuple(getattr(self, dt)))

    @staticmethod
    def _to_tuple(data):
        """
        Returns:
            list of (x, y)
        """
        return list(map(lambda x: (x['x'], x['y']), data))

查看分好批次中的数据:

In [8]:
iterator = PersonaChatIterator(data)
batch = [el for el in iterator.gen_batches(5, 'train')][0]
for x, y in zip(*batch):
    print('x:', x)
    print('y:', y)
    print('----------')

x: i think i will look into it .
y: that would be good to do . you should try to look it up on youtube
----------
x: i live in new orleans .
y: what do you do ? i have a toothpick business ,
----------
x: i am blue , because i was born male , then transitioned to female 3 years ago .
y: that is good that you transitioned why would you feel blue ?
----------
x: after school hard to find a job in the city your town sounds fun
y: it is nice . i can enjoy the lake and a few books every weekend .
----------
x: very exact of you ! the hair was reddish so i think it was my own . lol
y: oh my favorite color , red ! i would love your hair
----------


## Tokenizer
分词器（Tokenizer）用于从话语中提取token。

In [9]:
from deeppavlov.models.preprocessors.lazy_tokenizer import LazyTokenizer
tokenizer = LazyTokenizer()
tokenizer(['Hello my friend'])

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  Searched in:
    - '/home/jiahang/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/home/jiahang/env/nltk_data'
    - '/home/jiahang/env/lib/nltk_data'
    - ''
**********************************************************************


## Vocabulary
词汇表（Vocabulary）准备从token到token索引的映射，它使用‘train’中的数据构建这个映射。  
  
我们将实现DialogVocab类(继承SimpleVocabulary类)，它将x和y话语中的所有token都添加到词汇表中。

In [None]:
from deeppavlov.core.data.simple_vocab import SimpleVocabulary

@register('dialog_vocab')
class DialogVocab(SimpleVocabulary):
    def fit(self, *args):
        tokens = chain(*args)
        super().fit(tokens)

    def __call__(self, batch, **kwargs):
        indices_batch = []
        for utt in batch:
            tokens = [self[token] for token in utt]
            indices_batch.append(tokens)
        return indices_batch

我们来创建一个DialogVocab的实例，定义保存和加载的路径，加入词汇表和特殊标记集的token的最小频率。  
  
特殊标记是：
- < PAD > - 填充
- < BOS > - 序列的起点
- < EOS > - 序列的终点
- < UNK > - 未知标记（没有出现在词汇表中的token）  
  
并把他们配置在从x和y中获取的token中。

In [None]:
vocab = DialogVocab(
    save_path='./vocab.dict',
    load_path='./vocab.dict',
    min_freq=2,
    special_tokens=('<PAD>','<BOS>', '<EOS>', '<UNK>',),
    unk_token='<UNK>'
)

vocab.fit(tokenizer(iterator.get_instances(data_type='train')[0]), tokenizer(iterator.get_instances(data_type='train')[1]))
vocab.save()

train数据集中最常见的10个token:

In [None]:
vocab.freqs.most_common(10)

词汇表中的token数量:

In [None]:
len(vocab)

让我们使用构建好的的词汇表来编码一些分好词的句子。

In [None]:
vocab([['<BOS>', 'hello', 'my', 'friend', 'there_is_no_such_word_in_dataset', 'and_this', '<EOS>', '<PAD>']])

## Padding
为了将token索引序列传递给神经模型，我们应该使它们的长度相等。  
  
如果序列太短，我们在序列的末尾添加< PAD > 符号。  
如果序列太长，我们就把它剪掉。  
  
SentencePadder类实现这个功能，它也可以加 < BOS > 和 < EOS > 标记 。

In [None]:
from deeppavlov.core.models.component import Component

@register('sentence_padder')
class SentencePadder(Component):
    def __init__(self, length_limit, pad_token_id=0, start_token_id=1, end_token_id=2, *args, **kwargs):
        self.length_limit = length_limit
        self.pad_token_id = pad_token_id
        self.start_token_id = start_token_id
        self.end_token_id = end_token_id

    def __call__(self, batch):
        for i in range(len(batch)):
            batch[i] = batch[i][:self.length_limit]
            batch[i] = [self.start_token_id] + batch[i] + [self.end_token_id]
            batch[i] += [self.pad_token_id] * (self.length_limit + 2 - len(batch[i]))
        return batch

In [None]:
padder = SentencePadder(length_limit=6)
vocab(padder(vocab([['hello', 'my', 'friend', 'there_is_no_such_word_in_dataset', 'and_this']])))

## Seq2Seq 模型
这个模型由两个主要的组件组成：编码器（encoder）和解码器（decoder）。  
  
我们可以独立实现它们，然后将它们放在一个Seq2Seq模型中。
  
### 编码器（Encoder）
编码器构建输入序列的隐藏表示。

In [None]:
def encoder(inputs, inputs_len, embedding_matrix, cell_size, keep_prob=1.0):
    # inputs: tf.int32 tensor with shape bs x seq_len with token ids
    # inputs_len: tf.int32 tensor with shape bs
    # embedding_matrix: tf.float32 tensor with shape vocab_size x vocab_dim
    # cell_size: hidden size of recurrent cell
    # keep_prob: dropout keep probability
    with tf.variable_scope('encoder'):
        # first of all we should embed every token in input sequence (use tf.nn.embedding_lookup, don't forget about dropout)
        x_emb = tf.nn.dropout(tf.nn.embedding_lookup(embedding_matrix, inputs), keep_prob=keep_prob)
        
        # define recurrent cell (LSTM or GRU)
        encoder_cell = tf.nn.rnn_cell.GRUCell(
                            num_units=cell_size,
                            kernel_initializer=tf.contrib.layers.xavier_initializer(),
                            name='encoder_cell')
        
        # use tf.nn.dynamic_rnn to encode input sequence, use actual length of input sequence
        encoder_outputs, encoder_state = tf.nn.dynamic_rnn(cell=encoder_cell, inputs=x_emb, sequence_length=inputs_len, dtype=tf.float32)
    return encoder_outputs, encoder_state


检查编码器实现:  
下一个单元格输出形状是 32 x 10 x 100 和 32 x 100

In [None]:
tf.reset_default_graph()
vocab_size = 100
hidden_dim = 100
inputs = tf.cast(tf.random_uniform(shape=[32, 10]) * vocab_size, tf.int32) # bs x seq_len
mask = tf.cast(tf.random_uniform(shape=[32, 10]) * 2, tf.int32) # bs x seq_len
inputs_len = tf.reduce_sum(mask, axis=1)
embedding_matrix = tf.random_uniform(shape=[vocab_size, hidden_dim])

encoder(inputs, inputs_len, embedding_matrix, hidden_dim)

### 解码器（Decoder）
解码器使用编码器的输出和编码器状态产生输出序列。  
  
这里，你要做的是：
- 定义你的解码块 decoder_cell (GRU或LSTM)
  
它将成为您的基础seq2seq模型。  
  
并且，为了改进模型:
- 添加Teacher Forcing（教师强制算法，在MT && Abstractive Summarization的encoder训练中比较常用）
- 添加Attention Mechanism（注意力机制）

In [None]:
def decoder(encoder_outputs, encoder_state, embedding_matrix, mask,
            cell_size, max_length, y_ph,
            start_token_id=1, keep_prob=1.0,
            teacher_forcing_rate_ph=None,
            use_attention=False, is_train=True):
    # decoder
    # encoder_outputs: tf.float32 tensor with shape bs x seq_len x encoder_cell_size
    # encoder_state: tf.float32 tensor with shape bs x encoder_cell_size
    # embedding_matrix: tf.float32 tensor with shape vocab_size x vocab_dim
    # mask: tf.int32 tensor with shape bs x seq_len with zeros for masked sequence elements
    # cell_size: hidden size of recurrent cell
    # max_length: max length of output sequence
    # start_token_id: id of <BOS> token in vocabulary
    # keep_prob: dropout keep probability
    # teacher_forcing_rate_ph: rate of using teacher forcing on each decoding step
    # use_attention: use attention on encoder outputs or use only encoder_state
    # is_train: is it training or inference? at inference time we can't use teacher forcing
    with tf.variable_scope('decoder'):
        # define decoder recurrent cell
        decoder_cell = tf.nn.rnn_cell.GRUCell(
                            num_units=cell_size,
                            kernel_initializer=tf.contrib.layers.xavier_initializer(),
                            name='decoder_cell')
        
        # initial value of output_token on previsous step is start_token
        output_token = tf.ones(shape=(tf.shape(encoder_outputs)[0],), dtype=tf.int32) * start_token_id
        # let's define initial value of decoder state with encoder_state
        decoder_state = encoder_state

        pred_tokens = []
        logits = []

        # use for loop to sequentially call recurrent cell
        for i in range(max_length):
            """
            TEACHER FORCING
            # here you can try to implement teacher forcing for your model
            # details about teacher forcing are explained further in tutorial
            
            # pseudo code:
            NOTE THAT FOLLOWING CONDITIONS SHOULD BE EVALUATED AT GRAPH RUNTIME
            use tf.cond and tf.logical operations instead of python if
            
            if i > 0 and is_train and random_value < teacher_forcing_rate_ph:
                input_token = y_ph[:, i-1]
            else:
                input_token = output_token

            input_token_emb = tf.nn.embedding_lookup(embedding_matrix, input_token)
            
            """
            if i > 0:
                input_token_emb = tf.cond(
                                      tf.logical_and(
                                          is_train,
                                          tf.random_uniform(shape=(), maxval=1) <= teacher_forcing_rate_ph
                                      ),
                                      lambda: tf.nn.embedding_lookup(embedding_matrix, y_ph[:, i-1]), # teacher forcing
                                      lambda: tf.nn.embedding_lookup(embedding_matrix, output_token)
                                      )
            else:
                input_token_emb = tf.nn.embedding_lookup(embedding_matrix, output_token)

            """
            ATTENTION MECHANISM
            # here you can add attention to your model
            # you can find details about attention further in tutorial
            """            
            if use_attention:
                # compute attention and concat attention vector to input_token_emb
                att = dot_attention(encoder_outputs, decoder_state, mask, scope='att')
                input_token_emb = tf.concat([input_token_emb, att], axis=-1)


            input_token_emb = tf.nn.dropout(input_token_emb, keep_prob=keep_prob)
            # call recurrent cell
            decoder_outputs, decoder_state = decoder_cell(input_token_emb, decoder_state)
            decoder_outputs = tf.nn.dropout(decoder_outputs, keep_prob=keep_prob)
            # project decoder output to embeddings dimension
            embeddings_dim = embedding_matrix.get_shape()[1]
            output_proj = tf.layers.dense(decoder_outputs, embeddings_dim, activation=tf.nn.tanh,
                                          kernel_initializer=tf.contrib.layers.xavier_initializer(),
                                          name='proj', reuse=tf.AUTO_REUSE)
            # compute logits
            output_logits = tf.matmul(output_proj, embedding_matrix, transpose_b=True)

            logits.append(output_logits)
            output_probs = tf.nn.softmax(output_logits)
            output_token = tf.argmax(output_probs, axis=-1)
            pred_tokens.append(output_token)

        y_pred_tokens = tf.transpose(tf.stack(pred_tokens, axis=0), [1, 0])
        y_logits = tf.transpose(tf.stack(logits, axis=0), [1, 0, 2])
    return y_pred_tokens, y_logits

下一个单元的输出应该是形状:
~~~
32 x 10
32 x 10 x 100
~~~

In [None]:
tf.reset_default_graph()
vocab_size = 100
hidden_dim = 100
inputs = tf.cast(tf.random_uniform(shape=[32, 10]) * vocab_size, tf.int32) # bs x seq_len
mask = tf.cast(tf.random_uniform(shape=[32, 10]) * 2, tf.int32) # bs x seq_len
inputs_len = tf.reduce_sum(mask, axis=1)
embedding_matrix = tf.random_uniform(shape=[vocab_size, hidden_dim])

teacher_forcing_rate = tf.random_uniform(shape=())
y = tf.cast(tf.random_uniform(shape=[32, 10]) * vocab_size, tf.int32)

encoder_outputs, encoder_state = encoder(inputs, inputs_len, embedding_matrix, hidden_dim)
decoder(encoder_outputs, encoder_state, embedding_matrix, mask, hidden_dim, max_length=10,
        y_ph=y, teacher_forcing_rate_ph=teacher_forcing_rate)

### 模型
Seq2Seq 模型继承TFModel类，并且实现如下方法：
- train_on_batch - 在训练阶段调用这个方法
- \_\_call\_\_ - 调用这个方法用来进行预测


In [None]:
from deeppavlov.core.models.tf_model import TFModel

@register('seq2seq')
class Seq2Seq(TFModel):
    def __init__(self, **kwargs):
        # hyperparameters
        
        # dimension of word embeddings
        self.embeddings_dim = kwargs.get('embeddings_dim', 100)
        # size of recurrent cell in encoder and decoder
        self.cell_size = kwargs.get('cell_size', 200)
        # dropout keep_probability
        self.keep_prob = kwargs.get('keep_prob', 0.8)
        # learning rate
        self.learning_rate = kwargs.get('learning_rate', 3e-04)
        # max length of output sequence
        self.max_length = kwargs.get('max_length', 20)
        self.grad_clip = kwargs.get('grad_clip', 5.0)
        self.start_token_id = kwargs.get('start_token_id', 1)
        self.vocab_size = kwargs.get('vocab_size', 11595)
        self.teacher_forcing_rate = kwargs.get('teacher_forcing_rate', 0.0)
        self.use_attention = kwargs.get('use_attention', False)
        
        # create tensorflow session to run computational graph in it
        self.sess_config = tf.ConfigProto(allow_soft_placement=True)
        self.sess_config.gpu_options.allow_growth = True
        self.sess = tf.Session(config=self.sess_config)
        
        self.init_graph()
        
        # define train op
        self.train_op = self.get_train_op(self.loss, self.lr_ph,
                                          optimizer=tf.train.AdamOptimizer,
                                          clip_norm=self.grad_clip)
        # initialize graph variables
        self.sess.run(tf.global_variables_initializer())
        
        super().__init__(**kwargs)
        # load saved model if there is one
        if self.load_path is not None:
            self.load()
        
    def init_graph(self):
        # create placeholders
        self.init_placeholders()

        self.x_mask = tf.cast(self.x_ph, tf.int32) 
        self.y_mask = tf.cast(self.y_ph, tf.int32) 
        
        self.x_len = tf.reduce_sum(self.x_mask, axis=1)
        
        # create embeddings matrix for tokens
        self.embeddings = tf.Variable(tf.random_uniform((self.vocab_size, self.embeddings_dim), -0.1, 0.1, name='embeddings'), dtype=tf.float32)

        # encoder
        encoder_outputs, encoder_state = encoder(self.x_ph, self.x_len, self.embeddings, self.cell_size, self.keep_prob_ph)

        # decoder
        self.y_pred_tokens, y_logits = decoder(encoder_outputs, encoder_state, self.embeddings, self.x_mask,
                                                      self.cell_size, self.max_length,
                                                      self.y_ph, self.start_token_id, self.keep_prob_ph,
                                                      self.teacher_forcing_rate_ph, self.use_attention, self.is_train_ph)
        
        # loss
        self.y_ohe = tf.one_hot(self.y_ph, depth=self.vocab_size)
        self.y_mask = tf.cast(self.y_mask, tf.float32)
        self.loss = tf.nn.softmax_cross_entropy_with_logits(labels=self.y_ohe, logits=y_logits) * self.y_mask
        self.loss = tf.reduce_sum(self.loss) / tf.reduce_sum(self.y_mask)
    
    def init_placeholders(self):
        # placeholders for inputs
        self.x_ph = tf.placeholder(shape=(None, None), dtype=tf.int32, name='x_ph')
        # at inference time y_ph is used (y_ph exists in computational graph)  when teacher forcing is activated, so we add dummy default value
        # this dummy value is not actually used at inference
        self.y_ph = tf.placeholder_with_default(tf.zeros_like(self.x_ph), shape=(None,None), name='y_ph')

        # placeholders for model parameters
        self.lr_ph = tf.placeholder(dtype=tf.float32, shape=[], name='lr_ph')
        self.keep_prob_ph = tf.placeholder_with_default(1.0, shape=[], name='keep_prob_ph')
        self.is_train_ph = tf.placeholder_with_default(False, shape=[], name='is_train_ph')
        self.teacher_forcing_rate_ph = tf.placeholder_with_default(0.0, shape=[], name='teacher_forcing_rate_ph')
            
    def _build_feed_dict(self, x, y=None):
        feed_dict = {
            self.x_ph: x,
        }
        if y is not None:
            feed_dict.update({
                self.y_ph: y,
                self.lr_ph: self.learning_rate,
                self.keep_prob_ph: self.keep_prob,
                self.is_train_ph: True,
                self.teacher_forcing_rate_ph: self.teacher_forcing_rate,
            })
        return feed_dict
    
    def train_on_batch(self, x, y):
        feed_dict = self._build_feed_dict(x, y)
        loss, _ = self.sess.run([self.loss, self.train_op], feed_dict=feed_dict)
        return loss
    
    def __call__(self, x):
        feed_dict = self._build_feed_dict(x)
        y_pred = self.sess.run(self.y_pred_tokens, feed_dict=feed_dict)
        return y_pred

让我们用随机权重和默认参数创建模型，更改model的路径，否则它将存储在deeppavlov/download文件夹中:

In [None]:
s2s = Seq2Seq(
    save_path='PATH_TO_YOUR_WORKING_DIR/model',
    load_path='PATH_TO_YOUR_WORKING_DIR/model'
)

这里，我们首先运行所有预处理步骤，然后调用seq2seq模型，然后将token索引转换为token。因此，我们可以得到一些随机的单词序列。

In [None]:
vocab(s2s(padder(vocab([['hello', 'my', 'friend', 'there_is_no_such_word_in_dataset', 'and_this']]))))

### 注意力机制
注意力机制（论文链接：https://arxiv.org/abs/1409.0473) 可以根据当前状态从“记忆”中收集信息，通过聚合，我们假设出“记忆”项的加权和，每个记忆项的权重取决于当前状态。  
  
在没有注意力的情况下，解码器只能使用编码器的最后一个隐藏状态，注意力机制允许在解码过程中访问所有编码器状态。  
  
![](https://github.com/deepmipt/DeepPavlov/raw/c7896c6db96f43f57cacd9a6a471e37cb70bf07a/examples/tutorials/img/attention.png)

计算注意力权值(a_ij)最简单的方法之一是，通过记忆项和状态之间的点积计算它们，然后应用softmax函数。 
还有其他计算乘法注意力权值的方法（论文链接：https://arxiv.org/abs/1508.04025） 。  
  
我们还需要一个掩码来跳过一些序列元素，比如< PAD >,为了使不需要的记忆项的权重接近于零，我们可以在应用softmax函数之前向logits(点积的结果)添加大的负值。

In [None]:
def softmax_mask(values, mask):
    # adds big negative to masked values
    INF = 1e30
    return -INF * (1 - tf.cast(mask, tf.float32)) + values

In [None]:
def dot_attention(memory, state, mask, scope="dot_attention"):
    # inputs: bs x seq_len x hidden_dim
    # state: bs x hidden_dim
    # mask: bs x seq_len
    with tf.variable_scope(scope, reuse=tf.AUTO_REUSE):
        # dot product between each item in memory and state
        logits = tf.matmul(memory, tf.expand_dims(state, axis=1), transpose_b=True)
        logits = tf.squeeze(logits, [2])
        
        # apply mask to logits
        logits = softmax_mask(logits, mask)
        
        # apply softmax to logits
        att_weights = tf.expand_dims(tf.nn.softmax(logits), axis=2)
        
        # compute weighted sum of items in memory
        att = tf.reduce_sum(att_weights * memory, axis=1)
        return att

检查你的实现:  
输出应该是 32 x 100 的形状

In [None]:
tf.reset_default_graph()
memory = tf.random_normal(shape=[32, 10, 100]) # bs x seq_len x hidden_dim
state = tf.random_normal(shape=[32, 100]) # bs x hidden_dim
mask = tf.cast(tf.random_normal(shape=[32, 10]), tf.int32) # bs x seq_len
dot_attention(memory, state, mask)

### 教师强制算法
我们已经实现了解码器，在训练和推理过程中，解码器将自己的输出作为输入。但是，在训练的早期阶段，模型很难产生长序列，这取决于它自己是否接近随机输出。教师强迫算法可以解决这个问题:代替进给模型的输出，我们可以输入真实的token。它有助于建模的训练时间，但根据推断，我们仍然只能依赖于它自己的输出。  
  
使用模型的输出:  
< img src="img/sampling.png" alt="sampling" width=50% />  
  
教师强制算法：  
< img src="img/teacher_forcing.png" alt="teacher_forcing" width=50% />  
  
没有必要在每次步骤中都输入真实的token-如果我们想要真值输入或通过模型预测，我们可以以某个速率随机选择。seq2seq 模型的 teacher_forcing_rate 参数可以控制这种行为。  
  
更多关于教师强制算法的细节可以在 DeepLearningBook 10.2.1章（链接：http://www.deeplearningbook.org/contents/rnn.html ）中找到  
  
让我们用随机权重和默认参数创建模型:  
  
这里，我们首先运行所有预处理步骤，调用seq2seq模型，然后将token索引转换为token。因此，我们应该得到一些随机的单词序列。

## 后置处理  
在后置处理步骤中，我们将删除所有< PAD >, < BOS >, < EOS >标记。

In [None]:
@register('postprocessing')
class SentencePostprocessor(Component):
    def __init__(self, pad_token='<PAD>', start_token='<BOS>', end_token='<EOS>', *args, **kwargs):
        self.pad_token = pad_token
        self.start_token = start_token
        self.end_token = end_token

    def __call__(self, batch):
        for i in range(len(batch)):
            batch[i] = ' '.join(self._postproc(batch[i]))
        return batch
    
    def _postproc(self, utt):
        if self.end_token in utt:
            utt = utt[:utt.index(self.end_token)]
        return utt

In [None]:
postprocess = SentencePostprocessor()

In [None]:
postprocess(vocab(s2s(padder(vocab([['hello', 'my', 'friend', 'there_is_no_such_word_in_dataset', 'and_this']])))))

## 创建配置文件
让我们把它们放在一个配置文件中。

In [None]:
config = {
  "dataset_reader": {
    "name": "personachat_dataset_reader",
    "data_path": "YOUR_PATH_TO_FOLDER_WITH_PERSONACHAT_DATASET"
  },
  "dataset_iterator": {
    "name": "personachat_iterator",
    "seed": 1337,
    "shuffle": True
  },
  "chainer": {
    "in": ["x"],
    "in_y": ["y"],
    "pipe": [
      {
        "name": "lazy_tokenizer",
        "id": "tokenizer",
        "in": ["x"],
        "out": ["x_tokens"]
      },
      {
        "name": "lazy_tokenizer",
        "id": "tokenizer",
        "in": ["y"],
        "out": ["y_tokens"]
      },
      {
        "name": "dialog_vocab",
        "id": "vocab",
        "save_path": "YOUR_PATH_TO_WORKING_DIR/vocab.dict",
        "load_path": "YOUR_PATH_TO_WORKING_DIR/vocab.dict",
        "min_freq": 2,
        "special_tokens": ["<PAD>","<BOS>", "<EOS>", "<UNK>"],
        "unk_token": "<UNK>",
        "fit_on": ["x_tokens", "y_tokens"],
        "in": ["x_tokens"],
        "out": ["x_tokens_ids"]
      },
      {
        "ref": "vocab",
        "in": ["y_tokens"],
        "out": ["y_tokens_ids"]
      },
      {
        "name": "sentence_padder",
        "id": "padder",
        "length_limit": 20,
        "in": ["x_tokens_ids"],
        "out": ["x_tokens_ids"]
      },
      {
        "ref": "padder",
        "in": ["y_tokens_ids"],
        "out": ["y_tokens_ids"]
      },
      {
        "name": "seq2seq",
        "id": "s2s",
        "max_length": "#padder.length_limit+2",
        "cell_size": 250,
        "embeddings_dim": 50,
        "vocab_size": 11595,
        "keep_prob": 0.8,
        "learning_rate": 3e-04,
        "teacher_forcing_rate": 0.0,
        "use_attention": False,
        "save_path": "YOUR_PATH_TO_WORKING_DIR/model",
        "load_path": "YOUR_PATH_TO_WORKING_DIR/model",
        "in": ["x_tokens_ids"],
        "in_y": ["y_tokens_ids"],
        "out": ["y_predicted_tokens_ids"],
      },
      {
        "ref": "vocab",
        "in": ["y_predicted_tokens_ids"],
        "out": ["y_predicted_tokens"]
      },
      {
        "name": "postprocessing",
        "in": ["y_predicted_tokens"],
        "out": ["y_predicted_tokens"]
      }
    ],
    "out": ["y_predicted_tokens"]
  },
  "train": {
    "log_every_n_batches": 100,
    "val_every_n_epochs":0,
    "batch_size": 64,
    "validation_patience": 0,
    "epochs": 20,
    "metrics": ["bleu"],
  }
}

## 使用配置与模型交互

In [None]:
from deeppavlov.core.commands.infer import build_model_from_config
model = build_model_from_config(config)

In [None]:
model(['Hi, how are you?', 'Any ideas my dear friend?'])

## 训练模型
在有和没有注意力机制的情况下，在教师强迫算法和没有教师强迫算法的情况下进行实验。

In [None]:
from deeppavlov.core.commands.train import train_evaluate_model_from_config

In [None]:
json.dump(config, open('seq2seq.json', 'w'))

In [None]:
train_evaluate_model_from_config('seq2seq.json')

In [None]:
model = build_model_from_config(config)
model(['hi, how are you?', 'any ideas my dear friend?', 'okay, i agree with you', 'good bye!'])

为了改进模型，您可以尝试使用多层(使用MultiRNNCell)编码器和解码器，尝试使用可训练参数的注意力机制(而不是点积计分函数)。