In [1]:
%matplotlib inline


Translation with a Sequence to Sequence Network and Attention
=================================
*************************************************************
**Author**: `Sean Robertson <https://github.com/spro/practical-pytorch>`_

In this project we will be teaching a neural network to translate from
French to English.  
在这个项目中，我们将训练一个神经网络去把法语翻译成英语。

::

    [KEY: > input, = target, < output]  
    [关键字： > 输入， = 目标， < 输出]

    > il est en train de peindre un tableau .
    = he is painting a picture .
    < he is painting a picture .

    > pourquoi ne pas essayer ce vin delicieux ?
    = why not try that delicious wine ?
    < why not try that delicious wine ?

    > elle n est pas poete mais romanciere .
    = she is not a poet but a novelist .
    < she not not a poet but a novelist .

    > vous etes trop maigre .
    = you re too skinny .
    < you re all alone .

... to varying degrees of success.

This is made possible by the simple but powerful idea of the [sequence
to sequence network](http://arxiv.org/abs/1409.3215), in which two
recurrent neural networks work together to transform one sequence to
another. An encoder network condenses an input sequence into a vector,
and a decoder network unfolds that vector into a new sequence.  
“序列到序列网络” 这个简单却有用的想法使得这变为可能。 其中使用两个RNN把一个序列转换到另一个序列。
一个编码器网络把一个输入压缩成一个向量，另一个解码器网络展开这个向量成一个新的序列。

![](http://pytorch.org/tutorials/_images/seq2seq.png)

To improve upon this model we'll use an [attention
mechanism](https://arxiv.org/abs/1409.0473), which lets the decoder
learn to focus over a specific range of the input sequence.  
为改进模型，我们将使用“注意力机制”， 它使解码器注意特定范围的输入序列。

**Recommended Reading:**

I assume you have at least installed PyTorch, know Python, and
understand Tensors:

-  http://pytorch.org/ For installation instructions
-  :doc:`/beginner/deep_learning_60min_blitz` to get started with PyTorch in general
-  :doc:`/beginner/pytorch_with_examples` for a wide and deep overview
-  :doc:`/beginner/former_torchies_tutorial` if you are former Lua Torch user


It would also be useful to know about Sequence to Sequence networks and
how they work:

-  [Learning Phrase Representations using RNN Encoder-Decoder for
   Statistical Machine Translation](http://arxiv.org/abs/1406.1078)
-  [Sequence to Sequence Learning with Neural
   Networks](http://arxiv.org/abs/1409.3215)
-  [Neural Machine Translation by Jointly Learning to Align and
   Translate](https://arxiv.org/abs/1409.0473)
-  [A Neural Conversational Model](http://arxiv.org/abs/1506.05869)

You will also find the previous tutorials on
:doc:`/intermediate/char_rnn_classification_tutorial`
and :doc:`/intermediate/char_rnn_generation_tutorial`
helpful as those concepts are very similar to the Encoder and Decoder
models, respectively.

And for more, read the papers that introduced these topics:

-  [Learning Phrase Representations using RNN Encoder-Decoder for
   Statistical Machine Translation](http://arxiv.org/abs/1406.1078)
-  [Sequence to Sequence Learning with Neural
   Networks](http://arxiv.org/abs/1409.3215)
-  [Neural Machine Translation by Jointly Learning to Align and
   Translate](https://arxiv.org/abs/1409.0473)
-  [A Neural Conversational Model](http://arxiv.org/abs/1506.05869)


**Requirements**



In [2]:
from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import string
import re
import random

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch import optim
import torch.nn.functional as F

use_cuda = torch.cuda.is_available()

Loading data files  加载数据文件
==================

The data for this project is a set of many thousands of English to
French translation pairs.  
这个项目的数据是成千上万句英语法语句子对的集合。

[This question on Open Data Stack
Exchange](http://opendata.stackexchange.com/questions/3888/dataset-of-sentences-translated-into-many-languages)
pointed me to the open translation site http://tatoeba.org/ which has
downloads available at http://tatoeba.org/eng/downloads - and better
yet, someone did the extra work of splitting language pairs into
individual text files here: http://www.manythings.org/anki/

The English to French pairs are too big to include in the repo, so
download to ``data/eng-fra.txt`` before continuing. The file is a tab
separated list of translation pairs:  
句子对中间使用tab分割开两种语言。

::

    I am cold.    Je suis froid.

.. Note::
   Download the data from
   [here](https://download.pytorch.org/tutorial/data.zip)
   and extract it to the current directory.



Similar to the character encoding used in the character-level RNN
tutorials, we will be representing each word in a language as a one-hot
vector, or giant vector of zeros except for a single one (at the index
of the word). Compared to the dozens of characters that might exist in a
language, there are many many more words, so the encoding vector is much
larger. We will however cheat a bit and trim the data to only use a few
thousand words per language.  
类似于在字符级别的RNN指导教程中使用的字符编码， 我们将把一种语言中的每个单词表示成一个独热向量。
对比于一种语言中有限的字符， 语言中的单词可谓是多得多了， 所以编码向量也就大了很多。 因此我们
将偷个懒然后整理数据后只在每种语言中使用数千个单词。

![](http://pytorch.org/tutorials/_images/word-encoding.png)





We'll need a unique index per word to use as the inputs and targets of
the networks later. To keep track of all this we will use a helper class
called ``Lang`` which has word → index (``word2index``) and index → word
(``index2word``) dictionaries, as well as a count of each word
``word2count`` to use to later replace rare words.  
我们在输入和目标中需要对每个单词使用唯一索引。 为了记录所有的这些我们将使用一个辅助类``Lang``，
拥有``单词到索引``和``索引到单词``的两个字典， 还有一个用数字统计每个单词出现的次数(``单词到数字``)。




In [3]:
SOS_token = 0
EOS_token = 1


class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2  # Count SOS and EOS

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

The files are all in Unicode, to simplify we will turn Unicode
characters to ASCII, make everything lowercase, and trim most
punctuation.  
这些文件使用的是 Unicode 编码， 为了简化，我们将把它转化成 ASCII 码，然后全部小写并整理大部分的标点。




In [4]:
# Turn a Unicode string to plain ASCII, thanks to
# http://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

# Lowercase, trim, and remove non-letter characters


def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

To read the data file we will split the file into lines, and then split
lines into pairs. The files are all English → Other Language, so if we
want to translate from Other Language → English I added the ``reverse``
flag to reverse the pairs.  
为了读取数据文件我们将把文件且分成行，然后把行切分成对。文件是以 英语->其它语言 保存的，
所以如果我们要从其它语言翻译成英语， 我们添加一个``reverse``标志来翻转语言对。




In [5]:
def readLangs(lang1, lang2, reverse=False):
    print("Reading lines...")

    # Read the file and split into lines
    lines = open('data/%s-%s.txt' % (lang1, lang2), encoding='utf-8').\
        read().strip().split('\n')

    # Split every line into pairs and normalize
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]

    # Reverse pairs, make Lang instances
    if reverse:
        pairs = [list(reversed(p)) for p in pairs]
        input_lang = Lang(lang2)
        output_lang = Lang(lang1)
    else:
        input_lang = Lang(lang1)
        output_lang = Lang(lang2)

    return input_lang, output_lang, pairs

Since there are a *lot* of example sentences and we want to train
something quickly, we'll trim the data set to only relatively short and
simple sentences. Here the maximum length is 10 words (that includes
ending punctuation) and we're filtering to sentences that translate to
the form "I am" or "He is" etc. (accounting for apostrophes replaced
earlier).  
因为这里有太多的样本数据， 而我们想让训练更快， 我们将整理数据集成相对简短和简单的句子。
这里最大长度是10个单词（包含结束符），然后我们预处理句子并转换成 “I am” 或 “He is” 等的形式。
（提前替换省略符）。




In [6]:
MAX_LENGTH = 10

eng_prefixes = (
    "i am ", "i m ",
    "he is", "he s ",
    "she is", "she s",
    "you are", "you re ",
    "we are", "we re ",
    "they are", "they re "
)


def filterPair(p):
    return len(p[0].split(' ')) < MAX_LENGTH and \
        len(p[1].split(' ')) < MAX_LENGTH and \
        p[1].startswith(eng_prefixes)


def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

The full process for preparing the data is:  
所有数据准备的过程是：

-  Read text file and split into lines, split lines into pairs  
   读取文本文件切分成行，再切分成对
-  Normalize text, filter by length and content  
   标准化文本， 依据长度和内容预处理
-  Make word lists from sentences in pairs  
   从句子对中创建单词列表




In [7]:
def prepareData(lang1, lang2, reverse=False):
    input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse)
    print("Read %s sentence pairs" % len(pairs))
    pairs = filterPairs(pairs)
    print("Trimmed to %s sentence pairs" % len(pairs))
    print("Counting words...")
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    print("Count of pairs:")
    print(len(pairs))
    return input_lang, output_lang, pairs


input_lang, output_lang, pairs = prepareData('eng', 'fra', True)
print(random.choice(pairs))

Reading lines...
Read 135842 sentence pairs
Trimmed to 10853 sentence pairs
Counting words...
Counted words:
fra 4489
eng 2925
Count of pairs:
10853
['vous n etes pas invite .', 'you aren t invited .']


The Seq2Seq Model
=================

A Recurrent Neural Network, or RNN, is a network that operates on a
sequence and uses its own output as input for subsequent steps.  
一个循环神经网络（RNN），是一种工作在一个序列上的网络，并用它的输出作为下一步的输入。

A [Sequence to Sequence network](http://arxiv.org/abs/1409.3215), or
seq2seq network, or [Encoder Decoder network](https://arxiv.org/pdf/1406.1078v3.pdf), is a model
consisting of two RNNs called the encoder and decoder. The encoder reads
an input sequence and outputs a single vector, and the decoder reads
that vector to produce an output sequence.  
一个 Sequence to Sequence network (序列到序列网络，编码解码网络），是一种包含两个
循环神经网络（编码器和解码器）的神经网络模型。编码器读取一个序列作为输入，输出一个向量，
解码器读取这个向量然后输出一个序列。

![](http://pytorch.org/tutorials/_images/seq2seq.png)

Unlike sequence prediction with a single RNN, where every input
corresponds to an output, the seq2seq model frees us from sequence
length and order, which makes it ideal for translation between two
languages.  
与使用单个循环神经网络每步输入对应一个输出做序列预测不同，seq2seq模型令我们不用考虑序列长度和组织方式，这使得它十分符合用来处理语言翻译。

Consider the sentence "Je ne suis pas le chat noir" → "I am not the
black cat". Most of the words in the input sentence have a direct
translation in the output sentence, but are in slightly different
orders, e.g. "chat noir" and "black cat". Because of the "ne/pas"
construction there is also one more word in the input sentence. It would
be difficult to produce a correct translation directly from the sequence
of input words.  
考虑这个句子 "Je ne suis pas le chat noir" → "I am not the
black cat"。大部分的输入和输出单词都是直接对应的，但只有一些顺序不一致，
比如  "chat noir" 和 "black cat"。这将使得从输入单词序列直接翻译出正确结果变得困难。

With a seq2seq model the encoder creates a single vector which, in the
ideal case, encodes the "meaning" of the input sequence into a single
vector — a single point in some N dimensional space of sentences.  
使用 seq2seq 模型编码器会创建一个向量， 在理想的情况下， 会编码输入句子的“含义”成一个向量，
一个 N 维句子空间中的一个点。




The Encoder
-----------

The encoder of a seq2seq network is a RNN that outputs some value for
every word from the input sentence. For every input word the encoder
outputs a vector and a hidden state, and uses the hidden state for the
next input word.  
seq2seq 网络的编码器是一个 RNN， 它针对输入句子的每个单词输出一些值。 对于输入的每个单词，
编码器都输出一个向量和一个隐含状态， 然后在下个输入单词上使用隐含状态。

![](http://pytorch.org/tutorials/_images/encoder-network.png)





In [8]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, n_layers=1):
        super(EncoderRNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)

    def forward(self, input, hidden):
        embedded = self.embedding(input).view(1, 1, -1)
        output = embedded
        for i in range(self.n_layers):
            output, hidden = self.gru(output, hidden)
        return output, hidden

    def initHidden(self):
        result = Variable(torch.zeros(1, 1, self.hidden_size))
        if use_cuda:
            return result.cuda()
        else:
            return result

The Decoder
-----------

The decoder is another RNN that takes the encoder output vector(s) and
outputs a sequence of words to create the translation. 
解码器是另一个RNN，它把编码器的输出向量作为输入然后输出一个单词序列来作为翻译结果。




### Simple Decoder

In the simplest seq2seq decoder we use only last output of the encoder.
This last output is sometimes called the *context vector* as it encodes
context from the entire sequence. This context vector is used as the
initial hidden state of the decoder.  
在这个最简单的seq2seq解码器上我们只维持编码器的输出。这个最后的输出常称之为*上下文向量*，
因为它编码了整个序列文本。这个上下文向量被当成解码器的初始隐含状态来用。

At every step of decoding, the decoder is given an input token and
hidden state. The initial input token is the start-of-string ``<SOS>``
token, and the first hidden state is the context vector (the encoder's
last hidden state).  
在每一步的解码过程中，都会给解码器一个输入令牌和一个隐含状态。初始输入令牌是起始符``<SOS>``,
初始隐含状态是这个上下文向量。

![](http://pytorch.org/tutorials/_images/decoder-network.png)





In [9]:
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, n_layers=1):
        super(DecoderRNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax()

    def forward(self, input, hidden):
        output = self.embedding(input).view(1, 1, -1)
        for i in range(self.n_layers):
            output = F.relu(output)
            output, hidden = self.gru(output, hidden)
        output = self.softmax(self.out(output[0]))
        return output, hidden

    def initHidden(self):
        result = Variable(torch.zeros(1, 1, self.hidden_size))
        if use_cuda:
            return result.cuda()
        else:
            return result

I encourage you to train and observe the results of this model, but to
save space we'll be going straight for the gold and introducing the
Attention Mechanism.  
我鼓励你训练并观测模型的输出， 但为了节约空间我们将直接介绍注意力机制。




### Attention Decoder  注意力解码器

If only the context vector is passed betweeen the encoder and decoder,
that single vector carries the burden of encoding the entire sentence.  
如果在编码器和解码器之间只有上下文向量传递信息， 则这个单独的上下文向量承载了编码所有输入序列信息的负担。

Attention allows the decoder network to "focus" on a different part of
the encoder's outputs for every step of the decoder's own outputs. First
we calculate a set of *attention weights*. These will be multiplied by
the encoder output vectors to create a weighted combination. The result
(called ``attn_applied`` in the code) should contain information about
that specific part of the input sequence, and thus help the decoder
choose the right output words.  
注意力机制允许解码器网络对于每一步解码器自己的输出“关注”编码器的输出的不同部分。  
首先我们计算一个*注意力权重*集合。这些将会被编码器输出向量相乘然后产生一个组合权重。
这个计算结果（代码中使用``attn_applied``表示）应该包含特定部分输入序列的信息，
从而帮助解码器选择合适的输出单词。

![](https://i.imgur.com/1152PYf.png)

Calculating the attention weights is done with another feed-forward
layer ``attn``, using the decoder's input and hidden state as inputs.
Because there are sentences of all sizes in the training data, to
actually create and train this layer we have to choose a maximum
sentence length (input length, for encoder outputs) that it can apply
to. Sentences of the maximum length will use all the attention weights,
while shorter sentences will only use the first few.  
计算注意力权重使用另一个前馈层``attn``，这个前馈层使用解码器的输入和隐含状态作为输入。
因为在训练数据中包含有所有大小的句子，为了真正地创建并训练这个层我们必须选择一个
能够在训练数据上使用的最大句子长度（编码器输入长度）。最长的句子使用所有的注意力权重，
更短的句子只使用部分前面的权重。

![](http://pytorch.org/tutorials/_images/attention-decoder-network.png)





In [10]:
class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, n_layers=1, dropout_p=0.1, max_length=MAX_LENGTH):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.dropout_p = dropout_p
        self.max_length = max_length

        self.embedding = nn.Embedding(self.output_size, self.hidden_size)
        self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
        self.dropout = nn.Dropout(self.dropout_p)
        self.gru = nn.GRU(self.hidden_size, self.hidden_size)
        self.out = nn.Linear(self.hidden_size, self.output_size)

    def forward(self, input, hidden, encoder_output, encoder_outputs):
        embedded = self.embedding(input).view(1, 1, -1)
        embedded = self.dropout(embedded)

        attn_weights = F.softmax(
            self.attn(torch.cat((embedded[0], hidden[0]), 1)))
        attn_applied = torch.bmm(attn_weights.unsqueeze(0), encoder_outputs.unsqueeze(0))

        output = torch.cat((embedded[0], attn_applied[0]), 1)
        output = self.attn_combine(output).unsqueeze(0)

        for i in range(self.n_layers):
            output = F.relu(output)
            output, hidden = self.gru(output, hidden)

        output = F.log_softmax(self.out(output[0]))
        return output, hidden, attn_weights

    def initHidden(self):
        result = Variable(torch.zeros(1, 1, self.hidden_size))
        if use_cuda:
            return result.cuda()
        else:
            return result

<div class="alert alert-info"><h4>Note</h4><p>There are other forms of attention that work around the length
  limitation by using a relative position approach. Read about "local
  attention" in [Effective Approaches to Attention-based Neural Machine
  Translation](https://arxiv.org/abs/1508.04025)  
    还有其它形式的注意力机制，相对使用长度限制，它使用相对位置的方法作为一种变通的方法。</p></div>

Training
========

Preparing Training Data
-----------------------

To train, for each pair we will need an input tensor (indexes of the
words in the input sentence) and target tensor (indexes of the words in
the target sentence). While creating these vectors we will append the
EOS token to both sequences.  
为了进行训练，对于每对语言对我们需要一个输入张量（输入句子单词索引）和目标张量（目标句子单词索引）。
当创建这些向量时我们将为两个句子都加上EOS令牌（结束令牌）。




In [11]:
def indexesFromSentence(lang, sentence):
    return [lang.word2index[word] for word in sentence.split(' ')]


def variableFromSentence(lang, sentence):
    indexes = indexesFromSentence(lang, sentence)
    indexes.append(EOS_token)
    result = Variable(torch.LongTensor(indexes).view(-1, 1))
    if use_cuda:
        return result.cuda()
    else:
        return result


def variablesFromPair(pair):
    input_variable = variableFromSentence(input_lang, pair[0])
    target_variable = variableFromSentence(output_lang, pair[1])
    return (input_variable, target_variable)

Training the Model
------------------

To train we run the input sentence through the encoder, and keep track
of every output and the latest hidden state. Then the decoder is given
the ``<SOS>`` token as its first input, and the last hidden state of the
encoder as its first hidden state.  
为了训练我们通过编码器运行输入句子，跟踪每个输出和最后的隐含状态。然后给解码器``<SOS>``
令牌作为第一个输入，编码器的最后的隐含状态作为它的第一个隐含状态。

"Teacher forcing" is the concept of using the real target outputs as
each next input, instead of using the decoder's guess as the next input.
Using teacher forcing causes it to converge faster but `when the trained
network is exploited, it may exhibit
instability <http://minds.jacobs-university.de/sites/default/files/uploads/papers/ESNTutorialRev.pdf>`__.  
“导师驱动”是使用真实目标输出作为下一步输入，而不是使用解码器的预测值作为下一步的输入。
使用导师驱动使其收敛更快但是`当这个训练好的网络拿来使用时`，它可能表现得不稳定。

You can observe outputs of teacher-forced networks that read with
coherent grammar but wander far from the correct translation -
intuitively it has learned to represent the output grammar and can "pick
up" the meaning once the teacher tells it the first few words, but it
has not properly learned how to create the sentence from the translation
in the first place.  
你可以在语法一致的角度观察导师驱动网络的输出，但它却离正确的翻译差距太大，
直观地来说，它学习到了输出语法的表示并能“提取”它的含义，一旦“导师”告诉它开头几个少量的单词，
但是它不太可能学习到如何从最初的状的翻译中学习到如何产生句子。

Because of the freedom PyTorch's autograd gives us, we can randomly
choose to use teacher forcing or not with a simple if statement. Turn
``teacher_forcing_ratio`` up to use more of it.  
由于 PyTorch 的自动梯度机制给我们的自由， 我们可以随机使用或不使用导师驱动机制，只要一个简单的if结构。
提高``teacher_forcing_ratio（导师驱动率）``来更多使用导师驱动机制。




In [26]:
teacher_forcing_ratio = 0.5


def train(input_variable, target_variable, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion, max_length=MAX_LENGTH):
    encoder_hidden = encoder.initHidden()

    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    input_length = input_variable.size()[0]
    target_length = target_variable.size()[0]
    
    encoder_outputs = Variable(torch.zeros(max_length, encoder.hidden_size))
    encoder_outputs = encoder_outputs.cuda() if use_cuda else encoder_outputs
   
    loss = 0

    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(
            input_variable[ei], encoder_hidden)
        encoder_outputs[ei] = encoder_output[0][0]

    decoder_input = Variable(torch.LongTensor([[SOS_token]]))
    decoder_input = decoder_input.cuda() if use_cuda else decoder_input
    
    decoder_hidden = encoder_hidden

    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    if use_teacher_forcing:
        # Teacher forcing: Feed the target as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_output, encoder_outputs)
            loss += criterion(decoder_output, target_variable[di])
            print("decoder_output")
            print(decoder_output)
            print("target_variable[di]")
            print(target_variable[di])
            decoder_input = target_variable[di]  # Teacher forcing

    else:
        # Without teacher forcing: use its own predictions as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_output, encoder_outputs)
            topv, topi = decoder_output.data.topk(1)
            ni = topi[0][0]
            
            decoder_input = Variable(torch.LongTensor([[ni]]))
            decoder_input = decoder_input.cuda() if use_cuda else decoder_input
            
            loss += criterion(decoder_output, target_variable[di])
            if ni == EOS_token:
                break

    loss.backward()

    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss.data[0] / target_length

This is a helper function to print time elapsed and estimated time
remaining given the current time and progress %.  
辅助打印函数，打印一些运行状态信息。




In [27]:
import time
import math


def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)


def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (- %s)' % (asMinutes(s), asMinutes(rs))

The whole training process looks like this:  
整个训练过程看起来是这样的：

-  Start a timer  
   启动一个计时器
-  Initialize optimizers and criterion  
   初始化优化器和损失函数标准
-  Create set of training pairs  
   创建训练对集合
-  Start empty losses array for plotting  
   为后面画图启用空的误差数组

Then we call ``train`` many times and occasionally print the progress (%
of examples, time so far, estimated time) and average loss.  
然后我们多次调用``train``并定时打印计算过程。




In [28]:
def trainIters(encoder, decoder, n_iters, print_every=1000, plot_every=100, learning_rate=0.01):
    start = time.time()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)
    training_pairs = [variablesFromPair(random.choice(pairs))
                      for i in range(n_iters)]
    criterion = nn.NLLLoss()

    for iter in range(1, n_iters + 1):
        training_pair = training_pairs[iter - 1]
        input_variable = training_pair[0]
        target_variable = training_pair[1]
 
        loss = train(input_variable, target_variable, encoder,
                     decoder, encoder_optimizer, decoder_optimizer, criterion)
        print_loss_total += loss
        plot_loss_total += loss

        if iter % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print('%s (%d %d%%) %.4f' % (timeSince(start, iter / n_iters),
                                         iter, iter / n_iters * 100, print_loss_avg))

        if iter % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0

    showPlot(plot_losses)

Plotting results  画结果图
----------------

Plotting is done with matplotlib, using the array of loss values
``plot_losses`` saved while training.




In [29]:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import numpy as np


def showPlot(points):
    plt.figure()
    fig, ax = plt.subplots()
    # this locator puts ticks at regular intervals
    loc = ticker.MultipleLocator(base=0.2)
    ax.yaxis.set_major_locator(loc)
    plt.plot(points)

Evaluation
==========

Evaluation is mostly the same as training, but there are no targets so
we simply feed the decoder's predictions back to itself for each step.
Every time it predicts a word we add it to the output string, and if it
predicts the EOS token we stop there. We also store the decoder's
attention outputs for display later.  
评价模型与训练类似，但是没有目标输出，所以我们每一步简单反馈解码器的预测输出到它本身。
每一次它都预测一个单词然后我们将其添加到输出字符串。如果它预测输出 EOS 令牌，我们就在那里结束。
我们也会存储解码器的注意力输出在后面显示。




In [30]:
def evaluate(encoder, decoder, sentence, max_length=MAX_LENGTH):
    input_variable = variableFromSentence(input_lang, sentence)
    input_length = input_variable.size()[0]
    encoder_hidden = encoder.initHidden()

    encoder_outputs = Variable(torch.zeros(max_length, encoder.hidden_size))
    encoder_outputs = encoder_outputs.cuda() if use_cuda else encoder_outputs

    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(input_variable[ei],
                                                 encoder_hidden)
        encoder_outputs[ei] = encoder_outputs[ei] + encoder_output[0][0]

    decoder_input = Variable(torch.LongTensor([[SOS_token]]))  # SOS
    decoder_input = decoder_input.cuda() if use_cuda else decoder_input

    decoder_hidden = encoder_hidden

    decoded_words = []
    decoder_attentions = torch.zeros(max_length, max_length)

    for di in range(max_length):
        decoder_output, decoder_hidden, decoder_attention = decoder(
            decoder_input, decoder_hidden, encoder_output, encoder_outputs)
        decoder_attentions[di] = decoder_attention.data
        topv, topi = decoder_output.data.topk(1)
        ni = topi[0][0]
        if ni == EOS_token:
            decoded_words.append('<EOS>')
            break
        else:
            decoded_words.append(output_lang.index2word[ni])
        
        decoder_input = Variable(torch.LongTensor([[ni]]))
        decoder_input = decoder_input.cuda() if use_cuda else decoder_input

    return decoded_words, decoder_attentions[:di + 1]

We can evaluate random sentences from the training set and print out the
input, target, and output to make some subjective quality judgements:  
我们可以从训练集中随机选取句子来评价模型，然后打印出输入，目标输出，实际输出来做一些快速的主观判断。




In [31]:
def evaluateRandomly(encoder, decoder, n=10):
    for i in range(n):
        pair = random.choice(pairs)
        print('>', pair[0])
        print('=', pair[1])
        output_words, attentions = evaluate(encoder, decoder, pair[0])
        output_sentence = ' '.join(output_words)
        print('<', output_sentence)
        print('')

Training and Evaluating
=======================

With all these helper functions in place (it looks like extra work, but
it's easier to run multiple experiments easier) we can actually
initialize a network and start training.

Remember that the input sentences were heavily filtered. For this small
dataset we can use relatively small networks of 256 hidden nodes and a
single GRU layer. After about 40 minutes on a MacBook CPU we'll get some
reasonable results.

.. Note:: 
   If you run this notebook you can train, interrupt the kernel,
   evaluate, and continue training later. Comment out the lines where the
   encoder and decoder are initialized and run ``trainIters`` again.




In [32]:
hidden_size = 256
encoder1 = EncoderRNN(input_lang.n_words, hidden_size)
attn_decoder1 = AttnDecoderRNN(hidden_size, output_lang.n_words,
                               1, dropout_p=0.1)

if use_cuda:
    encoder1 = encoder1.cuda()
    attn_decoder1 = attn_decoder1.cuda()

trainIters(encoder1, attn_decoder1, 75000, print_every=5000)

decoder_output
Variable containing:
-8.1219 -8.2469 -7.8807  ...  -7.9698 -8.1224 -7.9490
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 130
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-8.0422 -8.0953 -7.9317  ...  -7.9482 -8.0557 -7.9297
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 78
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-7.9737 -8.0353 -7.9056  ...  -7.9554 -7.9606 -7.9594
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 152
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-7.9830 -8.0024 -7.9962  ...  -7.9226 -7.9733 -7.9432
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 356
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-8.0184 -7.9858 -7.9853  ...  -7.9102 -7.9618 -7.9766

decoder_output
Variable containing:
-8.0104 -7.0817 -5.2065  ...  -7.9985 -7.9226 -8.0570
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 2
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-7.8961 -6.4104 -5.6488  ...  -7.9481 -7.7840 -7.9991
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 16
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-7.9587 -6.5326 -6.1894  ...  -7.8789 -7.7695 -7.8819
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 88
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-7.9070 -6.5731 -6.4481  ...  -7.9051 -7.7930 -7.8911
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 904
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-7.9108 -6.4520 -6.6032  ...  -7.9056 -7.7323 -7.8749
[t

decoder_output
Variable containing:
-8.1813 -4.2525 -2.9985  ...  -8.2897 -8.0011 -8.1092
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 14
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-8.0273 -4.0138 -4.4046  ...  -8.0022 -7.7923 -7.9698
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 15
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-8.0058 -3.4390 -4.8550  ...  -7.9623 -7.7355 -7.9583
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 532
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-7.9560 -3.5406 -5.3442  ...  -7.9467 -7.7207 -7.9559
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 32
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-7.9863 -3.6522 -5.5709  ...  -7.9398 -7.7393 -7.9383
[

decoder_output
Variable containing:
 -9.0184  -5.4314  -1.5188  ...   -9.1755  -8.8085  -8.9938
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 2
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
 -9.0565  -4.2198  -2.5156  ...   -9.2351  -8.8396  -9.1833
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 3
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
 -8.7742  -3.1062  -3.4210  ...   -8.7839  -8.4688  -8.8107
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 148
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-8.5572 -2.9007 -4.1756  ...  -8.4752 -8.2079 -8.4291
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 33
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-8.5211 -2.9245 -4.6040  ...  -8.3240 -

decoder_output
Variable containing:
 -8.9619  -5.3987  -1.6336  ...   -9.1835  -8.7175  -9.0075
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 2
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
 -8.7737  -4.0964  -2.9202  ...   -9.0451  -8.6081  -9.0477
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 3
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-8.3930 -2.6868 -3.7506  ...  -8.4834 -8.1255 -8.5286
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 798
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-8.2825 -2.4761 -4.6011  ...  -8.2701 -7.9630 -8.2989
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 1320
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-8.1446 -2.7337 -5.3227  ...  -8.0796 -7.79

decoder_output
Variable containing:
 -9.7497  -8.4306  -0.6412  ...   -9.8564  -9.4576  -9.7973
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 2
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
 -9.4396  -7.5385  -2.5544  ...   -9.7174  -9.3175  -9.8275
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 3
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-8.4674 -5.6056 -3.1281  ...  -8.5822 -8.2224 -8.6890
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 793
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-8.3129 -4.8019 -3.5895  ...  -8.2307 -7.9437 -8.3590
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 4
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-8.7556 -2.4632 -3.7304  ...  -8.5335 -8.2771 

decoder_output
Variable containing:
 -9.4930  -6.0766  -0.7247  ...   -9.5727  -9.1488  -9.5081
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 2
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
 -8.8603  -4.3290  -2.7483  ...   -9.0969  -8.6936  -9.2252
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 3
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-8.6064 -2.6547 -4.2129  ...  -8.6611 -8.2711 -8.7749
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 148
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
 -8.8718  -2.1072  -5.0996  ...   -8.7670  -8.4007  -8.7454
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 217
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
 -8.9042  -2.0378  -5.7068  ...   -8.7

decoder_output
Variable containing:
 -9.8188  -7.7287  -0.7339  ...   -9.9120  -9.5209  -9.7802
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 75
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-8.6160 -5.3293 -2.4247  ...  -8.6017 -8.2443 -8.7730
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 15
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-8.5765 -4.0648 -3.6644  ...  -8.5171 -8.1059 -8.6265
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 42
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-8.5386 -3.7150 -4.4274  ...  -8.5277 -8.0894 -8.5634
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 487
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-8.6029 -3.4986 -5.0389  ...  -8.5160 -8.1080 -8.

decoder_output
Variable containing:
-10.2079  -8.4810  -0.5949  ...  -10.3650  -9.9121 -10.2111
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 2
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
 -9.7681  -7.6726  -3.3417  ...  -10.0872  -9.6740 -10.2034
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 3
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-8.4442 -5.2064 -3.7807  ...  -8.5622 -8.1983 -8.6716
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 367
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-8.3148 -4.2739 -4.5177  ...  -8.3168 -7.8990 -8.3013
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 2
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
 -8.6931  -4.8498  -4.8836  ...   -8.9376  -8.

decoder_output
Variable containing:
 -9.9500  -8.7847  -1.3075  ...  -10.0542  -9.5885  -9.9483
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 77
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-8.6691 -6.0952 -3.1320  ...  -8.6784 -8.2922 -8.9006
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 78
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-8.5536 -4.5997 -4.2770  ...  -8.5518 -8.1880 -8.8089
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 11
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
-8.6701 -3.8171 -5.1186  ...  -8.5051 -8.1829 -8.7049
[torch.cuda.FloatTensor of size 1x2925 (GPU 0)]

target_variable[di]
Variable containing:
 4
[torch.cuda.LongTensor of size 1 (GPU 0)]

decoder_output
Variable containing:
 -9.4386  -1.0081  -5.6001  ...   -9.2227  -8.8882 

KeyboardInterrupt: 

In [21]:
evaluateRandomly(encoder1, attn_decoder1)

> elle apprend a jouer du piano .
= she is learning the piano .
< she is learning the piano . <EOS>

> elle coud une robe .
= she is sewing a dress .
< she is knitting in dress . <EOS>

> il sait jouer de la guitare .
= he is able to play the guitar .
< he is eager to the the . <EOS>

> je suppose que c est ton pere .
= i m assuming this is your father .
< i m assuming this is your father . <EOS>

> je n ecarte pas cette possibilite .
= i m not discounting that possibility .
< i m not that that girl . <EOS>

> nous sommes en train de lire .
= we re reading .
< we re reading . <EOS>

> vous reagissez de maniere excessive .
= you re overreacting .
< you re overreacting . <EOS>

> je ne tiens pas en place .
= i m restless .
< i m not in . <EOS>

> nous sommes en retard .
= we are late .
< we re late . <EOS>

> vous n etes pas comme moi .
= you re not like me .
< you re not like me . <EOS>



Visualizing Attention
---------------------

A useful property of the attention mechanism is its highly interpretable
outputs. Because it is used to weight specific encoder outputs of the
input sequence, we can imagine looking where the network is focused most
at each time step.

You could simply run ``plt.matshow(attentions)`` to see attention output
displayed as a matrix, with the columns being input steps and rows being
output steps:




In [None]:
output_words, attentions = evaluate(
    encoder1, attn_decoder1, "je suis trop froid .")
plt.matshow(attentions.numpy())

For a better viewing experience we will do the extra work of adding axes
and labels:




In [None]:
def showAttention(input_sentence, output_words, attentions):
    # Set up figure with colorbar
    fig = plt.figure()
    ax = fig.add_subplot(111)
    cax = ax.matshow(attentions.numpy(), cmap='bone')
    fig.colorbar(cax)

    # Set up axes
    ax.set_xticklabels([''] + input_sentence.split(' ') +
                       ['<EOS>'], rotation=90)
    ax.set_yticklabels([''] + output_words)

    # Show label at every tick
    ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
    ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

    plt.show()


def evaluateAndShowAttention(input_sentence):
    output_words, attentions = evaluate(
        encoder1, attn_decoder1, input_sentence)
    print('input =', input_sentence)
    print('output =', ' '.join(output_words))
    showAttention(input_sentence, output_words, attentions)


evaluateAndShowAttention("elle a cinq ans de moins que moi .")

evaluateAndShowAttention("elle est trop petit .")

evaluateAndShowAttention("je ne crains pas de mourir .")

evaluateAndShowAttention("c est un jeune directeur plein de talent .")

Exercises
=========

-  Try with a different dataset

   -  Another language pair
   -  Human → Machine (e.g. IOT commands)
   -  Chat → Response
   -  Question → Answer

-  Replace the embeddings with pre-trained word embeddings such as word2vec or
   GloVe
-  Try with more layers, more hidden units, and more sentences. Compare
   the training time and results.
-  If you use a translation file where pairs have two of the same phrase
   (``I am test \t I am test``), you can use this as an autoencoder. Try
   this:

   -  Train as an autoencoder
   -  Save only the Encoder network
   -  Train a new Decoder for translation from there


