# 1 - Sequence to Sequence Learning with Neural Networks

![image.png](attachment:3a598937-feab-478b-b4af-75ea330ecbde.png)

上图显示了一个示例翻译。输入/源句“guten morgan”通过嵌入层(黄色)，然后输入到编码器(绿色)。我们还分别在句子的开始和结束处附加一个$序列开始$(<sos>)和$序列结束$(<eos>)标记。在每个时间步长，编码器RNN的输入既是对当前词语的词嵌入，$e$,也是从上一时间步输出的hidden state，$h_{t-1}$,并且编码器RNN输出一个新的hidden state $ht$。到目前为止，我们可以把隐藏状态看作是句子的向量表示。RNN可以表示为$e(x_t)$以及$h_{t-1}$的函数：

$h_t = EncoderRNN(e(x_t),h_{t-1})$

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

from torchtext.legacy.datasets import Multi30k
from torchtext.legacy.data import Field, BucketIterator

import spacy
import numpy as np

import random
import math
import time

In [2]:
print(torch.__version__)

1.8.0


In [3]:
print(torch.cuda.is_available())

True


In [4]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [5]:
spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

In [6]:
def tokenize_de(text):
    """
    Tokenizes German text from a string into a list of strings
    """
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings
    """
    return [tok.text for tok in spacy_en.tokenizer(text)]

将输入文本逆向输出成token

In [10]:
text = "hello world !"
print(tokenize_en(text))

['hello', 'world', '!']


torchtext's `Fields` handle how data should be processed. 

We set the `tokenize` argument to the correct tokenization function for each, with German being the `SRC` (source) field and English being the `TRG` (target) field. The field also appends the "start of sequence" and "end of sequence" tokens via the `init_token` and `eos_token` arguments, and converts all words to lowercase.

In [8]:
SRC = Field(tokenize = tokenize_de, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

TRG = Field(tokenize = tokenize_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

Next, we download and load the train, validation and test data.

The dataset we'll be using is the `Multi30k` dataset. This is a dataset with ~30,000 parallel English, German and French sentences, each with ~12 words per sentence.

`exts` specifies which languages to use as the source and target (source goes first) and `fields` specifies which field to use for the source and target.

In [9]:
train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'), 
                                                    fields = (SRC, TRG))

FileNotFoundError: [Errno 2] No such file or directory: '.data\\multi30k\\train.de'