# Neural Machine Translation with Attention

This notebook trains a sequence to sequence (seq2seq) model for Spanish to English translation using [tf.keras](https://www.tensorflow.org/programmers_guide/keras) and [eager execution](https://www.tensorflow.org/programmers_guide/eager). This is an advanced example that assumes some knowledge of sequence to sequence models.

After training the model in this notebook, you will be able to input a Spanish sentence, such as *"¿todavia estan en casa?"*, and return the English translation: *"are you still at home?"*

The translation quality is reasonable for a toy example, but the generated attention plot is perhaps more interesting. This shows which parts of the input sentence has the model's attention while translating:

![](./images/spanish-english-attention-plot.png)

Note: This example takes approximately 10 mintues to run on a single P100 GPU.

In [1]:
from __future__ import absolute_import, division, print_function

import tensorflow as tf
import tensorflow.contrib.eager as tfe

tfe.enable_eager_execution()

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

import unicodedata
import re
import numpy as np
import os
import time

print(tf.__version__)

ModuleNotFoundError: No module named 'matplotlib'

## Download and prepare the dataset

We'll use a language dataset provided by http://www.manythings.org/anki/. This dataset contains language translation pairs in the format:

```
May I borrow this book?	¿Puedo tomar prestado este libro?
```

There are a variety of languages available, but we'll use the English-Spanish dataset. For convenience, we've hosted a copy of this dataset on Google Cloud, but you can also download your own copy. After downloading the dataset, here are the steps we'll take to prepare the data:

1. Add a *start* and *end* token to each sentence.
2. Clean the sentences by removing special characters.
3. Create a word index and reverse word index (dictionaries mapping from word → id and id → word).
4. Pad each sentence to a maximum length.

数据预处理：  
- 每个 sentence 添加 <start> 和 <end>  
- 去除每个 sentence 中的特殊字符  
- 创建 word2index 和 index2word  
- 通过 padding 将每个 sentence 转换为最大长度

In [2]:
path_to_zip = tf.keras.utils.get_file("spa-end.zip", 
                                     origin='http://download.tensorflow.org/data/spa-eng.zip',
                                     extract=True)
path_to_zip
# /home/panxie/.keras/datasets/spa-eng/spa.txt

'/home/panxie/.keras/datasets/spa-end.zip'

In [3]:
path_to_file = os.path.dirname(path_to_zip)+"/spa-eng/spa.txt"
path_to_file

'/home/panxie/.keras/datasets/spa-eng/spa.txt'

In [4]:
with open(path_to_file, "rb") as f:
    for i,line in enumerate(f):
        if i > 5:
            break
        print(line)

b'Go.\tVe.\n'
b'Go.\tVete.\n'
b'Go.\tVaya.\n'
b'Go.\tV\xc3\xa1yase.\n'
b'Hi.\tHola.\n'
b'Run!\t\xc2\xa1Corre!\n'


In [5]:
# convert the unicode file to ascii
def unicode_to_ascii(s):
    return "".join(c for c in unicodedata.normalize('NFD', s) 
                   if unicodedata.category(c) != 'Mn')

In [6]:
# preprocessing
def preprocess_sentence(w):
    w = unicode_to_ascii(w.lower().strip())
    
    # creating a space between a word and the punctuation following it
    # eg: "he is a boy." => "he is a boy ." 
    w = re.sub(r"([?.!,¿])", r" \1 ",w)
    w = re.sub(r'[" "]+', " ", w)
    
    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
    w = re.sub(r"[^a-zA-Z?.!,¿]"," ", w)
    w = w.rstrip().strip()
    
    w = '<start> ' + w + ' <end>'
    return w

In [7]:
string = "xiepan has $60. Chuyan has ￥40#."
preprocess_sentence(string)

'<start> xiepan has     . chuyan has      . <end>'

In [8]:
# 1. Remove the accents
# 2. Clean the sentences
# 3. Return word pairs in the format: [ENGLISH, SPANISH]
def create_dataset(path, num_examples):
    lines = []
    with open(path, encoding='UTF-8') as f:
        for i, line in enumerate(f):
            if i == num_examples:
                break
            line = line.strip()
            lines.append(line)
    word_pairs = [[preprocess_sentence(w) for w in l.split('\t')] for l in lines]
    return word_pairs

In [9]:
word_pairs = create_dataset(path_to_file, 3)
print(word_pairs)

[['<start> go . <end>', '<start> ve . <end>'], ['<start> go . <end>', '<start> vete . <end>'], ['<start> go . <end>', '<start> vaya . <end>']]


In [10]:
# This class creates a word -> index mapping (e.g,. "dad" -> 5) and vice-versa 
# (e.g., 5 -> "dad") for each language,
class LanguageIndex():
    def __init__(self, lang):
        self.lang = lang
        self.word2index = {}
        self.index2word = {}
        self.vocab = set()
        
        self.create_index()
        
    def create_index(self):
        for phrase in self.lang:
            self.vocab.update(phrase.split(' '))
        self.vocab = sorted(self.vocab)
        
        self.word2index['<pad>'] = 0
        for index, word in enumerate(self.vocab):
            self.word2index[word] = index + 1
        for word, index in self.word2index.items():
            self.index2word[index] = word

In [15]:
def max_length(tensor):
    return max(len(t) for t in tensor)

def load_dataset(path, num_examples):
    # creating cleaned input, output pairs
    pairs = create_dataset(path, num_examples)
    
    # index language using the class defined above
    input_lang = LanguageIndex(sp for en, sp in pairs) # self.lang 是一个生成器 generator
    target_lang = LanguageIndex(en for en,ap in pairs)
    
    # Spanish sentences
    input_tensor = [[input_lang.word2index[s] for s in sp.split(' ')] for en, sp in pairs]
    
    # English sentences
    target_tensor = [[target_lang.word2index[s] for s in en.split(' ')] for en, sp in pairs]
    
    # Calculate max_length of input and output tensor
    # Here, we'll set those to the longest sentence in the dataset
    max_length_inp, max_length_tar = max_length(input_tensor), max_length(target_tensor)
    
    # padding the input and output tensor to the maximun length
    input_tensor = tf.keras.preprocessing.sequence.pad_sequences(input_tensor,
                                                                maxlen=max_length_inp,
                                                                padding='post')
    target_tensor = tf.keras.preprocessing.sequence.pad_sequences(target_tensor, 
                                                                  maxlen=max_length_tar, 
                                                                  padding='post')
    return input_tensor, target_tensor, input_lang, target_lang, max_length_inp, max_length_tar
    

In [20]:
# Try experimenting with the size of that dataset
num_examples = 30000
input_tensor, target_tensor, input_lang, target_lang, max_length_inp, max_length_targ = load_dataset(path_to_file, num_examples)

In [21]:
input_tensor.shape, target_tensor.shape, input_lang, target_lang, max_length_inp, max_length_targ

((30000, 16),
 (30000, 15),
 <__main__.LanguageIndex at 0x7fc0989a7550>,
 <__main__.LanguageIndex at 0x7fc09ff73ba8>,
 16,
 15)

In [25]:
# Creating training and validation sets using an 80-20 split
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)

# Show length
len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val)

(24000, 24000, 6000, 6000)

In [27]:
BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
N_BATCH = BUFFER_SIZE//BATCH_SIZE
embedding_dim = 256
units = 1024
vocab_inp_size = len(inp_lang.word2index)
vocab_tar_size = len(targ_lang.word2index)

dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

TypeError: batch() got an unexpected keyword argument 'drop_remainder'