# Simple RNN Encode-Decoder for Translation

**Learning Objectives**
1. Learn how to create a tf.data.Dataset for seq2seq problems
1. Learn how to train an encoder-decoder model in Keras
1. Learn how to save the encoder and the decoder as separate models 
1. Learn how to piece together the trained encoder and decoder into a translation function
1. Learn how to use the BLUE score to evaluate a translation model

## Introduction

In this lab we'll build a translation model from Spanish to English using an encoder-decoder model architecture. We'll benchmark our results using the industry standard BLEU score then deploy for online prediction using an AI Platform custom prediction routine.

In [2]:
import os
import pickle
import sys

import nltk
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.layers import (
    Dense,
    Embedding,
    GRU,
    Input,
)
from tensorflow.keras.models import (
    load_model,
    Model,
)

import utils_preproc

print(tf.__version__)

2.0.0


In [3]:
SEED = 0
MODEL_PATH = 'translate_models/baseline'
DATA_URL = 'http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip'
LOAD_CHECKPOINT = False

In [4]:
tf.random.set_seed(SEED)

## Downloading the Data

We'll use a language dataset provided by http://www.manythings.org/anki/. The dataset contains Spanish-English  translation pairs in the format:

```
May I borrow this book?	¿Puedo tomar prestado este libro?
```

The dataset is a curated list of 120K translation pairs from http://tatoeba.org/, a platform for community contributed translations by native speakers.

In [24]:
path_to_zip = tf.keras.utils.get_file(
    'spa-eng.zip', origin=DATA_URL, extract=True)

path_to_file = os.path.join(
    os.path.dirname(path_to_zip),
    "spa-eng/spa.txt"
)
print("Translation data stored at:", path_to_file)

Translation data stored at: /Users/dherin/.keras/datasets/spa-eng/spa.txt


In [25]:
data = pd.read_csv(
    path_to_file, sep='\t', header=None, names=['english', 'spanish'])

In [26]:
data.sample(3)

Unnamed: 0,english,spanish
105399,Is there anything to drink in the refrigerator?,¿Hay algo para beber en el refrigerador?
92723,I cut my little finger peeling potatoes.,Me hice un corte en el meñique pelando papas.
48227,Holland is a small country.,Holanda es un país pequeño.


From the `utils_preproc` package we have written for you,
we will use the following functions to pre-process our dataset of sentence pairs.

## Sentence Preprocessing

The `utils_preproc.preprocess_sentence()` method does the following:
1. Converts sentence lower case
2. Adds a space between puncation and words
3. Replaces tokens that aren't a-z or punctation with space
4. Adds `<start>` and `<end>` tokens

For example:

In [27]:
raw = [
    "No estamos comiendo.",
    "Está llegando el invierno.",
    "El invierno se acerca.",
    "Tom no comio nada.",
    "Su pierna mala le impidió ganar la carrera.",
    "Su respuesta es erronea.",
    "¿Qué tal si damos un paseo después del almuerzo?"
]

In [28]:
processed = [utils_preproc.preprocess_sentence(s) for s in raw]
processed

['<start> no estamos comiendo . <end>',
 '<start> esta llegando el invierno . <end>',
 '<start> el invierno se acerca . <end>',
 '<start> tom no comio nada . <end>',
 '<start> su pierna mala le impidio ganar la carrera . <end>',
 '<start> su respuesta es erronea . <end>',
 '<start> ¿ que tal si damos un paseo despues del almuerzo ? <end>']

## Sentence Integerizing

The `utils_preproc.tokenize()` method does the following:
    
1. Splits each sentence into a token list
1. Maps each token to an integer
1. Pads to length of longest sentence 

It returns an instance of a [Keras Tokenizer](https://keras.io/preprocessing/text/)
containing the token-integer mapping along with the integerized sentences:

In [29]:
integerized, tokenizer = utils_preproc.tokenize(processed)
integerized

array([[ 1,  4,  8,  9,  3,  2,  0,  0,  0,  0,  0,  0,  0],
       [ 1, 10, 11,  5,  6,  3,  2,  0,  0,  0,  0,  0,  0],
       [ 1,  5,  6, 12, 13,  3,  2,  0,  0,  0,  0,  0,  0],
       [ 1, 14,  4, 15, 16,  3,  2,  0,  0,  0,  0,  0,  0],
       [ 1,  7, 17, 18, 19, 20, 21, 22, 23,  3,  2,  0,  0],
       [ 1,  7, 24, 25, 26,  3,  2,  0,  0,  0,  0,  0,  0],
       [ 1, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37,  2]], dtype=int32)

The outputed tokenizer can be used to get back the actual works
from the integers representing them:

In [23]:
tokenizer.sequences_to_texts(integerized)

['<start> no estamos comiendo . <end>',
 '<start> esta llegando el invierno . <end>',
 '<start> el invierno se acerca . <end>',
 '<start> tom no comio nada . <end>',
 '<start> su pierna mala le impidio ganar la carrera . <end>',
 '<start> su respuesta es erronea . <end>',
 '<start> ¿ que tal si damos un paseo despues del almuerzo ? <end>']

## Creating the tf.data.Dataset

### `load_and_preprocess`

Let's first implement a function that will read the raw sentence-pair file
and preprocess the sentences with `utils_preproc.preprocess_sentence`.

The `load_and_preprocess` function takes as input
- the path where the sentence-pair file is located
- the number of examples one wants to read in

It returns a tuple whose first component contains the english
preprocessed sentences, while the second component contains the
spanish ones:

In [40]:
def load_and_preprocess(path, num_examples):
    with open(path_to_file, 'r') as fp:
        lines = fp.read().strip().split('\n')

    # TODO 1a
    sentence_pairs = [ 
        [utils_preproc.preprocess_sentence(sent) for sent in line.split('\t')]
        for line in lines[:num_examples]
    ]

    return zip(*sentence_pairs)

In [41]:
en, sp = load_and_preprocess(path_to_file, num_examples=10)

print(en[-1])
print(sp[-1])

<start> fire ! <end>
<start> incendio ! <end>


### `load_and_integerize`

Using `utils_preproc.tokenize`, let us now implement the function `load_and_integerize` that takes as input the data path along with the number of examples we want to read in and returns the following tuple:

```python
  (input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer)
```

where 


* `input_tensor` is an integer tensor of shape `(batch_size, max_length_inp)` containing the integerized versions of the source language sentences
* `target_tensor` is an integer tensor of shape `(batch_size, max_length_targ)` containing the integerized versions of the target language sentences
* `inp_lang_tokenizer` is the source language tokenizer
* `targ_lang_tokenizer` is the target language tokenizer

In [44]:
def load_and_integerize(path, num_examples=None):

    targ_lang, inp_lang = load_and_preprocess(path, num_examples)

    input_tensor, inp_lang_tokenizer = utils_preproc.tokenize(inp_lang)
    target_tensor, targ_lang_tokenizer = utils_preproc.tokenize(targ_lang)

    return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer

### Train and eval splits

We'll split this data 80/20 into train and validation, and we'll use only the first 30K examples, Since we'll be training on a single GPU. 
 
Let us set variable for that:

In [49]:
TEST_PROP = 0.2
NUM_EXAMPLES = 30000

Now let's load our integerized data:

In [43]:
input_tensor, target_tensor, inp_lang, targ_lang = load_and_integerize(
    path_to_file, NUM_EXAMPLES)

Let us store the maximal sentence length of both languages into two variables:

In [13]:
max_length_targ = target_tensor.shape[1]
max_length_inp = input_tensor.shape[1]

We are now using scikit-learn `train_test_split` to create our splits:

In [14]:
splits = train_test_split(
    input_tensor, target_tensor, test_size=TEST_PROP, random_state=SEED)

input_tensor_train = splits[0]
input_tensor_val = splits[1]

target_tensor_train = splits[2]
target_tensor_val = splits[3]

In [15]:
(len(input_tensor_train), len(target_tensor_train),
 len(input_tensor_val), len(target_tensor_val))

(24000, 24000, 6000, 6000)

In [16]:
print("Input Language; int to word mapping")
print(input_tensor_train[0])
print(utils_preproc.int2word(inp_lang, input_tensor_train[0]), '\n')

print("Target Language; int to word mapping")
print(target_tensor_train[0])
print(utils_preproc.int2word(targ_lang, target_tensor_train[0]))

Input Language; int to word mapping
[  1 133  14 316   3   2   0   0   0   0   0   0   0   0   0   0]
['<start>', 'deja', 'de', 'leer', '.', '<end>', '', '', '', '', '', '', '', '', '', ''] 

Target Language; int to word mapping
[  1  86 341   3   2   0   0   0   0   0   0]
['<start>', 'stop', 'reading', '.', '<end>', '', '', '', '', '', '']


In [17]:
input_tensor_train[:2]

array([[  1, 133,  14, 316,   3,   2,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0],
       [  1,  49,   7,  19, 448,   3,   2,   0,   0,   0,   0,   0,   0,
          0,   0,   0]], dtype=int32)

In [18]:
target_tensor_train[:2]

array([[  1,  86, 341,   3,   2,   0,   0,   0,   0,   0,   0],
       [  1,  19,   8,  21, 519,   3,   2,   0,   0,   0,   0]],
      dtype=int32)

### Create tf.data dataset

Note how our labels are our reference translations shifted ahead by 1

In [19]:
def encoder_decoder_dataset(encoder_input, decoder_input):

    # shift ahead by 1
    target = tf.roll(decoder_input, -1, 1)

    # replace last column with 0s
    zeros = tf.zeros([target.shape[0], 1], dtype=tf.int32)
    target = tf.concat((target[:, :-1], zeros), axis=-1)

    dataset = tf.data.Dataset.from_tensor_slices(
        ((encoder_input, decoder_input), target))

    return dataset

In [20]:
BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
steps_per_epoch = len(input_tensor_train)//BATCH_SIZE
embedding_dim = 256
units = 1024
vocab_inp_size = len(inp_lang.word_index) + 1
vocab_tar_size = len(targ_lang.word_index) + 1

In [35]:
train_dataset = encoder_decoder_dataset(
    input_tensor_train, target_tensor_train).shuffle(
    BUFFER_SIZE).repeat().batch(BATCH_SIZE, drop_remainder=True)

eval_dataset = encoder_decoder_dataset(
    input_tensor_val, target_tensor_val).batch(
    BATCH_SIZE, drop_remainder=True)

## Model

We use an encoder-decoder decoder architecture, however we embed our words into a latent space prior to feeding them into the RNN.

In [36]:
encoder_inputs = Input(shape=(None,), name="encoder_input")

encoder_inputs_embedded = Embedding(
    input_dim=vocab_inp_size,
    output_dim=embedding_dim,
    input_length=max_length_inp)(encoder_inputs)

encoder_rnn = GRU(
     units=1024,
     return_sequences=True,
     return_state=True,
     recurrent_initializer='glorot_uniform')

encoder_outputs, encoder_state = encoder_rnn(encoder_inputs_embedded)

In [37]:
decoder_inputs = Input(shape=(None,), name="decoder_input")

decoder_inputs_embedded = Embedding(
    input_dim=vocab_tar_size,
    output_dim=embedding_dim,
    input_length=max_length_targ)(decoder_inputs)

decoder_rnn = GRU(
    units=1024,
    return_sequences=True,
    return_state=True,
    recurrent_initializer='glorot_uniform')

decoder_outputs, decoder_state = decoder_rnn(
    decoder_inputs_embedded, initial_state=encoder_state)

In [38]:
# Classifier (take each intermediate hidden state and predict word)

decoder_dense = Dense(vocab_tar_size, activation='softmax')

predictions = decoder_dense(decoder_outputs)

In [39]:
model = Model(inputs=[encoder_inputs, decoder_inputs], outputs=predictions)

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.summary()

Model: "model_3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_input (InputLayer)      [(None, None)]       0                                            
__________________________________________________________________________________________________
decoder_input (InputLayer)      [(None, None)]       0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, None, 256)    2409984     encoder_input[0][0]              
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, None, 256)    1263360     decoder_input[0][0]              
____________________________________________________________________________________________

In [40]:
model.fit(
    train_dataset,
    steps_per_epoch=steps_per_epoch,
    validation_data=eval_dataset,
    epochs=10
)

Train for 375 steps, validate for 93 steps
Epoch 1/9
Epoch 2/9
Epoch 3/9
Epoch 4/9
Epoch 5/9
Epoch 6/9
Epoch 7/9
Epoch 8/9
Epoch 9/9


<tensorflow.python.keras.callbacks.History at 0x15da12da0>

## Prediction

We can't just use model.predict(), because we don't know all the inputs we used during training. We only know the encoder_input (source language) but not the decoder_input (target language).

We do however know the first token of the decoder input, which is the `<start>` token. So using this plus the state of the encoder RNN, we can predict the next token. We will then use that token to be the second token of decoder input, and continue like this until we predict the `<end>` token, or we reach some defined max length.

In [42]:
if LOAD_CHECKPOINT:
    encoder_model = load_model(os.path.join(MODEL_PATH, 'encoder_model.h5'))
    decoder_model = load_model(os.path.join(MODEL_PATH, 'decoder_model.h5'))

else:
    encoder_model = Model(inputs=encoder_inputs, outputs=encoder_state)

    decoder_state_input = Input(shape=(units,), name="decoder_state_input")

    # Reuses weights from the decoder_rnn layer
    decoder_outputs, decoder_state = decoder_rnn(
        decoder_inputs_embedded, initial_state=decoder_state_input)

    # Reuses weights from the decoder_dense layer
    predictions = decoder_dense(decoder_outputs)

    decoder_model = Model(
        inputs=[decoder_inputs, decoder_state_input],
        outputs=[predictions, decoder_state]
    )

In [60]:
def decode_sequences(input_seqs, output_tokenizer, max_decode_length=50):
    """
    Arguments:
    input_seqs: int tensor of shape (BATCH_SIZE, SEQ_LEN)
    output_tokenizer: Tokenizer used to conver from int to words

    Returns translated sentences
    """
    # Encode the input as state vectors.
    batch_size = input_seqs.shape[0]
    states_value = encoder_model.predict(input_seqs)

    # Populate the first character of target sequence with the start character.
    target_seq = tf.ones([batch_size, 1])

    decoded_sentences = [[] for _ in range(batch_size)]

    # Sampling loop for a batch of sequences
    for i in range(max_decode_length):

        output_tokens, decoder_state = decoder_model.predict(
            [target_seq, states_value])

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[:, -1, :], axis=-1)

        tokens = utils_preproc.int2word(output_tokenizer, sampled_token_index)

        for j in range(batch_size):
            decoded_sentences[j].append(tokens[j])

        # Update the target sequence (of length 1).
        target_seq = tf.expand_dims(tf.constant(sampled_token_index), axis=-1)

        # Update states
        states_value = decoder_state

    return decoded_sentences

Now we're ready to predict!

In [61]:
sentences = [
    "No estamos comiendo.",
    "Está llegando el invierno.",
    "El invierno se acerca.",
    "Tom no comio nada.",
    "Su pierna mala le impidió ganar la carrera.",
    "Su respuesta es erronea.",
    "¿Qué tal si damos un paseo después del almuerzo?"
]

reference_translations = [
    "We're not eating.",
    "Winter is coming.",
    "Winter is coming.",
    "Tom ate nothing.",
    "His bad leg prevented him from winning the race.",
    "Your answer is wrong.",
    "How about going for a walk after lunch?"
]

machine_translations = decode_sequences(
    utils_preproc.preprocess(sentences, inp_lang),
    targ_lang,
    max_length_targ
)

for i in range(len(sentences)):
    print('-')
    print('INPUT:')
    print(sentences[i])
    print('REFERENCE TRANSLATION:')
    print(reference_translations[i])
    print('MACHINE TRANSLATION:')
    print(machine_translations[i])

-
INPUT:
No estamos comiendo.
REFERENCE TRANSLATION:
We're not eating.
MACHINE TRANSLATION:
['we', 're', 'not', 'eating', '.', '<end>', '', '', '', '', '']
-
INPUT:
Está llegando el invierno.
REFERENCE TRANSLATION:
Winter is coming.
MACHINE TRANSLATION:
['winter', 'is', 'coming', '.', '<end>', '', '', '', '', '', '']
-
INPUT:
El invierno se acerca.
REFERENCE TRANSLATION:
Winter is coming.
MACHINE TRANSLATION:
['winter', 'is', 'coming', '.', '<end>', '', '', '', '', '', '']
-
INPUT:
Tom no comio nada.
REFERENCE TRANSLATION:
Tom ate nothing.
MACHINE TRANSLATION:
['tom', 'didn', 't', 'eat', 'lunch', '.', '<end>', '', '', '', '']
-
INPUT:
Su pierna mala le impidió ganar la carrera.
REFERENCE TRANSLATION:
His bad leg prevented him from winning the race.
MACHINE TRANSLATION:
['his', 'mother', 'is', 'out', '.', '<end>', '', '', '', '', '']
-
INPUT:
Su respuesta es erronea.
REFERENCE TRANSLATION:
Your answer is wrong.
MACHINE TRANSLATION:
['his', 'word', 'is', 'law', '.', '<end>', '', '', '', 

### Checkpoint Model

Save model artifacts

In [66]:
if not LOAD_CHECKPOINT:

    os.makedirs(MODEL_PATH, exist_ok=True)

    model.save(os.path.join(MODEL_PATH, 'model.h5'))
    encoder_model.save(os.path.join(MODEL_PATH, 'encoder_model.h5'))
    decoder_model.save(os.path.join(MODEL_PATH, 'decoder_model.h5'))

    with open(os.path.join(MODEL_PATH, 'encoder_tokenizer.pkl'), 'wb') as fp:
        pickle.dump(inp_lang, fp)

    with open(os.path.join(MODEL_PATH, 'decoder_tokenizer.pkl'), 'wb') as fp:
        pickle.dump(targ_lang, fp)

## Evaluation Metric (BLEU)

Unlike say, image classification, there is no one right answer for a machine translation. However our current loss metric, cross entropy, only gives credit when the machine translation matches the exact same word in the same order as the reference translation. 

Many attempts have been made to develop a better metric for natural language evaluation. The most popular currently is Bilingual Evaluation Understudy (BLEU).

- It is quick and inexpensive to calculate.
- It allows flexibility for the ordering of words and phrases.
- It is easy to understand.
- It is language independent.
- It correlates highly with human evaluation.
- It has been widely adopted.

The score is from 0 to 1, where 1 is an exact match.

It works by counting matching n-grams between the machine and reference texts, regardless of order. BLUE-4 counts matching n grams from 1-4 (1-gram, 2-gram, 3-gram and 4-gram). It is common to report both BLUE-1 and BLUE-4

It still is imperfect, since it gives no credit to synonyms and so human evaluation is still best when feasible. However BLEU is commonly considered the best among bad options for an automated metric.

The NLTK framework has an implementation that we will use.

We can't run calculate BLEU during training, because at that time the correct decoder input is used. Instead we'll calculate it now.

For more info: https://machinelearningmastery.com/calculate-bleu-score-for-text-python/

In [69]:
def bleu_1(reference, candidate):
    reference = list(filter(lambda x: x != '', reference))  # remove padding
    candidate = list(filter(lambda x: x != '', candidate))  # remove padding
    smoothing_function = nltk.translate.bleu_score.SmoothingFunction().method1
    return nltk.translate.bleu_score.sentence_bleu(
        reference, candidate, (1,), smoothing_function)

In [70]:
def bleu_4(reference, candidate):
    reference = list(filter(lambda x: x != '', reference))  # remove padding
    candidate = list(filter(lambda x: x != '', candidate))   # remove padding
    smoothing_function = nltk.translate.bleu_score.SmoothingFunction().method1
    return nltk.translate.bleu_score.sentence_bleu(
        reference, candidate, (.25, .25, .25, .25), smoothing_function)

This takes ~ 5min to run, the bulk of which is decoding the 6000 sentences in the validation set.

In [54]:
%%time
num_examples = len(input_tensor_val)
bleu_1_total = 0
bleu_4_total = 0


for idx in range(num_examples):

    reference_sentence = utils_preproc.int2word(
        targ_lang, target_tensor_val[idx][1:])

    decoded_sentence = decode_sequences(
        input_tensor_val[idx:idx+1], targ_lang, max_length_targ)[0]

    bleu_1_total += bleu_1(reference_sentence, decoded_sentence)
    bleu_4_total += bleu_4(reference_sentence, decoded_sentence)

print('BLEU 1: {}'.format(bleu_1_total/num_examples))
print('BLEU 4: {}'.format(bleu_4_total/num_examples))

BLEU 1: 0.25360982204741905
BLEU 4: 0.04659966638405097
CPU times: user 8min 46s, sys: 26.6 s, total: 9min 13s
Wall time: 7min 39s


## Results

**Hyperparameters**

- Batch_Size: 64
- Optimizer: adam
- Embed_dim: 256
- GRU Units: 1024
- Train Examples: 24,000
- Epochs: 10
- Hardware: P100 GPU

**Performance**
- Training Time: 5min 
- Cross-entropy loss: train: 0.0722 - val: 0.9062
- BLEU 1: 0.2519574312515255
- BLEU 4: 0.04589972764144636

## Deploy

See `translate_deploy.ipynb`

### References

- Francois Chollet: https://github.com/keras-team/keras/blob/master/examples/lstm_seq2seq.py
