# Intro

In this lab we'll build a translation model from Spanish to English using an encoder-decoder model architecture. We'll benchmark our results using the industry standard BLEU score then deploy for online predictio using an AI Platform custom prediction routine.

In [1]:
import os
import pickle
import io
import sys

import tensorflow as tf
import numpy as np
import nltk # for BLEU score
from sklearn.model_selection import train_test_split

import utils_preproc

print(tf.__version__) # 2.0.0-beta1

2.0.0-beta1


In [2]:
SEED=0
MODEL_PATH = 'translate_models/baseline'
LOAD_CHECKPOINT=False # True if you've already trained and don't want to re-train

## Download Data

We'll use a language dataset provided by http://www.manythings.org/anki/. The dataset contains Spanish-English  translation pairs in the format:

```
May I borrow this book?	¿Puedo tomar prestado este libro?
```

The dataset is a curated list of 120K translation pairs from http://tatoeba.org/, a platform for community contributed translations by native speakers.

In [3]:
path_to_zip = tf.keras.utils.get_file(
    'spa-eng.zip', origin='http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip',
    extract=True)

path_to_file = os.path.dirname(path_to_zip)+"/spa-eng/spa.txt"

## Data Preprocessing

The `preprocess_sentence()` method does the following:
1. Converts sentence lower case
2. Adds a space between puncation and words
3. Replaces tokens that aren't a-z or punctation with space
4. Adds `<start>` and `<end>` tokens

The `tokenize()` method does the following:
    
1. Maps each word to an integer
2. Pads to length of longest sentence 

Note where each is being used in the subsequent cells

In [4]:
en_sentence = u"May I borrow this book?"
sp_sentence = u"¿Puedo tomar prestado este libro?"
print(utils_preproc.preprocess_sentence(en_sentence))
print(utils_preproc.preprocess_sentence(sp_sentence).encode('utf-8'))

<start> may i borrow this book ? <end>
b'<start> \xc2\xbf puedo tomar prestado este libro ? <end>'


In [5]:
def create_dataset(path, num_examples):
    lines = io.open(path, encoding='UTF-8').read().strip().split('\n')

    word_pairs = [[utils_preproc.preprocess_sentence(w) for w in l.split('\t')]  for l in lines[:num_examples]]

    return zip(*word_pairs)

In [6]:
en, sp = create_dataset(path_to_file, None)
print(en[-1])
print(sp[-1])

<start> if you want to sound like a native speaker , you must be willing to practice saying the same sentence over and over in the same way that banjo players practice the same phrase over and over until they can play it correctly and at the desired tempo . <end>
<start> si quieres sonar como un hablante nativo , debes estar dispuesto a practicar diciendo la misma frase una y otra vez de la misma manera en que un musico de banjo practica el mismo fraseo una y otra vez hasta que lo puedan tocar correctamente y en el tiempo esperado . <end>


In [7]:
def load_dataset(path, num_examples=None):
    # creating cleaned input, output pairs
    targ_lang, inp_lang = create_dataset(path, num_examples)

    input_tensor, inp_lang_tokenizer = utils_preproc.tokenize(inp_lang)
    target_tensor, targ_lang_tokenizer = utils_preproc.tokenize(targ_lang)

    return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer

### Limit size to 30000

Since we'll be training on a single GPU, we'll use only the first 30K examples. We'll split this data 80/20 into train and validation.

In [8]:
def max_length(tensor):
    return max(len(t) for t in tensor)

In [9]:
# Try experimenting with the size of that dataset
num_examples = 30000
input_tensor, target_tensor, inp_lang, targ_lang = load_dataset(path_to_file, num_examples)

# Calculate max_length of the target tensors
max_length_targ, max_length_inp = max_length(target_tensor), max_length(input_tensor)

In [10]:
# Creating training and validation sets using an 80-20 split
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(
    input_tensor, target_tensor, test_size=0.2, random_state=0)

# Show length
len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val)

(24000, 24000, 6000, 6000)

In [11]:
print("Input Language; int to word mapping")
print(input_tensor_train[0])
print(utils_preproc.int2word(inp_lang, input_tensor_train[0]),'\n')
print("Target Language; int to word mapping")
print(target_tensor_train[0])
print(utils_preproc.int2word(targ_lang, target_tensor_train[0]))

Input Language; int to word mapping
[  1 133  14 316   3   2   0   0   0   0   0   0   0   0   0   0]
['<start>', 'deja', 'de', 'leer', '.', '<end>', '', '', '', '', '', '', '', '', '', ''] 

Target Language; int to word mapping
[  1  86 341   3   2   0   0   0   0   0   0]
['<start>', 'stop', 'reading', '.', '<end>', '', '', '', '', '', '']


### Create tf.data dataset

Note how our labels are our reference translations shifted ahead by 1

In [12]:
def encoder_decoder_dataset(encoder_input, decoder_input):
    """Converts sequence pairs into a tf.data.Dataset suitable for
    encoder-decoder learnign using teacher forcing.
    
    Arguments:
    encoder_input: tensor of shape (num_examples,encoder_ seq_length) fed into the encoder RNN
    decoder_input: tensor of shape (num_examples,decoder_seq_length) fed into the decoder RNN during training
   
    Returns:
    tf.data.Dataset of shape ((encoder_input,decoder_input),target)
        target is the decoder_input shifted ahead by 1, with a 0 for padding at the end
    """
    target = tf.roll(decoder_input,-1,1) # shift ahead by 1
    target = tf.concat((target[:,:-1],tf.zeros([target.shape[0],1],dtype=tf.int32)),axis=-1) # replace last column with 0s
    return tf.data.Dataset.from_tensor_slices(((encoder_input, decoder_input), target))
    
tf.random.set_seed(SEED)

BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
steps_per_epoch = len(input_tensor_train)//BATCH_SIZE
embedding_dim = 256
units = 1024
vocab_inp_size = len(inp_lang.word_index)+1
vocab_tar_size = len(targ_lang.word_index)+1

train_dataset = encoder_decoder_dataset(input_tensor_train, target_tensor_train).shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
eval_dataset = encoder_decoder_dataset(input_tensor_val, target_tensor_val).batch(BATCH_SIZE, drop_remainder=True)

W0620 18:07:26.343856 140665555871488 deprecation.py:323] From /home/jupyter/.local/lib/python3.5/site-packages/tensorflow/python/data/util/random_seed.py:58: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [13]:
(example_encoder_input_batch, example_decoder_input_batch), example_target_batch = next(iter(train_dataset))
example_encoder_input_batch[:3], example_decoder_input_batch[:3], example_target_batch[:3]

(<tf.Tensor: id=65, shape=(3, 16), dtype=int32, numpy=
 array([[   1,    4, 5125,    3,    2,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0],
        [   1, 3281,    3,    2,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0],
        [   1, 1182,   32,   22,   50,    3,    2,    0,    0,    0,    0,
            0,    0,    0,    0,    0]], dtype=int32)>,
 <tf.Tensor: id=69, shape=(3, 11), dtype=int32, numpy=
 array([[   1,    5, 1953,    3,    2,    0,    0,    0,    0,    0,    0],
        [   1,   28,   38,  230,    3,    2,    0,    0,    0,    0,    0],
        [   1,  244,  128,   49,   56,    3,    2,    0,    0,    0,    0]],
       dtype=int32)>,
 <tf.Tensor: id=73, shape=(3, 11), dtype=int32, numpy=
 array([[   5, 1953,    3,    2,    0,    0,    0,    0,    0,    0,    0],
        [  28,   38,  230,    3,    2,    0,    0,    0,    0,    0,    0],
        [ 244,  128,   49,   56,    3,    2,    0,    0,    0,    0,   

## Model

We use an encoder-decoder decoder architecture, however we embed our words into a latent space prior to feeding them into the RNN.

In [14]:
%%time
tf.random.set_seed(SEED)

embedding_dim = 256
units = 1024
vocab_inp_size = len(inp_lang.word_index)+1
vocab_tar_size = len(targ_lang.word_index)+1

# Encoder
encoder_inputs = tf.keras.layers.Input(shape=(None,),name="encoder_input")
encoder_inputs_embedded = tf.keras.layers.Embedding(input_dim=vocab_inp_size, output_dim=embedding_dim,input_length=max_length_inp)(encoder_inputs)
encoder_rnn = tf.keras.layers.GRU(
     units = 1024,
     return_sequences=True,
     return_state=True, # what is recurrent_initializer?
     recurrent_initializer='glorot_uniform')
encoder_outputs, encoder_state = encoder_rnn(encoder_inputs_embedded)


# Decoder
decoder_inputs = tf.keras.layers.Input(shape=(None,),name="decoder_input")
decoder_inputs_embedded = tf.keras.layers.Embedding(vocab_tar_size, embedding_dim,input_length=max_length_targ)(decoder_inputs)
decoder_rnn = tf.keras.layers.GRU(
    units = 1024,
    return_sequences=True,
    return_state=True,
    recurrent_initializer='glorot_uniform')
decoder_outputs, decoder_state = decoder_rnn(decoder_inputs_embedded,initial_state=encoder_state)

# Classifier (take each intermediate hidden state and predict word)
decoder_dense = tf.keras.layers.Dense(vocab_tar_size, activation='softmax')
predictions = decoder_dense(decoder_outputs)

# Model definition
if LOAD_CHECKPOINT:
    model = tf.keras.models.load_model(os.path.join(MODEL_PATH,'model.h5'))
else:
    model = tf.keras.models.Model(inputs=[encoder_inputs,decoder_inputs], outputs=predictions)
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy') 
    model.summary()
    model.fit(train_dataset,validation_data=eval_dataset, epochs=10)

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_input (InputLayer)      [(None, None)]       0                                            
__________________________________________________________________________________________________
decoder_input (InputLayer)      [(None, None)]       0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, None, 256)    2409984     encoder_input[0][0]              
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, None, 256)    1263360     decoder_input[0][0]              
______________________________________________________________________________________________

## Prediction

We can't just use model.predict(), because we don't know all the inputs we used during training. We only know the encoder_input (source language) but not the decoder_input (target language).

We do however know the first token of the decoder input, which is the `<start>` token. So using this plus the state of the encoder RNN, we can predict the next token. We will then use that token to be the second token of decoder input, and continue like this until we predict the `<end>` token, or we reach some defined max length.

In [15]:
if LOAD_CHECKPOINT:
    encoder_model = tf.keras.models.load_model(os.path.join(MODEL_PATH,'encoder_model.h5'))
    decoder_model = tf.keras.models.load_model(os.path.join(MODEL_PATH,'decoder_model.h5'))
    
else:
    encoder_model = tf.keras.models.Model(encoder_inputs, encoder_state)

    decoder_state_input = tf.keras.layers.Input(shape=(units,),name="decoder_state_input")

    decoder_outputs, decoder_state = decoder_rnn(decoder_inputs_embedded, initial_state=decoder_state_input) # reuses layer weights
    predictions = decoder_dense(decoder_outputs) # reuses layer weights

    decoder_model = tf.keras.models.Model(
        [decoder_inputs,decoder_state_input],
        [predictions,decoder_state])

def decode_sequences(input_seqs, output_tokenizer, max_decode_length=50):
    """
    Arguments:
    input_seqs: int tensor of shape (BATCH_SIZE,SEQ_LEN)
    output_tokenizer: keras_preprocessing.text.Tokenizer used to conver from int to words
    
    Returns translated sentences
    """
    # Encode the input as state vectors.
    batch_size = input_seqs.shape[0]
    states_value = encoder_model.predict(input_seqs)

    # Populate the first character of target sequence with the start character.
    target_seq = tf.ones([batch_size,1])

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    decoded_sentences = [[] for _ in range(batch_size)]
    for i in range(max_decode_length):
        output_tokens, decoder_state = decoder_model.predict(
            [target_seq,states_value])
        
        # Sample a token
        sampled_token_index = np.argmax(output_tokens[:, -1, :],axis=-1)
        tokens = utils_preproc.int2word(output_tokenizer,sampled_token_index)
        for j in range (batch_size):
            decoded_sentences[j].append(tokens[j])

        # Update the target sequence (of length 1).
        target_seq = tf.expand_dims(tf.constant(sampled_token_index),axis=-1)

        # Update states
        states_value = decoder_state

    return decoded_sentences

Now we're ready to predict!

In [16]:
sentences = [
    "No estamos comiendo.",
    "Está llegando el invierno.",
    "El invierno se acerca.",
    "Tom no comio nada.", 
    "Su pierna mala le impidió ganar la carrera.",
    "Su respuesta es erronea.",
    "¿Qué tal si damos un paseo después del almuerzo?"
]

reference_translations = [
    "We're not eating.",
    "Winter is coming.",
    "Winter is coming.",
    "Tom ate nothing.",
    "His bad leg prevented him from winning the race.",
    "Your answer is wrong.",
    "How about going for a walk after lunch?"
]

machine_translations = decode_sequences(
    utils_preproc.preprocess(sentences,inp_lang),
    targ_lang,
    max_length_targ
)

for i in range(len(sentences)):
    print('-')
    print('INPUT:')
    print(sentences[i])
    print('REFERENCE TRANSLATION:')
    print(reference_translations[i])
    print('MACHINE TRANSLATION:')
    print(machine_translations[i])

-
INPUT:
No estamos comiendo.
REFERENCE TRANSLATION:
We're not eating.
MACHINE TRANSLATION:
['we', 're', 'not', 'eating', '.', '<end>', '', '', '', '', '']
-
INPUT:
Está llegando el invierno.
REFERENCE TRANSLATION:
Winter is coming.
MACHINE TRANSLATION:
['winter', 'is', 'on', 'the', 'grass', '.', '<end>', '', '', '', '']
-
INPUT:
El invierno se acerca.
REFERENCE TRANSLATION:
Winter is coming.
MACHINE TRANSLATION:
['winter', 'is', 'coming', 'on', '.', '<end>', '', '', '', '', '']
-
INPUT:
Tom no comio nada.
REFERENCE TRANSLATION:
Tom ate nothing.
MACHINE TRANSLATION:
['tom', 'didn', 't', 'eat', 'lunch', '.', '<end>', '', '', '', '']
-
INPUT:
Su pierna mala le impidió ganar la carrera.
REFERENCE TRANSLATION:
His bad leg prevented him from winning the race.
MACHINE TRANSLATION:
['her', 'car', 'turned', 'red', '.', '<end>', '', '', '', '', '']
-
INPUT:
Su respuesta es erronea.
REFERENCE TRANSLATION:
Your answer is wrong.
MACHINE TRANSLATION:
['her', 'answer', 'is', 'weak', '.', '<end>', ''

### Checkpoint Model

Save model artifacts

In [17]:
if not LOAD_CHECKPOINT:
    os.makedirs(MODEL_PATH,exist_ok=True)
    model.save(os.path.join(MODEL_PATH,'model.h5'))
    encoder_model.save(os.path.join(MODEL_PATH,'encoder_model.h5'))
    decoder_model.save(os.path.join(MODEL_PATH,'decoder_model.h5'))
    pickle.dump(inp_lang,open(os.path.join(MODEL_PATH,'encoder_tokenizer.pkl'),'wb'))
    pickle.dump(targ_lang,open(os.path.join(MODEL_PATH,'decoder_tokenizer.pkl'),'wb'))

## Evaluation Metric (BLEU)

Unlike say, image classification, there is no one right answer for a machine translation. However our current loss metric, cross entropy, only gives credit when the machine translation matches the exact same word in the same order as the reference translation. 

Many attempts have been made to develop a better metric for natural language evaluation. The most popular currently is Bilingual Evaluation Understudy (BLEU).

- It is quick and inexpensive to calculate.
- It allows flexibility for the ordering of words and phrases.
- It is easy to understand.
- It is language independent.
- It correlates highly with human evaluation.
- It has been widely adopted.

The score is from 0 to 1, where 1 is an exact match.

It works by counting matching n-grams between the machine and reference texts, regardless of order. BLUE-4 counts matching n grams from 1-4 (1-gram, 2-gram, 3-gram and 4-gram). It is common to report both BLUE-1 and BLUE-4

It still is imperfect, since it gives no credit to synonyms and so human evaluation is still best when feasible. However BLEU is commonly considered the best among bad options for an automated metric.

The NLTK framework has an implementation that we will use.

We can't run calculate BLEU during training, because at that time the correct decoder input is used. Instead we'll calculate it now.

For more info: https://machinelearningmastery.com/calculate-bleu-score-for-text-python/

In [18]:
def bleu_1(reference, candidate):
    reference = list(filter(lambda x: x != '', reference)) # remove padding
    candidate = list(filter(lambda x: x != '', candidate)) # remove padding
    smoothing_function = nltk.translate.bleu_score.SmoothingFunction().method1 
    return nltk.translate.bleu_score.sentence_bleu(reference, candidate, (1,),smoothing_function)

def bleu_4(reference, candidate):
    reference = list(filter(lambda x: x != '', reference)) # remove padding
    candidate = list(filter(lambda x: x != '', candidate)) # remove padding
    smoothing_function = nltk.translate.bleu_score.SmoothingFunction().method1 
    return nltk.translate.bleu_score.sentence_bleu(reference, candidate, (.25,.25,.25,.25),smoothing_function)

This takes ~ 5min to run, the bulk of which is decoding the 6000 sentences in the validation set.

In [19]:
%%time
num_examples = len(input_tensor_val)
bleu_1_total = 0
bleu_4_total = 0

for idx in range(num_examples):
    reference_sentence = utils_preproc.int2word(targ_lang, target_tensor_val[idx][1:])
    decoded_sentence = decode_sequences(input_tensor_val[idx:idx+1],targ_lang,max_length_targ)[0]
    bleu_1_total += bleu_1(reference_sentence,decoded_sentence)
    bleu_4_total += bleu_4(reference_sentence,decoded_sentence)
print('BLEU 1: {}'.format(bleu_1_total/num_examples))
print('BLEU 4: {}'.format(bleu_4_total/num_examples))

BLEU 1: 0.2555061026047909
BLEU 4: 0.04662961766141444
CPU times: user 6min 41s, sys: 56.5 s, total: 7min 37s
Wall time: 5min 18s


## Results

**Hyperparameters**

- Batch_Size: 64
- Optimizer: adam
- Embed_dim: 256
- GRU Units: 1024
- Train Examples: 24,000
- Epochs: 10
- Hardware: P100 GPU

**Performance**
- Training Time: 5min 
- Cross-entropy loss: train: 0.0722 - val: 0.9062
- BLEU 1: 0.2519574312515255
- BLEU 4: 0.04589972764144636

## Deploy

See `translate_deploy.ipynb`

### References

- Francois Chollet: https://github.com/keras-team/keras/blob/master/examples/lstm_seq2seq.py
