## Next steps

- add attention (tf.keras.layers.Attention)
- see if benchmarks improved
- deploy for online prediction

### Resources

- Francois Chollet: https://github.com/keras-team/keras/blob/master/examples/lstm_seq2seq.py
- Attention Keras: https://towardsdatascience.com/light-on-math-ml-attention-with-keras-dc8dbc1fad39
- TF NMT: https://colab.sandbox.google.com/drive/1R4Hxvzf1a6H95N2sjh5_lVRat_59Zxlx#scrollTo=ddefjBMa3jF0

### To research

- Better understanding of teacher forcing: https://machinelearningmastery.com/teacher-forcing-for-recurrent-neural-networks/

#### completed

- make reproducible
- prediction works
- establish benchmark on: bleu score (do after the fact, then maybe add as keras eval metric)
- add way to checkpoint/restore models

In [1]:
import os
import unicodedata
import re
import io

import tensorflow as tf
import numpy as np
import nltk

print(tf.__version__) # 2.0.0-beta0
from sklearn.model_selection import train_test_split

2.0.0-beta0


In [2]:
SEED=0
MODEL_PATH = 'translate_models/baseline'
LOAD_CHECKPOINT=True

## Download Data

We'll use a language dataset provided by http://www.manythings.org/anki/. This dataset contains language translation pairs in the format:

```
May I borrow this book?	¿Puedo tomar prestado este libro?
```

The dataset is a curated list of 120K translation pairs from http://tatoeba.org/, a platform for community contributed translations by native speakers.

In [3]:
path_to_zip = tf.keras.utils.get_file(
    'spa-eng.zip', origin='http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip',
    extract=True)

path_to_file = os.path.dirname(path_to_zip)+"/spa-eng/spa.txt"

## Data Preprocessing

1. lower case
2. add space between puncation and words
3. replace tokens that aren't a-z or punctation with space
4. add \<start> and \<end> tokens
5. tokenize 
6. pad to length of longest sentence (post-pad)
7. convert to tf.data dataset
8. shuffle and batch

In [4]:
# Converts the unicode file to ascii
def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn')


def preprocess_sentence(w):
    w = unicode_to_ascii(w.lower().strip())

    # creating a space between a word and the punctuation following it
    # eg: "he is a boy." => "he is a boy ."
    # Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
    w = re.sub(r"([?.!,¿])", r" \1 ", w)
    w = re.sub(r'[" "]+', " ", w)

    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
    w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)

    w = w.rstrip().strip()

    # adding a start and an end token to the sentence
    # so that the model know when to start and stop predicting.
    w = '<start> ' + w + ' <end>'
    return w

In [5]:
en_sentence = u"May I borrow this book?"
sp_sentence = u"¿Puedo tomar prestado este libro?"
print(preprocess_sentence(en_sentence))
print(preprocess_sentence(sp_sentence).encode('utf-8'))

<start> may i borrow this book ? <end>
b'<start> \xc2\xbf puedo tomar prestado este libro ? <end>'


In [6]:
def create_dataset(path, num_examples):
    lines = io.open(path, encoding='UTF-8').read().strip().split('\n')

    word_pairs = [[preprocess_sentence(w) for w in l.split('\t')]  for l in lines[:num_examples]]

    return zip(*word_pairs)

In [7]:
en, sp = create_dataset(path_to_file, None)
print(en[-1])
print(sp[-1])

<start> if you want to sound like a native speaker , you must be willing to practice saying the same sentence over and over in the same way that banjo players practice the same phrase over and over until they can play it correctly and at the desired tempo . <end>
<start> si quieres sonar como un hablante nativo , debes estar dispuesto a practicar diciendo la misma frase una y otra vez de la misma manera en que un musico de banjo practica el mismo fraseo una y otra vez hasta que lo puedan tocar correctamente y en el tiempo esperado . <end>


In [8]:
def tokenize(lang):
  lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(
      filters='')
  lang_tokenizer.fit_on_texts(lang)

  tensor = lang_tokenizer.texts_to_sequences(lang)

  tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,
                                                         padding='post')

  return tensor, lang_tokenizer

In [9]:
def load_dataset(path, num_examples=None):
    # creating cleaned input, output pairs
    targ_lang, inp_lang = create_dataset(path, num_examples)

    input_tensor, inp_lang_tokenizer = tokenize(inp_lang)
    target_tensor, targ_lang_tokenizer = tokenize(targ_lang)

    return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer

### Limit size to 30000

In [10]:
def max_length(tensor):
    return max(len(t) for t in tensor)

In [11]:
# Try experimenting with the size of that dataset
num_examples = 30000
input_tensor, target_tensor, inp_lang, targ_lang = load_dataset(path_to_file, num_examples)

# Calculate max_length of the target tensors
max_length_targ, max_length_inp = max_length(target_tensor), max_length(input_tensor)

In [12]:
# Creating training and validation sets using an 80-20 split
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(
    input_tensor, target_tensor, test_size=0.2, random_state=0)

# Show length
len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val)

(24000, 24000, 6000, 6000)

In [13]:
def convert(lang,tensor):
    return [lang.index_word[t] if t!=0 else '' for t in tensor]

In [14]:
print ("Input Language; index to word mapping")
print(input_tensor_train[0])
print(convert(inp_lang, input_tensor_train[0]))
print ()
print ("Target Language; index to word mapping")
print(target_tensor_train[0])
print(convert(targ_lang, target_tensor_train[0]))

Input Language; index to word mapping
[  1 133  14 316   3   2   0   0   0   0   0   0   0   0   0   0]
['<start>', 'deja', 'de', 'leer', '.', '<end>', '', '', '', '', '', '', '', '', '', '']

Target Language; index to word mapping
[  1  86 341   3   2   0   0   0   0   0   0]
['<start>', 'stop', 'reading', '.', '<end>', '', '', '', '', '', '']


### Create tf.data dataset

In [15]:
def encoder_decoder_dataset(encoder_input, decoder_input):
    """Converts sequence pairs into a tf.data.Dataset suitable for
    encoder-decoder learnign using teacher forcing.
    
    Arguments:
        encoder_input: tensor of shape (num_examples,encoder_ seq_length) fed into the encoder RNN
        decoder_input: tensor of shape (num_examples,decoder_seq_length) fed into the decoder RNN during training
    Returns
        tf.data.Dataset of shape ((encoder_input,decoder_input),target)
        target is the decoder_input shifted ahead by 1, with a 0 for padding at the end
    """
    target = tf.roll(decoder_input,-1,1) # shift ahead by 1
    target = tf.concat((target[:,:-1],tf.zeros([target.shape[0],1],dtype=tf.int32)),axis=-1) # replace last column with 0s
    return tf.data.Dataset.from_tensor_slices(((encoder_input, decoder_input), target))
    
tf.random.set_seed(SEED)

BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
steps_per_epoch = len(input_tensor_train)//BATCH_SIZE
embedding_dim = 256
units = 1024
vocab_inp_size = len(inp_lang.word_index)+1
vocab_tar_size = len(targ_lang.word_index)+1

train_dataset = encoder_decoder_dataset(input_tensor_train, target_tensor_train).shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
eval_dataset = encoder_decoder_dataset(input_tensor_val, target_tensor_val).batch(BATCH_SIZE, drop_remainder=True)

W0616 02:36:59.587545 140501408745216 deprecation.py:323] From /home/jupyter/.local/lib/python3.5/site-packages/tensorflow/python/data/util/random_seed.py:58: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [16]:
(example_encoder_input_batch, example_decoder_input_batch), example_target_batch = next(iter(train_dataset))
example_encoder_input_batch[:3], example_decoder_input_batch[:3], example_target_batch[:3]

(<tf.Tensor: id=65, shape=(3, 16), dtype=int32, numpy=
 array([[   1,    4, 5125,    3,    2,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0],
        [   1, 3281,    3,    2,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0],
        [   1, 1182,   32,   22,   50,    3,    2,    0,    0,    0,    0,
            0,    0,    0,    0,    0]], dtype=int32)>,
 <tf.Tensor: id=69, shape=(3, 11), dtype=int32, numpy=
 array([[   1,    5, 1953,    3,    2,    0,    0,    0,    0,    0,    0],
        [   1,   28,   38,  230,    3,    2,    0,    0,    0,    0,    0],
        [   1,  244,  128,   49,   56,    3,    2,    0,    0,    0,    0]],
       dtype=int32)>,
 <tf.Tensor: id=73, shape=(3, 11), dtype=int32, numpy=
 array([[   5, 1953,    3,    2,    0,    0,    0,    0,    0,    0,    0],
        [  28,   38,  230,    3,    2,    0,    0,    0,    0,    0,    0],
        [ 244,  128,   49,   56,    3,    2,    0,    0,    0,    0,   

## Model

In [17]:
%%time
tf.random.set_seed(SEED)

embedding_dim = 256
units = 1024
vocab_inp_size = len(inp_lang.word_index)+1
vocab_tar_size = len(targ_lang.word_index)+1

# Encoder
encoder_inputs = tf.keras.layers.Input(shape=(None,),name="encoder_input")
encoder_inputs_embedded = tf.keras.layers.Embedding(input_dim=vocab_inp_size, output_dim=embedding_dim,input_length=max_length_inp)(encoder_inputs)
encoder_outputs, encoder_state = tf.keras.layers.GRU(
     units = 1024,
     return_sequences=True,
     return_state=True, # what is recurrent_initializer?
     recurrent_initializer='glorot_uniform')(encoder_inputs_embedded)


# Decoder
decoder_inputs = tf.keras.layers.Input(shape=(None,),name="decoder_input")
decoder_inputs_embedded = tf.keras.layers.Embedding(vocab_tar_size, embedding_dim,input_length=max_length_targ)(decoder_inputs)
decoder_rnn = tf.keras.layers.GRU(
    units = 1024,
    return_sequences=True,
    return_state=True,
    recurrent_initializer='glorot_uniform')
decoder_outputs, decoder_state = decoder_rnn(decoder_inputs_embedded,initial_state=encoder_state)

# Classifier (take each intermediate hidden state and predict word)
decoder_dense = tf.keras.layers.Dense(vocab_tar_size, activation='softmax')
predictions = decoder_dense(decoder_outputs)

# Model definition
if LOAD_CHECKPOINT:
    model = tf.keras.models.load_model(os.path.join(MODEL_PATH,'model.h5'))
else:
    model = tf.keras.models.Model(inputs=[encoder_inputs,decoder_inputs], outputs=predictions)
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy') # metrics=[bleu_1]
    model.summary()
    model.fit(train_dataset,validation_data=eval_dataset, epochs=10)

CPU times: user 4.34 s, sys: 244 ms, total: 4.58 s
Wall time: 4.32 s


## Prediction

We can't just use model.predict(), because we don't know all the inputs we used during training. We only know the encoder_input (source language) but not the decoder_input (target language).

We do however know the first token of the decoder input, which is the START character. So using this plus the state of the encoder RNN, we can predict the next token. We will then use that token to be the second token of decoder input, and continue like this until we predict the END token, or we reach some defined max length.

In [18]:
if LOAD_CHECKPOINT:
    encoder_model = tf.keras.models.load_model(os.path.join(MODEL_PATH,'encoder_model.h5'))
    decoder_model = tf.keras.models.load_model(os.path.join(MODEL_PATH,'decoder_model.h5'))
    
else:
    encoder_model = tf.keras.models.Model(encoder_inputs, encoder_state)

    decoder_state_input = tf.keras.layers.Input(shape=(units,),name="decoder_state_input")

    decoder_outputs, decoder_state = decoder_rnn(decoder_inputs_embedded, initial_state=decoder_state_input) # reuses layer weights
    predictions = decoder_dense(decoder_outputs) # reuses layer weights

    decoder_model = tf.keras.models.Model(
        [decoder_inputs,decoder_state_input],
        [predictions,decoder_state])

def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Populate the first character of target sequence with the start character.
    target_seq = tf.constant([[1]])

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = []
    for _ in range(max_length_targ):
        output_tokens, decoder_state = decoder_model.predict(
            [target_seq,states_value])

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = targ_lang.index_word[sampled_token_index]
        decoded_sentence.append(sampled_char)

        # Exit condition: either hit max length
        # or find stop character.
        if sampled_char == '<end>': break

        # Update the target sequence (of length 1).
        target_seq = tf.constant([[sampled_token_index]])

        # Update states
        states_value = decoder_state

    return decoded_sentence

W0616 02:37:04.155453 140501408745216 hdf5_format.py:171] No training configuration found in save file: the model was *not* compiled. Compile it manually.
W0616 02:37:04.357814 140501408745216 hdf5_format.py:171] No training configuration found in save file: the model was *not* compiled. Compile it manually.


In [19]:
for seq_index in range(10):
    decoded_sentence = decode_sequence(input_tensor_val[seq_index: seq_index + 1])
    print('-')
    print('Input:')
    print(convert(inp_lang, input_tensor_val[seq_index][1:]))
    print('Reference Translation:')
    print(convert(targ_lang, target_tensor_val[seq_index][1:]))
    print('Machine Translation:')
    print(decoded_sentence)

-
Input:
['quise', 'pagar', '.', '<end>', '', '', '', '', '', '', '', '', '', '', '']
Reference Translation:
['i', 'wanted', 'to', 'pay', '.', '<end>', '', '', '', '']
Machine Translation:
['i', 'wanted', 'them', 'to', 'know', '.', '<end>']
-
Input:
['yo', 'camino', 'todos', 'los', 'dias', '.', '<end>', '', '', '', '', '', '', '', '']
Reference Translation:
['i', 'walk', 'every', 'day', '.', '<end>', '', '', '', '']
Machine Translation:
['i', 'swim', 'every', 'day', '.', '<end>']
-
Input:
['tom', 'no', 'comio', 'nada', '.', '<end>', '', '', '', '', '', '', '', '', '']
Reference Translation:
['tom', 'ate', 'nothing', '.', '<end>', '', '', '', '', '']
Machine Translation:
['tom', 'didn', 't', 'eat', 'anything', '.', '<end>']
-
Input:
['su', 'respuesta', 'es', 'erronea', '.', '<end>', '', '', '', '', '', '', '', '', '']
Reference Translation:
['your', 'answer', 'is', 'wrong', '.', '<end>', '', '', '', '']
Machine Translation:
['her', 'answer', 'is', 'funny', '.', '<end>']
-
Input:
['te', 

### Checkpoint Model



In [20]:
os.makedirs(MODEL_PATH,exist_ok=True)
model.save(os.path.join(MODEL_PATH,'model.h5'))
encoder_model.save(os.path.join(MODEL_PATH,'encoder_model.h5'))
decoder_model.save(os.path.join(MODEL_PATH,'decoder_model.h5'))

## Evaluation Metric (BLEU)

Our loss metric, cross entropy log loss, is not the best eval metric for machine translation.

Unlike say, image classification, there is no one right answer for a machine translation. However our current loss metric, cross entropy, only gives credit when the machine translation matches the exact same word in the same order as the reference translation.

Many attempts have been made to develop a better metric for natural language evaluation. The most popular currently is Bilingual Evaluation Understudy (BLUE).

- It is quick and inexpensive to calculate.
- It is easy to understand.
- It is language independent.
- It correlates highly with human evaluation.
- It has been widely adopted.

It has the advantages that it allows comparison to multiple reference translations, and allows flexibility for the ordering of words and phrases. It still is imperfect, since it gives no credit to synonyms, so human evaluation is still best

The score is from 0 to 1, where 1 is an exact match. In practice on about 500 sentences (40 general news stories), a human translator scored 0.3468 against four references and scored 0.2571 against two references.

It works by counting matching n-grams between the machine and reference texts, regardless of order. BLUE-4 counts matching n grams from 1-4 (1-gram, 2-gram, 3-gram and 4-gram). It is common to report both BLUE-1 and BLUE-4

The NLTK framework has an implementation that we will use.

Furthermore we can't run evaluation during training, because at that time the correct decoder input is used. We will run our eval metrics after.

For more info: https://machinelearningmastery.com/calculate-bleu-score-for-text-python/

In [21]:
def bleu_1(reference, candidate):
    smoothing_function = nltk.translate.bleu_score.SmoothingFunction().method1 
    return nltk.translate.bleu_score.sentence_bleu(reference, candidate, (1,),smoothing_function)

def bleu_4(reference, candidate):
    smoothing_function = nltk.translate.bleu_score.SmoothingFunction().method1 
    return nltk.translate.bleu_score.sentence_bleu(reference, candidate, (.25,.25,.25,.25),smoothing_function)

This takes ~ 3min to run, the bulk of which is decoding the 6000 sentences in the validation set.

In [None]:
%%time
num_examples = len(input_tensor_val)
bleu_1_total = 0
bleu_4_total = 0

for idx in range(num_examples):
    reference_sentence = convert(targ_lang, target_tensor_val[idx][1:])
    decoded_sentence = decode_sequence(input_tensor_val[idx:idx+1])
    bleu_1_total += bleu_1(reference_sentence,decoded_sentence)
    bleu_4_total += bleu_4(reference_sentence,decoded_sentence)
print('BLEU 1: {}'.format(bleu_1_total/num_examples))
print('BLEU 4: {}'.format(bleu_4_total/num_examples))

## Benchmarks

- Batch_Size: 64
- Optimizer: adam
- Embed_dim: 256
- GRU Units: 1024
- Train Examples: 24,000
- Epochs: 10
- Hardware: P100 GPU

**Baseline**
- 5min - loss: 0.0722 - val_loss: 0.9062
- BLEU 1: 0.2519574312515255
- BLEU 4: 0.04589972764144636
- Manuel Inspection:most translations make sense, but are synonyms not exact matches

Expect loss to be considerably higher once targets are no longer the same as input

## Deploy

Note that to decode our sequences we're not just calling .predict() on a single Keras model. We're using .predict() on two different models with some python code in between. On top of that we're calling each model multiple times in a for loop.

Because of this we can't just export to SavedModel and deploy to AI Platform. Instead we'll take advantage of AI Platforms custom prediction function.