## Next steps

- check in as git branch
- make label offset by one from decode input (end character = 2, probably put reserved char for last thing)
- establish benchmark on: log loss, bleu score, manual inspection
- add attention (tf.keras.layers.Attention)
- see if benchmarks improved
- deploy for online prediction

### Resources

- Francois Chollet: https://github.com/keras-team/keras/blob/master/examples/lstm_seq2seq.py
- Attention Keras: https://towardsdatascience.com/light-on-math-ml-attention-with-keras-dc8dbc1fad39
- TF NMT: https://colab.sandbox.google.com/drive/1R4Hxvzf1a6H95N2sjh5_lVRat_59Zxlx#scrollTo=ddefjBMa3jF0

#### completed

- make reproducible

In [1]:
import os
import unicodedata
import re
import io

import tensorflow as tf
print(tf.__version__) # 2.0.0-beta0
from sklearn.model_selection import train_test_split

2.0.0-beta0


In [2]:
SEED=0

## Download Data

We'll use a language dataset provided by http://www.manythings.org/anki/. This dataset contains language translation pairs in the format:

```
May I borrow this book?	¿Puedo tomar prestado este libro?
```

The dataset is a curated list of 120K translation pairs from http://tatoeba.org/, a platform for community contributed translations by native speakers.

In [3]:
path_to_zip = tf.keras.utils.get_file(
    'spa-eng.zip', origin='http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip',
    extract=True)

path_to_file = os.path.dirname(path_to_zip)+"/spa-eng/spa.txt"

## Data Preprocessing

1. lower case
2. add space between puncation and words
3. replace tokens that aren't a-z or punctation with space
4. add \<start> and \<end> tokens
5. tokenize 
6. pad to length of longest sentence (post-pad)
7. convert to tf.data dataset
8. shuffle and batch

In [4]:
# Converts the unicode file to ascii
def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn')


def preprocess_sentence(w):
    w = unicode_to_ascii(w.lower().strip())

    # creating a space between a word and the punctuation following it
    # eg: "he is a boy." => "he is a boy ."
    # Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
    w = re.sub(r"([?.!,¿])", r" \1 ", w)
    w = re.sub(r'[" "]+', " ", w)

    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
    w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)

    w = w.rstrip().strip()

    # adding a start and an end token to the sentence
    # so that the model know when to start and stop predicting.
    w = '<start> ' + w + ' <end>'
    return w

In [5]:
en_sentence = u"May I borrow this book?"
sp_sentence = u"¿Puedo tomar prestado este libro?"
print(preprocess_sentence(en_sentence))
print(preprocess_sentence(sp_sentence).encode('utf-8'))

<start> may i borrow this book ? <end>
b'<start> \xc2\xbf puedo tomar prestado este libro ? <end>'


In [6]:
def create_dataset(path, num_examples):
    lines = io.open(path, encoding='UTF-8').read().strip().split('\n')

    word_pairs = [[preprocess_sentence(w) for w in l.split('\t')]  for l in lines[:num_examples]]

    return zip(*word_pairs)

In [7]:
en, sp = create_dataset(path_to_file, None)
print(en[-1])
print(sp[-1])

<start> if you want to sound like a native speaker , you must be willing to practice saying the same sentence over and over in the same way that banjo players practice the same phrase over and over until they can play it correctly and at the desired tempo . <end>
<start> si quieres sonar como un hablante nativo , debes estar dispuesto a practicar diciendo la misma frase una y otra vez de la misma manera en que un musico de banjo practica el mismo fraseo una y otra vez hasta que lo puedan tocar correctamente y en el tiempo esperado . <end>


In [8]:
def tokenize(lang):
  lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(
      filters='')
  lang_tokenizer.fit_on_texts(lang)

  tensor = lang_tokenizer.texts_to_sequences(lang)

  tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,
                                                         padding='post')

  return tensor, lang_tokenizer

In [9]:
def load_dataset(path, num_examples=None):
    # creating cleaned input, output pairs
    targ_lang, inp_lang = create_dataset(path, num_examples)

    input_tensor, inp_lang_tokenizer = tokenize(inp_lang)
    target_tensor, targ_lang_tokenizer = tokenize(targ_lang)

    return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer

### Limit size to 30000

In [10]:
def max_length(tensor):
    return max(len(t) for t in tensor)

In [11]:
# Try experimenting with the size of that dataset
num_examples = 3000 #30000
input_tensor, target_tensor, inp_lang, targ_lang = load_dataset(path_to_file, num_examples)

# Calculate max_length of the target tensors
max_length_targ, max_length_inp = max_length(target_tensor), max_length(input_tensor)

In [12]:
# Creating training and validation sets using an 80-20 split
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(
    input_tensor, target_tensor, test_size=0.2, random_state=0)

# Show length
len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val)

(2400, 2400, 600, 600)

In [13]:
def convert(lang, tensor):
  for t in tensor:
    if t!=0:
      print ("%d ----> %s" % (t, lang.index_word[t]))

In [14]:
print ("Input Language; index to word mapping")
convert(inp_lang, input_tensor_train[0])
print ()
print ("Target Language; index to word mapping")
convert(targ_lang, target_tensor_train[0])

Input Language; index to word mapping
1 ----> <start>
38 ----> nos
87 ----> vamos
3 ----> .
2 ----> <end>

Target Language; index to word mapping
1 ----> <start>
14 ----> we
22 ----> re
249 ----> going
3 ----> .
2 ----> <end>


### Create tf.data dataset

In [15]:
tf.random.set_seed(SEED)

BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
steps_per_epoch = len(input_tensor_train)//BATCH_SIZE
embedding_dim = 256
units = 1024
vocab_inp_size = len(inp_lang.word_index)+1
vocab_tar_size = len(targ_lang.word_index)+1

train_dataset = tf.data.Dataset.from_tensor_slices(((input_tensor_train, target_tensor_train), target_tensor_train)).shuffle(BUFFER_SIZE)
train_dataset = train_dataset.batch(BATCH_SIZE, drop_remainder=True)

eval_dataset = tf.data.Dataset.from_tensor_slices(((input_tensor_val, target_tensor_val), target_tensor_val))
eval_dataset = eval_dataset.batch(BATCH_SIZE, drop_remainder=True)

W0613 20:41:12.692818 140493494408960 deprecation.py:323] From /home/jupyter/.local/lib/python3.5/site-packages/tensorflow/python/data/util/random_seed.py:58: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [16]:
(example_encoder_input_batch, example_decoder_input_batch), example_target_batch = next(iter(train_dataset))
example_encoder_input_batch[:3], example_decoder_input_batch[:3], example_target_batch[:3]

(<tf.Tensor: id=41, shape=(3, 11), dtype=int32, numpy=
 array([[   1,   13,   31,    3,    2,    0,    0,    0,    0,    0,    0],
        [   1,   25, 1399,    3,    2,    0,    0,    0,    0,    0,    0],
        [   1,   15,  204,    3,    2,    0,    0,    0,    0,    0,    0]],
       dtype=int32)>, <tf.Tensor: id=45, shape=(3, 8), dtype=int32, numpy=
 array([[  1,   6,   8,  68,   3,   2,   0,   0],
        [  1,   4, 504,  11,   3,   2,   0,   0],
        [  1,   4, 173,  19,   3,   2,   0,   0]], dtype=int32)>, <tf.Tensor: id=49, shape=(3, 8), dtype=int32, numpy=
 array([[  1,   6,   8,  68,   3,   2,   0,   0],
        [  1,   4, 504,  11,   3,   2,   0,   0],
        [  1,   4, 173,  19,   3,   2,   0,   0]], dtype=int32)>)

## Model

In [17]:
tf.random.set_seed(SEED)

embedding_dim = 256
units = 1024
vocab_inp_size = len(inp_lang.word_index)+1
vocab_tar_size = len(targ_lang.word_index)+1

# Encoder
#encoder_inputs = tf.keras.layers.Input(shape=(max_length_inp,))
encoder_inputs = tf.keras.layers.Input(shape=(max_length_inp,),name="encoder_input")
encoder_inputs_embedded = tf.keras.layers.Embedding(input_dim=vocab_inp_size, output_dim=embedding_dim,input_length=max_length_inp)(encoder_inputs)
encoder_outputs, encoder_state = tf.keras.layers.GRU(
     units = 1024,
     return_sequences=True,
     return_state=True, # what is recurrent_initializer?
     recurrent_initializer='glorot_uniform')(encoder_inputs_embedded)


# Decoder
decoder_inputs = tf.keras.layers.Input(shape=(max_length_targ,),name="decoder_input")
decoder_inputs_embedded = tf.keras.layers.Embedding(vocab_tar_size, embedding_dim,input_length=max_length_targ)(decoder_inputs)
decoder_outputs, _ = tf.keras.layers.GRU(
    units = 1024,
    return_sequences=True,
    return_state=True,
    recurrent_initializer='glorot_uniform')(decoder_inputs_embedded,initial_state=encoder_state)

# Classifier (take each intermediate hidden state and predict word)
predictions = tf.keras.layers.Dense(vocab_tar_size, activation='softmax')(decoder_outputs)

# Model definition

model = tf.keras.models.Model(inputs=[encoder_inputs,decoder_inputs], outputs=predictions)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.summary()
model.fit(train_dataset,validation_data=eval_dataset)

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_input (InputLayer)      [(None, 11)]         0                                            
__________________________________________________________________________________________________
decoder_input (InputLayer)      [(None, 8)]          0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 11, 256)      488704      encoder_input[0][0]              
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 8, 256)       235008      decoder_input[0][0]              
______________________________________________________________________________________________

<tensorflow.python.keras.callbacks.History at 0x7fc68a44b9e8>