# Neural Machine Translation

This session will train faster with GPU!

In [None]:
import tensorflow as tf
from sklearn.model_selection import train_test_split

import unicodedata
import re
import numpy as np
import time

We'll use a language dataset provided by http://www.manythings.org/anki/ to translate from English to German

In [None]:
!wget --quiet http://www.manythings.org/anki/deu-eng.zip
!unzip deu-eng.zip

lines = open('deu.txt', encoding='UTF-8').read().strip().split('\n')

Archive:  deu-eng.zip
  inflating: deu.txt                 
  inflating: _about.txt              


In [None]:
lines[11]

'Wait!\tWarte!\tCC-BY 2.0 (France) Attribution: tatoeba.org #1744314 (belgavox) & #2122378 (Pfirsichbaeumchen)'

# Preprocessing

In [None]:
def preprocess_sentence(w):
  w = w.lower().strip()
  # This next line is confusing!
  # We normalize unicode data, umlauts will be converted to normal letters
  w = w.replace("ß", "ss")
  w = ''.join(c for c in unicodedata.normalize('NFD', w) if unicodedata.category(c) != 'Mn')

  # creating a space between a word and the punctuation following it
  # eg: "he is a boy." => "he is a boy ."
  # Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
  w = re.sub(r"([?.!,¿])", r" \1 ", w)
  w = re.sub(r'[" "]+', " ", w)

  # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
  w = re.sub(r"[^a-zA-Z?.!]+", " ", w)
  w = w.strip()

  # adding a start and an end token to the sentence
  # so that the model know when to start and stop predicting.
  w = '<start> ' + w + ' <end>'
  return w

sentence = "May I borrow this book?"
print(preprocess_sentence(sentence))
sentence = "Über die Wolken."
print(preprocess_sentence(sentence))

<start> may i borrow this book ? <end>
<start> uber die wolken . <end>


In [None]:
english = []
german = []
for line in lines:
  en = line.split('\t')[0]
  de = line.split('\t')[1]
  english.append(preprocess_sentence(en))
  german.append(preprocess_sentence(de))

Using the complete dataset will probably kill the Google Colab notebook. Why? RAM problems! Either you reduce the number of data inputs or smaller batch size or smaller vocabulary (then take care of UNKowns)

In [None]:
NUM_EXAMPLES = 30000
english = english[:NUM_EXAMPLES]
german = german[:NUM_EXAMPLES]

In [None]:
german[50:60]

['<start> mach mit ! <end>',
 '<start> spring rein ! <end>',
 '<start> druck mich ! <end>',
 '<start> nimm mich in den arm ! <end>',
 '<start> umarme mich ! <end>',
 '<start> mir ist es wichtig . <end>',
 '<start> ich fiel . <end>',
 '<start> ich fiel hin . <end>',
 '<start> ich sturzte . <end>',
 '<start> ich bin hingefallen . <end>']

This time instead of using **TextVectorizer** to preprocess and tokenize the text, we are using our own **preprocess_sentence** function and then Keras [**Tokenizer**](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer).

It's good that you get confortable with both types of layers, they have similar methods. For example, instead of adapt, Tokenizer uses fit_on_texts

Disadvantage:
TextVectorizer automatically pads to the longest sequence. For Tokenizer you have to do it on your own with **pad_sequences** method.

Advantage:
Tokenizer comes directly with **word_index** and **sequences_to_text** functions. We implemented these two on our own last week (Compare them!)


In [None]:
en_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
en_tokenizer.fit_on_texts(english)

data_en = en_tokenizer.texts_to_sequences(english)
data_en = tf.keras.preprocessing.sequence.pad_sequences(data_en, padding='post')

ge_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
ge_tokenizer.fit_on_texts(german)

data_ge = ge_tokenizer.texts_to_sequences(german)
data_ge = tf.keras.preprocessing.sequence.pad_sequences(data_ge,padding='post')

In [None]:
data_en[0]

array([ 1, 34,  3,  2,  0,  0,  0,  0,  0,  0], dtype=int32)

In [None]:
X_train,  X_test, Y_train, Y_test = train_test_split(data_en, data_ge, test_size=0.2)

BATCH_SIZE = 64
BUFFER_SIZE = len(X_train)
steps_per_epoch = BUFFER_SIZE // BATCH_SIZE
embedding_dims = 256
hidden_units = 1024

In [None]:
def max_len(sentence):
    return max(len(s) for s in sentence)

max_length_input = max_len(data_en)
max_length_output = max_len(data_ge)  
input_vocab_size = len(en_tokenizer.word_index) + 1  
output_vocab_size = len(ge_tokenizer.word_index) + 1
print(output_vocab_size)

7262


This time we shuffle and batch the dataset before the starting the training. It does not make a difference! I do this before this time, in order to check that the encoder-decoder layers are working.

In [None]:
dataset = tf.data.Dataset.from_tensor_slices((X_train, Y_train)).shuffle(BUFFER_SIZE).batch(BATCH_SIZE, 
                                                                                            drop_remainder=True)

for example in dataset.take(1):
  example_x, example_y = example
print(example_x.shape) 
print(example_y.shape) 

(64, 10)
(64, 13)


## Without Attention

We will use the Subclass API to create a Encoder and a Decoder Module.

Without attention it is possible to use only the Functional API. However, to implement attention there is a [bug](https://github.com/tensorflow/addons/issues/1153) that only allows it through classes  :( 

From the beginning we will use the subclass API, otherwise the jump between no attention and attention is too big.

If someone manages to transform the attention code to functional version. Please show me how :) In the [official documentation](https://github.com/tensorflow/addons/tree/master/tensorflow_addons/seq2seq) it says that it is now working...

It is nevertheless good to learn the Subclass API, as we will 100% need it when building a Transformer from scratch

In [None]:
# ENCODER
class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dims, hidden_units):
        super().__init__()
        self.hidden_units = hidden_units
        self.embedding_layer = tf.keras.layers.Embedding(vocab_size, embedding_dims)
        self.lstm_layer = tf.keras.layers.LSTM(hidden_units, return_sequences=False, 
                                                     return_state=True )
    
    def initialize_hidden_state(self): 
        return [tf.zeros((BATCH_SIZE, self.hidden_units)), 
                tf.zeros((BATCH_SIZE, self.hidden_units))] 
                                                               
    def call(self, input, hidden_state):
        embedding = self.embedding_layer(input)
        output, h_state, c_state = self.lstm_layer(embedding, initial_state = hidden_state)
        return output, h_state, c_state


encoder = Encoder(input_vocab_size, embedding_dims, hidden_units)

In [None]:
# Test  the encoder
sample_initial_state = encoder.initialize_hidden_state()
sample_output, sample_h, sample_c = encoder(example_x, sample_initial_state)
print(sample_output.shape)
print(sample_h.shape)

(64, 1024)
(64, 1024)


We are going to use tensorflow addon for seq2seq models

In [None]:
import tensorflow_addons as tfa
# DECODER

class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, hidden_units):
    super().__init__()
    
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    
    self.lstm_cell = tf.keras.layers.LSTMCell(hidden_units)
   
    self.sampler = tfa.seq2seq.sampler.TrainingSampler()
    
    self.output_layer = tf.keras.layers.Dense(vocab_size)
    self.decoder = tfa.seq2seq.BasicDecoder(self.lstm_cell, 
                                            sampler=self.sampler, 
                                            output_layer=self.output_layer)

  def call(self, inputs, initial_state):
    embedding = self.embedding(inputs)
    # We will pass sequences without the <END> token, so the length is max length - 1
    outputs, _, _ = self.decoder(embedding, initial_state=initial_state, 
                                 sequence_length=BATCH_SIZE*[max_length_output-1])
    return outputs

decoder = Decoder(output_vocab_size, embedding_dims, hidden_units)

In [None]:
# Test the decoder
sample_y = tf.random.uniform((BATCH_SIZE, max_length_output))
sample_decoder_output = decoder(sample_y, initial_state=[sample_h, sample_c])

print(sample_decoder_output.rnn_output.shape)

(64, 12, 7262)


Because we padded our sentences, we don't
want to bias our results by considering equality of pad words between the labels
and predictions. This custom loss function masks our predictions with the labels, so
padded positions on the label are also removed from the predictions, and we only
compute our loss using the non zero elements on both the label and predictions.

The predicted Tensor has shape (BATCH_SIZE, max_length_output, output_vocab_size)

The real Tensor has shape (BATCH_SIZE, max_length_output)

In [None]:
optimizer = tf.keras.optimizers.Adam()

def loss_function(real, pred):
  cross_entropy = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
  loss = cross_entropy(y_true=real, y_pred=pred)
  mask = tf.logical_not(tf.math.equal(real,0))   #output 0 for y=0 else output 1
  mask = tf.cast(mask, dtype=loss.dtype)  # mask and loss have to have the same Tensor type
  loss = mask * loss
  loss = tf.reduce_mean(loss) # you need one loss scalar number for the mini batch
  return loss 

We have to handle the training loop manually as well. Our train_step() function
handles the flow of data and computes the loss at each step, applies the gradient
of the loss back to the trainable weights, and returns the loss.


These are quasi the same steps we took before with our example_x and example_y data. Try to understand these steps before 

In [None]:
EPOCHS = 100

for epoch in range(EPOCHS):
  start = time.time()

  encoder_hidden = encoder.initialize_hidden_state() # Every epoch we use a zero Tensor matrix
  epoch_loss = 0

  for (batch, (input, target)) in enumerate(dataset.take(steps_per_epoch)):
    with tf.GradientTape() as tape:
        # Pass the input through the encoder 
        encoder_output, encoder_h, encoder_c = encoder(input, encoder_hidden)
        decoder_input = target[ : , :-1 ] # ignore <end> token
        real = target[ : , 1: ]           # ignore <start> token
        # The encoder hidden state and the decoder input
        # are passed to the decoder
        decoder_output = decoder(decoder_input, [encoder_h, encoder_c]) 
        logits = decoder_output.rnn_output
        batch_loss = loss_function(real, logits)

    variables = encoder.trainable_variables + decoder.trainable_variables
    gradients = tape.gradient(batch_loss, variables)
    optimizer.apply_gradients(zip(gradients, variables))
    epoch_loss += batch_loss

    if batch % 100 == 0:
        print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                   batch,
                                                   batch_loss.numpy()))
  print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                      epoch_loss / steps_per_epoch))
  print('Time {:.4f} sec\n'.format(time.time() - start))

Epoch 1 Batch 0 Loss 4.2367
Epoch 1 Batch 100 Loss 1.9599
Epoch 1 Batch 200 Loss 1.9536
Epoch 1 Batch 300 Loss 1.8202
Epoch 1 Loss 1.9803
Time 76.0210 sec

Epoch 2 Batch 0 Loss 1.6242
Epoch 2 Batch 100 Loss 1.6441
Epoch 2 Batch 200 Loss 1.6222
Epoch 2 Batch 300 Loss 1.6233
Epoch 2 Loss 1.6129
Time 74.5426 sec

Epoch 3 Batch 0 Loss 1.4970
Epoch 3 Batch 100 Loss 1.4370
Epoch 3 Batch 200 Loss 1.3742
Epoch 3 Batch 300 Loss 1.3840
Epoch 3 Loss 1.4619
Time 74.3314 sec

Epoch 4 Batch 0 Loss 1.3155
Epoch 4 Batch 100 Loss 1.3850
Epoch 4 Batch 200 Loss 1.3250
Epoch 4 Batch 300 Loss 1.4406
Epoch 4 Loss 1.3539
Time 74.4977 sec

Epoch 5 Batch 0 Loss 1.2954
Epoch 5 Batch 100 Loss 1.3336
Epoch 5 Batch 200 Loss 1.2161
Epoch 5 Batch 300 Loss 1.2900
Epoch 5 Loss 1.2684
Time 75.9332 sec

Epoch 6 Batch 0 Loss 1.1367
Epoch 6 Batch 100 Loss 1.1299
Epoch 6 Batch 200 Loss 1.2074
Epoch 6 Batch 300 Loss 1.0741
Epoch 6 Loss 1.1272
Time 75.0640 sec

Epoch 7 Batch 0 Loss 1.0356
Epoch 7 Batch 100 Loss 1.0060
Epoch 

**Translation**

In [None]:
def translate(sentence, preprocess=True):
    if preprocess:
        sentence = preprocess_sentence(sentence)
        sentence_tokens = en_tokenizer.texts_to_sequences([sentence])
        input = tf.keras.preprocessing.sequence.pad_sequences(sentence_tokens, maxlen=max_length_input, padding='post')
    else:
        input = sentence
    input = tf.convert_to_tensor(input)

    encoder_hidden = [tf.zeros((1, hidden_units)), tf.zeros((1, hidden_units))]
    encoder_output, encoder_h, encoder_c = encoder(input, encoder_hidden)

    ### This time we use the greedy sampler because we want the word with the highest probability!
    ### We are not generating new text, where a probability sampling would be better
    greedy_sampler = tfa.seq2seq.GreedyEmbeddingSampler()

    # Instantiate a BasicDecoder object
    decoder_instance = tfa.seq2seq.BasicDecoder(cell=decoder.lstm_cell, 
                                                sampler=greedy_sampler, 
                                                output_layer=decoder.output_layer)

    ### Since the BasicDecoder wraps around Decoder's lstm cell only, you have to ensure that the inputs to BasicDecoder 
    ### decoding step is output of embedding layer. tfa.seq2seq.GreedyEmbeddingSampler() takes care of this. 
    ### You only need to get the weights of embedding layer, which can be done by decoder.embedding.variables[0] 
    ### and pass this callabble to BasicDecoder's call() function

    decoder_embedding_matrix = decoder.embedding.variables[0]

    # Additionally, we give the start token to the decoder, and also the end token, so that it stops translating
    start_token = tf.convert_to_tensor([ge_tokenizer.word_index['<start>']])
    end_token = ge_tokenizer.word_index['<end>']

    outputs, _, _ = decoder_instance(decoder_embedding_matrix, start_tokens = start_token, 
                                     end_token= end_token, initial_state=[encoder_h, encoder_c])

    result_sequence  = outputs.sample_id.numpy()
    return ge_tokenizer.sequences_to_texts(result_sequence)[0]

translate("I love you!")

'ich liebe dich ! <end>'

In [None]:
translate("I want to kiss you!")

'ich mochte dich kussen . <end>'

In [None]:
translate("I played the piano today")

'ich habe das auto gekauft . <end>'

In [None]:
translate("The teacher was happy to train the language model")

'das licht war aus . <end>'

[**BLEU Scores**](https://www.nltk.org/_modules/nltk/translate/bleu_score.html)

In [None]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

bleu_scores = []
smooth_fn = SmoothingFunction()

for input, target in zip(X_test, Y_test):
    original = ge_tokenizer.sequences_to_texts([target])[0]
    predicted = translate([input], preprocess=False)
    original = re.sub("(<end>)|(<start>)|\?|!|\.", "", original)
    predicted = re.sub("(<end>)|\?|!|\.", "", predicted)
    original_tokens = original.strip().split(" ")
    predicted_tokens = predicted.strip().split(" ")
    score = sentence_bleu([original_tokens], predicted_tokens, 
                          smoothing_function=smooth_fn.method1)
    bleu_scores.append(score)

np.mean(np.array(bleu_scores)) * 100

24.307577949992638

## With Attention

The Encoder stays almost the same. Only the LSTM layer now needs to return the hidden states at every input to pass it to attention.
For this we activate return_state=True. [Read here](https://medium.com/@sanjivgautamofficial/lstm-in-keras-56a59264c0b2)

In [None]:
class EncoderAttention(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dims, hidden_units):
        super().__init__()
        self.hidden_units = hidden_units
        self.embedding_layer = tf.keras.layers.Embedding(vocab_size, embedding_dims)
        self.lstm_layer = tf.keras.layers.LSTM(hidden_units, return_sequences=True, 
                                                     return_state=True ) # We need the lstm outputs 
                                                                         # to calculate attention!
    
    def initialize_hidden_state(self): 
        return [tf.zeros((BATCH_SIZE, self.hidden_units)), 
                tf.zeros((BATCH_SIZE, self.hidden_units))] 
                                                               
    def call(self, input, hidden_state):
        embedding = self.embedding_layer(input)
        output, h_state, c_state = self.lstm_layer(embedding, initial_state = hidden_state)
        return output, h_state, c_state


encoder = EncoderAttention(input_vocab_size, embedding_dims, hidden_units)

In [None]:
# Test  the encoder
sample_initial_state = encoder.initialize_hidden_state()
sample_output, sample_h, sample_c = encoder(example_x, sample_initial_state)
print(sample_output.shape)
print(sample_h.shape)

(64, 10, 1024)
(64, 1024)


The Decoder is the one that changes the most. I comment with "#N", the new changes needed. For all future steps, wee need to initialize the attention and then pass the initial state (encoder output) through the attention cell.

In [None]:
class DecoderAttention(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, hidden_units):
    super().__init__()
    
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    
    self.lstm_cell = tf.keras.layers.LSTMCell(hidden_units)
   
    self.sampler = tfa.seq2seq.sampler.TrainingSampler()

    self.attention_mechanism = tfa.seq2seq.LuongAttention(hidden_units, memory_sequence_length=BATCH_SIZE*[max_length_input]) #N
    
    self.attention_cell = tfa.seq2seq.AttentionWrapper(cell=self.lstm_cell, # N
                                  attention_mechanism=self.attention_mechanism, 
                                  attention_layer_size=hidden_units)
    
    self.output_layer = tf.keras.layers.Dense(vocab_size)
    self.decoder = tfa.seq2seq.BasicDecoder(self.attention_cell, # N
                                            sampler=self.sampler, 
                                            output_layer=self.output_layer)

  def build_initial_state(self, batch_size, encoder_state): #N
    decoder_initial_state = self.attention_cell.get_initial_state(batch_size=batch_size, dtype=tf.float32)
    decoder_initial_state = decoder_initial_state.clone(cell_state=encoder_state)
    return decoder_initial_state


  def call(self, inputs, initial_state):
    embedding = self.embedding(inputs)
    outputs, _, _ = self.decoder(embedding, initial_state=initial_state, sequence_length=BATCH_SIZE*[max_length_output-1])
    return outputs

decoder = DecoderAttention(output_vocab_size, embedding_dims, hidden_units)

In [None]:
# Test the decoder
sample_y = tf.random.uniform((BATCH_SIZE, max_length_output))
decoder.attention_mechanism.setup_memory(sample_output) # Attention needs the last output of the Encoder
                                                        # as starting point
initial_state = decoder.build_initial_state(BATCH_SIZE, [sample_h, sample_c]) # N


sample_decoder_output = decoder(sample_y, initial_state)

print(sample_decoder_output.rnn_output.shape)

(64, 12, 7262)


Same loss function as before! No changes

In [None]:
EPOCHS = 100

for epoch in range(EPOCHS):
  start = time.time()

  encoder_hidden = encoder.initialize_hidden_state() # Every epoch we use a zero Tensor matrix
  epoch_loss = 0

  for (batch, (input, target)) in enumerate(dataset.take(steps_per_epoch)):
    with tf.GradientTape() as tape:
        # Pass the input through the encoder 
        encoder_output, encoder_h, encoder_c = encoder(input, encoder_hidden)
        decoder_input = target[ : , :-1 ] # Ignore <end> token
        real = target[ : , 1: ]         # ignore <start> token
        # The encoder output, encoder hidden state and the decoder input
        # is passed to the decoder
        decoder.attention_mechanism.setup_memory(encoder_output) # N
        decoder_initial_state = decoder.build_initial_state(BATCH_SIZE, [encoder_h, encoder_c]) # N
        decoder_output = decoder(decoder_input, decoder_initial_state) 
        logits = decoder_output.rnn_output
        batch_loss = loss_function(real, logits)

    variables = encoder.trainable_variables + decoder.trainable_variables
    gradients = tape.gradient(batch_loss, variables)
    optimizer.apply_gradients(zip(gradients, variables))
    epoch_loss += batch_loss

    if batch % 100 == 0:
        print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                   batch,
                                                   batch_loss.numpy()))
  print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                      epoch_loss / steps_per_epoch))
  print('Time {:.4f} sec\n'.format(time.time() - start))

Epoch 1 Batch 0 Loss 4.5146
Epoch 1 Batch 100 Loss 2.9734
Epoch 1 Batch 200 Loss 1.9706
Epoch 1 Batch 300 Loss 2.0473
Epoch 1 Loss 2.8581
Time 111.3372 sec

Epoch 2 Batch 0 Loss 1.8199
Epoch 2 Batch 100 Loss 1.5814
Epoch 2 Batch 200 Loss 1.7211
Epoch 2 Batch 300 Loss 1.5967
Epoch 2 Loss 1.6797
Time 111.2661 sec

Epoch 3 Batch 0 Loss 1.6079
Epoch 3 Batch 100 Loss 1.5383
Epoch 3 Batch 200 Loss 1.4252
Epoch 3 Batch 300 Loss 1.6349
Epoch 3 Loss 1.5118
Time 111.4935 sec

Epoch 4 Batch 0 Loss 1.3876
Epoch 4 Batch 100 Loss 1.3024
Epoch 4 Batch 200 Loss 1.3459
Epoch 4 Batch 300 Loss 1.4257
Epoch 4 Loss 1.4006
Time 110.7898 sec

Epoch 5 Batch 0 Loss 1.2880
Epoch 5 Batch 100 Loss 1.3369
Epoch 5 Batch 200 Loss 1.2947
Epoch 5 Batch 300 Loss 1.3020
Epoch 5 Loss 1.3138
Time 111.7957 sec

Epoch 6 Batch 0 Loss 1.2069
Epoch 6 Batch 100 Loss 1.1938
Epoch 6 Batch 200 Loss 1.2285
Epoch 6 Batch 300 Loss 1.1304
Epoch 6 Loss 1.2177
Time 111.7171 sec

Epoch 7 Batch 0 Loss 1.1337
Epoch 7 Batch 100 Loss 1.0426


In [None]:
def translate(sentence, preprocess=True):
    if preprocess:
        sentence = preprocess_sentence(sentence)
        sentence_tokens = en_tokenizer.texts_to_sequences([sentence])
        input = tf.keras.preprocessing.sequence.pad_sequences(sentence_tokens, maxlen=max_length_input, padding='post')
    else:
        input = sentence
    input = tf.convert_to_tensor(input)

    encoder_hidden = [tf.zeros((1, hidden_units)), tf.zeros((1, hidden_units))]
    encoder_output, encoder_h, encoder_c = encoder(input, encoder_hidden)
    start_token = tf.convert_to_tensor([ge_tokenizer.word_index['<start>']])
    end_token = ge_tokenizer.word_index['<end>']

    # This time we use the greedy sampler because we want the word with the highest probability!
    # We are not generating new text, where a probability sampling would be better
    greedy_sampler = tfa.seq2seq.GreedyEmbeddingSampler()

    # Instantiate a BasicDecoder object
    decoder_instance = tfa.seq2seq.BasicDecoder(cell=decoder.attention_cell, # N
                                                sampler=greedy_sampler, output_layer=decoder.output_layer)
    # Setup Memory in decoder stack
    decoder.attention_mechanism.setup_memory(encoder_output) # N

    # set decoder_initial_state
    decoder_initial_state = decoder.build_initial_state(batch_size=1, encoder_state=[encoder_h, encoder_c]) # N

    ### Since the BasicDecoder wraps around Decoder's rnn cell only, you have to ensure that the inputs to BasicDecoder 
    ### decoding step is output of embedding layer. tfa.seq2seq.GreedyEmbeddingSampler() takes care of this. 
    ### You only need to get the weights of embedding layer, which can be done by decoder.embedding.variables[0] and pass this callabble to BasicDecoder's call() function

    decoder_embedding_matrix = decoder.embedding.variables[0]

    outputs, _, _ = decoder_instance(decoder_embedding_matrix, start_tokens = start_token, end_token= end_token, initial_state=decoder_initial_state)

    result_sequence  = outputs.sample_id.numpy()
    return ge_tokenizer.sequences_to_texts(result_sequence)[0]

translate("I love you!")

'ich liebe dich . <end>'

In [None]:
translate("I want to kiss you")

['ich mochte dich kussen . <end>']

In [None]:
translate("I played the piano today")

['ich habe heute ein klavier gespielt . <end>']

In [None]:
translate("The teacher was happy to train the language model")

['der lehrer war glucklich . <end>']

In [None]:
bleu_scores = []
smooth_fn = SmoothingFunction()

for input, target in zip(X_test, Y_test):
    original = ge_tokenizer.sequences_to_texts([target])[0]
    predicted = translate([input], preprocess=False)
    original = re.sub("(<end>)|(<start>)|\?|!|\.", "", original)
    predicted = re.sub("(<end>)|\?|!|\.", "", predicted)
    original_tokens = original.strip().split(" ")
    predicted_tokens = predicted.strip().split(" ")
    score = sentence_bleu([original_tokens], predicted_tokens, 
                          smoothing_function=smooth_fn.method1)
    bleu_scores.append(score)

np.mean(np.array(bleu_scores)) * 100

26.038424524712184

IMPORTANT: Such complex models **need** an adaptive learning rate! Also the hyperparameters have to be tuned according to the task. In this course, we are not implementing them, but you should definitely play with them to improve your model!

Beam search can be very helpful to achieve a better BLEU score. This can be implemented with the tfa.seq2seq.BeamSearchDecoder module

# Continue Learning

This time the extra code is waaay easier than the course code! :D This will definitely help you understand the previous code!

## Date Translation

Train an seq2seq model that can convert a date string from one format to another (e.g., from "April 22, 2019" to "2019-04-22"). We use character-level translation

Note: Of course this can be simply done with regular expressions, but let's make a neural network learn the rules only from data!

In [None]:
from datetime import date

MONTHS = ["January", "February", "March", "April", "May", "June",
          "July", "August", "September", "October", "November", "December"]

# Create random dates
def random_dates(n_dates):
    min_date = date(1000, 1, 1).toordinal()
    max_date = date(9999, 12, 31).toordinal()

    ordinals = np.random.randint(max_date - min_date, size=n_dates) + min_date
    dates = [date.fromordinal(ordinal) for ordinal in ordinals]

    x = [MONTHS[dt.month - 1] + " " + dt.strftime("%d, %Y") for dt in dates]
    y = [dt.isoformat() for dt in dates]
    return x, y

In [None]:
np.random.seed(42)

x_example, y_example = random_dates(3)
for idx in range(3):
    print(f"Input: {x_example[idx]}, Target: {y_example[idx]}")

Input: September 20, 7075, Target: 7075-09-20
Input: May 15, 8579, Target: 8579-05-15
Input: January 11, 7103, Target: 7103-01-11


Preprocessing (Always the most tedious part...)

In [None]:
# create mapping from vocab chars to ints
input_vocab = "".join(sorted(set("".join(MONTHS)))) + "01234567890, "
output_vocab = "0123456789-"
input_vocab

'ADFJMNOSabceghilmnoprstuvy01234567890, '

In [None]:
input_char2id = {c:i for i, c in enumerate(input_vocab)}
output_char2id = {c:i for i, c in enumerate(output_vocab)}

In [None]:
print([input_char2id[x] + 1 for x in x_example[0]])

[8, 12, 20, 23, 12, 17, 10, 12, 21, 39, 29, 37, 38, 39, 34, 37, 34, 32]


Let's write a function that converts date strings to integers. We want to add padding for input strings, to have same length inputs. **ALWAYS use 0 as the padding ID**

What is the maximal input length? 

September xx, xxxx : 18 characters

What is the output length? xxxx-xx-xx : 10 characters

In [None]:
def string_to_char(data_str, vocabulary, max_length=None):
    if max_length:
        ids = [vocabulary[character] + 1 for character in data_str] # we add one to have 0 as a padding
        for i in range(max_length - len(ids)):
          ids.append(0)
    else:
        ids = [vocabulary[character] for character in data_str]
    return np.array(ids)

max_input_length = 18
max_output_length = 10     
string_to_char(x_example[1], input_char2id, max_length=max_input_length)

array([ 5,  9, 26, 39, 28, 32, 38, 39, 35, 32, 34, 36,  0,  0,  0,  0,  0,
        0])

In [None]:
def create_dataset(n_dates):
    x_strings, y_strings = random_dates(n_dates)
    x_ids = []
    for x in x_strings:
        x_ids.append(string_to_char(x, input_char2id, max_length=max_input_length))
    y_ids = []
    for y in y_strings:
        y_ids.append(string_to_char(y, output_char2id))
    return tf.convert_to_tensor(x_ids), tf.convert_to_tensor(y_ids)

In [None]:
X_train, y_train = create_dataset(10000)
X_valid, y_valid = create_dataset(2000)
X_test, y_test = create_dataset(2000)

In [None]:
y_train[0]

<tf.Tensor: shape=(10,), dtype=int32, numpy=array([ 7,  3,  7,  2, 10,  1,  1, 10,  2,  5], dtype=int32)>

Finally the model part!

**First Version** let's try a very basic Seq2Seq model without teaching forcing (The Decoder never knows the real inputs)

In [None]:
embedding_size = 32
batch_size = 64
hidden_units = 128

In [None]:
# ENCODER
encoder = tf.keras.models.Sequential()
encoder.add(tf.keras.layers.Embedding(input_dim=len(input_vocab) + 1,
                           output_dim=embedding_size,
                           input_shape=[None]))
encoder.add(tf.keras.layers.LSTM(hidden_units))

# DECODER
decoder = tf.keras.models.Sequential()
decoder.add(tf.keras.layers.LSTM(hidden_units, return_sequences=True))
decoder.add(tf.keras.layers.Dense(len(output_vocab) + 1, activation="softmax"))

# ENCODER-DECODER
model = tf.keras.models.Sequential()
model.add(encoder)
model.add(tf.keras.layers.RepeatVector(max_output_length)) # The decoder receives the hidden state from encoder
model.add(decoder)

model.compile(loss="sparse_categorical_crossentropy", optimizer="adam",
              metrics=["accuracy"])
history = model.fit(X_train, y_train, epochs=20,
                    validation_data=(X_valid, y_valid), batch_size=batch_size)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [None]:
model.summary()

Model: "sequential_14"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
sequential_12 (Sequential)   (None, 128)               83712     
_________________________________________________________________
repeat_vector_4 (RepeatVecto (None, 10, 128)           0         
_________________________________________________________________
sequential_13 (Sequential)   (None, 10, 12)            133132    
Total params: 216,844
Trainable params: 216,844
Non-trainable params: 0
_________________________________________________________________


Let's use the model to make a prediction

In [None]:
date = "July 14, 1789"
date_int = string_to_char(date, input_char2id, max_length=max_input_length)
date_tensor = tf.convert_to_tensor([date_int]) # It has to be in a list, since the input is a Tensor list of inputs
prediction = np.argmax(model.predict(date_tensor), axis=-1) 

In [None]:
prediction.shape

(1, 10)

In [None]:
"".join([output_vocab[x] for x in prediction[0]])

'1789-07-14'

**Second Version** Let's try a more advanced model (for the sake of learning purposes)

Instead of feeding the decoder a simple repetition of the encoder's output vector, we can feed it the target sequence, shifted by one time step to the right. This way, at each time step the decoder will know what the previous target character was. This should help is tackle more complex sequence-to-sequence problems.

Now let's create the decoder's inputs (for training, validation and testing). The sos (start of sentence) token will be represented using the last possible output character's ID + 1.

In [None]:
sos_id = len(output_vocab) + 1

def shifted_output_sequences(Y):
    sos_tokens = tf.fill(dims=(len(Y), 1), value=sos_id)
    return tf.concat([sos_tokens, Y[:, :-1]], axis=1)

X_train_decoder = shifted_output_sequences(y_train)
X_valid_decoder = shifted_output_sequences(y_valid)
X_test_decoder = shifted_output_sequences(y_test)

In [None]:
X_train_decoder[0] # 12 is the SOS 

<tf.Tensor: shape=(10,), dtype=int32, numpy=array([12,  7,  3,  7,  2, 10,  1,  1, 10,  2], dtype=int32)>

It's not a simple sequential model anymore, it's time to use the functional API. We need an Input Layer.

A LSTM layer can return its final internal states. The returned states can be used to resume the LSTM execution later, or to initialize another LSTM. This setting is commonly used in the encoder-decoder sequence-to-sequence model, where the encoder final state is used as the initial state of the decoder.

In [None]:
# ENCODER
encoder_input = tf.keras.layers.Input(shape=[None], dtype=tf.int32)
encoder_embedding = tf.keras.layers.Embedding(
                                input_dim=len(input_vocab) + 1,
                                output_dim=embedding_size)(encoder_input)

_, encoder_state_h, encoder_state_c = tf.keras.layers.LSTM(
                                hidden_units, return_state=True)(encoder_embedding)

# DECODER
decoder_input = tf.keras.layers.Input(shape=[None], dtype=tf.int32)
decoder_embedding = tf.keras.layers.Embedding(
                                input_dim=len(output_vocab) + 2, # SOS and EOS
                                output_dim=embedding_size)(decoder_input)

decoder_lstm = tf.keras.layers.LSTM(hidden_units, return_sequences=True)(
                                        decoder_embedding, initial_state=[encoder_state_h, encoder_state_c])
decoder_output = tf.keras.layers.Dense(len(output_vocab) + 1,
                                    activation="softmax")(decoder_lstm)

# ENCODER-DECODER
model = tf.keras.models.Model(inputs=[encoder_input, decoder_input], outputs=[decoder_output])

model.compile(loss="sparse_categorical_crossentropy", optimizer="adam",
              metrics=["accuracy"])
history = model.fit([X_train, X_train_decoder], y_train, epochs=10,
                    validation_data=([X_valid, X_valid_decoder], y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Now in 8 epochs we have 100% accuracy on the validation dataset (Don't forget using early stopping!)

**Third Version** Same as before, but using the tensorflow Decoder seq2seq module (which includes sampling, beam search and attention)

In [None]:
# ENCODER (Stays the same as before)
encoder_input = tf.keras.layers.Input(shape=[None], dtype=tf.int32)
encoder_embedding = tf.keras.layers.Embedding(
                                input_dim=len(input_vocab) + 1,
                                output_dim=embedding_size)(encoder_input)
_, encoder_state_h, encoder_state_c = tf.keras.layers.LSTM(
                                hidden_units, return_state=True)(encoder_embedding)

# DECODER 
decoder_input = tf.keras.layers.Input(shape=[None], dtype=np.int32)
decoder_embedding = tf.keras.layers.Embedding(
                                input_dim=len(output_vocab) + 2, # SOS and EOS
                                output_dim=embedding_size)(decoder_input)
# (This part changes! A LOT) 
# Intead of using our decoder_lstm from last code, we use the BasicDecoder
# Inputs: RNNCell (lstm,gru or rnn)
#         Sampler - Samples from the output probability
#         output layer
# Outputs: Final outputs, final state, final sequence lengths
# The last parenthesis (the function invocation) is the same as the decoder_lstm from previous code
decoder_outputs, _, _ = tfa.seq2seq.basic_decoder.BasicDecoder(tf.keras.layers.LSTMCell(hidden_units),
                                                 tfa.seq2seq.sampler.TrainingSampler(),
                                                 output_layer=tf.keras.layers.Dense(len(output_vocab) + 1, activation="softmax"))(
                                                                              decoder_embedding,
                                                                              initial_state=[encoder_state_h, encoder_state_c])
# There are more than the RNN outputs, so choose the ones from last layer
decoder_outputs = decoder_outputs.rnn_output

# ENCODER-DECODER
model = tf.keras.models.Model(inputs=[encoder_input, decoder_input],outputs=[decoder_outputs])

model.compile(loss="sparse_categorical_crossentropy", optimizer="adam",
              metrics=["accuracy"])
history = model.fit([X_train, X_train_decoder], y_train, epochs=10,
                    validation_data=([X_valid, X_valid_decoder], y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


**Fourth Version** This is a task for you. Add attention using Subclass API and tfa.seq2seq.AttentionWrapper. Similar to the translation example from above.