<a href="https://colab.research.google.com/github/JosephBless/DL/blob/main/Seq2seq_Translation_LSTM_with_Attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Translation with LSTM and attention

This notebook is to show case the attention layer using seq2seq model trained as translator from English to French. The model is composed of a bidirectional LSTM as encoder and an LSTM as the decoder and of course, the decoder and the encoder are fed to an attention layer. The dataset used is one from Udacity's repository and for text preprocessing, SentencePiece is used to convert the input text into sub-wordings.

You will be presented the choice to train everything from scratch yourself or load the models that are trained already in order to just test them. If you choose to train the model youself, based on my experience, it takes more than six hours to acheive 95% accuracy on validation set using a GPU. it will much longer with CPU or a TPU. I do not recommend those options.

If you are running this notebook on Google Colab, first remember to go and change the runtime type to GPU. Also there's a good chance that the default 12GB of RAM that is assigned to your notebook is not enough. Unfortunately, it seems there's no way to request for more RAM unless you crash the notebook (due to insufficient memory) which will lead to a popup to appear, offering you more memory (it will appear on the lower left corner). By clicking on that, you'll get 25GB of RAM which should be sufficient for you. But remember, you have to start all over and run the cells again. Based on my experience, step 5 where you'll be mapping the text into a sequence of integers will crash the VM.

Finally, the code blocks that are marked as **[TRAINING]** should only be executed only if you want to follow the training path. If you want to load the trained models, omit them.

# 1. Clone the attention layer repository

This is to add the attention layer to Keras since at this moment it is not part of the project.

In [None]:
!git clone https://github.com/ziadloo/attention_keras.git

# 2. Download the dataset

The dataset is composed of 137860 sentences in both English and French. Each sentence is written in one line and corresponding lines of the two files are the same sentences in different languages.

In [None]:
!wget -P ./attention_keras/data https://github.com/udacity/deep-learning/raw/master/language-translation/data/small_vocab_en
!wget -P ./attention_keras/data https://github.com/udacity/deep-learning/raw/master/language-translation/data/small_vocab_fr

# 3. Install the SentencePiece library

[SentencePiece](https://github.com/google/sentencepiece/blob/master/python/README.md) is a great library for converting texts into sub-words. Sub-words are the prefered way of tokenizing the text since they are something in between character level tokentization and word level tokenization.

In [None]:
!pip install sentencepiece

# 4. Configure the Python's path

This is in order to help Python find the relative addressing for the `attention_keras` library we just downloaded.

In [None]:
import os
import sys
base_dir = os.path.join(os.getcwd(), "attention_keras")
sys.path.insert(0, base_dir)

# 5. Train the sub-word mapping

Thanks to SentencePiece, it is so easy to have a sub-word mapping for our dataset. By this process we will have a mapping from English words in our dataset to an integer that we can use in our Machine Learning model (and one separate model for French).

Once run, there will be four files generated in the `data` folder which we can feed back to the SentencePiece and map our input sentences to integers.

**[TRAINING]**

In [None]:
import sentencepiece as spm

target_vocab_size_en = 400
target_vocab_size_fr = 600

spm.SentencePieceTrainer.Train(
    f" --input={base_dir}/data/small_vocab_en --model_type=unigram --hard_vocab_limit=false" +
    f" --model_prefix={base_dir}/data/en --vocab_size={target_vocab_size_en}")
spm.SentencePieceTrainer.Train(
    f" --input={base_dir}/data/small_vocab_fr --model_type=unigram --hard_vocab_limit=false" +
    f" --model_prefix={base_dir}/data/fr --vocab_size={target_vocab_size_fr}")

This block loads the sub-word mapping into memory. Make sure you run whether you want to train the model yourself or not. 

In [None]:
import sentencepiece as spm

sp_en = spm.SentencePieceProcessor()
sp_en.Load(os.path.join(base_dir, "data", 'en.model'))

sp_fr = spm.SentencePieceProcessor()
sp_fr.Load(os.path.join(base_dir, "data", 'fr.model'))

Now that we have our models loaded into `sp_en` and `sp_fr`, we can read the text files and convert then to sequences of integers. Once we are done with this phase, we won't be needing the actual text.

The `pad_sequences` function from `keras` is also used to make all the samples of the same length. Since this is a small dataset, all the samples are made as long as the longest one in the dataset.

We will need two extra tokens for input language (in our case English) and three extra tokens for the output language (French). The extra tokens are `<end>`, `<empty>`, and `<start>`. Each sample sequence will have an `<end>` token appended to mark the end of the sequence. For the samples other than the longest one, the empty tokens are filled with `<empty>`. And `<start>` is used in the output samples since we need a `<start>` token to kick off the decoder. Since the output samples will have an extra `<start>` in their beginnings, all of them are padded to a size two tokens longer than the longest one (to accomodate for the `<start>` and the `<end>` tokens while the input samples will only have one token longer than the longest input since we only append the `<end>`).

Also, while I named them `<end>`, `<empty>`, and `<start>` but they are never used in these forms instead in their integer forms. One last thing, while the `<end>` and `<empty>` might end up having the same ID in English and French, but that's not necessary true. So I have two versions of each mentioned tokens.

In [None]:
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.python.keras.utils import to_categorical

with open(os.path.join(base_dir, 'data', 'small_vocab_en'),
          'r', encoding='utf-8') as file:
  en_text = file.read().split("\n")

with open(os.path.join(base_dir, 'data', 'small_vocab_fr'),
          'r', encoding='utf-8') as file:
  fr_text = file.read().split("\n")

train_en_X = []
train_fr_X = []
train_fr_Y = []

en_max_len = 0
fr_max_len = 0

vocab_size_en = sp_en.GetPieceSize()
vocab_size_fr = sp_fr.GetPieceSize()

# Assuming three extra tokens: <end>: #vocab_size_en | #vocab_size_fr,
# <empty>: #vocab_size_en+1 | #vocab_size_fr+1, and <start>: #vocab_size_fr+2

end_token_id_en = vocab_size_en
empty_token_id_en = vocab_size_en + 1
end_token_id_fr = vocab_size_fr
empty_token_id_fr = vocab_size_fr + 1
start_token_id_fr = vocab_size_fr + 2

# The input text only needs two extra tokens while the output needs 3
vocab_size_en = vocab_size_en + 2
vocab_size_fr = vocab_size_fr + 3


for i in range(len(en_text)):
  en_seq = sp_en.EncodeAsIds(en_text[i].strip()) + [end_token_id_en]
  en_max_len = max(en_max_len, len(en_seq))
  train_en_X.append(en_seq)

  fr_seq = sp_fr.EncodeAsIds(fr_text[i].strip()) + [end_token_id_fr]
  fr_max_len = max(fr_max_len, len(fr_seq))
  train_fr_X.append(fr_seq)

# Cleaning up the memory (we don't need them anymore)
#en_text = []
#fr_text = []

# Padding all the samples with <empty> token to make them all of the same length
# equal to the longest one
train_en_X = pad_sequences(train_en_X, maxlen=en_max_len,
                           padding="post", value=empty_token_id_en)
# maxlen is fr_max_len+1 since we need to accomodate for <start>
train_fr_X = pad_sequences(train_fr_X, maxlen=fr_max_len+1,
                           padding="post", value=empty_token_id_fr)

# Converting the train_fr_Y to a one-hot vector needed by the training phase as
# the output
train_fr_Y = to_categorical(train_fr_X, num_classes=vocab_size_fr)

# Moving the last <empty> to the first position in each input sample
train_fr_X = np.roll(train_fr_X, 1, axis=-1)
# Changing the first token in each input sample to <start>
train_fr_X[:, 0] = start_token_id_fr

fr_max_len = fr_max_len + 1

# 6. Cutom metrics

These are two custom metrics that I think represent accuracy of a translation model better.

First, there's `masked_categorical_accuracy` which acts just like `categorical_accuracy` but with a mask. The reason this is a better measure of the accuracy compared to unmasked version is that, in unmasked version we are getting an accuracy even for learning the `<empty>` tokens at the end of the padded sequences. Of course, it is rather easy to learn them since they are all the same single token and they will be pruned off when mapped back to text form. This accuracy measure excludes learning those from the reported accuracy.

Second, we have `exact_matched_accuracy`. In this accuracy we are counting a sample learned only if all the tokens in that sample are learned without a miss. So basically, the reported percentage is actually ratio of the sentences learned completely (not the individual tokens).

In [None]:
import tensorflow.keras.backend as K
from tensorflow.python.keras.metrics import MeanMetricWrapper

class MaskedCategoricalAccuracy(MeanMetricWrapper):

    def __init__(self, mask_id, name='masked_categorical_accuracy', dtype=None):
        super(MaskedCategoricalAccuracy, self).__init__(
            masked_categorical_accuracy, name, dtype=dtype, mask_id=mask_id)


def masked_categorical_accuracy(y_true, y_pred, mask_id):
    true_ids = K.argmax(y_true, axis=-1)
    pred_ids = K.argmax(y_pred, axis=-1)
    maskBool = K.not_equal(true_ids, mask_id)
    maskInt64 = K.cast(maskBool, 'int64')
    maskFloatX = K.cast(maskBool, K.floatx())

    count = K.sum(maskFloatX)
    equals = K.equal(true_ids * maskInt64,
                     pred_ids * maskInt64)
    sum = K.sum(K.cast(equals, K.floatx()) * maskFloatX)
    return sum / count


class ExactMatchedAccuracy(MeanMetricWrapper):

    def __init__(self, mask_id, name='exact_matched_accuracy', dtype=None):
        super(ExactMatchedAccuracy, self).__init__(
            exact_matched_accuracy, name, dtype=dtype, mask_id=mask_id)


def exact_matched_accuracy(y_true, y_pred, mask_id):
    true_ids = K.argmax(y_true, axis=-1)
    pred_ids = K.argmax(y_pred, axis=-1)

    maskBool = K.not_equal(true_ids, mask_id)
    maskInt64 = K.cast(maskBool, 'int64')

    diff = (true_ids - pred_ids) * maskInt64
    matches = K.cast(K.not_equal(diff, K.zeros_like(diff)), 'int64')
    matches = K.sum(matches, axis=-1)
    matches = K.cast(K.equal(matches, K.zeros_like(matches)), K.floatx())

    return K.mean(matches)

# 7. Defining the models

There are three models to define, the trainging model, the encoder model, and the decoder model. The latter two are used after the training phase for the text generation.

If you want to load the models from the disk, you need to remember that these models are all using the same layers and weights. So it's not that straight forward to load them completely. The easiest way to do so is to define the models as if you are doing so for the first time and then load the weights for the training model (load just the weights, not the model). Since the training model holds the weight for all the layers, by doing so you are loading the weights for encoder and decoder as well.

In [None]:
from tensorflow.keras import Input, layers, models
from layers.attention import AttentionLayer

hidden_dim = 128

# Encoder input (English)
input_en = Input(batch_shape=(None, en_max_len), name='input_en')

# English embedding layer
embedding_en = layers.Embedding(vocab_size_en, hidden_dim, name='embedding_en')
embedded_en = embedding_en(input_en)

# Encoder RNN (LSTM) layer
encoder_lstm = layers.Bidirectional(
                  layers.LSTM(hidden_dim,
                              return_sequences=True, return_state=True),
                  name="encoder_lstm")
(encoded_en,
  forward_h_en, forward_c_en,
  backward_h_en, backward_c_en) = encoder_lstm(embedded_en)

# Decoder input (French)
input_fr = Input(batch_shape=(None, None), name='input_fr')

# English embedding layer
embedding_fr = layers.Embedding(vocab_size_fr, hidden_dim, name='embedding_fr')
embedded_fr = embedding_fr(input_fr)

state_h_en = layers.concatenate([forward_h_en, backward_h_en])
state_c_en = layers.concatenate([forward_c_en, backward_c_en])

# Decoder RNN (LSTM) layer
decoder_lstm = layers.LSTM(hidden_dim * 2, return_sequences=True,
                           return_state=True, name="decoder_lstm")
(encoded_fr,
  forward_h_fr, forward_c_fr) = decoder_lstm(embedded_fr,
                 initial_state=[state_h_en, state_c_en])

# Attention layer
attention_layer = AttentionLayer(name='attention_layer')
attention_out, attention_states = attention_layer({"values": encoded_en,
                                                   "query": encoded_fr})

# Concatenating the decoder output with attention output
rnn_output = layers.concatenate([encoded_fr, attention_out], name="rnn_output")

# Dense layer
dense_layer0 = layers.Dense(2048, activation='relu', name='dense_0')
dl0 = dense_layer0(rnn_output)

dense_layer1 = layers.Dense(1024, activation='relu', name='dense_1')
dl1 = dense_layer1(dl0)

dense_layer2 = layers.Dense(512, activation='relu', name='dense_2')
dl2 = dense_layer2(dl1)

dl2 = layers.Dropout(0.4)(dl2)

dense_layer3 = layers.Dense(vocab_size_fr, activation='softmax', name='dense_3')
dense_output = dense_layer3(dl2)

training_model = models.Model([input_en, input_fr], dense_output)
training_model.summary()

training_model.compile(optimizer='adam',
                       loss='categorical_crossentropy',
                       metrics=[MaskedCategoricalAccuracy(empty_token_id_fr),
                                ExactMatchedAccuracy(empty_token_id_fr)])

Now, the generative models (the encoder and the decoder).

It is worth mentioning that `attention_state` is made part of the output for the decoder only to be able to extract the attention scores to plot them. If you do not want to plot the attention scores, you can exclude them from the output.

In [None]:
# The encoder model that encodes English input into encoded output and states
encoder_model = models.Model([input_en],
                             [encoded_en,
                              state_h_en, state_c_en])
encoder_model.summary()


# The decoder model, to generate the French tokens (in integer form)
input_h = layers.Input(batch_shape=(None, hidden_dim * 2),
                       name='input_h')
input_c = layers.Input(batch_shape=(None, hidden_dim * 2),
                       name='input_c')

(decoder_output,
  output_h,
  output_c) = decoder_lstm(embedded_fr,
                           initial_state=[input_h, input_c])

input_encoded_en = layers.Input(batch_shape=(None, en_max_len, hidden_dim * 2),
                                name='input_encoded_en')

attention_out, attention_state = attention_layer({"values": input_encoded_en,
                                                  "query": decoder_output})

generative_output = layers.concatenate([decoder_output,
                                        attention_out],
                                       name="generative_output")

g0 = dense_layer0(generative_output)
g1 = dense_layer1(g0)
g2 = dense_layer2(g1)
dense_output = dense_layer3(g2)

decoder_model = models.Model([input_encoded_en, input_fr,
                              input_h, input_c],
                             [dense_output, attention_state,
                              output_h, output_c])
decoder_model.summary()

# 8. Traning the model / loading the weights

If you want you can train your model. But at the same time, to save your time, I've included the trained weights for this model that you can simply load. If you decided to train the model yourself, based on my experience, 170 epochs are enough. Also a disclaimer, each epoch took around 130 seconds to complete on a GPU. It takes a lot longer on a CPU or even a TPU.

**[TRAINING]**

In [None]:
from tensorflow.keras.callbacks import EarlyStopping

pocket = EarlyStopping(monitor='val_exact_matched_accuracy', min_delta=0.001,
                       patience=10, verbose=1, mode='max',
                       restore_best_weights = True)

history = training_model.fit(x=[train_en_X, train_fr_X], y=train_fr_Y, batch_size=786,
                             epochs=200, verbose=1, validation_split=0.3, shuffle=True,
                             workers=3, use_multiprocessing=True, callbacks=[pocket])

Saving the model weights to disk.

**[TRAINING]**

In [None]:
training_model.save_weights(os.path.join(base_dir, "data", "lstm_weights.h5"))

Downloads the saved model weights. Running this cell too fast will lead to an error. After running the previous cell, give it some time and then run this one. In any case, if it failed, just give it another try.

**[TRAINING]**

In [None]:
from google.colab import files

# This is code to download the model weights into your computer
files.download(os.path.join(base_dir, "data", "lstm_weights.h5"))

This block will plot the history for the loss value and the two accuracy metrics over the course of training for the trainging set and the validation set. You can run it only if you trained the model yourself.

**[TRAINING]**

In [None]:
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

figure(num=None, figsize=(11, 7))

# Plot training & validation loss values
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper right')
plt.show()

figure(num=None, figsize=(11, 7))

# Plot training & validation masked_categorical_accuracy values
plt.plot(history.history['masked_categorical_accuracy'])
plt.plot(history.history['val_masked_categorical_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='lower right')
plt.show()

figure(num=None, figsize=(11, 7))

# Plot training & validation exact_matched_accuracy values
plt.plot(history.history['exact_matched_accuracy'])
plt.plot(history.history['val_exact_matched_accuracy'])
plt.title('Model exact match accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='lower right')
plt.show()

This block loads the weights from the repo. **Run it if you decided to load the weights instead of training it yourself**. But in case you've made a mistake and ran it after you've trained your model, don't worry. It will load the weights that you've just saved.

In [None]:
training_model.load_weights(
    os.path.join(base_dir, "data", "lstm_weights.h5"))

# 9. Evaluate the model using the whole dataset

In this block, we are going to evaluate the model with the whole dataset. This is specially good if you decided to load the model and not train it so you can see it's accuracy youself.

In [None]:
results = training_model.evaluate(x=[train_en_X, train_fr_X], y=train_fr_Y,
                                  batch_size=786, verbose=1,
                                  workers=1, use_multiprocessing=False)

print('Test loss:', results[0])
print('Test masked categorical accuracy:', results[1])
print('Test exact matched accuracy:', results[2])

# 10. Testing the model with your input and plotting the alignment matrix

In this block, you can type in your string in English to be translated to French. At the end, as a bonus, you'll see how the model's attention layer has mapped the words from English to French (also known as the alignment matrix).

Just a disclaimer, this model is trained using 170K samples. Do not expect much from it! The provide English sentence is chosen from the dataset, so it should be translted correctly. But your custom ones might not result in a good translation.

In [None]:
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences
import math
import sys

# The input English string
english_string = "the united states is never freezing during november , but the united states is sometimes rainy in winter ."

# First, let's tokenize the Eglish string, then pad it
english_tokens = sp_en.EncodeAsIds(english_string.strip()) + [end_token_id_en]
english_tokens = pad_sequences([english_tokens], maxlen=en_max_len,
                               padding="post", value=empty_token_id_en)

# The encoder, we only need to use it once per each English string
(encoded_en_test,
  state_h_en_test, state_c_en_test) = encoder_model.predict(english_tokens)

# In order to find a better translation, we are using Beam search
beam_search_list = [{
  "decoder_input": {
    "input_encoded_en": encoded_en_test,
    "input_fr": np.array([[start_token_id_fr]]),
    "input_h": state_h_en_test,
    "input_c": state_c_en_test
  },
  "score": 0.0,
  "parent_node": None,
  "depth": 0,
  "attention_weights": None,
}]
ended_branches = []

beam_size = 10

# We are generating up to fr_max_len tokens
for i in range(fr_max_len):
  new_beam_candidates = []
  # Predict the next token for each member of the list
  for beam in beam_search_list:
    # Use the decoder to predict the next token using the previously
    # predicted token
    (output,
      attention_out,
      state_h_en_test,
      state_c_en_test) = decoder_model.predict(beam["decoder_input"])
    # Find the top beam_size candidates
    top_k = np.argpartition(output[0, 0, :], -beam_size)[-beam_size:]
    # For each candidate, put it in the list to predict the next token for it
    for k in top_k:
      if output[0, 0, k].item() > 0.0:
        log_k = math.log(output[0, 0, k].item())
      else:
        log_k = -sys.float_info.max

      if k == end_token_id_fr:
        ended_branches.append({
          "decoder_input": {
            "input_encoded_en": encoded_en_test,
            "input_fr": np.array([[k]]),
            "input_h": state_h_en_test,
            "input_c": state_c_en_test,
          },
          "score": beam["score"] + log_k,
          "parent_node": beam,
          "depth": beam["depth"] + 1,
          "attention_weights": attention_out,
        })
      else:
        new_beam_candidates.append({
          "decoder_input": {
            "input_encoded_en": encoded_en_test,
            "input_fr": np.array([[k]]),
            "input_h": state_h_en_test,
            "input_c": state_c_en_test,
          },
          "score": beam["score"] + log_k,
          "parent_node": beam,
          "depth": beam["depth"] + 1,
          "attention_weights": attention_out,
        })

  # Keeping only the top beam_size in the list
  beam_search_list = sorted(new_beam_candidates,
                            key=lambda b: b["score"],
                            reverse=True)[0:beam_size]

# Now that we are done with our beam search, let's take the best score and
# detokenize it
beam_node = sorted(beam_search_list + ended_branches,
                   key=lambda b: b["score"] / b["depth"],
                   reverse=True)[0]

# Trace the best beam back to the parent node
all_french_tokens = []
attention_weights = []
while beam_node["parent_node"] is not None:
    all_french_tokens.append(
        beam_node["decoder_input"]["input_fr"][0, 0].item())
    attention_weights.append(beam_node["attention_weights"])
    beam_node = beam_node["parent_node"]

# We traced from tail to head, so we need to reserve the order to have it the right way
all_french_tokens.reverse()
attention_weights.reverse()

# If there's any token out of the vocab, exclude it. This includes `<end>`,
# `<empty>`, and <start> tokens
french_tokens = [t for t in all_french_tokens if t < sp_fr.get_piece_size()]

# Voila!
french_string = sp_fr.DecodeIds(french_tokens)

print("The input English string: ", english_string)
print("The output French string: ", french_string)

Plotting the alignment matrix

In [None]:
import os
import numpy as np
import matplotlib.pyplot as plt

# Plotting the alignment matrix

attention_mat = []
for attn in attention_weights:
  attention_mat.append(attn.reshape(-1))

# We want to have the English tokens on the left axis, so we need to
# trasponse the matrix over the diagonal running from upper right to lower left
attention_mat = np.flipud(np.transpose(np.flipud(attention_mat)))

fig, ax = plt.subplots(figsize=(16, 16))
ax.imshow(attention_mat)

ax.set_xticks(np.arange(attention_mat.shape[1]))
ax.set_yticks(np.arange(attention_mat.shape[0]))

def map_en_special_tokens(t):
    switcher = {}
    switcher[end_token_id_en] = "<end>"
    switcher[empty_token_id_en] = "<empty>"
    return switcher.get(t, "<unknown>")

def map_fr_special_tokens(t):
    switcher = {}
    switcher[end_token_id_fr] = "<end>"
    switcher[empty_token_id_fr] = "<empty>"
    switcher[start_token_id_fr] = "<start>"
    return switcher.get(t, "<unknown>")

ax.set_xticklabels([sp_fr.IdToPiece(t)
                    if t < sp_fr.get_piece_size() else map_fr_special_tokens(t)
                    for t in all_french_tokens])
ax.set_yticklabels([sp_en.IdToPiece(t)
                    if t < sp_en.get_piece_size() else map_en_special_tokens(t)
                    for t in english_tokens[0].tolist()])

ax.tick_params(labelsize=12)
ax.tick_params(axis='x', labelrotation=90)

plt.show()