## The dataset


the dataset is WMT-14 English-German translation data from https://nlp.stanford.edu/projects/nmt/. There are over 4.5 million sentence pairs available. However, I will only use 10k pairs due to computational feasiblility.


In [33]:
import os
import random

n_sentences = 10000

# Loading English train sentences
original_en_sentences = []
with open(
    os.path.join("./data/en-de", "train_10k.en"), "r", encoding="utf-8"
) as en_file:
    for i, row in enumerate(en_file):
        # if i >= n_sentences:
        #     break
        original_en_sentences.append(row.strip().split(" "))

# loading German train sentences
original_de_sentences = []
with open(
    os.path.join("./data/en-de", "train_10k.de"), "r", encoding="utf-8"
) as de_file:
    for i, row in enumerate(de_file):
        # if i >= n_sentences:
        #     break
        original_de_sentences.append(row.strip().split(" "))

# Loading English test sentences
oritinal_en_test_sentences = []

with open(
    os.path.join("./data/en-de", "test_100.en"), "r", encoding="utf-8"
) as de_file:
    for i, row in enumerate(de_file):
        # if i >= n_sentences:
        #     break
        oritinal_en_test_sentences.append(row.strip().split(" "))

# Loading German test sentences
oritinal_de_test_sentences = []
with open(
    os.path.join("./data/en-de", "test_100.de"), "r", encoding="utf-8"
) as de_file:
    for i, row in enumerate(de_file):
        # if i >= n_sentences:
        #     break
        oritinal_de_test_sentences.append(row.strip().split(" "))

### displaying random sentences and their respective translations
for i in range(3):
    index = random.randint(0, 10000)
    print("English: ", " ".join(original_en_sentences[index]))
    print("German: ", " ".join(original_de_sentences[index]), "\n")

English:  If you use Registry Trash Keys Finder successfully .
German:  Sie nutzen Registry Trash Keys Finder erfolgreich . 

English:  The NH Giustiniano is a brand-new hotel in the heart of exclusive Prati , residential and commercial neighbourhood within walking distance of St. Peter ’ s Cathedral , Castel S.Angelo and the Vatican Museums .
German:  Sie wohnen im Herzen des exklusiven Stadtteils Prati - von dieser Geschäfts- und Wohngegend gelangen Sie zu Fuß zum Petersdom , der Engelsburg ( Castel Sant &apos; Angelo ) und den Vatikanischen Museen . 

English:  Guests with cars will find the A.C. Hotel Hoferer conveniently close to the A8 motorway , and can also park for free on site .
German:  Das A.C. Hotel Hoferer liegt günstig nahe der Autobahn A8 und verfügt über kostenfreie Parkplätze . 



# Adding special tokens

#### I will add "< s >" to mark the start of a sentence and "< /s >" to mark the end of a sentence

This way
we prediction can be done for an arbitrary number of time steps. Using < s > as the starting token gives a
way to signal to the decoder that it should start predicting tokens from the target language.

if < /s > token is not used to mark the end of a sentence, the decoder cannot be signaled to
end a sentence. This can lead the model to enter an infinite loop of predictions.


In [34]:
en_sentences = [["<s>"] + sent + ["</s>"] for sent in original_en_sentences]
de_sentences = [["<s>"] + sent + ["</s>"] for sent in original_de_sentences]
test_en_sentences = [["<s>"] + sent + ["</s>"] for sent in oritinal_en_test_sentences]
test_de_sentences = [["<s>"] + sent + ["</s>"] for sent in oritinal_de_test_sentences]

for i in range(2):
    index = random.randint(0, 10000)
    print("English: ", " ".join(en_sentences[index]))
    print("German: ", " ".join(de_sentences[index]), "\n")

print("English Test: ", " ".join(test_en_sentences[0]))
print("German Test: ", " ".join(test_de_sentences[0]))

English:  <s> The best way to do this is with a link to this web page . </s>
German:  <s> Am einfachsten ist es , an entsprechender Stelle einen Link auf diese Seite einzubinden . </s> 

English:  <s> Fixed an issue where the IME input tool used to enter Japanese , Korean , Chinese and Indic characters was covered by the &quot; Add Bookmark &quot; panel . </s>
German:  <s> Die Überdeckung des IME-Eingabeprogramms zur Eingabe von japanischen , koreanischen , chinesischen und indischen Zeichen durch den &quot; Lesezeichen für diese Seite gesetzt &quot; -Dialog wurde behoben . </s> 

English Test:  <s> Orlando Bloom and Miranda Kerr still love each other </s>
German Test:  <s> Orlando Bloom und Miranda Kerr lieben sich noch immer </s>


# splitting training and validation dataset

#### 90% training and 10% validation


In [35]:
from sklearn.model_selection import train_test_split
import numpy as np

train_en_sentences, valid_en_sentences, train_de_sentences, valid_de_sentences = (
    train_test_split(en_sentences, de_sentences, test_size=0.1)
)

print(train_en_sentences[1])
print(train_de_sentences[1])

['<s>', 'Free', 'public', 'parking', 'is', 'possible', 'at', 'a', 'location', 'nearby', '(', 'reservation', 'is', 'not', 'possible', ')', '.', '</s>']
['<s>', 'Öffentliche', 'Parkplätze', 'stehen', 'kostenfrei', 'in', 'der', 'Nähe', '(', 'Reservierung', 'ist', 'nicht', 'möglich', ')', 'zur', 'Verfügung', '.', '</s>']


### Defining sequence leghts fot the two languages


In [36]:
import pandas as pd

# Getting some basic statistics from the data

# convert train_en_sentences to a pandas series
pd.Series(train_en_sentences).str.len().describe(percentiles=[0.05, 0.5, 0.95])

count    9000.000000
mean       27.419000
std        14.356768
min         8.000000
5%         11.000000
50%        24.000000
95%        56.000000
max       102.000000
dtype: float64

The statistic above shows that 5% of english sentences have 11 words, 50% have 24 words, 95% have 56 words


In [37]:
pd.Series(train_de_sentences).str.len().describe(percentiles=[0.05, 0.5, 0.95])

count    9000.000000
mean       24.821222
std        12.896984
min         8.000000
5%         11.000000
50%        22.000000
95%        50.000000
max       102.000000
dtype: float64

The statistic above shows that 5% of German sentences have 11 words, 50% have 22 words, 95% have 50 words

the minimum and maximum number of sentences is 8 and 102 respectively in both languages. However, this will not always be the case


### Padding the sentences with pad_sequences from keras


In [38]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

n_en_seq_length = 50
n_de_seq_length = 50
unk_token = "<unk>"
pad_token = "<pad>"

train_en_sentences_padded = pad_sequences(
    train_en_sentences,
    maxlen=n_en_seq_length,
    value=pad_token,
    dtype=object,
    truncating="post",
    padding="post",
)

valid_en_sentences_padded = pad_sequences(
    valid_en_sentences,
    maxlen=n_en_seq_length,
    value=pad_token,
    dtype=object,
    truncating="post",
    padding="post",
)

test_en_sentences_padded = pad_sequences(
    test_en_sentences,
    maxlen=n_en_seq_length,
    value=pad_token,
    dtype=object,
    truncating="post",
    padding="post",
)


train_de_sentences_padded = pad_sequences(
    train_de_sentences,
    maxlen=n_de_seq_length,
    value=pad_token,
    dtype=object,
    truncating="post",
    padding="post",
)

valid_de_sentences_padded = pad_sequences(
    valid_de_sentences,
    maxlen=n_de_seq_length,
    value=pad_token,
    dtype=object,
    truncating="post",
    padding="post",
)

test_de_sentences_padded = pad_sequences(
    test_de_sentences,
    maxlen=n_de_seq_length,
    value=pad_token,
    dtype=object,
    truncating="post",
    padding="post",
)

valid_en_sentences_padded[0]

array(['<s>', 'If', 'you', 'need', 'that', 'functionality', ',', 'you',
       'can', 'eliminate', 'those', 'kinds', 'of', 'layers', 'from',
       'your', 'PSD', 'by', 'converting', 'the', 'layer', 'effects', 'to',
       'stand-alone', 'layers', 'or', '&apos;', 'smart', 'objects',
       '&apos;', '&#91;', 'right', 'click', 'on', 'the', 'Layer', 'in',
       'the', 'Photoshop', 'layers', 'palette', '&#93;', '.', '</s>',
       '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>'], dtype=object)

In [39]:
from tensorflow.keras.layers import TextVectorization
import os

# using text vectorization
# text_vectorizer_en = TextVectorization(output_mode="int")
# text_vectorizer_fr = TextVectorization(output_mode="int")
# text_vectorizer_en.adapt(data["en"])
# text_vectorizer_fr.adapt(data["fr"])

en_vocabulary = []
with open(os.path.join("./data/en-fr", "vocab.en"), "r", encoding="utf-8") as en_file:
    for ri, row in enumerate(en_file):

        en_vocabulary.append(row.strip())

de_vocabulary = []
with open(
    os.path.join("./data/en-de", "train_10k.de"), "r", encoding="utf-8"
) as en_file:
    for ri, row in enumerate(en_file):

        de_vocabulary.append(row.strip())

text_vectorizer_en = TextVectorization(output_mode="int")
text_vectorizer_de = TextVectorization(output_mode="int")
text_vectorizer_en.adapt(en_vocabulary)
text_vectorizer_de.adapt(de_vocabulary)


en_vocabulary = text_vectorizer_en.get_vocabulary()
de_vocabulary = text_vectorizer_de.get_vocabulary()
text_vectorizer_de.get_vocabulary()

['',
 '[UNK]',
 'und',
 'die',
 'der',
 'in',
 'sie',
 'von',
 'das',
 'zu',
 'mit',
 'ist',
 'für',
 'den',
 'im',
 'auf',
 'ein',
 'des',
 'eine',
 'dem',
 'sich',
 'hotel',
 'es',
 'quot',
 'werden',
 'an',
 'oder',
 'nicht',
 'als',
 'sind',
 'auch',
 'a',
 'ich',
 'wird',
 'einem',
 'über',
 'aus',
 'einen',
 'um',
 'the',
 'bei',
 'zur',
 'wie',
 'können',
 'einer',
 'er',
 'am',
 'nur',
 'nach',
 'alle',
 'so',
 'diese',
 'kann',
 'zimmer',
 'zum',
 'wir',
 'wenn',
 'man',
 'dieses',
 'bis',
 'vom',
 'durch',
 'hat',
 'and',
 'sehr',
 'ihre',
 'haben',
 'wurde',
 'daß',
 'war',
 'aber',
 'bietet',
 'pro',
 'dass',
 'unter',
 'vor',
 'b',
 'was',
 'sein',
 '�',
 'liegt',
 'ihr',
 'ihnen',
 'of',
 'sowie',
 'noch',
 'dieser',
 'gibt',
 'hier',
 'entfernt',
 '1',
 'mehr',
 'ihrer',
 'stadt',
 '2',
 'de',
 'denn',
 'apos',
 'seine',
 'windows',
 'befindet',
 'eines',
 'finden',
 'to',
 'lage',
 '“',
 'euch',
 'du',
 'damit',
 'zwischen',
 'zeit',
 'hotels',
 'c',
 'dann',
 'diesem',

# Defining the model


In [40]:
en_unk_token = en_vocabulary.pop(1)
de_unk_token = de_vocabulary.pop(1)

en_unk_token, de_unk_token

('[UNK]', '[UNK]')

In [42]:
import tensorflow as tf

en_lookup_layer = tf.keras.layers.StringLookup(
    oov_token=en_unk_token,
    vocabulary=en_vocabulary,
    mask_token=pad_token,
    pad_to_max_tokens=False,
)

de_lookup_layer = tf.keras.layers.StringLookup(
    oov_token=de_unk_token,
    vocabulary=de_vocabulary,
    mask_token=pad_token,
    pad_to_max_tokens=False,
)

In [43]:
wid_sample = en_lookup_layer(
    "iron cement protects the ingot against the hot , abrasive steel casting process .".split(
        " "
    )
)
print(f"Word IDs: {wid_sample}")
print(f"Sample vocabulary: {en_lookup_layer.get_vocabulary()[:10]}")

Word IDs: [269792 395373 156726     75 275279 448899     75 287062      1 454356
  94376 397282 159214      1]
Sample vocabulary: ['<pad>', '[UNK]', '', 'o', 'av', 're', 'ms', 'm', 'i', 'd']


In [44]:
class BahdanauAttention(tf.keras.layers.Layer):
    def __init__(self, units):
        super().__init__()
        # Weights to compute Bahdanau attention
        self.Wa = tf.keras.layers.Dense(units, use_bias=False)
        self.Ua = tf.keras.layers.Dense(units, use_bias=False)

        self.attention = tf.keras.layers.AdditiveAttention(use_scale=True)

    def call(self, query, key, value, mask, return_attention_scores=False):

        # Compute `Wa.ht`.
        wa_query = self.Wa(query)

        # Compute `Ua.hs`.
        ua_key = self.Ua(key)

        # Compute masks
        query_mask = tf.ones(tf.shape(query)[:-1], dtype=bool)
        value_mask = mask

        # Compute the attention
        context_vector, attention_weights = self.attention(
            inputs=[wa_query, value, ua_key],
            mask=[query_mask, value_mask, value_mask],
            return_attention_scores=True,
        )

        if not return_attention_scores:
            return context_vector
        else:
            return context_vector, attention_weights

In [46]:
import tensorflow.keras.backend as K

K.clear_session()

# Defining the encoder layers
encoder_input = tf.keras.layers.Input(shape=(n_en_seq_length,), dtype=tf.string)
# Converting tokens to IDs
encoder_wid_out = en_lookup_layer(encoder_input)

# Embedding layer and lookup
encoder_emb_out = tf.keras.layers.Embedding(
    len(en_lookup_layer.get_vocabulary()), 128, mask_zero=True
)(encoder_wid_out)

# Encoder GRU layer
encoder_gru_out, encoder_gru_last_state = tf.keras.layers.GRU(
    256, return_sequences=True, return_state=True
)(encoder_emb_out)

# Defining the encoder model: in - encoder_input / out - output of the GRU layer
encoder = tf.keras.models.Model(inputs=encoder_input, outputs=encoder_gru_out)

# Defining the decoder layers
decoder_input = tf.keras.layers.Input(shape=(n_de_seq_length - 1,), dtype=tf.string)
# Converting tokens to IDs (Decoder)
decoder_wid_out = de_lookup_layer(decoder_input)

# Embedding layer and lookup (decoder)
full_de_vocab_size = len(de_lookup_layer.get_vocabulary())
decoder_emb_out = tf.keras.layers.Embedding(full_de_vocab_size, 128, mask_zero=True)(
    decoder_wid_out
)
decoder_gru_out = tf.keras.layers.GRU(256, return_sequences=True)(
    decoder_emb_out, initial_state=encoder_gru_last_state
)

# The attention mechanism (inputs: [q, v, k])
decoder_attn_out, attn_weights = BahdanauAttention(256)(
    query=decoder_gru_out,
    key=encoder_gru_out,
    value=encoder_gru_out,
    mask=(encoder_wid_out != 0),
    return_attention_scores=True,
)

# Concatenate GRU output and the attention output
context_and_rnn_output = tf.keras.layers.Concatenate(axis=-1)(
    [decoder_attn_out, decoder_gru_out]
)

# Final prediction layer (size of the vocabulary)
decoder_out = tf.keras.layers.Dense(full_de_vocab_size, activation="softmax")(
    context_and_rnn_output
)

# Final seq2seq model
seq2seq_model = tf.keras.models.Model(
    inputs=[encoder.inputs, decoder_input], outputs=decoder_out
)

# We will use this model later to visualize attention patterns
attention_visualizer = tf.keras.models.Model(
    inputs=[encoder.inputs, decoder_input], outputs=[attn_weights, decoder_out]
)

# Compiling the model with a loss and an optimizer
seq2seq_model.compile(
    loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"]
)

# Print model summary
seq2seq_model.summary()

In [48]:
from tensorflow.keras.layers import StringLookup
from bleu import compute_bleu


class BLEUMetric(object):

    def __init__(self, vocabulary, name="perplexity", **kwargs):
        """Computes the BLEU score (Metric for machine translation)"""
        super().__init__()
        self.vocab = vocabulary
        self.id_to_token_layer = StringLookup(vocabulary=self.vocab, invert=True)

    def calculate_bleu_from_predictions(self, real, pred):
        """Calculate the BLEU score for targets and predictions"""

        # Get the predicted token IDs
        pred_argmax = tf.argmax(pred, axis=-1)

        # Convert token IDs to words using the vocabulary and the StringLookup
        pred_tokens = self.id_to_token_layer(pred_argmax)
        real_tokens = self.id_to_token_layer(real)

        def clean_text(tokens):
            """Clean padding and <s>/</s> tokens to only keep meaningful words"""

            # 3. Strip the string of any extra white spaces
            translations_in_bytes = tf.strings.strip(
                # 2. Replace everything after the eos token with blank
                tf.strings.regex_replace(
                    # 1. Join all the tokens to one string in each sequence
                    tf.strings.join(tf.transpose(tokens), separator=" "),
                    "<\/s>.*",
                    "",
                ),
            )

            # Decode the byte stream to a string
            translations = np.char.decode(
                translations_in_bytes.numpy().astype(np.bytes_), encoding="utf-8"
            )

            # If the string is empty, add a [UNK] token
            # Otherwise get a Division by zero error
            translations = [
                sent if len(sent) > 0 else en_unk_token for sent in translations
            ]

            # Split the sequences to individual tokens
            translations = np.char.split(translations).tolist()

            return translations

        # Get the clean versions of the predictions and real seuqences
        pred_tokens = clean_text(pred_tokens)
        # We have to wrap each real sequence in a list to make use of a function to compute bleu
        real_tokens = [[token_seq] for token_seq in clean_text(real_tokens)]

        # The compute_bleu method accpets the translations and references in the following format
        # tranlation - list of list of tokens
        # references - list of list of list of tokens
        bleu, precisions, bp, ratio, translation_length, reference_length = (
            compute_bleu(real_tokens, pred_tokens, smooth=False)
        )

        return bleu

  "<\/s>.*",


In [49]:
translation = [
    [
        de_unk_token,
        de_unk_token,
        "mÃssen",
        "wir",
        "in",
        "erfahrung",
        "bringen",
        "wo",
        "sie",
        "wohnen",
    ]
]
reference = [
    [
        [
            "als",
            "mÃssen",
            "mÃssen",
            "wir",
            "in",
            "erfahrung",
            "bringen",
            "wo",
            "sie",
            "wohnen",
        ]
    ]
]

bleu1, _, _, _, _, _ = compute_bleu(reference, translation)

translation = [
    [
        de_unk_token,
        "einmal",
        "mÃssen",
        en_unk_token,
        "in",
        "erfahrung",
        "bringen",
        "wo",
        "sie",
        "wohnen",
    ]
]
reference = [
    [
        [
            "als",
            "mÃssen",
            "mÃssen",
            "wir",
            "in",
            "erfahrung",
            "bringen",
            "wo",
            "sie",
            "wohnen",
        ]
    ]
]


bleu2, _, _, _, _, _ = compute_bleu(reference, translation)

print(f"BLEU score with longer correctly predicte phrases: {bleu1}")
print(f"BLEU score without longer correctly predicte phrases: {bleu2}")

BLEU score with longer correctly predicte phrases: 0.7598356856515925
BLEU score without longer correctly predicte phrases: 0.537284965911771


In [50]:
import time


def prepare_data(de_lookup_layer, train_xy, valid_xy, test_xy):
    """Create a data dictionary from the dataframes containing data"""

    data_dict = {}
    for label, data_xy in zip(
        ["train", "valid", "test"], [train_xy, valid_xy, test_xy]
    ):

        data_x, data_y = data_xy
        en_inputs = data_x
        de_inputs = data_y[:, :-1]
        de_labels = de_lookup_layer(data_y[:, 1:]).numpy()
        data_dict[label] = {
            "encoder_inputs": en_inputs,
            "decoder_inputs": de_inputs,
            "decoder_labels": de_labels,
        }

    return data_dict


def shuffle_data(en_inputs, de_inputs, de_labels, shuffle_inds=None):
    """Shuffle the data randomly (but all of inputs and labels at ones)"""

    if shuffle_inds is None:
        # If shuffle_inds are not passed create a shuffling automatically
        shuffle_inds = np.random.permutation(np.arange(en_inputs.shape[0]))
    else:
        # Shuffle the provided shuffle_inds
        shuffle_inds = np.random.permutation(shuffle_inds)

    # Return shuffled data
    return (
        en_inputs[shuffle_inds],
        de_inputs[shuffle_inds],
        de_labels[shuffle_inds],
    ), shuffle_inds


def check_for_nans(loss, model, en_lookup_layer, de_lookup_layer):

    if np.isnan(loss):
        for r_i in range(len(y)):
            loss_sample, _ = model.evaluate(
                [x[0][r_i : r_i + 1], x[1][r_i : r_i + 1]], y[r_i : r_i + 1], verbose=0
            )
            if np.isnan(loss_sample):

                print("=" * 25, "nan detected", "=" * 25)
                print("train_batch", i, "r_i", r_i)
                print("en_input ->", x[0][r_i].tolist())
                print("en_input_wid ->", en_lookup_layer(x[0][r_i]).numpy().tolist())
                print("de_input ->", x[1][r_i].tolist())
                print("de_input_wid ->", de_lookup_layer(x[1][r_i]).numpy().tolist())
                print("de_output_wid ->", y[r_i].tolist())

                if r_i > 0:
                    print("=" * 25, "no-nan", "=" * 25)
                    print("en_input ->", x[0][r_i - 1].tolist())
                    print(
                        "en_input_wid ->",
                        en_lookup_layer(x[0][r_i - 1]).numpy().tolist(),
                    )
                    print("de_input ->", x[1][r_i - 1].tolist())
                    print(
                        "de_input_wid ->",
                        de_lookup_layer(x[1][r_i - 1]).numpy().tolist(),
                    )
                    print("de_output_wid ->", y[r_i - 1].tolist())
                    return
                else:
                    continue


def train_model(
    model,
    en_lookup_layer,
    de_lookup_layer,
    train_xy,
    valid_xy,
    test_xy,
    epochs,
    batch_size,
    shuffle=True,
    predict_bleu_at_training=False,
):
    """Training the model and evaluating on validation/test sets"""

    # Define the metric
    bleu_metric = BLEUMetric(de_vocabulary)

    # Define the data
    data_dict = prepare_data(de_lookup_layer, train_xy, valid_xy, test_xy)

    shuffle_inds = None

    for epoch in range(epochs):

        # Reset metric logs every epoch
        if predict_bleu_at_training:
            blue_log = []
        accuracy_log = []
        loss_log = []

        # =================================================================== #
        #                         Train Phase                                 #
        # =================================================================== #

        # Shuffle data at the beginning of every epoch
        if shuffle:
            (en_inputs_raw, de_inputs_raw, de_labels), shuffle_inds = shuffle_data(
                data_dict["train"]["encoder_inputs"],
                data_dict["train"]["decoder_inputs"],
                data_dict["train"]["decoder_labels"],
                shuffle_inds,
            )
        else:
            (en_inputs_raw, de_inputs_raw, de_labels) = (
                data_dict["train"]["encoder_inputs"],
                data_dict["train"]["decoder_inputs"],
                data_dict["train"]["decoder_labels"],
            )
        # Get the number of training batches
        n_train_batches = en_inputs_raw.shape[0] // batch_size

        prev_loss = None
        # Train one batch at a time
        for i in range(n_train_batches):
            # Status update
            print(f"Training batch {i+1}/{n_train_batches}", end="\r")

            # Get a batch of inputs (english and german sequences)
            x = [
                en_inputs_raw[i * batch_size : (i + 1) * batch_size],
                de_inputs_raw[i * batch_size : (i + 1) * batch_size],
            ]
            # Get a batch of targets (german sequences offset by 1)
            y = de_labels[i * batch_size : (i + 1) * batch_size]

            loss, accuracy = model.evaluate(x, y, verbose=0)

            # Check if any samples are causing NaNs
            check_for_nans(loss, model, en_lookup_layer, de_lookup_layer)

            # Train for a single step
            model.train_on_batch(x, y)
            # Evaluate the model to get the metrics
            # loss, accuracy = model.evaluate(x, y, verbose=0)

            # Update the epoch's log records of the metrics
            loss_log.append(loss)
            accuracy_log.append(accuracy)

            if predict_bleu_at_training:
                # Get the final prediction to compute BLEU
                pred_y = model.predict(x)
                bleu_log.append(bleu_metric.calculate_bleu_from_predictions(y, pred_y))

        print("")
        print(f"\nEpoch {epoch+1}/{epochs}")
        if predict_bleu_at_training:
            print(
                f"\t(train) loss: {np.mean(loss_log)} - accuracy: {np.mean(accuracy_log)} - bleu: {np.mean(bleu_log)}"
            )
        else:
            print(
                f"\t(train) loss: {np.mean(loss_log)} - accuracy: {np.mean(accuracy_log)}"
            )
        # =================================================================== #
        #                      Validation Phase                               #
        # =================================================================== #

        val_en_inputs = data_dict["valid"]["encoder_inputs"]
        val_de_inputs = data_dict["valid"]["decoder_inputs"]
        val_de_labels = data_dict["valid"]["decoder_labels"]

        val_loss, val_accuracy, val_bleu = evaluate_model(
            model,
            de_lookup_layer,
            val_en_inputs,
            val_de_inputs,
            val_de_labels,
            batch_size,
        )

        # Print the evaluation metrics of each epoch
        print(
            f"\t(valid) loss: {val_loss} - accuracy: {val_accuracy} - bleu: {val_bleu}"
        )

    # =================================================================== #
    #                      Test Phase                                     #
    # =================================================================== #

    test_en_inputs = data_dict["test"]["encoder_inputs"]
    test_de_inputs = data_dict["test"]["decoder_inputs"]
    test_de_labels = data_dict["test"]["decoder_labels"]

    test_loss, test_accuracy, test_bleu = evaluate_model(
        model,
        de_lookup_layer,
        test_en_inputs,
        test_de_inputs,
        test_de_labels,
        batch_size,
    )

    print(f"\n(test) loss: {test_loss} - accuracy: {test_accuracy} - bleu: {test_bleu}")


def evaluate_model(
    model, de_lookup_layer, en_inputs_raw, de_inputs_raw, de_labels, batch_size
):
    """Evaluate the model on various metrics such as loss, accuracy and BLEU"""

    # Define the metric
    bleu_metric = BLEUMetric(de_vocabulary)

    loss_log, accuracy_log, bleu_log = [], [], []
    # Get the number of batches
    n_batches = en_inputs_raw.shape[0] // batch_size
    print(" ", end="\r")

    # Evaluate one batch at a time
    for i in range(n_batches):
        # Status update
        print(f"Evaluating batch {i+1}/{n_batches}", end="\r")

        # Get the inputs and targers
        x = [
            en_inputs_raw[i * batch_size : (i + 1) * batch_size],
            de_inputs_raw[i * batch_size : (i + 1) * batch_size],
        ]
        y = de_labels[i * batch_size : (i + 1) * batch_size]

        # Get the evaluation metrics
        loss, accuracy = model.evaluate(x, y, verbose=0)
        # Get the predictions to compute BLEU
        pred_y = model.predict(x)

        # Update logs
        loss_log.append(loss)
        accuracy_log.append(accuracy)
        bleu_log.append(bleu_metric.calculate_bleu_from_predictions(y, pred_y))

    return np.mean(loss_log), np.mean(accuracy_log), np.mean(bleu_log)

In [51]:
epochs = 10
batch_size = 72

t1 = time.time()
train_model(
    seq2seq_model,
    en_lookup_layer,
    de_lookup_layer,
    (train_en_sentences_padded, train_de_sentences_padded),
    (valid_en_sentences_padded, valid_de_sentences_padded),
    (test_en_sentences_padded, test_de_sentences_padded),
    epochs,
    batch_size,
    shuffle=False,
)
t2 = time.time()

print(f"\nIt took {t2-t1} seconds to complete the training")

Training batch 1/125

W0000 00:00:1729010716.288403   29115 op_level_cost_estimator.cc:699] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "CPU" vendor: "GenuineIntel" model: "111" frequency: 2303 num_cores: 24 environment { key: "cpu_instruction_set" value: "AVX SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 49152 l2_cache_size: 1310720 l3_cache_size: 31457280 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2024-10-15 12:45:16.338459: E tensorflow/core/util/util.cc:131] oneDNN supports DT_BOOL only on platforms with AVX-512. Falling back to the default Eigen-based implementation if present.
2024-10-15 12:45:16.447618: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: INVALID_ARGUMENT: Incompatible shapes: [32,49,50] vs. [32,50]
	 [[{{node functional_1_1/bahdana

InvalidArgumentError: Graph execution error:

Detected at node functional_1_1/bahdanau_attention_1/additive_attention_1/sub defined at (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main

  File "<frozen runpy>", line 88, in _run_code

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/ipykernel_launcher.py", line 18, in <module>

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/traitlets/config/application.py", line 1075, in launch_instance

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/ipykernel/kernelapp.py", line 739, in start

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/tornado/platform/asyncio.py", line 205, in start

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/asyncio/base_events.py", line 641, in run_forever

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/asyncio/base_events.py", line 1986, in _run_once

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/asyncio/events.py", line 88, in _run

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/ipykernel/kernelbase.py", line 545, in dispatch_queue

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/ipykernel/kernelbase.py", line 534, in process_one

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/ipykernel/kernelbase.py", line 437, in dispatch_shell

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/ipykernel/ipkernel.py", line 362, in execute_request

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/ipykernel/kernelbase.py", line 778, in execute_request

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/ipykernel/ipkernel.py", line 449, in do_execute

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/ipykernel/zmqshell.py", line 549, in run_cell

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 3075, in run_cell

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 3130, in _run_cell

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/IPython/core/async_helpers.py", line 128, in _pseudo_sync_runner

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 3334, in run_cell_async

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 3517, in run_ast_nodes

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code

  File "/tmp/ipykernel_29115/3462044664.py", line 5, in <module>

  File "/tmp/ipykernel_29115/2909701450.py", line 143, in train_model

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/keras/src/backend/tensorflow/trainer.py", line 432, in evaluate

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/keras/src/backend/tensorflow/trainer.py", line 165, in one_step_on_iterator

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/keras/src/backend/tensorflow/trainer.py", line 154, in one_step_on_data

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/keras/src/backend/tensorflow/trainer.py", line 82, in test_step

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/keras/src/layers/layer.py", line 899, in __call__

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/keras/src/ops/operation.py", line 46, in __call__

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/keras/src/models/functional.py", line 182, in call

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/keras/src/ops/function.py", line 171, in _run_through_graph

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/keras/src/models/functional.py", line 597, in call

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/keras/src/layers/layer.py", line 899, in __call__

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/keras/src/ops/operation.py", line 46, in __call__

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/tmp/ipykernel_29115/2301211892.py", line 23, in call

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/keras/src/layers/layer.py", line 899, in __call__

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/keras/src/ops/operation.py", line 46, in __call__

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/keras/src/layers/attention/attention.py", line 229, in call

  File "/home/mbeleck/anaconda3/envs/tf2-cuda/lib/python3.12/site-packages/keras/src/layers/attention/attention.py", line 177, in _apply_scores

Incompatible shapes: [32,49,50] vs. [32,50]
	 [[{{node functional_1_1/bahdanau_attention_1/additive_attention_1/sub}}]] [Op:__inference_one_step_on_iterator_3189]

In [53]:
def get_attention_matrix_for_sampled_data(
    attention_model, target_lookup_layer, test_xy, n_samples=5
):

    test_x, test_y = test_xy

    rand_ids = np.random.randint(0, len(test_xy[0]), size=(n_samples,))
    print(rand_ids)
    results = []

    for rid in rand_ids:
        en_input = test_x[rid : rid + 1]
        de_input = test_y[rid : rid + 1, :-1]

        clean_en_input = []
        en_start_i = 0
        for i, w in enumerate(en_input.ravel()):
            if w == "<pad>":
                en_start_i = i + 1
                continue

            clean_en_input.append(w)
            if w == "</s>":
                break

        attn_weights, predictions = attention_model.predict([en_input, de_input])
        predicted_word_ids = np.argmax(predictions, axis=-1).ravel()
        predicted_words = [
            target_lookup_layer.get_vocabulary()[wid] for wid in predicted_word_ids
        ]

        clean_predicted_words = []
        for w in predicted_words:
            clean_predicted_words.append(w)
            if w == "</s>":
                break

        results.append(
            {
                "attention_weights": attn_weights[
                    0,
                    : len(clean_predicted_words),
                    en_start_i : en_start_i + len(clean_en_input),
                ],
                "input_words": clean_en_input,
                "predicted_words": clean_predicted_words,
            }
        )

    return results

In [60]:
# import matplotlib.pyplot as plt
# %matplotlib inline

# _, axes = plt.subplots(5, 1, figsize=(100,100))

# attention_results = get_attention_matrix_for_sampled_data(
#     attention_visualizer,
#     de_lookup_layer,
#     (test_en_sentences_padded, test_de_sentences_padded),
#     n_samples = 5
# )

# for ax, result in zip(axes, attention_results):

#     ax.imshow(result["attention_weights"])
#     x_labels = result["input_words"]
#     y_labels = result["predicted_words"]
#     ax.set_xticks(np.arange(len(x_labels)))
#     ax.set_xticklabels(x_labels, rotation=45)
#     ax.set_yticks(np.arange(len(y_labels)))
#     ax.set_yticklabels(y_labels, rotation=0)