# The data

https://www.kaggle.com/datasets/mohamedlotfy50/wmt-2014-english-french/data
There over 4.5 million sentence pairs available. However, I will only use 25,000 pairs due to computational feasiblility.


In [60]:
import pandas as pd
import numpy as np

n_sentences = 25000

data = pd.read_csv(
    "./data/en-fr/wmt14_translate_fr-en_train.csv", nrows=n_sentences
).dropna()

data.head()

Unnamed: 0,en,fr
0,Resumption of the session,Reprise de la session
1,I declare resumed the session of the European ...,Je déclare reprise la session du Parlement eur...
2,"Although, as you will have seen, the dreaded '...","Comme vous avez pu le constater, le grand ""bog..."
3,You have requested a debate on this subject in...,Vous avez souhaité un débat à ce sujet dans le...
4,"In the meantime, I should like to observe a mi...","En attendant, je souhaiterais, comme un certai..."


# spliting the sentences into tokens


In [61]:
import random

original_en_sentences = [sent.strip().split(" ") for sent in data["en"]]
original_fr_sentences = [sent.strip().split(" ") for sent in data["fr"]]


for i in range(3):
    index = random.randint(0, 10000)
    print("English: ", " ".join(original_en_sentences[index]))
    print("French: ", " ".join(original_fr_sentences[index]), "\n")

English:  Therefore, in future, prior approval should take place in a targeted manner, that is, only in cases of uncertainty or risk.
French:  Les fonctionnaires chargés du contrôle financier devraient exercer leurs fonctions de manière décentralisée, c' est-à-dire dans les directions générales, auprès des collègues qui dépensent l' argent. 

English:  As regards the other aspect of Agenda, the aspect of cohesion and regional development, there we do indeed have great achievements to point to, but there are still less developed regions, particularly island regions, to which more attention should be paid.
French:  En ce qui concerne l' autre volet de l' Agenda, Monsieur le Président de la Commission, à savoir l'aspect de la cohésion et du développement régional, il y a, certes, de grands succès, cependant, il subsiste encore des régions qui restent en retard, et notamment les régions insulaires, sur lesquelles nous devrions nous attarder. 

English:  I do not need to remind you that con

# Adding special tokens

#### I will add "< s >" to mark the start of a sentence and "< /s >" to mark the end of a sentence

This way
we prediction can be done for an arbitrary number of time steps. Using < s > as the starting token gives a
way to signal to the decoder that it should start predicting tokens from the target language.

if < /s > token is not used to mark the end of a sentence, the decoder cannot be signaled to
end a sentence. This can lead the model to enter an infinite loop of predictions.


In [62]:
en_sentences = [["<s>"] + sent + ["</s>"] for sent in original_en_sentences]
fr_sentences = [["<s>"] + sent + ["</s>"] for sent in original_fr_sentences]

for i in range(2):
    index = random.randint(0, 10000)
    print("English: ", " ".join(en_sentences[index]))
    print("German: ", " ".join(fr_sentences[index]), "\n")

English:  <s> Finally, I would like to thank Parliament for the very constructive debate on the key aspects of these proposals and, in particular, of course, the rapporteur, Mrs Berger. </s>
German:  <s> Enfin, je voudrais remercier le Parlement pour le débat très constructif sur les aspects fondamentaux de ces propositions et, en particulier, bien sûr, le rapporteur, Mme Berger. </s> 

English:  <s> Mr Kinnock has also devoted some fine-sounding phrases to this subject but, at the same time, it is entirely unclear, at this moment when we have to make a decision, what, for example, happens with whistle-blowers who want to get something off their chest and cannot do this internally but who want to address the outside world - the press or Parliament. </s>
German:  <s> M. Kinnock y a consacré de belles paroles mais, en même temps, il n' est toujours pas clair du tout, au moment où nous devons prendre une décision, de savoir ce qui va se passer lorsque ces informateurs ne trouveront pas d'

# splitting training and validation dataset

#### 80% training, 10% validation and 10% for testing


In [63]:
from sklearn.model_selection import train_test_split
import numpy as np

(
    train_en_sentences,
    valid_test_en_sentences,
    train_fr_sentences,
    valid_test_fr_sentences,
) = train_test_split(en_sentences, fr_sentences, test_size=0.2)


(valid_en_sentences, test_en_sentences, valid_fr_sentences, test_fr_sentences) = (
    train_test_split(valid_test_en_sentences, valid_test_fr_sentences, test_size=0.5)
)


print(train_en_sentences[1])
print(train_fr_sentences[1])
print("\n")
print(test_en_sentences[0])
print(test_fr_sentences[0])

print(f"Train size: {len(train_en_sentences)}")
print(f"Valid size: {len(valid_en_sentences)}")
print(f"Test size: {len(test_en_sentences)}")

['<s>', 'We', 'can', 'only', 'do', 'that', 'by', 'intensifying', 'co-operation', 'within', 'the', 'Community.', '</s>']
['<s>', 'Ce', 'ne', 'sera', 'possible', 'que', 'par', 'le', 'biais', "d'un", 'renforcement', 'de', 'la', 'coopération', 'au', 'sein', 'de', 'la', 'Communauté.', '</s>']


['<s>', 'One', 'of', 'the', 'very', 'first', 'speakers,', 'Mr', 'Wuori,', 'spoke', 'of', 'the', 'remarks', 'made', 'earlier', 'today', 'by', 'Mr', 'Havel.', '</s>']
['<s>', 'Un', 'des', 'tout', 'premiers', 'orateurs,', 'M.', 'Wuori,', 'a', 'parlé', 'des', 'remarques', "qu'a", 'faites', "aujourd'hui", 'M.', 'Havel.', '</s>']
Train size: 20000
Valid size: 2500
Test size: 2500


### Defining sequence leghts fot the two languages


In [64]:
# Getting some basic statistics from the data

# convert train_en_sentences to a pandas series
pd.Series(train_en_sentences).str.len().describe(percentiles=[0.05, 0.5, 0.95])

count    20000.000000
mean        27.589950
std         15.737363
min          3.000000
5%           8.000000
50%         24.000000
95%         57.000000
max        150.000000
dtype: float64

In [65]:
pd.Series(train_fr_sentences).str.len().describe(percentiles=[0.05, 0.5, 0.95])

count    20000.000000
mean        28.874700
std         16.748511
min          3.000000
5%           8.000000
50%         26.000000
95%         60.000000
max        148.000000
dtype: float64

# from the train data statistics above, 95% of the english sentences have lengths of 57 while in the french, 95 % of sentences have lengths of 60


### padding the sentences with pad_sequences from keras


In [66]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

n_en_seq_length = 60
n_fr_seq_length = 60
# unk_token = "<unk>"
pad_token = "[pad]"

train_en_sentences_padded = pad_sequences(
    train_en_sentences,
    maxlen=n_en_seq_length,
    value=pad_token,
    dtype=object,
    truncating="post",
    padding="post",
)

valid_en_sentences_padded = pad_sequences(
    valid_en_sentences,
    maxlen=n_en_seq_length,
    value=pad_token,
    dtype=object,
    truncating="post",
    padding="post",
)

test_en_sentences_padded = pad_sequences(
    test_en_sentences,
    maxlen=n_en_seq_length,
    value=pad_token,
    dtype=object,
    truncating="post",
    padding="post",
)


train_fr_sentences_padded = pad_sequences(
    train_fr_sentences,
    maxlen=n_fr_seq_length,
    value=pad_token,
    dtype=object,
    truncating="post",
    padding="post",
)

valid_fr_sentences_padded = pad_sequences(
    valid_fr_sentences,
    maxlen=n_fr_seq_length,
    value=pad_token,
    dtype=object,
    truncating="post",
    padding="post",
)

test_fr_sentences_padded = pad_sequences(
    test_fr_sentences,
    maxlen=n_fr_seq_length,
    # value=unk_token,
    dtype=object,
    truncating="post",
    padding="post",
)

print(train_en_sentences_padded[1])

['<s>' 'We' 'can' 'only' 'do' 'that' 'by' 'intensifying' 'co-operation'
 'within' 'the' 'Community.' '</s>' '[pad]' '[pad]' '[pad]' '[pad]'
 '[pad]' '[pad]' '[pad]' '[pad]' '[pad]' '[pad]' '[pad]' '[pad]' '[pad]'
 '[pad]' '[pad]' '[pad]' '[pad]' '[pad]' '[pad]' '[pad]' '[pad]' '[pad]'
 '[pad]' '[pad]' '[pad]' '[pad]' '[pad]' '[pad]' '[pad]' '[pad]' '[pad]'
 '[pad]' '[pad]' '[pad]' '[pad]' '[pad]' '[pad]' '[pad]' '[pad]' '[pad]'
 '[pad]' '[pad]' '[pad]' '[pad]' '[pad]' '[pad]' '[pad]']


# Converting to token IDs


In [67]:
from tensorflow.keras.layers import TextVectorization
import os

# using text vectorization
# text_vectorizer_en = TextVectorization(output_mode="int")
# text_vectorizer_fr = TextVectorization(output_mode="int")
# text_vectorizer_en.adapt(data["en"])
# text_vectorizer_fr.adapt(data["fr"])

en_vocabulary = []
with open(os.path.join("./data/en-fr", "vocab.en"), "r", encoding="utf-8") as en_file:
    for ri, row in enumerate(en_file):

        en_vocabulary.append(row.strip())

fr_vocabulary = []
with open(os.path.join("./data/en-fr", "vocab.fr"), "r", encoding="utf-8") as en_file:
    for ri, row in enumerate(en_file):

        fr_vocabulary.append(row.strip())

text_vectorizer_en = TextVectorization(output_mode="int")
text_vectorizer_fr = TextVectorization(output_mode="int")
text_vectorizer_en.adapt(en_vocabulary)
text_vectorizer_fr.adapt(fr_vocabulary)


en_vocabulary = text_vectorizer_en.get_vocabulary()
fr_vocabulary = text_vectorizer_fr.get_vocabulary()

In [68]:
en_unk_token = en_vocabulary.pop(1)
fr_unk_token = fr_vocabulary.pop(1)

en_unk_token, fr_unk_token

('[UNK]', '[UNK]')

In [69]:
import tensorflow as tf

# pad_token = "<PAD>"

# English look up layer
en_lookup_layer = tf.keras.layers.StringLookup(
    vocabulary=en_vocabulary,
    oov_token=en_unk_token,
    mask_token=pad_token,
    pad_to_max_tokens=False,
)

# French look up layer
fr_lookup_layer = tf.keras.layers.StringLookup(
    vocabulary=fr_vocabulary,
    oov_token=en_unk_token,
    mask_token=pad_token,
    pad_to_max_tokens=False,
)

In [70]:
wid_sample = en_lookup_layer(
    "iron cement protects the ingot against the hot , abrasive steel casting process .".split(
        " "
    )
)
print(f"Word IDs: {wid_sample}")
print(f"Sample vocabulary: {en_lookup_layer.get_vocabulary()[:10]}")

Word IDs: [269792 395373 156726     75 275279 448899     75 287062      1 454356
  94376 397282 159214      1]
Sample vocabulary: ['[pad]', '[UNK]', '', 'o', 'av', 're', 'ms', 'm', 'i', 'd']


In [71]:
# dir(en_lookup_layer)
# en_lookup_layer.get_vocabulary()

# Defining the encoder


In [72]:
# takes n_en_seq_length of sentences
encoder_input = tf.keras.layers.Input(shape=(n_en_seq_length,), dtype=tf.string)

# using lookup layer into word IDs
encoder_wid_out = en_lookup_layer(encoder_input)

"""
With the tokens converted into IDs, route the generated word IDs to a token embedding layer.
Pass in the size of the vocabulary (derived from the en_lookup_layer's get_vocabulary()
method) and the embedding size (128) and finally then ask the layer to mask any zero-valued inputs
as they don’t contain any information:

"""
en_full_vocab_size = len(en_lookup_layer.get_vocabulary())
encoder_emb_out = tf.keras.layers.Embedding(en_full_vocab_size, 128, mask_zero=True)(
    encoder_wid_out
)


encoder_gru_out, encoder_gru_last_state = tf.keras.layers.GRU(
    256, return_sequences=True, return_state=True
)(encoder_emb_out)

encoder = tf.keras.models.Model(inputs=encoder_input, outputs=encoder_gru_out)

# Defining the Decoder with teacher forcing


In [73]:
decoder_input = tf.keras.layers.Input(shape=(n_fr_seq_length - 1,), dtype=tf.string)

# convert tokens to IDs using the de_lookup_layer
decoder_wid_out = fr_lookup_layer(decoder_input)

# decoder embedding layer
fr_full_vocab_size = len(fr_lookup_layer.get_vocabulary())
decoder_emb_out = tf.keras.layers.Embedding(fr_full_vocab_size, 128, mask_zero=True)(
    decoder_wid_out
)

# decoder layer>>> pass the last state of the encoder into the decoder
decoder_gru_out = tf.keras.layers.GRU(256, return_sequences=True)(
    decoder_emb_out, initial_state=encoder_gru_last_state
)

# Badanau Attention


In [74]:
class BahdanauAttention(tf.keras.layers.Layer):
    def __init__(self, units):
        super().__init__()
        # Weights to compute Bahdanau attention
        self.Wa = tf.keras.layers.Dense(units, use_bias=False)
        self.Ua = tf.keras.layers.Dense(units, use_bias=False)

        self.attention = tf.keras.layers.AdditiveAttention(use_scale=True)

    def call(self, query, key, value, mask, return_attention_scores=False):

        # Compute `Wa.ht`.
        wa_query = self.Wa(query)

        # Compute `Ua.hs`.
        ua_key = self.Ua(key)

        # Compute masks
        query_mask = tf.ones(tf.shape(query)[:-1], dtype=bool)
        value_mask = mask

        # Compute the attention
        context_vector, attention_weights = self.attention(
            inputs=[wa_query, value, ua_key],
            mask=[query_mask, value_mask, value_mask],
            return_attention_scores=True,
        )

        if not return_attention_scores:
            return context_vector
        else:
            return context_vector, attention_weights

In [75]:
import tensorflow.keras.backend as K

K.clear_session()

# Defining the encoder layers
encoder_input = tf.keras.layers.Input(shape=(n_en_seq_length,), dtype=tf.string)
# Converting tokens to IDs
encoder_wid_out = en_lookup_layer(encoder_input)

# Embedding layer and lookup
encoder_emb_out = tf.keras.layers.Embedding(
    len(en_lookup_layer.get_vocabulary()), 128, mask_zero=True
)(encoder_wid_out)

# Encoder GRU layer
encoder_gru_out, encoder_gru_last_state = tf.keras.layers.GRU(
    256, return_sequences=True, return_state=True
)(encoder_emb_out)

# Defining the encoder model: in - encoder_input / out - output of the GRU layer
encoder = tf.keras.models.Model(inputs=encoder_input, outputs=encoder_gru_out)

# Defining the decoder layers
decoder_input = tf.keras.layers.Input(shape=(n_fr_seq_length - 1,), dtype=tf.string)
# Converting tokens to IDs (Decoder)
decoder_wid_out = fr_lookup_layer(decoder_input)

# Embedding layer and lookup (decoder)
full_de_vocab_size = len(fr_lookup_layer.get_vocabulary())
decoder_emb_out = tf.keras.layers.Embedding(full_de_vocab_size, 128, mask_zero=True)(
    decoder_wid_out
)
decoder_gru_out = tf.keras.layers.GRU(256, return_sequences=True)(
    decoder_emb_out, initial_state=encoder_gru_last_state
)

# The attention mechanism (inputs: [q, v, k])
decoder_attn_out, attn_weights = BahdanauAttention(256)(
    query=decoder_gru_out,
    key=encoder_gru_out,
    value=encoder_gru_out,
    mask=(encoder_wid_out != 0),
    return_attention_scores=True,
)

# Concatenate GRU output and the attention output
context_and_rnn_output = tf.keras.layers.Concatenate(axis=-1)(
    [decoder_attn_out, decoder_gru_out]
)

# Final prediction layer (size of the vocabulary)
decoder_out = tf.keras.layers.Dense(full_de_vocab_size, activation="softmax")(
    context_and_rnn_output
)

# Final seq2seq model
seq2seq_model = tf.keras.models.Model(
    inputs=[encoder.inputs, decoder_input], outputs=decoder_out
)

# We will use this model later to visualize attention patterns
attention_visualizer = tf.keras.models.Model(
    inputs=[encoder.inputs, decoder_input], outputs=[attn_weights, decoder_out]
)

# Compiling the model with a loss and an optimizer
seq2seq_model.compile(
    loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"]
)

# Print model summary
seq2seq_model.summary()



# Defining the final model


In [76]:
decoder_attn_out, attn_weights = BahdanauAttention(256)(
    query=decoder_gru_out,
    key=encoder_gru_out,
    value=encoder_gru_out,
    mask=(encoder_wid_out != 0),  # mask that denotes which tokens need to be ignored
    return_attention_scores=True,
)


# combine the attention output and the decoder's GRU output to create a
# single concatenated input for the prediction
context_and_gru_output = tf.keras.layers.Concatenate(axis=-1)(
    [decoder_attn_out, decoder_gru_out]
)

# Prediction layer takes the concatenated attention's context vewctore andthe GRU ouput to
# produce probability distributions over the French tokens for each time step
decoder_out = tf.keras.layers.Dense(fr_full_vocab_size, activation="softmax")(
    context_and_gru_output
)



In [77]:
# final end-to-end model
seq2seq_model = tf.keras.models.Model(
    inputs=[encoder.inputs, decoder_input], outputs=decoder_out
)

seq2seq_model.compile(
    loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"]
)

# Attention visualization


In [78]:
attention_visualizer = tf.keras.models.Model(
    inputs=[encoder.inputs, decoder_input], outputs=[attn_weights, decoder_out]
)

# custom training


### Prepare data


In [79]:
def prepare_data(fr_lookup_layer, train_xy, valid_xy, test_xy):
    """
    The prepare_data() function takes the source sentence and target sentence pairs and generates
    encoder and decoder inputs and decoder labels.

    • fr_lookup_layer   =>  The StringLookup layer of the French language
    • train_xy          =>  A tuple containing tokenized English sentences and tokenized
                            French sentences in the training set, respectively
    • valid_xy          =>  Similar to train_xy but for validation data
    • test_xy           =>  Similar to train_xy but for test data


    For each training, validation, and test dataset, this function generates:

    • encoder_inputs => Tokenized English sentences as in the preprocessed dataset
    • decoder_inputs => All tokens except the last of each French sentence
    • decoder_labels => All token IDs except the first of each French sentence, where token
                        IDs are generated by the fr_lookup_layer

    decoder_labels will be decoder_inputs shifted one token to the left
    """
    # Create a data dictionary from the dataframes containing data
    data_dict = {}
    for label, data_xy in zip(
        ["train", "valid", "test"], [train_xy, valid_xy, test_xy]
    ):
        data_x, data_y = data_xy
        en_inputs = data_x
        fr_inputs = data_y[:, :-1]
        fr_labels = fr_lookup_layer(data_y[:, 1:]).numpy()
        data_dict[label] = {
            "encoder_input": en_inputs,
            "decoder_inputs": fr_inputs,
            "decoder_labels": fr_labels,
        }
    return data_dict

### Shuffle data


In [80]:
def shuffle_data(en_inputs, fr_inputs, fr_labels, shuffle_inds=None):
    """
    Shuffle the data randomly (but all inputs and labels at once)
    """

    if shuffle_inds is None:
        # If shuffle_inds are not passed, create a shuffle automatically
        shuffle_inds = np.random.permutation(np.arange(en_inputs.shape[0]))
    else:
        # Shuffle the provided shuffle_inds
        shuffle_inds = np.random.permutation(shuffle_inds)

    # return shuffled data
    return (
        en_inputs[shuffle_inds],
        fr_inputs[shuffle_inds],
        fr_labels[shuffle_inds],
    ), shuffle_inds

### Train function


#### utility


In [81]:
import time


def prepare_data(de_lookup_layer, train_xy, valid_xy, test_xy):
    """Create a data dictionary from the dataframes containing data"""

    data_dict = {}
    for label, data_xy in zip(
        ["train", "valid", "test"], [train_xy, valid_xy, test_xy]
    ):

        data_x, data_y = data_xy
        en_inputs = data_x
        de_inputs = data_y[:, :-1]
        de_labels = de_lookup_layer(data_y[:, 1:]).numpy()
        data_dict[label] = {
            "encoder_inputs": en_inputs,
            "decoder_inputs": de_inputs,
            "decoder_labels": de_labels,
        }

    return data_dict


def shuffle_data(en_inputs, de_inputs, de_labels, shuffle_inds=None):
    """Shuffle the data randomly (but all of inputs and labels at ones)"""

    if shuffle_inds is None:
        # If shuffle_inds are not passed create a shuffling automatically
        shuffle_inds = np.random.permutation(np.arange(en_inputs.shape[0]))
    else:
        # Shuffle the provided shuffle_inds
        shuffle_inds = np.random.permutation(shuffle_inds)

    # Return shuffled data
    return (
        en_inputs[shuffle_inds],
        de_inputs[shuffle_inds],
        de_labels[shuffle_inds],
    ), shuffle_inds


def check_for_nans(loss, model, en_lookup_layer, de_lookup_layer):

    if np.isnan(loss):
        for r_i in range(len(y)):
            loss_sample, _ = model.evaluate(
                [x[0][r_i : r_i + 1], x[1][r_i : r_i + 1]], y[r_i : r_i + 1], verbose=0
            )
            if np.isnan(loss_sample):

                print("=" * 25, "nan detected", "=" * 25)
                print("train_batch", i, "r_i", r_i)
                print("en_input ->", x[0][r_i].tolist())
                print("en_input_wid ->", en_lookup_layer(x[0][r_i]).numpy().tolist())
                print("de_input ->", x[1][r_i].tolist())
                print("de_input_wid ->", de_lookup_layer(x[1][r_i]).numpy().tolist())
                print("de_output_wid ->", y[r_i].tolist())

                if r_i > 0:
                    print("=" * 25, "no-nan", "=" * 25)
                    print("en_input ->", x[0][r_i - 1].tolist())
                    print(
                        "en_input_wid ->",
                        en_lookup_layer(x[0][r_i - 1]).numpy().tolist(),
                    )
                    print("de_input ->", x[1][r_i - 1].tolist())
                    print(
                        "de_input_wid ->",
                        de_lookup_layer(x[1][r_i - 1]).numpy().tolist(),
                    )
                    print("de_output_wid ->", y[r_i - 1].tolist())
                    return
                else:
                    continue


def train_model(
    model,
    en_lookup_layer,
    de_lookup_layer,
    train_xy,
    valid_xy,
    test_xy,
    epochs,
    batch_size,
    shuffle=True,
    predict_bleu_at_training=False,
):
    """Training the model and evaluating on validation/test sets"""

    # Define the metric
    bleu_metric = BLEUMetric(fr_vocabulary)

    # Define the data
    data_dict = prepare_data(de_lookup_layer, train_xy, valid_xy, test_xy)

    shuffle_inds = None

    for epoch in range(epochs):

        # Reset metric logs every epoch
        if predict_bleu_at_training:
            blue_log = []
        accuracy_log = []
        loss_log = []

        # =================================================================== #
        #                         Train Phase                                 #
        # =================================================================== #

        # Shuffle data at the beginning of every epoch
        if shuffle:
            (en_inputs_raw, de_inputs_raw, de_labels), shuffle_inds = shuffle_data(
                data_dict["train"]["encoder_inputs"],
                data_dict["train"]["decoder_inputs"],
                data_dict["train"]["decoder_labels"],
                shuffle_inds,
            )
        else:
            (en_inputs_raw, de_inputs_raw, de_labels) = (
                data_dict["train"]["encoder_inputs"],
                data_dict["train"]["decoder_inputs"],
                data_dict["train"]["decoder_labels"],
            )
        # Get the number of training batches
        n_train_batches = en_inputs_raw.shape[0] // batch_size

        prev_loss = None
        # Train one batch at a time
        for i in range(n_train_batches):
            # Status update
            print(f"Training batch {i+1}/{n_train_batches}", end="\r")

            # Get a batch of inputs (english and german sequences)
            x = [
                en_inputs_raw[i * batch_size : (i + 1) * batch_size],
                de_inputs_raw[i * batch_size : (i + 1) * batch_size],
            ]
            # Get a batch of targets (german sequences offset by 1)
            y = de_labels[i * batch_size : (i + 1) * batch_size]

            loss, accuracy = model.evaluate(x, y, verbose=0)

            # Check if any samples are causing NaNs
            check_for_nans(loss, model, en_lookup_layer, de_lookup_layer)

            # Train for a single step
            model.train_on_batch(x, y)
            # Evaluate the model to get the metrics
            # loss, accuracy = model.evaluate(x, y, verbose=0)

            # Update the epoch's log records of the metrics
            loss_log.append(loss)
            accuracy_log.append(accuracy)

            if predict_bleu_at_training:
                # Get the final prediction to compute BLEU
                pred_y = model.predict(x)
                bleu_log.append(bleu_metric.calculate_bleu_from_predictions(y, pred_y))

        print("")
        print(f"\nEpoch {epoch+1}/{epochs}")
        if predict_bleu_at_training:
            print(
                f"\t(train) loss: {np.mean(loss_log)} - accuracy: {np.mean(accuracy_log)} - bleu: {np.mean(bleu_log)}"
            )
        else:
            print(
                f"\t(train) loss: {np.mean(loss_log)} - accuracy: {np.mean(accuracy_log)}"
            )
        # =================================================================== #
        #                      Validation Phase                               #
        # =================================================================== #

        val_en_inputs = data_dict["valid"]["encoder_inputs"]
        val_de_inputs = data_dict["valid"]["decoder_inputs"]
        val_de_labels = data_dict["valid"]["decoder_labels"]

        val_loss, val_accuracy, val_bleu = evaluate_model(
            model,
            de_lookup_layer,
            val_en_inputs,
            val_de_inputs,
            val_de_labels,
            batch_size,
        )

        # Print the evaluation metrics of each epoch
        print(
            f"\t(valid) loss: {val_loss} - accuracy: {val_accuracy} - bleu: {val_bleu}"
        )

    # =================================================================== #
    #                      Test Phase                                     #
    # =================================================================== #

    test_en_inputs = data_dict["test"]["encoder_inputs"]
    test_de_inputs = data_dict["test"]["decoder_inputs"]
    test_de_labels = data_dict["test"]["decoder_labels"]

    test_loss, test_accuracy, test_bleu = evaluate_model(
        model,
        de_lookup_layer,
        test_en_inputs,
        test_de_inputs,
        test_de_labels,
        batch_size,
    )

    print(f"\n(test) loss: {test_loss} - accuracy: {test_accuracy} - bleu: {test_bleu}")


def evaluate_model(
    model, de_lookup_layer, en_inputs_raw, de_inputs_raw, de_labels, batch_size
):
    """Evaluate the model on various metrics such as loss, accuracy and BLEU"""

    # Define the metric
    bleu_metric = BLEUMetric(de_vocabulary)

    loss_log, accuracy_log, bleu_log = [], [], []
    # Get the number of batches
    n_batches = en_inputs_raw.shape[0] // batch_size
    print(" ", end="\r")

    # Evaluate one batch at a time
    for i in range(n_batches):
        # Status update
        print(f"Evaluating batch {i+1}/{n_batches}", end="\r")

        # Get the inputs and targers
        x = [
            en_inputs_raw[i * batch_size : (i + 1) * batch_size],
            de_inputs_raw[i * batch_size : (i + 1) * batch_size],
        ]
        y = de_labels[i * batch_size : (i + 1) * batch_size]

        # Get the evaluation metrics
        loss, accuracy = model.evaluate(x, y, verbose=0)
        # Get the predictions to compute BLEU
        pred_y = model.predict(x)

        # Update logs
        loss_log.append(loss)
        accuracy_log.append(accuracy)
        bleu_log.append(bleu_metric.calculate_bleu_from_predictions(y, pred_y))

    return np.mean(loss_log), np.mean(accuracy_log), np.mean(bleu_log)

In [82]:
def train_model(
    model,
    en_lookup_layer,
    fr_lookup_layer,
    train_xy,
    valid_xy,
    test_xy,
    epochs,
    batch_size,
    shuffle=True,
    predict_bleu_at_training=False,
):
    """Training the model and evaluating on validation/test sets"""

    # Define the metric
    bleu_metric = BLEUMetric(en_vocabulary)

    # Define the data
    data_dict = prepare_data(fr_lookup_layer, train_xy, valid_xy, test_xy)

    shuffle_inds = None

    for epoch in range(epochs):

        # Reset metric logs every epoch
        if predict_bleu_at_training:
            bleu_log = []
        accuracy_log = []
        loss_log = []

        # ========================================================== #
        #                   Train Phase                              #
        # ========================================================== #

        # shuffle data at the beginning of every epoch
        if shuffle:
            (en_inputs_raw, fr_inputs_raw, fr_labels), shuffle_inds = shuffle_data(
                data_dict["train"]["encoder_inputs"],
                data_dict["train"]["decoder_inputs"],
                data_dict["train"]["decoder_labels"],
                shuffle_inds,
            )
        else:
            (en_inputs_raw, de_inputs_raw, de_labels) = (
                data_dict["train"]["encoder_inputs"],
                data_dict["train"]["decoder_inputs"],
                data_dict["train"]["decoder_labels"],
            )

        # Get the number of training batches
        n_train_batches = en_inputs_raw.shape[0] // batch_size

        prev_loss = None
        # Train one batch at a time
        for i in range(n_train_batches):
            # Status update
            print("Training batch {}/{}".format(i + 1, n_train_batches), end="\r")

            # Get a batch of inputs (english and french sequences)
            x = [
                en_inputs_raw[i * batch_size : (i + 1) * batch_size],
                fr_inputs_raw[i * batch_size : (i + 1) * batch_size],
            ]

            # Get a batch of targets (french sequences offset by 1)
            y = fr_labels[i * batch_size : (i + 1) * batch_size]

            loss, accuracy = model.evaluate(x, y, verbose=0)

            # check if any samples are causing NaNs
            check_for_nans(loss, model, en_lookup_layer, fr_lookup_layer)

            # Train for a single step
            model.train_on_batch(x, y)

            # Update the epoch's log records of the metrics
            loss_log.append(loss)
            accuracy_log.append(accuracy)

            if predict_bleu_at_training:
                # Get the final prediction to compute BLEU
                pred_y = model.predict(x)
                bleu_log.append(bleu_metric.calculate_bleu_from_predictions(pred_y))

            print("")
            print("\nEpoch {}/{}".format(epoch + 1, epochs))
            if predict_bleu_at_training:

                print(
                    f"\t(train) loss: {np.mean(loss_log)} - accuracy:{np.mean(accuracy_log)} - bleu: {np.mean(bleu_log)}"
                )
            else:
                print(
                    f"\t(train) loss: {np.mean(loss_log)} - accuracy: {np.mean(accuracy_log)}"
                )

            # ========================================================== #
            #                   Validation Phase                         #
            # ========================================================== #

            val_en_inputs = data_dict["valid"]["encoder_inputs"]
            val_fr_inputs = data_dict["valid"]["decoder_inputs"]
            val_fr_labels = data_dict["valid"]["decoder_labels"]

            val_loss, val_accuracy, val_bleu = evaluate_model(
                model,
                fr_lookup_layer,
                val_en_inputs,
                val_fr_inputs,
                val_fr_labels,
                batch_size,
            )
            # Print the evaluation metrics of each epoch
            print(
                "\t(valid) loss: {} - accuracy: {} - bleu: {}".format(
                    val_loss, val_accuracy, val_bleu
                )
            )

        # ========================================================== #
        #                   test Phase                               #
        # ========================================================== #

        test_en_inputs = data_dict["test"]["encoder_inputs"]
        test_fr_inputs = data_dict["test"]["decoder_inputs"]
        test_fr_labels = data_dict["test"]["decoder_labels"]
        test_loss, test_accuracy, test_bleu = evaluate_model(
            model,
            fr_lookup_layer,
            test_en_inputs,
            test_fr_inputs,
            test_fr_labels,
            batch_size,
        )

        print(
            "\n(test) loss: {} - accuracy: {} - bleu: {}".format(
                test_loss, test_accuracy, test_bleu
            )
        )

## visualizing attention


In [83]:
def get_attention_matrix_for_sampled_data(
    attention_model, target_lookup_layer, test_xy, n_samples=5
):
    test_x, test_y = test_xy
    rand_ids = np.random.randint(0, len(test_xy[0]), size=(n_samples,))
    results = []
    for rid in rand_ids:
        en_input = test_x[rid : rid + 1]
        fr_input = test_y[rid : rid + 1, :-1]
        attn_weights, predictions = attention_model.predict([en_input, fr_input])
        predicted_word_ids = np.argmax(predictions, axis=-1).ravel()
        predicted_words = [
            target_lookup_layer.get_vocabulary()[wid] for wid in predicted_word_ids
        ]
        clean_en_input = []
        en_start_i = 0
        for i, w in enumerate(en_input.ravel()):
            if w == "<pad>":
                en_start_i = i + 1
                continue
            clean_en_input.append(w)
            if w == "</s>":
                break
        clean_predicted_words = []
        for w in predicted_words:
            clean_predicted_words.append(w)
            if w == "</s>":
                break

            results.append(
                {
                    "attention_weights": attn_weights[
                        0,
                        : len(clean_predicted_words),
                        en_start_i : en_start_i + len(clean_en_input),
                    ],
                    "input_words": clean_en_input,
                    "predicted_words": clean_predicted_words,
                }
            )
    return results

In [84]:
train_en_sentences_padded_ID = text_vectorizer_en(str(train_en_sentences_padded))
train_fr_sentences_padded_ID = text_vectorizer_fr(str(train_fr_sentences_padded))


valid_en_sentences_padded_ID = text_vectorizer_en(str(train_en_sentences_padded))
valid_fr_sentences_padded_ID = text_vectorizer_fr(str(train_fr_sentences_padded))

test_en_sentences_padded_ID = text_vectorizer_en(str(test_en_sentences_padded))
test_fr_sentences_padded_ID = text_vectorizer_fr(str(test_fr_sentences_padded))

In [85]:
import time

epochs = 10
batch_size = 72

t1 = time.time()
train_model(
    seq2seq_model,
    en_lookup_layer,
    fr_lookup_layer,
    (train_en_sentences_padded_ID, train_fr_sentences_padded_ID),
    (
        valid_en_sentences_padded_ID,
        valid_fr_sentences_padded_ID,
    ),
    (
        test_en_sentences_padded_ID,
        test_fr_sentences_padded_ID,
    ),
    epochs,
    batch_size,
    shuffle=False,
)
t2 = time.time()

print(f"\nIt took {t2-t1} seconds to complete the training")

2024-10-15 12:29:25.713722: W tensorflow/core/framework/op_kernel.cc:1840] OP_REQUIRES failed at strided_slice_op.cc:117 : INVALID_ARGUMENT: Index out of range using input dim 1; input has only 1 dims
2024-10-15 12:29:25.713766: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: INVALID_ARGUMENT: Index out of range using input dim 1; input has only 1 dims


InvalidArgumentError: {{function_node __wrapped__StridedSlice_device_/job:localhost/replica:0/task:0/device:CPU:0}} Index out of range using input dim 1; input has only 1 dims [Op:StridedSlice] name: strided_slice/