# The data

https://www.kaggle.com/datasets/mohamedlotfy50/wmt-2014-english-french/data
There over 4.5 million sentence pairs available. However, I will only use 25,000 pairs due to computational feasiblility.


In [1]:
import pandas as pd
import numpy as np

n_sentences = 25000

data = pd.read_csv(
    "./data/en-fr/wmt14_translate_fr-en_train.csv", nrows=n_sentences
).dropna()

data.head()

Unnamed: 0,en,fr
0,Resumption of the session,Reprise de la session
1,I declare resumed the session of the European ...,Je déclare reprise la session du Parlement eur...
2,"Although, as you will have seen, the dreaded '...","Comme vous avez pu le constater, le grand ""bog..."
3,You have requested a debate on this subject in...,Vous avez souhaité un débat à ce sujet dans le...
4,"In the meantime, I should like to observe a mi...","En attendant, je souhaiterais, comme un certai..."


# spliting the sentences into tokens


In [2]:
import random

original_en_sentences = [sent.strip().split(" ") for sent in data["en"]]
original_fr_sentences = [sent.strip().split(" ") for sent in data["fr"]]


for i in range(3):
    index = random.randint(0, 10000)
    print("English: ", " ".join(original_en_sentences[index]))
    print("French: ", " ".join(original_fr_sentences[index]), "\n")

English:  I completely understand your concerns, and I feel that there is a need for effort and a certain degree of consistency within the European Union itself on this issue.
French:  Je comprends parfaitement les inquiétudes de M. le député et je vois bien que l'on a besoin d'un effort et d'une certaine cohérence y compris à l'intérieur de l'Union européenne en ce qui concerne cette matière. 

English:  If the world' s other important currency trading areas were to be excluded from the Tobin tax, the fact that tax was liable to be paid in Europe would lead to currency dealing moving to those areas.
French:  Si les autres zones de change importantes du monde restaient en dehors de la taxe Tobin, la taxe perçue en Europe aurait pour effet de transférer le marché des changes vers ces zones-là. 

English:  Three amendments can be partially approved, and one can be approved in principle.
French:  Trois amendements peuvent être partiellement adoptés et un peut l'être en principe. 



# Adding special tokens

#### I will add "< s >" to mark the start of a sentence and "< /s >" to mark the end of a sentence

This way
we prediction can be done for an arbitrary number of time steps. Using < s > as the starting token gives a
way to signal to the decoder that it should start predicting tokens from the target language.

if < /s > token is not used to mark the end of a sentence, the decoder cannot be signaled to
end a sentence. This can lead the model to enter an infinite loop of predictions.


In [3]:
en_sentences = [["<s>"] + sent + ["</s>"] for sent in original_en_sentences]
fr_sentences = [["<s>"] + sent + ["</s>"] for sent in original_fr_sentences]

for i in range(2):
    index = random.randint(0, 10000)
    print("English: ", " ".join(en_sentences[index]))
    print("German: ", " ".join(fr_sentences[index]), "\n")

English:  <s> Not only do these decisions conflict with the Treaty and apportion to the institutions of the Union more power than that to which they are entitled but, worst of all, their effect will be counter-productive. </s>
German:  <s> Les adoptions de ce jour sont non seulement contraires au Traité et octroient aux organes de l'Union davantage de pouvoirs que ceux qui lui reviennent, mais surtout, elles produiront un effet contraire. </s> 

English:  <s> I also believe that the Commission, or President Prodi, has acted as a guardian of the Treaties. </s>
German:  <s> En outre, je pense que la Commission, Monsieur le Président Prodi, a été gardienne des Traités. </s> 



# splitting training and validation dataset

#### 80% training, 10% validation and 10% for testing


In [4]:
from sklearn.model_selection import train_test_split
import numpy as np

(
    train_en_sentences,
    valid_test_en_sentences,
    train_fr_sentences,
    valid_test_fr_sentences,
) = train_test_split(en_sentences, fr_sentences, test_size=0.2)


(valid_en_sentences, test_en_sentences, valid_fr_sentences, test_fr_sentences) = (
    train_test_split(valid_test_en_sentences, valid_test_fr_sentences, test_size=0.5)
)


print(train_en_sentences[1])
print(train_fr_sentences[1])
print("\n")
print(test_en_sentences[0])
print(test_fr_sentences[0])

['<s>', 'The', 'situation', 'should', 'also', 'remind', 'us', 'all', 'that', 'the', 'ratification', 'of', 'the', 'Treaty', 'for', 'an', 'International', 'Criminal', 'Court', 'has', 'been', 'shamefully', 'slow.', '</s>']
['<s>', 'La', 'situation', 'devrait', 'nous', 'rappeler', 'aussi', 'le', 'fait', 'que', 'la', 'ratification', 'du', 'traité', 'instituant', 'le', 'Tribunal', 'pénal', 'international', 'a', 'été', 'honteusement', 'lente.', '</s>']


['<s>', 'I', 'must', 'say', 'that', 'Alex', "Langer's", 'presence', 'here', 'in', 'the', 'European', 'Parliament', 'is', 'sadly', 'missed.', 'His', 'efforts', 'to', 'build', 'peace', 'should', 'serve', 'as', 'an', 'example,', 'to', 'all', 'of', 'us,', 'on', 'how', 'peaceful', 'coexistence', 'can', 'be', 'achieved.', '</s>']
['<s>', 'Je', 'tiens', 'à', 'dire', 'que', 'je', 'ressens', 'fortement', "l'absence", 'en', 'ce', 'moment,', 'dans', 'ce', 'Parlement,', "d'Alex", 'Langer,', 'qui', 'a', 'été', 'un', 'architecte', 'de', 'la', 'paix', 'et',

### Defining sequence leghts fot the two languages


In [5]:
# Getting some basic statistics from the data

# convert train_en_sentences to a pandas series
pd.Series(train_en_sentences).str.len().describe(percentiles=[0.05, 0.5, 0.95])

count    20000.000000
mean        27.699450
std         15.769621
min          3.000000
5%           8.000000
50%         25.000000
95%         58.000000
max        150.000000
dtype: float64

In [6]:
pd.Series(train_fr_sentences).str.len().describe(percentiles=[0.05, 0.5, 0.95])

count    20000.000000
mean        28.995000
std         16.805437
min          3.000000
5%           9.000000
50%         26.000000
95%         60.000000
max        154.000000
dtype: float64

# from the train data statistics above, 95% of the english sentences have lengths of 57 while in the french, 95 % of sentences have lengths of 60


### padding the sentences with pad_sequences from keras


In [7]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

n_en_seq_length = 60
n_fr_seq_length = 60
unk_token = "<unk>"

train_en_sentences_padded = pad_sequences(
    train_en_sentences,
    maxlen=n_en_seq_length,
    # value=unk_token,
    dtype=object,
    truncating="post",
    padding="post",
)

valid_en_sentences_padded = pad_sequences(
    valid_en_sentences,
    maxlen=n_en_seq_length,
    # value=unk_token,
    dtype=object,
    truncating="post",
    padding="post",
)

test_en_sentences_padded = pad_sequences(
    test_en_sentences,
    maxlen=n_en_seq_length,
    # value=unk_token,
    dtype=object,
    truncating="post",
    padding="post",
)


train_fr_sentences_padded = pad_sequences(
    train_fr_sentences,
    maxlen=n_fr_seq_length,
    # value=unk_token,
    dtype=object,
    truncating="post",
    padding="post",
)

valid_fr_sentences_padded = pad_sequences(
    valid_fr_sentences,
    maxlen=n_fr_seq_length,
    # value=unk_token,
    dtype=object,
    truncating="post",
    padding="post",
)

test_fr_sentences_padded = pad_sequences(
    test_fr_sentences,
    maxlen=n_fr_seq_length,
    # value=unk_token,
    dtype=object,
    truncating="post",
    padding="post",
)

print(train_en_sentences_padded[1])

2024-10-15 09:42:05.210172: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-15 09:42:05.233194: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-15 09:42:05.252164: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-15 09:42:05.256169: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-15 09:42:05.275226: I tensorflow/core/platform/cpu_feature_guar

['<s>' 'The' 'situation' 'should' 'also' 'remind' 'us' 'all' 'that' 'the'
 'ratification' 'of' 'the' 'Treaty' 'for' 'an' 'International' 'Criminal'
 'Court' 'has' 'been' 'shamefully' 'slow.' '</s>' 0.0 0.0 0.0 0.0 0.0 0.0
 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]


# Converting to token IDs


In [8]:
from tensorflow.keras.layers import TextVectorization
import os

# using text vectorization
# text_vectorizer_en = TextVectorization(output_mode="int")
# text_vectorizer_fr = TextVectorization(output_mode="int")
# text_vectorizer_en.adapt(data["en"])
# text_vectorizer_fr.adapt(data["fr"])

en_vocabulary = []
with open(os.path.join("./data/en-fr", "vocab.en"), "r", encoding="utf-8") as en_file:
    for ri, row in enumerate(en_file):

        en_vocabulary.append(row.strip())

fr_vocabulary = []
with open(os.path.join("./data/en-fr", "vocab.fr"), "r", encoding="utf-8") as en_file:
    for ri, row in enumerate(en_file):

        fr_vocabulary.append(row.strip())

text_vectorizer_en = TextVectorization(output_mode="int")
text_vectorizer_fr = TextVectorization(output_mode="int")
text_vectorizer_en.adapt(en_vocabulary)
text_vectorizer_fr.adapt(fr_vocabulary)


en_vocabulary = text_vectorizer_en.get_vocabulary()
fr_vocabulary = text_vectorizer_fr.get_vocabulary()

I0000 00:00:1728999727.254418   22491 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1728999727.290720   22491 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1728999727.290765   22491 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1728999727.294375   22491 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1728999727.294421   22491 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:0

In [9]:
en_unk_token = en_vocabulary.pop(1)
fr_unk_token = fr_vocabulary.pop(1)

en_unk_token, fr_unk_token

('[UNK]', '[UNK]')

In [10]:
import tensorflow as tf

pad_token = "[PAD]"

# English look up layer
en_lookup_layer = tf.keras.layers.StringLookup(
    vocabulary=en_vocabulary,
    oov_token=en_unk_token,
    mask_token=pad_token,
    pad_to_max_tokens=False,
)

# French look up layer
fr_lookup_layer = tf.keras.layers.StringLookup(
    vocabulary=fr_vocabulary,
    oov_token=en_unk_token,
    mask_token=pad_token,
    pad_to_max_tokens=False,
)

In [11]:
# dir(en_lookup_layer)
# en_lookup_layer.get_vocabulary()

# Defining the encoder


In [12]:
# takes n_en_seq_length of sentences
encoder_input = tf.keras.layers.Input(shape=(n_en_seq_length,), dtype=tf.string)

# using lookup layer into word IDs
encoder_wid_out = en_lookup_layer(encoder_input)

"""
With the tokens converted into IDs, route the generated word IDs to a token embedding layer.
Pass in the size of the vocabulary (derived from the en_lookup_layer's get_vocabulary()
method) and the embedding size (128) and finally then ask the layer to mask any zero-valued inputs
as they don’t contain any information:

"""
en_full_vocab_size = len(en_lookup_layer.get_vocabulary())
encoder_emb_out = tf.keras.layers.Embedding(en_full_vocab_size, 128, mask_zero=True)(
    encoder_wid_out
)


encoder_gru_out, encoder_gru_last_state = tf.keras.layers.GRU(
    256, return_sequences=True, return_state=True
)(encoder_emb_out)

encoder = tf.keras.models.Model(inputs=encoder_input, outputs=encoder_gru_out)

# Defining the Decoder with teacher forcing


In [13]:
decoder_input = tf.keras.layers.Input(shape=(n_fr_seq_length - 1,), dtype=tf.string)

# convert tokens to IDs using the de_lookup_layer
decoder_wid_out = fr_lookup_layer(decoder_input)

# decoder embedding layer
fr_full_vocab_size = len(fr_lookup_layer.get_vocabulary())
decoder_emb_out = tf.keras.layers.Embedding(fr_full_vocab_size, 128, mask_zero=True)(
    decoder_wid_out
)

# decoder layer>>> pass the last state of the encoder into the decoder
decoder_gru_out = tf.keras.layers.GRU(256, return_sequences=True)(
    decoder_emb_out, initial_state=encoder_gru_last_state
)

# Badanau Attention


In [14]:
class BahdanauAttention(tf.keras.layers.Layer):
    def __init__(self, units):
        super().__init__()
        # Weights to compute Bahdanau attention
        self.Wa = tf.keras.layers.Dense(units, use_bias=False)
        self.Ua = tf.keras.layers.Dense(units, use_bias=False)

        self.attention = tf.keras.layers.AdditiveAttention(
            use_scale=True
        )  # takes query, key and value
        # query = each decoder GRU's hidden states for eact time step.

    def call(self, query, key, value, mask, return_attention_scores=False):

        # compute 'Wa.ht'.

        wa_query = self.Wa(query)

        # Compute Ua.hs
        ua_key = self.Ua(key)

        # compute masks
        query_mask = tf.ones(tf.shape(query)[:-1], dtype=bool)
        value_mask = mask

        # Compute the attention
        context_vector, attention_weights = self.attention(
            inputs=[wa_query, value, ua_key],
            mask=[query_mask, value_mask, value_mask],
            return_attention_scores=True,
        )

        if not return_attention_scores:
            return context_vector
        else:
            return context_vector, attention_weights

# Defining the final model


In [15]:
decoder_attn_out, attn_weights = BahdanauAttention(256)(
    query=decoder_gru_out,
    key=encoder_gru_out,
    value=encoder_gru_out,
    mask=(encoder_wid_out != 0),  # mask that denotes which tokens need to be ignored
    return_attention_scores=True,
)


# combine the attention output and the decoder's GRU output to create a
# single concatenated input for the prediction
context_and_gru_output = tf.keras.layers.Concatenate(axis=-1)(
    [decoder_attn_out, decoder_gru_out]
)

# Prediction layer takes the concatenated attention's context vewctore andthe GRU ouput to
# produce probability distributions over the French tokens for each time step
decoder_out = tf.keras.layers.Dense(fr_full_vocab_size, activation="softmax")(
    context_and_gru_output
)



In [18]:
# final end-to-end model
seq2seq_model = tf.keras.models.Model(
    inputs=[encoder.inputs, decoder_input], outputs=decoder_out
)

seq2seq_model.compile(
    loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"]
)

# Attention visualization


In [20]:
attention_visualizer = tf.keras.models.Model(
    inputs=[encoder.inputs, decoder_input], outputs=[attn_weights, decoder_out]
)