# The data

https://www.kaggle.com/datasets/mohamedlotfy50/wmt-2014-english-french/data
There over 4.5 million sentence pairs available. However, I will only use 25,000 pairs due to computational feasiblility.


In [5]:
import pandas as pd
import numpy as np

n_sentences = 25000

data = pd.read_csv("./data/en-fr/wmt14_translate_fr-en_train.csv", nrows=n_sentences)

data.head()

Unnamed: 0,en,fr
0,Resumption of the session,Reprise de la session
1,I declare resumed the session of the European ...,Je déclare reprise la session du Parlement eur...
2,"Although, as you will have seen, the dreaded '...","Comme vous avez pu le constater, le grand ""bog..."
3,You have requested a debate on this subject in...,Vous avez souhaité un débat à ce sujet dans le...
4,"In the meantime, I should like to observe a mi...","En attendant, je souhaiterais, comme un certai..."


# spliting the sentences into tokens


In [18]:
import random

original_en_sentences = [sent.strip().split(" ") for sent in data["en"]]
original_fr_sentences = [sent.strip().split(" ") for sent in data["fr"]]


for i in range(3):
    index = random.randint(0, 10000)
    print("English: ", " ".join(original_en_sentences[index]))
    print("French: ", " ".join(original_fr_sentences[index]), "\n")

English:  How is it possible not to be outraged when a bank refuses to divulge the names of the Erika' s owners on the grounds of professional confidentiality, without governments reacting?
French:  Comment ne pas être indigné lorsqu'une banque refuse de révéler le nom des propriétaires de l'Erika, invoquant le secret bancaire, sans que les gouvernements réagissent ? 

English:  Joint motion for a resolution on the negotiations to form a government in Austria
French:  Proposition de résolution commune sur les négociations gouvernementales en Autriche 

English:  Madam President, I would like to thank Mr Poettering for advertising this debate.
French:  Madame la Présidente, je voudrais remercier M. Poettering pour le coup de publicité qu' il vient de donner à ce débat. 



# Adding special tokens

#### I will add "< s >" to mark the start of a sentence and "< /s >" to mark the end of a sentence

This way
we prediction can be done for an arbitrary number of time steps. Using < s > as the starting token gives a
way to signal to the decoder that it should start predicting tokens from the target language.

if < /s > token is not used to mark the end of a sentence, the decoder cannot be signaled to
end a sentence. This can lead the model to enter an infinite loop of predictions.


In [20]:
en_sentences = [["<s>"] + sent + ["</s>"] for sent in original_en_sentences]
fr_sentences = [["<s>"] + sent + ["</s>"] for sent in original_fr_sentences]

for i in range(2):
    index = random.randint(0, 10000)
    print("English: ", " ".join(en_sentences[index]))
    print("German: ", " ".join(fr_sentences[index]), "\n")

English:  <s> I have already said that the authority will be charged with developing networks with national food safety agencies and bodies in the Member States. </s>
German:  <s> J'ai déjà déclaré que cette autorité sera chargée de développer des réseaux avec les agences et organes nationaux de sécurité alimentaire au sein des États membres. </s> 

English:  <s> I propose that we start the debate immediately. </s>
German:  <s> Je vous propose de commencer le débat tout de suite. </s> 



# splitting training and validation dataset

#### 80% training, 10% validation and 10% for testing


In [41]:
from sklearn.model_selection import train_test_split
import numpy as np

(
    train_en_sentences,
    valid_test_en_sentences,
    train_fr_sentences,
    valid_test_fr_sentences,
) = train_test_split(en_sentences, fr_sentences, test_size=0.2)


(valid_en_sentences, test_en_sentences, valid_fr_sentences, test_fr_sentences) = (
    train_test_split(valid_test_en_sentences, valid_test_fr_sentences, test_size=0.5)
)


print(train_en_sentences[1])
print(train_fr_sentences[1])
print("\n")
print(test_en_sentences[0])
print(test_fr_sentences[0])

['<s>', 'Mr', 'President,', 'since', 'the', 'signature', 'of', 'the', 'Treaty', 'of', 'Amsterdam,', 'the', 'European', 'Union', 'has', 'had', 'responsibility', 'for', 'combating', 'racism', 'and', 'xenophobia.', '</s>']
['<s>', 'Monsieur', 'le', 'Président,', "l'", 'Union', 'européenne,', 'depuis', 'la', 'signature', 'du', 'traité', "d'", 'Amsterdam,', 'possède', 'des', 'compétences', 'dans', 'la', 'lutte', 'contre', 'le', 'racisme', 'et', 'la', 'xénophobie.', '</s>']


['<s>', 'This', 'country', 'has', 'a', 'very', 'small', 'market', 'which,', 'unfortunately,', 'is', 'often', 'at', 'the', 'same', 'time', 'defined', 'as', 'the', 'relevant', 'market.', 'This', 'in', 'contrast', 'to', 'Germany,', 'where', 'a', 'very', 'experienced', 'Kartellamt', 'is', 'exercising', 'its', 'powers', 'within', 'a', 'gigantic', 'market.', '</s>']
['<s>', 'Le', 'pays', 'ne', 'représente', "qu'", 'un', 'tout', 'petit', 'marché', 'que', "l'", 'on', 'qualifie', 'cependant', 'souvent', 'et', 'malheureusement', 

### Defining sequence leghts fot the two languages


In [42]:
# Getting some basic statistics from the data

# convert train_en_sentences to a pandas series
pd.Series(train_en_sentences).str.len().describe(percentiles=[0.05, 0.5, 0.95])

count    20000.000000
mean        27.609650
std         15.676941
min          3.000000
5%           8.000000
50%         25.000000
95%         57.000000
max        150.000000
dtype: float64

In [43]:
pd.Series(train_fr_sentences).str.len().describe(percentiles=[0.05, 0.5, 0.95])

count    20000.000000
mean        28.920400
std         16.709905
min          3.000000
5%           9.000000
50%         26.000000
95%         60.000000
max        154.000000
dtype: float64

# from the train data statistics above, 95% of the english sentences have lengths of 57 while in the french, 95 % of sentences have lengths of 60


### padding the sentences with pad_sequences from keras


In [47]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

n_en_seq_length = 60
n_de_seq_length = 60
unk_token = "<unk>"

train_en_sentences_padded = pad_sequences(
    train_en_sentences,
    maxlen=n_en_seq_length,
    # value=unk_token,
    dtype=object,
    truncating="post",
    padding="post",
)

valid_en_sentences_padded = pad_sequences(
    valid_en_sentences,
    maxlen=n_en_seq_length,
    # value=unk_token,
    dtype=object,
    truncating="post",
    padding="post",
)

test_en_sentences_padded = pad_sequences(
    test_en_sentences,
    maxlen=n_en_seq_length,
    # value=unk_token,
    dtype=object,
    truncating="post",
    padding="post",
)


train_fr_sentences_padded = pad_sequences(
    train_fr_sentences,
    maxlen=n_de_seq_length,
    # value=unk_token,
    dtype=object,
    truncating="post",
    padding="post",
)

valid_fr_sentences_padded = pad_sequences(
    valid_fr_sentences,
    maxlen=n_de_seq_length,
    # value=unk_token,
    dtype=object,
    truncating="post",
    padding="post",
)

test_fr_sentences_padded = pad_sequences(
    test_fr_sentences,
    maxlen=n_de_seq_length,
    # value=unk_token,
    dtype=object,
    truncating="post",
    padding="post",
)

print(train_en_sentences_padded[1])

['<s>' 'Mr' 'President,' 'since' 'the' 'signature' 'of' 'the' 'Treaty'
 'of' 'Amsterdam,' 'the' 'European' 'Union' 'has' 'had' 'responsibility'
 'for' 'combating' 'racism' 'and' 'xenophobia.' '</s>' 0.0 0.0 0.0 0.0 0.0
 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]
