## The dataset


the dataset is WMT-14 English-German translation data from https://nlp.stanford.edu/projects/nmt/. There over 4.5 million sentence pairs available. However, I will only use 10k pairs due to computational feasiblility.


In [1]:
import os
import random

n_sentences = 10000

# Loading English train sentences
original_en_sentences = []
with open(os.path.join("data", "train_10k.en"), "r", encoding="utf-8") as en_file:
    for i, row in enumerate(en_file):
        if i >= n_sentences:
            break
        original_en_sentences.append(row.strip().split(" "))

# loading German train sentences
original_de_sentences = []
with open(os.path.join("data", "train_10k.de"), "r", encoding="utf-8") as de_file:
    for i, row in enumerate(de_file):
        if i >= n_sentences:
            break
        original_de_sentences.append(row.strip().split(" "))

# Loading English test sentences
oritinal_en_test_sentences = []

with open(os.path.join("data", "test_100.en"), "r", encoding="utf-8") as de_file:
    for i, row in enumerate(de_file):
        if i >= n_sentences:
            break
        oritinal_en_test_sentences.append(row.strip().split(" "))

# Loading German test sentences
oritinal_de_test_sentences = []
with open(os.path.join("data", "test_100.de"), "r", encoding="utf-8") as de_file:
    for i, row in enumerate(de_file):
        if i >= n_sentences:
            break
        oritinal_de_test_sentences.append(row.strip().split(" "))

### displaying random sentences and their respective translations
for i in range(3):
    index = random.randint(0, 10000)
    print("English: ", " ".join(original_en_sentences[index]))
    print("German: ", " ".join(original_de_sentences[index]), "\n")

English:  Class EA-JA ; Our wide range of automatic cars include : Renault Clios , Opel Vectras and Ford ...
German:  Klasse EA-JA ; unser breites Sortiment an Autos mit Automatik beinhalten : Renault Clio , Opel ... 

English:  Are you looking for a partner who will guide you , reliably and competently , every step of the way , from your first decision to buy , through the planning stage , to a complete aftersales service ?
German:  Sie wünschen sich einen Partner , der Sie von der ersten Entscheidung über die Planung bis hin zum kompletten Service verlässlich und kompetent begleitet ? 

English:  Aquasphere manufactures and sells a comprehensive line of pool accessories , spare parts for pools , water treatment systems and pool care products .
German:  Aquasphere stellt eine umfangreiche Reihe an Schwimmbeckenzubehör , Ersatzteilen und Wasserpflegesystemen her und vermarktet diese . 



# Adding special tokens

#### I will add "< s >" to mark the start of a sentence and "< /s >" to mark the end of a sentence

This way
we prediction can be done for an arbitrary number of time steps. Using < s > as the starting token gives a
way to signal to the decoder that it should start predicting tokens from the target language.

if < /s > token is not used to mark the end of a sentence, the decoder cannot be signaled to
end a sentence. This can lead the model to enter an infinite loop of predictions.


In [2]:
en_sentences = [["<s>"] + sent + ["</s>"] for sent in original_en_sentences]
de_sentences = [["<s>"] + sent + ["</s>"] for sent in original_de_sentences]
test_en_sentences = [["<s>"] + sent + ["</s>"] for sent in oritinal_en_test_sentences]
test_de_sentences = [["<s>"] + sent + ["</s>"] for sent in oritinal_de_test_sentences]

for i in range(2):
    index = random.randint(0, 10000)
    print("English: ", " ".join(en_sentences[index]))
    print("German: ", " ".join(de_sentences[index]), "\n")

print("English Test: ", " ".join(test_en_sentences[0]))
print("German Test: ", " ".join(test_de_sentences[0]))

English:  <s> The NH Giustiniano is set in an impressive building housing a range of elegant and spacious rooms and luxurious suites , as well as 8 meeting rooms with a capacity of 180 people . </s>
German:  <s> Ein beeindruckendes Gebäude beherbergt das NH Giustiniano mit seinen eleganten , geräumigen Zimmern sowie den luxuriösen Suiten . Darüber hinaus stehen Ihnen 8 Tagungsräume für bis zu 180 Personen zur Verfügung . </s> 

English:  <s> This section holds the most general questions about PHP : what it is and what it does . </s>
German:  <s> Dieses Kapitel beinhaltet allgemeine Fragen zu PHP : Was es ist und was es tut . </s> 

English Test:  <s> Orlando Bloom and Miranda Kerr still love each other </s>
German Test:  <s> Orlando Bloom und Miranda Kerr lieben sich noch immer </s>


# splitting training and validation dataset

#### 90% training and 10% validation


In [3]:
from sklearn.model_selection import train_test_split
import numpy as np

train_en_sentences, valid_en_sentences, train_de_sentences, valid_de_sentences = (
    train_test_split(en_sentences, de_sentences, test_size=0.1)
)

print(train_en_sentences[1])
print(train_de_sentences[1])

['<s>', 'Explore', 'new', 'NI', 'technologies', 'ranging', 'from', 'a', 'portable', 'handheld', 'sound', 'and', 'vibration', 'analyzer', 'to', 'high-channel', 'microphone', 'arrays', 'and', 'embedded', 'systems', '.', '</s>']
['<s>', 'Die', 'zunehmende', 'Flut', 'an', 'Mess-', 'und', 'Simulationsdaten', 'stellt', 'viele', 'Abteilungen', 'und', 'Unternehmen', 'vor', 'eine', 'Herausforderung', '.', '</s>']


### Defining sequence leghts fot the two languages


In [4]:
import pandas as pd

# Getting some basic statistics from the data

# convert train_en_sentences to a pandas series
pd.Series(train_en_sentences).str.len().describe(percentiles=[0.05, 0.5, 0.95])

count    9000.000000
mean       27.380889
std        14.290006
min         8.000000
5%         11.000000
50%        24.000000
95%        56.000000
max       102.000000
dtype: float64

The statistic above shows that 5% of english sentences have 11 words, 50% have 24 words, 95% have 56 words


In [5]:
pd.Series(train_de_sentences).str.len().describe(percentiles=[0.05, 0.5, 0.95])

count    9000.000000
mean       24.822222
std        12.911805
min         8.000000
5%         11.000000
50%        22.000000
95%        50.000000
max       102.000000
dtype: float64

The statistic above shows that 5% of German sentences have 11 words, 50% have 22 words, 95% have 50 words

the minimum and maximum number of sentences is 8 and 102 respectively in both languages. However, this will not always be the case


### Padding the sentences with pad_sequences from keras


In [16]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

n_en_seq_length = 50
n_de_seq_length = 50
unk_token = "<unk>"

train_en_sentences_padded = pad_sequences(
    train_en_sentences,
    maxlen=n_en_seq_length,
    # value=unk_token,
    dtype=object,
    truncating="post",
)

valid_en_sentences = pad_sequences(
    valid_en_sentences,
    maxlen=n_en_seq_length,
    # value=unk_token,
    dtype=object,
    truncating="post",
)

test_en_sentences = pad_sequences(
    test_en_sentences,
    maxlen=n_en_seq_length,
    # value=unk_token,
    dtype=object,
    truncating="post",
)


train_de_sentences_padded = pad_sequences(
    train_de_sentences,
    maxlen=n_de_seq_length,
    # value=unk_token,
    dtype=object,
    truncating="post",
)

valid_de_sentences = pad_sequences(
    valid_de_sentences,
    maxlen=n_de_seq_length,
    # value=unk_token,
    dtype=object,
    truncating="post",
)

test_de_sentences = pad_sequences(
    test_de_sentences,
    maxlen=n_de_seq_length,
    # value=unk_token,
    dtype=object,
    truncating="post",
)