## The dataset


the dataset is WMT-14 English-German translation data from https://nlp.stanford.edu/projects/nmt/. There over 4.5 million sentence pairs available. However, I will only use 10k pairs due to computational feasiblility.


In [1]:
import os
import random

n_sentences = 10000

# Loading English train sentences
original_en_sentences = []
with open(os.path.join("data", "train_10k.en"), "r", encoding="utf-8") as en_file:
    for i, row in enumerate(en_file):
        if i >= n_sentences:
            break
        original_en_sentences.append(row.strip().split(" "))

# loading German train sentences
original_de_sentences = []
with open(os.path.join("data", "train_10k.de"), "r", encoding="utf-8") as de_file:
    for i, row in enumerate(de_file):
        if i >= n_sentences:
            break
        original_de_sentences.append(row.strip().split(" "))

# Loading English test sentences
oritinal_en_test_sentences = []

with open(os.path.join("data", "test_100.en"), "r", encoding="utf-8") as de_file:
    for i, row in enumerate(de_file):
        if i >= n_sentences:
            break
        oritinal_en_test_sentences.append(row.strip().split(" "))

# Loading German test sentences
oritinal_de_test_sentences = []
with open(os.path.join("data", "test_100.de"), "r", encoding="utf-8") as de_file:
    for i, row in enumerate(de_file):
        if i >= n_sentences:
            break
        oritinal_de_test_sentences.append(row.strip().split(" "))

### displaying random sentences and their respective translations
for i in range(3):
    index = random.randint(0, 10000)
    print("English: ", " ".join(original_en_sentences[index]))
    print("German: ", " ".join(original_de_sentences[index]), "\n")

English:  The guests of the area are invited to stop at City Apart Hotel .
German:  Die Gäste des Bereichs werden eingeladen , am City Apart Hotel zu stoppen . 

English:  The warehouse in the beginning is used for the storage of the spezie and other articles of commerce of the Far East .
German:  Das Lager am Anfang wird für die Lagerung des spezie und anderer Artikel des Handels vom Fernen Osten benutzt . 

English:  You can chose from a wide selection of over 60 of Casino Tropez � s most popular games , and play either for real money or in practice mode .
German:  Sie k � nnen sich zwischen 60 verschiedenen Casino Tropez Spielen entscheiden und sogar um echtes Geld spielen . 



# Adding special tokens

#### I will add "< s >" to mark the start of a sentence and "< /s >" to mark the end of a sentence

This way
we prediction can be done for an arbitrary number of time steps. Using < s > as the starting token gives a
way to signal to the decoder that it should start predicting tokens from the target language.

if < /s > token is not used to mark the end of a sentence, the decoder cannot be signaled to
end a sentence. This can lead the model to enter an infinite loop of predictions.


In [2]:
en_sentences = [["<s>"] + sent + ["</s>"] for sent in original_en_sentences]
de_sentences = [["<s>"] + sent + ["</s>"] for sent in original_de_sentences]
test_en_sentences = [["<s>"] + sent + ["</s>"] for sent in oritinal_en_test_sentences]
test_de_sentences = [["<s>"] + sent + ["</s>"] for sent in oritinal_de_test_sentences]

for i in range(2):
    index = random.randint(0, 10000)
    print("English: ", " ".join(en_sentences[index]))
    print("German: ", " ".join(de_sentences[index]), "\n")

print("English Test: ", " ".join(test_en_sentences[0]))
print("German Test: ", " ".join(test_de_sentences[0]))

English:  <s> In these portraits , Velázquez has well repaid the debt of gratitude that he owed to his first patron , whom Velázquez stood by during Olivares &apos;s fall from power , thus exposing himself to the great risk of the anger of the jealous Philip . The king , however , showed no sign of malice towards his favorite painter . </s>
German:  <s> Während dieser Reise muss Velázquez Details der Übergabe von Breda von dem damaligen Kommandanten des spanischen Heeres gehört und eine erste Porträtstudie für sein späteres Historiengemälde der Übergabe von Breda gemacht haben . </s> 

English:  <s> � � El Gordo � works differently to most lottery games played , as countless people can share one single ticket . </s>
German:  <s> &quot; El Gordo &quot; verl � uft anders als die meisten Lotterie Spiele . </s> 

English Test:  <s> Orlando Bloom and Miranda Kerr still love each other </s>
German Test:  <s> Orlando Bloom und Miranda Kerr lieben sich noch immer </s>


# splitting training and validation dataset

#### 90% training and 10% validation


In [3]:
from sklearn.model_selection import train_test_split
import numpy as np

train_en_sentences, valid_en_sentences, train_de_sentences, valid_de_sentences = (
    train_test_split(en_sentences, de_sentences, test_size=0.1)
)

print(train_en_sentences[1])
print(train_de_sentences[1])

['<s>', 'While', 'access', 'to', 'software', 'determines', 'our', 'ability', 'to', 'participate', 'in', 'a', 'digital', 'society', 'and', 'governs', 'our', 'ability', 'for', 'communication', ',', 'education', 'and', 'work', ',', 'software', 'itself', 'represents', 'a', 'reservoir', 'of', 'codified', 'skill', '.', '</s>']
['<s>', 'Während', 'der', 'Zugang', 'zu', 'Software', 'über', 'unsere', 'Fähigkeit', 'entscheidet', ',', 'an', 'der', 'digitalen', 'Gesellschaft', 'teilhaben', 'zu', 'können', 'und', 'unsere', 'Möglichkeiten', 'bei', 'Kommunikation', ',', 'Bildung', 'und', 'Beruf', 'entscheidend', 'bestimmt', ',', 'stellt', 'Software', 'gleichzeitig', 'auch', 'ein', 'Reservoir', 'von', 'in', 'Programmcode', 'gegossenen', 'Fertigkeiten', 'dar', '.', '</s>']


### Defining sequence leghts fot he two languages


In [4]:
import pandas as pd

# Getting some basic statistics from the data

# convert train_en_sentences to a pandas series
pd.Series(train_en_sentences).str.len().describe(percentiles=[0.05, 0.5, 0.95])

count    9000.000000
mean       27.364222
std        14.288396
min         8.000000
5%         11.000000
50%        24.000000
95%        56.000000
max       102.000000
dtype: float64