## The dataset


the dataset is WMT-14 English-German translation data from https://nlp.stanford.edu/projects/nmt/. There are over 4.5 million sentence pairs available. However, I will only use 10k pairs due to computational feasiblility.


In [1]:
import os
import random

n_sentences = 10000

# Loading English train sentences
original_en_sentences = []
with open(
    os.path.join("./data/en-de", "train_10k.en"), "r", encoding="utf-8"
) as en_file:
    for i, row in enumerate(en_file):
        # if i >= n_sentences:
        #     break
        original_en_sentences.append(row.strip().split(" "))

# loading German train sentences
original_de_sentences = []
with open(
    os.path.join("./data/en-de", "train_10k.de"), "r", encoding="utf-8"
) as de_file:
    for i, row in enumerate(de_file):
        # if i >= n_sentences:
        #     break
        original_de_sentences.append(row.strip().split(" "))

# Loading English test sentences
oritinal_en_test_sentences = []

with open(
    os.path.join("./data/en-de", "test_100.en"), "r", encoding="utf-8"
) as de_file:
    for i, row in enumerate(de_file):
        # if i >= n_sentences:
        #     break
        oritinal_en_test_sentences.append(row.strip().split(" "))

# Loading German test sentences
oritinal_de_test_sentences = []
with open(
    os.path.join("./data/en-de", "test_100.de"), "r", encoding="utf-8"
) as de_file:
    for i, row in enumerate(de_file):
        # if i >= n_sentences:
        #     break
        oritinal_de_test_sentences.append(row.strip().split(" "))

### displaying random sentences and their respective translations
for i in range(3):
    index = random.randint(0, 10000)
    print("English: ", " ".join(original_en_sentences[index]))
    print("German: ", " ".join(original_de_sentences[index]), "\n")

English:  The address of the hotel is Dusseldorf 40211 , Kloster Strasse 53 .
German:  Die Adresse des Hotels ist Dusseldorf 40211 , Kloster Strasse 53. 

English:  We observe the presentation and development of the Newcomer .
German:  Dabei beobachten wir die Präsentation und Entwicklung der Newcomer . 

English:  The LateRooms rates for Days Inn Bristol M5 in Bristol are the total price of the room and not the &apos; per person &apos; rate .
German:  Die bei LateRooms für das Days Inn Bristol M5 in Bristol angegebenen Preise verstehen sich als Komplettpreise pro Zimmer , NICHT “ pro Person ” . 



In [2]:
original_de_sentences

[['iron',
  'cement',
  'ist',
  'eine',
  'gebrauchs-fertige',
  'Paste',
  ',',
  'die',
  'mit',
  'einem',
  'Spachtel',
  'oder',
  'den',
  'Fingern',
  'als',
  'Hohlkehle',
  'in',
  'die',
  'Formecken',
  '(',
  'Winkel',
  ')',
  'der',
  'Stahlguss',
  '-Kokille',
  'aufgetragen',
  'wird',
  '.'],
 ['Nach',
  'der',
  'Aushärtung',
  'schützt',
  'iron',
  'cement',
  'die',
  'Kokille',
  'gegen',
  'den',
  'heissen',
  ',',
  'abrasiven',
  'Stahlguss',
  '.'],
 ['feuerfester',
  'Reparaturkitt',
  'für',
  'Feuerungsanlagen',
  ',',
  'Öfen',
  ',',
  'offene',
  'Feuerstellen',
  'etc.'],
 ['Der', 'Bau', 'und', 'die', 'Reparatur', 'der', 'Autostraßen', '...'],
 ['die',
  'Mitteilungen',
  'sollen',
  'den',
  'geschäftlichen',
  'kommerziellen',
  'Charakter',
  'tragen',
  '.'],
 ['der',
  'Vertrieb',
  'Ihrer',
  'Waren',
  'und',
  'Dienstleistungen',
  'durch',
  'das',
  'Postfach-System',
  'WIRD',
  'NICHT',
  'ZUGELASSEN',
  '.'],
 ['die',
  'Werbeversande',
 

# Adding special tokens

#### I will add "< s >" to mark the start of a sentence and "< /s >" to mark the end of a sentence

This way
we prediction can be done for an arbitrary number of time steps. Using < s > as the starting token gives a
way to signal to the decoder that it should start predicting tokens from the target language.

if < /s > token is not used to mark the end of a sentence, the decoder cannot be signaled to
end a sentence. This can lead the model to enter an infinite loop of predictions.


In [3]:
en_sentences = [["<s>"] + sent + ["</s>"] for sent in original_en_sentences]
de_sentences = [["<s>"] + sent + ["</s>"] for sent in original_de_sentences]
test_en_sentences = [["<s>"] + sent + ["</s>"] for sent in oritinal_en_test_sentences]
test_de_sentences = [["<s>"] + sent + ["</s>"] for sent in oritinal_de_test_sentences]

for i in range(2):
    index = random.randint(0, 10000)
    print("English: ", " ".join(en_sentences[index]))
    print("German: ", " ".join(de_sentences[index]), "\n")

print("English Test: ", " ".join(test_en_sentences[0]))
print("German Test: ", " ".join(test_de_sentences[0]))

English:  <s> Unnoticed , it has taken centre stage in the public sphere . </s>
German:  <s> Er hat unbemerkt die vordere Bühne der Öffentlichkeit eingenommen . </s> 

English:  <s> How do I know it is safe ? </s>
German:  <s> Woran sehe ich , dass meine Daten geschützt sind ? </s> 

English Test:  <s> Orlando Bloom and Miranda Kerr still love each other </s>
German Test:  <s> Orlando Bloom und Miranda Kerr lieben sich noch immer </s>


# splitting training and validation dataset

#### 90% training and 10% validation


In [4]:
from sklearn.model_selection import train_test_split
import numpy as np

train_en_sentences, valid_en_sentences, train_de_sentences, valid_de_sentences = (
    train_test_split(en_sentences, de_sentences, test_size=0.1)
)

print(train_en_sentences[1])
print(train_de_sentences[1])

['<s>', 'Owing', 'to', 'geography', ',', 'heavy', 'reliance', 'on', 'automobiles', ',', 'and', 'the', 'Los', 'Angeles', '/', 'Long', 'Beach', 'port', 'complex', ',', 'Los', 'Angeles', 'suffers', 'from', 'air', 'pollution', 'in', 'the', 'form', 'of', 'smog', '.', 'The', 'Los', 'Angeles', 'Basin', 'and', 'the', 'San', 'Fernando', 'Valley', 'are', 'susceptible', 'to', 'atmospheric', 'inversion', ',', 'which', 'holds', 'in', 'the', 'exhausts', 'from', 'road', 'vehicles', ',', 'airplanes', ',', 'locomotives', ',', 'shipping', ',', 'manufacturing', ',', 'and', 'other', 'sources', '.', '</s>']
['<s>', 'Der', 'Großraum', 'Los', 'Angeles', '/', 'Long', 'Beach', '/', 'Riverside', 'zählte', '2007', 'nach', 'einem', 'Bericht', 'der', '„', 'American', 'Lung', 'Association', '“', 'zum', 'städtischen', 'Gebiet', 'mit', 'der', 'höchsten', 'Luftverschmutzung', 'in', 'den', 'Vereinigten', 'Staaten', '.', '</s>']


### Defining sequence leghts fot the two languages


In [5]:
import pandas as pd

# Getting some basic statistics from the data

# convert train_en_sentences to a pandas series
pd.Series(train_en_sentences).str.len().describe(percentiles=[0.05, 0.5, 0.95])

count    9000.000000
mean       27.337667
std        14.155593
min         8.000000
5%         11.000000
50%        24.000000
95%        56.000000
max       102.000000
dtype: float64

The statistic above shows that 5% of english sentences have 11 words, 50% have 24 words, 95% have 56 words


In [6]:
pd.Series(train_de_sentences).str.len().describe(percentiles=[0.05, 0.5, 0.95])

count    9000.000000
mean       24.760333
std        12.744718
min         8.000000
5%         11.000000
50%        22.000000
95%        50.000000
max       102.000000
dtype: float64

The statistic above shows that 5% of German sentences have 11 words, 50% have 22 words, 95% have 50 words

the minimum and maximum number of sentences is 8 and 102 respectively in both languages. However, this will not always be the case


### Padding the sentences with pad_sequences from keras


In [11]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

n_en_seq_length = 50
n_de_seq_length = 50
unk_token = "<unk>"

train_en_sentences_padded = pad_sequences(
    train_en_sentences,
    maxlen=n_en_seq_length,
    # value=unk_token,
    dtype=object,
    truncating="post",
    padding="post",
)

valid_en_sentences_padded = pad_sequences(
    valid_en_sentences,
    maxlen=n_en_seq_length,
    # value=unk_token,
    dtype=object,
    truncating="post",
    padding="post",
)

test_en_sentences_padded = pad_sequences(
    test_en_sentences,
    maxlen=n_en_seq_length,
    # value=unk_token,
    dtype=object,
    truncating="post",
    padding="post",
)


train_de_sentences_padded = pad_sequences(
    train_de_sentences,
    maxlen=n_de_seq_length,
    # value=unk_token,
    dtype=object,
    truncating="post",
    padding="post",
)

valid_de_sentences_padded = pad_sequences(
    valid_de_sentences,
    maxlen=n_de_seq_length,
    # value=unk_token,
    dtype=object,
    truncating="post",
    padding="post",
)

test_de_sentences_padded = pad_sequences(
    test_de_sentences,
    maxlen=n_de_seq_length,
    # value=unk_token,
    dtype=object,
    truncating="post",
    padding="post",
)

train_en_sentences_padded[0]

array(['<s>', 'If', 'you', 'aren', '’', 't', 'already', 'following',
       'along', ',', 'I', 'highly', 'recommend', 'checking', 'out', 'the',
       'How', 'To', 'Create', 'A', 'WordPress', 'Theme', 'tutorial',
       'series', 'by', 'Ian', 'Stewart', '(', 'ThemeShaper.com', ')', '.',
       '</s>', 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0], dtype=object)

In [None]:
# convert tokens to IDs

n_vocab = (
    len(original_de_sentences) + 1
)  # adding one because of the special token (<unk>) to denote the words that are out of vocabulary

# Defining the model
