# Machine Translation with different encoders

Most state-of-the-art methods of machine translation use currently an encoder-decoder structure. The encoder tries to find a vector representation for the phrase in the source language and the decoder takes this representation as a basis to generate the phrase in the target language. The goal of the following study is to compare different kinds of encoders for representing the meaning of a source phrase in a vector. For this, I will focus on three different types:
- recurrent neural networks (i.e. LSTM) ([3], [4])
- transformer ([5], [6])
- convolutional neural networks ([1], [2])

The structure of the encoders will be based on the work in the referenced papers. For the decoder, I will always use an LSTM, to generate the output sentence. This will allow me, to only compare the differences of the methods in encoding the meaning of a phrase.

> Unfortunately, the LSTM didn't produce a very sensible output in the beginning. Consequently, I spent all of my time, investigating the problems and trying to fix and improve the LSTM network. Below, you can find my attempts and experiments and the variables I tried to change. Therefore, I didn't have any time to also build and tweak the transformer or CNN encoder. 

## 0 - Constants/Imports

In [151]:
import math
import random
import sys
from pprint import pprint

import numpy as np
import torch
import torch.nn as nn
from nltk.tokenize import word_tokenize
from sklearn.metrics import accuracy_score, precision_score, recall_score
from torch import optim
from torch.nn import CrossEntropyLoss
from torch.utils.data import DataLoader, Dataset

In [152]:
PADDING_TOKEN = '<PAD>'
UNKNOWN_TOKEN = '<UNK>'
START_TOKEN = '<SOS>'
END_TOKEN = '<EOS>'

device = torch.device('cuda:0')

In [153]:
hyperparameters = {
    'batch_size': 16,
    'embedding_dim': 256,
    'lstm_out_dim': 512,
    'epochs': 150,
    'learning_rate': 0.001
}

## 1 - Loading Data
I will use the Multi30k dataset, which contains source phrases in German and target phrases in English.

For this, I downloaded the raw train datasets from https://github.com/multi30k/ in English and German and aligned the translated sentences in one file as following:
```
[German sentence]\t[English sentence]
[German sentence]\t[English sentence]
...
```

The following code loads the data from this file, creates train and test sets, tokenizes the sentences and encodes the words.

In [154]:
class MTDataset(Dataset):
    def __init__(self, path, max_lines=1000, dataset=None):
        data_file = self._read_file(path, max_lines)

        if dataset is None:
            self.max_length_source = -1
            self.max_length_target = -1
            vocab_source_lang = {PADDING_TOKEN, UNKNOWN_TOKEN, START_TOKEN, END_TOKEN}
            vocab_target_lang = {PADDING_TOKEN, UNKNOWN_TOKEN, START_TOKEN, END_TOKEN}
            for sample in data_file:
                vocab_source_lang.update(sample['vocab_source_lang'])
                vocab_target_lang.update(sample['vocab_target_lang'])
                self.max_length_source = max(self.max_length_source, len(sample['vocab_source_lang']))
                self.max_length_target = max(self.max_length_target, len(sample['vocab_target_lang']))

            self.vocab_source_lang = {word: index for index, word in enumerate(list(vocab_source_lang))}
            self.vocab_target_lang = {word: index for index, word in enumerate(list(vocab_target_lang))}

            # START token, END token
            self.max_length_source += 2
            self.max_length_target += 2
        else:
            self.vocab_source_lang = dataset.vocab_source_lang
            self.vocab_target_lang = dataset.vocab_target_lang
            self.max_length_source = dataset.max_length_source
            self.max_length_target = dataset.max_length_target

        self.samples = []
        for sample in data_file:
            source = [self.get_encoded_source_word(word) for word in sample['vocab_source_lang']]
            source.insert(0, self.get_encoded_source_word(START_TOKEN))
            source.append(self.get_encoded_source_word(END_TOKEN))
            source.extend([self.get_encoded_source_word(PADDING_TOKEN)] * (
                    self.max_length_source - len(sample['vocab_source_lang'])))

            target = [self.get_encoded_target_word(word) for word in sample['vocab_target_lang']]
            target.insert(0, self.get_encoded_target_word(START_TOKEN))
            target.append(self.get_encoded_target_word(END_TOKEN))
            target.extend([self.get_encoded_target_word(PADDING_TOKEN)] * (
                    self.max_length_target - len(sample['vocab_target_lang'])))

            self.samples.append({
                'source': torch.tensor(source),
                'target': torch.tensor(target)
            })

    def _read_file(self, path, max_lines):
        lines = []
        with open(path) as f:
            for line_index, sample in enumerate(f):
                split = sample.rstrip().split('\t')
                if len(split) == 2:
                    vocab_source_lang, vocab_target_lang = split
                    lines.append({
                        'vocab_source_lang': [word.lower() for word in word_tokenize(vocab_source_lang)],
                        'vocab_target_lang': [word.lower() for word in word_tokenize(vocab_target_lang)],
                    })

                    if line_index == max_lines:
                        break
        return lines

    def get_encoded_source_word(self, word):
        if word in self.vocab_source_lang:
            return self.vocab_source_lang[word]
        else:
            return self.vocab_source_lang[UNKNOWN_TOKEN]

    def get_encoded_target_word(self, word):
        if word in self.vocab_target_lang:
            return self.vocab_target_lang[word]
        else:
            return self.vocab_target_lang[UNKNOWN_TOKEN]

    def get_decoded_target_word(self, index):
        found = list(filter(lambda x: x[1] == index, self.vocab_target_lang.items()))
        if len(found) > 0:
            return found[0][0]
        else:
            return UNKNOWN_TOKEN

    def get_decoded_source_word(self, index):
        found = list(filter(lambda x: x[1] == index, self.vocab_source_lang.items()))
        if len(found) > 0:
            return found[0][0]
        else:
            return UNKNOWN_TOKEN

    def __getitem__(self, item):
        return self.samples[item]

    def __len__(self):
        return len(self.samples)

In [155]:
dataset = MTDataset('data/multi30k_dev.txt')
print(dataset[:10])

[{'source': tensor([ 978,  919,  571,  781, 1140,  236, 1421,  175, 1533,  910, 1460,  563,
        1238,  543,   33, 2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184,
        2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184,
        2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184]), 'target': tensor([ 793, 1269,   71, 1165,  863, 1598,   10, 1044,  177, 1774,  995,  425,
          20, 1853, 1853, 1853, 1853, 1853, 1853, 1853, 1853, 1853, 1853, 1853,
        1853, 1853, 1853, 1853, 1853, 1853, 1853, 1853, 1853, 1853, 1853, 1853,
        1853, 1853, 1853])}, {'source': tensor([ 978,  233, 1140, 1912,  986, 1127, 1186, 1891,  543,   33, 2184, 2184,
        2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184,
        2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184,
        2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184, 2184]), 'target': tensor([ 793,  792, 1799, 1265, 1176,   41, 

In [156]:
def split_data(source_path, target_path_train, target_path_test, train_split=0.8):
    with open(source_path, 'r') as source:
        lines = source.readlines()

    delimiter = int(len(lines) * train_split)

    with open(target_path_train, 'w') as target_train:
        for line in lines[:delimiter]:
            target_train.write(line)
    with open(target_path_test, 'w') as target_test:
        for line in lines[delimiter:]:
            target_test.write(line)

In [157]:
split_data('data/multi30k_dev.txt', 'data/dev_train', 'data/dev_test')

In [158]:
def dataloader(path_train, path_test, batch_size):
    train_dataset = MTDataset(path_train, max_lines=-1)
    test_dataset = MTDataset(path_test, max_lines=-1, dataset=train_dataset)

    train_dataloader = DataLoader(train_dataset,
                                  batch_size=batch_size,
                                  shuffle=True)
    test_dataloader = DataLoader(test_dataset,
                                 batch_size=batch_size,
                                 shuffle=True)

    return train_dataloader, test_dataloader

In [159]:
train_dataloader, test_dataloader = dataloader('data/dev_train', 'data/dev_test', hyperparameters['batch_size'])
train_dataset = train_dataloader.dataset

## 2 - Models
### 2.1 - recurrent neural network (LSTM)

In [160]:
class DCEPEncoder(nn.Module):
    def __init__(self, source_vocab_size, embedding_dim, encoder_out_dim, padding_idx, dropout_prob):
        super(DCEPEncoder, self).__init__()

        self.embeddings = nn.Embedding(source_vocab_size, embedding_dim, padding_idx=padding_idx)
        self.lstm = nn.LSTM(embedding_dim, encoder_out_dim, 8, batch_first=True)
        self.dropout = nn.Dropout(dropout_prob)

    def forward(self, source):
        embedding = self.embeddings(source)
        dropped_out = self.dropout(embedding)
        _, states = self.lstm(dropped_out)

        return states

In [161]:
class DCEPDecoder(nn.Module):
    def __init__(self, target_vocab_size, embedding_dim, decoder_out_dim, padding_idx, dropout_prob):
        super(DCEPDecoder, self).__init__()

        self.embeddings = nn.Embedding(target_vocab_size, embedding_dim, padding_idx=padding_idx)
        self.lstm = nn.LSTM(embedding_dim, decoder_out_dim, 8, batch_first=True)
        self.classifier = nn.Linear(decoder_out_dim, target_vocab_size)
        self.dropout = nn.Dropout(dropout_prob)

    def forward(self, target_word, input_states):
        embedding = self.embeddings(target_word.unsqueeze(0).transpose(0,1))
        dropped_out = self.dropout(embedding)
        output, output_states = self.lstm(dropped_out, input_states)
        prediction = self.classifier(output).squeeze(1)

        return prediction, output_states

In [162]:
class DCEPSeq2Seq(nn.Module):
    def __init__(self, encoder, decoder, encoded_target_SOS, encoded_target_EOS):
        super(DCEPSeq2Seq, self).__init__()

        self.encoder = encoder
        self.decoder = decoder
        self.encoded_target_SOS = encoded_target_SOS
        self.encoded_target_EOS = encoded_target_EOS

    def forward(self, source, target=None):
        predicted_sentence = []

        states = self.encoder(source)

        # predictions during training
        if target is not None:
            target = target.transpose(0,1)
            predicted_word = target[0]
            for word in target:
                base_word = word if random.random() > 0.5 else predicted_word

                predicted_word_layer, states = self.decoder(base_word, states)
                predicted_word = torch.max(predicted_word_layer, 1).indices
                predicted_sentence.append(predicted_word_layer)

        # predictions during testing/production
        else:
            sentence_length = 0
            predicted_word = torch.tensor([self.encoded_target_SOS] * source.shape[0], device=device)
            while predicted_word != torch.tensor(self.encoded_target_EOS, device=device):
                predicted_word_layer, states = self.decoder(predicted_word, states)
                predicted_word = torch.max(predicted_word_layer, 1).indices
                predicted_sentence.append(predicted_word)
                sentence_length += 1

                if sentence_length > 30:
                    break
        return torch.stack(predicted_sentence)

In [163]:
loss_function = CrossEntropyLoss(ignore_index=train_dataset.get_encoded_target_word(PADDING_TOKEN))

dcepEncoder = DCEPEncoder(len(train_dataset.vocab_source_lang),
                          hyperparameters['embedding_dim'],
                          hyperparameters['lstm_out_dim'],
                          train_dataset.get_encoded_source_word(PADDING_TOKEN),
                          0.5)

dcepDecoder = DCEPDecoder(len(train_dataset.vocab_target_lang),
                          hyperparameters['embedding_dim'],
                          hyperparameters['lstm_out_dim'],
                          train_dataset.get_encoded_target_word(PADDING_TOKEN),
                          0.5)
dcepSeq2seq = DCEPSeq2Seq(dcepEncoder,
                          dcepDecoder,
                          train_dataset.get_encoded_target_word(START_TOKEN),
                          train_dataset.get_encoded_target_word(END_TOKEN))
dcepSeq2seq.to(device)

optimizer = optim.Adam(dcepSeq2seq.parameters(), lr=hyperparameters['learning_rate'])

In [164]:
# translate example sentence to see how the translation improves during the training 
def translate_test():
    # <SOS> a man in a blue shirt is standing on a ladder and cleans a window . <EOS>
    sentence = "<SOS> ein mann in einem blauen hemd steht auf einer leiter und putzt ein fenster .".split(' ')
    encoded_sentence = torch.tensor([train_dataset.get_encoded_source_word(word.lower()) for word in sentence], device=device).unsqueeze(0)
    translated = dcepSeq2seq(encoded_sentence).squeeze(0)
    return [train_dataset.get_decoded_target_word(int(word)) for word in translated]

In [165]:
print(f'{hyperparameters["epochs"]} EPOCHS - {math.floor(len(train_dataset) / train_dataloader.batch_size)} BATCHES PER EPOCH')

for epoch in range(hyperparameters['epochs']):
    total_loss = 0
    for i, batch in enumerate(train_dataloader):
        source = batch['source'].to(device)
        target = batch['target'].type(torch.LongTensor).to(device)

        output = dcepSeq2seq(source, target)
#        print(output.transpose(0,1).size())
#        print(target.size())
#        print(torch.max(output.transpose(0,1), 2).indices.size())
#        print()
#        print([train_dataset.get_decoded_source_word(int(word)) for word in source[0]])
#        print([train_dataset.get_decoded_target_word(int(word)) for word in target[:, 1:][0]])
#        max_output = torch.max(output.transpose(0,1)[:, :-1], 2).indices
#        print([train_dataset.get_decoded_target_word(int(word)) for word in max_output[0]])
        loss = loss_function(output.transpose(0,1)[:, :-1].reshape(-1, output.shape[2]), target[:, 1:].reshape(-1))
        total_loss += loss.item()

        # print average loss for the epoch
        sys.stdout.write(f'\repoch {epoch}, batch {i}: {np.round(total_loss / (i + 1), 4)}')

        # compute gradients
        loss.backward()

        # update parameters
        optimizer.step()

        # reset gradients
        optimizer.zero_grad()
    print()
    print(translate_test())

150 EPOCHS - 50 BATCHES PER EPOCH
epoch 0, batch 49: 5.4177
['a', 'man', 'a', 'a', 'a', 'a', 'a', '.', '.', '<EOS>']
epoch 1, batch 49: 4.9296
['a', 'man', 'a', 'a', 'a', 'a', '.', '<EOS>']
epoch 2, batch 49: 4.8235
['a', 'man', 'in', 'a', 'a', 'a', 'a', '.', '<EOS>']
epoch 3, batch 49: 4.7721
['a', 'man', 'in', 'a', 'a', 'a', '.', '.', '<EOS>']
epoch 4, batch 49: 4.7375
['a', 'man', 'in', 'a', 'a', 'a', 'a', '.', '.', '<EOS>']
epoch 5, batch 49: 4.7104
['a', 'man', 'in', 'a', 'a', 'a', 'a', '.', '<EOS>']
epoch 6, batch 49: 4.6989
['a', 'man', 'in', 'a', 'a', 'a', 'a', 'a', '.', '<EOS>']
epoch 7, batch 49: 4.6661
['a', 'man', 'in', 'a', 'a', 'a', 'a', 'a', '.', '<EOS>']
epoch 8, batch 49: 4.6167
['a', 'man', 'in', 'a', 'a', 'a', 'a', 'a', '.', '<EOS>']
epoch 9, batch 49: 4.5744
['a', 'man', 'in', 'a', 'a', 'a', 'a', 'a', '<EOS>']
epoch 10, batch 49: 4.5412
['a', 'man', 'in', 'a', 'a', 'a', 'a', 'a', '.', '<EOS>']
epoch 11, batch 49: 4.5209
['a', 'man', 'in', 'a', 'a', 'a', 'a', 'a', '.

epoch 88, batch 49: 3.7029
['a', 'woman', 'in', 'a', 'a', 'shirt', 'a', 'a', 'a', 'a', '.', '<EOS>']
epoch 89, batch 49: 3.7141
['a', 'man', 'in', 'a', 'a', 'a', 'a', 'a', '<EOS>']
epoch 90, batch 49: 3.7115
['a', 'woman', 'woman', 'a', 'a', 'a', 'a', 'a', 'a', 'a', '.', '<EOS>']
epoch 91, batch 49: 3.6966
['a', 'woman', 'in', 'a', 'a', 'a', 'a', 'a', 'a', '<EOS>']
epoch 92, batch 49: 3.6908
['a', 'woman', 'woman', 'a', 'a', 'a', 'a', 'a', 'a', 'a', '.', '<EOS>']
epoch 93, batch 49: 3.9604
['a', 'woman', 'in', 'a', 'a', 'a', 'a', 'a', 'a', 'a', '.', '<EOS>']
epoch 94, batch 49: 3.9332
['a', 'woman', 'of', 'a', 'a', 'a', 'a', 'a', 'a', '.', '.', '<EOS>']
epoch 95, batch 49: 4.0727
['a', 'woman', 'of', 'a', 'a', 'a', 'a', 'a', 'a', '.', '.', '<EOS>']
epoch 96, batch 49: 4.0589
['a', 'man', 'in', 'a', 'a', 'a', 'a', 'a', 'a', 'a', '<EOS>']
epoch 97, batch 49: 4.0059
['a', 'man', 'in', 'a', 'a', 'a', 'a', 'a', 'a', 'a', '.', '.', '.', '<EOS>']
epoch 98, batch 49: 3.8607
['a', 'man', 'in', 

In [166]:
dcepSeq2seq.eval()
print(translate_test())

['a', 'man', 'in', 'a', 'a', 'a', 'a', 'a', 'a', '.', '.', '<EOS>']


### 2.2 - transformer
*future work*

### 2.3 - convolutional neural network
*future work*

## 3 - Evaluation
*future work*

## 4 - Discussion
As can be seen, the translation doesn't work well. The first few words are translated correctly, but thenit stops predicting the correct words. Since, there is already a big descrepancy, I didn't implement a proper test with metrics like the *bleu score* or similar. This certainly needs to be done in future work when it becomes visible that the model works correctly. There are multiple things, I tried to improve and fix the model. 

The first thing, I tried, was changing the data set. In the beginning I used the DCEP corpus (Digital Corpus of the European Parliament), but after no success, I changed it to the Multi30k corpus. This corpus was used in multiple papers and models for translation. In the end, this didn't improve the model. Taking this step further, I also created my own 'corpus', which consisted of just two translated sentences. For this test, I used a big number of epochs and no dropout. In theory, the model should now learn the corpus by heart. This experiemnt failed, since the model could only translate one sentence, while outputting a wrong translation for the other sentence. Therefore, the problem needed to be somewhere else.

Next, I tried to adapt the hyperparameters, namely the batch size, learning rate, dimensions and dropout. The **batch size** didn't have a big effect on the results, when kept in between 16 and 256 (it was just slowing the training/overflowing the memory). Changing the **dimensions** was more difficult to test, since higher dimensions immediatly prolonged the training time for hours. I used some trial and error (with some trainings running for 20 hours), but while higher dimensions performed somewhat better, they still didn't produce any complete sentences and were in my perspective not the biggest problem. For a better training time, i kept them lower, but for tweaking the models performance, they may be adapted in future work. The **learning rate** had a much higher impact. I started with a learning rate of around *0,02*. Lowering this to something between *0,002* to *0.0002* had a major impact on the cross entropy loss. While the loss was converging to around *7* in the beginning, it now converged to around *3,6*. But still, the model was unable to produce complete sentences and wasn't able to get the loss lower after around the 100th epoch.

Debugging the whole E2E model seemed very hard to me, since I couldn't pin down the location of the problem. It could be part of the encoder, the decoder, the combination or even the data loading. From observing the translated test sentences, it nonetheless seems to me, that the problem lies in the encoder. The decoder is able to predict word by word and consistently also predicts and <EOS> token. The encoder somehow seems to fail encoding the middle and end part of the sequence, while encoding the beginning of the sequence quite well. Future work may test this, by using a training corpus with only very short sentences, to see how much the LSTM encoder can remember. Furthermore, it may be possible to connect a different encoder (e.g. a transformer) to rule out problems of the decoder. Another try could be using closer related languages, e.g. German/Swedish or Italian/Spanish. This may be a simpler problem to solve for the model. With more computational resources, it would also be very interesting to test different hyperparamters more systematically.

## References
[1] Gehring et al. 2017. Convolutional Sequence to Sequence Learning
[2] Gehring et al. 2017. A Convolutional Encoder Model for Neural Machine Translation
[3] Cho et al. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
[4] Zhou et al. 2016. Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation
[5] Zhou et al. 2020. Incorporating BERT into Neural Machine Translation
[6] Vaswani et al. 2017. Attention is All you Need