<a href="https://colab.research.google.com/github/bucuram/machine-translation-labs/blob/main/Lab2_MT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Overview of Approaches to MT

### Open NMT frameworks
* [Moses](http://www.statmt.org/moses/https://aclanthology.org/P07-2045.pdf). Paper: [Moses: Open Source Toolkit for Statistical Machine Translation](https://aclanthology.org/P07-2045.pdf). C++

* [OpenNMT](https://github.com/OpenNMT/OpenNMT-py). Paper: [OpenNMT: Open-Source Toolkit for Neural Machine Translation](https://aclanthology.org/P17-4012.pdf). PyTorch / TensorFlow. Developed by Harvard NLP,  SYSTRAN
* [Marian](https://marian-nmt.github.io/). Paper: [Marian: Fast Neural Machine Translation in C++](https://aclanthology.org/P18-4020.pdf). C++. Developed by Microsoft Translator
* [Fairseq](https://github.com/pytorch/fairseq). Paper: [FAIRSEQ: A Fast, Extensible Toolkit for Sequence Modeling](https://aclanthology.org/N19-4009.pdf). PyTorch. Developed by Facebook AI
* [Nematus](https://github.com/EdinburghNLP/nematus). Paper: [Nematus: a Toolkit for Neural Machine Translation](https://aclanthology.org/E17-3017.pdf). TensorFlow. Developed by Edinburgh NLP
* [Sockeye](https://github.com/awslabs/sockeye). Paper: [SOCKEYE 2:A Toolkit for Neural Machine Translation](https://aclanthology.org/2020.eamt-1.50.pdf). MXNet. Developed by Amazon
* [JoeyNMT](https://github.com/joeynmt/joeynmt). Paper: [Joey NMT: A Minimalist NMT Toolkit for Novices](https://aclanthology.org/D19-3019v1.pdf). PyTorch



###Exploring the fairseq framework

Installing `fairseq`

In [None]:
!pip install fairseq 

###Downloading the data

We will use the Europarl parallel corpus https://www.statmt.org/europarl/. It contains translations of parliament proceedings

In [None]:
!wget https://object.pouta.csc.fi/OPUS-Europarl/v8/moses/en-ro.txt.zip

In [2]:
!mkdir data
!mv en-ro.txt.zip data/en-ro.txt.zip

In [None]:
!unzip data/en-ro.txt.zip -d data/
!rm data/Europarl.en-ro.xml

The size of files in lines.

In [None]:
!wc -l data/Europarl*

We shuffle and merge the source and target files horizontally (each line of the resulting file will contain a source line and a target line, separated by a tab). We use the [paste](https://www.geeksforgeeks.org/paste-command-in-linux-with-examples/) command.

In [8]:
!paste data/Europarl.en-ro.ro data/Europarl.en-ro.en | shuf > data/shuf-Europarl.en-ro.both

In [None]:
with open('data/shuf-Europarl.en-ro.both', 'r', encoding='utf8') as fh:
    for i in range(5):
        et_sentence, en_sentence = fh.readline().strip().split('\t')
        print('RO: {}\nEN: {}\n'.format(et_sentence, en_sentence))

We will use a subset of the Europarl en-ro corpus in our experiments. We separate the data intro train, dev and test using [sed](https://www.geeksforgeeks.org/sed-command-in-linux-unix-with-examples/)

In [10]:
!sed -n 1,20000p data/shuf-Europarl.en-ro.both | cut -f 1 > data/train.ro
!sed -n 1,20000p data/shuf-Europarl.en-ro.both | cut -f 2 > data/train.en

In [11]:
!sed -n 20001,21000p data/shuf-Europarl.en-ro.both | cut -f 1 > data/dev.ro
!sed -n 20001,21000p data/shuf-Europarl.en-ro.both | cut -f 2 > data/dev.en

In [12]:
!sed -n 21001,21500p data/shuf-Europarl.en-ro.both | cut -f 1 > data/test.ro
!sed -n 21001,21500p data/shuf-Europarl.en-ro.both | cut -f 2 > data/test.en

###Text preprocessing

We clean the texts and remove bad sentence pairs.

* Removing the pairs in which at least one element of the pair (source or target) is empty. 
* Removing the pairs in which at least one of the sentences has 100 or more words. For simplicity, let's just split the sentences by whitespaces and consider the resulting pieces words; for example, the sentence `How are you?` will consist of 3 "words": `["How", "are", "you?"]`.
* Remove the pair if one sentence has at least 5 times as many words as the other one.

In [18]:
def clean_sentence_pairs(raw_lines):
    # Remove pairs with empty sentences
    clean_lines = [pair for pair in raw_lines
                  if len(pair[0]) > 0 
                  and len(pair[1]) > 0]
    print(f'Pairs with empty lines removed: {len(raw_lines) - len(clean_lines)}')
    raw_lines = clean_lines

    # Remove very long sentences
    clean_lines = [pair for pair in raw_lines
                  if len(pair[0].split()) < 100 
                  and len(pair[1].split()) < 100]
    print(f'Pairs with long sentences removed: {len(raw_lines) - len(clean_lines)}')
    raw_lines = clean_lines

    # Remove pairs with high length ratios
    clean_lines = [pair for pair in raw_lines
                  if len(pair[0].split())/len(pair[1].split()) < 5
                  and len(pair[1].split())/len(pair[0].split()) < 5]
    print(f'Pairs with high length ratio removed: {len(raw_lines) - len(clean_lines)}')

    return clean_lines

In [19]:
# Read source and target lines
with open('data/train.en', 'r', encoding='utf8') as en_file:
    en_lines = [line.strip() for line in en_file]
with open('data/train.ro', 'r', encoding='utf8') as et_file:
    et_lines = [line.strip() for line in et_file]

input_pairs = [(en_lines[i], et_lines[i]) for i in range(len(en_lines))]

In [None]:
# Apply function to sentence pairs
cleaned_pairs = clean_sentence_pairs(input_pairs)

# Write the result into new files
with open('data/cleaned-train.en', 'w', encoding='utf8') as en_clean_file:
    en_clean_file.write('\n'.join([pair[0] for pair in cleaned_pairs]))
with open('data/cleaned-train.ro', 'w', encoding='utf8') as et_clean_file:
    et_clean_file.write('\n'.join([pair[1] for pair in cleaned_pairs]))

### 4. (Optional) Tokenization

In a typical natural language processing pipeline, one of the main pre-processing steps is tokenization. Its task is to turn a string into a list of tokens, in other words, to separate words from punctuation marks (e.g. `Hi, Mary!` $\rightarrow$ `["Hi", ",", "Mary", "!"]`). A typical choice of tokenizer for MT is `mosestokenizer`.

In [None]:
!pip install mosestokenizer

In [None]:
from mosestokenizer import MosesTokenizer, MosesDetokenizer
sentence = "Am avut posibilitatea de a ne exprima aşteptările."

with MosesTokenizer('en') as tokenizer:
    tok_sentence = tokenizer(sentence)
    print(f'Tokenized: {tok_sentence}')
with MosesDetokenizer('en') as detokenizer:
    detok_sentence = detokenizer(tok_sentence)
    print(f'Detokenized: {detok_sentence}')

However, tokenization has some drawbacks. It requires knowledge of patterns of the particular language at hand, and it is not always reversible.


In [None]:
test_sentence = "Cele mai vechi atestări documentare ale termenului de „rumân/român” cunoscute în mod cert sunt conţinute în relatări, jurnale şi rapoarte de călătorie redactate de umanişti renascentişti din secolul al XVI-lea."

with MosesTokenizer('en') as tokenizer:
    tok_sentence = tokenizer(test_sentence)
    print(f'Tokenized: {tok_sentence}')
with MosesDetokenizer('en') as detokenizer:
    detok_sentence = detokenizer(tok_sentence)
    print(f'Detokenized: {detok_sentence}')

As you can see, when we tokenize this sentence and then detokenize it again, it is not reproduced correctly. While this would not be a problem, for example, for text classification, it is a problem for MT. Firstly, we want to have natural-looking output, and secondly, unexpected whitespaces mean that our translations will get a low BLEU score.

In this tutorial, we **will not** tokenize our data, because SentencePiece, which we will use for subword segmentation (see section 6 of this notebook) can handle untokenized text. It is also language-independent and reversible.

### 5. (Optional) Truecasing

As the next step, we need to deal with capitalization. We have three options:

1. Lowercase everything
2. Use a truecaser
3. Do nothing

**Truecasing** is the process of restoring
case information to badly-cased or noncased text.

We will use the MosesTruecaser from `sacremoses` https://github.com/alvations/sacremoses

In [None]:
pip install -U sacremoses

We will have to train the truecaser on our data.

In [26]:
!mkdir model/

In [None]:
from sacremoses import MosesTruecaser, MosesTokenizer

mtr = MosesTruecaser()
rotok = MosesTokenizer(lang='ro')

tokenized_docs = [rotok.tokenize(line) for line in open('data/cleaned-train.ro')]
mtr.train(tokenized_docs, save_to='model/ro.truecasemodel')


In [None]:
mtr = MosesTruecaser()
entok = MosesTokenizer(lang='en')

tokenized_docs = [entok.tokenize(line) for line in open('data/cleaned-train.en')]
mtr.train(tokenized_docs, save_to='model/en.truecasemodel')


Using the trained models for truecasing.

In [None]:
!sacremoses -j 14 truecase -m model/ro.truecasemodel < data/cleaned-train.ro > data/tc_cleaned-train.ro
!sacremoses -j 14 truecase -m model/en.truecasemodel < data/cleaned-train.en > data/tc_cleaned-train.en

### 6. Subword segmentation

The last preprocessing step is subword segmentation. Words will be split into smaller parts based on character co-occurrence frequency. The most common words will remain in one piece, and rare words will be broken into several units.

We will use `SentencePiece` https://github.com/google/sentencepiece. Paper: [SentencePiece: A simple and language independent subword tokenizer
and detokenizer for Neural Text Processing](https://arxiv.org/pdf/1808.06226.pdf)

In [None]:
!pip install sentencepiece

We can train a model for splitting our text into subwords. Note that it is common to have a joint vocabulary for source and target languages.

In [33]:
import sentencepiece as spm

spm.SentencePieceTrainer.train(input=['data/cleaned-train.en', 'data/cleaned-train.ro'],
                               model_prefix='model/sentpiece',
                               vocab_size=4000)

In [None]:
sp = spm.SentencePieceProcessor(model_file='model/sentpiece.model')
encoded = sp.encode('Am avut posibilitatea de a ne exprima aşteptările.', out_type=str)
print(encoded)
encoded_str = ' '.join(encoded)
print(encoded_str)

We have trained a model with 4000 subwords. This means that SentencePiece will split the words so that the vocabulary size will be 4000.

We will apply the model on the training data.

In [37]:
for filename in ['cleaned-train.en', 'cleaned-train.ro']:
    with open(f'data/{filename}', 'r', encoding='utf8') as in_fh:
        sp_out = sp.encode([line.strip() for line in in_fh], out_type=str)
    with open(f'data/sentpiece-{filename}', 'w', encoding='utf8') as out_fh:
        out_fh.writelines([' '.join(line) + '\n' for line in sp_out])

In [None]:
with open(f'data/sentpiece-cleaned-train.ro', 'r') as f:
    sentpieces_ro = f.readlines()
print(sentpieces_ro[:2])

###Repeat for dev sets

In [None]:
# Cleaning

# Read source and target lines
with open('data/dev.en', 'r', encoding='utf8') as en_file:
    en_lines = [line.strip() for line in en_file]
with open('data/dev.ro', 'r', encoding='utf8') as et_file:
    et_lines = [line.strip() for line in et_file]

input_pairs = [(en_lines[i], et_lines[i]) for i in range(len(en_lines))]

# Apply function to sentence pairs
cleaned_pairs = clean_sentence_pairs(input_pairs)

# Write the result into new files
with open('data/cleaned-dev.en', 'w', encoding='utf8') as en_clean_file:
    en_clean_file.write('\n'.join([pair[0] for pair in cleaned_pairs]))
with open('data/cleaned-dev.ro', 'w', encoding='utf8') as et_clean_file:
    et_clean_file.write('\n'.join([pair[1] for pair in cleaned_pairs]))

In [41]:
# Subword segmentation
for filename in ['cleaned-dev.en', 'cleaned-dev.ro']:
    with open(f'data/{filename}', 'r', encoding='utf8') as in_fh:
        sp_out = sp.encode([line.strip() for line in in_fh], out_type=str)
    with open(f'data/sentpiece-{filename}', 'w', encoding='utf8') as out_fh:
        out_fh.writelines([' '.join(line) + '\n' for line in sp_out])

We need to write the training and dev data into binary files from which `fairseq` will read during training. And also to build the vocabularies.

In [None]:
!fairseq-preprocess --source-lang ro \
                    --target-lang en \
                    --trainpref data/sentpiece-cleaned-train \
                    --validpref data/sentpiece-cleaned-dev \
                    --destdir data/bin-data \
                    --joined-dictionary

###Train the model

Now that we have preprocessed some texts, we are finally ready to train a translation model. It will not be very good, because we are only using 20,000 sentence pairs for training and we do not have a lot of time, but nevertheless it should learn something useful.

Run the command below. It will train a model with 2-layer Transformer encoder and decoder for 10 epochs. This will take some time. (Check that you have selected runtime type 'GPU'.)

In [None]:
!pip uninstall sacrebleu
!pip install sacrebleu==1.5.1

In [None]:
!pip install tensorboardX

In [None]:
!fairseq-train data/bin-data --arch transformer \
                             --lr 0.005 \
                             --encoder-attention-heads 4 \
                             --encoder-embed-dim 32 \
                             --encoder-layers 2 \
                             --encoder-ffn-embed-dim 64 \
                             --decoder-attention-heads 4 \
                             --decoder-embed-dim 32 \
                             --decoder-layers 2 \
                             --decoder-ffn-embed-dim 64 \
                             --max-epoch 10 \
                             --optimizer adam \
                             --max-tokens 4000 \
                             --save-dir data/et2en_model \
                             --log-format json \
                             --tensorboard-logdir data/et2en_model/log-tb \
                             --eval-bleu \
                             --eval-bleu-remove-bpe=sentencepiece

Given the small training set, we got low BLEU scores at validation, which indicates that our model did not learn to translate well yet, but is not generating completely random output either.

In [None]:
%load_ext tensorboard
%tensorboard --logdir data/et2en_model/log-tb

###Translating

Before we can translate a sentence, we need to preprocess it in the same way as we did the training and development sets.

We will create the file with our sample text.

In [59]:
!echo "România este un stat situat în sud-estul Europei Centrale, pe cursul inferior al Dunării" >> data/input.ro
!echo "România a apărut ca stat, condus de Alexandru Ioan Cuza, în 1859." >> data/input.ro
!echo "A fost recunoscută ca ţară independentă 19 ani mai târziu." >> data/input.ro

Preprocessing the input text.

In [60]:
# Subword segmentation
with open(f'data/input.ro', 'r', encoding='utf8') as in_fh:
    sp_out = sp.encode([line.strip() for line in in_fh], out_type=str)
with open(f'data/sentpiece-input.ro', 'w', encoding='utf8') as out_fh:
    out_fh.writelines([' '.join(line) + '\n' for line in sp_out])

Now we can translate it. To get a readable sentence, we also need to reverse SentencePiece splitting afterwards.

In [None]:
!cat data/sentpiece-input.ro | fairseq-interactive data/bin-data \
                                               --source-lang ro \
                                               --target-lang en \
                                               --path data/et2en_model/checkpoint_best.pt \
                                               > data/output.en

In [None]:
!cat data/output.en

In [63]:
!grep "^H" data/output.en | cut -f3 > data/hypothesis.en

In [None]:
# De-SentencePiece
with open(f'data/hypothesis.en', 'r', encoding='utf8') as in_fh:
    de_sp_out = [sp.decode(line.strip().split()) for line in in_fh]
    print(de_sp_out)
with open(f'data/de-sentpiece-hypothesis.en', 'w', encoding='utf8') as out_fh:
    out_fh.writelines([line + '\n' for line in de_sp_out])

In [None]:
!cat data/de-sentpiece-hypothesis.en

You can probably see that our model generates readable English text, but it is not necessarily a translation of the input. The language model component is already OK, but the conditioning part is not working yet. You will fix it when you train a bigger baseline with more data.

##Assignment

To be uploaded here: https://forms.gle/A8rHbckQcfZfRMJMA

Try to experiment with other framework(s) for machine translation. You can use the list of resources from the begging of the lab or any other framework you know for NMT.

You can use [Europarl data](https://www.statmt.org/europarl/) or experiment with other data from  https://www.statmt.org/.

Split the data you are using into train/dev/test and report your performance in terms on BLEU score and/or other MT metrics.

###Resources

* [Intro to Pytorch](https://github.com/udacity/deep-learning-v2-pytorch/tree/master/intro-to-pytorch)
* [Pytorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning)
* [Intro to Tensorflow](https://github.com/udacity/intro-to-ml-tensorflow)

Notebook adapted from: [MTAT.06.055 Machine Translation](https://courses.cs.ut.ee/2021/mt/spring/Main/HomePage)