<a href="https://colab.research.google.com/github/bucuram/machine-translation-labs/blob/main/Lab2_MT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Overview of Approaches to MT

### Open NMT frameworks
* [Moses](http://www.statmt.org/moses/https://aclanthology.org/P07-2045.pdf). Paper: [Moses: Open Source Toolkit for Statistical Machine Translation](https://aclanthology.org/P07-2045.pdf). C++

* [OpenNMT](https://github.com/OpenNMT/OpenNMT-py). Paper: [OpenNMT: Open-Source Toolkit for Neural Machine Translation](https://aclanthology.org/P17-4012.pdf). PyTorch / TensorFlow. Developed by Harvard NLP,  SYSTRAN
* [Marian](https://marian-nmt.github.io/). Paper: [Marian: Fast Neural Machine Translation in C++](https://aclanthology.org/P18-4020.pdf). C++. Developed by Microsoft Translator
* [Fairseq](https://github.com/pytorch/fairseq). Paper: [https://aclanthology.org/N19-4009.pdf](https://aclanthology.org/N19-4009.pdf). PyTorch. Developed by Facebook AI
* [Nematus](https://github.com/EdinburghNLP/nematus). Paper: [Nematus: a Toolkit for Neural Machine Translation](https://aclanthology.org/E17-3017.pdf). TensorFlow. Developed by Edinburgh NLP
* [Sockeye](https://github.com/awslabs/sockeye). Paper: [SOCKEYE 2:A Toolkit for Neural Machine Translation](https://aclanthology.org/2020.eamt-1.50.pdf). MXNet. Developed by Amazon
* [JoeyNMT](https://github.com/joeynmt/joeynmt). Paper: [Joey NMT: A Minimalist NMT Toolkit for Novices](https://aclanthology.org/D19-3019v1.pdf). PyTorch



###Exploring the fairseq framework

Installing fairseq, mosestokenizer and tensorboardX

In [21]:
!pip install fairseq 



###Downloading the data

We will use the Europarl parallel corpus https://www.statmt.org/europarl/. It contains translations of parliament proceedings

In [22]:
!wget https://object.pouta.csc.fi/OPUS-Europarl/v8/moses/en-ro.txt.zip

--2021-10-20 19:30:24--  https://object.pouta.csc.fi/OPUS-Europarl/v8/moses/en-ro.txt.zip
Resolving object.pouta.csc.fi (object.pouta.csc.fi)... 86.50.254.18, 86.50.254.19
Connecting to object.pouta.csc.fi (object.pouta.csc.fi)|86.50.254.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 39495951 (38M) [application/zip]
Saving to: ‘en-ro.txt.zip.1’


2021-10-20 19:30:28 (10.5 MB/s) - ‘en-ro.txt.zip.1’ saved [39495951/39495951]



In [4]:
!mkdir data
!mv en-ro.txt.zip data/en-ro.txt.zip

In [5]:
!unzip data/en-ro.txt.zip
!rm data/Europarl.en-ro.xml

/content/data
Archive:  en-ro.txt.zip
  inflating: README                  
  inflating: LICENSE                 
  inflating: Europarl.en-ro.en       
  inflating: Europarl.en-ro.ro       
  inflating: Europarl.en-ro.xml      


The size of files in lines.

In [6]:
!wc -l data/Europarl*

   400356 Europarl.en-ro.en
   400356 Europarl.en-ro.ro
   800712 total


We shuffle and merge the source and target files horizontally (each line of the resulting file will contain a source line and a target line, separated by a tab).

In [7]:
!paste Europarl.en-ro.ro Europarl.en-ro.en | shuf > shuf-Europarl.en-ro.both

In [8]:
with open('data/shuf-Europarl.en-ro.both', 'r', encoding='utf8') as fh:
    for i in range(5):
        et_sentence, en_sentence = fh.readline().strip().split('\t')
        print('RO: {}\nEN: {}\n'.format(et_sentence, en_sentence))

RO: Aş dori să îi mulţumesc în mod deosebit raportoarei, dna Lynne, pentru munca excelentă în întocmirea propunerii de rezoluţie referitoare la propunerea Comisiei şi pentru eforturile sale, de-a lungul mai multor ani, în vederea îmbunătăţirii sănătăţii şi siguranţei la locul de muncă pentru lucrătorii din sectorul spitalicesc şi al asistenţei medicale.
EN: In particular I want to thank the rapporteur, Mrs Lynne, for her excellent work in drawing up the motion for a resolution on the Commission's proposal and for her efforts over a number of years to improve health and safety at work for workers in the hospital and healthcare sector.

RO: Acestea ridică între noi bariere inutile pe care, la urma urmelor, le-am demontat timp de decenii şi aşa trebuie să facem şi pe viitor.
EN: These build unnecessary walls between us which, after all, we have been dismantling for decades, and that is what the future is about.

RO: Salut propunerea de prevederi mai solide care încearcă să prevină transpo

We will use a subset of the Europarl en-ro corpus in our experiments. We separate the data intro train, dev and test.

In [9]:
!sed -n 1,20000p data/shuf-Europarl.en-ro.both | cut -f 1 > data/train.ro
!sed -n 1,20000p data/shuf-Europarl.en-ro.both | cut -f 2 > data/train.en

In [10]:
!sed -n 20001,21000p data/shuf-Europarl.en-ro.both | cut -f 1 > data/dev.ro
!sed -n 20001,21000p data/shuf-Europarl.en-ro.both | cut -f 2 > data/dev.en

In [11]:
!sed -n 21001,21500p data/shuf-Europarl.en-ro.both | cut -f 1 > data/test.ro
!sed -n 21001,21500p data/shuf-Europarl.en-ro.both | cut -f 2 > data/test.en

###Text preprocessing

We clean the texts and remove bad sentence pairs.

* Removing the pairs in which at least one element of the pair (source or target) is empty. 
* Removing the pairs in which at least one of the sentences has 100 or more words. For simplicity, let's just split the sentences by whitespaces and consider the resulting pieces words; for example, the sentence `How are you?` will consist of 3 "words": `["How", "are", "you?"]`.
* Remove the pair if one sentence has at least 9 times as many words as the other one.

In [12]:
def clean_sentence_pairs(raw_lines):
    # Remove pairs with empty sentences
    clean_lines = [pair for pair in raw_lines
                  if len(pair[0]) > 0 
                  and len(pair[1]) > 0]
    print(f'Pairs with empty lines removed: {len(raw_lines) - len(clean_lines)}')
    raw_lines = clean_lines

    # Remove very long sentences
    clean_lines = [pair for pair in raw_lines
                  if len(pair[0].split()) < 100 
                  and len(pair[1].split()) < 100]
    print(f'Pairs with long sentences removed: {len(raw_lines) - len(clean_lines)}')
    raw_lines = clean_lines

    # Remove pairs with high length ratios
    clean_lines = [pair for pair in raw_lines
                  if len(pair[0].split())/len(pair[1].split()) < 9
                  and len(pair[1].split())/len(pair[0].split()) < 9]
    print(f'Pairs with high length ratio removed: {len(raw_lines) - len(clean_lines)}')

    return clean_lines

In [13]:
# Read source and target lines
with open('data/train.en', 'r', encoding='utf8') as en_file:
    en_lines = [line.strip() for line in en_file]
with open('data/train.ro', 'r', encoding='utf8') as et_file:
    et_lines = [line.strip() for line in et_file]

input_pairs = [(en_lines[i], et_lines[i]) for i in range(len(en_lines))]

In [14]:
# Apply function to sentence pairs
cleaned_pairs = clean_sentence_pairs(input_pairs)

# Write the result into new files
with open('data/cleaned-train.en', 'w', encoding='utf8') as en_clean_file:
    en_clean_file.write('\n'.join([pair[0] for pair in cleaned_pairs]))
with open('cdata/leaned-train.ro', 'w', encoding='utf8') as et_clean_file:
    et_clean_file.write('\n'.join([pair[1] for pair in cleaned_pairs]))

Pairs with empty lines removed: 0
Pairs with long sentences removed: 31
Pairs with high length ratio removed: 0


### 4. (Optional) Tokenization

In a typical natural language processing pipeline, one of the main pre-processing steps is tokenization. Its task is to turn a string into a list of tokens, in other words, to separate words from punctuation marks (e.g. `Hi, Mary!` $\rightarrow$ `["Hi", ",", "Mary", "!"]`). A typical choice of tokenizer for MT is `mosestokenizer`.

In [None]:
!pip install mosestokenizer

In [15]:
from mosestokenizer import MosesTokenizer, MosesDetokenizer
sentence = "Am avut posibilitatea de a ne exprima aşteptările."

with MosesTokenizer('en') as tokenizer:
    tok_sentence = tokenizer(sentence)
    print(f'Tokenized: {tok_sentence}')
with MosesDetokenizer('en') as detokenizer:
    detok_sentence = detokenizer(tok_sentence)
    print(f'Detokenized: {detok_sentence}')

Tokenized: ['Am', 'avut', 'posibilitatea', 'de', 'a', 'ne', 'exprima', 'aşteptările', '.']
Detokenized: Am avut posibilitatea de a ne exprima aşteptările.


However, tokenization has some drawbacks. It requires knowledge of patterns of the particular language at hand, and it is not always reversible.


In [16]:
test_sentence = "Cele mai vechi atestări documentare ale termenului de „rumân/român” cunoscute în mod cert sunt conţinute în relatări, jurnale şi rapoarte de călătorie redactate de umanişti renascentişti din secolul al XVI-lea."

with MosesTokenizer('en') as tokenizer:
    tok_sentence = tokenizer(test_sentence)
    print(f'Tokenized: {tok_sentence}')
with MosesDetokenizer('en') as detokenizer:
    detok_sentence = detokenizer(tok_sentence)
    print(f'Detokenized: {detok_sentence}')

Tokenized: ['Cele', 'mai', 'vechi', 'atestări', 'documentare', 'ale', 'termenului', 'de', '„', 'rumân', '/', 'român', '”', 'cunoscute', 'în', 'mod', 'cert', 'sunt', 'conţinute', 'în', 'relatări', ',', 'jurnale', 'şi', 'rapoarte', 'de', 'călătorie', 'redactate', 'de', 'umanişti', 'renascentişti', 'din', 'secolul', 'al', 'XVI', '@-@', 'lea', '.']
Detokenized: Cele mai vechi atestări documentare ale termenului de „rumân / român ” cunoscute în mod cert sunt conţinute în relatări, jurnale şi rapoarte de călătorie redactate de umanişti renascentişti din secolul al XVI-lea.


As you can see, when we tokenize this sentence and then detokenize it again, it is not reproduced correctly. While this would not be a problem, for example, for text classification, it is a problem for MT. Firstly, we want to have natural-looking output, and secondly, unexpected whitespaces mean that our translations will get a low BLEU score.

In this tutorial, we **will not** tokenize our data, because SentencePiece, which we will use for subword segmentation (see section 6 of this notebook) can handle untokenized text. It is also language-independent and reversible.

### 5. (Optional) Truecasing

As the next step, we need to deal with capitalization. We have three options:

1. Lowercase everything
2. Use a truecaser
3. Do nothing

**Truecasing** is the process of restoring
case information to badly-cased or noncased text.

We will use the MosesTruecaser from [sacremoses](https://github.com/alvations/sacremoses)

In [17]:
pip install -U sacremoses

Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 5.2 MB/s 
Installing collected packages: sacremoses
Successfully installed sacremoses-0.0.46


We will have to train the truecaser on our data.

In [None]:
from sacremoses import MosesTruecaser, MosesTokenizer

mtr = MosesTruecaser()
rotok = MosesTokenizer(lang='ro')

tokenized_docs = [rotok.tokenize(line) for line in open('data/cleaned-train.ro')]
mtr.train(tokenized_docs, save_to='model/ro.truecasemodel')


In [None]:
mtr = MosesTruecaser()
entok = MosesTokenizer(lang='en')

tokenized_docs = [entok.tokenize(line) for line in open('data/cleaned-train.en')]
mtr.train(tokenized_docs, save_to='model/en.truecasemodel')


Using the trained models.

In [None]:
!sacremoses -j 14 truecase -m model/ro.truecasemodel < data/cleaned-train.ro > data/tc_cleaned-train.ro
!sacremoses -j 14 truecase -m model/en.truecasemodel < data/cleaned-train.en > data/tc_cleaned-train.en

### 6. Subword segmentation

The last preprocessing step is subword segmentation. Words will be split into smaller parts based on character co-occurrence frequency. The most common words will remain in one piece, and rare words will be broken into several units.

We will use [SentencePiece](https://github.com/google/sentencepiece)

In [23]:
!pip install sentencepiece



We can train a model for splitting our text into subwords. Note that it is common to have a joint vocabulary for source and target languages.

In [26]:
import sentencepiece as spm

spm.SentencePieceTrainer.train(input=['data/cleaned-train.en', 'data/cleaned-train.ro'],
                               model_prefix='model/sentpiece',
                               vocab_size=4000)

In [28]:
sp = spm.SentencePieceProcessor(model_file='model/sentpiece.model')
encoded = sp.encode('Am avut posibilitatea de a ne exprima aşteptările.', out_type=str)
print(encoded)
encoded_str = ' '.join(encoded)
print(encoded_str)

['▁Am', '▁avut', '▁posibilitatea', '▁de', '▁a', '▁ne', '▁exprim', 'a', '▁aştept', 'ările', '.']
▁Am ▁avut ▁posibilitatea ▁de ▁a ▁ne ▁exprim a ▁aştept ările .


We have trained a model with 4000 subwords. This means that SentencePiece will split the words so that the vocabulary size will be 4000.

We will apply the model on the training data.

In [43]:
for filename in ['data/cleaned-train.en', 'data/cleaned-train.ro']:
    with open(f'{filename}', 'r', encoding='utf8') as in_fh:
        sp_out = sp.encode([line.strip() for line in in_fh], out_type=str)
    with open(f'data/sentpiece-{filename}', 'w', encoding='utf8') as out_fh:
        out_fh.writelines([' '.join(line) + '\n' for line in sp_out])

In [44]:
with open(f'data/sentpiece-cleaned-train.ro', 'r') as f:
    sentpieces_ro = f.readlines()
print(sentpieces_ro[:2])

['▁Aş ▁dori ▁să ▁î i ▁mulţumesc ▁în ▁mod ▁deosebit ▁raport oare i , ▁dn a ▁L y n ne , ▁pentru ▁munca ▁excelent ă ▁în ▁în to c mi rea ▁propuneri i ▁de ▁rezoluţie ▁referito are ▁la ▁propunerea ▁Comisiei ▁şi ▁pentru ▁eforturile ▁sale , ▁de - a ▁lung ul ▁mai ▁mult or ▁ani , ▁în ▁vedere a ▁îmbunătăţ i rii ▁sănătăţii ▁şi ▁siguranţ ei ▁la ▁loc ul ▁de ▁muncă ▁pentru ▁lucră torii ▁din ▁sectorul ▁sp it al ic esc ▁şi ▁al ▁asistenţ ei ▁medical e .\n', '▁Aceste a ▁ridic ă ▁între ▁noi ▁bariere ▁in ut ile ▁pe ▁care , ▁la ▁urma ▁ ur m elor , ▁le - am ▁de m on t at ▁timp ▁de ▁de ce ni i ▁şi ▁aşa ▁trebuie ▁să ▁facem ▁şi ▁pe ▁viitor .\n']


###Repeat for dev sets

In [34]:
# Cleaning

# Read source and target lines
with open('data/dev.en', 'r', encoding='utf8') as en_file:
    en_lines = [line.strip() for line in en_file]
with open('data/dev.ro', 'r', encoding='utf8') as et_file:
    et_lines = [line.strip() for line in et_file]

input_pairs = [(en_lines[i], et_lines[i]) for i in range(len(en_lines))]

# Apply function to sentence pairs
cleaned_pairs = clean_sentence_pairs(input_pairs)

# Write the result into new files
with open('data/cleaned-dev.en', 'w', encoding='utf8') as en_clean_file:
    en_clean_file.write('\n'.join([pair[0] for pair in cleaned_pairs]))
with open('data/cleaned-dev.ro', 'w', encoding='utf8') as et_clean_file:
    et_clean_file.write('\n'.join([pair[1] for pair in cleaned_pairs]))

Pairs with empty lines removed: 0
Pairs with long sentences removed: 1
Pairs with high length ratio removed: 0


In [40]:
# Subword segmentation
for filename in ['data/cleaned-dev.en', 'data/cleaned-dev.ro']:
    with open(f'{filename}', 'r', encoding='utf8') as in_fh:
        sp_out = sp.encode([line.strip() for line in in_fh], out_type=str)
    with open(f'data/sentpiece-{filename}', 'w', encoding='utf8') as out_fh:
        out_fh.writelines([' '.join(line) + '\n' for line in sp_out])

We need to write the training and dev data into binary filesfrom which `fairseq` will read during training.

In [45]:
!fairseq-preprocess --source-lang ro \
                    --target-lang en \
                    --trainpref data/sentpiece-cleaned-train \
                    --validpref data/sentpiece-cleaned-dev \
                    --destdir data/bin-data \
                    --joined-dictionary

2021-10-20 19:51:34 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_shard_count=1, checkpoint_suffix='', cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data/bin-data', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=True, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer=None, padding_factor=8, profile=False, quantization_config_path=None, scoring='bleu', seed=1, source_lang='ro', srcdict=None, target_lang='en', task='translation', tensorboard_logdir=None, testpref=None, tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, tpu=False, trainpref='data/se

###Train the model

Now that we have preprocessed some texts, we are finally ready to train a translation model. It will not be very good, because we are only using 20,000 sentence pairs for training and we do not have a lot of time, but nevertheless it should learn something useful.

Run the command below. It will train a model with 2-layer Transformer encoder and decoder for 10 epochs. This will take some time. (Check that you have selected runtime type 'GPU'.)

In [47]:
!pip uninstall sacrebleu
!pip install sacrebleu==1.5.1

Found existing installation: sacrebleu 2.0.0
Uninstalling sacrebleu-2.0.0:
  Would remove:
    /usr/local/bin/sacrebleu
    /usr/local/lib/python3.7/dist-packages/sacrebleu-2.0.0.dist-info/*
    /usr/local/lib/python3.7/dist-packages/sacrebleu/*
Proceed (y/n)? y
  Successfully uninstalled sacrebleu-2.0.0
Collecting sacrebleu==1.5.1
  Downloading sacrebleu-1.5.1-py3-none-any.whl (54 kB)
[K     |████████████████████████████████| 54 kB 1.9 MB/s 
[?25hCollecting portalocker==2.0.0
  Downloading portalocker-2.0.0-py2.py3-none-any.whl (11 kB)
Installing collected packages: portalocker, sacrebleu
  Attempting uninstall: portalocker
    Found existing installation: portalocker 2.3.2
    Uninstalling portalocker-2.3.2:
      Successfully uninstalled portalocker-2.3.2
Successfully installed portalocker-2.0.0 sacrebleu-1.5.1


In [48]:
!fairseq-train data/bin-data --arch transformer \
                             --lr 0.005 \
                             --encoder-attention-heads 4 \
                             --encoder-embed-dim 32 \
                             --encoder-layers 2 \
                             --encoder-ffn-embed-dim 64 \
                             --decoder-attention-heads 4 \
                             --decoder-embed-dim 32 \
                             --decoder-layers 2 \
                             --decoder-ffn-embed-dim 64 \
                             --max-epoch 10 \
                             --optimizer adam \
                             --max-tokens 4000 \
                             --save-dir data/et2en_model \
                             --log-format json \
                             --tensorboard-logdir data/et2en_model/log-tb \
                             --eval-bleu \
                             --eval-bleu-remove-bpe=sentencepiece

2021-10-20 19:54:47 | INFO | fairseq_cli.train | Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.999)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, arch='transformer', attention_dropout=0.0, batch_size=None, batch_size_valid=None, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=0.0, cpu=False, criterion='cross_entropy', cross_self_attention=False, curriculum=0, data='data/bin-data', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=4, decoder_embed_dim=32, decoder_embed_path=None, decoder_ffn_embed_dim=64, decoder_input_dim=32, decoder_layerdrop=0, decoder_layers=2, decoder_layers_to_keep=None, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=32, device_id=0, disable_validation=False, distributed_backend='nccl', d

Given the small training set, we got low BLEU scores at validation, which indicates that our model did not learn to translate well yet, but is not generating completely random output either.

In [None]:
%load_ext tensorboard
%tensorboard --logdir data/et2en_model/log-tb

###Translating

Before we can translate a sentence, we need to preprocess it in the same way as we did the training and development sets.

We will create the file with our sample text.

In [61]:
!echo "România este un stat situat în sud-estul Europei Centrale, pe cursul inferior al Dunării" >> data/input.ro
!echo "România a apărut ca stat, condus de Alexandru Ioan Cuza, în 1859." >> data/input.ro
!echo "A fost recunoscută ca ţară independentă 19 ani mai târziu." >> data/input.ro

Preprocessing the input text.

In [62]:
# Subword segmentation
with open(f'data/input.ro', 'r', encoding='utf8') as in_fh:
    sp_out = sp.encode([line.strip() for line in in_fh], out_type=str)
with open(f'data/sentpiece-input.ro', 'w', encoding='utf8') as out_fh:
    out_fh.writelines([' '.join(line) + '\n' for line in sp_out])

Now we can translate it. To get a readable sentence, we also need to reverse SentencePiece splitting afterwards.

In [63]:
!cat data/sentpiece-input.ro | fairseq-interactive data/bin-data \
                                               --source-lang ro \
                                               --target-lang en \
                                               --path data/et2en_model/checkpoint_best.pt \
                                               > data/output.en

To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


In [57]:
!cat data/output.en

2021-10-20 20:03:59 | INFO | fairseq_cli.interactive | Namespace(all_gather_list_size=16384, batch_size=1, batch_size_valid=None, beam=5, bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, buffer_size=1, checkpoint_shard_count=1, checkpoint_suffix='', constraints=None, cpu=False, criterion='cross_entropy', curriculum=0, data='data/bin-data', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoding_format=None, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_num_procs=1, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', diverse_beam_groups=-1, diverse_beam_strength=0.5, diversity_rate=-1.0, empty_cache_freq=0, eval_bleu=False, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=False, eval_bleu_remove_bpe=None, eval_tokenized_bleu=False, fast_stat_sync=False, find_unused_parameters=Fals

In [65]:
!grep "^H" data/output.en | cut -f3 > data/hypothesis.en

In [66]:
# De-SentencePiece
with open(f'data/hypothesis.en', 'r', encoding='utf8') as in_fh:
    de_sp_out = [sp.decode(line.strip().split()) for line in in_fh]
    print(de_sp_out)
with open(f'data/de-sentpiece-hypothesis.en', 'w', encoding='utf8') as out_fh:
    out_fh.writelines([line + '\n' for line in de_sp_out])

['Romania in Romania, Romania, Greece, I am axiganction of the Exanction of the Exition of the 19959599999999999.', 'Romania in Romania, Romania, Greece, I am axiganction of the Exanction of the Exition of the 19959599999999999.', 'The country has been condemned in 199 years.', 'Romania is a whole of the Easter of Central Central Central Central Central Central Central Central Central Central Central Central Central Bank.', 'Romania in Romania, Romania, Greece, I am axiganction of the Exanction of the Exition of the 19959599999999999.', 'The country has been condemned in 199 years.']


In [67]:
!cat data/de-sentpiece-hypothesis.en

Romania in Romania, Romania, Greece, I am axiganction of the Exanction of the Exition of the 19959599999999999.
Romania in Romania, Romania, Greece, I am axiganction of the Exanction of the Exition of the 19959599999999999.
The country has been condemned in 199 years.
Romania is a whole of the Easter of Central Central Central Central Central Central Central Central Central Central Central Central Central Bank.
Romania in Romania, Romania, Greece, I am axiganction of the Exanction of the Exition of the 19959599999999999.
The country has been condemned in 199 years.


You can probably see that our model generates readable English text, but it is not necessarily a translation of the input. The language model component is already OK, but the conditioning part is not working yet. You will fix it when you train a bigger baseline with more data.

##Assignment

To be uploaded here: https://forms.gle/A8rHbckQcfZfRMJMA

Try to experiment with other framework(s) for machine translation. You can use the list of resources from the begging of the lab or any other framework you know for NMT.

You can use [Europarl data](https://www.statmt.org/europarl/) or experiment with other data from  https://www.statmt.org/.

Split the data you are using into train/dev/test and report your performance in terms on BLEU score.

###Resources

* [Intro to Pytorch](https://github.com/udacity/deep-learning-v2-pytorch/tree/master/intro-to-pytorch)
* [Pytorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning)
* [Intro to Tensorflow](https://github.com/udacity/intro-to-ml-tensorflow)

Notebook adapted from: [MTAT.06.055 Machine Translation](https://courses.cs.ut.ee/2021/mt/spring/Main/HomePage)