# Lab 2: Joey NMT Advanced

In this notebook, we'll train a Transformer model for translating between TED talks in French (*fr*) and English (*en*). We'll do some configuration debugging and experiment with hyperparameters, then inspect evaluation metrics and find out how robust the model is.

The pre-processing code is a bit lengthy, but it reflects reality: often getting the data into the right format and selecting the right pieces takes more code than the actual model training ;) 

At the very end of this colab you'll also find instructions on how to get started with backtranslation as a data augmentation technique, and how to build a multilingual model. These topics are not mandatory but might be fun to explore if you have time. 

**Important:** Before you start, set runtime type to GPU.

Author: Julia Kreutzer

In [1]:
import os

In [2]:
!pip install torch==1.8.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.8.0+cu101
[?25l  Downloading https://download.pytorch.org/whl/cu101/torch-1.8.0%2Bcu101-cp37-cp37m-linux_x86_64.whl (763.5MB)
[K     |████████████████████████████████| 763.5MB 24kB/s 
[31mERROR: torchvision 0.9.1+cu101 has requirement torch==1.8.1, but you'll have torch 1.8.0+cu101 which is incompatible.[0m
[31mERROR: torchtext 0.9.1 has requirement torch==1.8.1, but you'll have torch 1.8.0+cu101 which is incompatible.[0m
Installing collected packages: torch
  Found existing installation: torch 1.8.1+cu101
    Uninstalling torch-1.8.1+cu101:
      Successfully uninstalled torch-1.8.1+cu101
Successfully installed torch-1.8.0+cu101


In [3]:
!pip install joeynmt

Collecting joeynmt
[?25l  Downloading https://files.pythonhosted.org/packages/c6/0a/f383560ca9eedbd4a09f5d9d1523aaf3b4b5e0e79a1e5d9c72ff370826d4/joeynmt-1.3-py3-none-any.whl (84kB)
[K     |████████████████████████████████| 92kB 4.2MB/s 
Collecting sacrebleu>=1.3.6
[?25l  Downloading https://files.pythonhosted.org/packages/7e/57/0c7ca4e31a126189dab99c19951910bd081dea5bbd25f24b77107750eae7/sacrebleu-1.5.1-py3-none-any.whl (54kB)
[K     |████████████████████████████████| 61kB 8.5MB/s 
Collecting pylint
[?25l  Downloading https://files.pythonhosted.org/packages/10/f0/9705d6ec002876bc20b6923cbdeeca82569a895fc214211562580e946079/pylint-2.8.2-py3-none-any.whl (357kB)
[K     |████████████████████████████████| 358kB 21.3MB/s 
[?25hCollecting torchtext==0.9.0
[?25l  Downloading https://files.pythonhosted.org/packages/36/50/84184d6230686e230c464f0dd4ff32eada2756b4a0b9cefec68b88d1d580/torchtext-0.9.0-cp37-cp37m-manylinux1_x86_64.whl (7.1MB)
[K     |████████████████████████████████| 7.1MB 

# Data Preparation

We'll use *English - French* translations from the [IWSLT 2017 challenge](https://wit3.fbk.eu/2017-01-c), the ["unofficial" task](https://sites.google.com/site/iwsltevaluation2017/TED-tasks?authuser=0). This challenge is about translating TED talks from multiple languages.

## Download

Requires downloading a file of 292MB. If you do this ahead of time, store a copy of it in your Google drive and access it from there.

In [4]:
! pip install gdown



In [5]:
!gdown https://drive.google.com/uc?id=1gFeuPTRc3RB4DhJEkhr8O-a8PObM7Ix2

Downloading...
From: https://drive.google.com/uc?id=1gFeuPTRc3RB4DhJEkhr8O-a8PObM7Ix2
To: /content/2017-01-trnted.tgz
292MB [00:04, 59.2MB/s]


In [6]:
! tar -zxvf /content/2017-01-trnted.tgz

2017-01-trnted/
2017-01-trnted/texts/
2017-01-trnted/._texts.html
2017-01-trnted/texts.html
2017-01-trnted/texts/ar/
2017-01-trnted/texts/de/
2017-01-trnted/texts/en/
2017-01-trnted/texts/fr/
2017-01-trnted/texts/ja/
2017-01-trnted/texts/ko/
2017-01-trnted/texts/zh/
2017-01-trnted/texts/zh/en/
2017-01-trnted/texts/zh/en/._.eval
2017-01-trnted/texts/zh/en/.eval
2017-01-trnted/texts/zh/en/._.info
2017-01-trnted/texts/zh/en/.info
2017-01-trnted/texts/zh/en/._zh-en.tgz
2017-01-trnted/texts/zh/en/zh-en.tgz
2017-01-trnted/texts/ko/en/
2017-01-trnted/texts/ko/en/._.eval
2017-01-trnted/texts/ko/en/.eval
2017-01-trnted/texts/ko/en/._.info
2017-01-trnted/texts/ko/en/.info
2017-01-trnted/texts/ko/en/._ko-en.tgz
2017-01-trnted/texts/ko/en/ko-en.tgz
2017-01-trnted/texts/ja/en/
2017-01-trnted/texts/ja/en/._.eval
2017-01-trnted/texts/ja/en/.eval
2017-01-trnted/texts/ja/en/._.info
2017-01-trnted/texts/ja/en/.info
2017-01-trnted/texts/ja/en/._ja-en.tgz
2017-01-trnted/texts/ja/en/ja-en.tgz
2017-01-trnte

The `texts` subdirectory contains translation data for multiple languages. Let's start with `fr-en`, French to English translations.

In [7]:
!tar -xvf 2017-01-trnted/texts/fr/en/fr-en.tgz

fr-en/
fr-en/IWSLT17.TED.dev2010.fr-en.en.xml
fr-en/IWSLT17.TED.dev2010.fr-en.fr.xml
fr-en/IWSLT17.TED.tst2010.fr-en.en.xml
fr-en/IWSLT17.TED.tst2010.fr-en.fr.xml
fr-en/IWSLT17.TED.tst2011.fr-en.en.xml
fr-en/IWSLT17.TED.tst2011.fr-en.fr.xml
fr-en/IWSLT17.TED.tst2012.fr-en.en.xml
fr-en/IWSLT17.TED.tst2012.fr-en.fr.xml
fr-en/IWSLT17.TED.tst2013.fr-en.en.xml
fr-en/IWSLT17.TED.tst2013.fr-en.fr.xml
fr-en/IWSLT17.TED.tst2014.fr-en.en.xml
fr-en/IWSLT17.TED.tst2014.fr-en.fr.xml
fr-en/IWSLT17.TED.tst2015.fr-en.en.xml
fr-en/IWSLT17.TED.tst2015.fr-en.fr.xml
fr-en/README
fr-en/train.en
fr-en/train.tags.fr-en.en
fr-en/train.tags.fr-en.fr


## Pre-processing

The parallel data is stored in XML, see the description in the README. But it's multiple documents per file, so XML parsing requires splitting it. We'll go the quick and dirty way, as in this [pre-processing script](https://github.com/pytorch/fairseq/blob/master/examples/translation/prepare-iwslt14.sh) by just removing all metainformation that we're not interested in (i.e. every line containing a html tag. This is *not a good example for mindful pre-processing*, but good enough for now. 

In [8]:
! head -n 20 /content/fr-en/train.tags.fr-en.en

<doc docid="1" genre="lectures"> 
<url>http://www.ted.com/talks/al_gore_on_averting_climate_crisis</url> 
<keywords>talks, alternative energy, cars, climate change, culture, environment, global issues, politics, science, sustainability, technology</keywords> 
<speaker>Al Gore</speaker> 
<talkid>1</talkid> 
<title>Al Gore: Averting the climate crisis</title> 
<description>TED Talk Subtitles and Transcript: With the same humor and humanity he exuded in "An Inconvenient Truth," Al Gore spells out 15 ways that individuals can address climate change immediately, from buying a hybrid to inventing a new, hotter brand name for global warming.</description> 
Thank you so much, Chris. And it's truly a great honor to have the opportunity to come to this stage twice; I'm extremely grateful. 
I have been blown away by this conference, and I want to thank all of you for the many nice comments about what I had to say the other night. 
And I say that sincerely, partly because  Put yourselves in my pos

In [9]:
def remove_xml(filename):
  """Remove all lines that contain xml brackets except for those in <seg>."""
  valid_lines = []
  with open(filename, 'r') as ofile:
    for line in ofile:
      if ('<' in line or '>' in line) and not '<seg' in line:
        continue
      else:
        # Get content between <seg> tags for dev and test sets.
        if '<seg' in line:
          content = line.strip().split('>')[1].split('<')[0]
        else: 
          content = line.strip()
        valid_lines.append(content)
  return valid_lines

In [10]:
targets = remove_xml('/content/fr-en/train.tags.fr-en.en')
print(f'Read {len(targets)} target sentences.')
    
sources = remove_xml('/content/fr-en/train.tags.fr-en.fr')
print(f'Read {len(sources)} source sentences.')

Read 232825 target sentences.
Read 232825 source sentences.


Let's check if they match.

In [11]:
num_examples = 3
for s, t in zip(sources[:num_examples], targets[:num_examples]):
  print(s)
  print(t)
  print()

Merci beaucoup, Chris. C'est vraiment un honneur de pouvoir venir sur cette scène une deuxième fois. Je suis très reconnaissant.
Thank you so much, Chris. And it's truly a great honor to have the opportunity to come to this stage twice; I'm extremely grateful.

J'ai été très impressionné par cette conférence, et je tiens à vous remercier tous pour vos nombreux et sympathiques commentaires sur ce que j'ai dit l'autre soir.
I have been blown away by this conference, and I want to thank all of you for the many nice comments about what I had to say the other night.

Et je dis çà sincèrement, en autres parce que --Faux sanglot-- j'en ai besoin ! --Rires-- Mettez-vous à ma place!
And I say that sincerely, partly because  Put yourselves in my position.



Looks good! You might already see that the translations are sometimes not very literal. Let's write them into file to feed them to Joey NMT.

In [12]:
def write_to_file(sentences, filename):
  """Write sentences to file."""
  with open(filename, 'w') as ofile:
    for sent in sentences:
      ofile.write(sent.strip()+'\n')

In [13]:
data_dir = '/content/fr-en'

In [14]:
file_prefix = 'parallel_'

In [15]:
src_lang = 'fr'
trg_lang = 'en'

In [16]:
train_src_file = os.path.join(data_dir, file_prefix+'train.'+src_lang)
train_trg_file = os.path.join(data_dir, file_prefix+'train.'+trg_lang)
write_to_file(targets, train_trg_file)
write_to_file(sources, train_src_file)

Great, now we need a development and a test set. As development set we can pick any of the `tst` or `dev` files in the data we just downloaded (these were used for testing and evaluation for previous years). We'll go with `tst2015` for testing and `tst2014` for development.

**Question for you**: Is this choice important? How do you think selecting a different dev/test set could influence our findings?

In [17]:
test_targets = remove_xml('/content/fr-en/IWSLT17.TED.tst2015.fr-en.en.xml')
print(f'Read {len(test_targets)} test target sentences.')
    
test_sources = remove_xml('/content/fr-en/IWSLT17.TED.tst2015.fr-en.fr.xml')
print(f'Read {len(test_sources)} test source sentences.')

dev_targets = remove_xml('/content/fr-en/IWSLT17.TED.tst2014.fr-en.en.xml')
print(f'Read {len(dev_targets)} dev target sentences.')
    
dev_sources = remove_xml('/content/fr-en/IWSLT17.TED.tst2014.fr-en.fr.xml')
print(f'Read {len(dev_sources)} dev source sentences.')

Read 1210 test target sentences.
Read 1210 test source sentences.
Read 1306 dev target sentences.
Read 1306 dev source sentences.


In [18]:
dev_src_file = os.path.join(data_dir, file_prefix+'dev.'+src_lang)
dev_trg_file = os.path.join(data_dir, file_prefix+'dev.'+trg_lang)
test_src_file = os.path.join(data_dir, file_prefix+'test.'+src_lang)
test_trg_file = os.path.join(data_dir, file_prefix+'test.'+trg_lang)

write_to_file(dev_targets, dev_trg_file)
write_to_file(dev_sources, dev_src_file)
write_to_file(test_targets, test_trg_file)
write_to_file(test_sources, test_src_file)

## Sub-words

Same procedure as in Lab 1.

In [19]:
bpe_size = 4000

In [20]:
train_joint_file = os.path.join(data_dir, file_prefix+'train.'+src_lang+'-'+trg_lang)

src_files = {'train': train_src_file, 'dev': dev_src_file, 'test': test_src_file}
trg_files = {'train': train_trg_file, 'dev': dev_trg_file, 'test': test_trg_file}

vocab_src_file = os.path.join(data_dir, f'vocab.{bpe_size}.{src_lang}')
vocab_trg_file = os.path.join(data_dir, f'vocab.{bpe_size}.{trg_lang}')
bpe_file = os.path.join(data_dir, f'bpe.codes.{bpe_size}')

In [21]:
! cat $train_src_file $train_trg_file > $train_joint_file

! subword-nmt learn-bpe \
  --input $train_joint_file \
  -s $bpe_size \
  -o $bpe_file

In [22]:
src_bpe_files = {}
trg_bpe_files = {}
for split in ['train', 'dev', 'test']:
  src_input_file = src_files[split]
  trg_input_file = trg_files[split]
  src_output_file = src_input_file.replace(split, f'{split}.{bpe_size}.bpe')
  trg_output_file = trg_input_file.replace(split, f'{split}.{bpe_size}.bpe')
  src_bpe_files[split] = src_output_file
  trg_bpe_files[split] = trg_output_file

  ! subword-nmt apply-bpe \
    -c $bpe_file \
    < $src_input_file > $src_output_file

  ! subword-nmt apply-bpe \
    -c $bpe_file \
    < $trg_input_file > $trg_output_file


In [23]:
! wget https://raw.githubusercontent.com/joeynmt/joeynmt/master/scripts/build_vocab.py

--2021-05-19 12:39:27--  https://raw.githubusercontent.com/joeynmt/joeynmt/master/scripts/build_vocab.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2034 (2.0K) [text/plain]
Saving to: ‘build_vocab.py’


2021-05-19 12:39:28 (36.5 MB/s) - ‘build_vocab.py’ saved [2034/2034]



In [24]:
vocab_src_file = src_bpe_files['train']
vocab_trg_file = trg_bpe_files['train']
bpe_vocab_file = os.path.join(data_dir, f'joint.{bpe_size}bpe.vocab')

! python build_vocab.py  \
  $vocab_src_file $vocab_trg_file \
  --output_path $bpe_vocab_file

# Training

In [25]:
from google.colab import drive
drive_home = '/content/drive'
drive.mount(drive_home)

Mounted at /content/drive


In [26]:
g_drive_path = "/content/drive/My\ Drive/NMT_Lab2/models/%s-%s" % (src_lang, trg_lang)

In [27]:
experiment_name = 'ted_fr_en'

In [28]:
model_path = os.path.join(g_drive_path, experiment_name)

Copy the BPE merges to Gdrive so we don't lose them.

In [29]:
bpe_drive_path = "/content/drive/My\ Drive/NMT_Lab2/bpe/%s-%s" % (src_lang, trg_lang)
! mkdir -p $bpe_drive_path
! cp $bpe_file $bpe_drive_path 

**TODO:**

The following configuration file contains *three* bugs that prevent it from working (=quickly giving reasonable BLEU for translating between French and English). Find and fix those three. Try not to compare with the config from Lab 1 ;)

In [53]:
# Create the config
broken_config = """
name: "{name}"

data:
    src: "{source_language}"
    trg: "{target_language}"
    train: "{data_dir}/parallel_train.{bpe_size}.bpe"  
    dev:   "{data_dir}/parallel_dev.{bpe_size}.bpe"
    test:  "{data_dir}/parallel_test.{bpe_size}.bpe"
    level: "bpe"                   # Here we specify we're working on BPEs.
    lowercase: False                
    max_sent_length: 30             # Extend to longer sentences.
    src_vocab: "{src_vocab_path}"
    trg_vocab: "{trg_vocab_path}"

testing:
    beam_size: 5
    alpha: 1.0
    sacrebleu:                      # sacrebleu options
        remove_whitespace: True     # `remove_whitespace` option in sacrebleu.corpus_chrf() function (defalut: True)
        tokenize: "intl"            # `tokenize` option in sacrebleu.corpus_bleu() function (options include: "none" (use for already tokenized test data), "13a" (default minimal tokenizer), "intl" which mostly does punctuation and unicode, etc) 

training:
    #load_model: "{model_path}/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # Alternative: try switching from plateau to Noam scheduling
    patience: 30                    # For plateau: decrease learning rate by decrease_factor if validation score has not improved for this many validation rounds.
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.00001
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 4096
    batch_type: "token"
    eval_batch_size: 3600
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 5                     # Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
    validation_freq: 500          # Set to at least once per epoch.
    logging_freq: 100
    eval_metric: "bleu"
    model_dir: "{model_path}"
    overwrite: True                 # Set to True if you want to overwrite possibly existing models. 
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3]
    keep_last_ckpts: 3

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True        # Joint vocabulary.
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4             # Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256   # Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # Increase to 512 for larger data.
        ff_size: 1024            # Increase to 2048 for larger data.
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4              # Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256    # Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
""".format(name=experiment_name, 
           source_language=src_lang, 
           target_language=trg_lang,
           data_dir=data_dir, 
           model_path=model_path, 
           src_vocab_path=bpe_vocab_file,
           trg_vocab_path=bpe_vocab_file, 
           bpe_size=bpe_size)
with open("transformer_{name}.yaml".format(name=experiment_name),'w') as f:
    f.write(broken_config)

If you try running this training multiple times for debugging, set `overwrite` to `True` in the config.

In [None]:
!python -m joeynmt train transformer_ted_fr_en.yaml

2021-05-19 13:08:18,404 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-05-19 13:08:18,420 - INFO - joeynmt.data - Loading training data...
2021-05-19 13:08:22,881 - INFO - joeynmt.data - Building vocabulary...
2021-05-19 13:08:23,115 - INFO - joeynmt.data - Loading dev data...
2021-05-19 13:08:23,138 - INFO - joeynmt.data - Loading test data...
2021-05-19 13:08:23,150 - INFO - joeynmt.data - Data loaded.
2021-05-19 13:08:23,151 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-05-19 13:08:23,357 - INFO - joeynmt.model - Enc-dec model built.
2021-05-19 13:08:23.474453: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-05-19 13:08:25,112 - INFO - joeynmt.training - Total params: 12170752
2021-05-19 13:08:27,579 - INFO - joeynmt.helpers - cfg.name                           : ted_fr_en
2021-05-19 13:08:27,579 - INFO - joeynmt.helpers - cfg.data.src                       : fr
2021-05-19 13

If correct, the model should obtain roughly the following BLEU scores (or better!):


*   Step 500: 0.05
*   Step 1000: 1.24
*   Step 1500: 1.97
*   Step 2000: 3.71
*   ...
*   Step 3000: 7.12
*   Step 8000: 15.30
*   Step 17000: 21.69
*   Step 27000: 23.67  (around 2h training time)

You don't need to wait that long for the purpose of this exercise, Julia can provide a checkpoint for an already trained model :)


**TODO:**
1. Why does the number of source words reported in the log not match the specified number of BPE merges? Tip: browse the [subword-nmt GitHub](https://github.com/rsennrich/subword-nmt).
2. Imagine your job is to provide the best translation system as soon as possible. Try changing a few hyperparameters to see what the best score is that you can get within three epochs of training. You may also coordinate this with your colleagues.
  * Suggestions: try changing bpe size, learning rate, batch size. 
  * Recommendation: create a new configuration and experiment directory for each experiment so you can tell them apart.
  * You can spend endless time on this, but try to select a few settings that you'd hope could improve the result. 
  * Do you observe any tendency? Compare with your colleagues.


*Notes*


# Testing

For the following exercises you may either use your own model or the trained one provided by Julia (trained for 30 epochs).

Now that we got a trained model, let's see how well it does. We'll probe for the following examples:

1. A *training/memorization/overfitting* check: Did model learn to perfectly translate the training set?
2. Unseen but from the *same domain*: Did the model learn to generalize to unseen examples?
3. *Out-of-domain*: Can the model translate a random sentence from the source language?

It will be increasingly hard for the model to do well on these. But even in the training set you can probably find outliers that the model does not translate well.

In [33]:
from joeynmt.helpers import load_config
import yaml


def download_pretrained_model_from_gdrive(
    checkpoint='1sIaNogftpt-moKEKRBMAKbE4QdOb7XwM',
    config='1_FpHfRn8bxLu_pAUgj99jtZuBcCABgXV',
    src_vocabulary='1esULLiG-2fS6Ucj2LMndUoID8WaY_QuZ',
    trg_vocabulary='1sdygCZxK6h8M1khDlh-TLAq0Y4DNczl_',
    bpe_merges='17XeygY048oXQHHzmH4u_hiJl1GndkiSN',
    directory='/content/pretrained_model'):
  
  """Download pretrained model from ids and place it in given directory. 
  Adjust paths in config as needed.
  Default ids are for a model as specified above, but trained for the full
  30 epochs. 
  """

  # Download files and place them into the new directory.
  original_config = os.path.join(directory, 'original_config.yaml')
  new_config = os.path.join(directory, 'config.yaml')
  checkpoint_path = os.path.join(directory, 'best.ckpt')
  trg_vocab_path = os.path.join(directory, 'trg_vocab.txt')
  src_vocab_path = os.path.join(directory, 'src_vocab.txt')
  bpe_path = os.path.join(directory, 'bpe.merges')

  def gdown_by_id(id, output):
    ! gdown 'https://drive.google.com/uc?id='$id -O $output

  ! mkdir -p $directory
  gdown_by_id(checkpoint, checkpoint_path)
  gdown_by_id(config, original_config)
  gdown_by_id(src_vocabulary, src_vocab_path)
  gdown_by_id(trg_vocabulary, trg_vocab_path)
  gdown_by_id(bpe_merges, bpe_path)

  # Overwrite paths in config.
  config = load_config(original_config)
  config['data']['src_vocab'] = src_vocab_path
  config['data']['trg_vocab'] = trg_vocab_path
  config['training']['model_dir'] = directory
  config['training']['load_model'] = checkpoint_path
  with open(new_config, 'w') as cfile:
    yaml.dump(config, cfile)
  return new_config, bpe_path


In [34]:
# Download a pretrained model.
pretrained_config, pretrained_bpe = download_pretrained_model_from_gdrive()

Downloading...
From: https://drive.google.com/uc?id=1sIaNogftpt-moKEKRBMAKbE4QdOb7XwM
To: /content/pretrained_model/best.ckpt
157MB [00:02, 57.5MB/s]
Downloading...
From: https://drive.google.com/uc?id=1_FpHfRn8bxLu_pAUgj99jtZuBcCABgXV
To: /content/pretrained_model/original_config.yaml
100% 3.70k/3.70k [00:00<00:00, 5.78MB/s]
Downloading...
From: https://drive.google.com/uc?id=1esULLiG-2fS6Ucj2LMndUoID8WaY_QuZ
To: /content/pretrained_model/src_vocab.txt
100% 26.3k/26.3k [00:00<00:00, 1.75MB/s]
Downloading...
From: https://drive.google.com/uc?id=1sdygCZxK6h8M1khDlh-TLAq0Y4DNczl_
To: /content/pretrained_model/trg_vocab.txt
100% 26.3k/26.3k [00:00<00:00, 3.83MB/s]
Downloading...
From: https://drive.google.com/uc?id=17XeygY048oXQHHzmH4u_hiJl1GndkiSN
To: /content/pretrained_model/bpe.merges
100% 33.8k/33.8k [00:00<00:00, 55.1MB/s]


**TODO:**


1.   Pick 2-5 sentences each from the three sets described above and translate them with your model in `translate` mode. Remember that you need to split them into BPEs first (already done for 1 and 2; example code for that in Lab 1).
2.   Compare their translations: Can you tell from these examples what kind of data the model was trained on? Anything surprisingly good or bad?
3. Choose one sentence that the model translated really well. Can you perturb it so that it's still very similar to the original but the translation is very different or significantly worse? 

Small changes in the input leading to small changes in the output can be seen as a criterion for robustness. The harder it is to find these adversarial inputs, the more robust is the model.



In [35]:
# Either use $pretrained_config for the pretrained model or your own trained model.
!python -m joeynmt translate $pretrained_config

2021-05-18 14:49:08,811 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-05-18 14:49:12,731 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-05-18 14:49:12,934 - INFO - joeynmt.model - Enc-dec model built.

Please enter a source sentence (pre-processed): 
bonjour comment tu vas
JoeyNMT: Hypotheses ranked by score
JoeyNMT #1: How do you do that?

Please enter a source sentence (pre-processed): 
salut
JoeyNMT: Hypotheses ranked by score
JoeyNMT #1: We have to do that.

Please enter a source sentence (pre-processed): 
je pense que tu es malade
JoeyNMT: Hypotheses ranked by score
JoeyNMT #1: I think you are.

Please enter a source sentence (pre-processed): 
stop
JoeyNMT: Hypotheses ranked by score
JoeyNMT #1: stop there.

Please enter a source sentence (pre-processed): 
c'est fini
JoeyNMT: Hypotheses ranked by score
JoeyNMT #1: That's it.

Please enter a source sentence (pre-processed): 
quit
JoeyNMT: Hypotheses ranked by score
JoeyNMT #1: We have to do that.



*Notes:*

# Evaluation

We now got an intuition of what the model can do and where its limits are. During validation, we trust the BLEU score to tell us whether the model is progressing.



1.   Compute the `sacrebleu` ([GitHub](https://github.com/mjpost/sacrebleu)) score for the dev set translations for a chosen step that are stored in your model directory (`.hyps`). Below is an example call. Do they match with the result that was reported in `validations.txt` and `train.log`?
2.   In the configuration we chose one particular tokenizer, but there are other options (hint: explore sacrebleu documentation). Does the reported score change? If so, why do you think this happens?
3. The `sacrebleu` library also implements the ChrF score. Compute the ChrF as well as the BLEU score for two validation steps. How do differ with respect to ChrF and BLEU, are the differences comparable?

(We did not tokenize our data before feeding it to the model. Do you think it makes a difference? You can try it out with the `sacremoses` library that implements tokenizers.)

In [38]:
# Helper function
def read_sentences(inputfile):
  """Read sentences from file into list."""
  lines = []
  with open(inputfile, 'r') as ofile:
    for line in ofile:
      lines.append(line.strip())
  print(f'Read {len(lines)} sentences from file {inputfile}.')
  return lines

In [None]:
# Model outputs
hyps = read_sentences('path/to/.hyps')
# And references for the same dev set
refs = read_sentences('path/to/references')

In [None]:
sacrebleu.corpus_bleu(hyps, [refs])

Note the one-element list that we're passing to the BLEU score calculation. This is because BLEU was originally proposed to compute quality scores relative to multiple translations. However, in practice there are rarely multiple translations available, so we got to work with what we have.

# Extra: Backtranslation & Multilingual

These experiments take more time than you'll have in the lab and relate to contents covered later this week. They might be interesting to explore if you want to keep learning about NMT :)

### Backtranslation

The downloaded data also contains a `train.en` file: monolingual data for English. This can be used to improve our model with backtranslation.  There are multiple steps and options involved:
  * First, you need to train a en-fr model.
  * Then use this reverse model to translate this monolingual data (or a part thereof, depending on translation speed).
  * Now you have synthetic training data that you can either 1) mix with the original training data as it is, 2) mix with a certain ratio, since this data has probably lower quality. 
  * You can then either 1) further train the original `fr-en` model on this data, or 2) retrain a `fr-en` model to see if it gets better than the original data.


*Notes:*

### Multilingual

The downloaded directory also contains for other languages paired with English on the target side: `ar`, `de`, `ja`, `ko`, `zh`. Additional training data from other languages often helps to improve translation quality for small training data. 

We'll try out the "many-to-one" approach here: learning to translate from many languages into English. For the opposite, we would need to add special target language tags to the source (.e.g. `<2fr>`, `<2ja>` to tell the model which language it should translate into.

* First, select one or more language pairs to add to fr-en.
* Repeat the pre-processing pipeline for them. Training and dev sets should get concatenated for joint training. BPE training should also be done on a concatenation of the training sets for all languages, so that the sub-word merges reflect all languages.
* Depending on the number of languages, your concatenated dev set might grow too large for regular validation during training, so you can also just take a smaller subset from each language and combine them. 
* Do you find improvements over the original model? For a direct comparison you would need to translate the original fr-en dev or test set with the multilingual model (not the concatenated ones used for this experiment) and compare the scores.


*Notes:*