# WMT reproduction

# 1. Building the vocabularies for the model

Before creating the vocabularies I needed to prepare some files.

1. I needed to filter web_monolingual and witaj_monolingual Upper-Sorbian data, so that these files only have lines with those characters that are represented in other files.  
For that I figured which are the chars that are shared in all those files by doing this (file known-hsb-cs-chars.sh, that creates known-hsb-cs-chars.txt):

```python
# known-hsb-cs-chars.sh:
( python charset.py DGT.cs-de.cs; \
  python charset.py Europarl.cs-de.cs; \
  python charset.py News-Commentary.cs-de.cs; \
  python charset.py newstest2019-csde.cs; \
  python charset.py newstest2019-decs.cs; \
  python charset.py OpenSubtitles.cs-de.cs; \
  python charset.py WMT-News.cs-de.cs; \
  python charset.py sorbian_institute_monolingual.hsb; \
  python charset.py train.hsb-de.hsb ) | cat | sort | uniq > known-hsb-cs-chars.txt
```

Then I used file filtering.py as a command to filter the web_monolingual and witaj_monolingual data. Filetring.py contains this:

```python
# filtering.py:
import sys

if __name__ == "__main__":
    with open("known-hsb-cs-chars.txt") as f:
        chars = set(map(str.strip, f))
        chars.add(" ")
        
    
    
    with open(sys.argv[1]) as f:
        for line in map(str.strip, f):
            if all(c in chars for c in line):
                print(line)
```

2. After I filtered the files, I needed to duplicate HSB monolingual data 25 times, so that HSB data is well represented in the vocab, and the HSB tokens don't appear too rare to have a significant enough place in the vocab. Example:
```python
! perl -0777pe '$_=$_ x 26' train.hsb-de.de > 25train.hsb-de.de
```

Then I concatenated all the files that I wanted to use for building the vocabularies the following way and got the combined files that I will later use in subword-nmt for building the vocabularies. At this point it doesn't matter that the file for source and target happen to be of different size.


```python
! cat 25train.hsb-de.de \
    OpenSubtitles.cs-de.de \
    DGT.cs-de.de \
    Europarl.cs-de.de \
    News-Commentary.cs-de.de \
    WMT-News.cs-de.de > train_file.de
    
! cat 25train.hsb-de.hsb \
    25sorbian_institute_monolingual.hsb \
    25web_monolingual_filtered.hsb \
    25witaj_monolingual_filtered.hsb \
    OpenSubtitles.cs-de.cs \
    DGT.cs-de.cs \
    Europarl.cs-de.cs \
    News-Commentary.cs-de.cs \
    WMT-News.cs-de.cs > train_file.cshsb
```

All of the following code is done running creating_vocabs_and_train_val_files.sh file and according to the subword-nmt tutorial.  
First we perform learn-bpe operation on our training files. The parameters of this operation are taken from the article. After that we perform apply-bpe operation on source and target files separately to get two separate vocabs:

```python
# learn BPE on concatenation of train files and extract vocabs 
cat train_file.de train_file.cshsb | subword-nmt learn-bpe -s 20000 -o codes --min-frequency 2 --total-symbols

subword-nmt apply-bpe -c codes < train_file.de | subword-nmt get-vocab > vocab_file.de
subword-nmt apply-bpe -c codes < train_file.cshsb | subword-nmt get-vocab > vocab_file.cshsb
```

We will explain the BPE operations in a separate notebook, here we will proceed with building the vocabularies.

After getting the vocabs as a result of subword-nmt apply operation, we use the following script to make these vocabs compatible with sockeye model. The script is in the file subword-nmt_vocab2sockeye_vocab.py:

```python
# Diplom/WMT_repeat/parent_child/subword-nmt_vocab2sockeye_vocab.py
#!/usr/bin/env  python3

import json
import sys

from argparse import ArgumentParser

def getArgs():
    usage="subword-nmt_vocac2sockeye_vocab.py  subword_nmt_vocab [sockeye_vocab_json]"
    help = ''
    parser = ArgumentParser(usage=usage, description=help, add_help=True)
    parser.add_argument("vocab_filename", type=str, help="Subword-nmt's vocabulary filename")
    parser.add_argument("vocab_json",
         nargs   = '?',
         type    = lambda f: open(f, mode='w', encoding='UTF-8'),
         default = sys.stdout,
         help    = "output file [sys.stdout]")
    parser.add_argument("--add_bt_tag",
         dest    = "add_bt_tag",
         action  = 'store_true',
         default = False,
         help    = "Also include BT tag in vocab [%(default)s]")
    parser.add_argument("--add_generic_tags",
         dest    = "add_generic_tags",
         action  = 'store_true',
         default = False,
         help    = "Also include BT tag in vocab [%(default)s]")
    parser.add_argument("--add_glossaries",
         dest    = "add_glossaries",
         nargs   = '?',
         default = None,
         help    = "Also include the glossaries in vocab [%(default)s]")


    cmd_args = parser.parse_args()

    return cmd_args




if __name__ == '__main__':
    args = getArgs()
    vocab = dict({
        "<pad>": 0,
        "<unk>": 1,
        "<s>": 2,
        "</s>": 3,
        })

    if args.add_bt_tag:
        vocab.update({'<BT>': len(vocab)})

    if args.add_generic_tags:
        vocab.update({ f'<TAG{i:02}>' : i+len(vocab)-1 for i in range(1, 26) })

    with open(args.vocab_filename, mode='r', encoding='UTF-8') as f:
        vocab.update({ k: v for v, k in enumerate(map(lambda l: l.strip().split()[0], f.readlines()), start=len(vocab.keys())) })

    if args.add_glossaries is not None:
        last_token_id = len(vocab.keys())
        with open(args.add_glossaries, mode='rt') as f:
            for token in map(str.strip, f.readlines()):
                if token != '' and token not in vocab:
                    vocab[token] = last_token_id
                    last_token_id += 1

    json.dump(vocab, args.vocab_json, indent=2, ensure_ascii=False)

```

This code adds all the special tags and turns the vocabs into json files suitable for sockeye model. Here is how we apply it for our vocabs:
```python
# adding tags and creating sockeye-compatible vocabs
python subword_nmt_vocab2sockeye_vocab.py --add_bt_tag --add_generic_tags --add_glossaries glossaries_file.txt vocab_file.de > augmented_vocab_de.json
python subword_nmt_vocab2sockeye_vocab.py --add_bt_tag --add_generic_tags --add_glossaries glossaries_file.txt vocab_file.cshsb > augmented_vocab_cshsb.json
```

We will use these vocabs as parameters to our sockeye model.  
In the next chapter we will talk about how we make actual BPE files for parent and child models separately.