# Building the vocabularies for the model

Here I present the pipeline for making vocabularies for the machine translation model from the data.  
I have the following files for parent model:  
- 1M.ru: file with 1 million Russian sentences  
- 1M.chv: file with 1 million Chuvash sentences  

This is the data I have for child model:  
- khakas_ru.txt: file with 39523 Russian sentences  
- khakas_kh.txt: file with 39523 Khakas sentences

## The preparation of the text files for translation model must include the following steps:  
### 1. checking for empty lines  

I know for sure that Ru-Kh data doesn't include empty lines because I prepared the corpora. Now I will check Ru-Chv data.

In [9]:
! grep -cvE '[^[:space:]]' 1M.ru
! grep -cvE '[^[:space:]]' 1M.chv

0
0


### 2. checking for repeated lines

I checked the Ru-Chv files for repeated lines, using the following command:

In [None]:
! awk 'seen[$0]++ == 1 { lines[$0]; next } $0 in lines { print NR, $0; exit }' 1M.chv
# 241307 — Тӗлӗнмелле.
! awk 'seen[$0]++ == 1 { lines[$0]; next } $0 in lines { print NR, $0; exit }' 1M.ru
# 43899 — Конечно.
! awk 'seen[$0]++ == 2 { lines[$0]; next } $0 in lines { print NR, $0; exit }' 1M.ru
# 54716 Вот и все.

There were repeated lines, but having looked at them closely, I figured that repeated lines in one language don't have duplicate in another language, but are rather translated with synonims, so it was not a mistake in dataset.

### 3. shuffling the lines
It is important to shuffle the lines so that model trains better, and also data from different sources is represented in train, val and test sets.

In [None]:
! paste -d ':' 1M.ru 1M.chv | shuf | awk -v FS=":" '{ print $1 > "1M_shuffed.ru" ; print $2 > "1M_shuffed.chv" }'
! paste -d ':' khakas_ru.txt khakas_kh.txt | shuf | awk -v FS=":" '{ print $1 > "khakas_ru_shuffed.ru" ; print $2 > "khakas_kh_shuffed.chv" }'

### 4. Splitting the data into train, val and test sets

We need to split these files into train, val and test.
In the original article there were the fillowing sizes of the train-val-test files:

In [1]:
# train parent
! wc -l /Users/macbook/Documents/NewFolder/Diplom/WMT_repeat/parent_child/train_file_parent.de
! wc -l /Users/macbook/Documents/NewFolder/Diplom/WMT_repeat/parent_child/train_file_parent.cs
# val parent
! wc -l /Users/macbook/Documents/NewFolder/Diplom/WMT_repeat/data/de-cs_parallel/newstest2019-decs.cs
! wc -l /Users/macbook/Documents/NewFolder/Diplom/WMT_repeat/data/de-cs_parallel/newstest2019-decs.de
# train child
! wc -l /Users/macbook/Documents/NewFolder/Diplom/WMT_repeat/parent_child/train.hsb-de.de
! wc -l /Users/macbook/Documents/NewFolder/Diplom/WMT_repeat/parent_child/train.hsb-de.hsb
# val child
! wc -l /Users/macbook/Documents/NewFolder/Diplom/WMT_repeat/parent_child/devel.hsb-de.de
! wc -l /Users/macbook/Documents/NewFolder/Diplom/WMT_repeat/parent_child/devel.hsb-de.hsb
# test child
! wc -l /Users/macbook/Documents/NewFolder/Diplom/WMT_repeat/parent_child/devel_test.hsb-de.de
! wc -l /Users/macbook/Documents/NewFolder/Diplom/WMT_repeat/parent_child/devel_test.hsb-de.hsb

 22097622 /Users/macbook/Documents/NewFolder/Diplom/WMT_repeat/parent_child/train_file_parent.de
 22097622 /Users/macbook/Documents/NewFolder/Diplom/WMT_repeat/parent_child/train_file_parent.cs
    1997 /Users/macbook/Documents/NewFolder/Diplom/WMT_repeat/data/de-cs_parallel/newstest2019-decs.cs
    1997 /Users/macbook/Documents/NewFolder/Diplom/WMT_repeat/data/de-cs_parallel/newstest2019-decs.de
   60000 /Users/macbook/Documents/NewFolder/Diplom/WMT_repeat/parent_child/train.hsb-de.de
   60000 /Users/macbook/Documents/NewFolder/Diplom/WMT_repeat/parent_child/train.hsb-de.hsb
    2000 /Users/macbook/Documents/NewFolder/Diplom/WMT_repeat/parent_child/devel.hsb-de.de
    2000 /Users/macbook/Documents/NewFolder/Diplom/WMT_repeat/parent_child/devel.hsb-de.hsb
    2000 /Users/macbook/Documents/NewFolder/Diplom/WMT_repeat/parent_child/devel_test.hsb-de.de
    2000 /Users/macbook/Documents/NewFolder/Diplom/WMT_repeat/parent_child/devel_test.hsb-de.hsb


The data I have:

In [4]:
! wc -l /Users/macbook/Documents/NewFolder/Diplom/Rus_work/1M.chv
! wc -l /Users/macbook/Documents/NewFolder/Diplom/Rus_work/1M.ru

 1002013 /Users/macbook/Documents/NewFolder/Diplom/Rus_work/1M.chv
 1002013 /Users/macbook/Documents/NewFolder/Diplom/Rus_work/1M.ru


In [5]:
! wc -l /Users/macbook/Documents/NewFolder/Diplom/Rus_work/khakas_kh.txt

   39523 /Users/macbook/Documents/NewFolder/Diplom/Rus_work/khakas_kh.txt


In my case I guess it is ok to take 2000 as parent val, and 1000 lines for child val and child test.

In [None]:
head -n 1000013 1M_shuffed.ru > train_parent.ru
tail -n +1000014 1M_shuffed.ru > val_parent.ru
head -n 1000013 1M_shuffed.chv > train_parent.chv
tail -n +1000014 1M_shuffed.chv > val_parent.chv

head -n 38523 khakas_shuffed.ru > train_val_child.ru
tail -n +38524 khakas_shuffed.ru > test_child.ru
head -n 37523 train_val_child.ru > train_child.ru
tail -n +37524 train_val_child.ru > val_child.ru

head -n 38523 khakas_shuffed.kh > train_val_child.kh
tail -n +38524 khakas_shuffed.kh > test_child.kh
head -n 37523 train_val_child.kh > train_child.kh
tail -n +37524 train_val_child.kh > val_child.kh

## Building the vocabularies:

After I prepared the necessary files, it is time to create vocabularies, that the model will use. According to the pipeline presented in the WMT article, I need to duplicate KH data 25 times, so that KH data is well represented in the vocab, and the KH tokens don't appear too rare to have a significant enough place in the vocab. I did it this way:

In [None]:
! perl -0777pe '$_=$_ x 26' train_child.kh > 26train_child.kh
! perl -0777pe '$_=$_ x 26' train_child.ru > 26train_child.ru

Then I concatenated all the files that I wanted to use for building the vocabularies the following way and got the combined files that I will later use in subword-nmt for building the vocabularies.

In [None]:
! cat train_parent.ru 26train_child.ru > train_vocab.ru
! cat train_parent.chv 26train_child.kh > train_vocab.chvkh

All of the following code is done running RU_creating_vocabs_and_train_val_files.sh file and according to the subword-nmt tutorial (the installation of subword-nmt package is required).  
First we perform learn-bpe operation on our training files. The parameters of this operation are taken from the WMT article. After that we perform apply-bpe operation on source and target files separately to get two separate vocabs:

```python
# learn BPE on concatenation of train files and extract vocabs 
cat train_vocab.ru train_vocab.chvkh | subword-nmt learn-bpe -s 20000 -o codes --min-frequency 2 --total-symbols

subword-nmt apply-bpe -c codes < train_vocab.ru | subword-nmt get-vocab > vocab_file.ru
subword-nmt apply-bpe -c codes < train_vocab.chvkh | subword-nmt get-vocab > vocab_file.chvkh
```

We will explain the BPE operations in a separate notebook, here we will proceed with building the vocabularies.

After getting the vocabs as a result of subword-nmt apply operation, we use the following script to make these vocabs compatible with sockeye model. The script is in the file subword-nmt_vocab2sockeye_vocab.py:

```python
# Diplom/WMT_repeat/parent_child/subword-nmt_vocab2sockeye_vocab.py
#!/usr/bin/env  python3

import json
import sys

from argparse import ArgumentParser

def getArgs():
    usage="subword-nmt_vocac2sockeye_vocab.py  subword_nmt_vocab [sockeye_vocab_json]"
    help = ''
    parser = ArgumentParser(usage=usage, description=help, add_help=True)
    parser.add_argument("vocab_filename", type=str, help="Subword-nmt's vocabulary filename")
    parser.add_argument("vocab_json",
         nargs   = '?',
         type    = lambda f: open(f, mode='w', encoding='UTF-8'),
         default = sys.stdout,
         help    = "output file [sys.stdout]")
    parser.add_argument("--add_bt_tag",
         dest    = "add_bt_tag",
         action  = 'store_true',
         default = False,
         help    = "Also include BT tag in vocab [%(default)s]")
    parser.add_argument("--add_generic_tags",
         dest    = "add_generic_tags",
         action  = 'store_true',
         default = False,
         help    = "Also include BT tag in vocab [%(default)s]")
    parser.add_argument("--add_glossaries",
         dest    = "add_glossaries",
         nargs   = '?',
         default = None,
         help    = "Also include the glossaries in vocab [%(default)s]")


    cmd_args = parser.parse_args()

    return cmd_args




if __name__ == '__main__':
    args = getArgs()
    vocab = dict({
        "<pad>": 0,
        "<unk>": 1,
        "<s>": 2,
        "</s>": 3,
        })

    if args.add_bt_tag:
        vocab.update({'<BT>': len(vocab)})

    if args.add_generic_tags:
        vocab.update({ f'<TAG{i:02}>' : i+len(vocab)-1 for i in range(1, 26) })

    with open(args.vocab_filename, mode='r', encoding='UTF-8') as f:
        vocab.update({ k: v for v, k in enumerate(map(lambda l: l.strip().split()[0], f.readlines()), start=len(vocab.keys())) })

    if args.add_glossaries is not None:
        last_token_id = len(vocab.keys())
        with open(args.add_glossaries, mode='rt') as f:
            for token in map(str.strip, f.readlines()):
                if token != '' and token not in vocab:
                    vocab[token] = last_token_id
                    last_token_id += 1

    json.dump(vocab, args.vocab_json, indent=2, ensure_ascii=False)

```

This code adds all the special tags and turns the vocabs into json files suitable for sockeye model. Here is how we apply it for our vocabs:

```python
# adding tags and creating sockeye-compatible vocabs
python subword-nmt_vocab2sockeye_vocab_adj.py --add_bt_tag --add_generic_tags --add_glossaries glossaries_file.txt vocab_file.ru > augmented_vocab_ru.json
python subword-nmt_vocab2sockeye_vocab_adj.py --add_bt_tag --add_generic_tags --add_glossaries glossaries_file.txt vocab_file.chvkh > augmented_vocab_chvkh.json
```

As a result of all these actions we will have two files:  
augmented_vocab_ru.json  
augmented_vocab_chvkh.json  
We will use these vocabs as parameters to our sockeye model.  
In the next chapter we will talk about how we make actual BPE files for parent and child models separately.