0. Cleaning files from unnesessary symbols:

In [12]:
! gsed -i 's/^[[:blank:]●]*//' khakas_ru.txt
! gsed -i 's/^[[:blank:]●]*//' khakas_kh.txt

In [22]:
! paste -d $'\t' khakas_ru.txt khakas_kh.txt | shuf | awk -F $'\t' '{print $1 > "khakas_shuffed.ru" ; print $2 > "khakas_shuffed.kh"}

1. Splitting Chuvash and Khakas data in train, val and test

In [None]:
! head -n 1000013 1M_shuffed.ru > train_parent.ru
! tail -n +1000014 1M_shuffed.ru > val_parent.ru
! head -n 1000013 1M_shuffed.chv > train_parent.chv
! tail -n +1000014 1M_shuffed.chv > val_parent.chv

! head -n 38523 khakas_shuffed.ru > train_val_child.ru
! tail -n +38524 khakas_shuffed.ru > test_child.ru
! head -n 37523 train_val_child.ru > train_child.ru
! tail -n +37524 train_val_child.ru > val_child.ru

! head -n 38523 khakas_shuffed.kh > train_val_child.kh
! tail -n +38524 khakas_shuffed.kh > test_child.kh
! head -n 37523 train_val_child.kh > train_child.kh
! tail -n +37524 train_val_child.kh > val_child.kh

2. After I filtered the files, I need to duplicate KH data 25 times, so that KH data is well represented in the vocab, and the KH tokens don't appear too rare to have a significant enough place in the vocab. I did it this way:

In [None]:
! perl -0777pe '$_=$_ x 26' train_child.kh > 26train_child.kh
! perl -0777pe '$_=$_ x 26' train_child.ru > 26train_child.ru

In [None]:
! cat train_parent.ru 26train_child.ru > train_vocab.ru
! cat train_parent.chv 26train_child.kh > train_vocab.chvkh

Then I concatenated all the files that I wanted to use for building the vocabularies the following way and got the combined files that I will later use in subword-nmt for building the vocabularies.

First we perform learn-bpe operation on our training files. The parameters of this operation are taken from the article. After that we perform apply-bpe operation on source and target files separately to get two separate vocabs:

```python
# learn BPE on concatenation of train files and extract vocabs 
cat train_vocab.ru train_vocab.chvkh | subword-nmt learn-bpe -s 20000 -o codes --min-frequency 2 --total-symbols

subword-nmt apply-bpe -c codes < train_vocab.ru | subword-nmt get-vocab > vocab_file.ru
subword-nmt apply-bpe -c codes < train_vocab.chvkh | subword-nmt get-vocab > vocab_file.chvkh
```

We will explain the BPE operations in a separate notebook, here we will proceed with building the vocabularies.

So after getting the vocabs as a result of subword-nmt apply operation, we use the following script to make these vocabs compatible with sockeye model. The script is in the file subword-nmt_vocab2sockeye_vocab.py:

```python
#!/usr/bin/env  python3

import json
import sys

from argparse import ArgumentParser

def getArgs():
    usage="subword-nmt_vocab2sockeye_vocab.py  subword_nmt_vocab [sockeye_vocab_json]"
    help = ''
    parser = ArgumentParser(usage=usage, description=help, add_help=True)
    parser.add_argument("vocab_filename", type=str, help="Subword-nmt's vocabulary filename")
    parser.add_argument("vocab_json",
         nargs   = '?',
         type    = lambda f: open(f, mode='w', encoding='UTF-8'),
         default = sys.stdout,
         help    = "output file [sys.stdout]")
    parser.add_argument("--add_bt_tag",
         dest    = "add_bt_tag",
         action  = 'store_true',
         default = False,
         help    = "Also include BT tag in vocab [%(default)s]")
    parser.add_argument("--add_generic_tags",
         dest    = "add_generic_tags",
         action  = 'store_true',
         default = False,
         help    = "Also include BT tag in vocab [%(default)s]")
    parser.add_argument("--add_glossaries",
         dest    = "add_glossaries",
         nargs   = '?',
         default = None,
         help    = "Also include the glossaries in vocab [%(default)s]")


    cmd_args = parser.parse_args()

    return cmd_args




if __name__ == '__main__':
    args = getArgs()
    vocab = dict({
        "<pad>": 0,
        "<unk>": 1,
        "<s>": 2,
        "</s>": 3,
        })

    if args.add_bt_tag:
        vocab.update({'<BT>': len(vocab)})

    if args.add_generic_tags:
        vocab.update({ f'<TAG{i:02}>' : i+len(vocab)-1 for i in range(1, 26) })

    with open(args.vocab_filename, mode='r', encoding='UTF-8') as f:
        vocab.update({ k: v for v, k in enumerate(map(lambda l: l.strip().split()[0], f.readlines()), start=len(vocab.keys())) })

    if args.add_glossaries is not None:
        last_token_id = len(vocab.keys())
        with open(args.add_glossaries, mode='rt') as f:
            for token in map(str.strip, f.readlines()):
                if token != '' and token not in vocab:
                    vocab[token] = last_token_id
                    last_token_id += 1

    json.dump(vocab, args.vocab_json, indent=2, ensure_ascii=False)

```

This code adds all the special tags and turns the vocabs into json files suitable for sockeye model. Here is how we apply it for our vocabs:
```python
# adding tags and creating sockeye-compatible vocabs
python subword-nmt_vocab2sockeye_vocab_adj.py --add_bt_tag --add_generic_tags --add_glossaries glossaries_file.txt vocab_file.ru > augmented_vocab_ru.json
python subword-nmt_vocab2sockeye_vocab_adj.py --add_bt_tag --add_generic_tags --add_glossaries glossaries_file.txt vocab_file.chvkh > augmented_vocab_chvkh.json
```

We will use these vocabs as parameters to our sockeye model.  
In the next chapter we will talk about how we make actual BPE files for parent and child models separately.