# PoLitBert - Polish RoBERT'a model 

## Preparation of vocabulary and encoding the data

Used corpuses:
* Wikipedia, Link: 
* Oscar
* Polish Books

Usefull resources
* https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md
* https://github.com/musixmatchresearch/umberto/issues/2

In [None]:
import csv
import sys
import datetime as dt
import os
from pathlib import Path
import re

from tqdm import tqdm

import mmap

## Create vocabulary

### Prepare data for vocab

Separate text file for training vocabulary has been created with one sentence per line.
We used polish sentence tokenizer with [additional abbreviations](https://gist.github.com/ksopyla/f05fe2f48bbc9de895368b8a7863b5c3)
typical for the Polish language.
Sentencepiece model is capable of handling around 12.000.000 sentences, so larger files are not necessary.

### Train the BPE vocabulary model

We used the [SentencePiece](https://github.com/google/sentencepiece) segmentation model trained from raw
sentences with fixed final vocabulary size - 32K and 50K unique tokens.

Training and segmentation can be done in two ways:
- as a python module,
- as a command-line tool.

To use it as a command-line it should be installed from source, which is described in the
[build the C++ version from source](https://github.com/google/sentencepiece#c-from-source) section of the documentation.


#### Training SentencePiece vocab using command line

* 32k vocab:
```
spm_train \
    --input=./data/corpus_raw/corpus_books_wiki_12M_lines.txt \
    --max_sentence_length=4192\
    --model_prefix=./data/vocab/books_wikipedia_v32k_sen10M.spm.bpe \
    --vocab_size=32000 \
    --model_type=bpe \
    --shuffle_input_sentence=true \
    --input_sentence_size=10000000 \
    --bos_id=0 --eos_id=1 --pad_id=2 --unk_id=3
```

* 50k vocab:
```
spm_train \
    --input=./data/corpus_raw/corpus_books_wiki_12M_lines.txt \
    --max_sentence_length=4192\
    --model_prefix=./data/vocab/books_wikipedia_v50k_sen10M.spm.bpe \
    --vocab_size=50000 \
    --model_type=bpe \
    --shuffle_input_sentence=true \
    --input_sentence_size=10000000 \
    --bos_id=0 --eos_id=1 --pad_id=2 --unk_id=3
```

#### Training SentencePiece vocab with Python module

Below, for reference, an example of how to prepare a SP model if Python script is preferred.

In [None]:
import sentencepiece as spm

vocab_size = 32000
model_type = "bpe"  
iss = 10_000_000

data_file = './data/corpus_raw/corpus_books_wiki_12M_lines.txt'

tok_model = f"books_wikipedia_v32k_sen10M"
tok_model = os.path.abspath(f"./data/vocab/{tok_model}")

piece_options = ' --bos_id=0 --eos_id=1 --pad_id=2 --unk_id=3 --shuffle_input_sentence=true'

cmd = f"--input={data_file} --model_prefix={tok_model} --num_threads=4 --vocab_size={vocab_size}  --input_sentence_size={iss}" + piece_options
print(cmd)

start = dt.datetime.now()
print(start)
spm.SentencePieceTrainer.train(cmd)
end = dt.datetime.now()

print(f"Created vocab of {vocab_size} tokens from {data_file}, took {end-start}.")

In [10]:
# Example segmentation usage:

# make segmenter instance and load the model file (m.model)
sp = spm.SentencePieceProcessor()
sp.load(f"{tok_model}.model")

# verify vocab size
print(sp.get_piece_size())

# encode: text => id
text = """Będąc młodym programistą (hoho), czytałem "Dziady" w 1983r."""
print(sp.encode_as_pieces(text))

32000
['▁Będąc', '▁młodym', '▁programi', 'stą', '▁(', 'ho', 'ho', '),', '▁czy', 't', 'ałem', '▁"', 'D', 'zia', 'dy', '"', '▁w', '▁1983', 'r', '.']


### Fairseq vocab

Usage of sentencepiece the model's with fairseq requires changing the separator used in the dictionary.
All _\t_ characters should be replaced with _whitespace_ in the vocab file.

In [12]:
for vocab_size in ("32k", "50k"):
    vocab_file = f"./data/vocab/books_wikipedia_v{vocab_size}_sen10M.spm.bpe.vocab"

    p = Path(vocab_file)

    output_path = f"{p.with_suffix('')}_fair.vocab"
    with open(output_path, 'w+') as output_file:
        with open(vocab_file) as f:

            text = f.read().replace('\t', ' ')
            output_file.write(text)

### Encode data with sentence piece model

Encoding prepared training and test datasets with SentencePiece tokenizer. Both, for 32k and 50k vocabularies.

* 32k vocab:

```
DATA_PATH=./data/wiki_books_oscar/
VOCAB_SIZE=32k

for SPLIT in test train ; do \
    spm_encode \
    --model=./data/vocab/books_wikipedia_v${VOCAB_SIZE}_sen10M.spm.bpe.model \
    --extra_options=bos:eos \
    --output_format=piece \
    < ${DATA_PATH}corpus_wiki_books_oscar_${SPLIT}.txt \
    > ${DATA_PATH}corpus_wiki_books_oscar_${SPLIT}_${VOCAB_SIZE}.txt.bpe
done
```

* 50k vocab:

```
DATA_PATH=./data/wiki_books_oscar/
VOCAB_SIZE=50k

for SPLIT in test train ; do \
    spm_encode \
    --model=./data/vocab/books_wikipedia_v${VOCAB_SIZE}_sen10M.spm.bpe.model \
    --extra_options=bos:eos \
    --output_format=piece \
    < ${DATA_PATH}corpus_wiki_books_oscar_${SPLIT}.txt \
    > ${DATA_PATH}corpus_wiki_books_oscar_${SPLIT}_${VOCAB_SIZE}.txt.bpe
done
```

## Data binarization with Fairseq

### Fairseq-preprocessing bpe encoded and splited data

* Data processed with 32k vocab:

```
DATA_PATH=./data/wiki_books_oscar/
VOCAB_SIZE=32k

fairseq-preprocess \
    --only-source \
    --srcdict ./vocab/books_wikipedia_v${VOCAB_SIZE}_sen10M.spm.bpe_fair.vocab \
    --trainpref ${DATA_PATH}corpus_wiki_books_oscar_train_vocab${VOCAB_SIZE}.txt.bpe \
    --validpref ${DATA_PATH}corpus_wiki_books_oscar_test_vocab${VOCAB_SIZE}.txt.bpe \
    --destdir ${DATA_PATH}vocab${VOCAB_SIZE} \
    --workers 8
```

* Data processed with 50k vocab:

```
DATA_PATH=./data/wiki_books_oscar/
VOCAB_SIZE=50k

fairseq-preprocess \
    --only-source \
    --srcdict ./vocab/books_wikipedia_v${VOCAB_SIZE}_sen10M.spm.bpe_fair.vocab \
    --trainpref ${DATA_PATH}corpus_wiki_books_oscar_train_vocab${VOCAB_SIZE}.txt.bpe \
    --validpref ${DATA_PATH}corpus_wiki_books_oscar_test_vocab${VOCAB_SIZE}.txt.bpe \
    --destdir ${DATA_PATH}vocab${VOCAB_SIZE} \
    --workers 8
```