The SentencePiece model is used to tokenize the input strings and decode the output tokens. You can create your own model with the google/sentencepiece library, or use our default one at t5.data.DEFAULT_SPM_PATH. If you create your own, you must use the flags --pad_id=0 --eos_id=1 --unk_id=2 --bos_id=-1 with spm_train to be compatible with our model code.

1. preprocess such that one sentence per one line


speaker 1, speaker 2 tokens: or just put them into source text? second option is ok as they are already in data

https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md

2. run training

how to train: https://github.com/google/sentencepiece#train-sentencepiece-model

pass multiple files: https://github.com/google/sentencepiece/issues/489#issuecomment-631556141

all options: https://github.com/google/sentencepiece/blob/master/doc/options.md

### preprocessing

In [None]:
# preprocess ods_data
from tqdm.notebook import tqdm
from pathlib import Path
import json
from ru_sent_tokenize import ru_sent_tokenize

path = Path('/home/kuratov/data/ods_shards/ods_shards/')

save_path = Path('/home/kuratov/data/ods_shards/merged_txt')
if not Path(save_path).exists():
    Path(save_path).mkdir(parents=True)

files = sorted(list(path.glob('*json')))

for f in tqdm(files):
    sentences = []
    dialogs = json.load(f.open('r'))
    for d in dialogs:
        dialog = ' '.join(d) 
        dialog = dialog.replace('<speaker1>', '<speaker1> ').replace('<speaker2>', '<speaker2> ')
        if len(dialog) == 0:
            continue
        sentences += ru_sent_tokenize(dialog)
    with (save_path / (f.stem + '.txt')).open('w') as fout:
        for sent in sentences:
            fout.write(sent + '\n')

### train sentencepiece

In [1]:
import sentencepiece as sp

In [2]:
from pathlib import Path

In [3]:
save_path = Path('/home/kuratov/data/ods_shards/merged_txt')

In [5]:
sp.SentencePieceTrainer.Train(input=list(save_path.glob('*.txt')), vocab_size=50259,
                              pad_id=0, eos_id=1, unk_id=2, bos_id=-1,
                              model_prefix='ods_data_50259_sp_1M_speaker_tokens',
                              user_defined_symbols=['<speaker1>','<speaker2>'],
                              train_extremely_large_corpus=True, # to run on full data
                              input_sentence_size=1000000, # to use less RAM, ~630 Gb needed for full data
                              )

In [7]:
sp_tokenizer = sp.SentencePieceProcessor(model_file='./ods_data_50259_sp_dgx3_spkr_tokens.model')

In [25]:
sp_tokenizer.tokenize('<speaker1> Привет!')

[5, 3, 655, 11]

In [29]:
sp_tokenizer.decode_ids([5, 3, 655, 11])

'<speaker1> Привет!'