<h1> Tokenizer </h1>

Notes from:

https://colab.research.google.com/drive/1aFgzrUv3udM_gNJNUoLaHIm78QHtxdIz?usp=sharing#scrollTo=tA4HMrnFJ33e

SentencePiece tokenizer is included in Speechbrain.


In [1]:
import os
import re
import sentencepiece as spm
from speechbrain.tokenizers.SentencePiece import SentencePiece
from speechbrain.utils.data_utils import get_all_files, download_file

In [7]:
project_dir = "/mnt/c/Users/Gaia/Documents/Schoolwork/ML4S/ICNALE_SM_2.0_A"
transcript_dir = "./ICNALE_Spoken_Monologue_2.0_Transcripts/Unmerged_classified/ICNALE_SM_ENS_XXX_NX00"
os.chdir(project_dir)

In [8]:
#quick vocab and token count across the corpus 

trans_files = get_all_files(transcript_dir, match_and='.txt')

def build_vocab(trans_files):
    vocab = {}
    token_count = 0
    index = 0
    for file in trans_files:  
        with open(file) as f:
            text = f.read()
            text = re.sub('\ufeff', '', text)
            text = re.sub('\n', ' ', text)
            tokens = [a.strip('.,- ').lower() for a in text.split(' ') if a!='']
            token_count += len(tokens)
            tokens = set(tokens)
            for token in tokens:
                if token not in vocab.values():
                    vocab[index] = token
                    index += 1
    return vocab, token_count


In [12]:
vocab, token_count = build_vocab(trans_files)
print(len(vocab), token_count)

3781 92431


For the tokenizer, we want something that can generalize well to words the model has not seen, so that the model can (in theory) perform well on the test set. 

We should also think about the probability distribution across the tokens: with a vocabulary of 7k, I have a feeling that the model will revert to the most frequent words automatically if we just take the probability dist across all possible tokens (not to mention, tokens not trained on will get zero prior probability and will not be output during testing). 

This is additionally a problem because learners have very sparse vocabularies generally, so a few words ('I', 'because', 'um', 'and') will dominate the distribution. We may even have to take this into account and use some kind of smoothing (or whatever it's called) somewhere in the model to boost low-frequency tokens during prediction.

Below are my experiments with the SentencePiece tokenizer, which is included in Speechbrain. 

In [13]:
sp = spm.SentencePieceProcessor()

In [14]:
#creates, trains, and saves tokenizer
SentencePiece(
    model_dir = 'tokenizers',
    vocab_size = 26**2, #experiment with values here
    annotation_train = './data/training.json', #.json train manifest
    annotation_read = 'words', #key from .json with val text string
    model_type = 'bpe',
    annotation_format = 'json'
)


sp.load('tokenizers/676_bpe.model') #path to tokenizer model

#test on common words and pseudowords, and
# demonstrate ID encoding (for use with model)

#with bpe vocab size = 676
print(sp.encode_as_pieces('I think because you make me'), '\n\n', 
      sp.encode_as_ids('I think because you make me'), 
      '\n\n', sp.encode_as_pieces('wordlable jibberish cojecture'))

['▁I', '▁think', '▁because', '▁you', '▁make', '▁me'] 

 [25, 87, 133, 43, 426, 188] 

 ['▁wor', 'd', 'l', 'able', '▁j', 'ib', 'b', 'er', 'is', 'h', '▁co', 'j', 'ec', 'ture']


sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=./data/training.txt --model_prefix=tokenizers/676_bpe --model_type=bpe --bos_id=-1 --eos_id=-1 --pad_id=-1 --unk_id=0 --max_sentencepiece_length=10 --character_coverage=1.0 --add_dummy_prefix=True --vocab_size=676
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: ./data/training.txt
  input_format: 
  model_prefix: tokenizers/676_bpe
  model_type: BPE
  vocab_size: 676
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 10
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpu

In [16]:
#bpe vocab size = 2600

SentencePiece(
    model_dir = 'tokenizers',
    vocab_size = 26*100, #experiment with values here
    annotation_train = './data/training.json', #.json train manifest
    annotation_read = 'words', #key from .json with val text string
    model_type = 'bpe',
    annotation_format = 'json'
)

sp.load('tokenizers/2600_bpe.model')

print(sp.encode_as_pieces('I think because you make me'), '\n\n',  
      '\n\n', sp.encode_as_pieces('wordlable jibberish cojecture'))

['▁I', '▁think', '▁because', '▁you', '▁make', '▁me'] 

 

 ['▁wor', 'd', 'l', 'able', '▁j', 'ib', 'ber', 'ish', '▁co', 'j', 'ec', 'ture']


sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=./data/training.txt --model_prefix=tokenizers/2600_bpe --model_type=bpe --bos_id=-1 --eos_id=-1 --pad_id=-1 --unk_id=0 --max_sentencepiece_length=10 --character_coverage=1.0 --add_dummy_prefix=True --vocab_size=2600
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: ./data/training.txt
  input_format: 
  model_prefix: tokenizers/2600_bpe
  model_type: BPE
  vocab_size: 2600
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 10
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_c

In [18]:

#unigram vocab size = 676

SentencePiece(
    model_dir = 'tokenizers',
    vocab_size = 26**2, #experiment with values here
    annotation_train = './data/training.json', #.json train manifest
    annotation_read = 'words', #key from .json with val text string
    model_type = 'unigram',
    annotation_format = 'json'
)

sp.load('tokenizers/676_unigram.model')

print(sp.encode_as_pieces('I think because you make me'), '\n\n',  
      '\n\n', sp.encode_as_pieces('wordlable jibberish cojecture'))

['▁I', '▁think', '▁because', '▁you', '▁make', '▁me'] 

 

 ['▁w', 'or', 'd', 'l', 'able', '▁', 'j', 'i', 'b', 'b', 'er', 'i', 's', 'h', '▁co', 'j', 'e', 'c', 't', 'u', 're']


sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=./data/training.txt --model_prefix=tokenizers/676_unigram --model_type=unigram --bos_id=-1 --eos_id=-1 --pad_id=-1 --unk_id=0 --max_sentencepiece_length=10 --character_coverage=1.0 --add_dummy_prefix=True --vocab_size=676
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: ./data/training.txt
  input_format: 
  model_prefix: tokenizers/676_unigram
  model_type: UNIGRAM
  vocab_size: 676
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 10
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extre

In [19]:
SentencePiece(
    model_dir = 'tokenizers',
    vocab_size = 26*100, #What would be a good value here?
    annotation_train = './data/training.json', #.json train manifest
    annotation_read = 'words', #key with text string
    model_type = 'unigram',
    annotation_format = 'json'
)

#unigram with 2600 vocab size 
sp.load('tokenizers/2600_unigram.model')
print(sp.encode_as_pieces('I think because you make me'), '\n\n',
sp.encode_as_pieces('wordlable jibberish cojecture'))

['▁I', '▁think', '▁because', '▁you', '▁make', '▁me'] 

 ['▁wor', 'd', 'l', 'able', '▁', 'j', 'i', 'b', 'b', 'er', 'ish', '▁co', 'j', 'e', 'c', 't', 'ure']


sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=./data/training.txt --model_prefix=tokenizers/2600_unigram --model_type=unigram --bos_id=-1 --eos_id=-1 --pad_id=-1 --unk_id=0 --max_sentencepiece_length=10 --character_coverage=1.0 --add_dummy_prefix=True --vocab_size=2600
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: ./data/training.txt
  input_format: 
  model_prefix: tokenizers/2600_unigram
  model_type: UNIGRAM
  vocab_size: 2600
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 10
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_e

In [20]:
SentencePiece(
    model_dir = 'tokenizers',
    vocab_size = 3174, #This is 'max' value 
    annotation_train = './data/training.json', #.json train manifest
    annotation_read = 'words', #key with text string
    model_type = 'unigram',
    annotation_format = 'json'
)

#unigram with vocab size 3174
sp.load('tokenizers/3174_unigram.model')
print(sp.encode_as_pieces('I think because you make me'), '\n\n',
sp.encode_as_pieces('wordlable jibberish cojecture'))

['▁I', '▁think', '▁because', '▁you', '▁make', '▁me'] 

 ['▁word', 'l', 'able', '▁j', 'i', 'b', 'be', 'r', 'ish', '▁co', 'j', 'e', 'c', 't', 'ure']


sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=./data/training.txt --model_prefix=tokenizers/3174_unigram --model_type=unigram --bos_id=-1 --eos_id=-1 --pad_id=-1 --unk_id=0 --max_sentencepiece_length=10 --character_coverage=1.0 --add_dummy_prefix=True --vocab_size=3174
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: ./data/training.txt
  input_format: 
  model_prefix: tokenizers/3174_unigram
  model_type: UNIGRAM
  vocab_size: 3174
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 10
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_e

In [22]:
SentencePiece(
    model_dir = 'tokenizers',
    vocab_size = 3174, #This is 'max' value 
    annotation_train = './data/training.json', #.json train manifest
    annotation_read = 'words', #key with text string
    model_type = 'bpe',
    annotation_format = 'json'
)

#unigram with vocab size 3174
sp.load('tokenizers/3174_bpe.model')
print(sp.encode_as_pieces('I think because you make me'), '\n\n',
sp.encode_as_pieces('wordlable jibberish cojecture'))

['▁I', '▁think', '▁because', '▁you', '▁make', '▁me'] 

 ['▁wor', 'd', 'l', 'able', '▁j', 'ib', 'ber', 'ish', '▁co', 'j', 'ec', 'ture']


sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=./data/training.txt --model_prefix=tokenizers/3174_bpe --model_type=bpe --bos_id=-1 --eos_id=-1 --pad_id=-1 --unk_id=0 --max_sentencepiece_length=10 --character_coverage=1.0 --add_dummy_prefix=True --vocab_size=3174
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: ./data/training.txt
  input_format: 
  model_prefix: tokenizers/3174_bpe
  model_type: BPE
  vocab_size: 3174
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 10
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_c

The max vocabulary models 3174 lead to the best results but large vocabulary may bog down the training process. So far, I am recommending either the unigram or bpe with 2600 vocabulary. They capture both some morphological features (affixes) as well as the most frequent whole words, while keeping the vocabulary size (kinda) low. 

This might help the model make connections between phonemes and segments/syllables, rather than just word-to-word.


I am not sure how this affects the search space during decoding - there is probably a tradeoff, because these sub-word token approaches yield longer sequences but have fewer (softmax?) probabilities to assign per sequence position. 

Also, my numbers for vocabulary size are just guesses, they are not really informed. If it is found that 2600 is too big, we can trim it down, maybe 1200 or 1600 would be good choices. If we have time, it might be fun to test different tokenization approaches in our pipeline and the effects that they have on output and computational time. 