# Tokenizer from scratch

In [34]:
from tokenizers import (ByteLevelBPETokenizer, 
                        CharBPETokenizer, 
                        SentencePieceBPETokenizer)

example_str = ['努力超越自我', 
               '希望是梦想的一道光', 
               '冒然行动是不可取的', 
              '动我心房，酸我眼眶，一生的伤', 
              '嘌呤、脂肪酸']

## Sentence Piece tokenizer

`SentencePiece` tokenizer tries to read sentences, learn pattern of words, parse words and map them to numbers. 

In [35]:
tokenizer = SentencePieceBPETokenizer()

tokenizer.train(['./allCh.txt'], vocab_size = 20000)

for msg in example_str: 
    output = tokenizer.encode(msg)
    print(output.ids, output.tokens, output.offsets)


[87, 1101, 3775, 2039] ['▁', '努力', '超越', '自我'] [(0, 1), (0, 2), (2, 4), (4, 6)]
[2127, 556, 1093, 1048, 929, 199] ['▁希望', '是', '梦想', '的一', '道', '光'] [(0, 2), (2, 3), (3, 5), (5, 7), (7, 8), (8, 9)]
[3257, 2040, 556, 1834, 274, 700] ['▁然', '行动', '是', '不可', '取', '的'] [(0, 1), (1, 3), (3, 4), (4, 6), (6, 7), (7, 8)]
[87, 243, 492, 459, 495, 11, 937, 492, 706, 1145, 1886, 165] ['▁', '动', '我', '心', '房', ',', '酸', '我', '眼', ',一', '生的', '伤'] [(0, 1), (0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 10), (10, 12), (12, 13)]
[87, 89, 937] ['▁', '、', '酸'] [(0, 1), (0, 1), (1, 2)]


Notice that those words that more frequently appear in the training set (e.g. 努力：46次，超越：3次，自我：9次，希望：43次，梦想：48次，行动：7次) are more likely to be grouped together by `SentencePiece` tokenizer. Those words which rarely appears (e.g. 心房：0次，眼眶：0次）are not grouped together. 

Frequently appearing sequence that are not words (e.g. 的一：80次）are also grouped together by tokenizer. 

Characters that never appear in the training set cannot be learned by the tokenizer (e.g. 冒然、嘌呤）cannot be recognized by tokenizer. 

Confusing phenomenon: why the tokenizer cannot recognize 脂肪 which did appear in the dataset? 

## CharBPE tokenizer

Character-level tokenizer maps each character to number. 

In [36]:
tokenizer = CharBPETokenizer()

tokenizer.train(['./allCh.txt'], vocab_size = 20000)

for msg in example_str: 
    output = tokenizer.encode(msg)
    print(output.ids, output.tokens, output.offsets)

[1900, 1302, 1087, 1320, 1082, 1196] ['努</w>', '力</w>', '超</w>', '越</w>', '自</w>', '我</w>'] [(0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6)]
[1084, 1242, 1437, 1401, 1225, 1283, 1143, 1270, 1862] ['希</w>', '望</w>', '是</w>', '梦</w>', '想</w>', '的</w>', '一</w>', '道</w>', '光</w>'] [(0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9)]
[1881, 1374, 1726, 1437, 1919, 1890, 1581, 1283] ['然</w>', '行</w>', '动</w>', '是</w>', '不</w>', '可</w>', '取</w>', '的</w>'] [(1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9)]
[1726, 1196, 1816, 1774, 1601, 1067, 1196, 1398, 1601, 1143, 1937, 1283, 1631] ['动</w>', '我</w>', '心</w>', '房</w>', '，</w>', '酸</w>', '我</w>', '眼</w>', '，</w>', '一</w>', '生</w>', '的</w>', '伤</w>'] [(0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (9, 10), (10, 11), (11, 12), (12, 13), (13, 14)]
[1505, 1067] ['、</w>', '酸</w>'] [(2, 3), (5, 6)]


## ByteLevelBPE tokenizer

Byte-level tokenizer maps each byte to number 

In [37]:
tokenizer = ByteLevelBPETokenizer()

tokenizer.train(['./allCh.txt'], vocab_size = 20000)

for msg in example_str: 
    output = tokenizer.encode(msg)
    print(output.ids, output.tokens, output.offsets)

[763, 4651, 2273] ['åĬªåĬĽ', 'è¶ħè¶Ĭ', 'èĩªæĪĳ'] [(0, 2), (2, 4), (4, 6)]
[801, 272, 747, 615, 427, 650] ['å¸ĮæľĽ', 'æĺ¯', 'æ¢¦æĥ³', 'çļĦä¸Ģ', 'éģĵ', 'åħī'] [(0, 2), (2, 3), (3, 5), (5, 7), (7, 8), (8, 9)]
[2903, 527, 2355, 272, 1797, 933, 260] ['åĨĴ', 'çĦ¶', 'è¡ĮåĬ¨', 'æĺ¯', 'ä¸įåı¯', 'åıĸ', 'çļĦ'] [(0, 1), (1, 2), (2, 4), (4, 5), (5, 7), (7, 8), (8, 9)]
[609, 276, 333, 1796, 258, 2005, 276, 703, 313, 114, 258, 1698, 260, 629] ['åĬ¨', 'æĪĳ', 'å¿ĥ', 'æĪ¿', 'ï¼Į', 'éħ¸', 'æĪĳ', 'çľ¼', 'çľ', '¶', 'ï¼Į', 'ä¸ĢçĶŁ', 'çļĦ', 'ä¼¤'] [(0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9), (8, 9), (9, 10), (10, 12), (12, 13), (13, 14)]
[905, 234, 357, 97, 567, 8094, 2005] ['åĺ', 'Į', 'åĳ', '¤', 'ãĢģ', 'èĦĤèĤª', 'éħ¸'] [(0, 1), (0, 1), (1, 2), (1, 2), (2, 3), (3, 5), (5, 6)]


## Experiment with Tibetan text

In [48]:
tokenizer = SentencePieceBPETokenizer()

tokenizer.train(['./bo.txt'], vocab_size = 20000)

example_str = [
   'བསྟན་བསྟན་འཛིན་སྒྲོལ་མ་རྒྱལ་པོ་']

for msg in example_str: 
    output = tokenizer.encode(msg)
    print(output.ids, output.tokens, output.offsets)

[166, 3355, 842, 1188, 8219, 143, 507] ['▁བ', 'སྟན་བ', 'སྟན་', 'འཛིན་', 'སྒྲོལ་', 'མ་', 'རྒྱལ་པོ་'] [(0, 1), (1, 6), (6, 10), (10, 15), (15, 21), (21, 23), (23, 31)]


1. Even if the words are not parsed correctly, it *might* not matter in the end (not sure)

1. Explore whether some explicit rules can be specified by parameters of tokenzier? 