## normalizer
- 使原始语料更加“干净”

In [2]:
# -------------------------------- normalizer ---------------------------------- #

from tokenizers import normalizers
from tokenizers.normalizers import NFD, StripAccents
normalizer = normalizers.Sequence([NFD(), StripAccents()])

normalizer.normalize_str("Héllò hôw are ü?")
# "Hello how are u?"

'Hello how are u?'

In [None]:
# tokenizer中可以指定normalizer
# tokenizer.normalizer = normalizer

## pre-tokenizer
- 预分词器，会将文本分割为最小的token单位。之后vocab_size不会超过此时的分词数量
    - 比如利用BPE会合并某些token成为一个新token，vocab_size就变小了

In [3]:
from tokenizers.pre_tokenizers import Whitespace
pre_tokenizer = Whitespace()
pre_tokenizer.pre_tokenize_str("Hello! How are you? I'm fine, thank you.")
# [("Hello", (0, 5)), ("!", (5, 6)), ("How", (7, 10)), ("are", (11, 14)), ("you", (15, 18)),
#  ("?", (18, 19)), ("I", (20, 21)), ("'", (21, 22)), ('m', (22, 23)), ("fine", (24, 28)),
#  (",", (28, 29)), ("thank", (30, 35)), ("you", (36, 39)), (".", (39, 40))]

[('Hello', (0, 5)),
 ('!', (5, 6)),
 ('How', (7, 10)),
 ('are', (11, 14)),
 ('you', (15, 18)),
 ('?', (18, 19)),
 ('I', (20, 21)),
 ("'", (21, 22)),
 ('m', (22, 23)),
 ('fine', (24, 28)),
 (',', (28, 29)),
 ('thank', (30, 35)),
 ('you', (36, 39)),
 ('.', (39, 40))]

- 预分词器会输出每个token的开始位置和结束位置
- Whitespace会以 \[空格, tab, 回车\] 等空白字符为间隔分词

可以利用pre_tokenizers.Sequence组合多个预分词器，预分词器会按照顺序执行

In [4]:
from tokenizers import pre_tokenizers
from tokenizers.pre_tokenizers import Digits
pre_tokenizer = pre_tokenizers.Sequence([Whitespace(), Digits(individual_digits=True)]) # 可以按顺序组合多个预分词器
pre_tokenizer.pre_tokenize_str("Call 911!")
# [("Call", (0, 4)), ("9", (5, 6)), ("1", (6, 7)), ("1", (7, 8)), ("!", (8, 9))]

[('Call', (0, 4)), ('9', (5, 6)), ('1', (6, 7)), ('1', (7, 8)), ('!', (8, 9))]

In [None]:
# tokenizer指定预分词器
# tokenizer.pre_tokenizer = pre_tokenizer

## model
- 分别有4类模型：
    - models.BPE
    - models.Unigram
    -