### 分析数据集

数据集名为`news-commentary-v6`，包含……五种语言的文本数据。打印数据集：

In [6]:
from pathlib import Path
DATA_DIR = Path("./data")
DATASET_NAME = 'news-commentary-v6'
lang = 'en'

with open(DATA_DIR / f'{DATASET_NAME}.{lang}') as f:
    text = f.read()

print(text[:200])

Musharraf's Last Act?
Desperate to hold onto power, Pervez Musharraf has discarded Pakistan's constitutional framework and declared a state of emergency.
His goal?
To stifle the independent judiciary 


可以看到数据组织为每行一句话，我们可以将整个文件作为一整个字符串进行分词器的训练和分词。

### 训练分词器


In [5]:
from itertools import product
from tokenizers import Tokenizer
from tokenizers.models import WordPiece, BPE, Unigram
from tokenizers.trainers import WordPieceTrainer, BpeTrainer, UnigramTrainer
from tokenizers.pre_tokenizers import Whitespace

VOCAB_SIZE = [1000, 3000, 5000]
NAME_TEMPLATE = "{model}-{vocab_size}.json"
TOKENIZER_DIR = Path("./tokenizers")
if not TOKENIZER_DIR.exists():
    TOKENIZER_DIR.mkdir()
models = [WordPiece, BPE, Unigram]
trainers = [WordPieceTrainer, BpeTrainer, UnigramTrainer]


for vocab_size, (model, trainer) in product(VOCAB_SIZE, zip(models, trainers)):
    print(f"Training {model.__name__} with vocab size {vocab_size}")
    tokenizer = Tokenizer(model())
    tokenizer.pre_tokenizer = Whitespace()
    trainer = trainer(vocab_size=vocab_size, special_tokens=["[UNK]"])
    tokenizer.train([str(DATA_DIR / f'{DATASET_NAME}.{lang}')], trainer)
    file_name = NAME_TEMPLATE.format(model=model.__name__, vocab_size=vocab_size)
    tokenizer.save(str(TOKENIZER_DIR / file_name))

Training WordPiece with vocab size 1000



Training BPE with vocab size 1000



Training Unigram with vocab size 1000


Training WordPiece with vocab size 3000



Training BPE with vocab size 3000



Training Unigram with vocab size 3000


Training WordPiece with vocab size 5000



Training BPE with vocab size 5000



Training Unigram with vocab size 5000




### 计算压缩率

In [8]:
def compute_compression_ratio(text, tokenizer):
    num_utf8_bytes = len(text.encode('utf-8'))
    encoded_ids = tokenizer.encode(text)
    num_encoded_ids = len(encoded_ids.ids)
    return num_utf8_bytes / num_encoded_ids

for vocab_size, (model, trainer) in product(VOCAB_SIZE, zip(models, trainers)):
    file_name = NAME_TEMPLATE.format(model=model.__name__, vocab_size=vocab_size)
    tokenizer = Tokenizer.from_file(str(TOKENIZER_DIR / file_name))
    ratio = compute_compression_ratio(text, tokenizer)
    print(f"{file_name}: {ratio:.2f}")

WordPiece-1000.json: 2.73
BPE-1000.json: 2.99
Unigram-1000.json: 2.69
WordPiece-3000.json: 3.76
BPE-3000.json: 3.92
Unigram-3000.json: 3.48
WordPiece-5000.json: 4.23
BPE-5000.json: 4.35
Unigram-5000.json: 3.75
