# 🛠️ Building a tokenizer, block by block

A tokenizer is built from modular components:
- Normalization
- Pre-tokenization
- Model/trainer setup
- Post-processing (special tokens, templates)
- Decoding

Let’s assemble WordPiece, BPE (byte-level), and Unigram tokenizers from scratch with the Tokenizers library!

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

## 1️⃣ Acquire Training Corpus (WikiText-2 for Demo Speed)

Load and batch a public corpus as input for training tokenizers.


In [None]:
from datasets import load_dataset

dataset=load_dataset("wikitext",name="wikitext-2-raw-v1",split="train")

def get_training_corpus():
  for i in range(0,len(dataset),1000):
    yield dataset[i:i+1000]["text"]

# Optionally,save texts to a file for direct training
with open ("wikitext-2.txt","w",encoding="utf-8") as f:
  for i in range(len(dataset)):
    f.write(dataset[i]["text"]+"\n")


## 2️⃣ Build a WordPiece (BERT-style) Tokenizer, Step by Step

Let’s assemble all its main blocks—model, normalizer, pre-tokenizer, post-processor, decoder.


In [None]:
from tokenizers import(
    decoders,models,normalizers,pre_tokenizers,processors,trainers,Tokenizer
)

# 2.1 Model: WordPiece with unk token
tokenizer=Tokenizer(models.WordPiece(unk_token="[UNK]"))

# 2.2 Normalization: Lowercaser + accents remover
tokenizer.normalizer=normalizers.Sequence([
    normalizers.NFD(), # Unicode normalization
    normalizers.Lowercase(), # Lowercase
    normalizers.StripAccents(), # Remove accents
])

print(tokenizer.normalizer.normalize_str("Héllò hôw are ü?"))  # Should print: hello how are u?

# 2.3 Pre-tokenizer: split on whitespace and punctuation
tokenizer.pre_tokenizer=pre_tokenizers.Sequence([
    pre_tokenizers.WhitespaceSplit(),
    pre_tokenizers.Punctuation()
])
print(tokenizer.pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer."))

# 2.4 Trainer: WordPieceTrainer
special_tokens=["[UNK]","[PAD]","[CLS]","[SEP]","[MASK]"]
trainer=trainers.WordPieceTrainer(vocab_size=25000,special_tokens=special_tokens)

# 2.5 Train from iterator
tokenizer.train_from_iterator(get_training_corpus(),trainer=trainer)

## 3️⃣ Post-Processing Template for Special Tokens

Add [CLS]/[SEP] formatting for single and sentence-pair cases.


In [None]:
cls_token_id=tokenizer.token_to_id("[CLS]")
sep_token_id=tokenizer.token_to_id("[SEP]")
print(cls_token_id,sep_token_id)

tokenizer.post_processor=processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[("[CLS]",cls_token_id),("[SEP]",sep_token_id)]
)

# 3.1 Test the template
encoding=tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

encoding=tokenizer.encode("Let's test this tokenizer...","on a pair of sentences.")
print(encoding.tokens)
print(encoding.type_ids)

## 4️⃣ Add a Decoder

Transform token IDs back to human-readable text.


In [None]:
tokenizer.decoder=decoders.WordPiece(prefix="##")
decoded=tokenizer.decode(encoding.ids)
print(decoded) # Should print: "let's test this tokenizer... on a pair of sentences."

## 5️⃣ Save & Reload

Save to JSON, reload to verify portability.


In [None]:
tokenizer.save("tokenizer.json")
from tokenizers import Tokenizer as LoadTokenizer
new_tokenizer=LoadTokenizer.from_file("tokenizer.json")


## 6️⃣ Make Transformer-Compatible: PreTrainedTokenizerFast

Wrap your custom tokenizer for use in 🤗 Transformers API.


In [None]:
from transformers import PreTrainedTokenizerFast

wrapped_tokenizer=PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)

## 7️⃣ Build a ByteLevel BPE Tokenizer (GPT-2)

Skip normalizer, use byte-level pre-tokenizer and trainer, and add byte-level post-processor and decoder.


In [None]:
tokenizer=Tokenizer(models.BPE())
tokenizer.pre_tokenizer=pre_tokenizers.ByteLevel(add_prefix_space=False)
trainer=trainers.BpeTrainer(vocab_size=25000,special_tokens=["<|endoftext|>"])
tokenizer.train_from_iterator(get_training_corpus(),trainer=trainer)
tokenizer.post_processor=processors.ByteLevel(trim_offsets=False)
tokenizer.decoder=decoders.ByteLevel()

encoding=tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)
print(tokenizer.decode(encoding.ids))

## 8️⃣ Build a Unigram Tokenizer (XLNet)

Use Metaspace pre-tokenizer, normalization, UnigramTrainer, and appropriate template for XLNet style.


In [None]:
tokenizer=Tokenizer(models.Unigram())
from tokenizers import Regex

tokenizer.normalizer=normalizers.Sequence([
    normalizers.Replace("``",'"'),
    normalizers.Replace("''",'"'),
    normalizers.NFKD(),
    normalizers.StripAccents(),
    normalizers.Replace(Regex(" {2,}"), " ")
])
tokenizer.pre_tokenizer=pre_tokenizers.Metaspace()
special_tokens=["<cls>","<sep>","<unk>","<mask>","<s>","</s>"]
trainer=trainers.UnigramTrainer(
    vocab_size=25000,special_tokens=special_tokens,unk_token="<unk>"
)
tokenizer.train_from_iterator(get_training_corpus(),trainer=trainer)
encoding=tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

cls_token_id=tokenizer.token_to_id("<cls>")
sep_token_id=tokenizer.token_to_id("<sep>")
tokenizer.post_processor=processors.TemplateProcessing(
    single="$A:0 <sep>:0 <cls>:2",
    pair="$A:0 <sep>:0 $B:1 <sep>:1 <cls>:2",
    special_tokens=[("<sep>", sep_token_id), ("<cls>", cls_token_id)],
)
encoding=tokenizer.encode("Let's test this tokenizer...", "on a pair of sentences!")
print(encoding.tokens)
print(encoding.type_ids)

tokenizer.decoder=decoders.Metaspace()
print(tokenizer.decode(encoding.ids))

# ✅ Summary

- Build and mix tokenizer components: normalization, pre-tokenization, model/trainer, post-processing, decoding.
- Try any model: WordPiece (BERT), ByteLevel BPE (GPT-2), Unigram (XLNet).
- Save, reload, and wrap for use with Transformers.