# 🏷️ WordPiece tokenization

WordPiece is the subword algorithm Google developed for BERT and family.  
It's similar to BPE but chooses merges by a scoring function and encodes words by greedy longest-match.

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

## 1️⃣ WordPiece Training Algorithm: How It Works

- Start with base vocabulary (special tokens, initial alphabet).
- Insert prefix (e.g. `##`) to indicate continuation subwords.
- Learn merges by scoring pairs:
  score = freq(pair) / (freq(first) * freq(second))
- Merge pairs with best scores until vocabulary size is reached.


## 2️⃣ Example Corpus and Pre-tokenization

We reuse the corpus and build word frequencies just like the BPE example.


In [None]:
corpus = [
    "This is the Hugging Face Course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

In [None]:
from transformers import AutoTokenizer
from collections import defaultdict

# Pre-tokenize with a WordPiece-style (BERT) tokenizer
tokenizer=AutoTokenizer.from_pretrained("bert-base-cased")
word_freqs=defaultdict(int)
for text in corpus:
  words_with_offsets=tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
  new_words=[word for word,offset in words_with_offsets]
  for word in new_words:
    word_freqs[word]+=1

print(word_freqs) # Shows words and counts


## 3️⃣ Build the Initial Alphabet

For WordPiece, use each word’s first character and then all other characters prefixed by '##'.


In [None]:
alphabet=[]
for word in word_freqs.keys():
  if word[0] not in alphabet:
    alphabet.append(word[0])
  for letter in word[1:]:
    if f"##{letter} not in alphabet":
      alphabet.append(f"##{letter}")

alphabet.sort()
print(alphabet) # e.g. ['##a', '##b', ..., 'C', 'F', ..., 'a', 'b', 'c']

## 4️⃣ Start Vocabulary with Special Tokens and Initial Alphabet

Include special tokens for BERT, then the alphabet.


In [None]:
vocab = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"] + alphabet.copy()


## 5️⃣ Split Words for Training

Each word gets split to first char (no prefix), other chars with '##'.


In [None]:
splits={
    word:[c if i==0 else f"##{c}" for i,c in enumerate(word)]
    for word in word_freqs.keys()
}
splits

## 6️⃣ Compute Pair Scores

Track letter frequencies and pair frequencies. Score each pair as freq(pair) / (freq(first) * freq(second)).


In [None]:
def compute_pair_scores(splits):
    letter_freqs = defaultdict(int)
    pair_freqs = defaultdict(int)
    for word, freq in word_freqs.items():
        split = splits[word]
        if len(split) == 1:
            letter_freqs[split[0]] += freq
            continue
        for i in range(len(split) - 1):
            pair = (split[i], split[i + 1])
            letter_freqs[split[i]] += freq
            pair_freqs[pair] += freq
        letter_freqs[split[-1]] += freq

    scores = {
        pair: freq / (letter_freqs[pair[0]] * letter_freqs[pair[1]])
        for pair, freq in pair_freqs.items()
    }
    return scores

pair_scores=compute_pair_scores(splits)
for i,key in enumerate(list(pair_scores.keys())[:6]):
  print(f"{key}:{pair_scores[key]}")
# Example: ('T', '##h'): 0.125, ('##h', '##i'): 0.03, etc.


In [None]:
best_pair = ""
max_score = None
for pair, score in pair_scores.items():
    if max_score is None or max_score < score:
        best_pair = pair
        max_score = score

print(best_pair, max_score)

In [None]:
vocab.append("ab")

## 7️⃣ Training Loop: Select and Merge Best Pairs (by Score)

Loop for desired vocabulary size, pick best-scoring merge, and update all splits.


In [None]:
def merge_pair(a, b, splits):
    for word in word_freqs:
        split = splits[word]
        if len(split) == 1:
            continue
        i = 0
        while i < len(split) - 1:
            if split[i] == a and split[i + 1] == b:
                merge = a + b[2:] if b.startswith("##") else a + b
                split = split[:i] + [merge] + split[i + 2 :]
            else:
                i += 1
        splits[word] = split
    return splits

vocab_size = 70
while len(vocab) < vocab_size:
    scores = compute_pair_scores(splits)
    best_pair, max_score = "", None
    for pair, score in scores.items():
        if max_score is None or max_score < score:
            best_pair = pair
            max_score = score
    splits = merge_pair(*best_pair, splits)
    new_token = (
        best_pair[0] + best_pair[1][2:]
        if best_pair[1].startswith("##")
        else best_pair[0] + best_pair[1]
    )
    vocab.append(new_token)
print(vocab)  # All learned tokens


## 8️⃣ WordPiece Tokenization Algorithm

Given a new word, greedily select the longest matching subword from vocab, prefix next part with '##', repeat.


In [None]:
def encode_word(word):
    tokens = []
    while len(word) > 0:
        i = len(word)
        while i > 0 and word[:i] not in vocab:
            i -= 1
        if i == 0:
            return ["[UNK]"]
        tokens.append(word[:i])
        word = word[i:]
        if len(word) > 0:
            word = f"##{word}"
    return tokens

# Test with known and unknown words
print(encode_word("Hugging"))  # ['Hugg', '##i', '##n', '##g']
print(encode_word("HOgging"))  # ['[UNK]']



In [None]:
print(vocab)

In [None]:
def encode_word(word):
    tokens = []
    while len(word) > 0:
        i = len(word)
        while i > 0 and word[:i] not in vocab:
            i -= 1
        if i == 0:
            return ["[UNK]"]
        tokens.append(word[:i])
        word = word[i:]
        if len(word) > 0:
            word = f"##{word}"
    return tokens

In [None]:
print(encode_word("Hugging"))
print(encode_word("HOgging"))

In [None]:
def tokenize(text):
    pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)
    pre_tokenized_text = [word for word, offset in pre_tokenize_result]
    encoded_words = [encode_word(word) for word in pre_tokenized_text]
    return sum(encoded_words, [])

In [None]:
tokenize("This is the Hugging Face course!")

## 9️⃣ Full Text Tokenization

Tokenize a sentence by pre-tokenizing and then applying the WordPiece strategy to each word.


In [None]:
def tokenize(text):
    pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)
    pre_tokenized_text = [word for word, offset in pre_tokenize_result]
    encoded_words = [encode_word(word) for word in pre_tokenized_text]
    return sum(encoded_words, [])

print(tokenize("This is the Hugging Face course!"))
# See BERT-style subword tokenization


# ✅ Summary

- WordPiece builds its vocabulary by scoring possible merges, not just raw frequency.
- Tokenization uses a greedy, longest-match-first search for known subwords, with '##' indicating word continuations.
- Out-of-vocab words become '[UNK]', unlike BPE which can produce partially-known tokens.
- This approach compactly covers English and many other languages, and is central to models like BERT.
