WordPiece is the tokenization algorithm Google developed to **pretrain BERT**. It has since been reused in quite a few Transformer models **based on BERT**, such as DistilBERT, MobileBERT, Funnel Transformers, and MPNET.

WordPiece:

1. Like BPE, WordPiece starts from a small vocabulary including the special tokens used by the model and the initial alphabet.
2. Since it identifies subwords by adding a prefix (like ## for BERT), each word is initially split by adding that prefix to all the characters inside the word. So, for instance, "word" gets split like this: "w ##o ##r ##d".Thus, the initial alphabet contains all the characters present at the beginning of a word and the characters present inside a word preceded by the WordPiece prefix.
3. Then, again like BPE, WordPiece learns merge rules. Instead of selecting the most frequent pair, WordPiece computes a score for each pair (score=(freq_of_pair)/(freq_of_first_element×freq_of_second_element)) IDEA KHAFAN: By dividing the frequency of the pair by the product of the frequencies of each of its parts, the algorithm prioritizes the merging of pairs where the individual parts are less frequent in the vocabulary. For instance, it won’t necessarily merge ("un", "##able") even if that pair occurs very frequently in the vocabulary, because the two pairs "un" and "##able" will likely each appear in a lot of other words and have a high frequency. In contrast, a pair like ("hu", "##gging") will probably be merged faster (assuming the word “hugging” appears often in the vocabulary) since "hu" and "##gging" are likely to be less frequent individually. (https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt)
4. Tokenization differs in WordPiece and BPE in that WordPiece only **saves the final vocabulary, not the merge rules learned.**

# Implementing WordPiece

In [1]:
corpus = [
    "This is the Hugging Face Course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

First, we need to pre-tokenize the corpus into words

In [2]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
word_freqs = dict()

for text in corpus:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    words = [words for words, offset in words_with_offsets]
    for word in words:
        if word not in word_freqs.keys():
            word_freqs[word] = 1
        else:
            word_freqs[word] += 1

word_freqs

{'This': 3,
 'is': 2,
 'the': 1,
 'Hugging': 1,
 'Face': 1,
 'Course': 1,
 '.': 4,
 'chapter': 1,
 'about': 1,
 'tokenization': 1,
 'section': 1,
 'shows': 1,
 'several': 1,
 'tokenizer': 1,
 'algorithms': 1,
 'Hopefully': 1,
 ',': 1,
 'you': 1,
 'will': 1,
 'be': 1,
 'able': 1,
 'to': 1,
 'understand': 1,
 'how': 1,
 'they': 1,
 'are': 1,
 'trained': 1,
 'and': 1,
 'generate': 1,
 'tokens': 1}

the alphabet: the alphabet is the unique set composed of all the first letters of words, and all the other letters that appear in words prefixed by ##

In [11]:
alphabet = []

for words in word_freqs.keys():
    if words[0] not in alphabet:
        alphabet.append(words[0])
    for chr in words[1:]:
        letter = "##" + chr
        if letter not in alphabet:
            alphabet.append(letter)

alphabet.sort()
alphabet


['##a',
 '##b',
 '##c',
 '##d',
 '##e',
 '##f',
 '##g',
 '##h',
 '##i',
 '##k',
 '##l',
 '##m',
 '##n',
 '##o',
 '##p',
 '##r',
 '##s',
 '##t',
 '##u',
 '##v',
 '##w',
 '##y',
 '##z',
 ',',
 '.',
 'C',
 'F',
 'H',
 'T',
 'a',
 'b',
 'c',
 'g',
 'h',
 'i',
 's',
 't',
 'u',
 'w',
 'y']

We also add the special tokens used by the model at the beginning of that vocabulary. In the case of BERT, it’s the list ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]:

In [114]:
vocab = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"] + alphabet.copy()

In [124]:
splits = dict()

for words in word_freqs:
    first_chr = [words[0]]
    splits[words] = first_chr
    for chr in words[1:]:
        letter = "##" + chr
        splits[words] = splits[words] + [letter]

splits

{'This': ['T', '##h', '##i', '##s'],
 'is': ['i', '##s'],
 'the': ['t', '##h', '##e'],
 'Hugging': ['H', '##u', '##g', '##g', '##i', '##n', '##g'],
 'Face': ['F', '##a', '##c', '##e'],
 'Course': ['C', '##o', '##u', '##r', '##s', '##e'],
 '.': ['.'],
 'chapter': ['c', '##h', '##a', '##p', '##t', '##e', '##r'],
 'about': ['a', '##b', '##o', '##u', '##t'],
 'tokenization': ['t',
  '##o',
  '##k',
  '##e',
  '##n',
  '##i',
  '##z',
  '##a',
  '##t',
  '##i',
  '##o',
  '##n'],
 'section': ['s', '##e', '##c', '##t', '##i', '##o', '##n'],
 'shows': ['s', '##h', '##o', '##w', '##s'],
 'several': ['s', '##e', '##v', '##e', '##r', '##a', '##l'],
 'tokenizer': ['t', '##o', '##k', '##e', '##n', '##i', '##z', '##e', '##r'],
 'algorithms': ['a',
  '##l',
  '##g',
  '##o',
  '##r',
  '##i',
  '##t',
  '##h',
  '##m',
  '##s'],
 'Hopefully': ['H', '##o', '##p', '##e', '##f', '##u', '##l', '##l', '##y'],
 ',': [','],
 'you': ['y', '##o', '##u'],
 'will': ['w', '##i', '##l', '##l'],
 'be': ['b', '##e

# we are ready for training

computes the score of each pair. We’ll need to use this at each step of the training:

In [125]:
# ### My bad :(
# # def compute_pair_scores(splits):
# letter_freqs = dict()
# pair_frqs = dict()
# score_first = dict()
# score_second = dict()
# for keys, values in splits.items():
#     # print(values)
#     for i in range(len(values)-1):
#         # print(i)
#         # print(f'{len(values)=}')
#         pair_chr = (values[i], values[i+1])
#         chr = values[i] 

#         if pair_chr not in pair_frqs:
#             pair_frqs[pair_chr] = 1
#         else:
#             pair_frqs[pair_chr] += 1
        
#         if chr not in letter_freqs.keys():
#             letter_freqs[chr] = 1
#         else:
#             letter_freqs[chr] += 1
        
#         if i == len(values)-2:
#             # print(f'{i}')
#             chr_last = values[i+1]
#             if chr_last not in letter_freqs.keys():
#                 letter_freqs[chr_last] = 1
#             else:
#                 letter_freqs[chr_last] += 1

    
# for keys_freq in pair_frqs.keys():
#     for _, values in splits.items():
#         for i in range(len(values)-1):
#             if keys_freq[0] == values[i] and keys_freq[1] == values[i+1]:
#                 if keys_freq not in score_first:
#                     score_first[keys_freq] = 1
#                 else:
#                     score_first[keys_freq] += 1
    
# for keys in score_first.keys():
#         score_second[keys] = score_first[keys] / (letter_freqs[keys[0]] * letter_freqs[keys[1]])

# pair_max = ""
# max_score = None
# for keys , values in score_second.items():
#     if max_score is None or max_score < values:
#         max_score = values
#         pair_max = keys[0] + keys[1]

# print(pair_max)
# print(max_score)
# score_second

In [126]:
# :D
def compute_pair_scores(splits):
    letter_freq = dict()
    pair_freq = dict()
    scores = dict()

    for words, freq in word_freqs.items():
        split = splits[words]
        if len(split) == 1:
            continue
        for i in range(len(split)-1):
            pair = (split[i] , split[i+1])
            if pair not in pair_freq.keys():
                pair_freq[pair] = freq
            else:
                pair_freq[pair] += freq
            
            if split[i] not in letter_freq.keys():
                letter_freq[split[i]] = freq
            else:
                letter_freq[split[i]] += freq
        if split[-1] not in letter_freq.keys():
            letter_freq[split[-1]] = freq
        else:
            letter_freq[split[-1]] += freq

    for keys, values in pair_freq.items():
        if keys not in scores.keys():
            scores[keys] = pair_freq[keys] / (letter_freq[keys[0]] * letter_freq[keys[1]])
    
    return scores


In [127]:
pair_scores = compute_pair_scores(splits)
for i, key in enumerate(pair_scores.keys()):
    print(f"{key}: {pair_scores[key]}")
    if i >= 5:
        break

('T', '##h'): 0.125
('##h', '##i'): 0.03409090909090909
('##i', '##s'): 0.02727272727272727
('i', '##s'): 0.1
('t', '##h'): 0.03571428571428571
('##h', '##e'): 0.011904761904761904


In [128]:
#finding pair with best score

pair_max = ""
max_score = None
for pair, score in pair_scores.items():
    if max_score is None or max_score < score:
        max_score = score
        pair_max = pair

print(pair_max)
print(max_score)


('a', '##b')
0.2


In [129]:
def merge_pair(a, b, splits):
    for words in word_freqs:
        split = splits[words]
        if len(split) == 1:
            continue
        i = 0
        while i < len(split)-1:
            if split[i] == a and split[i+1] == b:
                merge = a + b[2:] if b.startswith("##") else a+b 
                split = split[:i] + [merge] + split[i+2:]
            else:
                i += 1
        
        splits[words] = split
    return splits

In [95]:
# splits = merge_pair("a", "##b", splits)
# splits["about"]

['ab', '##o', '##u', '##t']

Now we have everything we need to loop until we have learned all the merges we want.

In [130]:
vocab = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"] + alphabet.copy()
vocab_size = 70

while len(vocab) < vocab_size:
    scores = compute_pair_scores(splits)
    best_pair, max_score = "", None
    for pair, score in scores.items():
        if max_score is None or max_score < score:
            best_pair = pair
            max_score = score
    splits = merge_pair(*best_pair, splits)
    new_token= (best_pair[0] + best_pair[1][2:] 
                if best_pair[1].startswith("##")
                else best_pair[0] + best_pair[1])
    vocab.append(new_token)


In [131]:
vocab

['[PAD]',
 '[UNK]',
 '[CLS]',
 '[SEP]',
 '[MASK]',
 '##a',
 '##b',
 '##c',
 '##d',
 '##e',
 '##f',
 '##g',
 '##h',
 '##i',
 '##k',
 '##l',
 '##m',
 '##n',
 '##o',
 '##p',
 '##r',
 '##s',
 '##t',
 '##u',
 '##v',
 '##w',
 '##y',
 '##z',
 ',',
 '.',
 'C',
 'F',
 'H',
 'T',
 'a',
 'b',
 'c',
 'g',
 'h',
 'i',
 's',
 't',
 'u',
 'w',
 'y',
 'ab',
 '##fu',
 'Fa',
 'Fac',
 '##ct',
 '##ful',
 '##full',
 '##fully',
 'Th',
 'ch',
 '##hm',
 'cha',
 'chap',
 'chapt',
 '##thm',
 'Hu',
 'Hug',
 'Hugg',
 'sh',
 'th',
 'is',
 '##thms',
 '##za',
 '##zat',
 '##ut']

# tokenize new text

1. we pre-tokenize it, 
2. split it, 
3. then apply the tokenization algorithm on each word
4. That is, we look for the biggest subword starting at the beginning of the first word and split it, 
5. then we repeat the process on the second part, and so on for the rest of that word and the following words in the text:

In [145]:
# EXCITING ALGORITHM :D
def encode_word(word_new):
    tokens = []
    while len(word_new) > 0:
        i = len(word_new)
        while i > 0 and word_new[:i] not in vocab:
            i -= 1
        if i == 0:
            return["[UNK]"]
        tokens.append(word_new[:i])
        word_new = word_new[i:]
        if len(word_new) > 0:
            word_new = "##" + word_new
    return tokens

# Let’s test it on one word that’s in the vocabulary, and another that isn’t:

In [144]:
print(encode_word("Hugging"))
print(encode_word("HOgging"))

['Hugg', '##i', '##n', '##g']
['[UNK]']


In [140]:
vocab

['[PAD]',
 '[UNK]',
 '[CLS]',
 '[SEP]',
 '[MASK]',
 '##a',
 '##b',
 '##c',
 '##d',
 '##e',
 '##f',
 '##g',
 '##h',
 '##i',
 '##k',
 '##l',
 '##m',
 '##n',
 '##o',
 '##p',
 '##r',
 '##s',
 '##t',
 '##u',
 '##v',
 '##w',
 '##y',
 '##z',
 ',',
 '.',
 'C',
 'F',
 'H',
 'T',
 'a',
 'b',
 'c',
 'g',
 'h',
 'i',
 's',
 't',
 'u',
 'w',
 'y',
 'ab',
 '##fu',
 'Fa',
 'Fac',
 '##ct',
 '##ful',
 '##full',
 '##fully',
 'Th',
 'ch',
 '##hm',
 'cha',
 'chap',
 'chapt',
 '##thm',
 'Hu',
 'Hug',
 'Hugg',
 'sh',
 'th',
 'is',
 '##thms',
 '##za',
 '##zat',
 '##ut']

# let’s write a function that tokenizes a text:D KHAFAN!

In [146]:

def tokenize(text):
    pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)
    pre_tokenized_text = [word for word, offset in pre_tokenize_result]
    encoded_words = [encode_word(word) for word in pre_tokenized_text]
    return sum(encoded_words, [])

In [147]:
tokenize("This is the Hugging Face course!")

['Th',
 '##i',
 '##s',
 'is',
 'th',
 '##e',
 'Hugg',
 '##i',
 '##n',
 '##g',
 'Fac',
 '##e',
 'c',
 '##o',
 '##u',
 '##r',
 '##s',
 '##e',
 '[UNK]']

## so based on training dataset we create and update vocab list that we can tokenize text and corpus :))))