# L2: Role of the Tokenizers

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px"> ⏳ <b>Note <code>(Kernel Starting)</code>:</b> This notebook takes about 30 seconds to be ready to use. You may start and watch the video while you wait.</p>

In [2]:
pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cpu
Note: you may need to restart the kernel to use updated packages.


In [3]:
import warnings
warnings.filterwarnings('ignore')

In [1]:
training_data = [
    "walker walked a long walk",
]

## BPE - Byte-Pair Encoding

**Byte-Pair Encoding (BPE)** is a data compression technique that is commonly used in natural language processing (NLP) for tokenizing text, particularly in transformer models like GPT, BERT, and others. BPE helps to manage the trade-off between word-based and character-based representations by breaking down words into subword units, enabling the model to handle rare and unknown words efficiently.

**How BPE works:**  
**1. Initial Step (Character Level Representation):**
Every word is initially represented as a sequence of characters, treating each character as an individual token.

```Example: "low" → [ 'l', 'o', 'w' ]```

**2. Merging Frequent Pairs:**
The algorithm iteratively merges the most frequent adjacent pairs of tokens (initially, characters). After each merge, a new token is created, which represents the merged pair. This process is repeated a predefined number of times or until a certain number of tokens are created.

**3. Subword Units:**
Over time, BPE produces subword units that capture common word patterns, such as prefixes, suffixes, and common root parts of words. For example, in English, words like "playing," "plays," and "played" may share common subwords like "play".

**4. Handling Unknown Words:**
Since words are broken into subword units, even unknown or rare words can be handled effectively. The model can split such words into smaller, familiar subword units, ensuring that the system can process previously unseen words without a problem.


**Benefits of BPE:** 

**- Efficient Vocabulary Size:**
By merging frequently occurring pairs, BPE generates a compact vocabulary of subword units, allowing the model to capture word structure without needing to store a large number of full words.

**- Handles Rare Words:**
It allows the model to handle rare or out-of-vocabulary words by breaking them into smaller, meaningful subword units.

**- Language Flexibility:**
It is useful for morphologically rich languages, where words can have many different forms.

In summary, BPE is a widely-used tokenization method in NLP that strikes a balance between word-based and character-based approaches, providing both flexibility and efficiency in handling language data.



In [2]:
from tokenizers.trainers import BpeTrainer
from tokenizers.models import BPE
from tokenizers import Tokenizer
from tokenizers.pre_tokenizers import Whitespace

bpe_tokenizer = Tokenizer(BPE())
bpe_tokenizer.pre_tokenizer = Whitespace()

bpe_trainer = BpeTrainer(vocab_size=14)

<p style="background-color:#fff6ff; padding:15px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px"> 💻 &nbsp; <b>Access <code>requirements.txt</code> and <code>helper.py</code> files:</b> 1) click on the <em>"File"</em> option on the top menu of the notebook and then 2) click on <em>"Open"</em>. For more help, please see the <em>"Appendix - Tips and Help"</em> Lesson.</p>

In [3]:
bpe_tokenizer.train_from_iterator(training_data, bpe_trainer)






In [4]:
bpe_tokenizer.get_vocab()

{'wal': 11,
 'd': 1,
 'l': 5,
 'g': 3,
 'w': 9,
 'walk': 12,
 'o': 7,
 'k': 4,
 'walke': 13,
 'n': 6,
 'r': 8,
 'al': 10,
 'a': 0,
 'e': 2}

In [5]:
bpe_tokenizer.encode("walker walked a long walk").tokens

['walke', 'r', 'walke', 'd', 'a', 'l', 'o', 'n', 'g', 'walk']

In [6]:
bpe_tokenizer.encode("wlk").ids

[9, 5, 4]

In [7]:
bpe_tokenizer.encode("wlk").tokens

['w', 'l', 'k']

In [8]:
bpe_tokenizer.encode("she walked").tokens

['e', 'walke', 'd']

## WordPiece

In [9]:
from real_wordpiece.trainer import RealWordPieceTrainer
from tokenizers.models import WordPiece

real_wordpiece_tokenizer = Tokenizer(WordPiece())
real_wordpiece_tokenizer.pre_tokenizer = Whitespace()

real_wordpiece_trainer = RealWordPieceTrainer(
    vocab_size=27,
)

In [10]:
real_wordpiece_trainer.train_tokenizer(
    training_data, real_wordpiece_tokenizer
)
real_wordpiece_tokenizer.get_vocab()

{'long': 21,
 '##l': 2,
 '##e': 4,
 '##o': 9,
 'k': 12,
 'r': 14,
 '##ng': 20,
 '##n': 10,
 'o': 16,
 '##lk': 25,
 'e': 13,
 '##d': 8,
 'n': 17,
 '##a': 1,
 'l': 6,
 '##ed': 23,
 'wa': 24,
 'g': 18,
 '##g': 11,
 '##er': 22,
 'w': 0,
 '##r': 7,
 'walk': 26,
 'a': 5,
 'lo': 19,
 '##k': 3,
 'd': 15}

In [11]:
real_wordpiece_tokenizer.encode("walker walked a long walk").tokens

['walk', '##er', 'walk', '##ed', 'a', 'long', 'walk']

In [12]:
real_wordpiece_tokenizer.encode("wlk").tokens

['w', '##lk']

**Unknown Characters:**
The following line will produce an error because it contains unknown characters. Please uncomment the line and run it to see the error.

In [16]:
#real_wordpiece_tokenizer.encode("she walked").tokens

Exception: WordPiece error: Missing [UNK] token from the vocabulary

## HuggingFace WordPiece and special tokens

In [13]:
from tokenizers.trainers import WordPieceTrainer

unk_token = "[UNK]"

wordpiece_model = WordPiece(unk_token=unk_token)
wordpiece_tokenizer = Tokenizer(wordpiece_model)
wordpiece_tokenizer.pre_tokenizer = Whitespace()
wordpiece_trainer = WordPieceTrainer(
    vocab_size=28,
    special_tokens=[unk_token]
)

In [14]:
wordpiece_tokenizer.train_from_iterator(
    training_data, 
    wordpiece_trainer
)
wordpiece_tokenizer.get_vocab()






{'##r': 16,
 '##ng': 25,
 '##e': 14,
 '##g': 19,
 'walk': 22,
 'walked': 26,
 'a': 1,
 'e': 3,
 'n': 7,
 '##k': 13,
 'lo': 24,
 'l': 6,
 'o': 8,
 '##d': 15,
 '[UNK]': 0,
 'walker': 27,
 'walke': 23,
 '##l': 12,
 'k': 5,
 'd': 2,
 '##a': 11,
 'w': 10,
 'r': 9,
 '##o': 17,
 '##n': 18,
 'g': 4,
 'wa': 20,
 '##lk': 21}

In [15]:
wordpiece_tokenizer.encode("walker walked a long walk").tokens

['walker', 'walked', 'a', 'lo', '##ng', 'walk']

In [16]:
wordpiece_tokenizer.encode("wlk").tokens

['w', '##lk']

In [17]:
wordpiece_tokenizer.encode("she walked").tokens

['[UNK]', 'walked']

## Unigram

In [18]:
from tokenizers.trainers import UnigramTrainer
from tokenizers.models import Unigram

unigram_tokenizer = Tokenizer(Unigram())
unigram_tokenizer.pre_tokenizer = Whitespace()
unigram_trainer = UnigramTrainer(
    vocab_size=14, 
    special_tokens=[unk_token],
    unk_token=unk_token,
)

unigram_tokenizer.train_from_iterator(training_data, unigram_trainer)
unigram_tokenizer.get_vocab()




{'a': 6,
 'e': 10,
 'n': 11,
 'walk': 4,
 'w': 3,
 'l': 5,
 'r': 7,
 'k': 2,
 'd': 12,
 'walke': 1,
 '[UNK]': 0,
 'o': 8,
 'g': 9}




In [19]:
unigram_tokenizer.encode("walker walked a long walk").tokens

['walke', 'r', 'walke', 'd', 'a', 'l', 'o', 'n', 'g', 'walk']

In [20]:
unigram_tokenizer.encode("wlk").tokens

['w', 'l', 'k']

In [21]:
unigram_tokenizer.encode("she walked").tokens

['sh', 'e', 'walke', 'd']

In [22]:
unigram_tokenizer.encode("she walked").ids

[0, 10, 1, 12]