# üß± Tokenization


Tokenization is the process of breaking down text into smaller units called tokens. These tokens can be words, characters, or subwords. Check out the [preamble](./README.md)
for what tokenization does, why, and how. The material list of this is as follows, sorted by how you're supposed to read them:

> .  
> ‚îú‚îÄ‚îÄ [README.md](./README.md)  
> ‚îú‚îÄ‚îÄ [rulebased.md](./rulebased.md)  
> ‚îú‚îÄ‚îÄ [bpe.md](./bpe.md)  
> ‚îú‚îÄ‚îÄ [unigram.md](./unigram.md)  
> ‚îî‚îÄ‚îÄ [wordpiece.md](./wordpiece.md)


## Rule-Based Tokenization

Rule-based tokenization uses a set of predefined rules to split text into tokens. We can use `nltk` for this via its `word_tokenize` thing, which is derived from Penn Treebank's tokenizer. It's a bit advanced, so you can check out how it works [yourself](https://github.com/nltk/nltk/blob/5e0a6c4d69f001d86e86ada9b9f9a363bae1c692/nltk/tokenize/destructive.py#L37), but in essence, it's basically splitting off quotes, then punctuations, parentheses, dashes, then splitting off contracted words.


In [5]:
from nltk.tokenize import word_tokenize

text = "Lorem ipsum dolor sit amet."
tokens = word_tokenize(text)
print(tokens)

['Lorem', 'ipsum', 'dolor', 'sit', 'amet', '.']


## Byte-Pair Encoding (BPE) Tokenization

BPE is the subword tokenization algorithm used in models like GPT-2 and RoBERTa. This bit isn't explained in the [notes](./bpe.md), but `ƒ†` is used to pad when a token begins with space.


In [7]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
text = "Lorem ipsum dolor sit amet."
tokens = tokenizer.tokenize(text)
print(tokens)

['L', 'orem', 'ƒ†', 'ips', 'um', 'ƒ†d', 'olor', 'ƒ†sit', 'ƒ†am', 'et', '.']


## Unigram Tokenization

Unigram is the subword tokenization algorithm used in models like XLNet. You can check out the explainer in the [notes](./unigram.md) for how the logic works. Those `"_"`s in the output is caused by how SentencePiece intializes the initial character vocab, you can assume them as just spaces. [^1]

[^1]: https://huggingface.co/learn/llm-course/en/chapter6/7


In [22]:
from transformers import XLNetTokenizer

tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
text = "Lorem ipsum dolor sit amet."
tokens = tokenizer.tokenize(text)
print(tokens)

['‚ñÅLore', 'm', '‚ñÅ', 'ip', 'sum', '‚ñÅdo', 'lor', '‚ñÅsit', '‚ñÅa', 'met', '.']


## WordPiece Tokenization

WordPiece is the subword tokenization algorithm used in models like BERT. Check out the related [notes](./wordpiece.md) to see how it works. As you can see, the tokenization adds these `#` paddings to allow for modelling of non-beginning tokens.


In [24]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
text = "Lorem ipsum dolor sit amet."
tokens = tokenizer.tokenize(text)
print(tokens)

['Lo', '##rem', 'i', '##ps', '##um', 'do', '##lor', 'sit', 'am', '##et', '.']
