# Lesson: Comparing Trained LLM Tokenizers

## Setup

We start with setting up the lab by installing the `transformers` library and ignoring the warnings. 

In [None]:
!pip install transformers>=4.46.1

: 

In [None]:
# Warning control
import warnings
warnings.filterwarnings('ignore')

## Tokenizing Text

In this section, you will tokenize the sentence "Hello Internet of Things!" using the tokenizer of the [`bert-base-cased` model](https://huggingface.co/google-bert/bert-base-cased).

Let's import the `Autotokenizer` class, define the sentence to tokenize, and instantiate the tokenizer.

<p style="background-color:#fff1d7; padding:15px; "> <b>FYI: </b> The transformers library has a set of Auto classes, like AutoConfig, AutoModel, and AutoTokenizer. The Auto classes are designed to automatically do the job for you.</p>

In [3]:
from transformers import AutoTokenizer

In [4]:
# define the sentence to tokenize
sentence = "Hello Internet of Things!"

In [5]:
# load the pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

You'll now apply the tokenizer to the sentence. The tokeziner splits the sentence into tokens and returns the IDs of each token.

In [6]:
# apply the tokenizer to the sentence and extract the token ids
token_ids = tokenizer(sentence).input_ids

In [7]:
print(token_ids)

[101, 8667, 4639, 1104, 7149, 106, 102]


To map each token ID to its corresponding token, you can use the `decode` method of the tokenizer.

In [8]:
for id in token_ids:
    print(tokenizer.decode(id))

[CLS]
Hello
Internet
of
Things
!
[SEP]


## Visualizing Tokenization

In this section, you'll wrap the code of the previous section in the function `show_tokens`. The function takes in a text and the model name, and prints the vocabulary length of the tokenizer and a colored list of the tokens.

In [9]:
# A list of colors in RGB for representing the tokens
colors = [
    '102;194;165', '252;141;98', '141;160;203',
    '231;138;195', '166;216;84', '255;217;47'
]

def show_tokens(sentence: str, tokenizer_name: str):
    """ Show the tokens each separated by a different color """

    # Load the tokenizer and tokenize the input
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    token_ids = tokenizer(sentence).input_ids

    # Extract vocabulary length
    print(f"Vocab length: {len(tokenizer)}")

    # Print a colored list of tokens
    for idx, t in enumerate(token_ids):
        print(
            f'\x1b[0;30;48;2;{colors[idx % len(colors)]}m' +
            tokenizer.decode(t) +
            '\x1b[0m',
            end=' '
        )

Here's the text that you'll use to explore the different tokenization strategies of each model.

In [10]:
text = """
English and INTERNET OF THINGS
🎵 鸟
show_tokens False None elif == >= else: two tabs:"    " Three tabs: "       "
12.0*50=600
"""

You'll now again use the tokenizer of `bert-base-cased` and compare its tokenization strategy to that of `Xenova/gpt-4`

**bert-base-cased**

In [11]:
show_tokens(text, "bert-base-cased")

Vocab length: 28996
[0;30;48;2;102;194;165m[CLS][0m [0;30;48;2;252;141;98mEnglish[0m [0;30;48;2;141;160;203mand[0m [0;30;48;2;231;138;195mIN[0m [0;30;48;2;166;216;84m##TE[0m [0;30;48;2;255;217;47m##R[0m [0;30;48;2;102;194;165m##NE[0m [0;30;48;2;252;141;98m##T[0m [0;30;48;2;141;160;203mOF[0m [0;30;48;2;231;138;195mT[0m [0;30;48;2;166;216;84m##H[0m [0;30;48;2;255;217;47m##ING[0m [0;30;48;2;102;194;165m##S[0m [0;30;48;2;252;141;98m[UNK][0m [0;30;48;2;141;160;203m[UNK][0m [0;30;48;2;231;138;195mshow[0m [0;30;48;2;166;216;84m_[0m [0;30;48;2;255;217;47mtoken[0m [0;30;48;2;102;194;165m##s[0m [0;30;48;2;252;141;98mF[0m [0;30;48;2;141;160;203m##als[0m [0;30;48;2;231;138;195m##e[0m [0;30;48;2;166;216;84mNone[0m [0;30;48;2;255;217;47mel[0m [0;30;48;2;102;194;165m##if[0m [0;30;48;2;252;141;98m=[0m [0;30;48;2;141;160;203m=[0m [0;30;48;2;231;138;195m>[0m [0;30;48;2;166;216;84m=[0m [0;30;48;2;255;217;47melse[0m [0;30;48;2;102;194;165m:[0m 

**Optional - bert-base-uncased**

You can also try the uncased version of the bert model, and compare the vocab length and tokenization strategy of the two bert versions.

In [12]:
show_tokens(text, "bert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Vocab length: 30522
[0;30;48;2;102;194;165m[CLS][0m [0;30;48;2;252;141;98menglish[0m [0;30;48;2;141;160;203mand[0m [0;30;48;2;231;138;195minternet[0m [0;30;48;2;166;216;84mof[0m [0;30;48;2;255;217;47mthings[0m [0;30;48;2;102;194;165m[UNK][0m [0;30;48;2;252;141;98m[UNK][0m [0;30;48;2;141;160;203mshow[0m [0;30;48;2;231;138;195m_[0m [0;30;48;2;166;216;84mtoken[0m [0;30;48;2;255;217;47m##s[0m [0;30;48;2;102;194;165mfalse[0m [0;30;48;2;252;141;98mnone[0m [0;30;48;2;141;160;203meli[0m [0;30;48;2;231;138;195m##f[0m [0;30;48;2;166;216;84m=[0m [0;30;48;2;255;217;47m=[0m [0;30;48;2;102;194;165m>[0m [0;30;48;2;252;141;98m=[0m [0;30;48;2;141;160;203melse[0m [0;30;48;2;231;138;195m:[0m [0;30;48;2;166;216;84mtwo[0m [0;30;48;2;255;217;47mtab[0m [0;30;48;2;102;194;165m##s[0m [0;30;48;2;252;141;98m:[0m [0;30;48;2;141;160;203m"[0m [0;30;48;2;231;138;195m"[0m [0;30;48;2;166;216;84mthree[0m [0;30;48;2;255;217;47mtab[0m [0;30;48;2;102;194;165m##s

**GPT-4**

In [13]:
show_tokens(text, "Xenova/gpt-4")

tokenizer_config.json:   0%|          | 0.00/460 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.01M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/917k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.23M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/98.0 [00:00<?, ?B/s]

Vocab length: 100263
[0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mEnglish[0m [0;30;48;2;141;160;203m and[0m [0;30;48;2;231;138;195m INTERN[0m [0;30;48;2;166;216;84mET[0m [0;30;48;2;255;217;47m OF[0m [0;30;48;2;102;194;165m TH[0m [0;30;48;2;252;141;98mINGS[0m [0;30;48;2;141;160;203m
[0m [0;30;48;2;231;138;195m�[0m [0;30;48;2;166;216;84m�[0m [0;30;48;2;255;217;47m�[0m [0;30;48;2;102;194;165m �[0m [0;30;48;2;252;141;98m�[0m [0;30;48;2;141;160;203m�[0m [0;30;48;2;231;138;195m
[0m [0;30;48;2;166;216;84mshow[0m [0;30;48;2;255;217;47m_tokens[0m [0;30;48;2;102;194;165m False[0m [0;30;48;2;252;141;98m None[0m [0;30;48;2;141;160;203m elif[0m [0;30;48;2;231;138;195m ==[0m [0;30;48;2;166;216;84m >=[0m [0;30;48;2;255;217;47m else[0m [0;30;48;2;102;194;165m:[0m [0;30;48;2;252;141;98m two[0m [0;30;48;2;141;160;203m tabs[0m [0;30;48;2;231;138;195m:"[0m [0;30;48;2;166;216;84m   [0m [0;30;48;2;255;217;47m "[0m [0;30;48;2;102;194;165m Thre

### Optional Models to Explore

You can also explore the tokenization strategy of other models. The following is a suggested list. Make sure to consider the following features when you're doing your comparison:
- Vocabulary length
- Special tokens
- Tokenization of the tabs, special characters and special keywords

**gpt2**

In [14]:
show_tokens(text, "gpt2")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Vocab length: 50257
[0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mEnglish[0m [0;30;48;2;141;160;203m and[0m [0;30;48;2;231;138;195m INTER[0m [0;30;48;2;166;216;84mNET[0m [0;30;48;2;255;217;47m OF[0m [0;30;48;2;102;194;165m TH[0m [0;30;48;2;252;141;98mINGS[0m [0;30;48;2;141;160;203m
[0m [0;30;48;2;231;138;195m�[0m [0;30;48;2;166;216;84m�[0m [0;30;48;2;255;217;47m�[0m [0;30;48;2;102;194;165m �[0m [0;30;48;2;252;141;98m�[0m [0;30;48;2;141;160;203m�[0m [0;30;48;2;231;138;195m
[0m [0;30;48;2;166;216;84mshow[0m [0;30;48;2;255;217;47m_[0m [0;30;48;2;102;194;165mt[0m [0;30;48;2;252;141;98mok[0m [0;30;48;2;141;160;203mens[0m [0;30;48;2;231;138;195m False[0m [0;30;48;2;166;216;84m None[0m [0;30;48;2;255;217;47m el[0m [0;30;48;2;102;194;165mif[0m [0;30;48;2;252;141;98m ==[0m [0;30;48;2;141;160;203m >=[0m [0;30;48;2;231;138;195m else[0m [0;30;48;2;166;216;84m:[0m [0;30;48;2;255;217;47m two[0m [0;30;48;2;102;194;165m tabs[0m [0;30;4

**Flan-T5-small**

In [None]:
show_tokens(text, "google/flan-t5-small")

**Starcoder 2 - 15B**

In [None]:
show_tokens(text, "bigcode/starcoder2-15b")

**Phi-3**

In [None]:
show_tokens(text, "microsoft/Phi-3-mini-4k-instruct")

**Qwen2 - Vision-Language Model**

In [None]:
show_tokens(text, "Qwen/Qwen2-VL-7B-Instruct")

<p style="background-color:#f2f2ff; padding:15px; border-width:3px; border-color:#e2e2ff; border-style:solid; border-radius:6px"> ⬇
&nbsp; <b>Download Notebooks:</b> If you'd like to donwload the notebook: 1) click on the <em>"File"</em> option on the top menu of the notebook and then 2) click on <em>"Download as"</em> and select <em>"Notebook (.ipynb)"</em>. For more help, please see the <em>"Appendix – Tips, Help, and Download"</em> Lesson.</p>