# Tokenizers

For this Colab session, we explore the world of Tokenizers

You can run this notebook on a free CPU, or locally on your box if you prefer.


## Reminder: 2 important pro-tips for using Colab:

**Pro-tip 1:**

The top of every colab has some pip installs. You may receive errors from pip when you run this, such as:

> gcsfs 2025.3.2 requires fsspec==2025.3.2, but you have fsspec 2025.3.0 which is incompatible.

These pip compatibility errors can be safely ignored; and while it's tempting to try to fix them by changing version numbers, that will actually introduce real problems!

**Pro-tip 2:**

In the middle of running a Colab, you might get an error like this:

> Runtime error: CUDA is required but not available for bitsandbytes. Please consider installing [...]

This is a super-misleading error message! Please don't try changing versions of packages...

This actually happens because Google has switched out your Colab runtime, perhaps because Google Colab was too busy. The solution is:

1. Kernel menu >> Disconnect and delete runtime
2. Reload the colab from fresh and Edit menu >> Clear All Outputs
3. Connect to a new T4 using the button at the top right
4. Select "View resources" from the menu on the top right to confirm you have a GPU
5. Rerun the cells in the colab, from the top down, starting with the pip installs

And all should work great - otherwise, ask me!

In [None]:
# if this gives an "ERROR" about pip dependency conflicts, ignore it! It doesn't affect anything.

!pip install -q --upgrade torch==2.5.1+cu124 torchvision==0.20.1+cu124 torchaudio==2.5.1+cu124 --index-url https://download.pytorch.org/whl/cu124

!pip install -q --upgrade transformers==4.48.3 datasets==3.2.0

# Sign in to Hugging Face

1. If you haven't already done so, create a free HuggingFace account at https://huggingface.co and navigate to Settings, then Create a new API token, giving yourself write permissions

**IMPORTANT** when you create your HuggingFace API key, please be sure to select read/write permissions for your key by clicking on the WRITE tab, otherwise you may get problems later.

2. Press the "key" icon on the side panel to the left, and add a new secret:
`HF_TOKEN = your_token`

3. Execute the cell below to log in.

In [3]:
from google.colab import userdata
from huggingface_hub import login
from transformers import AutoTokenizer

In [4]:
hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

# Accessing Llama 3.1 from Meta

In order to use the fantastic Llama 3.1, Meta does require you to sign their terms of service.

Visit their model instructions page in Hugging Face:
https://huggingface.co/meta-llama/Meta-Llama-3.1-8B

At the top of the page are instructions on how to agree to their terms. If possible, you should use the same email as your huggingface account.

In my experience approval comes in a couple of minutes. Once you've been approved for any 3.1 model, it applies to the whole 3.1 family of models. For whatever reason, occasionally Meta doesn't approve access. If that happens to you, please follow [this](https://colab.research.google.com/drive/1deJO03YZTXUwcq2vzxWbiBhrRuI29Vo8?usp=sharing) troubleshooting.

If the next cell gives you an error, then please check:  
1. Are you logged in to HuggingFace? Try running `login()` to check your key works
2. Did you set up your API key with full read and write permissions?
3. If you visit the Llama3.1 page with the link above, does it show that you have access to the model near the top?



In [5]:
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-3.1-8B', trust_remote_code=True)

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

In [6]:
text = "I am excited to how Tokenizers in action to all my AI Engineer fellows."
tokens = tokenizer.encode(text)
tokens

[128000,
 40,
 1097,
 12304,
 311,
 1268,
 9857,
 12509,
 304,
 1957,
 311,
 682,
 856,
 15592,
 29483,
 87819,
 13]

In [7]:
len(text), len(tokens)

(71, 17)

In [8]:
# decoded (has a special token now!!)

tokenizer.decode(tokens)

'<|begin_of_text|>I am excited to how Tokenizers in action to all my AI Engineer fellows.'

In [9]:
# each string represents one token
# Note that the space before a letter is also part of a token.
# Case sensitive

tokenizer.batch_decode(tokens)

['<|begin_of_text|>',
 'I',
 ' am',
 ' excited',
 ' to',
 ' how',
 ' Token',
 'izers',
 ' in',
 ' action',
 ' to',
 ' all',
 ' my',
 ' AI',
 ' Engineer',
 ' fellows',
 '.']

In [10]:
# The whole vocab

tokenizer.vocab

{'umm': 27054,
 "Ġ{}'.": 26307,
 'ilities': 4396,
 'nas': 46523,
 '_assign': 21345,
 'ÎŁÎ¹': 121880,
 '(tweet': 74610,
 'obb': 21046,
 '$")Ċ': 88753,
 'rarian': 96020,
 'ĠNorway': 32603,
 'Ġbearing': 18534,
 'ä»ĬæĹ¥': 110589,
 '/gin': 79724,
 'izzard': 39248,
 'ĠISO': 22705,
 'Ġeuro': 18140,
 'ĠInt': 1357,
 'Ġautobiography': 91537,
 'ĠMay': 3297,
 'Ġdomestically': 98890,
 'Ġà¤Ĩà¤¶': 114382,
 '\');");Ċ': 98533,
 '-posts': 89671,
 'Ġbreeding': 40308,
 'CÃ³mo': 96997,
 'Ġfoyer': 100016,
 'łĢ': 64319,
 'impl': 6517,
 'Ġframed': 47093,
 'Ġpel': 12077,
 'bcm': 92285,
 'obox': 33560,
 'ĠÑĢÐ¾Ð·ÑĢÐ°Ñħ': 120973,
 'Ġfiletype': 91371,
 'Ġcupboard': 87041,
 'SACTION': 47259,
 'enumerate': 77669,
 'Ġ`-': 94897,
 'Ġump': 86142,
 'ĠremoveObject': 82722,
 'åįİ': 86461,
 'candidate': 47374,
 'ï¼ĮèĢĮä¸Ķ': 116496,
 'Ġentrenched': 82144,
 'Ġspice': 42786,
 'Ġatmos': 14036,
 'DEX': 10962,
 '_constant': 36067,
 'Ġstatistic': 43589,
 '="")Ċ': 64841,
 'RYPTO': 60694,
 '.parallel': 72557,
 'à¹Īà¸§à¸¡': 104722,


In [11]:
 # special tokens

 tokenizer.get_added_vocab()

{'<|begin_of_text|>': 128000,
 '<|end_of_text|>': 128001,
 '<|reserved_special_token_0|>': 128002,
 '<|reserved_special_token_1|>': 128003,
 '<|finetune_right_pad_id|>': 128004,
 '<|reserved_special_token_2|>': 128005,
 '<|start_header_id|>': 128006,
 '<|end_header_id|>': 128007,
 '<|eom_id|>': 128008,
 '<|eot_id|>': 128009,
 '<|python_tag|>': 128010,
 '<|reserved_special_token_3|>': 128011,
 '<|reserved_special_token_4|>': 128012,
 '<|reserved_special_token_5|>': 128013,
 '<|reserved_special_token_6|>': 128014,
 '<|reserved_special_token_7|>': 128015,
 '<|reserved_special_token_8|>': 128016,
 '<|reserved_special_token_9|>': 128017,
 '<|reserved_special_token_10|>': 128018,
 '<|reserved_special_token_11|>': 128019,
 '<|reserved_special_token_12|>': 128020,
 '<|reserved_special_token_13|>': 128021,
 '<|reserved_special_token_14|>': 128022,
 '<|reserved_special_token_15|>': 128023,
 '<|reserved_special_token_16|>': 128024,
 '<|reserved_special_token_17|>': 128025,
 '<|reserved_special_to

# Instruct variants of models

Many models have a variant that has been trained for use in Chats.  
These are typically labelled with the word "Instruct" at the end.  
They have been trained to expect prompts with a particular format that includes system, user and assistant prompts.  

There is a utility method `apply_chat_template` that will convert from the messages list format we are familiar with, into the right input prompt for this model.

In [12]:
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct', trust_remote_code=True)

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

In [15]:

messages = [
    {'role': 'system', 'content': 'You are a helpful assistant.'},
    {'role': 'user', 'content': 'Tell a light-hearted joke for a room of data scientists.'}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(prompt)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell a light-hearted joke for a room of data scientists.<|eot_id|><|start_header_id|>assistant<|end_header_id|>




## **Trying New Models**  
+ We'll now work with 3 models.
+ Phi3 from Microsoft Qwen2 from Alibaba Cloud, StarCoder2 from BigCode(ServiceNow + HuggingFace + Nvidia)

> PHI3 vs LLAMA3.1

In [23]:
PHI3_MODEL_NAME = "microsoft/Phi-3-mini-4k-instruct"
QWEN2_MODEL_NAME = "Qwen/Qwen2-7B-instruct"
STARCODER2_MODEL_NAME = "bigcode/starcoder2-3b"

In [17]:
phi3_tokenizer = AutoTokenizer.from_pretrained(PHI3_MODEL_NAME)

text = "I am excited to how Tokenizers in action to all my AI Engineer fellows."
print(tokenizer.encode(text))
print()
print(phi3_tokenizer.encode(text))  # from the llama3.1-8B


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

[128000, 40, 1097, 12304, 311, 1268, 9857, 12509, 304, 1957, 311, 682, 856, 15592, 29483, 87819, 13]

[306, 626, 24173, 304, 920, 25159, 19427, 297, 3158, 304, 599, 590, 319, 29902, 10863, 261, 8379, 1242, 29889]


In [18]:
tokens = phi3_tokenizer.encode(text)
print(phi3_tokenizer.batch_decode(tokens))  # completely different results that with llama3.1

['I', 'am', 'excited', 'to', 'how', 'Token', 'izers', 'in', 'action', 'to', 'all', 'my', 'A', 'I', 'Engine', 'er', 'fell', 'ows', '.']


In [20]:
print(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print()
print(phi3_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell a light-hearted joke for a room of data scientists.<|eot_id|><|start_header_id|>assistant<|end_header_id|>



<|system|>
You are a helpful assistant.<|end|>
<|user|>
Tell a light-hearted joke for a room of data scientists.<|end|>
<|assistant|>



> Qwen2 vs Llama3.1 vs Phi3

In [26]:
qwen2_tokenizer = AutoTokenizer.from_pretrained(QWEN2_MODEL_NAME)

text = "I am excited to how Tokenizers in action to all my AI Engineer fellows."
print(tokenizer.encode(text))
print()
print(phi3_tokenizer.encode(text))
print()
print(qwen2_tokenizer.encode(text))

[128000, 40, 1097, 12304, 311, 1268, 9857, 12509, 304, 1957, 311, 682, 856, 15592, 29483, 87819, 13]

[306, 626, 24173, 304, 920, 25159, 19427, 297, 3158, 304, 599, 590, 319, 29902, 10863, 261, 8379, 1242, 29889]

[40, 1079, 12035, 311, 1246, 9660, 12230, 304, 1917, 311, 678, 847, 15235, 28383, 86719, 13]


In [25]:
print(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print()
print(phi3_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print()
print(qwen2_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell a light-hearted joke for a room of data scientists.<|eot_id|><|start_header_id|>assistant<|end_header_id|>



<|system|>
You are a helpful assistant.<|end|>
<|user|>
Tell a light-hearted joke for a room of data scientists.<|end|>
<|assistant|>


<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Tell a light-hearted joke for a room of data scientists.<|im_end|>
<|im_start|>assistant



In [28]:
starcoder2_tokenizer = AutoTokenizer.from_pretrained(STARCODER2_MODEL_NAME)

code = """
def hello_world(person):
      print(f"Hello, {person}!")
"""

tokens = starcoder2_tokenizer.encode(code)
for token in tokens:
  print(f"{token} = {starcoder2_tokenizer.decode(token)}")

222 = 

610 = def
17966 =  hello
100 = _
5879 = world
45 = (
6427 = person
731 = ):
416 = 
     
1489 =  print
45 = (
107 = f
39 = "
8302 = Hello
49 = ,
320 =  {
6427 = person
130 = }
16013 = !")
222 = 

