# Introduction to Tokenizer  
Maps between Text and Tokens for a particular model 
     
- Translates between Text and Tokens with encode() and decode() methods
- Contains a Vocab that can include special tokens to signal information to the LLM, like start of prompt
- Can include a Chat Template that knows how to format a chat message for this model

- Llama 3.1
- Phi 3
- Qwen 2
- Starcoder 2

In [1]:
from huggingface_hub import login  
from transformers import AutoTokenizer
from transformers import AutoModel

In [2]:
# Local 
import os
from dotenv import load_dotenv 
load_dotenv() 
hf_token = os.environ["HF_TOKEN"]
login(hf_token, add_to_git_credential=True)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


In [3]:
# Colab 
# from google.colab import userdata
# hf_token = userdata.get("HF_TOKEN") 
# login(hf_token, add_to_git_credential=True)

In [4]:
tokenizer=AutoTokenizer.from_pretrained('meta-llama/Llama-3.2-3B', trust_remote_code=True)

In [5]:
text="I am excited to show Tokenizers in action to my LLM engineers" 
tokens=tokenizer.encode(text) 
tokens

[128000,
 40,
 1097,
 12304,
 311,
 1501,
 9857,
 12509,
 304,
 1957,
 311,
 856,
 445,
 11237,
 25175]

In [7]:
len(text)

61

In [8]:
len(tokens)

15

approx. 1 token for 4 letters in English

When decoding the token, the result should be similar but not exactly the orignal text.      
""I am excited to show Tokenizers in action to my LLM engineers"

In [9]:
tokenizer.decode(tokens)

'<|begin_of_text|>I am excited to show Tokenizers in action to my LLM engineers'

"<|begin_of_text|>" is a special token and it maps to 1 token, start of a text prompt 

In [10]:
tokenizer.batch_decode(tokens)

['<|begin_of_text|>',
 'I',
 ' am',
 ' excited',
 ' to',
 ' show',
 ' Token',
 'izers',
 ' in',
 ' action',
 ' to',
 ' my',
 ' L',
 'LM',
 ' engineers']

1 token for each string    
space in front of some of the string is part of the token       
" am" and "Am" have different tokens     
"Token-izers" - 2 tokens       
case sensitive      

##### tokenizer.vocab 
dictionary of the complete mapping between fragments of Words and numbers

In [11]:
tokenizer.vocab

{'ÙĤØ©': 101581,
 '_TOPIC': 75177,
 'Ġstructures': 14726,
 '130': 5894,
 'Ġdims': 41988,
 'wig': 38022,
 'Ġfoi': 22419,
 'Ġ{$': 14249,
 'addClass': 12567,
 'à¸´à¸ķ': 101267,
 'ĠRULE': 44897,
 'Ð»ÑĥÐ¶': 107919,
 'Î³ÎŃÎ½': 119926,
 'Waiting': 43204,
 'âĢľ.': 77284,
 'Tracks': 54023,
 "'est": 17771,
 'Å¡ov': 126126,
 'ĠLua': 38762,
 'å½¢æĪĲ': 115376,
 'Ġbelongings': 64028,
 'VARCHAR': 80751,
 'Ġíļ': 108366,
 'ĠyÃªu': 103755,
 'Ġ/>,': 78651,
 'ĠPipeline': 42007,
 'Ð¼Ð°Ð·': 117835,
 '*L': 87613,
 '_toggle': 49960,
 'ĠHtmlWebpackPlugin': 93515,
 'Ġ("%': 51634,
 "Ġ.'": 44684,
 'Ġasteroid': 55479,
 '(comb': 99980,
 '_COLUMNS': 73003,
 'ually': 1870,
 'Concrete': 84694,
 'ĠSpoon': 94613,
 'ĉnames': 95040,
 'Ġrocked': 78360,
 'ingle': 2222,
 'Slider': 22226,
 'Ġhandsome': 44877,
 'Ġ(~': 31857,
 'Ġintegral': 26154,
 '.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:': 125437,
 'ìĦł': 101151,
 '/photos': 50875,
 'weak': 13451,
 'slideUp': 79374,
 'Ġcontrasting': 75055,
 '_DELETED': 87021,
 'Ġuses': 5829,
 '.ListVi

Reserved special tokens

In [11]:
tokenizer.get_added_vocab()

{'<|begin_of_text|>': 128000,
 '<|end_of_text|>': 128001,
 '<|reserved_special_token_0|>': 128002,
 '<|reserved_special_token_1|>': 128003,
 '<|finetune_right_pad_id|>': 128004,
 '<|reserved_special_token_2|>': 128005,
 '<|start_header_id|>': 128006,
 '<|end_header_id|>': 128007,
 '<|eom_id|>': 128008,
 '<|eot_id|>': 128009,
 '<|python_tag|>': 128010,
 '<|reserved_special_token_3|>': 128011,
 '<|reserved_special_token_4|>': 128012,
 '<|reserved_special_token_5|>': 128013,
 '<|reserved_special_token_6|>': 128014,
 '<|reserved_special_token_7|>': 128015,
 '<|reserved_special_token_8|>': 128016,
 '<|reserved_special_token_9|>': 128017,
 '<|reserved_special_token_10|>': 128018,
 '<|reserved_special_token_11|>': 128019,
 '<|reserved_special_token_12|>': 128020,
 '<|reserved_special_token_13|>': 128021,
 '<|reserved_special_token_14|>': 128022,
 '<|reserved_special_token_15|>': 128023,
 '<|reserved_special_token_16|>': 128024,
 '<|reserved_special_token_17|>': 128025,
 '<|reserved_special_to

Instruct models are fine tuned for Chat      

In [12]:
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-3.2-3B-Instruct', trust_remote_code=True)

they have been trained to expect prompts in a particular structure with a particular set of special tokens that identifies the system message, the user message and assistance responses so that it forms a kind of a chat.

In [13]:
messages = [
    {'role': "system", 'content': "You are a helpful assistant" }, 
    {'role': "user", 'content': "Tell a light-hearted joke for a room of Data Scientists"}
]

Huggingface tokenizers have a special function apply chat template and it will take messages in this format in the OpenAI API format, and it will convert it into the right structure to be used for a this particular model, the type of the prompt that this model is expecting 
     
tokenize = False, if it is True, it will retrun a series of numbers.      
See below the prompt in text format.   

In [14]:
prompt=tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) 
print(prompt)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 01 Feb 2025

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell a light-hearted joke for a room of Data Scientists<|eot_id|><|start_header_id|>assistant<|end_header_id|>




### New Models

In [15]:
PHI3_MODEL_NAME="microsoft/Phi-3-mini-4k-instruct" 
QWEN2_MODEL_NAME="Qwen/Qwen2-7B-Instruct"
STARCODER2_MODEL_NAME="bigcode/starcoder2-7b"

In [16]:
phi3_tokenizer=AutoTokenizer.from_pretrained(PHI3_MODEL_NAME)

In [17]:
text = "I am excited to show Tokenizers in action to my LLM engineers"
print("------Llama3----------")
print(tokenizer.encode(text)) 
print("------PHI3------------")
print(phi3_tokenizer.encode(text))

------Llama3----------
[128000, 40, 1097, 12304, 311, 1501, 9857, 12509, 304, 1957, 311, 856, 445, 11237, 25175]
------PHI3------------
[306, 626, 24173, 304, 1510, 25159, 19427, 297, 3158, 304, 590, 365, 26369, 6012, 414]


In [18]:
print("------Llama3----------")
llama3_tokens=tokenizer.encode(text)
print(llama3_tokens)
print(tokenizer.batch_decode(llama3_tokens))
print("------PHI3------------")
phi3_tokens=phi3_tokenizer.encode(text)
print(phi3_tokens) 
print(phi3_tokenizer.batch_decode(phi3_tokens))

------Llama3----------
[128000, 40, 1097, 12304, 311, 1501, 9857, 12509, 304, 1957, 311, 856, 445, 11237, 25175]
['<|begin_of_text|>', 'I', ' am', ' excited', ' to', ' show', ' Token', 'izers', ' in', ' action', ' to', ' my', ' L', 'LM', ' engineers']
------PHI3------------
[306, 626, 24173, 304, 1510, 25159, 19427, 297, 3158, 304, 590, 365, 26369, 6012, 414]
['I', 'am', 'excited', 'to', 'show', 'Token', 'izers', 'in', 'action', 'to', 'my', 'L', 'LM', 'engine', 'ers']


In [19]:
print("---------Llama3-------")
print(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) )
print("---------PHI3---------")
print(phi3_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))

---------Llama3-------
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 01 Feb 2025

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell a light-hearted joke for a room of Data Scientists<|eot_id|><|start_header_id|>assistant<|end_header_id|>


---------PHI3---------
<|system|>
You are a helpful assistant<|end|>
<|user|>
Tell a light-hearted joke for a room of Data Scientists<|end|>
<|assistant|>



Two models have a different approach for how prompts get sent in.

In [20]:
qwen2_tokenizer=AutoTokenizer.from_pretrained(QWEN2_MODEL_NAME) 

print("---------------Llama3---------------")
print(tokenizer.encode(text))
print("---------------PHI3-----------------")
print(phi3_tokenizer.encode(text))
print("---------------QWEN2-----------------")
print(qwen2_tokenizer.encode(text))

---------------Llama3---------------
[128000, 40, 1097, 12304, 311, 1501, 9857, 12509, 304, 1957, 311, 856, 445, 11237, 25175]
---------------PHI3-----------------
[306, 626, 24173, 304, 1510, 25159, 19427, 297, 3158, 304, 590, 365, 26369, 6012, 414]
---------------QWEN2-----------------
[40, 1079, 12035, 311, 1473, 9660, 12230, 304, 1917, 311, 847, 444, 10994, 24198]


In [23]:
print("---------------Llama3---------------")
print(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print("---------------PHI3-----------------")
print(phi3_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print("---------------QWEN2-----------------")
print(qwen2_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))

---------------Llama3---------------
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 01 Feb 2025

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell a light-hearted joke for a room of Data Scientists<|eot_id|><|start_header_id|>assistant<|end_header_id|>


---------------PHI3-----------------
<|system|>
You are a helpful assistant<|end|>
<|user|>
Tell a light-hearted joke for a room of Data Scientists<|end|>
<|assistant|>

---------------QWEN2-----------------
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Tell a light-hearted joke for a room of Data Scientists<|im_end|>
<|im_start|>assistant



In [21]:
starcoder2_tokenizer=AutoTokenizer.from_pretrained(STARCODER2_MODEL_NAME)

In [22]:
code = """
def hello_world(person):
    print("Hello", person)
"""

In [23]:
tokens=starcoder2_tokenizer.encode(code)
for token in tokens:
    print(f"{token}={starcoder2_tokenizer.decode(token)}")

222=

610=def
17966= hello
100=_
5879=world
45=(
6427=person
731=):
303=
   
1489= print
459=("
8302=Hello
411=",
4944= person
46=)
222=



In [24]:
deepseek_r1_model = "deepseek-ai/DeepSeek-R1"
deepseek_r1_tokenizer = AutoTokenizer.from_pretrained(deepseek_r1_model) 

In [25]:
print("---------------Llama3---------------")
print(tokenizer.encode(text))
print("---------------PHI3-----------------")
print(phi3_tokenizer.encode(text))
print("---------------QWEN2-----------------")
print(qwen2_tokenizer.encode(text))
print("---------------DEEPSEEK--------------")
print(deepseek_r1_tokenizer.encode(text))

---------------Llama3---------------
[128000, 40, 1097, 12304, 311, 1501, 9857, 12509, 304, 1957, 311, 856, 445, 11237, 25175]
---------------PHI3-----------------
[306, 626, 24173, 304, 1510, 25159, 19427, 297, 3158, 304, 590, 365, 26369, 6012, 414]
---------------QWEN2-----------------
[40, 1079, 12035, 311, 1473, 9660, 12230, 304, 1917, 311, 847, 444, 10994, 24198]
---------------DEEPSEEK--------------
[0, 43, 1030, 15046, 304, 1801, 47948, 24524, 295, 4271, 304, 1026, 33792, 47, 26170]


In [26]:
print("---------------Llama3---------------")
print(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print("---------------PHI3-----------------")
print(phi3_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print("---------------QWEN2-----------------")
print(qwen2_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print("---------------DEEPSEEK--------------")
print(deepseek_r1_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))

---------------Llama3---------------
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 01 Feb 2025

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell a light-hearted joke for a room of Data Scientists<|eot_id|><|start_header_id|>assistant<|end_header_id|>


---------------PHI3-----------------
<|system|>
You are a helpful assistant<|end|>
<|user|>
Tell a light-hearted joke for a room of Data Scientists<|end|>
<|assistant|>

---------------QWEN2-----------------
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Tell a light-hearted joke for a room of Data Scientists<|im_end|>
<|im_start|>assistant

---------------DEEPSEEK--------------
<｜begin▁of▁sentence｜>You are a helpful assistant<｜User｜>Tell a light-hearted joke for a room of Data Scientists<｜Assistant｜>
