## Tokenizer

A **tokenizer** is the bridge between human text and the model’s internal tokens:

- **Encode/Decode**  
  Converts your input text into tokens (numbers) before feeding it to the model, and turns tokens back into readable text.

- **Vocabulary & Special Tokens**  
  Holds the list of all tokens the model knows (its “vocab”), including special markers like start-of-prompt, end-of-text, speaker labels, etc.

- **Chat Templates (Optional)**  
  Some tokenizers come with pre-built templates to wrap your messages (e.g. system/user tags), so you don’t have to craft them by hand every time.


In [1]:
!pip install -q transformers

In [4]:
from google.colab import userdata
from huggingface_hub import login
from transformers import AutoTokenizer

## 1. Hugging Face API Token

1. Go to https://huggingface.co and **sign up** or log in.  
2. Open **Settings → Access Tokens** and click **Create new token**.  
3. Under **Permissions**, select **Read & Write**, then **Generate** and copy the token.  
4. Press the "key" icon in your side-panel on the left, add a secret:  
   ```bash
   HF_TOKEN=<your_token>


In [5]:
hf_token = userdata.get("HF_TOKEN")
login(hf_token)

## 2. Meta Llama 3.1 Access

1. Open  
   https://huggingface.co/meta-llama/Meta-Llama-3.1-8B
2. Click the “Accept terms” banner (use your HF email).  
3. Wait a few minutes for approval—it covers all 3.1 models.

**Troubleshoot if you see errors:**  
- Run `login()` to check you’re logged in.  
- Make sure your HF token has Read & Write rights.  
- On the model page, confirm it says “Access granted.”  
- Still blocked? Try this Colab:  
  https://colab.research.google.com/drive/1deJO03YZTXUwcq2vzxWbiBhrRuI29Vo8?usp=sharing  


In [8]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B", trust_remote_code=True)

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

In [9]:
text = "I am an AI developer/engineer and looking forward to showing my LLM engineers how tokenizers work."
tokens = tokenizer.encode(text)
tokens

[128000,
 40,
 1097,
 459,
 15592,
 16131,
 83145,
 261,
 323,
 3411,
 4741,
 311,
 9204,
 856,
 445,
 11237,
 25175,
 1268,
 4037,
 12509,
 990,
 13]

**Note:** 128000 is the first token which means that it is a special token which is indicating to our model that it is the start of a text of a prompt.

In [10]:
len(tokens)

22

In [11]:
# Decode to recreate the original text
decoding = tokenizer.decode(tokens)
decoding

'<|begin_of_text|>I am an AI developer/engineer and looking forward to showing my LLM engineers how tokenizers work.'

In [12]:
batch_decodes = tokenizer.batch_decode(tokens)
batch_decodes

['<|begin_of_text|>',
 'I',
 ' am',
 ' an',
 ' AI',
 ' developer',
 '/engine',
 'er',
 ' and',
 ' looking',
 ' forward',
 ' to',
 ' showing',
 ' my',
 ' L',
 'LM',
 ' engineers',
 ' how',
 ' token',
 'izers',
 ' work',
 '.']

In [14]:
# It's the dictionary of the complete mapping between fragments of Words and numbers.
tokenizer.vocab

{'ĠAcer': 77077,
 '.Exceptions': 69367,
 '.portal': 35863,
 'ĠØ¨ØŃØ«': 113751,
 'Eric': 50554,
 'Î¸Î®': 105344,
 'ĠæŀĹ': 115897,
 'ĠØ³ÙĦØ³': 124051,
 'Ġsupervised': 60089,
 '-ÑĤÐ°ÐºÐ¸': 121601,
 'à¹Ģà¸®': 121975,
 '_scene': 38396,
 'obuf': 18971,
 'áº»': 101498,
 'Ġsao': 104372,
 'ĠBankasÄ±': 127703,
 '.raw': 18648,
 'Ġchiáº¿n': 104854,
 'åĿļ': 114248,
 'åĢ«': 119862,
 'ĉrestore': 97127,
 'Ġkrij': 70537,
 '.setName': 28737,
 'ĠCarmen': 71058,
 'ï¼Įå°±æĺ¯': 121191,
 'estureRecognizer': 83020,
 'Fields': 9118,
 'Ġ])->': 92513,
 'grammar': 42194,
 'ê¹Ģ': 108922,
 'ĠÐ¿ÑĢÐµÐ¿Ð°ÑĢÐ°ÑĤÑĭ': 125873,
 'ĠChÃ¢u': 113917,
 'ĠâĪĴ': 25173,
 '(sim': 48860,
 'ĠBaptist': 43748,
 "$('": 37188,
 'Ġnothing': 4400,
 'ĠHess': 99805,
 '.INPUT': 82914,
 'ĠIo': 30755,
 'cimiento': 67137,
 '.body': 5189,
 '+b': 36193,
 'Ġbesteht': 99521,
 'Ġcreations': 53862,
 'ĠPy': 5468,
 'Ġdiscriminator': 82838,
 ':variables': 57305,
 'ĠIGN': 50939,
 'ĠçĶ·': 109078,
 'ÙĨØ§Ø¯': 115577,
 '.atan': 72616,
 'IGGER': 41361,
 'ĠnÃ¤c

In [15]:
# The special tokens that have been reserved in the vocab to be used to signal to things to the LM.
tokenizer.get_added_vocab()

{'<|begin_of_text|>': 128000,
 '<|end_of_text|>': 128001,
 '<|reserved_special_token_0|>': 128002,
 '<|reserved_special_token_1|>': 128003,
 '<|finetune_right_pad_id|>': 128004,
 '<|reserved_special_token_2|>': 128005,
 '<|start_header_id|>': 128006,
 '<|end_header_id|>': 128007,
 '<|eom_id|>': 128008,
 '<|eot_id|>': 128009,
 '<|python_tag|>': 128010,
 '<|reserved_special_token_3|>': 128011,
 '<|reserved_special_token_4|>': 128012,
 '<|reserved_special_token_5|>': 128013,
 '<|reserved_special_token_6|>': 128014,
 '<|reserved_special_token_7|>': 128015,
 '<|reserved_special_token_8|>': 128016,
 '<|reserved_special_token_9|>': 128017,
 '<|reserved_special_token_10|>': 128018,
 '<|reserved_special_token_11|>': 128019,
 '<|reserved_special_token_12|>': 128020,
 '<|reserved_special_token_13|>': 128021,
 '<|reserved_special_token_14|>': 128022,
 '<|reserved_special_token_15|>': 128023,
 '<|reserved_special_token_16|>': 128024,
 '<|reserved_special_token_17|>': 128025,
 '<|reserved_special_to

## 3. Instruct Model Variants

An **Instruct** model is simply a version of an LLM that’s been fine-tuned to follow human-style “instructions” or chat prompts. Under the hood:

- It’s trained on datasets where inputs are labeled as `system`, `user`, and `assistant` messages.

- This tuning helps it interpret “do X” or “explain Y” requests more reliably.

- When you see a model name like xxx-Instruct, it means it’s already optimized for instruction‐following (chat) rather than plain text generation.


Use the helper:  
   ```python
   apply_chat_template(messages)


**When to use:**  
Pick an **Instruct** variant whenever you need a conversational, instruction-aware LLM.  

In [17]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct", trust_remote_code=True)

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

In [18]:
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Tell a light-hearted joke for a room of AI Engineer"}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(prompt)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell a light-hearted joke for a room of AI Engineer<|eot_id|><|start_header_id|>assistant<|end_header_id|>




## Tokenizers for Key Models

We’ll try out tokenizers for several open-source LLMs:

- **Llama 3.1**  
  Meta’s cutting-edge model  
- **Phi 3**  
  Microsoft’s entrant  
- **Qwen2**  
  Alibaba Cloud’s leader  
- **Starcoder2**  
  Specialized coding model  


In [19]:
PHI3_MODEL_NAME = "microsoft/Phi-3-mini-4k-instruct"
QWEN2_MODEL_NAME = "Qwen/Qwen2-7B-Instruct"
STARCODER2_MODEL_NAME = "bigcode/starcoder2-3b"

In [20]:
# 1- Phi 3
phi3_tokenizer = AutoTokenizer.from_pretrained(PHI3_MODEL_NAME)

text = "I am an AI developer/engineer and excited to give my LLM team a hands-on tour of Tokenizers."
print(tokenizer.encode(text))
print("======================")
tokens = phi3_tokenizer.encode(text)
print(phi3_tokenizer.batch_decode(tokens))

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

[128000, 40, 1097, 459, 15592, 16131, 83145, 261, 323, 12304, 311, 3041, 856, 445, 11237, 2128, 264, 6206, 10539, 7364, 315, 9857, 12509, 13]
['I', 'am', 'an', 'A', 'I', 'developer', '/', 'engine', 'er', 'and', 'excited', 'to', 'give', 'my', 'L', 'LM', 'team', 'a', 'hands', '-', 'on', 'tour', 'of', 'Token', 'izers', '.']


In [21]:
print(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print("=======================")
print(phi3_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell a light-hearted joke for a room of AI Engineer<|eot_id|><|start_header_id|>assistant<|end_header_id|>


<|system|>
You are a helpful assistant<|end|>
<|user|>
Tell a light-hearted joke for a room of AI Engineer<|end|>
<|assistant|>



In [24]:
# 2- Qwen2
qwen2_tokenizer = AutoTokenizer.from_pretrained(QWEN2_MODEL_NAME)

text = "I am an AI developer/engineer and excited to give my LLM team a hands-on tour of Tokenizers."
print(tokenizer.encode(text))
print("======================")
print(phi3_tokenizer.encode(text))
print("======================")
print(qwen2_tokenizer.encode(text))

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

[128000, 40, 1097, 459, 15592, 16131, 83145, 261, 323, 12304, 311, 3041, 856, 445, 11237, 2128, 264, 6206, 10539, 7364, 315, 9857, 12509, 13]
[306, 626, 385, 319, 29902, 13897, 29914, 10599, 261, 322, 24173, 304, 2367, 590, 365, 26369, 3815, 263, 6567, 29899, 265, 6282, 310, 25159, 19427, 29889]
[40, 1079, 458, 15235, 15754, 82045, 261, 323, 12035, 311, 2968, 847, 444, 10994, 2083, 264, 6078, 10326, 7216, 315, 9660, 12230, 13]


In [25]:
print(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print("=======================")
print(phi3_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print("=======================")
print(qwen2_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell a light-hearted joke for a room of AI Engineer<|eot_id|><|start_header_id|>assistant<|end_header_id|>


<|system|>
You are a helpful assistant<|end|>
<|user|>
Tell a light-hearted joke for a room of AI Engineer<|end|>
<|assistant|>

<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Tell a light-hearted joke for a room of AI Engineer<|im_end|>
<|im_start|>assistant



In [26]:
starcoder2_tokenizer = AutoTokenizer.from_pretrained(STARCODER2_MODEL_NAME, trust_remote_code=True)

code = """
def hello_world(person):
  print("Hello", person)
"""

tokens = starcoder2_tokenizer.encode(code)
for token in tokens:
  print(f"{token}={starcoder2_tokenizer.decode(token)}")

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/958 [00:00<?, ?B/s]

222=

610=def
17966= hello
100=_
5879=world
45=(
6427=person
731=):
353=
 
1489= print
459=("
8302=Hello
411=",
4944= person
46=)
222=

