# Tokenizers

For this Colab session, we explore the world of Tokenizers

You can run this notebook on a free CPU, or locally on your box if you prefer.


In [1]:
# Importing the userdata module from Google Colab.
# The `userdata` module can be used for accessing user-specific data or configurations
# in a Colab environment (if integrated). Note that this module may require explicit installation or setup.
from google.colab import userdata

# Importing the `login` function from the Hugging Face Hub library.
# This function allows users to authenticate and log in to their Hugging Face account
# to access models, datasets, or other resources that require authentication.
from huggingface_hub import login

# Importing the `AutoTokenizer` class from the Transformers library.
# The `AutoTokenizer` is a versatile tokenizer that automatically selects the
# appropriate tokenizer class for a given pre-trained model. It is useful for preparing
# text data for NLP tasks such as tokenization and model input formatting.
from transformers import AutoTokenizer

# Sign in to Hugging Face

1. If you haven't already done so, create a free HuggingFace account at https://huggingface.co and navigate to Settings, then Create a new API token, giving yourself write permissions

2. Press the "key" icon on the side panel to the left, and add a new secret:
`HF_TOKEN = your_token`

3. Execute the cell below to log in.

In [2]:
hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

# Accessing Llama 3.1 from Meta

In order to use the fantastic Llama 3.1, Meta does require you to sign their terms of service.

Visit their model instructions page in Hugging Face:
https://huggingface.co/meta-llama/Meta-Llama-3.1-8B

At the top of the page are instructions on how to agree to their terms. If possible, you should use the same email as your huggingface account.

In my experience approval comes in a couple of minutes. Once you've been approved for any 3.1 model, it applies to the whole family of models.


In [3]:
# Initializing a tokenizer using the `AutoTokenizer` class from the Transformers library.
# The `from_pretrained` method loads a pre-trained tokenizer from the Hugging Face model hub.
# - 'meta-llama/Meta-Llama-3.1-8B': Specifies the identifier of the pre-trained model for which
#   the tokenizer is being loaded. This is a LLaMA model variant with 8 billion parameters.
# - trust_remote_code=True: This parameter allows the use of custom code provided by the model's repository
#   on Hugging Face. This is necessary when the model requires specific preprocessing or tokenization logic
#   defined in the repository. Ensure that the code is from a trusted source to avoid security risks.
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B', trust_remote_code=True)


tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

In [4]:
text = "I am excited to show Tokenizers in action to my LLM engineers"
# The `encode` method converts the input text into a sequence of integers (token IDs)
# that correspond to the vocabulary of the pre-trained model.
# These token IDs are the numerical representation of the text used as input for the model.
tokens = tokenizer.encode(text)
# Print or display the resulting token IDs.
tokens

[128000,
 40,
 1097,
 12304,
 311,
 1501,
 9857,
 12509,
 304,
 1957,
 311,
 856,
 445,
 11237,
 25175]

In [5]:
# average length of a token is 4 characters
# Text is converted into token IDs, which lose letter-level information.
# reason why a language model is often not good at guessing "How many 'a's are in this sentence"
# is that they do not parse in letters but in tokens; so, in fact they have no clue what we are
# asking for with this question. These models are trained on tasks emphasizing understanding text
# semantically rather than parsing individual characters.
len(tokens)

15

In [6]:
# Decoding the list of token IDs back into a string using the `decode` method.
# This method takes the list of token IDs (numerical representation of the text)
# and reconstructs the original text or an approximation of it.
# We see that with the llama model the 'special token' <|begin_of_text|> is added.
tokenizer.decode(tokens)

'<|begin_of_text|>I am excited to show Tokenizers in action to my LLM engineers'

### Special Tokens in Language Models

Special tokens are additional tokens defined in a tokenizer's vocabulary that serve specific purposes during training and inference. These tokens are typically reserved for tasks or structural purposes and are not part of the natural language input.

#### **What is `<|begin_of_text|>`?**

- **Purpose**: The `<|begin_of_text|>` token (or its equivalent, depending on the tokenizer) is used to indicate the start of a text sequence. It provides context to the language model that a new input or prompt is starting.
- **LLaMA Model**: In the case of LLaMA and similar models, `<|begin_of_text|>` is part of the tokenizer's vocabulary and is often added by default to the beginning of the input text during tokenization. This helps the model recognize where a sequence starts.
- **Decoding**: When decoding the tokens back into text, this special token may be included unless explicitly removed by the tokenizer's settings.

#### **Common Special Tokens**
1. **`<|end_of_text|>`**:
   - Marks the end of a sequence.
   - Used to signal that the text input or output should stop.
   
2. **`[CLS]`** (Classification Token):
   - Specific to models like BERT.
   - Used at the start of a sequence for tasks like classification.
   
3. **`[SEP]`** (Separator Token):
   - Separates two segments of text in tasks involving multiple inputs (e.g., question answering).
   
4. **`<pad>`** (Padding Token):
   - Used to pad sequences to a uniform length for batching.
   - Prevents uneven inputs during training or inference.

5. **`<unk>`** (Unknown Token):
   - Replaces words or subwords not found in the tokenizer's vocabulary.
   
6. **Task-Specific Tokens**:
   - Custom tokens for models fine-tuned on specific tasks, such as `<question>` or `<answer>` for question-answering tasks.

#### **How Special Tokens Are Used**
1. **Training**:
   - Special tokens help structure the input for various tasks (e.g., sequence classification, translation, text generation).
   - For instance, `<|begin_of_text|>` ensures that the model learns where sequences begin.
   
2. **Inference**:
   - During text generation, the model may use tokens like `<|begin_of_text|>` and `<|end_of_text|>` to manage input-output boundaries.
   
3. **Fine-Tuning**:
   - Custom special tokens can be added for domain-specific tasks to guide the model's attention or outputs.

### Why Special Tokens Are Important

- They provide **structure** to the model's input and output.
- Enable the model to handle **complex tasks** such as:
  - Multi-segment inputs (e.g., context and query).
  - Sequential processing (e.g., beginning and end markers).
  - Differentiating between plain text and task-specific metadata.



In [7]:
# The batch_decode method is used to decode multiple sequences of token IDs (a batch) into human-readable text.
# It processes a list of tokenized sequences and returns a list of decoded strings.
tokenizer.batch_decode(tokens)

['<|begin_of_text|>',
 'I',
 ' am',
 ' excited',
 ' to',
 ' show',
 ' Token',
 'izers',
 ' in',
 ' action',
 ' to',
 ' my',
 ' L',
 'LM',
 ' engineers']

In [8]:
# tokenizer.vocab
tokenizer.get_added_vocab()

{'<|begin_of_text|>': 128000,
 '<|end_of_text|>': 128001,
 '<|reserved_special_token_0|>': 128002,
 '<|reserved_special_token_1|>': 128003,
 '<|finetune_right_pad_id|>': 128004,
 '<|reserved_special_token_2|>': 128005,
 '<|start_header_id|>': 128006,
 '<|end_header_id|>': 128007,
 '<|eom_id|>': 128008,
 '<|eot_id|>': 128009,
 '<|python_tag|>': 128010,
 '<|reserved_special_token_3|>': 128011,
 '<|reserved_special_token_4|>': 128012,
 '<|reserved_special_token_5|>': 128013,
 '<|reserved_special_token_6|>': 128014,
 '<|reserved_special_token_7|>': 128015,
 '<|reserved_special_token_8|>': 128016,
 '<|reserved_special_token_9|>': 128017,
 '<|reserved_special_token_10|>': 128018,
 '<|reserved_special_token_11|>': 128019,
 '<|reserved_special_token_12|>': 128020,
 '<|reserved_special_token_13|>': 128021,
 '<|reserved_special_token_14|>': 128022,
 '<|reserved_special_token_15|>': 128023,
 '<|reserved_special_token_16|>': 128024,
 '<|reserved_special_token_17|>': 128025,
 '<|reserved_special_to

# Instruct variants of models

Many models have a variant that has been trained for use in Chats.  
These are typically labelled with the word "Instruct" at the end.  
They have been trained to expect prompts with a particular format that includes system, user and assistant prompts.  

There is a utility method `apply_chat_template` that will convert from the messages list format we are familiar with, into the right input prompt for this model.

In [9]:
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct', trust_remote_code=True)

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

In [10]:

messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Tell a light-hearted joke for a room of Data Scientists"}
  ]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(prompt)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell a light-hearted joke for a room of Data Scientists<|eot_id|><|start_header_id|>assistant<|end_header_id|>




**`tokenizer.apply_chat_template`**:
   - This method applies a chat template to format the conversation into a single prompt string for a language model.
   - Parameters:
     - **`messages`**: The conversation messages that need to be formatted.
     - **`tokenize=False`**: Ensures that the result is not tokenized but returned as plain text.
     - **`add_generation_prompt=True`**: Adds a generation prompt to indicate where the model should continue generating text. In this case, the `Assistant` portion serves as a signal for the model to generate the next part of the conversation.

# Trying new models

We will now work with 3 models:

Phi3 from Microsoft
Qwen2 from Alibaba Cloud
Starcoder2 from BigCode (ServiceNow + HuggingFace + NVidia)

In [11]:
PHI3_MODEL_NAME = "microsoft/Phi-3-mini-4k-instruct"
QWEN2_MODEL_NAME = "Qwen/Qwen2-7B-Instruct"
STARCODER2_MODEL_NAME = "bigcode/starcoder2-3b"

In [12]:
phi3_tokenizer = AutoTokenizer.from_pretrained(PHI3_MODEL_NAME)

text = "I am excited to show Tokenizers in action to my LLM engineers"
print(tokenizer.encode(text))
print()
tokens = phi3_tokenizer.encode(text)
print(phi3_tokenizer.batch_decode(tokens))


tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

[128000, 40, 1097, 12304, 311, 1501, 9857, 12509, 304, 1957, 311, 856, 445, 11237, 25175]

['I', 'am', 'excited', 'to', 'show', 'Token', 'izers', 'in', 'action', 'to', 'my', 'L', 'LM', 'engine', 'ers']


In [13]:
# We see that tokenization works different in each model. So it would not make sense to
# use the tokenizer of 1 model for another model. The function Autotokenizer from Hugging
# Face makes sure we always use the correct tokenizer for the model that is being used.

print(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print()
print(phi3_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell a light-hearted joke for a room of Data Scientists<|eot_id|><|start_header_id|>assistant<|end_header_id|>



<|system|>
You are a helpful assistant<|end|>
<|user|>
Tell a light-hearted joke for a room of Data Scientists<|end|>
<|assistant|>



In [14]:
qwen2_tokenizer = AutoTokenizer.from_pretrained(QWEN2_MODEL_NAME)

text = "I am excited to show Tokenizers in action to my LLM engineers"
print(tokenizer.encode(text))
print()
print(phi3_tokenizer.encode(text))
print()
print(qwen2_tokenizer.encode(text))

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

[128000, 40, 1097, 12304, 311, 1501, 9857, 12509, 304, 1957, 311, 856, 445, 11237, 25175]

[306, 626, 24173, 304, 1510, 25159, 19427, 297, 3158, 304, 590, 365, 26369, 6012, 414]

[40, 1079, 12035, 311, 1473, 9660, 12230, 304, 1917, 311, 847, 444, 10994, 24198]


In [15]:
print(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print()
print(phi3_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print()
print(qwen2_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell a light-hearted joke for a room of Data Scientists<|eot_id|><|start_header_id|>assistant<|end_header_id|>



<|system|>
You are a helpful assistant<|end|>
<|user|>
Tell a light-hearted joke for a room of Data Scientists<|end|>
<|assistant|>


<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Tell a light-hearted joke for a room of Data Scientists<|im_end|>
<|im_start|>assistant



In [16]:
# starcoder2 is a model for coding, so it makes sense that its tokens
# are geared towards code

starcoder2_tokenizer = AutoTokenizer.from_pretrained(STARCODER2_MODEL_NAME, trust_remote_code=True)
code = """
def hello_world(person):
  print("Hello", person)
"""
tokens = starcoder2_tokenizer.encode(code)
for token in tokens:
  print(f"{token}={starcoder2_tokenizer.decode(token)}")

tokenizer_config.json:   0%|          | 0.00/7.88k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/777k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/442k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.06M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/958 [00:00<?, ?B/s]

222=

610=def
17966= hello
100=_
5879=world
45=(
6427=person
731=):
353=
 
1489= print
459=("
8302=Hello
411=",
4944= person
46=)
222=

