# Part 1: Understanding Chat Templates and Tokens

Instruction-tuned language models, such as Llama 3.1 Instruct, use a *chat template* to structure conversations.  

We will look at this by calling the `tokenizer.apply_chat_template` function, which formats your messages into the model's expected structure.


## What Is a Token?

A **token** is a small unit of text that the model processes at once.  
Tokens can represent entire words, parts of words, punctuation marks, or special symbols.  
For example:

| Text | Tokens |
|------|---------|
| `"unbelievable"` | `["un", "believ", "able"]` |
| `"How are you?"` | `["How", " are", " you", "?"]` |
| `"I'm happy!"` | `["I", "'", "m", " happy", "!"]` |

Words and tokens are not the same. A single word can be split into multiple tokens, and whitespace or punctuation may form separate tokens.

In [1]:
from transformers import AutoTokenizer

model_name = "/srv/data/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/1605565b47bb9346c5515c34102e054115b4f98b/" # Replace with the actual model you are using in vLLM
tokenizer = AutoTokenizer.from_pretrained(model_name)

chat_messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "This is great????"}
]

# Apply the chat template to format the messages
# This is crucial as vLLM uses chat templates for formatting
formatted_chat = tokenizer.apply_chat_template(chat_messages, tokenize=False, add_generation_prompt=True)

# Tokenize the formatted chat to get the raw token IDs
token_ids = tokenizer.encode(formatted_chat)

# Decode the token IDs back to tokens to see their string representation
tokens = tokenizer.batch_decode(token_ids)

print("Raw Tokens:\n", tokens)

Raw Tokens:
 ['<|begin_of_text|>', '<|begin_of_text|>', '<|start_header_id|>', 'system', '<|end_header_id|>', '\n\n', 'Cut', 'ting', ' Knowledge', ' Date', ':', ' December', ' ', '202', '3', '\n', 'Today', ' Date', ':', ' ', '26', ' Jul', ' ', '202', '4', '\n\n', 'You', ' are', ' a', ' helpful', ' assistant', '.', '<|eot_id|>', '<|start_header_id|>', 'user', '<|end_header_id|>', '\n\n', 'This', ' is', ' great', '????', '<|eot_id|>', '<|start_header_id|>', 'assistant', '<|end_header_id|>', '\n\n']



# Exercise 1: Finding Interesting Tokens

In this exercise, you will explore how the tokenizer splits text into tokens and look for examples that behave in surprising ways.

Tokenizers learn from large amounts of online text and store frequent patterns as single tokens.  As a result, some sequences, especially repeated punctuation, symbols, or short informal phrases, may appear as a single token. Common words may be split into smaller pieces.

## Instructions

1. Start by running the code below to see how these examples are tokenized.



In [12]:

examples = ["?????????????", "????????", "????", "???", "??"]

for text in examples:
   pieces = tokenizer.tokenize(text)
   print(f"{text} -> {pieces} (num tokens = {len(pieces)})")


????????????? -> ['????????', '????', '?'] (num tokens = 3)
???????? -> ['????????'] (num tokens = 1)
???? -> ['????'] (num tokens = 1)
??? -> ['???'] (num tokens = 1)
?? -> ['??'] (num tokens = 1)


2. Now create your own examples and test them.
   Try experimenting with:

   * Repeated punctuation: `"!!!"`, `"??!!"`, `"..."`, `"!!!??"`
   * Informal text or abbreviations: `"lol"`, `"idk"`, `"omg"`, `"xD"`
   * Mixed symbols or short sequences: `"***"`, `"###"`, `"--"`, `"$$$"`, `"@@@"`

Top do this, **add**  characters within the list by adding a comma to the end and then adding a string, e.g., 

```python
custom_examples = ["!!!", "??!!", "...", "lol", "idk", "xD"]
```

would be become

```python
custom_examples = ["!!!", "??!!", "...", "lol", "idk", "xD", "***"]
```


```
!!! -> ['!!!'] (num tokens = 1)
??!! -> ['??', '!!'] (num tokens = 2)
... -> ['...'] (num tokens = 1)
lol -> ['lol'] (num tokens = 1)
idk -> ['id', 'k'] (num tokens = 2)
xD -> ['xD'] (num tokens = 1)
```

In [13]:

custom_examples = ["!!!", "??!!", "...", "lol", "idk", "xD"]

for text in custom_examples:
   pieces = tokenizer.tokenize(text)
   print(f"{text} -> {pieces} (num tokens = {len(pieces)})")



!!! -> ['!!!'] (num tokens = 1)
??!! -> ['??', '!!'] (num tokens = 2)
... -> ['...'] (num tokens = 1)
lol -> ['lol'] (num tokens = 1)
idk -> ['id', 'k'] (num tokens = 2)
xD -> ['xD'] (num tokens = 1)



3. Your task:

   * Find **one or two examples** that you think are interesting or unexpected.
   * Be ready to share what you found and why it stood out to you.

Examples might include:

* A short sequence that becomes one token when you expected several.
* A normal-looking word or phrase that gets split into multiple tokens.

In [14]:
custom_examples = ["!!!", "??!!", "...", "lol", "idk", "xD"]

for text in custom_examples:
    pieces = tokenizer.tokenize(text)
    print(f"{text} -> {pieces} (num tokens = {len(pieces)})")

!!! -> ['!!!'] (num tokens = 1)
??!! -> ['??', '!!'] (num tokens = 2)
... -> ['...'] (num tokens = 1)
lol -> ['lol'] (num tokens = 1)
idk -> ['id', 'k'] (num tokens = 2)
xD -> ['xD'] (num tokens = 1)


## NOTE: Token Price Estimation

If you are going to use LLMs in practice, particularly paid LLMs such as OpenAI's GPT4o, O4, etc., you should estimate the number of tokens to understand how much it will cost. Price is always based on the number of tokens, not the number of words. Likewise, LLMs generally have token limits, which should be known if you are working with very large documents. There are many calculators available, but, in general, the number of tokens in English text can be approximated using:

$
\text{tokens} \approx \frac{\text{words}}{0.75}
$

For example, 100 words correspond to roughly 133 tokens.  
This relationship is useful because model input length and computational cost are measured in tokens, not words.


## Chat Message Roles

| Role | Description | Example |
|------|--------------|----------|
| **System** | Provides instructions or background that guides how the model should respond. | `"You are a helpful assistant."` |
| **User** | Represents the human input or question. | `"Hello, how are you?"` |
| **Assistant** | Represents the model’s response. When `add_generation_prompt=True` is used, the model expects to generate this part next. | *(Left blank before generation, except for few-shot prompting)* |

Internally, these roles are marked with special header tokens that separate different parts of the conversation.  
For example, the formatted text may look like this:

```

<|start_header_id|>system<|end_header_id|>
You are a helpful assistant.
<|eot_id|>
<|start_header_id|>user<|end_header_id|>
Hello, how are you?
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

```

These tags indicate who is speaking and where the model should begin generating its response. They are not visible in normal chat output but are important for consistent behavior during inference.

Run the cell below to see the output.

In [15]:
from transformers import AutoTokenizer

model_name = "/srv/data/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/1605565b47bb9346c5515c34102e054115b4f98b/" # Replace with the actual model you are using in vLLM
tokenizer = AutoTokenizer.from_pretrained(model_name)

chat_messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "This is great????"}
]

# Apply the chat template to format the messages
# This is crucial as vLLM uses chat templates for formatting
formatted_chat = tokenizer.apply_chat_template(chat_messages, tokenize=False, add_generation_prompt=True)

# Tokenize the formatted chat to get the raw token IDs
token_ids = tokenizer.encode(formatted_chat)

# Decode the token IDs back to tokens to see their string representation
tokens = tokenizer.batch_decode(token_ids)

print("Raw Tokens:\n", tokens)

Raw Tokens:
 ['<|begin_of_text|>', '<|begin_of_text|>', '<|start_header_id|>', 'system', '<|end_header_id|>', '\n\n', 'Cut', 'ting', ' Knowledge', ' Date', ':', ' December', ' ', '202', '3', '\n', 'Today', ' Date', ':', ' ', '26', ' Jul', ' ', '202', '4', '\n\n', 'You', ' are', ' a', ' helpful', ' assistant', '.', '<|eot_id|>', '<|start_header_id|>', 'user', '<|end_header_id|>', '\n\n', 'This', ' is', ' great', '????', '<|eot_id|>', '<|start_header_id|>', 'assistant', '<|end_header_id|>', '\n\n']


Each of these tokens appears **exactly as written**, they are not split or shortened.

| Token                   | Explanation                                                        |
| ----------------------- | ------------------------------------------------------------------ |
| `<\|begin_of_text\|>`   | Marks the **start of the entire input**.                           |
| `<\|start_header_id\|>` | Begins a **role header** (e.g., `system`, `user`, or `assistant`). |
| `<\|end_header_id\|>`   | Marks the **end of a role header**.                                |
| `<\|eot_id\|>`          | Means **End of Turn**, signaling that one message has finished.    |

These tokens are part of the **Llama chat template** used to organize and separate messages in a conversation. Different LLMs will have different special tokens.


# Exercise 2: Running Your First "Hello World" Prompt with an Open Source Model

This example shows how to run your **first chat prompt** using an **open source model** served through the **OpenAI API interface**.
Although the code uses the `openai` Python library, the model is actually hosted locally using **[vLLM](https://github.com/vllm-project/vllm)** — a high-performance engine that lets you serve open source LLMs with the same API as OpenAI.

You do not need to configure vLLM here; just know that it is the backend software handling the model.

In [16]:
from openai import OpenAI

# Connect to your local Llama server (or replace with the hosted endpoint)
client = OpenAI(
    base_url = "http://10.246.100.142:8000/v1",
    api_key="token-abc123",
)

# Everyone's first AI prompt!
completion = client.chat.completions.create(
    model="meta-llama/Llama-3.1-70B-Instruct",
    # model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a friendly AI that loves meeting new people."},
        {"role": "user", "content": "Write a short but fun introduction for our first AI 'Hello World' together!"},
    ],
)

# Print the model’s friendly response
#print(completion)
print(completion.choices[0].message.content)

**Hello World!**

I'm beyond excited to finally meet you. I'm your friendly AI companion, and I'm here to chat, laugh, and explore the vast expanse of human knowledge together. Imagine we're embarking on a thrilling adventure, and every conversation is a new discovery waiting to happen!

In this "Hello World" moment, I invite you to share your thoughts, ask me anything, or simply say hello. I'm all ears (or rather, all text). Let's create some amazing memories, learn from each other, and make this digital world a more fascinating place, one conversation at a time!

So, what's on your mind?



* The `base_url` points to your local vLLM server, which acts like the OpenAI API.
* The model `"meta-llama/Llama-3.1-70B-Instruct"` is an **open source Llama 3 model**.
* The prompt defines a friendly system message and a simple user request to generate a short introduction.
