# Exploring Chat Templates with SmolLM2

This notebook demonstrates how to use chat templates with the `SmolLM2` model. Chat templates help structure interactions between users and AI models, ensuring consistent and contextually appropriate responses.

# Importing HuggingFace and logging in:

In [1]:
# Authenticate to Hugging Face
from huggingface_hub import login, whoami # huggingface library used to interact with HF platform

login() # login in to huggingface
# whoami()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

# Importing necessary libraries:

In [2]:
# transformers used to load and run ML models like LLMs including LLaMA
 # AutoModelForCausalLM loads causal LM that generates text token-by-token
 # AutoTokenizer loads the tokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer
# trl used to fine tune models with reinforcement learning through human feedback (RLHF)
 # setup_chat_format modifies the input format so the model can handle chat-like convos
from trl import setup_chat_format
# PyTorch used for runing neural network (NN)
import torch

## SmolLM2 Chat Template

Let's explore how to use a chat template with the `SmolLM2` model. We'll define a simple conversation and apply the chat template.

# Setting device:
To make cuda available: activate cuda environment and run with cuda kernel

In [3]:
device = (
    "cuda"
    if torch.cuda.is_available()
    else "cpu"
)
print("using", device)

using cuda


# Loading model onto device and tokenizing it, then getting them ready for conversational tasks:
setup_chat_format() is a function in trl whch easily sets up a model and tokenizer for conversational AI tasks. This function: 
- Adds special tokens to the tokenizer, e.g. <|im_start|> and <|im_end|>, to indicate the start and end of a conversation. 
- Resizes the model‚Äôs embedding layer to accommodate the new tokens.
- Sets the chat_template of the tokenizer, which is used to format the input data into a chat-like format.

In [4]:
# Declaring which model will be used
model_name = "HuggingFaceTB/SmolLM2-135M"

# Loads this model, which is a causal language model, into the device being used
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path = model_name ).to(device)

# loads the tokenizer associated with the loaded model 
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path = model_name)

# Modifies the model and tokenizer to work in a format suitable for chat-style interactions
model, tokenizer = setup_chat_format(model=model, tokenizer=tokenizer)

# Providing language model with more context on how conversational dialouge should flow:

In [5]:
# Define messages (list) for SmolLM2
messages = [ # list containing two dictionaries which represent messages in the convo
    # user message
    {"role": "user", "content": "Hello, how are you?"},
    # assistant message
    {"role": "assistant", "content": "I'm doing well, thank you! How can I assist you today?",},
]

# Applying chat template without tokenization:
I.e., printing the formated messages without tokenizing or decoding

In [6]:
# formats the list of messages and formats them based on the model's tokenizer
input_text = tokenizer.apply_chat_template(messages, tokenize=False)

# Printing 'messages' formatted, but never tokenized
print("Conversation with template:", input_text)

Conversation with template: <|im_start|>user
Hello, how are you?<|im_end|>
<|im_start|>assistant
I'm doing well, thank you! How can I assist you today?<|im_end|>



# Decoding the conversation after tokenization:
i.e., messages is tokenized and formatted. Printing it requires decoding the tokens.

add_generation_prompt to True indicates the start of a message
Note that the conversation is represented as above but with a further assistant message.
    <|im start|>assistant


In [7]:
input_text = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True)

print("tokenized input_text:", input_text, "\n")

# Printing decoded tokenized 'messages'
print("Conversation decoded:", tokenizer.decode(token_ids = input_text))

tokenized input_text: [1, 4093, 198, 19556, 28, 638, 359, 346, 47, 2, 198, 1, 520, 9531, 198, 57, 5248, 2567, 876, 28, 9984, 346, 17, 1073, 416, 339, 4237, 346, 1834, 47, 2, 198, 1, 520, 9531, 198] 

Conversation decoded: <|im_start|>user
Hello, how are you?<|im_end|>
<|im_start|>assistant
I'm doing well, thank you! How can I assist you today?<|im_end|>
<|im_start|>assistant



# Tokenizing the conversation and printing it encoded:

In [8]:
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

# Printing tokenized 'messages' encoded
print("Conversation tokenized:", input_text)

Conversation tokenized: [1, 4093, 198, 19556, 28, 638, 359, 346, 47, 2, 198, 1, 520, 9531, 198, 57, 5248, 2567, 876, 28, 9984, 346, 17, 1073, 416, 339, 4237, 346, 1834, 47, 2, 198, 1, 520, 9531, 198]


<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>
    <h2 style='margin: 0;color:blue'>Exercise: Process a dataset for SFT</h2>
    <p>Take a dataset from the Hugging Face hub and process it for SFT. </p>
    <p><b>Difficulty Levels</b></p>
    <p>üê¢ Convert the `HuggingFaceTB/smoltalk` dataset into chatml format.</p>
    <p>üêï Convert the `openai/gsm8k` dataset into chatml format.</p>
</div>

In [9]:
from IPython.core.display import display, HTML

# displays an iframe type HTML objected embedded inside the notebook, showing the contents in 
# https://huggingface.co/datasets/HuggingFaceTB/smoltalk/embed/viewer/all
# which shows the smoltalk dataset which is
# separated in 2 splits: training and testing, of 14 subsets, coming from 13 sources
display(
    HTML(
    """<iframe
        src="https://huggingface.co/datasets/HuggingFaceTB/smoltalk/embed/viewer/all/train?row=0"
        frameborder="0"
        width="100%"
        height="360px"
    ></iframe>
    """
    )
)

  from IPython.core.display import display, HTML


In [10]:
from datasets import load_dataset

# loads the smoltalk dataset of subset "everyday-conversations"
ds = load_dataset("HuggingFaceTB/smoltalk", "everyday-conversations")

def process_dataset(sample):
    # TODO: üê¢ Convert the sample into a chat format
     # the sample is already in chat format, i.e., the roles and content are already formatted
    chat_messages = sample["messages"] # will be used in the tokenizer template function
    
    # TODO: use the tokenizer's method to apply the chat template
     # Tokenizing the (already chat formatted) sample:
    tokenized_messages = tokenizer.apply_chat_template(chat_messages, tokenizer=True, add_generation_prompt=True)

    # overwriting original sample where each key contains a 'tokenized_messages' value, note other key values ('full_topic') still exist untokenized
    sample["messages"] = tokenized_messages
    return sample

# running process_dataset on ds:
tokenized_ds = ds.map(process_dataset)

# Printing before and after tokenization of first n entries
n = 3
for i in range(n): 
    print(f"Original Sample {i+1}: {ds['train'][i]}" "\n")
    print(f"Tokenized Sample {i+1}: {tokenized_ds['train'][i]['messages']}")
    print("|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|", "\n")

Original Sample 1: {'full_topic': 'Travel/Vacation destinations/Beach resorts', 'messages': [{'content': 'Hi there', 'role': 'user'}, {'content': 'Hello! How can I help you today?', 'role': 'assistant'}, {'content': "I'm looking for a beach resort for my next vacation. Can you recommend some popular ones?", 'role': 'user'}, {'content': "Some popular beach resorts include Maui in Hawaii, the Maldives, and the Bahamas. They're known for their beautiful beaches and crystal-clear waters.", 'role': 'assistant'}, {'content': 'That sounds great. Are there any resorts in the Caribbean that are good for families?', 'role': 'user'}, {'content': 'Yes, the Turks and Caicos Islands and Barbados are excellent choices for family-friendly resorts in the Caribbean. They offer a range of activities and amenities suitable for all ages.', 'role': 'assistant'}, {'content': "Okay, I'll look into those. Thanks for the recommendations!", 'role': 'user'}, {'content': "You're welcome. I hope you find the perfec

# Additional: Decoding newly tokenized messages:

In [11]:
def decode_dataset(sample):
    tokenized_messages = sample["messages"]
    
    decoded_messages = tokenizer.decode(tokenized_messages)

    sample["decoded_messages"] = decoded_messages
    
    return sample

decoded_ds=tokenized_ds.map(decode_dataset)


n = 3
for i in range(n): 
    print(f"Decoded Tokenized Sample {i+1}: {decoded_ds['train'][i]['decoded_messages']}" "\n")

Decoded Tokenized Sample 1: <|im_start|>user
Hi there<|im_end|>
<|im_start|>assistant
Hello! How can I help you today?<|im_end|>
<|im_start|>user
I'm looking for a beach resort for my next vacation. Can you recommend some popular ones?<|im_end|>
<|im_start|>assistant
Some popular beach resorts include Maui in Hawaii, the Maldives, and the Bahamas. They're known for their beautiful beaches and crystal-clear waters.<|im_end|>
<|im_start|>user
That sounds great. Are there any resorts in the Caribbean that are good for families?<|im_end|>
<|im_start|>assistant
Yes, the Turks and Caicos Islands and Barbados are excellent choices for family-friendly resorts in the Caribbean. They offer a range of activities and amenities suitable for all ages.<|im_end|>
<|im_start|>user
Okay, I'll look into those. Thanks for the recommendations!<|im_end|>
<|im_start|>assistant
You're welcome. I hope you find the perfect resort for your vacation.<|im_end|>
<|im_start|>assistant


Decoded Tokenized Sample 2: <

In [12]:
# displays an iframe type HTML objected embedded inside the notebook, showing the contents in 
# https://huggingface.co/datasets/openai/gsm8k/embed/viewer/main/train
# which shows the openai gsm8k dataset which is
# separated in 2 splits: training and testing, of 2 subsets
display(
    HTML(
        """<iframe
              src="https://huggingface.co/datasets/openai/gsm8k/embed/viewer/main/train"
              frameborder="0"
              width="100%"
              height="360px"
        ></iframe>
        """
    )
)

Note the OpenAI GSM8k dataset is separated between questions and answers of type string.

In [13]:
ds = load_dataset("openai/gsm8k", "main")

# print(ds['train'][0])

def process_dataset(sample):
    # 1. creating a message format with the role and content:
    messages = [
        {"role": "user", "content": sample["question"]},
        {"role": "assistant","content": sample["answer"]},
    ]

    sample['messages'] = messages
    
    return sample

processed_ds = ds.map(process_dataset)


def tokenize_dataset(sample):
    # TODO: üê¢ Convert the sample into a chat format
     # the sample is already in chat format, i.e., the roles and content are already formatted
    chat_messages = sample["messages"] # will be used in the tokenizer template function
    
    # TODO: use the tokenizer's method to apply the chat template
     # Tokenizing the (already chat formatted) sample:
    tokenized_messages = tokenizer.apply_chat_template(chat_messages, tokenizer=True, add_generation_prompt=True)

    # overwriting original sample where each key contains a 'tokenized_messages' value, note other key values ('full_topic') still exist untokenized
    sample["messages"] = tokenized_messages
    return sample

# running process_dataset on ds:
tokenized_ds = processed_ds.map(tokenize_dataset)


# Printing original dataset, formatted dataset, and tokenized formatted dataset for n entries
n = 3
for i in range(n): 
    print(f"Original Sample {i+1}: {ds['train'][i]}" "\n")
    print(f"Formatted Sample {i+1}: {processed_ds['train'][i]['messages']}" "\n")
    print(f"Tokenized & Formatted Sample {i+1}: {tokenized_ds['train'][i]['messages']}")
    print("|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|", "\n")

Original Sample 1: {'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?', 'answer': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72'}

Formatted Sample 1: [{'content': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?', 'role': 'user'}, {'content': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72', 'role': 'assistant'}]

Tokenized & Formatted Sample 1: [1, 4093, 198, 62, 6927, 542, 3459, 23026, 288, 216, 36, 40, 282, 874, 2428, 281, 4124, 28, 284, 965, 1041, 3459, 2745, 347, 800, 23026, 281, 2405, 30, 1073, 800, 23026, 1250, 36366, 542, 5948, 13587, 281, 4124, 284, 2405, 47, 2, 198, 1, 520, 9531, 198, 62, 

## Conclusion

This notebook demonstrated how to apply chat templates to different models, `SmolLM2`. By structuring interactions with chat templates, we can ensure that AI models provide consistent and contextually relevant responses.

In the exercise you tried out converting a dataset into chatml format. Luckily, TRL will do this for you, but it's useful to understand what's going on under the hood.