# Unit 1: 0 - Introduction to SmolLM2

**Collaborators**:
* Roberto Rodriguez ([@Cyb3rWard0g](https://x.com/Cyb3rWard0g))

## What is SmolLM2?

[SmolLM2](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct) is a family of compact language models available in three sizes: 135M, 360M, and 1.7B parameters. These models are designed to be efficient while maintaining strong performance on a wide range of tasks. The 1.7B variant excels in instruction following, knowledge retention, reasoning, and mathematics.

### Install Required Libraries

In [None]:
# !pip install transformers torch

### Loading SmolLM2 Efficiently

To avoid downloading the model every time (**~3.42 GB**), we first check if it exists locally before loading:

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import os

MODEL_NAME = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
MODEL_DIR = "data/smollm2"

def load_model():
    if os.path.exists(MODEL_DIR):
        print("Loading model from local directory.")
        model = AutoModelForCausalLM.from_pretrained(MODEL_DIR)
    else:
        print("Downloading model...")
        model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
        model.save_pretrained(MODEL_DIR)
    return model

device = torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = load_model().to(device)

Loading model from local directory.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [2]:
# Display model architecture
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(49152, 2048, padding_idx=2)
    (layers): ModuleList(
      (0-23): 24 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (v_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
 

In [38]:
# Show tokenizer metadata
print(f"Model Max Context Length: {tokenizer.model_max_length} tokens")
print(f"Tokenizer Special Tokens: {tokenizer.special_tokens_map}")

Model Max Context Length: 8192 tokens
Tokenizer Special Tokens: {'bos_token': '<|im_start|>', 'eos_token': '<|im_end|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|im_end|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>']}


## Interacting with SmolLM2

### Direct Prompting (Raw Input, No Chat Template)
* Sends the raw string "The capital of France is"
* Generates just the next tokens in sequence (no <|im_start|> markers)

In [39]:
# Direct Prompt Completion (No Chat Format)
prompt = "The capital of France is"

# Encode raw text input with attention mask
encoded_input = tokenizer(prompt, return_tensors="pt").to(device)
input_ids = encoded_input["input_ids"]
attention_mask = encoded_input["attention_mask"]

# Generate next token predictions
outputs = model.generate(
    input_ids, 
    attention_mask=attention_mask, # Avoids padding/EOS confusion
    max_new_tokens=10,
    eos_token_id=tokenizer.eos_token_id  # Ensures stopping when EOS is reached
)

# Decode output
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Generated Text (Direct Completion):")
print(generated_text)

Generated Text (Direct Completion):
The capital of France is Paris.


### Chat-based Interaction


#### Inspecting the Chat Template

SmolLM2 follows an instruction-tuned format for improved usability in conversational AI. You can inspect the chat template used by SmolLM2 to structure input prompts:

In [6]:
chat_template = tokenizer.chat_template
print("Chat Template:")
print(chat_template)

Chat Template:
{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
You are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>
' }}{% endif %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}


#### Applying Chat Template

* Uses `apply_chat_template`
* Produces structured output with special tokens (<|im_start|> etc.)

In [22]:
# Chat-based interaction using instruction-tuned format
messages = [{"role": "user", "content": "The capital of France is"}]
non_tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=False)
print(non_tokenized_chat)

<|im_start|>system
You are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>
<|im_start|>user
The capital of France is<|im_end|>



The `apply_chat_template` method has an `add_generation_prompt` argument. This argument tells the template to add tokens that indicate the start of a bot response.

In [25]:
# Chat-based interaction using instruction-tuned format
messages = [{"role": "user", "content": "The capital of France is"}]
non_tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(non_tokenized_chat)

<|im_start|>system
You are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>
<|im_start|>user
The capital of France is<|im_end|>
<|im_start|>assistant



As you can see, our prompt now has `<|im_start|>assistant` at the end. This ensures that when the model generates text it will write a bot response instead of doing something unexpected, like continuing the user’s message.

#### Tokenizing Chat

In [27]:
encoded_input = tokenizer(non_tokenized_chat, return_tensors="pt").to(device)
encoded_input

{'input_ids': tensor([[    1,  9690,   198,  2683,   359,   253,  5356,  5646, 11173,  3365,
          3511,   308, 34519,    28,  7018,   411,   407, 19712,  8182,     2,
           198,     1,  4093,   198,   504,  3575,   282,  4649,   314,     2,
           198,     1,   520,  9531,   198]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

#### Generating Chat Response

In [None]:
# Defining Input
input_ids = encoded_input["input_ids"]
attention_mask = encoded_input["attention_mask"]  # Explicit attention mask

# Generate response with proper settings
outputs = model.generate(
    input_ids, 
    attention_mask=attention_mask,  # Avoids padding/EOS confusion
    max_new_tokens=50,
    eos_token_id=tokenizer.eos_token_id  # Stops at <|im_end|>
)

In [30]:
# Decode and print response
generated_text = tokenizer.decode(outputs[0])
print(generated_text)

<|im_start|>system
You are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>
<|im_start|>user
The capital of France is<|im_end|>
<|im_start|>assistant
The capital of France is Paris.<|im_end|>


### Processing Assistant Output

Once we generate a response from the model, the output contains the entire prompt history, including the system message and user query. However, we are only interested in extracting the assistant’s actual response.

To achieve this, we follow these steps:

1. Count the number of prompt tokens before generation. This allows us to separate the model's output from the original input.
2. Extract only the newly generated tokens after the prompt length.
3. Decode the assistant's output properly without including system/user messages.

**Step 1: Count Number of Prompt Tokens**
Before generating text, we measure the number of input tokens:

In [32]:
count_prompt_tokens = input_ids.shape[1]
count_prompt_tokens

35

**Step 2: Extract Generated Tokens**
After generation, we extract only the newly generated tokens:

In [33]:
generated_tokens = outputs[0, count_prompt_tokens:]
generated_tokens

tensor([ 504, 3575,  282, 4649,  314, 7042,   30,    2])

**Step 3: Decode Assistant Response**
Now, we decode only the assistant’s response:

In [34]:
output = tokenizer.decode(generated_tokens, skip_special_tokens=False)
output

'The capital of France is Paris.<|im_end|>'

We can skip the special tokens too when decoding results to remove `<|im_end|>`

In [35]:
output = tokenizer.decode(generated_tokens, skip_special_tokens=True)
output

'The capital of France is Paris.'

## Final Code Block

In [None]:
messages = [{"role": "user", "content": "The capital of France is"}]

# Convert messages into model-compatible format
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Encode input with attention mask
encoded_input = tokenizer(input_text, return_tensors="pt").to(device)
input_ids = encoded_input["input_ids"]
attention_mask = encoded_input["attention_mask"]
count_prompt_tokens = input_ids.shape[1]  # Save prompt length

# Generate response
outputs = model.generate(
    input_ids, 
    attention_mask=attention_mask,
    max_new_tokens=50,
    eos_token_id=tokenizer.eos_token_id, # A special token representing the end of a sentence
)

# Extract only assistant-generated tokens
generated_tokens = outputs[0, count_prompt_tokens:]

# Decode assistant response
output = tokenizer.decode(generated_tokens, skip_special_tokens=True)
print("Assistant Response:", output)

Assistant Response: The capital of France is Paris.
