# Unit 1: 1 - Tokenization and Next Token Prediction with SmolLM2

**Collaborators**:
* Roberto Rodriguez ([@Cyb3rWard0g](https://x.com/Cyb3rWard0g))

## Install Required Libraries

In [None]:
# !pip install transformers torch

## SmolLM2 Tokenization

### Initializing SmolLM2 Tokenizer

In [None]:
from transformers import AutoTokenizer
import torch

MODEL_NAME = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
device = torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

### Manual Tokenization (Text Without Special Tokens)

In [2]:
text = "The Capital of France is"
tokens = tokenizer.tokenize(text)
input_ids = tokenizer.convert_tokens_to_ids(tokens)

In [3]:
print("Tokens (without special tokens):", tokens)

Tokens (without special tokens): ['The', 'ĠCapital', 'Ġof', 'ĠFrance', 'Ġis']


In [4]:
print("Token IDs (without special tokens):", input_ids)

Token IDs (without special tokens): [504, 14937, 282, 4649, 314]


In [5]:
print("Decoded Text (without special tokens):", tokenizer.decode(input_ids))

Decoded Text (without special tokens): The Capital of France is


#### Byte-Level BPE Tokenization in SmolLM2
SmolLM2 uses [byte-level Byte Pair Encoding (BPE)](https://huggingface.co/learn/nlp-course/chapter6/5), meaning spaces are included in tokens as special characters. The first token in a sequence does not include a space representation, but all subsequent tokens do. This behavior ensures proper tokenization consistency across different inputs.

### Automatic Tokenization (Chat With Special Tokens)

In [6]:
messages = [{"role": "user", "content": "The Capital of France is"}]
non_tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
encoded_input = tokenizer(non_tokenized_chat, return_tensors="pt").to(device)

In [7]:
encoded_input

{'input_ids': tensor([[    1,  9690,   198,  2683,   359,   253,  5356,  5646, 11173,  3365,
          3511,   308, 34519,    28,  7018,   411,   407, 19712,  8182,     2,
           198,     1,  4093,   198,   504, 14937,   282,  4649,   314,     2,
           198,     1,   520,  9531,   198]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [8]:
input_ids_tensor = encoded_input["input_ids"]

In [9]:
print("Input IDs (with special tokens):", input_ids_tensor)

Input IDs (with special tokens): tensor([[    1,  9690,   198,  2683,   359,   253,  5356,  5646, 11173,  3365,
          3511,   308, 34519,    28,  7018,   411,   407, 19712,  8182,     2,
           198,     1,  4093,   198,   504, 14937,   282,  4649,   314,     2,
           198,     1,   520,  9531,   198]])


In [10]:
print("Decoded Chat (with special tokens):\n", tokenizer.decode(input_ids_tensor[0]))

Decoded Chat (with special tokens):
 <|im_start|>system
You are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>
<|im_start|>user
The Capital of France is<|im_end|>
<|im_start|>assistant



## Next Token Prediction - Autoregressive Modeling

In [11]:
# Define input prompt
prompt = "The Capital of France is"
input_text = tokenizer(prompt, return_tensors="pt")
input_text

{'input_ids': tensor([[  504, 14937,   282,  4649,   314]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

### Loading SmolLM2 Efficiently

To avoid downloading the model every time (**~3.42 GB**), we first check if it exists locally before loading:

In [12]:
from transformers import AutoModelForCausalLM
import os

MODEL_NAME = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
MODEL_DIR = "data/smollm2"

def load_model():
    if os.path.exists(MODEL_DIR):
        print("Loading model from local directory.")
        model = AutoModelForCausalLM.from_pretrained(MODEL_DIR)
    else:
        print("Downloading model...")
        model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
        model.save_pretrained(MODEL_DIR)
    return model

model = load_model().to(device)

Loading model from local directory.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

#### Generating Next Token Logits

In [13]:
import torch

# Generate next token logits
with torch.no_grad():
    outputs = model(**input_text)
    logits = outputs.logits
    next_token_logits = logits[:, -1, :]
    next_token_id = next_token_logits.argmax()

print("Predicted Next Token:", tokenizer.decode(next_token_id))

Predicted Next Token:  Paris


## Understanding Token Scores: Probabilities vs. Logits

When generating text with SmolLM2, the model assigns scores to tokens based on their likelihood of being the next token. However, there are two different ways to interpret these scores:

* Softmax Probabilities (Normalized Likelihoods)
* Raw Logits (Unnormalized Scores)

Both methods provide valuable insights into token selection but serve different purposes.

### Softmax Probabilities: Interpreting Token Likelihood
The first approach applies softmax normalization to convert raw logits into probabilities. These probabilities indicate how likely each token is relative to others at a given step.

In [14]:
import torch.nn.functional as F

def get_top_k_predictions(input_text, model, tokenizer, k=5):
    inputs = tokenizer(input_text, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits[:, -1, :]  # Get the logits for the last token
    probs = F.softmax(logits, dim=-1)  # Apply softmax to get probabilities
    top_k_probs, top_k_indices = torch.topk(probs, k, dim=-1)
    
    for i in range(k):
        token = tokenizer.decode(top_k_indices[0][i])
        score = top_k_probs[0][i].item()
        print(f"Token: {token}, Score: {score:.4f}")

In [15]:
get_top_k_predictions("The Capital of France is", model, tokenizer)

Token:  Paris, Score: 0.8024
Token:  the, Score: 0.0374
Token:  a, Score: 0.0267
Token:  known, Score: 0.0110
Token:  called, Score: 0.0062


What This Shows:
* The probability distribution of the next possible tokens.
* The highest probability token is the most likely next token.
* The scores sum to 1 because of the softmax transformation.

### Raw Logits: Interpreting Model Confidence
The second approach examines raw logits, which are unnormalized scores produced directly by the model before softmax is applied.

In [16]:
def get_top_k_raw_logits(input_text, model, tokenizer, k=5):
    inputs = tokenizer(input_text, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits[:, -1, :]  # Get raw logits for the last token
    top_k_logits, top_k_indices = torch.topk(logits, k, dim=-1)
    
    for i in range(k):
        token = tokenizer.decode(top_k_indices[0][i])
        score = top_k_logits[0][i].item()
        print(f"Token: {token}, Logit Score: {score:.4f}")

In [17]:
get_top_k_raw_logits("The Capital of France is", model, tokenizer)

Token:  Paris, Logit Score: 18.4019
Token:  the, Logit Score: 15.3357
Token:  a, Logit Score: 14.9981
Token:  known, Logit Score: 14.1096
Token:  called, Logit Score: 13.5362


What This Shows:
* Raw model outputs before softmax.
* These scores are not probabilities and can have negative values.
* The higher the logit, the more preferred the token (but not in a probabilistic sense).

### Softmax Probabilities vs. Logits: Which One to Use?

|Approach| What It Represents | When to Use |
| --- | --- | --- |
|Softmax Probabilities | Normalized likelihood of a token (values between 0 and 1, sum to 1) | When you want to understand how likely each token is relative to others. |
| Raw Logits | Unnormalized scores before softmax (can be negative, not sum to 1) | When you want to analyze model preference for tokens in absolute terms. |

## Iterative Token Generation

The iterative decoding process mimics how SmolLM2 generates text one token at a time. At each step, the model predicts the most likely next tokens, ranks them by their raw logits, and selects the top choice. The process repeats, appending the selected token to the input until an end condition (`<|im_end|>` or EOS) is met.

In [23]:
import torch

def generate_top_k_tokens(prompt, model, tokenizer, k=5, max_tokens=10):
    """
    Generates text iteratively, displaying the top-k token predictions at each step.
    Continues until reaching the EOS token (<|im_end|>) or max_tokens is reached.

    Args:
        prompt (str): Input text to start generation.
        model: Pretrained language model (SmolLM2).
        tokenizer: Tokenizer corresponding to the model.
        k (int): Number of top tokens to display at each step.
        max_tokens (int): Maximum number of tokens to generate.

    Returns:
        str: The final generated sequence.
    """
    
    # Tokenize input and move to device
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    generated_tokens = []
    
    for _ in range(max_tokens):
        with torch.no_grad():
            outputs = model(**inputs)
        
        # Get raw logits for the last token
        logits = outputs.logits[:, -1, :]
        
        # Get top-k token indices and scores
        top_k_logits, top_k_indices = torch.topk(logits, k, dim=-1)

        print(f"\nStep {_+1}: Top {k} Token Predictions")
        for i in range(k):
            token_text = tokenizer.decode(top_k_indices[0][i])
            logit_score = top_k_logits[0][i].item()
            print(f"Rank {i+1}: Token = '{token_text}', Logit Score = {logit_score:.4f}")

            # Stop if <|im_end|> appears in the top-k predictions
            if token_text.strip() == "<|im_end|>":
                print("\nStopping Generation: Encountered End-of-Sequence Token (<|im_end|>) in Top-K")
                final_output = prompt + "".join(generated_tokens)
                print("\nFinal Generated Text:", final_output)
                return final_output
        
        # Select the top token
        top_token_id = top_k_indices[0, 0].unsqueeze(0)  # Take the highest ranked token
        top_token_text = tokenizer.decode(top_token_id)

        # Append selected token to the output
        generated_tokens.append(top_token_text)
        
        # Append the new token to the input sequence
        inputs = {
            "input_ids": torch.cat([inputs["input_ids"], top_token_id.unsqueeze(0)], dim=1),
            "attention_mask": torch.cat([inputs["attention_mask"], torch.tensor([[1]]).to(device)], dim=1),
        }
    
    final_output = prompt + "".join(generated_tokens)
    print("\nFinal Generated Text:", final_output)
    return final_output

In [25]:
# Run the iterative token generation
output = generate_top_k_tokens("The Capital of France is", model, tokenizer, k=4, max_tokens=5)


Step 1: Top 4 Token Predictions
Rank 1: Token = ' Paris', Logit Score = 18.4019
Rank 2: Token = ' the', Logit Score = 15.3357
Rank 3: Token = ' a', Logit Score = 14.9981
Rank 4: Token = ' known', Logit Score = 14.1096

Step 2: Top 4 Token Predictions
Rank 1: Token = '.', Logit Score = 17.5881
Rank 2: Token = ',', Logit Score = 17.0111
Rank 3: Token = '."', Logit Score = 16.2235
Rank 4: Token = '.",', Logit Score = 15.6103

Step 3: Top 4 Token Predictions
Rank 1: Token = '
', Logit Score = 13.8837
Rank 2: Token = '<|im_end|>', Logit Score = 13.6790

Stopping Generation: Encountered End-of-Sequence Token (<|im_end|>) in Top-K

Final Generated Text: The Capital of France is Paris.
