# Unit 1: 2 - Beam Search with GPT2

**Collaborators**:
* Roberto Rodriguez ([@Cyb3rWard0g](https://x.com/Cyb3rWard0g))

## Install Required Libraries

In [None]:
# !pip install transformers torch

## Loading GPT2 Efficiently

To avoid downloading the model every time (**~548 MB**), we first check if it exists locally before loading:

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import os

MODEL_NAME = "openai-community/gpt2"
MODEL_DIR = "data/gpt2"

def load_model():
    if os.path.exists(MODEL_DIR):
        print("Loading model from local directory.")
        model = AutoModelForCausalLM.from_pretrained(MODEL_DIR)
    else:
        print("Downloading model...")
        model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
        model.save_pretrained(MODEL_DIR)
    return model

device = torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = load_model().to(device)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading model...


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

## Next Token Prediction: The Core of LLMs

Large Language Models (LLMs) are **autoregressive**, meaning they generate text **one token at a time** based on previous context. 

### **How Next Token Prediction Works:**
- The model takes an input sequence.
- It generates a probability distribution over possible next tokens.
- The most likely token (or one sampled using randomness) is selected.
- The process repeats until an **End of Sequence (EOS)** token is reached.

In [None]:
prompt = "The capital of France is"
input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"].to(device)

# Generate next token logits
with torch.no_grad():
    outputs = model(input_ids)
    logits = outputs.logits[:, -1, :]
    next_token_id = logits.argmax()

print("Predicted Next Token:", tokenizer.decode(next_token_id))

Predicted Next Token:  the


## Beam Search: Improving Generated Sequences

While greedy decoding picks the single highest probability token at each step, **beam search** keeps multiple "beams" (hypotheses) and explores various possibilities before finalizing an output.

### How Beam Search Works:
- Maintains multiple possible token sequences (**num_beams**).
- Expands each sequence by considering multiple likely next tokens.
- Applies a **length penalty** to avoid bias toward shorter sequences.
- Returns the top **num_return_sequences** outputs based on cumulative scores.

In [7]:
def generate_with_beam_search(
    prompt, model, tokenizer, 
    max_new_tokens=12, num_beams=4, length_penalty=1.0, num_return_sequences=3
):
    """
    Generates text using Beam Search Decoding.

    Parameters:
    - prompt (str): The input text to decode.
    - model: The SmolLM2 model.
    - tokenizer: The tokenizer for SmolLM2.
    - max_new_tokens (int): Number of tokens to generate (steps).
    - num_beams (int): Number of beams to use.
    - length_penalty (float): Length penalty (higher promotes longer sequences).
    - num_return_sequences (int): Number of top-ranked sequences to return.

    Returns:
    - List of generated sequences ranked by score.
    """
    # Encode raw text input with attention mask
    encoded_input = tokenizer(prompt, return_tensors="pt").to(device)
    input_ids = encoded_input["input_ids"]
    attention_mask = encoded_input["attention_mask"]

    outputs = model.generate(
        input_ids,
        attention_mask=attention_mask,
        max_new_tokens=max_new_tokens,
        num_beams=num_beams,
        length_penalty=length_penalty,
        num_return_sequences=num_return_sequences,
        eos_token_id=tokenizer.eos_token_id
    )

    generated_sequences = [tokenizer.decode(output, skip_special_tokens=False) for output in outputs]

    print("\n=== Beam Search Results ===\n")
    for i, seq in enumerate(generated_sequences):
        print(f"Rank {i+1}: {seq}")

    return generated_sequences

In [8]:
# Define input sentence
prompt = "Conclusion: thanks a lot. That's all for today"

# Run beam search decoding
beam_search_results = generate_with_beam_search(
    prompt, 
    model, 
    tokenizer, 
    max_new_tokens=5,        # Number of steps
    num_beams=4,             # Number of beams
    length_penalty=1.0,      # Length penalty
    num_return_sequences=3   # Number of return sequences
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



=== Beam Search Results ===

Rank 1: Conclusion: thanks a lot. That's all for today.

Advertisements<|endoftext|>
Rank 2: Conclusion: thanks a lot. That's all for today. I hope you enjoyed
Rank 3: Conclusion: thanks a lot. That's all for today. I'll be back
