<a href="https://colab.research.google.com/github/Lakshmi-Adhikari-AI/LLM-HuggingFace/blob/main/ch2/mod4_handling_multiple_sequences.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Handling multiple sequences (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [28]:
!pip install datasets evaluate transformers[sentencepiece]



## 🧠 Chapter 2: Handling Multiple Sequences

This notebook explores **batching**, **padding**, and **attention masks** in Hugging Face Transformers.




## 🔁 Batching Sequences

**Batching** is the act of sending **multiple sentences** through the model at once, instead of one-by-one.

- It's **faster** and more efficient (especially on GPUs).
- Most models are **trained and optimized** to handle batches.
- Even with one sentence, you must wrap it as a batch: a **list of lists (2D tensor)**.


In [29]:
ids = [1045, 2310, 2023]  # Token IDs for: "I feel this"

# Convert to batch format
batched_ids = [ids]  # Batch of 1 sentence
print(batched_ids)

batched_ids = [ids, ids]  # Batch of 2 identical sentences
print(batched_ids)


[[1045, 2310, 2023]]
[[1045, 2310, 2023], [1045, 2310, 2023]]


## 🧪 Model Expects Batched Input

The model expects a **2D tensor** input — even for a single sentence.

If you pass a 1D tensor like `[id1, id2, id3]`, it will throw an error.

✅ Correct: `torch.tensor([[id1, id2, id3]])`  
❌ Wrong: `torch.tensor([id1, id2, id3])`


In [30]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

# Tokenization
tokens = tokenizer.tokenize(sequence)
print(tokens)

# Convert to IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)

# Convert to 2D tensor (batch of 1)
input_ids = torch.tensor([token_ids])
print("Input IDs:", input_ids)
print("Size:", input_ids.size())

# Model inference
output = model(input_ids)
print("Logits:", output.logits)


['i', "'", 've', 'been', 'waiting', 'for', 'a', 'hugging', '##face', 'course', 'my', 'whole', 'life', '.']
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]
Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
Size: torch.Size([1, 14])
Logits: tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


## 📏 Padding Sequences

When batching multiple sequences of **different lengths**, you must make them the **same length** using padding.

Why?

- Tensors must be **rectangular**
- Padding adds a **special token** to the shorter sequences
- `tokenizer.pad_token_id` gives the ID of the padding token for your model


In [31]:
# Example: two different length sequences
sequence1_input_ids = [[200, 200, 200]]
sequence2_input_ids = [[200, 200]]

# Use padding for sequence2
batched_input_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id]
]

print("Padding token ID:", tokenizer.pad_token_id)

# Inference
print("Seq 1 logits:", model(torch.tensor(sequence1_input_ids)).logits)
print("Seq 2 logits:", model(torch.tensor(sequence2_input_ids)).logits)  # Will differ
print("Batched logits:", model(torch.tensor(batched_input_ids)).logits)  # Padding affects result


Padding token ID: 0
Seq 1 logits: tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
Seq 2 logits: tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
Batched logits: tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)


## 🧠 Attention Masks

**Attention masks** tell the model which tokens to **attend to** (1) and which to **ignore** (0), especially useful for **ignoring padding**.

- Shape of the attention mask = shape of `input_ids`
- 1 → real token
- 0 → pad token


In [32]:
# Define attention mask
attention_mask = [
    [1, 1, 1],     # All real tokens
    [1, 1, 0]      # Last token is padding
]

# Now model ignores the padding during inference
outputs = model(torch.tensor(batched_input_ids), attention_mask=torch.tensor(attention_mask))
print("Logits with attention mask:", outputs.logits)


Logits with attention mask: tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


## ✏️ Try It Yourself

1. Take these two sentences:
   - `"I've been waiting for a HuggingFace course my whole life."`
   - `"I hate this so much!"`

2. Tokenize and encode both
3. Run each sentence **individually** through the model and note the logits
4. Now:
   - Pad the shorter sequence
   - Create an attention mask
   - Run both as a **batch** with `attention_mask`


In [33]:
# Input sentences
sentence1 = "I've been waiting for a HuggingFace course my whole life."
sentence2 = "I hate this so much!"

# Encode
ids1 = tokenizer.encode(sentence1)
ids2 = tokenizer.encode(sentence2)

print("Sentence 1 IDs:", ids1)
print("Sentence 2 IDs:", ids2)

# Inference individually
print("Sentence 1 logits:", model(torch.tensor([ids1])).logits)
print("Sentence 2 logits:", model(torch.tensor([ids2])).logits)

# Padding both to same length
pad_id = tokenizer.pad_token_id
max_len = max(len(ids1), len(ids2))

ids1_padded = ids1 + [pad_id] * (max_len - len(ids1))
ids2_padded = ids2 + [pad_id] * (max_len - len(ids2))

# Attention masks
attn_mask1 = [1] * len(ids1) + [0] * (max_len - len(ids1))
attn_mask2 = [1] * len(ids2) + [0] * (max_len - len(ids2))

# Batched input and mask
batch_input = torch.tensor([ids1_padded, ids2_padded])
batch_mask = torch.tensor([attn_mask1, attn_mask2])

# Inference
print("Batched logits with attention mask:", model(batch_input, attention_mask=batch_mask).logits)


Sentence 1 IDs: [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
Sentence 2 IDs: [101, 1045, 5223, 2023, 2061, 2172, 999, 102]
Sentence 1 logits: tensor([[-1.5607,  1.6123]], grad_fn=<AddmmBackward0>)
Sentence 2 logits: tensor([[ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)
Batched logits with attention mask: tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


## ✅ Summary

- ✅ Transformers expect **batched inputs** (2D tensors)
- ✅ Use **padding** to equalize sequence lengths
- ✅ Use **attention masks** to **ignore padding**
- ✅ Ensures consistent outputs for individual and batched inference

You're now ready to handle **multiple sequences** effectively using Hugging Face Transformers!
