# Understanding Padding and Attention Masks

This notebook demonstrates how to handle sequences of different lengths using:
- **Padding**: Making all sequences the same length
- **Attention Masks**: Telling the model which tokens are real vs padding

These concepts are essential for batch processing in language models!

In [None]:
! pip install transformers

In [None]:
from transformers import AutoTokenizer
from huggingface_hub import login
from google.colab import userdata

HF_TOKEN = userdata.get("HF_TOKEN")
login(HF_TOKEN)

In [None]:
# Load Llama 3.2 1B tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")

print(f"Tokenizer loaded: {tokenizer.__class__.__name__}")
print(f"Vocabulary size: {len(tokenizer):,} tokens")
print(f"\nSpecial tokens:")
print(f"  BOS: {tokenizer.bos_token} (ID: {tokenizer.bos_token_id})")
print(f"  EOS: {tokenizer.eos_token} (ID: {tokenizer.eos_token_id})")
print(f"  PAD: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})")

---

## The Problem - Different Length Sequences

Let's create some example sentences of **different lengths** to see why padding is necessary.


In [None]:
# Create sentences of varying lengths
sentences = [
    "Hi!",
    "How are you?",
    "What's the weather like today?",
    "I'm working on a machine learning project using transformers."
]

print("="*80)
print("EXAMPLE SENTENCES (Different Lengths)")
print("="*80)

for i, sentence in enumerate(sentences, 1):
    print(f"\n{i}. \"{sentence}\"")
    print(f"   Length: {len(sentence)} characters")


### Tokenize Without Padding

Let's first tokenize these sentences **without padding** to see what happens:


In [None]:
print("\n" + "="*80)
print("TOKENIZATION WITHOUT PADDING")
print("="*80)

# Tokenize each sentence separately (no padding)
for i, sentence in enumerate(sentences, 1):
    tokens = tokenizer.encode(sentence)
    print(f"\nSentence {i}: \"{sentence}\"")
    print(f"  Number of tokens: {len(tokens)}")
    print(f"  Token IDs: {tokens}")

print("\n" + "="*80)
print("❌ PROBLEM: All sequences have different lengths!")
print("   Cannot process as a batch - tensors must have the same shape.")
print("="*80)

---

## The Solution - Padding

**Padding** adds special `<pad>` tokens to shorter sequences to make them all the same length.

Let's tokenize the same sentences **with padding**:


In [None]:
# First, set the pad token (Llama models don't have one by default)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    print(f"✓ Set pad_token to: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})\n")

In [None]:
print("="*80)
print("TOKENIZATION WITH PADDING")
print("="*80)

# Tokenize with padding
result = tokenizer(
    sentences,
    padding=True,              # Add padding to make all sequences same length
    truncation=False,          # Don't truncate long sequences
    return_tensors="pt",       # Return PyTorch tensors
    return_attention_mask=True # Return attention mask
)

input_ids = result["input_ids"]
attention_mask = result["attention_mask"]

print(f"\n✓ All sequences padded to the same length!")
print(f"\nBatch shape: {input_ids.shape}")
print(f"  - {input_ids.shape[0]} sequences (batch size)")
print(f"  - {input_ids.shape[1]} tokens (max sequence length)")
print("\n" + "="*80)


### Examining the Padded Sequences

Let's look at each sequence in detail to see where padding was added:


In [None]:
print("\n" + "="*80)
print("DETAILED VIEW: INPUT IDS (with padding)")
print("="*80)

for i, (sentence, ids, mask) in enumerate(zip(sentences, input_ids, attention_mask), 1):
    ids_list = ids.tolist()
    mask_list = mask.tolist()

    # Count real tokens vs padding tokens
    num_real_tokens = sum(mask_list)
    num_padding_tokens = len(mask_list) - num_real_tokens

    print(f"\nSequence {i}: \"{sentence}\"")
    print(f"  Real tokens: {num_real_tokens}")
    print(f"  Padding tokens: {num_padding_tokens}")
    print(f"  Input IDs: {ids_list}")

    # Highlight padding tokens
    if num_padding_tokens > 0:
        print(f"  └─> Padding starts at position {num_real_tokens}")

print("\n" + "="*80)


---

## Attention Masks - Telling the Model What to Ignore

The **attention mask** is a binary array that tells the model:
- `1` = Real token (pay attention to this)
- `0` = Padding token (ignore this)

Let's examine the attention masks:


In [None]:
print("="*80)
print("ATTENTION MASKS")
print("="*80)

for i, (sentence, ids, mask) in enumerate(zip(sentences, input_ids, attention_mask), 1):
    ids_list = ids.tolist()
    mask_list = mask.tolist()

    print(f"\nSequence {i}: \"{sentence}\"")
    print(f"  Attention Mask: {mask_list}")
    print(f"  Legend: 1 = Real token, 0 = Padding token")

    # Visual representation
    print(f"  Visual: ", end="")
    for m in mask_list:
        print("█" if m == 1 else "░", end="")
    print("  (█ = real, ░ = padding)")

print("\n" + "="*80)


---

## Complete Matrices


In [None]:
print("INPUT IDS:")
print(input_ids)
print("\nATTENTION MASK:")
print(attention_mask)