# üîß Transformers Basics

**Module 01 | Notebook 1 of 2**

In this notebook, you'll learn the fundamental building blocks of working with transformer models using the Hugging Face Transformers library.

## Learning Objectives

By the end of this notebook, you will be able to:
1. Load pre-trained models using `AutoModel` and `AutoTokenizer`
2. Understand the tokenization process
3. Perform model inference
4. Use pipelines for common tasks

---

## üì¶ Setup

In [None]:
%%capture
!pip install transformers datasets torch accelerate
print("‚úÖ Dependencies installed!")

In [None]:
import torch
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification
from transformers import AutoModelForSeq2SeqLM, AutoModelForCausalLM
import warnings
warnings.filterwarnings('ignore')

# Check device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

---

## 1Ô∏è‚É£ Understanding Tokenization

### What is Tokenization?

Tokenization is the process of converting text into smaller units (tokens) that the model can process. These tokens are then converted to numerical IDs.

```
Text: "Hello, how are you?"
  ‚Üì Tokenization
Tokens: ["Hello", ",", "how", "are", "you", "?"]
  ‚Üì Convert to IDs
Token IDs: [7592, 1010, 2129, 2024, 2017, 1029]
```

### Why Tokenization Matters

- Models can only process numbers, not text
- Different models use different tokenization strategies
- Token count affects memory usage and processing time

In [None]:
# Load a tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Simple tokenization example
text = "Hello, how are you doing today?"

# Tokenize
tokens = tokenizer.tokenize(text)
print(f"Original text: {text}")
print(f"Tokens: {tokens}")
print(f"Number of tokens: {len(tokens)}")

In [None]:
# Convert tokens to IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"Token IDs: {token_ids}")

# We can also go back from IDs to tokens
decoded_tokens = tokenizer.convert_ids_to_tokens(token_ids)
print(f"Decoded tokens: {decoded_tokens}")

### The Complete Tokenization Pipeline

In practice, we use the tokenizer's `__call__` method which handles everything:

In [None]:
# Complete tokenization with the __call__ method
encoded = tokenizer(text, return_tensors="pt")

print("Encoded outputs:")
print(f"  Keys: {list(encoded.keys())}")
print(f"  input_ids shape: {encoded['input_ids'].shape}")
print(f"  input_ids: {encoded['input_ids']}")
print(f"  attention_mask: {encoded['attention_mask']}")

### Understanding the Outputs

| Field | Description |
|-------|-------------|
| `input_ids` | Token IDs for the model |
| `attention_mask` | 1s for real tokens, 0s for padding |
| `token_type_ids` | Segment IDs (for sentence pairs) |

In [None]:
# Visualize the tokenization
print("Token-by-token breakdown:")
print("-" * 40)
for token_id, attention in zip(encoded['input_ids'][0], encoded['attention_mask'][0]):
    token = tokenizer.convert_ids_to_tokens([token_id.item()])[0]
    print(f"ID: {token_id.item():5d} | Attention: {attention.item()} | Token: {token}")

### Special Tokens

Models use special tokens to mark the beginning/end of sequences:

In [None]:
print("Special tokens:")
print(f"  [CLS] token: {tokenizer.cls_token} (ID: {tokenizer.cls_token_id})")
print(f"  [SEP] token: {tokenizer.sep_token} (ID: {tokenizer.sep_token_id})")
print(f"  [PAD] token: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})")
print(f"  [UNK] token: {tokenizer.unk_token} (ID: {tokenizer.unk_token_id})")

---

## 2Ô∏è‚É£ Loading Pre-trained Models

### The Auto Classes

Hugging Face provides `Auto` classes that automatically detect the correct model architecture:

| Class | Use Case |
|-------|----------|
| `AutoModel` | Base model (embeddings only) |
| `AutoModelForSequenceClassification` | Text classification |
| `AutoModelForSeq2SeqLM` | Sequence-to-sequence (summarization, translation) |
| `AutoModelForCausalLM` | Text generation (GPT-style) |
| `AutoModelForQuestionAnswering` | Extractive QA |

In [None]:
# Load the base BERT model
model = AutoModel.from_pretrained(model_name)

print(f"Model type: {type(model).__name__}")
print(f"Number of parameters: {sum(p.numel() for p in model.parameters()):,}")

In [None]:
# Get model configuration
config = model.config

print("Model configuration:")
print(f"  Hidden size: {config.hidden_size}")
print(f"  Number of layers: {config.num_hidden_layers}")
print(f"  Number of attention heads: {config.num_attention_heads}")
print(f"  Vocabulary size: {config.vocab_size}")
print(f"  Max position embeddings: {config.max_position_embeddings}")

### Model Inference

Let's pass our tokenized text through the model:

In [None]:
# Move model to device
model = model.to(device)

# Prepare inputs
inputs = tokenizer(text, return_tensors="pt").to(device)

# Run inference (no gradient computation needed)
with torch.no_grad():
    outputs = model(**inputs)

print(f"Output keys: {list(outputs.keys())}")
print(f"Last hidden state shape: {outputs.last_hidden_state.shape}")
print(f"  - Batch size: {outputs.last_hidden_state.shape[0]}")
print(f"  - Sequence length: {outputs.last_hidden_state.shape[1]}")
print(f"  - Hidden size: {outputs.last_hidden_state.shape[2]}")

### Understanding the Output

The `last_hidden_state` contains embeddings for each token:

```
Shape: [batch_size, sequence_length, hidden_size]
       [1,          10,              768]
```

Each token is now represented as a 768-dimensional vector that captures its meaning in context.

In [None]:
# Extract the [CLS] token embedding (often used for classification)
cls_embedding = outputs.last_hidden_state[0, 0, :]  # First token of first batch
print(f"CLS embedding shape: {cls_embedding.shape}")
print(f"CLS embedding (first 10 values): {cls_embedding[:10]}")

---

## 3Ô∏è‚É£ Task-Specific Models

For specific tasks, use the appropriate model class:

In [None]:
# Load a classification model
classifier = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
).to(device)

classifier_tokenizer = AutoTokenizer.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)

print(f"Number of labels: {classifier.config.num_labels}")
print(f"Label mapping: {classifier.config.id2label}")

In [None]:
# Run classification
test_texts = [
    "I absolutely love this product!",
    "This is the worst experience ever.",
    "It's okay, nothing special."
]

for text in test_texts:
    inputs = classifier_tokenizer(text, return_tensors="pt").to(device)
    
    with torch.no_grad():
        outputs = classifier(**inputs)
    
    # Get probabilities
    probs = torch.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(probs).item()
    confidence = probs[0, predicted_class].item()
    
    print(f"Text: {text}")
    print(f"  ‚Üí {classifier.config.id2label[predicted_class]} ({confidence:.2%})")
    print()

---

## 4Ô∏è‚É£ Using Pipelines

For quick prototyping, use the `pipeline` API which abstracts away tokenization and post-processing:

In [None]:
from transformers import pipeline

# Sentiment analysis pipeline
sentiment_pipeline = pipeline(
    "sentiment-analysis",
    device=0 if torch.cuda.is_available() else -1
)

results = sentiment_pipeline(test_texts)

print("Pipeline Results:")
for text, result in zip(test_texts, results):
    print(f"{text}")
    print(f"  ‚Üí {result['label']} ({result['score']:.2%})\n")

In [None]:
# Summarization pipeline
summarizer = pipeline(
    "summarization",
    model="facebook/bart-large-cnn",
    device=0 if torch.cuda.is_available() else -1
)

long_text = """
The Amazon rainforest, also known as Amazonia, is a moist broadleaf tropical rainforest 
in the Amazon biome that covers most of the Amazon basin of South America. This basin 
encompasses 7,000,000 km2 (2,700,000 sq mi), of which 5,500,000 km2 (2,100,000 sq mi) 
are covered by the rainforest. This region includes territory belonging to nine nations 
and 3,344 formally acknowledged indigenous territories. The majority of the forest is 
contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia 
with 10%, and with minor amounts in Bolivia, Ecuador, French Guiana, Guyana, Suriname, 
and Venezuela.
"""

summary = summarizer(long_text, max_length=50, min_length=20, do_sample=False)
print(f"Original length: {len(long_text.split())} words")
print(f"Summary: {summary[0]['summary_text']}")

### Available Pipelines

| Pipeline | Task |
|----------|------|
| `text-classification` | Sentiment, topic classification |
| `token-classification` | NER, POS tagging |
| `question-answering` | Extractive QA |
| `summarization` | Text summarization |
| `translation` | Machine translation |
| `text-generation` | GPT-style generation |
| `fill-mask` | Masked language modeling |

---

## 5Ô∏è‚É£ Batch Processing

For efficiency, process multiple inputs at once:

In [None]:
# Batch tokenization with padding
texts = [
    "Short text.",
    "This is a medium length sentence.",
    "This is a much longer sentence that contains many more words and tokens."
]

# Tokenize with padding
batch_encoded = tokenizer(
    texts,
    padding=True,           # Pad to longest in batch
    truncation=True,        # Truncate if too long
    max_length=32,          # Maximum length
    return_tensors="pt"     # Return PyTorch tensors
)

print(f"Batch shape: {batch_encoded['input_ids'].shape}")
print(f"\nInput IDs:")
print(batch_encoded['input_ids'])
print(f"\nAttention Mask (0 = padding):")
print(batch_encoded['attention_mask'])

---

## üéØ Student Challenge

Now it's your turn! Complete the following exercises:

### Challenge 1: Compare Tokenizers
Load tokenizers for `bert-base-uncased` and `gpt2`, then compare how they tokenize the same sentence.

In [None]:
# TODO: Your code here
# 1. Load both tokenizers
# 2. Tokenize: "The transformer architecture revolutionized natural language processing."
# 3. Print the tokens and token counts for each

# bert_tokenizer = ...
# gpt2_tokenizer = ...

test_sentence = "The transformer architecture revolutionized natural language processing."

# Your solution:


### Challenge 2: Model Size Comparison
Load `distilbert-base-uncased` and `bert-base-uncased`, then compare their parameter counts.

In [None]:
# TODO: Your code here
# 1. Load both models
# 2. Count parameters for each
# 3. Calculate the size reduction percentage

# Your solution:


---

## üìù Key Takeaways

1. **Tokenization** converts text to numerical IDs that models can process
2. **Auto classes** automatically detect the right model architecture
3. **Task-specific models** add appropriate heads for classification, generation, etc.
4. **Pipelines** provide a high-level API for quick prototyping
5. **Batch processing** with padding improves efficiency

---

## ‚û°Ô∏è Next Steps

Continue to `02_model_architecture.ipynb` to learn about:
- Encoder vs. Decoder architectures
- Attention mechanism visualization
- Memory and compute requirements