# Complete Guide to Transformer Models ü§ñ

## A Beginner-Friendly Tutorial for CS Students

Welcome! In this tutorial, you'll learn about three major types of transformer models:
1. **Encoder-Only Models** (like BERT)
2. **Decoder-Only Models** (like GPT)
3. **Encoder-Decoder Models** (like T5)

By the end, you'll understand how each works and be able to use them for real tasks!

---

## üì¶ Setup: Installing Required Libraries

First, let's install the libraries we'll need. Run this cell once at the beginning.

In [None]:
# Install required libraries
!pip install transformers torch sentencepiece -q

print("‚úÖ All libraries installed successfully!")

## Import Libraries

Now let's import everything we need:

In [None]:
from transformers import (
    BertTokenizer, BertModel, BertForSequenceClassification,
    GPT2Tokenizer, GPT2LMHeadModel,
    T5Tokenizer, T5ForConditionalGeneration,
    pipeline
)
import torch
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Libraries imported successfully!")
print(f"Using PyTorch version: {torch.__version__}")

---

# Section 1: Encoder-Only Models (BERT) üîç

## What are Encoder-Only Models?

**Encoder-only models** read and understand text. They're great at:
- Understanding the meaning of sentences
- Classification tasks (spam detection, sentiment analysis)
- Question answering
- Finding similar sentences

**Key Feature:** They can look at the *entire* sentence at once (bidirectional attention).

**Popular Example:** BERT (Bidirectional Encoder Representations from Transformers)

### How BERT Works:
1. Takes in a sentence
2. Converts words to numbers (tokenization)
3. Processes the entire sentence at once
4. Outputs a representation (embedding) that captures the meaning

Let's see it in action!

## Example 1.1: Loading BERT and Getting Embeddings

In [None]:
# Step 1: Load the BERT tokenizer and model
print("Loading BERT model...")
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')
print("‚úÖ BERT model loaded!\n")

# Step 2: Let's encode some text
text = "Transformers are amazing for natural language processing!"
print(f"Input text: '{text}'\n")

# Step 3: Tokenize (convert text to numbers)
inputs = bert_tokenizer(text, return_tensors='pt', padding=True, truncation=True)
print("Tokenized input IDs:")
print(inputs['input_ids'])
print(f"\nTokens: {bert_tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])}\n")

# Step 4: Get BERT's output
with torch.no_grad():  # We don't need gradients for inference
    outputs = bert_model(**inputs)

# Step 5: Extract the embeddings
# last_hidden_state contains embeddings for each token
embeddings = outputs.last_hidden_state
print(f"Shape of embeddings: {embeddings.shape}")
print(f"This means: [batch_size=1, sequence_length={embeddings.shape[1]}, hidden_size={embeddings.shape[2]}]")
print("\nüí° Each word now has a 768-dimensional vector that captures its meaning!")

## Example 1.2: Sentiment Analysis with BERT

Let's use BERT to classify whether movie reviews are positive or negative!

In [None]:
# Using a pre-trained sentiment analysis pipeline (built on BERT)
print("Loading sentiment analysis model...")
sentiment_analyzer = pipeline("sentiment-analysis")
print("‚úÖ Model loaded!\n")

# Test sentences
sentences = [
    "This movie was absolutely fantastic! I loved every minute.",
    "Terrible film. Waste of time and money.",
    "It was okay, nothing special but not bad either.",
    "Best movie I've seen this year! Highly recommend!"
]

print("Analyzing sentiments...\n")
for sentence in sentences:
    result = sentiment_analyzer(sentence)[0]
    print(f"Text: '{sentence}'")
    print(f"Sentiment: {result['label']} (confidence: {result['score']:.4f})")
    print("-" * 80)

## Example 1.3: Understanding Sentence Similarity

BERT can help us understand which sentences are similar in meaning!

In [None]:
from torch.nn.functional import cosine_similarity

def get_sentence_embedding(text):
    """Get the average embedding for a sentence using BERT"""
    inputs = bert_tokenizer(text, return_tensors='pt', padding=True, truncation=True)
    with torch.no_grad():
        outputs = bert_model(**inputs)
    # Average all token embeddings (mean pooling)
    return outputs.last_hidden_state.mean(dim=1)

# Compare three sentences
sentence1 = "The cat sat on the mat."
sentence2 = "A feline rested on the rug."
sentence3 = "Python is a programming language."

# Get embeddings
emb1 = get_sentence_embedding(sentence1)
emb2 = get_sentence_embedding(sentence2)
emb3 = get_sentence_embedding(sentence3)

# Calculate similarities
sim_1_2 = cosine_similarity(emb1, emb2).item()
sim_1_3 = cosine_similarity(emb1, emb3).item()
sim_2_3 = cosine_similarity(emb2, emb3).item()

print("Sentence Similarity Analysis:\n")
print(f"Sentence 1: '{sentence1}'")
print(f"Sentence 2: '{sentence2}'")
print(f"Sentence 3: '{sentence3}'\n")
print(f"Similarity (1 ‚Üî 2): {sim_1_2:.4f} üëç (similar meaning!)")
print(f"Similarity (1 ‚Üî 3): {sim_1_3:.4f} üëé (different topics)")
print(f"Similarity (2 ‚Üî 3): {sim_2_3:.4f} üëé (different topics)")
print("\nüí° Sentences 1 and 2 have high similarity because they mean the same thing!")

## üéØ Key Takeaways: Encoder-Only Models

‚úÖ **Best for:** Understanding and analyzing text

‚úÖ **Can see:** The entire input at once (bidirectional)

‚úÖ **Common tasks:**
- Text classification
- Sentiment analysis
- Named entity recognition
- Question answering
- Sentence similarity

‚ùå **Not good for:** Generating new text

---

# Section 2: Decoder-Only Models (GPT) üìù

## What are Decoder-Only Models?

**Decoder-only models** are designed to generate text. They're great at:
- Writing stories, articles, code
- Continuing text from a prompt
- Conversational AI
- Creative writing

**Key Feature:** They can only look at *previous* words when predicting the next word (unidirectional/causal attention).

**Popular Example:** GPT (Generative Pre-trained Transformer)

### How GPT Works:
1. You give it a starting prompt
2. It predicts the next word based on previous words
3. It adds that word to the sequence
4. Repeats until it generates the desired length

Let's explore!

## Example 2.1: Loading GPT-2 and Generating Text

In [None]:
# Step 1: Load GPT-2 tokenizer and model
print("Loading GPT-2 model...")
gpt_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
gpt_model = GPT2LMHeadModel.from_pretrained('gpt2')

# GPT-2 needs a padding token
gpt_tokenizer.pad_token = gpt_tokenizer.eos_token
print("‚úÖ GPT-2 model loaded!\n")

# Step 2: Create a prompt
prompt = "Once upon a time, in a land far away,"
print(f"Prompt: '{prompt}'\n")

# Step 3: Tokenize the input
inputs = gpt_tokenizer(prompt, return_tensors='pt')
print(f"Tokenized input: {inputs['input_ids']}")
print(f"Tokens: {gpt_tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])}\n")

# Step 4: Generate text!
print("Generating text...\n")
with torch.no_grad():
    outputs = gpt_model.generate(
        inputs['input_ids'],
        max_length=100,  # Maximum length of generated text
        num_return_sequences=1,  # Number of different completions
        temperature=0.8,  # Creativity (higher = more creative)
        do_sample=True,  # Use sampling instead of greedy decoding
        top_k=50,  # Consider top 50 tokens
        pad_token_id=gpt_tokenizer.eos_token_id
    )

# Step 5: Decode and print the generated text
generated_text = gpt_tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated Story:")
print("=" * 80)
print(generated_text)
print("=" * 80)

## Example 2.2: Multiple Generation Strategies

Let's see how different parameters affect text generation!

In [None]:
prompt = "The future of artificial intelligence is"

print(f"Prompt: '{prompt}'\n")
print("=" * 80)

# Strategy 1: Greedy decoding (always picks most likely word)
print("\n1Ô∏è‚É£ GREEDY DECODING (deterministic, safe)")
print("-" * 80)
inputs = gpt_tokenizer(prompt, return_tensors='pt')
with torch.no_grad():
    outputs = gpt_model.generate(
        inputs['input_ids'],
        max_length=50,
        do_sample=False,  # No sampling, always pick most likely
        pad_token_id=gpt_tokenizer.eos_token_id
    )
print(gpt_tokenizer.decode(outputs[0], skip_special_tokens=True))

# Strategy 2: High temperature (more creative/random)
print("\n\n2Ô∏è‚É£ HIGH TEMPERATURE (creative, varied)")
print("-" * 80)
with torch.no_grad():
    outputs = gpt_model.generate(
        inputs['input_ids'],
        max_length=50,
        do_sample=True,
        temperature=1.5,  # High temperature = more random
        pad_token_id=gpt_tokenizer.eos_token_id
    )
print(gpt_tokenizer.decode(outputs[0], skip_special_tokens=True))

# Strategy 3: Low temperature (more focused/deterministic)
print("\n\n3Ô∏è‚É£ LOW TEMPERATURE (focused, consistent)")
print("-" * 80)
with torch.no_grad():
    outputs = gpt_model.generate(
        inputs['input_ids'],
        max_length=50,
        do_sample=True,
        temperature=0.3,  # Low temperature = more deterministic
        pad_token_id=gpt_tokenizer.eos_token_id
    )
print(gpt_tokenizer.decode(outputs[0], skip_special_tokens=True))

print("\n" + "=" * 80)
print("\nüí° Notice how temperature affects creativity and coherence!")

## Example 2.3: Interactive Story Generator

Create your own story beginnings and see what GPT-2 generates!

In [None]:
def generate_story(prompt, max_length=100, temperature=0.8):
    """Generate a story continuation from a prompt"""
    inputs = gpt_tokenizer(prompt, return_tensors='pt')
    
    with torch.no_grad():
        outputs = gpt_model.generate(
            inputs['input_ids'],
            max_length=max_length,
            temperature=temperature,
            do_sample=True,
            top_k=50,
            top_p=0.95,
            pad_token_id=gpt_tokenizer.eos_token_id
        )
    
    return gpt_tokenizer.decode(outputs[0], skip_special_tokens=True)

# Try different story prompts
prompts = [
    "In the year 2150, humans discovered",
    "The mysterious package arrived at midnight, containing",
    "She opened the old book and found"
]

for i, prompt in enumerate(prompts, 1):
    print(f"\n{'='*80}")
    print(f"Story {i}")
    print(f"{'='*80}")
    print(f"\nPrompt: '{prompt}'\n")
    story = generate_story(prompt, max_length=80)
    print(story)

print("\n" + "=" * 80)
print("\nüí° Try changing the prompts above to generate your own stories!")

## üéØ Key Takeaways: Decoder-Only Models

‚úÖ **Best for:** Generating new text

‚úÖ **Can see:** Only previous words in the sequence (left-to-right)

‚úÖ **Common tasks:**
- Text generation
- Story writing
- Code completion
- Chatbots
- Text completion

‚ùå **Not ideal for:** Understanding/analyzing text (though modern large GPT models can do this too!)

**Important Parameters:**
- `temperature`: Controls randomness (0 = deterministic, higher = more creative)
- `top_k`: Only consider the k most likely next tokens
- `top_p`: Nucleus sampling - consider tokens whose cumulative probability exceeds p

---

# Section 3: Encoder-Decoder Models (T5) üîÑ

## What are Encoder-Decoder Models?

**Encoder-decoder models** combine the best of both worlds! They:
- Use an encoder to understand the input
- Use a decoder to generate the output
- Are perfect for transformation tasks

**Key Feature:** They can understand complex inputs AND generate sophisticated outputs.

**Popular Example:** T5 (Text-to-Text Transfer Transformer)

### How T5 Works:
1. **Encoder** reads and understands the input
2. Creates a representation of the input
3. **Decoder** uses that representation to generate output
4. Everything is framed as a text-to-text task!

### Use Cases:
- Translation
- Summarization
- Question answering
- Text transformation

Let's dive in!

## Example 3.1: Loading T5 and Understanding the Architecture

In [None]:
# Step 1: Load T5 tokenizer and model
print("Loading T5 model... (this might take a moment)")
t5_tokenizer = T5Tokenizer.from_pretrained('t5-small')
t5_model = T5ForConditionalGeneration.from_pretrained('t5-small')
print("‚úÖ T5 model loaded!\n")

# T5 uses task prefixes to know what to do
print("üìö T5 uses prefixes to understand the task:")
print("-" * 80)
print("'translate English to German: ' ‚Üí Translation")
print("'summarize: ' ‚Üí Summarization")
print("'question: ... context: ...' ‚Üí Question Answering")
print("'sentiment: ' ‚Üí Sentiment Analysis")
print("=" * 80)

## Example 3.2: Text Summarization

Let's use T5 to summarize a long piece of text!

In [None]:
# Long article about machine learning
article = """
Machine learning is a subset of artificial intelligence that focuses on the development 
of algorithms and statistical models that enable computer systems to improve their 
performance on a specific task through experience. Unlike traditional programming, 
where explicit instructions are provided, machine learning systems learn patterns from 
data. There are three main types of machine learning: supervised learning, where the 
algorithm learns from labeled data; unsupervised learning, where the algorithm finds 
patterns in unlabeled data; and reinforcement learning, where an agent learns to make 
decisions by receiving rewards or penalties. Deep learning, a subset of machine learning, 
uses neural networks with multiple layers to learn hierarchical representations of data. 
Machine learning has revolutionized many industries, including healthcare, finance, 
transportation, and entertainment, enabling applications such as disease diagnosis, 
fraud detection, autonomous vehicles, and recommendation systems.
"""

print("Original Article:")
print("=" * 80)
print(article.strip())
print("=" * 80)
print(f"\nOriginal length: {len(article.split())} words\n")

# Prepare input for T5 (add task prefix)
input_text = "summarize: " + article
inputs = t5_tokenizer(input_text, return_tensors='pt', max_length=512, truncation=True)

# Generate summary
print("Generating summary...\n")
with torch.no_grad():
    summary_ids = t5_model.generate(
        inputs['input_ids'],
        max_length=100,
        min_length=30,
        length_penalty=2.0,
        num_beams=4,  # Beam search for better quality
        early_stopping=True
    )

summary = t5_tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print("Generated Summary:")
print("=" * 80)
print(summary)
print("=" * 80)
print(f"\nSummary length: {len(summary.split())} words")
print(f"Compression ratio: {len(article.split()) / len(summary.split()):.1f}x shorter!")

## Example 3.3: Translation

T5 can translate text between languages!

In [None]:
def translate_text(text, target_language='German'):
    """Translate English text to another language using T5"""
    # T5 format: "translate English to [language]: [text]"
    input_text = f"translate English to {target_language}: {text}"
    inputs = t5_tokenizer(input_text, return_tensors='pt', max_length=512, truncation=True)
    
    with torch.no_grad():
        outputs = t5_model.generate(
            inputs['input_ids'],
            max_length=128,
            num_beams=4,
            early_stopping=True
        )
    
    return t5_tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test sentences
sentences = [
    "Hello, how are you today?",
    "Machine learning is fascinating.",
    "I love studying computer science."
]

print("English to German Translation:")
print("=" * 80)

for sentence in sentences:
    translation = translate_text(sentence, 'German')
    print(f"\nüá¨üáß English:  {sentence}")
    print(f"üá©üá™ German:   {translation}")
    print("-" * 80)

print("\nüí° T5 learned translation during pre-training on multilingual data!")

## Example 3.4: Question Answering

T5 can answer questions based on provided context!

In [None]:
def answer_question(question, context):
    """Answer a question given context using T5"""
    # T5 format for QA
    input_text = f"question: {question} context: {context}"
    inputs = t5_tokenizer(input_text, return_tensors='pt', max_length=512, truncation=True)
    
    with torch.no_grad():
        outputs = t5_model.generate(
            inputs['input_ids'],
            max_length=50,
            num_beams=4,
            early_stopping=True
        )
    
    return t5_tokenizer.decode(outputs[0], skip_special_tokens=True)

# Context paragraph
context = """
Python is a high-level, interpreted programming language created by Guido van Rossum 
and first released in 1991. It emphasizes code readability and simplicity, making it 
an excellent choice for beginners. Python supports multiple programming paradigms, 
including procedural, object-oriented, and functional programming. It has a large 
standard library and is widely used in web development, data science, artificial 
intelligence, scientific computing, and automation.
"""

print("Context:")
print("=" * 80)
print(context.strip())
print("=" * 80)
print()

# Questions to ask
questions = [
    "Who created Python?",
    "When was Python first released?",
    "What programming paradigms does Python support?",
    "What is Python used for?"
]

print("Question Answering:")
print("=" * 80)

for question in questions:
    answer = answer_question(question, context)
    print(f"\n‚ùì Q: {question}")
    print(f"‚úÖ A: {answer}")
    print("-" * 80)

print("\nüí° T5 extracts and generates answers from the context!")

## Example 3.5: Multiple Tasks Showcase

Let's demonstrate T5's versatility by running multiple different tasks!

In [None]:
def t5_task(task_prefix, text, max_length=100):
    """Generic function to run any T5 task"""
    input_text = f"{task_prefix} {text}"
    inputs = t5_tokenizer(input_text, return_tensors='pt', max_length=512, truncation=True)
    
    with torch.no_grad():
        outputs = t5_model.generate(
            inputs['input_ids'],
            max_length=max_length,
            num_beams=4,
            early_stopping=True
        )
    
    return t5_tokenizer.decode(outputs[0], skip_special_tokens=True)

print("T5: One Model, Multiple Tasks! üöÄ")
print("=" * 80)

# Task 1: Grammar correction
print("\n1Ô∏è‚É£ GRAMMAR CORRECTION")
print("-" * 80)
bad_grammar = "She don't likes apples and oranges very much."
corrected = t5_task("grammar:", bad_grammar)
print(f"Original:  {bad_grammar}")
print(f"Corrected: {corrected}")

# Task 2: Sentiment analysis
print("\n2Ô∏è‚É£ SENTIMENT ANALYSIS")
print("-" * 80)
text = "This product exceeded my expectations! Amazing quality."
sentiment = t5_task("sentiment:", text, max_length=10)
print(f"Text: {text}")
print(f"Sentiment: {sentiment}")

# Task 3: Paraphrasing
print("\n3Ô∏è‚É£ PARAPHRASING")
print("-" * 80)
original = "The quick brown fox jumps over the lazy dog."
paraphrase = t5_task("paraphrase:", original)
print(f"Original:   {original}")
print(f"Paraphrase: {paraphrase}")

print("\n" + "=" * 80)
print("\nüí° T5 treats everything as a text-to-text transformation!")
print("   Just change the task prefix to change the task!")

## üéØ Key Takeaways: Encoder-Decoder Models

‚úÖ **Best for:** Transformation tasks (input ‚Üí output)

‚úÖ **Architecture:**
- **Encoder** understands the input (bidirectional attention)
- **Decoder** generates the output (causal attention + cross-attention to encoder)

‚úÖ **Common tasks:**
- Translation
- Summarization
- Question answering
- Paraphrasing
- Grammar correction
- Any text-to-text transformation!

‚úÖ **Advantages:**
- Combines understanding and generation
- Great for complex transformations
- Versatile (one model, many tasks)

**T5's Special Feature:** Everything is framed as text-to-text, making it extremely flexible!

---

# üéì Final Comparison: Which Model When?

## Quick Reference Guide

| Model Type | Architecture | Best For | Example Models | Can Generate? | Can Understand? |
|------------|--------------|----------|----------------|---------------|------------------|
| **Encoder-Only** | Bidirectional | Understanding & Analysis | BERT, RoBERTa | ‚ùå No | ‚úÖ Yes |
| **Decoder-Only** | Causal (Left-to-right) | Text Generation | GPT, GPT-2, GPT-3 | ‚úÖ Yes | ‚ö†Ô∏è Limited |
| **Encoder-Decoder** | Both | Transformation Tasks | T5, BART | ‚úÖ Yes | ‚úÖ Yes |

## Decision Tree üå≥

```
What's your task?
‚îÇ
‚îú‚îÄ Need to UNDERSTAND text?
‚îÇ  ‚îú‚îÄ Classification ‚Üí Encoder-Only (BERT)
‚îÇ  ‚îú‚îÄ Similarity ‚Üí Encoder-Only (BERT)
‚îÇ  ‚îî‚îÄ Analysis ‚Üí Encoder-Only (BERT)
‚îÇ
‚îú‚îÄ Need to GENERATE text?
‚îÇ  ‚îú‚îÄ Creative writing ‚Üí Decoder-Only (GPT)
‚îÇ  ‚îú‚îÄ Continuation ‚Üí Decoder-Only (GPT)
‚îÇ  ‚îî‚îÄ Chatbot ‚Üí Decoder-Only (GPT)
‚îÇ
‚îî‚îÄ Need to TRANSFORM text?
   ‚îú‚îÄ Translation ‚Üí Encoder-Decoder (T5)
   ‚îú‚îÄ Summarization ‚Üí Encoder-Decoder (T5)
   ‚îî‚îÄ Question Answering ‚Üí Encoder-Decoder (T5)
```

## Real-World Examples üåç

### Encoder-Only (BERT)
- üìß Email spam detection
- üòä Sentiment analysis in product reviews
- üè∑Ô∏è Named entity recognition
- üîç Semantic search engines

### Decoder-Only (GPT)
- ‚úçÔ∏è Content creation (blogs, articles)
- üí¨ Conversational AI assistants
- üíª Code generation and completion
- üìñ Story and creative writing

### Encoder-Decoder (T5)
- üåê Language translation
- üìÑ Document summarization
- ‚ùì Question answering systems
- ‚úèÔ∏è Grammar and style correction

---

# üéâ Congratulations!

You've completed the transformer models tutorial! Here's what you learned:

## ‚úÖ Key Concepts Mastered:

1. **Encoder-Only Models (BERT)**
   - Bidirectional understanding
   - Text embeddings
   - Classification and analysis tasks

2. **Decoder-Only Models (GPT)**
   - Autoregressive generation
   - Temperature and sampling strategies
   - Creative text generation

3. **Encoder-Decoder Models (T5)**
   - Text-to-text framework
   - Translation and summarization
   - Multi-task learning

## üöÄ Next Steps:

1. **Experiment:** Modify the code examples with your own text!
2. **Explore:** Try different model sizes (e.g., `bert-large`, `gpt2-medium`, `t5-base`)
3. **Build:** Create your own application using these models
4. **Learn More:** Check out Hugging Face documentation at https://huggingface.co/docs

## üìö Additional Resources:

- Hugging Face Transformers: https://huggingface.co/transformers/
- "Attention Is All You Need" paper (original Transformer)
- BERT paper: "BERT: Pre-training of Deep Bidirectional Transformers"
- GPT papers: GPT, GPT-2, GPT-3
- T5 paper: "Exploring the Limits of Transfer Learning"

## üí° Pro Tips:

- Start with small models for experimentation (faster and less memory)
- Use GPU acceleration for faster inference (if available)
- Read the model cards on Hugging Face for capabilities and limitations
- Fine-tune models on your specific data for better performance

---

### Happy coding! üéà

*Remember: These models are tools. Understanding when and how to use each one is the real skill!*

## üõ†Ô∏è Practice Exercises

Try these challenges to test your understanding:

### Beginner:
1. Modify the BERT sentiment analyzer to analyze your own sentences
2. Change the GPT-2 temperature and observe the differences in generation
3. Use T5 to summarize your favorite news article

### Intermediate:
4. Create a function that classifies movie reviews as positive/negative using BERT
5. Build a story generator that takes a genre as input and generates appropriate stories
6. Translate sentences from English to multiple languages using T5

### Advanced:
7. Compare embeddings from BERT for synonyms vs. unrelated words
8. Implement beam search manually for GPT-2 generation
9. Fine-tune T5 on a custom dataset (requires additional data)

Use the cells below to work on these exercises!

In [None]:
# Your practice code here!
# Try out the exercises above or experiment with your own ideas

