# Week 6 Lab: Pre-trained Models - Feature Extraction

## Learning Objectives
- Load and use pre-trained BERT and GPT models
- Extract contextual embeddings from both architectures
- Compare BERT vs GPT representations
- Visualize attention patterns
- Use pre-trained features for classification
- Understand when to use each architecture

**Note**: This lab focuses on feature extraction (not fine-tuning) to work with limited compute resources.

---

## Part 1: Setup and Installation

In [None]:
# Install required libraries (run once)
# !pip install transformers torch numpy matplotlib seaborn scikit-learn

import torch
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from transformers import BertTokenizer, BertModel, GPT2Tokenizer, GPT2Model
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings('ignore')

# Set random seed
torch.manual_seed(42)
np.random.seed(42)

# Configure plotting
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

print("Setup complete!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

## Part 2: Load Pre-trained Models

We'll load BERT-base and GPT-2 (small) from HuggingFace.

In [None]:
# Load BERT
print("Loading BERT-base-uncased...")
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')
bert_model.eval()

print(f"BERT loaded:")
print(f"  Parameters: {sum(p.numel() for p in bert_model.parameters())/1e6:.1f}M")
print(f"  Layers: {bert_model.config.num_hidden_layers}")
print(f"  Hidden size: {bert_model.config.hidden_size}")
print(f"  Attention heads: {bert_model.config.num_attention_heads}")
print(f"  Vocabulary: {bert_model.config.vocab_size}")

In [None]:
# Load GPT-2
print("Loading GPT-2 (small)...")
gpt_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
gpt_model = GPT2Model.from_pretrained('gpt2')
gpt_model.eval()

# GPT-2 needs padding token
gpt_tokenizer.pad_token = gpt_tokenizer.eos_token

print(f"GPT-2 loaded:")
print(f"  Parameters: {sum(p.numel() for p in gpt_model.parameters())/1e6:.1f}M")
print(f"  Layers: {gpt_model.config.n_layer}")
print(f"  Hidden size: {gpt_model.config.n_embd}")
print(f"  Attention heads: {gpt_model.config.n_head}")
print(f"  Vocabulary: {gpt_model.config.vocab_size}")

## Part 3: Extract BERT Embeddings

BERT provides contextual embeddings for each token. The [CLS] token is commonly used as a sentence representation.

In [None]:
def get_bert_embeddings(text, model, tokenizer, layer=-1):
    """
    Extract BERT embeddings for text
    
    Args:
        text: Input string
        model: BERT model
        tokenizer: BERT tokenizer
        layer: Which layer to extract from (-1 = last layer)
    
    Returns:
        cls_embedding: [CLS] token embedding (sentence representation)
        all_embeddings: All token embeddings
    """
    # Tokenize
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
    
    # Get embeddings
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
    
    # Extract from specified layer
    hidden_states = outputs.hidden_states[layer]  # (batch, seq_len, hidden_size)
    
    # [CLS] token is first token
    cls_embedding = hidden_states[0, 0, :].numpy()  # (hidden_size,)
    all_embeddings = hidden_states[0, :, :].numpy()  # (seq_len, hidden_size)
    
    return cls_embedding, all_embeddings

# Test with sample sentences
sample_sentences = [
    "The cat sat on the mat.",
    "BERT provides bidirectional context.",
    "Pre-training enables transfer learning."
]

print("Extracting BERT embeddings...\n")
for sent in sample_sentences:
    cls_emb, all_emb = get_bert_embeddings(sent, bert_model, bert_tokenizer)
    tokens = bert_tokenizer.tokenize(sent)
    print(f"Sentence: {sent}")
    print(f"  Tokens: {tokens}")
    print(f"  [CLS] embedding shape: {cls_emb.shape}")
    print(f"  All embeddings shape: {all_emb.shape}")
    print(f"  [CLS] L2 norm: {np.linalg.norm(cls_emb):.3f}\n")

In [None]:
# Visualize BERT embeddings with PCA
sentences_for_viz = [
    "The cat sleeps.",
    "The dog runs.",
    "She loves reading books.",
    "He enjoys playing games.",
    "Machine learning is fascinating.",
    "Natural language processing is powerful.",
    "Python is a programming language.",
    "Java is also a programming language."
]

# Extract [CLS] embeddings
bert_cls_embeddings = []
for sent in sentences_for_viz:
    cls_emb, _ = get_bert_embeddings(sent, bert_model, bert_tokenizer)
    bert_cls_embeddings.append(cls_emb)

bert_cls_embeddings = np.array(bert_cls_embeddings)  # (n_sentences, 768)

# Apply PCA for 2D visualization
pca = PCA(n_components=2)
bert_pca = pca.fit_transform(bert_cls_embeddings)

# Plot
fig, ax = plt.subplots(figsize=(12, 8))

for i, (sent, coords) in enumerate(zip(sentences_for_viz, bert_pca)):
    ax.scatter(coords[0], coords[1], s=200, alpha=0.6)
    ax.annotate(sent, xy=coords, xytext=(5, 5), textcoords='offset points',
                fontsize=9, bbox=dict(boxstyle='round,pad=0.5', facecolor='wheat', alpha=0.5))

ax.set_xlabel('PCA Component 1', fontsize=12, fontweight='bold')
ax.set_ylabel('PCA Component 2', fontsize=12, fontweight='bold')
ax.set_title('BERT [CLS] Embeddings in 2D Space (PCA)', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Explained variance: {pca.explained_variance_ratio_.sum():.3f}")

## Part 4: Extract GPT Embeddings

GPT-2 uses final hidden states (no [CLS] token). We'll use the last token's embedding as sentence representation.

In [None]:
def get_gpt_embeddings(text, model, tokenizer, layer=-1):
    """
    Extract GPT-2 embeddings
    
    Returns:
        last_token_embedding: Last token's hidden state (sentence representation)
        all_embeddings: All token embeddings
    """
    # Tokenize
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
    
    # Get embeddings
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
    
    # Extract from specified layer
    hidden_states = outputs.hidden_states[layer]
    
    # Last token as sentence embedding
    last_token_embedding = hidden_states[0, -1, :].numpy()
    all_embeddings = hidden_states[0, :, :].numpy()
    
    return last_token_embedding, all_embeddings

# Extract GPT embeddings for same sentences
print("Extracting GPT-2 embeddings...\n")
gpt_sentence_embeddings = []

for sent in sample_sentences:
    last_emb, all_emb = get_gpt_embeddings(sent, gpt_model, gpt_tokenizer)
    gpt_sentence_embeddings.append(last_emb)
    tokens = gpt_tokenizer.tokenize(sent)
    print(f"Sentence: {sent}")
    print(f"  Tokens: {tokens}")
    print(f"  Last token embedding shape: {last_emb.shape}")
    print(f"  All embeddings shape: {all_emb.shape}")
    print(f"  Last token L2 norm: {np.linalg.norm(last_emb):.3f}\n")

## Part 5: Compare BERT vs GPT Embeddings

Let's compare how BERT and GPT represent the same sentences.

In [None]:
# Extract both BERT and GPT embeddings for comparison
comparison_sentences = [
    "The bank is by the river.",
    "I deposited money at the bank.",
    "The model performs well.",
    "She is a fashion model.",
]

bert_embs = []
gpt_embs = []

for sent in comparison_sentences:
    bert_cls, _ = get_bert_embeddings(sent, bert_model, bert_tokenizer)
    gpt_last, _ = get_gpt_embeddings(sent, gpt_model, gpt_tokenizer)
    
    bert_embs.append(bert_cls)
    gpt_embs.append(gpt_last)

bert_embs = np.array(bert_embs)
gpt_embs = np.array(gpt_embs)

print(f"BERT embeddings shape: {bert_embs.shape}")
print(f"GPT embeddings shape: {gpt_embs.shape}")

In [None]:
# Visualize both in same 2D space
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# BERT visualization
pca_bert = PCA(n_components=2)
bert_2d = pca_bert.fit_transform(bert_embs)

for i, sent in enumerate(comparison_sentences):
    color = 'red' if 'bank' in sent else 'blue' if 'model' in sent else 'gray'
    ax1.scatter(bert_2d[i, 0], bert_2d[i, 1], s=200, c=color, alpha=0.6)
    ax1.annotate(sent, xy=bert_2d[i], xytext=(5, 5), textcoords='offset points',
                fontsize=8, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

ax1.set_title('BERT Embeddings (Bidirectional)', fontsize=13, fontweight='bold')
ax1.set_xlabel('PCA 1', fontsize=11)
ax1.set_ylabel('PCA 2', fontsize=11)
ax1.grid(True, alpha=0.3)

# GPT visualization
pca_gpt = PCA(n_components=2)
gpt_2d = pca_gpt.fit_transform(gpt_embs)

for i, sent in enumerate(comparison_sentences):
    color = 'red' if 'bank' in sent else 'blue' if 'model' in sent else 'gray'
    ax2.scatter(gpt_2d[i, 0], gpt_2d[i, 1], s=200, c=color, alpha=0.6)
    ax2.annotate(sent, xy=gpt_2d[i], xytext=(5, 5), textcoords='offset points',
                fontsize=8, bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.5))

ax2.set_title('GPT-2 Embeddings (Autoregressive)', fontsize=13, fontweight='bold')
ax2.set_xlabel('PCA 1', fontsize=11)
ax2.set_ylabel('PCA 2', fontsize=11)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Notice how similar sentences cluster together in both representations!")

In [None]:
# Compute cosine similarities
from sklearn.metrics.pairwise import cosine_similarity

bert_sim = cosine_similarity(bert_embs)
gpt_sim = cosine_similarity(gpt_embs)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# BERT similarities
im1 = ax1.imshow(bert_sim, cmap='RdYlGn', aspect='auto', vmin=0, vmax=1)
ax1.set_title('BERT Cosine Similarities', fontsize=12, fontweight='bold')
ax1.set_xticks(range(len(comparison_sentences)))
ax1.set_yticks(range(len(comparison_sentences)))
ax1.set_xticklabels([f'S{i+1}' for i in range(len(comparison_sentences))], fontsize=9)
ax1.set_yticklabels([f'S{i+1}' for i in range(len(comparison_sentences))], fontsize=9)
plt.colorbar(im1, ax=ax1)

# GPT similarities
im2 = ax2.imshow(gpt_sim, cmap='RdYlGn', aspect='auto', vmin=0, vmax=1)
ax2.set_title('GPT-2 Cosine Similarities', fontsize=12, fontweight='bold')
ax2.set_xticks(range(len(comparison_sentences)))
ax2.set_yticks(range(len(comparison_sentences)))
ax2.set_xticklabels([f'S{i+1}' for i in range(len(comparison_sentences))], fontsize=9)
ax2.set_yticklabels([f'S{i+1}' for i in range(len(comparison_sentences))], fontsize=9)
plt.colorbar(im2, ax=ax2)

plt.tight_layout()
plt.show()

print("\nSentence pairs with 'bank' (S1, S2) and 'model' (S3, S4) should be similar.")
print(f"BERT - bank pair similarity: {bert_sim[0, 1]:.3f}")
print(f"GPT - bank pair similarity: {gpt_sim[0, 1]:.3f}")

## Part 6: Visualize Attention Patterns

Let's examine what BERT and GPT attend to in sentences.

In [None]:
def visualize_bert_attention(text, model, tokenizer, layer=11, head=0):
    """
    Visualize BERT attention patterns
    """
    # Tokenize
    inputs = tokenizer(text, return_tensors='pt')
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
    
    # Get attention
    with torch.no_grad():
        outputs = model(**inputs, output_attentions=True)
    
    # Extract attention from specified layer and head
    attention = outputs.attentions[layer][0, head, :, :].numpy()
    
    # Plot heatmap
    fig, ax = plt.subplots(figsize=(10, 9))
    
    im = ax.imshow(attention, cmap='Blues', aspect='auto')
    
    ax.set_xticks(range(len(tokens)))
    ax.set_yticks(range(len(tokens)))
    ax.set_xticklabels(tokens, rotation=45, ha='right', fontsize=10)
    ax.set_yticklabels(tokens, fontsize=10)
    
    ax.set_xlabel('Attend TO (keys)', fontsize=11, fontweight='bold')
    ax.set_ylabel('Attend FROM (queries)', fontsize=11, fontweight='bold')
    ax.set_title(f'BERT Attention Pattern (Layer {layer+1}, Head {head+1})',
                fontsize=13, fontweight='bold')
    
    plt.colorbar(im, ax=ax, label='Attention Weight')
    plt.tight_layout()
    plt.show()
    
    return attention, tokens

# Visualize attention for a sample sentence
test_sentence = "The cat sat on the mat because it was tired."
attention, tokens = visualize_bert_attention(test_sentence, bert_model, bert_tokenizer, layer=11, head=0)

print(f"\nTokens: {tokens}")
print(f"Notice: BERT can attend to ALL tokens (full matrix, no triangular restriction)")

In [None]:
def visualize_gpt_attention(text, model, tokenizer, layer=11, head=0):
    """
    Visualize GPT-2 attention patterns (should show causal/triangular)
    """
    inputs = tokenizer(text, return_tensors='pt')
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
    
    with torch.no_grad():
        outputs = model(**inputs, output_attentions=True)
    
    attention = outputs.attentions[layer][0, head, :, :].numpy()
    
    fig, ax = plt.subplots(figsize=(10, 9))
    
    im = ax.imshow(attention, cmap='Oranges', aspect='auto')
    
    ax.set_xticks(range(len(tokens)))
    ax.set_yticks(range(len(tokens)))
    ax.set_xticklabels(tokens, rotation=45, ha='right', fontsize=9)
    ax.set_yticklabels(tokens, fontsize=9)
    
    ax.set_xlabel('Attend TO (keys)', fontsize=11, fontweight='bold')
    ax.set_ylabel('Attend FROM (queries)', fontsize=11, fontweight='bold')
    ax.set_title(f'GPT-2 Attention Pattern (Layer {layer+1}, Head {head+1}) - Notice Triangular!',
                fontsize=13, fontweight='bold')
    
    plt.colorbar(im, ax=ax, label='Attention Weight')
    plt.tight_layout()
    plt.show()
    
    return attention, tokens

# Visualize GPT attention (should be lower-triangular due to causal masking)
gpt_attention, gpt_tokens = visualize_gpt_attention(test_sentence, gpt_model, gpt_tokenizer, layer=11, head=0)

print(f"\nTokens: {gpt_tokens}")
print(f"Notice: GPT-2 shows CAUSAL (lower-triangular) pattern - can't attend to future tokens!")

## Part 7: Use for Classification Task

Use BERT embeddings as features for a simple sentiment classification task.

In [None]:
# Create simple sentiment dataset
train_texts = [
    # Positive
    "This movie was excellent!",
    "I absolutely loved it.",
    "Best film I've seen this year.",
    "Highly recommend this.",
    "Fantastic performance by the actors.",
    "Brilliant storytelling.",
    # Negative
    "Terrible waste of time.",
    "I hated every minute.",
    "Worst movie ever made.",
    "Do not watch this.",
    "Boring and predictable.",
    "Completely disappointing."
]

train_labels = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]  # 1=positive, 0=negative

test_texts = [
    "Amazing experience!",
    "Not worth your time.",
    "Truly inspiring film.",
    "Awful and boring."
]

test_labels = [1, 0, 1, 0]

print(f"Training set: {len(train_texts)} examples")
print(f"Test set: {len(test_texts)} examples")

In [None]:
# Extract BERT features
print("Extracting BERT features for training...")
train_features_bert = []
for text in train_texts:
    cls_emb, _ = get_bert_embeddings(text, bert_model, bert_tokenizer)
    train_features_bert.append(cls_emb)

train_features_bert = np.array(train_features_bert)

print("Extracting BERT features for test...")
test_features_bert = []
for text in test_texts:
    cls_emb, _ = get_bert_embeddings(text, bert_model, bert_tokenizer)
    test_features_bert.append(cls_emb)

test_features_bert = np.array(test_features_bert)

print(f"\nTrain features shape: {train_features_bert.shape}")
print(f"Test features shape: {test_features_bert.shape}")

In [None]:
# Train logistic regression classifier on BERT features
clf_bert = LogisticRegression(max_iter=1000, random_state=42)
clf_bert.fit(train_features_bert, train_labels)

# Predict on test set
pred_bert = clf_bert.predict(test_features_bert)

# Evaluate
accuracy_bert = accuracy_score(test_labels, pred_bert)

print("=== BERT Feature-Based Classification ===")
print(f"\nTest Accuracy: {accuracy_bert:.3f}")
print(f"\nPredictions vs True:")
for text, true_label, pred_label in zip(test_texts, test_labels, pred_bert):
    label_map = {0: 'Negative', 1: 'Positive'}
    match = "âœ“" if true_label == pred_label else "X"
    print(f"  {match} '{text}'")
    print(f"     True: {label_map[true_label]}, Predicted: {label_map[pred_label]}")

## Part 8: Compare BERT vs GPT for Classification

Now extract GPT-2 features and compare performance.

In [None]:
# Extract GPT features
print("Extracting GPT-2 features for training...")
train_features_gpt = []
for text in train_texts:
    last_emb, _ = get_gpt_embeddings(text, gpt_model, gpt_tokenizer)
    train_features_gpt.append(last_emb)

train_features_gpt = np.array(train_features_gpt)

print("Extracting GPT-2 features for test...")
test_features_gpt = []
for text in test_texts:
    last_emb, _ = get_gpt_embeddings(text, gpt_model, gpt_tokenizer)
    test_features_gpt.append(last_emb)

test_features_gpt = np.array(test_features_gpt)

# Train classifier on GPT features
clf_gpt = LogisticRegression(max_iter=1000, random_state=42)
clf_gpt.fit(train_features_gpt, train_labels)

# Predict
pred_gpt = clf_gpt.predict(test_features_gpt)
accuracy_gpt = accuracy_score(test_labels, pred_gpt)

print("=== GPT-2 Feature-Based Classification ===")
print(f"\nTest Accuracy: {accuracy_gpt:.3f}")

# Compare
print("\n=== Comparison ===")
print(f"BERT accuracy: {accuracy_bert:.3f}")
print(f"GPT-2 accuracy: {accuracy_gpt:.3f}")

if accuracy_bert > accuracy_gpt:
    print("\nBERT wins! (Expected - bidirectional better for classification)")
else:
    print("\nGPT-2 competitive! (Impressive for autoregressive model)")

In [None]:
# Visualize comparison
fig, ax = plt.subplots(figsize=(10, 6))

models = ['BERT\n(Bidirectional)', 'GPT-2\n(Autoregressive)']
accuracies = [accuracy_bert, accuracy_gpt]
colors = ['#E74C3C', '#3498DB']

bars = ax.bar(models, accuracies, color=colors, alpha=0.7, edgecolor='black', linewidth=2)

ax.set_ylabel('Test Accuracy', fontsize=12, fontweight='bold')
ax.set_title('Classification Performance: BERT vs GPT-2 (Feature Extraction)', fontsize=14,
            fontweight='bold')
ax.set_ylim(0, 1.1)

# Add value labels
for i, (model, acc) in enumerate(zip(models, accuracies)):
    ax.text(i, acc + 0.03, f'{acc:.3f}', ha='center', fontsize=12, fontweight='bold')

ax.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5, label='Random baseline')
ax.legend(fontsize=10)

ax.grid(True, alpha=0.2, axis='y')

plt.tight_layout()
plt.show()

print("\nKey insight: BERT's bidirectional context gives it an edge for classification tasks!")

## Summary and Key Takeaways

### What We Learned:

1. **Loading Pre-trained Models**:
   - HuggingFace makes it easy to load BERT and GPT
   - Models are large (110M-1.5B parameters) but manageable
   
2. **Feature Extraction**:
   - BERT: Use [CLS] token for sentence embedding
   - GPT: Use last token for sentence embedding
   - Both provide rich 768-dim contextual representations
   
3. **BERT vs GPT**:
   - BERT: Bidirectional, better for classification/understanding
   - GPT: Causal, better for generation tasks
   - Both learn useful representations
   
4. **Attention Patterns**:
   - BERT: Full attention matrix (can see all tokens)
   - GPT: Lower-triangular (causal mask prevents future peeking)
   
5. **Practical Use**:
   - Pre-trained features work well for downstream tasks
   - Simple classifier on top achieves good results
   - No fine-tuning needed for quick prototyping

### Next Steps:
- Week 7: Advanced architectures (T5, GPT-3, scaling laws)
- Week 10: Fine-tuning and prompt engineering
- Experiment: Try on your own classification/generation tasks!

## Exercises

1. **Different Layers**: Extract embeddings from different BERT layers (0, 6, 12). How do they differ?

2. **Pooling Strategies**: Compare [CLS] vs mean pooling vs max pooling for sentence embeddings.

3. **Multi-class Classification**: Extend to 3+ classes (positive/neutral/negative).

4. **Attention Analysis**: Find which words BERT/GPT attend to most for specific tasks.

5. **Domain Adaptation**: Try on technical text or social media. Does performance degrade?

6. **Embedding Similarity**: Build a semantic search using cosine similarity of BERT embeddings.

7. **Generation with GPT**: Use GPT-2 to generate text (extend beyond feature extraction).