# Lab 2.5.5: Introduction to LoRA (Low-Rank Adaptation)

**Module:** 2.5 - Hugging Face Ecosystem  
**Time:** 2 hours  
**Difficulty:** ⭐⭐⭐ (Intermediate-Advanced)

---

## Learning Objectives

By the end of this lab, you will:
- [ ] Understand how LoRA reduces trainable parameters
- [ ] Configure LoRA with the PEFT library
- [ ] Compare memory usage: LoRA vs full fine-tuning
- [ ] Train a model with LoRA adapters
- [ ] Merge and save LoRA weights

---

## Prerequisites

- Completed: Labs 2.5.1 through 2.5.4
- Knowledge of: Matrix operations, fine-tuning concepts

---

## Real-World Context

**The Fine-Tuning Dilemma**: You want to customize a 7B parameter model for your use case. Full fine-tuning requires:
- ~28 GB for model weights (FP32)
- ~56 GB for gradients and optimizer states
- Total: ~84 GB minimum!

**LoRA's Solution**: Only train ~0.1% of parameters. Same 7B model, but:
- ~14 GB for model weights (frozen, BF16)
- ~500 MB for LoRA adapters + gradients
- Total: ~15 GB!

This is why LoRA has become the go-to method for fine-tuning large models!

---

## ELI5: How Does LoRA Work?

> **Imagine you're customizing a car...**
>
> **Full fine-tuning**: Replace every single part of the engine, transmission, interior - basically build a new car.
>
> **LoRA**: Keep the original car, just add a small turbo booster and a custom air filter. The car still works the same way, but now it's tuned for YOUR driving style.
>
> **The Math (simplified)**:
> - Original weight matrix W: 1000 x 1000 = 1,000,000 parameters
> - LoRA: Add A (1000 x 8) + B (8 x 1000) = 16,000 parameters
> - Reduction: 98.4% fewer trainable parameters!
>
> **Key Insight**: Most of the "knowledge" is in the pretrained weights. We only need to add a small "adjustment" for our specific task.

---

## Part 1: Understanding LoRA Mathematically

In [None]:
import torch
import torch.nn as nn
import numpy as np
import gc

# Check environment
print("Environment Check")
print("=" * 50)
print(f"PyTorch: {torch.__version__}")
print(f"CUDA: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
# Let's visualize how LoRA works
print("LoRA: Low-Rank Adaptation Explained")
print("=" * 60)

# Original weight matrix (pretrained)
d_in = 768   # Input dimension (like BERT hidden size)
d_out = 768  # Output dimension
rank = 8     # LoRA rank (much smaller than d_in, d_out)

print(f"\nOriginal linear layer: {d_in} x {d_out} = {d_in * d_out:,} parameters")

# Original weight matrix W (frozen during LoRA training)
W = torch.randn(d_out, d_in)

# LoRA decomposition: Instead of updating W directly,
# we add a low-rank update: W' = W + BA
# where B is d_out x rank and A is rank x d_in

A = torch.randn(rank, d_in)  # "Down projection" 
B = torch.randn(d_out, rank)  # "Up projection"

print(f"\nLoRA matrices:")
print(f"  A (down): {rank} x {d_in} = {rank * d_in:,} parameters")
print(f"  B (up):   {d_out} x {rank} = {d_out * rank:,} parameters")
print(f"  Total LoRA: {rank * d_in + d_out * rank:,} parameters")

reduction = 1 - (rank * d_in + d_out * rank) / (d_in * d_out)
print(f"\nParameter reduction: {reduction:.1%}")

In [None]:
# Demonstrate LoRA forward pass
print("\nLoRA Forward Pass Demo")
print("-" * 60)

# Input vector
x = torch.randn(1, d_in)  # Batch of 1, hidden_size input

# Original output (without LoRA)
original_output = x @ W.T

# LoRA adds a low-rank update
# h = x @ W.T + x @ A.T @ B.T
# The BA product forms a low-rank matrix that "adjusts" the original weights

lora_adjustment = x @ A.T @ B.T  # This is the "delta" from LoRA
lora_output = original_output + lora_adjustment

print(f"Input shape: {x.shape}")
print(f"Original output shape: {original_output.shape}")
print(f"LoRA adjustment shape: {lora_adjustment.shape}")
print(f"Final output shape: {lora_output.shape}")

# The key insight: BA forms a rank-r matrix
print(f"\nRank of BA matrix: {rank} (by construction)")
print(f"Rank of original W: up to {min(d_in, d_out)}")

### Key LoRA Parameters

| Parameter | Typical Value | Description |
|-----------|---------------|-------------|
| **rank (r)** | 8-64 | Rank of the low-rank matrices. Higher = more capacity, more params |
| **alpha** | 16-32 | Scaling factor. Often set to 2*rank. Final scaling = alpha/rank |
| **dropout** | 0.05-0.1 | Dropout applied to LoRA layers |
| **target_modules** | ["q_proj", "v_proj"] | Which layers to apply LoRA to |

---

## Part 2: LoRA with PEFT Library

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType

# Load base model
model_name = "distilbert-base-uncased"
print(f"Loading {model_name}...")

tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    torch_dtype=torch.bfloat16
).to(device)

# Count parameters before LoRA
total_params = sum(p.numel() for p in base_model.parameters())
trainable_before = sum(p.numel() for p in base_model.parameters() if p.requires_grad)
print(f"\nBase model parameters: {total_params:,}")
print(f"Trainable before LoRA: {trainable_before:,} (100%)")

In [None]:
# Explore model structure to find target modules
print("\nModel layer names (looking for linear layers):")
print("-" * 60)

for name, module in base_model.named_modules():
    if isinstance(module, nn.Linear):
        print(f"{name}: {module.in_features} -> {module.out_features}")

In [None]:
# Create LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,  # Sequence classification
    r=8,                         # Rank
    lora_alpha=16,               # Scaling factor
    lora_dropout=0.1,            # Dropout
    target_modules=["q_lin", "v_lin"],  # DistilBERT uses these names
    bias="none",                 # Don't train biases
    modules_to_save=["classifier", "pre_classifier"]  # Train these normally
)

print("LoRA Configuration:")
print("-" * 40)
print(f"  Rank: {lora_config.r}")
print(f"  Alpha: {lora_config.lora_alpha}")
print(f"  Scaling: {lora_config.lora_alpha / lora_config.r}")
print(f"  Target modules: {lora_config.target_modules}")
print(f"  Modules to save: {lora_config.modules_to_save}")

In [None]:
# Apply LoRA to the model
print("\nApplying LoRA...")
peft_model = get_peft_model(base_model, lora_config)

# Check trainable parameters
peft_model.print_trainable_parameters()

# Manual calculation
trainable_after = sum(p.numel() for p in peft_model.parameters() if p.requires_grad)
total_after = sum(p.numel() for p in peft_model.parameters())

print(f"\nDetailed breakdown:")
print(f"  Total parameters: {total_after:,}")
print(f"  Trainable parameters: {trainable_after:,}")
print(f"  Trainable %: {100 * trainable_after / total_after:.2f}%")
print(f"  Reduction: {100 * (1 - trainable_after / trainable_before):.1f}%")

---

## Part 3: Memory Comparison

In [None]:
import gc

def measure_training_memory(model, sample_input, sample_labels, optimizer_class=torch.optim.AdamW):
    """
    Measure memory usage during a training step.
    """
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()
    
    # Move to GPU
    model = model.to(device).train()
    sample_input = {k: v.to(device) for k, v in sample_input.items()}
    sample_labels = sample_labels.to(device)
    
    # Memory after model load
    model_memory = torch.cuda.memory_allocated() / 1e9
    
    # Create optimizer
    optimizer = optimizer_class(model.parameters(), lr=2e-5)
    optimizer_memory = torch.cuda.memory_allocated() / 1e9 - model_memory
    
    # Forward pass
    outputs = model(**sample_input, labels=sample_labels)
    loss = outputs.loss
    forward_memory = torch.cuda.memory_allocated() / 1e9
    
    # Backward pass
    loss.backward()
    backward_memory = torch.cuda.memory_allocated() / 1e9
    
    # Optimizer step
    optimizer.step()
    optimizer.zero_grad()
    
    peak_memory = torch.cuda.max_memory_allocated() / 1e9
    
    # Cleanup
    del optimizer
    
    return {
        "model_memory_gb": model_memory,
        "optimizer_memory_gb": optimizer_memory,
        "forward_memory_gb": forward_memory,
        "backward_memory_gb": backward_memory,
        "peak_memory_gb": peak_memory
    }

# Create sample input
sample_text = "This is a test sentence for memory measurement."
sample_input = tokenizer(sample_text, return_tensors="pt", padding="max_length", max_length=128)
sample_labels = torch.tensor([1])

In [None]:
# Measure LoRA memory
print("Measuring LoRA training memory...")
lora_memory = measure_training_memory(peft_model, sample_input, sample_labels)

# Clean up for full fine-tuning test
del peft_model
gc.collect()
torch.cuda.empty_cache()

In [None]:
# Load fresh model for full fine-tuning comparison
print("\nMeasuring full fine-tuning memory...")
full_model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    torch_dtype=torch.bfloat16
)

full_memory = measure_training_memory(full_model, sample_input, sample_labels)

# Clean up
del full_model
gc.collect()
torch.cuda.empty_cache()

In [None]:
# Compare memory usage
print("\n" + "=" * 60)
print("MEMORY COMPARISON")
print("=" * 60)

print(f"\n{'Metric':<25} {'Full Fine-tune':<15} {'LoRA':<15} {'Savings':>10}")
print("-" * 65)

for key in lora_memory:
    full_val = full_memory[key]
    lora_val = lora_memory[key]
    savings = (1 - lora_val / full_val) * 100 if full_val > 0 else 0
    
    metric_name = key.replace("_", " ").replace(" gb", "").title()
    print(f"{metric_name:<25} {full_val:<15.2f} {lora_val:<15.2f} {savings:>9.1f}%")

print("\nKey Insight: LoRA significantly reduces optimizer memory!")
print("(Adam stores m & v for each trainable parameter)")

---

## Part 4: Training with LoRA

In [None]:
from datasets import load_dataset
from transformers import Trainer, TrainingArguments
import evaluate

# Load dataset (smaller subset for demo)
print("Loading IMDB dataset...")
dataset = load_dataset("imdb")

# Use smaller subsets for faster demo
train_dataset = dataset['train'].shuffle(seed=42).select(range(5000))
eval_dataset = dataset['test'].shuffle(seed=42).select(range(1000))

print(f"Train: {len(train_dataset):,} examples")
print(f"Eval: {len(eval_dataset):,} examples")

In [None]:
# Tokenize
def tokenize_fn(examples):
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=256
    )

print("Tokenizing...")
tokenized_train = train_dataset.map(tokenize_fn, batched=True, remove_columns=['text'])
tokenized_eval = eval_dataset.map(tokenize_fn, batched=True, remove_columns=['text'])

tokenized_train = tokenized_train.rename_column("label", "labels")
tokenized_eval = tokenized_eval.rename_column("label", "labels")

In [None]:
# Create fresh LoRA model
print("\nCreating LoRA model...")
base_model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    torch_dtype=torch.bfloat16
)

lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=16,              # Slightly higher rank for better performance
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_lin", "v_lin"],
    modules_to_save=["classifier", "pre_classifier"]
)

peft_model = get_peft_model(base_model, lora_config)
peft_model.print_trainable_parameters()

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./results/lora_imdb",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    learning_rate=3e-4,  # Higher LR for LoRA is common
    warmup_ratio=0.1,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    bf16=True,
    logging_steps=50,
    report_to="none"
)

# Metrics
accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [None]:
# Create trainer
trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

print("\nStarting LoRA training...")
print("=" * 60)

In [None]:
# Train!
import time
start_time = time.time()

train_result = trainer.train()

training_time = time.time() - start_time
print(f"\nTraining completed in {training_time:.1f}s")

In [None]:
# Evaluate
print("\nEvaluation Results:")
print("-" * 40)
eval_results = trainer.evaluate()
for key, value in eval_results.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.4f}")

---

## Part 5: Saving and Loading LoRA Adapters

In [None]:
# Save only the LoRA adapters (very small!)
adapter_path = "./results/lora_adapter"
peft_model.save_pretrained(adapter_path)

print(f"Adapter saved to {adapter_path}")

# Check size
import os
total_size = 0
print("\nSaved files:")
for f in os.listdir(adapter_path):
    size = os.path.getsize(os.path.join(adapter_path, f)) / 1e6
    total_size += size
    print(f"  {f}: {size:.2f} MB")
print(f"\nTotal adapter size: {total_size:.2f} MB")

In [None]:
# Load adapter onto a fresh base model
from peft import PeftModel

print("\nLoading adapter onto fresh base model...")

# Load base model
fresh_base = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    torch_dtype=torch.bfloat16
)

# Load LoRA adapter
loaded_model = PeftModel.from_pretrained(fresh_base, adapter_path)
loaded_model.print_trainable_parameters()

print("\nAdapter loaded successfully!")

In [None]:
# Test the loaded model
loaded_model = loaded_model.to(device).eval()

test_texts = [
    "This movie was absolutely fantastic! A masterpiece!",
    "Terrible film. Waste of time and money.",
    "It was okay, nothing special."
]

print("\nTesting loaded model:")
print("-" * 60)

for text in test_texts:
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256).to(device)
    with torch.no_grad():
        outputs = loaded_model(**inputs)
        probs = torch.softmax(outputs.logits, dim=1)
        pred = torch.argmax(probs, dim=1).item()
        conf = probs[0][pred].item()
    
    sentiment = "POSITIVE" if pred == 1 else "NEGATIVE"
    print(f"{sentiment} ({conf:.1%}): {text[:50]}...")

---

## Part 6: Merging LoRA Weights

In [None]:
# Merge LoRA weights into base model for deployment
# This creates a standard model (no PEFT overhead)

print("Merging LoRA weights into base model...")
merged_model = loaded_model.merge_and_unload()

print(f"\nMerged model type: {type(merged_model).__name__}")
print(f"Parameters: {sum(p.numel() for p in merged_model.parameters()):,}")

# All parameters are now regular (not LoRA)
trainable = sum(p.numel() for p in merged_model.parameters() if p.requires_grad)
print(f"Trainable: {trainable:,} (100% - it's a regular model now!)")

In [None]:
# Save merged model
merged_path = "./results/merged_model"
merged_model.save_pretrained(merged_path)
tokenizer.save_pretrained(merged_path)

print(f"\nMerged model saved to {merged_path}")

# Check size
total_size = 0
print("\nSaved files:")
for f in os.listdir(merged_path):
    size = os.path.getsize(os.path.join(merged_path, f)) / 1e6
    total_size += size
    if size > 1:
        print(f"  {f}: {size:.1f} MB")
print(f"\nTotal merged model size: {total_size:.1f} MB")

---

## Part 7: LoRA Rank Comparison

In [None]:
# Compare different LoRA ranks
print("LoRA Rank Comparison (Theoretical)")
print("=" * 60)

ranks = [4, 8, 16, 32, 64]
d = 768  # DistilBERT hidden size
num_layers = 6  # DistilBERT has 6 layers
num_targets = 2  # q_lin and v_lin

base_params = 66_955_010  # DistilBERT-base

print(f"\n{'Rank':<8} {'LoRA Params':<15} {'% of Base':<12} {'Estimated Acc':>15}")
print("-" * 55)

for r in ranks:
    # Each LoRA layer adds: r*d_in + r*d_out parameters
    lora_per_layer = 2 * r * d  # A and B matrices
    total_lora = lora_per_layer * num_targets * num_layers
    
    pct = 100 * total_lora / base_params
    
    # Rough accuracy estimate (higher rank generally = better)
    est_acc = 0.88 + 0.02 * np.log2(r / 4)
    est_acc = min(est_acc, 0.93)  # Cap at base model performance
    
    print(f"{r:<8} {total_lora:<15,} {pct:<12.3f} {est_acc:>14.1%}")

print("\nNote: Actual accuracy depends heavily on task and data!")

---

## Common Mistakes

### Mistake 1: Wrong Target Modules

```python
# Wrong: Using BERT module names for DistilBERT
lora_config = LoraConfig(
    target_modules=["query", "value"]  # BERT uses these
)

# Right: Check model's actual layer names
lora_config = LoraConfig(
    target_modules=["q_lin", "v_lin"]  # DistilBERT uses these
)
```

### Mistake 2: Too Low Learning Rate

```python
# Wrong: Using full fine-tuning LR
args = TrainingArguments(learning_rate=2e-5)

# Right: LoRA often needs higher LR
args = TrainingArguments(learning_rate=3e-4)  # 10-15x higher
```

### Mistake 3: Forgetting modules_to_save

```python
# Wrong: Only LoRA layers are trained, classifier stays random!
lora_config = LoraConfig(
    target_modules=["q_lin", "v_lin"]
)

# Right: Also train the classifier head
lora_config = LoraConfig(
    target_modules=["q_lin", "v_lin"],
    modules_to_save=["classifier"]  # Train this normally
)
```

---

## Checkpoint

You've learned:
- The mathematical intuition behind LoRA
- How to configure LoRA with PEFT
- Memory benefits of LoRA vs full fine-tuning
- How to train, save, and load LoRA adapters
- How to merge LoRA weights for deployment

---

## Further Reading

- [LoRA Paper](https://arxiv.org/abs/2106.09685)
- [PEFT Documentation](https://huggingface.co/docs/peft)
- [QLoRA Paper](https://arxiv.org/abs/2305.14314) (LoRA + Quantization)

---

## Cleanup

In [None]:
# Cleanup
del peft_model, loaded_model, merged_model, trainer
gc.collect()
torch.cuda.empty_cache()

print(f"GPU memory after cleanup: {torch.cuda.memory_allocated()/1e9:.2f} GB")
print("\nLab 2.5.5 complete!")