# Task 9.5: LoRA - Parameter-Efficient Fine-Tuning

**Module:** 9 - Hugging Face Ecosystem  
**Time:** 2 hours  
**Difficulty:** ⭐⭐⭐⭐ (Intermediate-Advanced)

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Understand the theory behind LoRA (Low-Rank Adaptation)
- [ ] Configure LoRA with the PEFT library
- [ ] Compare memory usage: LoRA vs full fine-tuning
- [ ] Fine-tune a model using LoRA
- [ ] Merge LoRA adapters back into the base model
- [ ] Appreciate when and why to use PEFT methods

---

## Prerequisites

- Completed: Task 9.4 (Trainer Fine-tuning)
- Knowledge of: Basic linear algebra (matrices), fine-tuning concepts

---

## Real-World Context

Imagine you want to customize a 70 billion parameter model for your specific use case. Full fine-tuning would require:
- ~140GB just for model weights (in fp16)
- ~560GB for optimizer states (Adam needs 4x model size)
- Multiple high-end GPUs and weeks of training

**LoRA changes everything:**
- Train only 0.1-1% of parameters
- Fit training on a single GPU
- Complete in hours, not weeks
- Achieve 90-99% of full fine-tuning quality!

**Real-world LoRA users:**
- Stability AI: Fine-tuned Stable Diffusion for specific styles
- Microsoft: Personalized Copilot assistants
- Countless developers: Custom chatbots and assistants

---

## ELI5: What is LoRA?

> **Imagine you're customizing a car.** You could:
> - Option A: Rebuild the entire engine from scratch (expensive, slow)
> - Option B: Add a turbocharger attachment (cheap, fast, removable!)
>
> **LoRA is Option B for AI models.**
>
> Instead of modifying the entire model (billions of numbers), LoRA:
> 1. **Freezes** the original model completely
> 2. **Adds** tiny "adapter" layers alongside key components
> 3. **Trains** only the adapters (millions vs billions of parameters)

### Visual: How LoRA Works

```
STANDARD LAYER:                    LORA LAYER:
                                   
  Input (x)                          Input (x)
     │                                  │
     ▼                                  ├──────────────┐
  ┌─────┐                               ▼              ▼
  │  W  │  ← Full weight matrix     ┌─────┐        ┌─────┐
  │     │    (10000×10000 params)   │  W  │ FROZEN │  A  │ (10000×8)
  │     │                           │     │        └──┬──┘
  └──┬──┘                           └──┬──┘           ▼
     │                                 │          ┌─────┐
     ▼                                 │          │  B  │ (8×10000)
  Output                               │          └──┬──┘
                                       │             │
                                       ▼             ▼
                                    Original    +   LoRA
                                       └─────┬───────┘
                                             ▼
                                          Output

  100,000,000 params             160,000 trainable params
  (all trainable)                  (625x fewer!)
```

> **The clever math trick:**
> - Original layer: multiply by huge matrix W (10000 × 10000 = 100M parameters)
> - LoRA says: "Changes to W can be approximated by two small matrices"
> - Small matrices A (10000 × 8) and B (8 × 10000) = only 160K parameters!
> - That's 625x fewer parameters to train!
>
> **In AI terms:** LoRA approximates weight updates using low-rank matrix decomposition, dramatically reducing trainable parameters while maintaining model quality.

---

## Part 1: Setup and Understanding the Math

In [None]:
# Install PEFT library
# Note: These packages are pre-installed in the NGC PyTorch container.
# Running pip install ensures you have compatible versions.
# If NOT using NGC container, ensure you have ARM64-compatible packages for DGX Spark.

!pip install -q "peft>=0.6.0" "transformers>=4.35.0" "datasets>=2.14.0" "evaluate>=0.4.0" "accelerate>=0.24.0"

import torch
import torch.nn as nn
import numpy as np
import evaluate
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import warnings
warnings.filterwarnings('ignore')

print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

### The Math Behind LoRA

In a standard neural network layer:
$$y = Wx$$

During fine-tuning, we update W:
$$y = (W + \Delta W)x$$

LoRA's key insight: $\Delta W$ can be decomposed into two smaller matrices:
$$\Delta W = BA$$

Where:
- $W$ is the original weight matrix (frozen)
- $B$ is a small matrix of shape (d, r)
- $A$ is a small matrix of shape (r, k)
- $r$ is the "rank" (typically 8-64, much smaller than d or k)

In [None]:
# Visualize the math
def visualize_lora_math():
    # Typical transformer dimensions
    d_model = 768  # Hidden size
    
    # Original weight matrix
    W_params = d_model * d_model
    print(f"Original W matrix: {d_model} x {d_model} = {W_params:,} parameters")
    
    # LoRA decomposition with different ranks
    print("\nLoRA decomposition:")
    print(f"{'Rank (r)':<10} {'A params':<12} {'B params':<12} {'Total':<12} {'Reduction':<12}")
    print("="*60)
    
    for r in [4, 8, 16, 32, 64]:
        A_params = d_model * r
        B_params = r * d_model
        total = A_params + B_params
        reduction = W_params / total
        print(f"{r:<10} {A_params:<12,} {B_params:<12,} {total:<12,} {reduction:.0f}x")

visualize_lora_math()

In [None]:
# Demonstrate LoRA computation
class SimpleLoRALayer(nn.Module):
    """A simple demonstration of how LoRA works."""
    
    def __init__(self, in_features, out_features, rank=8, alpha=16):
        super().__init__()
        
        # Original frozen weight
        self.W = nn.Linear(in_features, out_features, bias=False)
        self.W.weight.requires_grad = False  # Frozen!
        
        # LoRA adapters (these are trainable)
        self.lora_A = nn.Linear(in_features, rank, bias=False)
        self.lora_B = nn.Linear(rank, out_features, bias=False)
        
        # Scaling factor
        self.scaling = alpha / rank
        
        # Initialize A with random, B with zeros (so LoRA starts as identity)
        nn.init.kaiming_uniform_(self.lora_A.weight)
        nn.init.zeros_(self.lora_B.weight)
    
    def forward(self, x):
        # Original path (frozen)
        original = self.W(x)
        
        # LoRA path (trainable)
        lora = self.lora_B(self.lora_A(x)) * self.scaling
        
        return original + lora

# Create and test
layer = SimpleLoRALayer(768, 768, rank=8)

# Count parameters
frozen_params = sum(p.numel() for p in layer.W.parameters())
trainable_params = sum(p.numel() for p in [layer.lora_A.weight, layer.lora_B.weight])

print(f"Frozen parameters: {frozen_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"Trainable %: {100 * trainable_params / (frozen_params + trainable_params):.2f}%")

---

## Part 2: Using PEFT Library for LoRA

In [None]:
# Helper to track memory
def get_memory_usage():
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1e9
        reserved = torch.cuda.memory_reserved() / 1e9
        return allocated, reserved
    return 0, 0

def print_memory(label):
    alloc, res = get_memory_usage()
    print(f"[{label}] Allocated: {alloc:.2f} GB, Reserved: {res:.2f} GB")

# Clear any existing memory
torch.cuda.empty_cache()
print_memory("Start")

In [None]:
# Load a model for comparison
model_name = "bert-base-uncased"

print(f"Loading {model_name}...")
base_model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    torch_dtype=torch.bfloat16
)

print_memory("After loading base model")

# Count parameters
total_params = sum(p.numel() for p in base_model.parameters())
trainable_params = sum(p.numel() for p in base_model.parameters() if p.requires_grad)

print(f"\nBase model:")
print(f"  Total parameters: {total_params:,}")
print(f"  Trainable parameters: {trainable_params:,}")

In [None]:
# Configure LoRA
# NOTE: target_modules names vary by model architecture!
# See "Common patterns by model architecture" section (cell 36) for model-specific names.
# PEFT uses partial string matching, so "query" matches "bert.encoder.layer.X.attention.self.query"

lora_config = LoraConfig(
    # Core LoRA parameters
    r=8,                        # Rank (lower = fewer params, higher = more capacity)
    lora_alpha=16,              # Scaling factor (usually 2*r)
    lora_dropout=0.1,           # Dropout for regularization
    
    # Which layers to apply LoRA to
    # For BERT/RoBERTa: use "query", "key", "value", "dense"
    # For DistilBERT: use "q_lin", "k_lin", "v_lin"
    # For LLaMA/Mistral: use "q_proj", "k_proj", "v_proj", "o_proj"
    target_modules=["query", "value"],  # Attention layers (works for BERT variants)
    
    # Task type
    task_type=TaskType.SEQ_CLS,  # Sequence classification
    
    # Additional options
    bias="none",                # Don't train biases
    modules_to_save=["classifier"],  # Train classifier head normally
)

print("LoRA Configuration:")
print(f"  Rank: {lora_config.r}")
print(f"  Alpha: {lora_config.lora_alpha}")
print(f"  Target modules: {lora_config.target_modules}")
print(f"  Dropout: {lora_config.lora_dropout}")

In [None]:
# Apply LoRA to the model
peft_model = get_peft_model(base_model, lora_config)

print_memory("After applying LoRA")

# Print trainable parameters
print("\nParameter summary:")
peft_model.print_trainable_parameters()

In [None]:
# Examine the model structure
print("\nLoRA layers added:")
for name, param in peft_model.named_parameters():
    if 'lora' in name.lower():
        print(f"  {name}: {param.shape}")

---

## Part 3: Memory Comparison - LoRA vs Full Fine-tuning

In [None]:
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
import time

def measure_training_memory(model, tokenizer, dataset, use_lora=True, description=""):
    """Measure memory and time for training."""
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()
    
    # Small subset for quick test
    small_train = dataset['train'].select(range(500))
    small_eval = dataset['train'].select(range(500, 600))
    
    # Tokenize
    def tokenize(examples):
        return tokenizer(
            examples['text'],
            truncation=True,
            padding='max_length',
            max_length=128
        )
    
    tokenized_train = small_train.map(tokenize, batched=True, remove_columns=['text'])
    tokenized_train = tokenized_train.rename_column('label', 'labels')
    tokenized_train.set_format('torch')
    
    tokenized_eval = small_eval.map(tokenize, batched=True, remove_columns=['text'])
    tokenized_eval = tokenized_eval.rename_column('label', 'labels')
    tokenized_eval.set_format('torch')
    
    # Training args
    args = TrainingArguments(
        output_dir="./temp_training",
        num_train_epochs=1,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        learning_rate=2e-5,
        bf16=True,
        logging_strategy="no",
        save_strategy="no",
        report_to="none",
    )
    
    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_eval,
    )
    
    # Train and measure
    start_time = time.time()
    trainer.train()
    train_time = time.time() - start_time
    
    peak_memory = torch.cuda.max_memory_allocated() / 1e9
    
    return {
        'description': description,
        'peak_memory_gb': peak_memory,
        'train_time_seconds': train_time,
        'trainable_params': sum(p.numel() for p in model.parameters() if p.requires_grad)
    }

print("Memory measurement function ready!")

In [None]:
# Load dataset
print("Loading dataset...")
imdb = load_dataset("imdb")

tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
# Test 1: Full fine-tuning
print("\n" + "="*50)
print("TEST 1: FULL FINE-TUNING")
print("="*50)

# Fresh model for full fine-tuning
full_model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=2, torch_dtype=torch.bfloat16
).cuda()

full_results = measure_training_memory(full_model, tokenizer, imdb, use_lora=False, description="Full Fine-tuning")
print(f"Peak memory: {full_results['peak_memory_gb']:.2f} GB")
print(f"Train time: {full_results['train_time_seconds']:.1f}s")
print(f"Trainable params: {full_results['trainable_params']:,}")

# Cleanup
del full_model
torch.cuda.empty_cache()

In [None]:
# Test 2: LoRA fine-tuning (r=8)
print("\n" + "="*50)
print("TEST 2: LoRA (r=8)")
print("="*50)

lora_model_r8 = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=2, torch_dtype=torch.bfloat16
)
lora_config_r8 = LoraConfig(
    r=8, lora_alpha=16, lora_dropout=0.1,
    target_modules=["query", "value"],
    task_type=TaskType.SEQ_CLS,
    modules_to_save=["classifier"]
)
lora_model_r8 = get_peft_model(lora_model_r8, lora_config_r8).cuda()

lora_r8_results = measure_training_memory(lora_model_r8, tokenizer, imdb, use_lora=True, description="LoRA r=8")
print(f"Peak memory: {lora_r8_results['peak_memory_gb']:.2f} GB")
print(f"Train time: {lora_r8_results['train_time_seconds']:.1f}s")
print(f"Trainable params: {lora_r8_results['trainable_params']:,}")

del lora_model_r8
torch.cuda.empty_cache()

In [None]:
# Test 3: LoRA fine-tuning (r=16)
print("\n" + "="*50)
print("TEST 3: LoRA (r=16)")
print("="*50)

lora_model_r16 = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=2, torch_dtype=torch.bfloat16
)
lora_config_r16 = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.1,
    target_modules=["query", "value"],
    task_type=TaskType.SEQ_CLS,
    modules_to_save=["classifier"]
)
lora_model_r16 = get_peft_model(lora_model_r16, lora_config_r16).cuda()

lora_r16_results = measure_training_memory(lora_model_r16, tokenizer, imdb, use_lora=True, description="LoRA r=16")
print(f"Peak memory: {lora_r16_results['peak_memory_gb']:.2f} GB")
print(f"Train time: {lora_r16_results['train_time_seconds']:.1f}s")
print(f"Trainable params: {lora_r16_results['trainable_params']:,}")

del lora_model_r16
torch.cuda.empty_cache()

In [None]:
# Summary comparison
print("\n" + "="*70)
print("MEMORY COMPARISON SUMMARY")
print("="*70)

results = [full_results, lora_r8_results, lora_r16_results]

print(f"{'Method':<20} {'Memory (GB)':<15} {'Time (s)':<12} {'Params':<15} {'Reduction'}")
print("-"*70)

baseline_memory = full_results['peak_memory_gb']
baseline_params = full_results['trainable_params']

for r in results:
    mem_reduction = baseline_memory / r['peak_memory_gb']
    param_reduction = baseline_params / r['trainable_params']
    print(f"{r['description']:<20} {r['peak_memory_gb']:<15.2f} {r['train_time_seconds']:<12.1f} {r['trainable_params']:<15,} {param_reduction:.0f}x params")

---

## Part 4: Full Training with LoRA

In [None]:
# Full training pipeline with LoRA
from datasets import DatasetDict
import evaluate

print("Setting up full LoRA training pipeline...")

# Prepare data
train_val = imdb['train'].train_test_split(test_size=0.1, seed=42)
dataset = DatasetDict({
    'train': train_val['train'],
    'validation': train_val['test'],
    'test': imdb['test']
})

# Tokenize
def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        truncation=True,
        padding='max_length',
        max_length=256
    )

tokenized = dataset.map(tokenize_function, batched=True, remove_columns=['text'])
tokenized = tokenized.rename_column('label', 'labels')

print(f"Train: {len(tokenized['train']):,}")
print(f"Validation: {len(tokenized['validation']):,}")
print(f"Test: {len(tokenized['test']):,}")

In [None]:
# Create model with LoRA
base_model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    id2label={0: "NEGATIVE", 1: "POSITIVE"},
    label2id={"NEGATIVE": 0, "POSITIVE": 1},
    torch_dtype=torch.bfloat16
)

# LoRA config with more target modules
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["query", "key", "value", "dense"],  # All attention + FFN
    task_type=TaskType.SEQ_CLS,
    modules_to_save=["classifier"]
)

model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()

In [None]:
# Training setup
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {
        'accuracy': accuracy.compute(predictions=predictions, references=labels)['accuracy'],
        'f1': f1.compute(predictions=predictions, references=labels)['f1']
    }

training_args = TrainingArguments(
    output_dir="./lora_results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=1e-4,  # LoRA can use higher LR than full fine-tuning!
    warmup_ratio=0.1,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    bf16=True,
    logging_steps=100,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized['train'],
    eval_dataset=tokenized['validation'],
    compute_metrics=compute_metrics,
)

print("Trainer ready!")

In [None]:
# Train!
print("\nStarting LoRA training...")
train_result = trainer.train()

print(f"\nTraining complete!")
print(f"Time: {train_result.metrics['train_runtime']:.1f}s")
print(f"Samples/sec: {train_result.metrics['train_samples_per_second']:.1f}")

In [None]:
# Evaluate on test set
print("\nEvaluating on test set...")
test_results = trainer.evaluate(tokenized['test'])

print("\nTest Results:")
for key, value in test_results.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.4f}")

---

## Part 5: Saving and Merging LoRA Adapters

In [None]:
# Save just the LoRA adapters (very small!)
adapter_path = "./lora_adapters"
model.save_pretrained(adapter_path)
tokenizer.save_pretrained(adapter_path)  # Always save tokenizer alongside adapter!

import os
print("Saved adapter files:")
for f in os.listdir(adapter_path):
    size_mb = os.path.getsize(os.path.join(adapter_path, f)) / 1e6
    print(f"  {f}: {size_mb:.2f} MB")

# Compare to full model size
adapter_size = sum(os.path.getsize(os.path.join(adapter_path, f)) for f in os.listdir(adapter_path)) / 1e6
print(f"\nTotal adapter size: {adapter_size:.2f} MB")
print(f"Full BERT-base model: ~440 MB")
print(f"Savings: {440/adapter_size:.0f}x smaller!")

In [None]:
# Load adapters onto a fresh model
from peft import PeftModel

print("Loading adapters onto fresh base model...")

# Fresh base model
fresh_base = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=2, torch_dtype=torch.bfloat16
)

# Load adapters
loaded_model = PeftModel.from_pretrained(fresh_base, adapter_path)

print("Adapters loaded!")
loaded_model.print_trainable_parameters()

In [None]:
# Merge adapters into base model (for faster inference)
print("Merging adapters into base model...")

merged_model = loaded_model.merge_and_unload()

print(f"\nMerged model parameters: {sum(p.numel() for p in merged_model.parameters()):,}")
print("Adapters are now part of the model weights!")

In [None]:
# Save merged model (full size, but includes LoRA changes)
merged_path = "./merged_model"
merged_model.save_pretrained(merged_path)
tokenizer.save_pretrained(merged_path)

print(f"\nMerged model saved to {merged_path}")
print("This model can be loaded without PEFT library!")

---

## Part 6: Multiple LoRA Adapters (Advanced)

One powerful feature: you can train multiple adapters and switch between them!

In [None]:
# Demonstrate multiple adapter concept
print("Multiple LoRA Adapters Concept:")
print("\n1. Train one adapter for sentiment analysis")
print("2. Train another adapter for spam detection")
print("3. Load same base model")
print("4. Switch adapters based on task!")

print("\n" + "="*50)
print("ADAPTER SWITCHING EXAMPLE")
print("="*50)

# This would work like:
# model.load_adapter("sentiment_adapter")
# model.set_adapter("sentiment_adapter")
# output1 = model(sentiment_input)
#
# model.set_adapter("spam_adapter")
# output2 = model(spam_input)

print("""
from peft import PeftModel

# Load base model once
base = AutoModel.from_pretrained("bert-base-uncased")

# Add multiple adapters
model = PeftModel.from_pretrained(base, "sentiment_adapter")
model.load_adapter("spam_adapter", adapter_name="spam")

# Switch between them
model.set_adapter("default")  # sentiment
sentiment_output = model(**sentiment_inputs)

model.set_adapter("spam")  # spam
spam_output = model(**spam_inputs)
""")

---

## Try It Yourself: Experiment with LoRA Configurations

Try different LoRA configurations and compare results:
1. Different ranks (4, 8, 16, 32, 64)
2. Different target modules
3. Different alpha values

<details>
<summary>Hint</summary>

```python
# Try these configurations:
configs = [
    LoraConfig(r=4, lora_alpha=8, target_modules=["query", "value"]),
    LoraConfig(r=16, lora_alpha=32, target_modules=["query", "value"]),
    LoraConfig(r=8, lora_alpha=16, target_modules=["query", "key", "value", "dense"]),
]
```
</details>

In [None]:
# YOUR CODE HERE
# Experiment with different LoRA configurations




---

## Common Mistakes

### Mistake 1: Wrong Target Modules

In [None]:
# WRONG: Guessing module names
# LoraConfig(target_modules=["attention", "mlp"])  # May not match!

# CORRECT: Check actual module names
print("Finding target modules:")
print("\nMethod 1: Print named modules")
for name, module in base_model.named_modules():
    if isinstance(module, nn.Linear):
        if 'attention' in name.lower() or 'query' in name.lower():
            print(f"  {name}")

print("\nCommon patterns by model architecture:")
print("  BERT/RoBERTa:    'query', 'key', 'value', 'dense'")
print("  DistilBERT:      'q_lin', 'k_lin', 'v_lin'")
print("  GPT-2:           'c_attn', 'c_proj', 'c_fc'")
print("  LLaMA/Mistral:   'q_proj', 'k_proj', 'v_proj', 'o_proj'")
print("  DeBERTa:         'query_proj', 'key_proj', 'value_proj'")
print("  T5:              'q', 'k', 'v', 'o'")
print("\nTip: Always inspect your model's named_modules() to find correct names!")

### Mistake 2: Not Saving Tokenizer

In [None]:
# WRONG: Only saving adapter
# model.save_pretrained(path)  # Missing tokenizer!

# CORRECT: Save both
# model.save_pretrained(path)
# tokenizer.save_pretrained(path)

print("Always save the tokenizer with your adapter!")

### Mistake 3: Using Too High LR with LoRA

In [None]:
# LoRA allows higher LR than full fine-tuning, but not too high!
print("Learning rate guidelines:")
print("\nFull fine-tuning:")
print("  1e-5 to 5e-5 (conservative)")
print("\nLoRA fine-tuning:")
print("  1e-4 to 3e-4 (can be higher!)")
print("\nBut still not:")
print("  1e-2+ (too high, unstable training)")

---

## Checkpoint

You've learned:
- ✅ The theory behind LoRA (low-rank decomposition)
- ✅ How to configure LoRA with PEFT
- ✅ Memory savings comparison (LoRA vs full fine-tuning)
- ✅ How to train with LoRA
- ✅ How to save and merge adapters
- ✅ When and why to use PEFT methods

---

## Challenge: LoRA for a Large Model

Apply LoRA to fine-tune a larger model (e.g., `microsoft/deberta-v3-base` or `roberta-large`) and compare the memory savings!

In [None]:
# YOUR CHALLENGE CODE HERE




---

## Further Reading

- [LoRA Paper](https://arxiv.org/abs/2106.09685)
- [PEFT Documentation](https://huggingface.co/docs/peft)
- [QLoRA Paper](https://arxiv.org/abs/2305.14314) (LoRA + Quantization)
- [PEFT Methods Comparison](https://huggingface.co/blog/peft)

---

## Cleanup

In [None]:
import shutil
import gc

# Clean up saved files
for path in ["./lora_results", "./lora_adapters", "./merged_model", "./temp_training"]:
    if os.path.exists(path):
        shutil.rmtree(path)
        print(f"Removed {path}")

# Clear memory
gc.collect()
torch.cuda.empty_cache()

print("\nCleanup complete!")

---

## Next Steps

In the next notebook, **06-model-upload.ipynb**, we'll learn how to share your fine-tuned models on the Hugging Face Hub, including creating proper model cards!

Great job completing Task 9.5! You now understand how to efficiently fine-tune large models with LoRA!