# Task 10.2: Fine-Tuning Llama 3.1 8B with LoRA

**Module:** 10 - Large Language Model Fine-Tuning  
**Time:** 3 hours  
**Difficulty:** ⭐⭐⭐☆☆

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Load Llama 3.1 8B with 4-bit quantization on DGX Spark
- [ ] Apply LoRA adapters to attention layers using PEFT
- [ ] Prepare an instruction-following dataset
- [ ] Fine-tune using Unsloth for 2x faster training
- [ ] Evaluate the fine-tuned model on a held-out test set

---

## Prerequisites

- Completed: Task 10.1 (LoRA Theory)
- Knowledge of: PyTorch training loops, Hugging Face Transformers
- Required: Llama 3.1 model access (request at meta.ai)

---

## Real-World Context

### Why Fine-Tune an 8B Model?

Imagine you're building a **customer support chatbot** for your company. The base Llama model is great at general conversation, but:

- It doesn't know your product names or features
- It doesn't follow your company's communication style
- It might give incorrect information about your services

**Fine-tuning** teaches the model your specific domain while keeping all its general knowledge intact!

### What Makes 8B Special?

| Model Size | Capability | DGX Spark Training |
|------------|------------|--------------------|
| 1-3B | Basic tasks | Full fine-tuning possible |
| **8B** | **Production quality** | **LoRA is perfect** |
| 70B | State-of-the-art | QLoRA required |

8B is the **sweet spot** - smart enough for most tasks, small enough to iterate quickly.

---

## ELI5: What is Fine-Tuning?

> **Imagine you hired a brilliant new employee (the base model).** They went to the best schools and know a lot about everything.
>
> But on their first day, they don't know:
> - Your company's product names
> - How you talk to customers
> - Your internal processes
>
> **Fine-tuning is like onboarding.** You show them examples of how things are done at YOUR company. After a few days, they combine their general brilliance with your specific knowledge.
>
> **With LoRA, it's even better:** Instead of retraining them from scratch (expensive!), you just give them a small "cheat sheet" of adjustments. They keep everything they learned before, plus your specific tweaks.

---

## Part 1: Environment Setup

First, let's set up our environment. We'll use Unsloth for faster training.

In [None]:
# Setup cell - run this first!
import warnings
warnings.filterwarnings('ignore')

import os
import sys
import json
import gc
from datetime import datetime

import torch
import numpy as np
from typing import Dict, List, Optional, Tuple

# Set environment variables for optimal performance
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name()}")
    print(f"Total GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    print(f"CUDA version: {torch.version.cuda}")

In [None]:
# Memory monitoring utility
def print_gpu_memory(label: str = ""):
    """Print current GPU memory usage."""
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1e9
        reserved = torch.cuda.memory_reserved() / 1e9
        print(f"[{label}] GPU Memory: {allocated:.2f} GB allocated, {reserved:.2f} GB reserved")

def clear_memory():
    """Clear GPU memory."""
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()

print_gpu_memory("Initial")

### Installing Required Libraries

We'll use Unsloth for faster training when available. Install inside your NGC container:

```bash
# IMPORTANT: Run inside NGC PyTorch container on DGX Spark
# The NGC container already includes PyTorch, CUDA, and most dependencies

# Install fine-tuning libraries (these work on ARM64)
pip install --no-deps trl peft accelerate

# Unsloth provides 2x faster training (ARM64 support may vary)
# If this fails, the notebook automatically falls back to standard transformers
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" || echo "Unsloth not available, using standard transformers"

# bitsandbytes should already be in NGC container
# If not: pip install bitsandbytes
```

**Note for DGX Spark users:**
- The NGC PyTorch container is recommended (includes ARM64-optimized libraries)
- If Unsloth installation fails, don't worry - the notebook automatically falls back to standard transformers
- Standard transformers work perfectly on DGX Spark, just ~2x slower than Unsloth

In [None]:
# Note: warnings already suppressed in cell-3, no need to repeat here

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import Dataset
from trl import SFTTrainer
import gc

# Try Unsloth for faster training (optional - falls back gracefully)
try:
    from unsloth import FastLanguageModel
    USE_UNSLOTH = True
    print("Unsloth available - will use for 2x faster training!")
except ImportError:
    USE_UNSLOTH = False
    print("Using standard transformers (Unsloth not available - training will be slower but works fine)")

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

---

## Part 2: Loading the Model with 4-bit Quantization

### ELI5: What is 4-bit Quantization?

> **Imagine you have a very detailed map (the model weights)** with every tiny road and path marked.
>
> **Normal precision (16-bit):** The map shows roads accurate to the centimeter. Very detailed, but takes up lots of space.
>
> **4-bit quantization:** The map shows roads accurate to the meter. You lose some tiny details, but the major roads are still clear, and the map fits in your pocket!
>
> **For LLMs:** An 8B model at 16-bit takes ~16GB. At 4-bit, it takes ~4-5GB. You can now fit it in memory with room for training!

In [None]:
# Model configuration
MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"
# Alternatives if you don't have Llama access:
# MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2"
# MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"  # For quick testing

MAX_SEQ_LENGTH = 2048
LOAD_IN_4BIT = True

print(f"Model: {MODEL_NAME}")
print(f"Max sequence length: {MAX_SEQ_LENGTH}")
print(f"4-bit quantization: {LOAD_IN_4BIT}")

# Optional: Clear buffer cache for cleaner memory state
# This is CRITICAL for 70B models, helpful for 8B to ensure stable loading
import subprocess
try:
    subprocess.run(["sudo", "sh", "-c", "sync; echo 3 > /proc/sys/vm/drop_caches"], 
                   check=True, capture_output=True)
    print("Buffer cache cleared for optimal memory state")
except (subprocess.CalledProcessError, FileNotFoundError):
    print("Could not clear buffer cache (requires sudo) - continuing anyway")

In [None]:
# Load model with Unsloth (faster) or standard transformers
print_gpu_memory("Before loading")

if USE_UNSLOTH:
    # Unsloth path - 2x faster, 50% less memory
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=MODEL_NAME,
        max_seq_length=MAX_SEQ_LENGTH,
        dtype=torch.bfloat16,  # Use bfloat16 for Blackwell
        load_in_4bit=LOAD_IN_4BIT,
    )
    print("Model loaded with Unsloth!")
    
else:
    # Standard transformers path
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=LOAD_IN_4BIT,
        bnb_4bit_quant_type="nf4",  # Normal Float 4-bit
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,  # Double quantization saves more memory
    )
    
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        quantization_config=bnb_config,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        # Note: trust_remote_code not needed for official Llama models
    )
    
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"
    
    # Prepare for k-bit training
    model = prepare_model_for_kbit_training(model)
    print("Model loaded with standard transformers!")

print_gpu_memory("After loading")

In [None]:
# Verify model is loaded correctly
print(f"Model type: {type(model).__name__}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Tokenizer vocabulary size: {len(tokenizer)}")

# Quick test
test_prompt = "Hello, my name is"
inputs = tokenizer(test_prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=10, do_sample=False)
print(f"\nTest generation: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")

---

## Part 3: Adding LoRA Adapters

Now we'll add LoRA adapters to the model. We'll only train these small adapter weights, not the full model.

In [None]:
# LoRA Configuration
LORA_CONFIG = {
    "r": 16,  # Rank - higher = more capacity but more memory
    "lora_alpha": 32,  # Scaling factor - usually 2*r
    "lora_dropout": 0.05,  # Dropout for regularization
    "target_modules": [
        "q_proj",  # Query projection
        "k_proj",  # Key projection  
        "v_proj",  # Value projection
        "o_proj",  # Output projection
        # Uncomment for more capacity (uses more memory):
        # "gate_proj",
        # "up_proj", 
        # "down_proj",
    ],
    "bias": "none",  # Don't train biases
    "task_type": "CAUSAL_LM",
}

print("LoRA Configuration:")
for key, value in LORA_CONFIG.items():
    print(f"  {key}: {value}")

In [None]:
# Apply LoRA to the model
if USE_UNSLOTH:
    # Unsloth's optimized LoRA
    model = FastLanguageModel.get_peft_model(
        model,
        r=LORA_CONFIG["r"],
        lora_alpha=LORA_CONFIG["lora_alpha"],
        lora_dropout=LORA_CONFIG["lora_dropout"],
        target_modules=LORA_CONFIG["target_modules"],
        bias="none",
        use_gradient_checkpointing="unsloth",  # Unsloth's optimized checkpointing
        random_state=42,
    )
else:
    # Standard PEFT
    lora_config = LoraConfig(**LORA_CONFIG)
    model = get_peft_model(model, lora_config)
    model.enable_input_require_grads()

# Print trainable parameters
def print_trainable_parameters(model):
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"Trainable params: {trainable_params:,} || "
        f"All params: {all_param:,} || "
        f"Trainable%: {100 * trainable_params / all_param:.2f}%"
    )
    return trainable_params, all_param

trainable, total = print_trainable_parameters(model)
print_gpu_memory("After LoRA")

### What's Happening?

Notice that we're only training about **0.1-1%** of the total parameters! This is the magic of LoRA:

- **8B parameters** in the base model
- **~8-80M trainable LoRA parameters** (depending on config)
- **99%+ of weights are frozen**

This means:
- Much less memory for gradients and optimizer states
- Faster training iterations
- No risk of "catastrophic forgetting" of base model knowledge

---

## Part 4: Preparing the Training Dataset

We'll create a sample instruction-following dataset. In practice, you'd use your own domain-specific data.

In [None]:
# Sample instruction dataset - replace with your own data!
sample_data = [
    {
        "instruction": "What is machine learning?",
        "input": "",
        "output": "Machine learning is a subset of artificial intelligence that enables computers to learn patterns from data without being explicitly programmed. Instead of following hard-coded rules, ML algorithms improve their performance through experience, identifying patterns and making decisions based on the data they're trained on."
    },
    {
        "instruction": "Explain the concept of overfitting.",
        "input": "",
        "output": "Overfitting occurs when a machine learning model learns the training data too well, including its noise and outliers. This results in excellent performance on training data but poor generalization to new, unseen data. It's like memorizing answers instead of understanding concepts - you'll ace the practice test but struggle with new questions."
    },
    {
        "instruction": "What is the difference between supervised and unsupervised learning?",
        "input": "",
        "output": "Supervised learning uses labeled data where we know the correct answers (like classifying emails as spam/not spam). The model learns to map inputs to known outputs. Unsupervised learning works with unlabeled data, finding hidden patterns or structures (like grouping similar customers). Think of it as learning with a teacher vs. exploring on your own."
    },
    {
        "instruction": "Summarize the following text.",
        "input": "The transformer architecture revolutionized natural language processing when it was introduced in 2017. Unlike previous models that processed text sequentially, transformers can process entire sequences in parallel using a mechanism called self-attention. This allows them to capture long-range dependencies in text more effectively and train much faster on modern hardware.",
        "output": "Transformers, introduced in 2017, transformed NLP by processing text in parallel using self-attention, enabling better capture of long-range dependencies and faster training compared to sequential models."
    },
    {
        "instruction": "Write a Python function to calculate factorial.",
        "input": "",
        "output": "```python\ndef factorial(n):\n    if n < 0:\n        raise ValueError('Factorial is not defined for negative numbers')\n    if n <= 1:\n        return 1\n    return n * factorial(n - 1)\n\n# Example usage:\nprint(factorial(5))  # Output: 120\n```"
    },
    {
        "instruction": "What are the key hyperparameters in training neural networks?",
        "input": "",
        "output": "Key hyperparameters include: 1) Learning rate - controls step size during optimization, 2) Batch size - number of samples processed before updating weights, 3) Number of epochs - full passes through the dataset, 4) Network architecture - layers, neurons, activation functions, 5) Regularization - dropout rate, weight decay, 6) Optimizer choice - Adam, SGD, etc."
    },
    {
        "instruction": "Explain gradient descent in simple terms.",
        "input": "",
        "output": "Imagine you're blindfolded on a hilly terrain, trying to find the lowest valley. Gradient descent is like feeling the slope with your feet and taking small steps downhill. The 'gradient' tells you which direction is steepest, and you keep stepping until you reach a valley (minimum loss). The step size is your 'learning rate'."
    },
    {
        "instruction": "What is transfer learning?",
        "input": "",
        "output": "Transfer learning reuses knowledge from a model trained on one task to help with a different but related task. Like how learning to ride a bicycle helps you learn to ride a motorcycle - you don't start from scratch. In AI, we often take models pre-trained on large datasets and fine-tune them for specific tasks, saving time and data."
    },
    {
        "instruction": "Compare batch normalization and layer normalization.",
        "input": "",
        "output": "Batch normalization normalizes across the batch dimension - it computes statistics from all samples in a batch. Layer normalization normalizes across the feature dimension - each sample is normalized independently. BatchNorm works well for CNNs and large batches, while LayerNorm is preferred for transformers and RNNs where batch sizes may vary."
    },
    {
        "instruction": "What is LoRA and why is it useful?",
        "input": "",
        "output": "LoRA (Low-Rank Adaptation) is a technique for efficient fine-tuning of large language models. Instead of updating all model weights, LoRA adds small trainable matrices that capture the changes needed for your task. This reduces memory requirements by 10-100x and allows fine-tuning of models that wouldn't otherwise fit in GPU memory."
    },
]

print(f"Created {len(sample_data)} training examples")

In [None]:
# Format data for Llama 3.1 Instruct format
def format_instruction_llama3(example: dict) -> str:
    """
    Format a single example into Llama 3.1 Instruct format.
    
    Llama 3.1 uses this format:
    <|begin_of_text|><|start_header_id|>system<|end_header_id|>
    {system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>
    {user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
    {assistant_message}<|eot_id|>
    """
    system_message = "You are a helpful AI assistant specializing in machine learning and data science."
    
    # Combine instruction and input if input exists
    if example.get("input", ""):
        user_message = f"{example['instruction']}\n\n{example['input']}"
    else:
        user_message = example['instruction']
    
    # Format in Llama 3.1 style
    formatted = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>

{user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{example['output']}<|eot_id|>"""
    
    return formatted

# Test formatting
print("Sample formatted example:")
print("=" * 50)
print(format_instruction_llama3(sample_data[0]))
print("=" * 50)

In [None]:
# Create dataset
formatted_data = [{"text": format_instruction_llama3(ex)} for ex in sample_data]
dataset = Dataset.from_list(formatted_data)

# Split into train/eval
dataset = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = dataset['train']
eval_dataset = dataset['test']

print(f"Training examples: {len(train_dataset)}")
print(f"Evaluation examples: {len(eval_dataset)}")

---

## Part 5: Training with Unsloth/SFTTrainer

Now the exciting part - actually training the model!

In [None]:
# Training configuration
OUTPUT_DIR = "./llama3-8b-lora-finetuned"

training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    
    # Training hyperparameters
    num_train_epochs=3,
    per_device_train_batch_size=2,  # Adjust based on memory
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,  # Effective batch size = 2 * 4 = 8
    
    # Optimizer settings
    learning_rate=2e-4,
    weight_decay=0.01,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    
    # Memory optimization
    fp16=False,  # Use bf16 instead on Blackwell
    bf16=True,
    gradient_checkpointing=True,
    
    # Logging
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2,
    
    # Other
    seed=42,
    report_to="none",  # Disable wandb/tensorboard
)

print("Training configuration:")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Gradient accumulation: {training_args.gradient_accumulation_steps}")
print(f"  Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"  Learning rate: {training_args.learning_rate}")

In [None]:
# Create the trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    args=training_args,
    packing=False,  # Set to True for more efficient training with short examples
)

print("Trainer created successfully!")
print_gpu_memory("Before training")

In [None]:
# Train!
print("Starting training...")
print("=" * 50)

start_time = datetime.now()
train_result = trainer.train()
end_time = datetime.now()

print("=" * 50)
print(f"Training completed in {end_time - start_time}")
print_gpu_memory("After training")

# Print training metrics
metrics = train_result.metrics
print(f"\nTraining Metrics:")
print(f"  Total steps: {metrics.get('total_steps', 'N/A')}")
print(f"  Training loss: {metrics.get('train_loss', 'N/A'):.4f}")
print(f"  Training runtime: {metrics.get('train_runtime', 'N/A'):.2f}s")
print(f"  Samples/second: {metrics.get('train_samples_per_second', 'N/A'):.2f}")

---

## Part 6: Evaluating the Fine-Tuned Model

Let's test our fine-tuned model!

In [None]:
# Put model in inference mode
if USE_UNSLOTH:
    FastLanguageModel.for_inference(model)
model.eval()

def generate_response(prompt: str, max_new_tokens: int = 256) -> str:
    """
    Generate a response from the fine-tuned model.
    """
    # Format prompt in Llama 3.1 style
    system_message = "You are a helpful AI assistant specializing in machine learning and data science."
    formatted_prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>

{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""
    
    inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    # Decode only the generated part
    generated = outputs[0][inputs['input_ids'].shape[1]:]
    response = tokenizer.decode(generated, skip_special_tokens=True)
    
    return response.strip()

In [None]:
# Test prompts - mix of in-domain and out-of-domain
test_prompts = [
    "What is the difference between LoRA and full fine-tuning?",
    "Explain attention mechanism in transformers.",
    "Write a Python function to calculate the mean of a list.",
    "What is regularization and why is it important?",
]

print("Testing fine-tuned model:")
print("=" * 70)

for prompt in test_prompts:
    print(f"\nPrompt: {prompt}")
    print("-" * 50)
    response = generate_response(prompt)
    print(f"Response: {response}")
    print("=" * 70)

---

## Part 7: Saving the Model

We have two options:
1. **Save LoRA adapters only** (small, ~50MB)
2. **Merge and save full model** (large, ~16GB)

In [None]:
# Option 1: Save LoRA adapters only (recommended for most use cases)
ADAPTER_PATH = "./llama3-8b-lora-adapter"

model.save_pretrained(ADAPTER_PATH)
tokenizer.save_pretrained(ADAPTER_PATH)

print(f"LoRA adapter saved to {ADAPTER_PATH}")

# Check size
import os
adapter_size = sum(os.path.getsize(os.path.join(ADAPTER_PATH, f)) 
                   for f in os.listdir(ADAPTER_PATH) 
                   if os.path.isfile(os.path.join(ADAPTER_PATH, f)))
print(f"Adapter size: {adapter_size / 1e6:.2f} MB")

In [None]:
# Option 2: Merge LoRA into base model and save (for deployment)
# Warning: This creates a ~16GB model!

MERGED_MODEL_PATH = "./llama3-8b-merged"
SAVE_MERGED = False  # Set to True to save merged model

if SAVE_MERGED:
    print("Merging LoRA weights into base model...")
    
    if USE_UNSLOTH:
        # Unsloth method
        model.save_pretrained_merged(
            MERGED_MODEL_PATH,
            tokenizer,
            save_method="merged_16bit",
        )
    else:
        # Standard PEFT method
        merged_model = model.merge_and_unload()
        merged_model.save_pretrained(MERGED_MODEL_PATH)
        tokenizer.save_pretrained(MERGED_MODEL_PATH)
    
    print(f"Merged model saved to {MERGED_MODEL_PATH}")
else:
    print("Skipping merged model save (set SAVE_MERGED=True to enable)")

---

## Part 8: Loading the Fine-Tuned Model Later

Here's how to load your fine-tuned model in a new session.

In [None]:
# Example: Loading the LoRA adapter later
def load_finetuned_model(base_model_name: str, adapter_path: str):
    """
    Load a fine-tuned model by combining base model with LoRA adapter.
    """
    from peft import PeftModel
    
    # Load base model
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )
    
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        quantization_config=bnb_config,
        device_map="auto",
    )
    
    # Load adapter
    model = PeftModel.from_pretrained(base_model, adapter_path)
    tokenizer = AutoTokenizer.from_pretrained(adapter_path)
    
    return model, tokenizer

print("To load your fine-tuned model later:")
print(f"model, tokenizer = load_finetuned_model('{MODEL_NAME}', '{ADAPTER_PATH}')")

---

## Try It Yourself: Exercises

### Exercise 1: Custom Dataset

Create a dataset for a specific domain (e.g., medical, legal, coding) and fine-tune the model.

<details>
<summary>Hint</summary>

Use the same format as `sample_data` but with domain-specific examples. Aim for at least 50-100 examples for meaningful fine-tuning.
</details>

In [None]:
# Exercise 1: Your custom dataset here
custom_data = [
    # Add your domain-specific examples
]

### Exercise 2: Hyperparameter Tuning

Experiment with different LoRA configurations:
- Try ranks: 8, 16, 32, 64
- Try different target modules
- Compare training loss curves

In [None]:
# Exercise 2: Hyperparameter experiments here

### Exercise 3: Evaluation Metrics

Implement proper evaluation using:
- Perplexity on held-out data
- ROUGE scores for summarization
- Human evaluation criteria

In [None]:
# Exercise 3: Evaluation implementation here

---

## Common Mistakes

### Mistake 1: Wrong Chat Template

```python
# ❌ Wrong: Generic prompt format
prompt = f"User: {question}\nAssistant:"

# ✅ Right: Use model's specific chat template
prompt = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""
```

**Why:** Each model is trained with a specific format. Using the wrong format leads to poor results.

### Mistake 2: Not Enabling Gradient Checkpointing

```python
# ❌ Wrong: No checkpointing
training_args = TrainingArguments(
    gradient_checkpointing=False,  # OOM error!
)

# ✅ Right: Enable checkpointing
training_args = TrainingArguments(
    gradient_checkpointing=True,  # Trades compute for memory
)
```

**Why:** Gradient checkpointing saves memory by recomputing activations during backward pass.

### Mistake 3: Too High Learning Rate

```python
# ❌ Wrong: Standard learning rate
learning_rate = 1e-3  # Too high for LoRA

# ✅ Right: Lower learning rate for LoRA
learning_rate = 2e-4  # Works well for most LoRA fine-tuning
```

**Why:** LoRA parameters are initialized near zero, so updates are amplified. Lower LR prevents instability.

---

## Checkpoint

You've learned:
- ✅ How to load large models with 4-bit quantization
- ✅ How to apply LoRA adapters using PEFT/Unsloth
- ✅ How to prepare instruction-following datasets
- ✅ How to fine-tune with SFTTrainer
- ✅ How to save and load fine-tuned models

---

## Challenge (Optional): Full Pipeline

Build a complete fine-tuning pipeline that:
1. Loads data from a CSV file
2. Cleans and formats it automatically
3. Splits into train/validation/test
4. Trains with early stopping
5. Evaluates on test set
6. Saves the best checkpoint

---

## Further Reading

- [Unsloth Documentation](https://github.com/unslothai/unsloth)
- [PEFT Library](https://huggingface.co/docs/peft)
- [TRL Library](https://huggingface.co/docs/trl)
- [Llama 3.1 Model Card](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)

---

## Cleanup

In [None]:
# Cleanup
del model, tokenizer, trainer
clear_memory()

print("Cleanup complete!")
print_gpu_memory("After cleanup")

---

## Next Steps

**[Task 10.3: 70B Model QLoRA Fine-Tuning](03-70b-qlora-finetuning.ipynb)**

Ready for the DGX Spark showcase? In the next notebook, you'll fine-tune a **70 billion parameter model** - something that would require multiple expensive GPUs elsewhere!