# Lab 4.6.8.2: QLoRA Fine-Tuning

**Capstone Option E:** Browser-Deployed Fine-Tuned LLM (Matcha Expert)  
**Phase:** 2 of 6  
**Time:** 6-8 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê‚≠ê

---

## Phase Objectives

By completing this phase, you will:
- [ ] Configure QLoRA for efficient fine-tuning
- [ ] Use Unsloth for 2x training speed
- [ ] Track experiments with MLflow
- [ ] Train LoRA adapters on matcha dataset
- [ ] Evaluate training quality
- [ ] Save adapters for merging

---

## Phase Checklist

- [ ] Environment configured
- [ ] Dataset loaded from Phase 1
- [ ] Base model loaded in 4-bit
- [ ] LoRA adapters configured
- [ ] MLflow experiment created
- [ ] Training completed
- [ ] Adapters saved
- [ ] Quality verified

---

## Why This Matters

**Fine-tuning transforms a general model into a domain expert.**

| Before Fine-Tuning | After Fine-Tuning |
|-------------------|-------------------|
| Generic responses about tea | Specific matcha expertise |
| May hallucinate details | Accurate domain knowledge |
| Inconsistent style | Consistent expert persona |
| "I don't know" on specifics | Detailed, authoritative answers |

**DGX Spark Advantage:** With 128GB unified memory, we can fine-tune efficiently while keeping the full model accessible for validation.

---

## ELI5: What is QLoRA?

> **Imagine teaching a master chef to specialize in Japanese cuisine.**
>
> **Regular training** would be like sending them back to culinary school for 4 years - relearning everything from scratch. Expensive and time-consuming.
>
> **LoRA (Low-Rank Adaptation)** is like giving them a specialized notebook where they write down just the Japanese-specific techniques. They keep all their existing skills, and just add the new knowledge.
>
> **QLoRA (Quantized LoRA)** is even smarter - it compresses the chef's existing knowledge (4-bit quantization) so it takes up less space, while the new notebook stays full quality. This means we can work with a much bigger chef (larger model) in the same kitchen (GPU memory).
>
> **Result:** We train only the small notebook (LoRA adapters, ~30MB) instead of the entire chef (base model, ~2GB), saving 90%+ of memory and time.

---

## Part 1: Environment Setup

In [None]:
# Environment Setup
import os
import sys
from pathlib import Path
from datetime import datetime
import json
import torch

print("üçµ PHASE 2: QLORA FINE-TUNING")
print("="*70)
print(f"Date: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
print(f"\nGPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'Not available'}")
print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
print(f"CUDA Version: {torch.version.cuda}")
print(f"PyTorch Version: {torch.__version__}")

In [None]:
# Project Configuration
PROJECT_DIR = Path("./matcha-expert")
DATA_DIR = PROJECT_DIR / "data"
MODEL_DIR = PROJECT_DIR / "models"
MODEL_DIR.mkdir(parents=True, exist_ok=True)

# Training Configuration
CONFIG = {
    # Model
    "base_model": "unsloth/gemma-3-270m-it",  # 270M instruction-tuned Gemma 3
    "max_seq_length": 2048,
    
    # LoRA
    "lora_r": 16,             # Rank - higher = more capacity
    "lora_alpha": 16,         # Scaling factor
    "lora_dropout": 0,        # Dropout (0 for small datasets)
    "target_modules": [       # Which layers to adapt
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    
    # Training
    "num_epochs": 3,
    "batch_size": 2,
    "gradient_accumulation_steps": 4,  # Effective batch = 8
    "learning_rate": 2e-4,
    "warmup_ratio": 0.03,
    "weight_decay": 0.01,
    
    # Paths
    "output_dir": str(MODEL_DIR / "matcha-lora"),
    "dataset_path": str(DATA_DIR / "matcha-dataset"),
}

print("üìã TRAINING CONFIGURATION")
print("="*70)
for key, value in CONFIG.items():
    if isinstance(value, list):
        print(f"   {key}: [{len(value)} modules]")
    else:
        print(f"   {key}: {value}")

In [None]:
# Memory usage helper
def log_memory(stage: str = ""):
    """
    Log current GPU memory usage.
    
    Useful for tracking memory consumption at different stages
    of the training pipeline on DGX Spark.
    """
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1e9
        reserved = torch.cuda.memory_reserved() / 1e9
        print(f"üíæ Memory [{stage}]: {allocated:.2f}GB allocated, {reserved:.2f}GB reserved")
    else:
        print("üíæ No GPU available")

log_memory("Initial")

---

## Part 2: Load Dataset

In [None]:
from datasets import load_from_disk, Dataset

# Load dataset from Phase 1
dataset_path = Path(CONFIG["dataset_path"])

if dataset_path.exists():
    dataset = load_from_disk(str(dataset_path))
    print(f"‚úÖ Loaded dataset from {dataset_path}")
    print(f"   Train: {len(dataset['train'])} examples")
    print(f"   Validation: {len(dataset['validation'])} examples")
    print(f"   Test: {len(dataset['test'])} examples")
else:
    print(f"‚ùå Dataset not found at {dataset_path}")
    print("   Please complete Phase 1 first!")

In [None]:
# Preview a training example
if 'dataset' in dir():
    sample = dataset['train'][0]
    print("üìù SAMPLE TRAINING EXAMPLE")
    print("="*70)
    for msg in sample['messages']:
        role = msg['role'].upper()
        content = msg['content'][:200] + "..." if len(msg['content']) > 200 else msg['content']
        print(f"\n[{role}]")
        print(content)

---

## Part 3: Load Model with Unsloth

Unsloth provides 2x faster training with 60% less memory through kernel optimizations.

In [None]:
# Load model with Unsloth for 2x speedup

from unsloth import FastLanguageModel

print("üöÄ Loading model with Unsloth...")
print(f"   Model: {CONFIG['base_model']}")
print(f"   Max Sequence Length: {CONFIG['max_seq_length']}")

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=CONFIG["base_model"],
    max_seq_length=CONFIG["max_seq_length"],
    load_in_4bit=True,      # Load in 4-bit for memory efficiency
    dtype=torch.bfloat16,   # Use bfloat16 for DGX Spark
)

print(f"\n‚úÖ Model loaded successfully!")
log_memory("After model load")

In [None]:
# Add LoRA adapters

print("üîß Adding LoRA adapters...")
print(f"   Rank (r): {CONFIG['lora_r']}")
print(f"   Alpha: {CONFIG['lora_alpha']}")
print(f"   Target modules: {len(CONFIG['target_modules'])}")

model = FastLanguageModel.get_peft_model(
    model,
    r=CONFIG["lora_r"],
    lora_alpha=CONFIG["lora_alpha"],
    lora_dropout=CONFIG["lora_dropout"],
    target_modules=CONFIG["target_modules"],
    bias="none",
    use_gradient_checkpointing="unsloth",  # Optimized checkpointing
    random_state=42,
)

# Count trainable parameters
def count_parameters(model):
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    return trainable, total

trainable, total = count_parameters(model)
print(f"\n‚úÖ LoRA adapters added!")
print(f"   Trainable parameters: {trainable:,} ({trainable/total*100:.2f}%)")
print(f"   Total parameters: {total:,}")
log_memory("After LoRA")

---

## Part 4: Prepare Training Data

In [None]:
# Format dataset for training

def format_chat_template(example):
    """
    Format messages into the chat template expected by the model.
    
    This function applies the model's chat template to convert
    the messages format into a training-ready text format.
    
    Args:
        example: Dataset example with 'messages' field
        
    Returns:
        Dict with 'text' field containing formatted conversation
    """
    messages = example['messages']
    
    # Apply chat template
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False,
    )
    
    return {"text": text}

# Apply formatting
print("üìù Formatting dataset for training...")

train_dataset = dataset['train'].map(
    format_chat_template,
    remove_columns=dataset['train'].column_names,
)

val_dataset = dataset['validation'].map(
    format_chat_template,
    remove_columns=dataset['validation'].column_names,
)

print(f"‚úÖ Formatting complete!")
print(f"   Train examples: {len(train_dataset)}")
print(f"   Validation examples: {len(val_dataset)}")

In [None]:
# Preview formatted example
print("üìã FORMATTED EXAMPLE")
print("="*70)
sample_text = train_dataset[0]['text']
print(sample_text[:1000] + "..." if len(sample_text) > 1000 else sample_text)

---

## Part 5: Configure MLflow Tracking

In [None]:
import mlflow

# Configure MLflow
mlflow_dir = PROJECT_DIR / "mlruns"
mlflow_dir.mkdir(parents=True, exist_ok=True)

mlflow.set_tracking_uri(f"file://{mlflow_dir.absolute()}")
mlflow.set_experiment("matcha-expert-training")

print(f"üìä MLflow configured")
print(f"   Tracking URI: {mlflow.get_tracking_uri()}")
print(f"   Experiment: matcha-expert-training")

In [None]:
# Start MLflow run
run_name = f"train-{datetime.now().strftime('%Y%m%d-%H%M')}"

mlflow.start_run(run_name=run_name)

# Log configuration
mlflow.log_params({
    "base_model": CONFIG["base_model"],
    "lora_r": CONFIG["lora_r"],
    "lora_alpha": CONFIG["lora_alpha"],
    "num_epochs": CONFIG["num_epochs"],
    "batch_size": CONFIG["batch_size"],
    "learning_rate": CONFIG["learning_rate"],
    "train_examples": len(train_dataset),
    "val_examples": len(val_dataset),
})

print(f"‚úÖ MLflow run started: {run_name}")

---

## Part 6: Configure Trainer

In [None]:
from trl import SFTTrainer, SFTConfig

# Training arguments
training_args = SFTConfig(
    output_dir=CONFIG["output_dir"],
    
    # Batch size
    per_device_train_batch_size=CONFIG["batch_size"],
    per_device_eval_batch_size=CONFIG["batch_size"],
    gradient_accumulation_steps=CONFIG["gradient_accumulation_steps"],
    
    # Training duration
    num_train_epochs=CONFIG["num_epochs"],
    max_seq_length=CONFIG["max_seq_length"],
    
    # Optimizer
    learning_rate=CONFIG["learning_rate"],
    lr_scheduler_type="cosine",
    warmup_ratio=CONFIG["warmup_ratio"],
    weight_decay=CONFIG["weight_decay"],
    optim="adamw_8bit",  # Memory-efficient optimizer
    
    # Precision
    bf16=True,
    
    # Logging
    logging_steps=10,
    logging_first_step=True,
    
    # Evaluation
    eval_strategy="steps",
    eval_steps=50,
    
    # Saving
    save_strategy="steps",
    save_steps=100,
    save_total_limit=3,
    
    # Other
    seed=42,
    report_to=[],  # We'll use MLflow manually
    dataset_text_field="text",
    packing=False,  # Don't pack sequences
)

print("üìã TRAINING ARGUMENTS")
print("="*70)
print(f"   Epochs: {training_args.num_train_epochs}")
print(f"   Batch size: {training_args.per_device_train_batch_size}")
print(f"   Gradient accumulation: {training_args.gradient_accumulation_steps}")
print(f"   Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"   Learning rate: {training_args.learning_rate}")
print(f"   Precision: {'BF16' if training_args.bf16 else 'FP32'}")

In [None]:
# Create trainer

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    args=training_args,
)

print("‚úÖ Trainer configured!")
log_memory("After trainer setup")

---

## Part 7: Train the Model

In [None]:
# Training
print("üèãÔ∏è STARTING TRAINING")
print("="*70)
print(f"   Start time: {datetime.now().strftime('%H:%M:%S')}")
print(f"   Expected duration: ~15-30 minutes for {len(train_dataset)} examples")
print("\n   Progress:")

# Train
train_result = trainer.train()

print(f"\n‚úÖ Training complete!")
print(f"   End time: {datetime.now().strftime('%H:%M:%S')}")
print(f"   Total steps: {train_result.global_step}")
print(f"   Final loss: {train_result.training_loss:.4f}")

In [None]:
# Log training metrics to MLflow

mlflow.log_metrics({
    "train_loss": train_result.training_loss,
    "train_steps": train_result.global_step,
    "train_runtime_seconds": train_result.metrics.get("train_runtime", 0),
    "train_samples_per_second": train_result.metrics.get("train_samples_per_second", 0),
})

print("üìä Metrics logged to MLflow")

In [None]:
# Evaluate on validation set
print("üìä Running evaluation...")

eval_result = trainer.evaluate()

print(f"\nüìä EVALUATION RESULTS")
print("="*70)
for key, value in eval_result.items():
    print(f"   {key}: {value:.4f}" if isinstance(value, float) else f"   {key}: {value}")

# Log to MLflow
mlflow.log_metrics({f"eval_{k}": v for k, v in eval_result.items() if isinstance(v, (int, float))})

---

## Part 8: Test the Fine-Tuned Model

In [None]:
# Enable inference mode
FastLanguageModel.for_inference(model)

def generate_response(question: str, max_tokens: int = 256) -> str:
    """
    Generate a response from the fine-tuned model.
    
    Args:
        question: User question about matcha
        max_tokens: Maximum tokens to generate
        
    Returns:
        Model's response
    """
    messages = [
        {"role": "system", "content": "You are a matcha tea expert with deep knowledge of Japanese tea culture, preparation methods, health benefits, and culinary applications."},
        {"role": "user", "content": question},
    ]
    
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to(model.device)
    
    outputs = model.generate(
        inputs,
        max_new_tokens=max_tokens,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
    )
    
    # Decode only the new tokens
    response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
    return response.strip()

print("‚úÖ Inference mode enabled")

In [None]:
# Test with sample questions

TEST_QUESTIONS = [
    "What's the difference between ceremonial and culinary grade matcha?",
    "How should I store my matcha to keep it fresh?",
    "What's the correct water temperature for making matcha?",
]

print("üß™ TESTING FINE-TUNED MODEL")
print("="*70)

for i, question in enumerate(TEST_QUESTIONS, 1):
    print(f"\n‚ùì Question {i}: {question}")
    print(f"\nüí¨ Response:")
    response = generate_response(question)
    print(response)
    print("-"*70)

---

## Part 9: Save LoRA Adapters

In [None]:
# Save LoRA adapters

adapter_path = Path(CONFIG["output_dir"]) / "final"
adapter_path.mkdir(parents=True, exist_ok=True)

# Save model and tokenizer
model.save_pretrained(str(adapter_path))
tokenizer.save_pretrained(str(adapter_path))

# Calculate adapter size
adapter_size = sum(f.stat().st_size for f in adapter_path.glob("*.safetensors")) / 1e6

print(f"‚úÖ LoRA adapters saved!")
print(f"   Path: {adapter_path}")
print(f"   Size: {adapter_size:.1f} MB")

# List saved files
print(f"\nüìÅ Saved files:")
for f in sorted(adapter_path.iterdir()):
    size = f.stat().st_size / 1e6
    print(f"   {f.name}: {size:.2f} MB")

In [None]:
# Log artifacts to MLflow

# Save config
config_path = adapter_path / "training_config.json"
with open(config_path, 'w') as f:
    json.dump(CONFIG, f, indent=2)

# Log to MLflow
mlflow.log_artifact(str(config_path))
mlflow.log_metric("adapter_size_mb", adapter_size)

# End run
mlflow.end_run()

print("‚úÖ MLflow run completed and artifacts logged")

---

## Common Issues

### Issue 1: CUDA Out of Memory
**Symptom:** `RuntimeError: CUDA out of memory`  
**Fix:** Reduce batch_size or max_seq_length

### Issue 2: Loss Not Decreasing
**Symptom:** Loss stays flat or increases  
**Fix:** Check learning rate (try lower), check data format

### Issue 3: Model Outputs Garbage
**Symptom:** Random tokens, incomplete sentences  
**Fix:** Check chat template formatting, ensure tokenizer matches model

### Issue 4: Training Too Slow
**Symptom:** Hours per epoch  
**Fix:** Ensure Unsloth is being used, check GPU utilization

---

## Metrics & Outputs

| Metric | Expected | Actual |
|--------|----------|--------|
| Training Loss | < 1.0 | [Fill in] |
| Eval Loss | < 1.5 | [Fill in] |
| Training Time | 15-30 min | [Fill in] |
| Adapter Size | ~20-50 MB | [Fill in] |
| Peak Memory | ~8-12 GB | [Fill in] |

---

## Phase Complete!

You've achieved:
- ‚úÖ Loaded model with Unsloth for 2x speedup
- ‚úÖ Configured QLoRA adapters
- ‚úÖ Trained on matcha dataset
- ‚úÖ Tracked experiments with MLflow
- ‚úÖ Saved LoRA adapters

**Next:** [Lab 4.6.8.3: Merge and Export](./lab-4.6.8.3-merge-and-export.ipynb)

---

In [None]:
# Cleanup
import gc

# Free GPU memory
del model
del trainer
torch.cuda.empty_cache()
gc.collect()

print("‚úÖ Phase 2 Complete!")
print("\nüéØ Next Steps:")
print("   1. Review MLflow logs for training metrics")
print("   2. Test model responses for quality")
print("   3. Proceed to Lab 4.6.8.3 for LoRA merging")
print(f"\n   Adapters saved at: {adapter_path}")

log_memory("After cleanup")