# Task 10.3: Fine-Tuning Llama 3.1 70B with QLoRA

**Module:** 10 - Large Language Model Fine-Tuning  
**Time:** 4 hours  
**Difficulty:** ⭐⭐⭐⭐☆

## This is the DGX Spark Showcase!

You're about to do something that would normally require:
- **4x NVIDIA A100 80GB** (~$60,000+), or
- **8x NVIDIA RTX 4090** (~$16,000+), or  
- **Cloud GPU rental** ($20-50/hour)

**On your DGX Spark**, you'll fine-tune a 70-billion parameter model on a single desktop machine. This is what makes DGX Spark special!

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Clear system cache before loading large models
- [ ] Load Llama 3.1 70B with QLoRA configuration
- [ ] Monitor and document memory usage throughout the process
- [ ] Fine-tune the 70B model on a custom dataset
- [ ] Understand what makes DGX Spark unique for this task

---

## Prerequisites

- Completed: Task 10.1 (LoRA Theory) and Task 10.2 (8B Fine-tuning)
- Hardware: DGX Spark with 128GB unified memory
- Model access: Meta Llama 3.1 70B (request at meta.ai)

---

## ELI5: What is QLoRA?

> **Remember our chef analogy from the LoRA notebook?** QLoRA adds another trick.
>
> **LoRA:** Give the chef a small recipe card instead of retraining them.
>
> **QLoRA:** Also compress the chef's entire cookbook (base weights) into a tiny, highly-efficient format while keeping the recipe card (adapters) in high quality.
>
> **The technical trick:** 
> - Base model weights are stored in 4-bit precision (NF4 format)
> - LoRA adapters stay in higher precision (bfloat16)
> - Computations happen in bfloat16 for accuracy
> - Double quantization further reduces memory overhead
>
> **Result:** A 70B model that normally needs 140GB (in float16) fits in ~35-45GB!

---

## Memory Budget Analysis

Let's understand why this works on DGX Spark:

| Component | FP16 Memory | QLoRA Memory |
|-----------|------------|---------------|
| 70B Model Weights | ~140 GB | **~35 GB** (4-bit) |
| LoRA Parameters | - | ~0.5 GB |
| Gradients (LoRA only) | - | ~1 GB |
| Optimizer States | - | ~4 GB |
| Activations (with gradient checkpointing) | - | ~5-10 GB |
| **Total** | **~400+ GB** | **~45-55 GB** |

**DGX Spark Unified Memory: 128GB** - plenty of room!

### Why Consumer GPUs Can't Do This

- RTX 4090: 24GB VRAM - can't fit the model
- Multi-GPU setup: Requires complex tensor parallelism, PCIe bandwidth limitations
- CPU offloading: Works but 10-100x slower

**DGX Spark's unified memory** means the full model can be accessed efficiently without complex distribution strategies!

---

## Part 1: System Preparation (CRITICAL)

Before loading a 70B model, we **MUST** clear the system buffer cache. This is critical!

In [None]:
# CRITICAL: Clear buffer cache before loading large models
import subprocess
import os
import gc
import time

def clear_buffer_cache():
    """
    Clear Linux buffer cache to free up memory for model loading.
    This is CRITICAL for loading 70B models!
    
    Note: Requires sudo access. If you get permission errors,
    run this command manually in terminal:
    sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
    """
    print("Clearing buffer cache...")
    try:
        # Sync filesystem
        subprocess.run(["sync"], check=True)
        # Drop caches
        subprocess.run(
            ["sudo", "sh", "-c", "echo 3 > /proc/sys/vm/drop_caches"],
            check=True
        )
        print("Buffer cache cleared successfully!")
    except subprocess.CalledProcessError as e:
        print(f"Warning: Could not clear cache automatically: {e}")
        print("Please run this command manually:")
        print("  sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'")

# Clear Python garbage collector
gc.collect()

# Clear buffer cache
clear_buffer_cache()

In [None]:
# Memory monitoring utilities
import torch
from datetime import datetime

# Graceful import for psutil - may not be installed in all NGC container versions
try:
    import psutil
    PSUTIL_AVAILABLE = True
except ImportError:
    PSUTIL_AVAILABLE = False
    print("Note: psutil not available. Install with 'pip install psutil' for system memory monitoring.")
    print("GPU memory monitoring will still work.")

def get_memory_info():
    """
    Get comprehensive memory information.
    """
    info = {
        'timestamp': datetime.now().strftime('%H:%M:%S'),
    }
    
    # System memory (if psutil available)
    if PSUTIL_AVAILABLE:
        sys_mem = psutil.virtual_memory()
        info['system_total_gb'] = sys_mem.total / 1e9
        info['system_used_gb'] = sys_mem.used / 1e9
        info['system_available_gb'] = sys_mem.available / 1e9
    else:
        # Fallback: estimate from /proc/meminfo on Linux
        try:
            with open('/proc/meminfo', 'r') as f:
                meminfo = f.read()
            mem_total = int([x for x in meminfo.split('\n') if 'MemTotal' in x][0].split()[1]) * 1024
            mem_avail = int([x for x in meminfo.split('\n') if 'MemAvailable' in x][0].split()[1]) * 1024
            info['system_total_gb'] = mem_total / 1e9
            info['system_available_gb'] = mem_avail / 1e9
            info['system_used_gb'] = (mem_total - mem_avail) / 1e9
        except Exception:
            info['system_total_gb'] = 0
            info['system_used_gb'] = 0
            info['system_available_gb'] = 0
    
    # GPU memory
    if torch.cuda.is_available():
        info['gpu_allocated_gb'] = torch.cuda.memory_allocated() / 1e9
        info['gpu_reserved_gb'] = torch.cuda.memory_reserved() / 1e9
        
        # Get device properties for total memory
        props = torch.cuda.get_device_properties(0)
        info['gpu_total_gb'] = props.total_memory / 1e9
    
    return info

def print_memory_status(label: str = ""):
    """
    Print formatted memory status.
    """
    info = get_memory_info()
    print(f"\n{'='*60}")
    print(f"Memory Status: {label} ({info['timestamp']})")
    print(f"{'='*60}")
    print(f"System RAM: {info['system_used_gb']:.1f} / {info['system_total_gb']:.1f} GB used")
    print(f"System Available: {info['system_available_gb']:.1f} GB")
    if 'gpu_allocated_gb' in info:
        print(f"GPU Allocated: {info['gpu_allocated_gb']:.1f} GB")
        print(f"GPU Reserved: {info['gpu_reserved_gb']:.1f} GB")
        print(f"GPU Total: {info['gpu_total_gb']:.1f} GB")
    print(f"{'='*60}\n")
    return info

# Track memory throughout the notebook
memory_log = []

def log_memory(label: str):
    """Log memory status for later analysis."""
    info = print_memory_status(label)
    info['label'] = label
    memory_log.append(info)

# Initial status
log_memory("Initial (after cache clear)")

In [None]:
# System information
import torch

print("System Information:")
print(f"  PyTorch version: {torch.__version__}")
print(f"  CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"  GPU: {torch.cuda.get_device_name()}")
    props = torch.cuda.get_device_properties(0)
    print(f"  Total GPU Memory: {props.total_memory / 1e9:.1f} GB")
    print(f"  CUDA Capability: {props.major}.{props.minor}")
    print(f"  Multiprocessors: {props.multi_processor_count}")

# Set environment variables for optimal performance
import os
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'
os.environ['TOKENIZERS_PARALLELISM'] = 'false'

---

## Part 2: Loading the 70B Model with QLoRA

This is the moment of truth! We'll load a 70-billion parameter model into memory.

In [None]:
import warnings
# Suppress verbose warnings from transformers/PEFT/bitsandbytes that clutter output
# These warnings don't affect functionality - we filter them for cleaner notebooks
warnings.filterwarnings('ignore')

import time  # Required for timing model loading and training

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import Dataset
from trl import SFTTrainer

# Try Unsloth for faster training
try:
    from unsloth import FastLanguageModel
    USE_UNSLOTH = True
    print("Unsloth available - will use for faster training!")
except ImportError:
    USE_UNSLOTH = False
    print("Using standard transformers (Unsloth not available)")

In [None]:
# Model configuration for 70B
MODEL_NAME = "meta-llama/Llama-3.1-70B-Instruct"
# Alternative: "unsloth/Llama-3.1-70B-Instruct-bnb-4bit" (pre-quantized)

MAX_SEQ_LENGTH = 2048  # Can go up to 4096, but uses more memory

print(f"Model: {MODEL_NAME}")
print(f"Max sequence length: {MAX_SEQ_LENGTH}")
print("\nThis is a 70B parameter model - loading will take several minutes...")

In [None]:
# QLoRA quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                    # 4-bit quantization
    bnb_4bit_quant_type="nf4",            # NormalFloat4 - optimal for normally distributed weights
    bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bfloat16 for Blackwell
    bnb_4bit_use_double_quant=True,       # Quantize the quantization constants too!
)

print("QLoRA Configuration:")
print(f"  4-bit quantization: {bnb_config.load_in_4bit}")
print(f"  Quantization type: {bnb_config.bnb_4bit_quant_type}")
print(f"  Compute dtype: {bnb_config.bnb_4bit_compute_dtype}")
print(f"  Double quantization: {bnb_config.bnb_4bit_use_double_quant}")

In [None]:
# Load the 70B model - this is the big moment!
print("\n" + "="*60)
print("LOADING 70B MODEL - This will take 5-15 minutes...")
print("="*60 + "\n")

load_start = time.time()
log_memory("Before loading model")

if USE_UNSLOTH:
    # Unsloth path - faster loading and training
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=MODEL_NAME,
        max_seq_length=MAX_SEQ_LENGTH,
        dtype=torch.bfloat16,
        load_in_4bit=True,
    )
else:
    # Standard transformers path
    print("Loading tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"
    
    print("Loading model with QLoRA quantization...")
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        quantization_config=bnb_config,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=True,  # Important for large models
        # Note: trust_remote_code not needed for official Llama models
    )
    
    # Prepare for k-bit training
    model = prepare_model_for_kbit_training(model)

load_time = time.time() - load_start
print(f"\nModel loaded in {load_time/60:.1f} minutes!")

log_memory("After loading model")

In [None]:
# Verify the model loaded correctly
print("Model Information:")
print(f"  Model type: {type(model).__name__}")
print(f"  Total parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"  Tokenizer vocabulary: {len(tokenizer):,}")

# Quick inference test
print("\nQuick inference test...")
test_input = tokenizer("The capital of France is", return_tensors="pt").to(model.device)
with torch.no_grad():
    output = model.generate(**test_input, max_new_tokens=10, do_sample=False)
print(f"Test output: {tokenizer.decode(output[0], skip_special_tokens=True)}")

### Celebrate!

You just loaded a **70 billion parameter model** on a single desktop GPU! This is remarkable - let's document the memory usage.

In [None]:
# Document the memory achievement
mem_info = get_memory_info()

print("\n" + "="*60)
print("DGX SPARK ACHIEVEMENT: 70B Model Loaded!")
print("="*60)
print(f"\nGPU Memory Used: {mem_info.get('gpu_allocated_gb', 0):.1f} GB")
print(f"GPU Memory Reserved: {mem_info.get('gpu_reserved_gb', 0):.1f} GB")
print(f"System RAM Available: {mem_info.get('system_available_gb', 0):.1f} GB")
print("\nThis would require:")
print("  - 4x A100 80GB GPUs ($60,000+)")
print("  - 8x RTX 4090 GPUs ($16,000+)")
print("  - Cloud rental at $20-50/hour")
print("\nBut you're doing it on a single DGX Spark!")
print("="*60)

---

## Part 3: Adding LoRA Adapters

Now we add the trainable LoRA adapters. For 70B, we'll use a conservative configuration.

In [None]:
# LoRA configuration for 70B - optimized for memory
LORA_CONFIG_70B = {
    "r": 16,                # Rank - 16 is good balance for 70B
    "lora_alpha": 32,       # Scaling factor
    "lora_dropout": 0.05,
    "target_modules": [
        "q_proj",           # Query projection
        "k_proj",           # Key projection
        "v_proj",           # Value projection
        "o_proj",           # Output projection
        # For 70B, we skip MLP layers to save memory
        # Uncomment if you have memory headroom:
        # "gate_proj", "up_proj", "down_proj",
    ],
    "bias": "none",
    "task_type": "CAUSAL_LM",
}

print("LoRA Configuration for 70B:")
for k, v in LORA_CONFIG_70B.items():
    print(f"  {k}: {v}")

In [None]:
# Apply LoRA to the model
log_memory("Before adding LoRA")

if USE_UNSLOTH:
    model = FastLanguageModel.get_peft_model(
        model,
        r=LORA_CONFIG_70B["r"],
        lora_alpha=LORA_CONFIG_70B["lora_alpha"],
        lora_dropout=LORA_CONFIG_70B["lora_dropout"],
        target_modules=LORA_CONFIG_70B["target_modules"],
        bias="none",
        use_gradient_checkpointing="unsloth",
        random_state=42,
    )
else:
    lora_config = LoraConfig(**LORA_CONFIG_70B)
    model = get_peft_model(model, lora_config)
    model.enable_input_require_grads()

log_memory("After adding LoRA")

# Print parameter counts
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())

print(f"\nParameter Summary:")
print(f"  Total parameters: {total_params:,}")
print(f"  Trainable parameters: {trainable_params:,}")
print(f"  Trainable %: {100 * trainable_params / total_params:.4f}%")

---

## Part 4: Dataset Preparation

We'll use a small sample dataset for demonstration. In practice, use your domain-specific data.

In [None]:
# Sample training data - replace with your own!
training_examples = [
    {
        "instruction": "Explain quantum computing in simple terms.",
        "response": "Quantum computing uses quantum mechanics to process information in fundamentally different ways than classical computers. While regular computers use bits (0 or 1), quantum computers use qubits that can be 0, 1, or both at once (superposition). This allows them to explore many solutions simultaneously, making them potentially powerful for specific problems like cryptography and molecular simulation."
    },
    {
        "instruction": "What are the key differences between GPT and BERT architectures?",
        "response": "GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers) differ in three main ways: 1) Direction - GPT is unidirectional (left-to-right), while BERT is bidirectional, processing text in both directions. 2) Training objective - GPT predicts the next token, while BERT uses masked language modeling. 3) Use cases - GPT excels at text generation, while BERT is better for understanding and classification tasks."
    },
    {
        "instruction": "How does backpropagation work in neural networks?",
        "response": "Backpropagation calculates how to adjust neural network weights to reduce prediction error. It works in two phases: 1) Forward pass - input flows through the network to produce output. 2) Backward pass - the error is propagated backwards, computing gradients using the chain rule. Each weight is updated proportionally to how much it contributed to the error, allowing the network to learn from mistakes."
    },
    {
        "instruction": "What is the attention mechanism and why is it important?",
        "response": "The attention mechanism allows neural networks to focus on relevant parts of the input when producing output. Instead of compressing all input into a fixed-size vector, attention computes weighted sums where weights indicate importance. This enables models to handle long sequences, capture long-range dependencies, and provides interpretability by showing what the model 'attends to'. It's the foundation of transformers and modern NLP."
    },
    {
        "instruction": "Explain the concept of transfer learning.",
        "response": "Transfer learning reuses knowledge from one task to help with another related task. Instead of training from scratch, you take a model pre-trained on a large dataset (like ImageNet or text corpora) and adapt it to your specific task. This works because early layers learn general features applicable to many tasks. Benefits include: faster training, better performance with limited data, and reduced computational requirements."
    },
    {
        "instruction": "What is the difference between L1 and L2 regularization?",
        "response": "L1 (Lasso) and L2 (Ridge) regularization both prevent overfitting by penalizing large weights, but differently: L1 adds the absolute value of weights to the loss (|w|), encouraging sparse models with some weights exactly zero - useful for feature selection. L2 adds squared weights (w²), encouraging small but non-zero weights - better when all features are relevant. L1 produces simpler models; L2 handles correlated features better."
    },
    {
        "instruction": "How do transformers handle positional information?",
        "response": "Transformers use positional encoding because self-attention is permutation-invariant - it doesn't inherently know token order. Two main approaches: 1) Sinusoidal positional encoding adds sine/cosine functions of different frequencies to embeddings, allowing the model to learn relative positions. 2) Learned positional embeddings train position vectors directly. Modern variants include RoPE (rotary position embedding) and ALiBi, which handle longer sequences better."
    },
    {
        "instruction": "What are the main challenges in training very large language models?",
        "response": "Training large language models faces several challenges: 1) Memory - models with billions of parameters require distributed training across many GPUs. 2) Compute cost - training GPT-4 scale models costs millions of dollars in compute. 3) Data quality - need massive high-quality datasets. 4) Stability - larger models can have training instabilities requiring careful hyperparameter tuning. 5) Evaluation - it's hard to comprehensively evaluate capabilities and limitations."
    },
]

print(f"Created {len(training_examples)} training examples")

In [None]:
# Format for Llama 3.1 Instruct
def format_example_llama3(example: dict) -> str:
    """Format a single example in Llama 3.1 chat format."""
    return f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert AI assistant specializing in machine learning and deep learning. Provide clear, accurate, and educational explanations.<|eot_id|><|start_header_id|>user<|end_header_id|>

{example['instruction']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{example['response']}<|eot_id|>"""

# Create dataset
formatted_data = [{"text": format_example_llama3(ex)} for ex in training_examples]
dataset = Dataset.from_list(formatted_data)

# For 70B, we'll use all data for training (small dataset)
train_dataset = dataset

print(f"Training examples: {len(train_dataset)}")
print(f"\nSample formatted example:")
print("="*50)
print(formatted_data[0]['text'][:500] + "...")

---

## Part 5: Training Configuration

We need to be careful with training parameters for 70B to manage memory.

In [None]:
# Training configuration optimized for 70B on DGX Spark
OUTPUT_DIR = "./llama3-70b-qlora-finetuned"

training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    
    # Batch size - CRITICAL for memory!
    per_device_train_batch_size=1,        # Keep at 1 for 70B
    gradient_accumulation_steps=8,         # Effective batch = 8
    
    # Training duration
    num_train_epochs=1,                    # Start with 1 epoch
    max_steps=50,                          # Or limit steps for demo
    
    # Optimizer - use 8-bit Adam to save memory
    learning_rate=1e-4,                    # Slightly lower for 70B
    optim="adamw_8bit",                    # 8-bit optimizer!
    weight_decay=0.01,
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
    
    # Memory optimization - CRITICAL
    bf16=True,                             # Use bfloat16
    fp16=False,
    gradient_checkpointing=True,           # Trade compute for memory
    
    # Logging
    logging_steps=5,
    save_strategy="steps",
    save_steps=25,
    save_total_limit=1,                    # Save only 1 checkpoint
    
    # Other
    seed=42,
    report_to="none",
    dataloader_pin_memory=False,           # Reduce memory overhead
)

print("Training Configuration for 70B:")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Gradient accumulation: {training_args.gradient_accumulation_steps}")
print(f"  Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Optimizer: {training_args.optim}")
print(f"  Max steps: {training_args.max_steps}")

In [None]:
# Create trainer
log_memory("Before creating trainer")

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    args=training_args,
    packing=False,
)

log_memory("After creating trainer")
print("Trainer created successfully!")

---

## Part 6: Training!

This is it - training a 70B model on your desktop!

In [None]:
# TRAIN THE 70B MODEL!
print("\n" + "="*70)
print("TRAINING 70B MODEL - This is the DGX Spark showcase!")
print("="*70 + "\n")

log_memory("Before training")

train_start = time.time()

# Train!
train_result = trainer.train()

train_time = time.time() - train_start

log_memory("After training")

print("\n" + "="*70)
print(f"TRAINING COMPLETE!")
print(f"Total time: {train_time/60:.1f} minutes")
print("="*70)

In [None]:
# Training metrics
metrics = train_result.metrics

print("\nTraining Metrics:")
print(f"  Total steps: {metrics.get('total_steps', 'N/A')}")
print(f"  Training loss: {metrics.get('train_loss', 'N/A'):.4f}")
print(f"  Runtime: {metrics.get('train_runtime', 'N/A'):.1f} seconds")
print(f"  Samples/second: {metrics.get('train_samples_per_second', 'N/A'):.3f}")
print(f"  Steps/second: {metrics.get('train_steps_per_second', 'N/A'):.3f}")

---

## Part 7: Testing the Fine-Tuned Model

In [None]:
# Put model in inference mode
if USE_UNSLOTH:
    FastLanguageModel.for_inference(model)
model.eval()

def generate_70b_response(prompt: str, max_new_tokens: int = 200) -> str:
    """Generate response from the fine-tuned 70B model."""
    formatted = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert AI assistant specializing in machine learning and deep learning.<|eot_id|><|start_header_id|>user<|end_header_id|>

{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""
    
    inputs = tokenizer(formatted, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    generated = outputs[0][inputs['input_ids'].shape[1]:]
    return tokenizer.decode(generated, skip_special_tokens=True).strip()

In [None]:
# Test the fine-tuned model
test_prompts = [
    "What is the difference between CNN and RNN architectures?",
    "Explain how dropout works as regularization.",
    "What makes LoRA an efficient fine-tuning method?",
]

print("\n" + "="*70)
print("Testing Fine-Tuned 70B Model")
print("="*70)

for prompt in test_prompts:
    print(f"\nQ: {prompt}")
    print("-"*50)
    response = generate_70b_response(prompt)
    print(f"A: {response}")
    print("="*70)

---

## Part 8: Memory Analysis Summary

In [None]:
# Create memory usage visualization
import matplotlib.pyplot as plt

# Extract memory data
labels = [m['label'] for m in memory_log]
gpu_allocated = [m.get('gpu_allocated_gb', 0) for m in memory_log]
gpu_reserved = [m.get('gpu_reserved_gb', 0) for m in memory_log]

fig, ax = plt.subplots(figsize=(14, 6))

x = range(len(labels))
width = 0.35

bars1 = ax.bar([i - width/2 for i in x], gpu_allocated, width, label='Allocated', color='steelblue')
bars2 = ax.bar([i + width/2 for i in x], gpu_reserved, width, label='Reserved', color='lightsteelblue')

# Add 128GB line for reference
ax.axhline(y=128, color='red', linestyle='--', label='DGX Spark Capacity (128GB)')

ax.set_ylabel('GPU Memory (GB)')
ax.set_title('70B Model QLoRA Training: Memory Usage Over Time')
ax.set_xticks(x)
ax.set_xticklabels(labels, rotation=45, ha='right')
ax.legend()
ax.set_ylim(0, 140)

# Add value labels on bars
for bar in bars1:
    height = bar.get_height()
    ax.annotate(f'{height:.1f}',
                xy=(bar.get_x() + bar.get_width() / 2, height),
                xytext=(0, 3),
                textcoords="offset points",
                ha='center', va='bottom', fontsize=8)

plt.tight_layout()
plt.savefig('70b_memory_analysis.png', dpi=150, bbox_inches='tight')
plt.show()
plt.close(fig)  # Release memory

print("\nMemory usage saved to 70b_memory_analysis.png")

In [None]:
# Final summary
print("\n" + "="*70)
print("DGX SPARK 70B FINE-TUNING - FINAL SUMMARY")
print("="*70)

final_mem = get_memory_info()

print(f"""
Model: {MODEL_NAME}
Total Parameters: ~70 billion
Trainable Parameters: {trainable_params:,} ({100*trainable_params/total_params:.4f}%)

Memory Usage:
  Peak GPU Allocated: ~{max(m.get('gpu_allocated_gb', 0) for m in memory_log):.1f} GB
  Peak GPU Reserved: ~{max(m.get('gpu_reserved_gb', 0) for m in memory_log):.1f} GB
  DGX Spark Capacity: 128 GB
  Headroom: ~{128 - max(m.get('gpu_reserved_gb', 0) for m in memory_log):.1f} GB

Training:
  Total Time: {train_time/60:.1f} minutes
  Steps Completed: {metrics.get('total_steps', 'N/A')}
  Final Loss: {metrics.get('train_loss', 'N/A'):.4f}

What This Would Cost Elsewhere:
  Cloud (4x A100): ~$30-50/hour
  Hardware (4x A100): ~$60,000+
  Consumer (8x 4090): Not possible (memory limitations)

On DGX Spark: Your desktop. No cloud bills. No waiting.
""")
print("="*70)

---

## Part 9: Saving the Model

In [None]:
# Save LoRA adapter (small, ~100MB)
ADAPTER_PATH = "./llama3-70b-qlora-adapter"

model.save_pretrained(ADAPTER_PATH)
tokenizer.save_pretrained(ADAPTER_PATH)

print(f"LoRA adapter saved to {ADAPTER_PATH}")

# Check size
adapter_size = sum(
    os.path.getsize(os.path.join(ADAPTER_PATH, f))
    for f in os.listdir(ADAPTER_PATH)
    if os.path.isfile(os.path.join(ADAPTER_PATH, f))
)
print(f"Adapter size: {adapter_size / 1e6:.1f} MB")
print(f"\nTo load later:")
print(f"  1. Load base model with QLoRA config")
print(f"  2. Load adapter: PeftModel.from_pretrained(base_model, '{ADAPTER_PATH}')")

---

## Common Mistakes with 70B Models

### Mistake 1: Not Clearing Buffer Cache

```python
# ❌ Wrong: Loading without clearing cache
model = AutoModelForCausalLM.from_pretrained(...)  # OOM!

# ✅ Right: Clear cache first
subprocess.run(["sudo", "sh", "-c", "sync; echo 3 > /proc/sys/vm/drop_caches"])
model = AutoModelForCausalLM.from_pretrained(...)
```

### Mistake 2: Batch Size Too Large

```python
# ❌ Wrong: Using 8B settings
per_device_train_batch_size = 4  # OOM on 70B!

# ✅ Right: Minimal batch size with accumulation
per_device_train_batch_size = 1
gradient_accumulation_steps = 8
```

### Mistake 3: Not Using 8-bit Optimizer

```python
# ❌ Wrong: Standard optimizer
optim = "adamw_torch"  # Uses too much memory for 70B

# ✅ Right: 8-bit optimizer
optim = "adamw_8bit"  # Significantly reduces memory
```

---

## Checkpoint

You've achieved:
- ✅ Loaded a 70B parameter model with QLoRA quantization
- ✅ Documented memory usage throughout the process
- ✅ Fine-tuned the model on custom data
- ✅ Saved the fine-tuned adapter for later use
- ✅ Understood what makes DGX Spark special for this task

**Congratulations!** You've done something that most ML practitioners never get to experience - fine-tuning a 70B model on desktop hardware!

---

## Cleanup

In [None]:
# Cleanup
del model, tokenizer, trainer
gc.collect()
torch.cuda.empty_cache()

print_memory_status("After cleanup")
print("Cleanup complete!")

---

## Next Steps

Now that you've mastered 70B fine-tuning:

1. **[Task 10.4: Dataset Preparation](04-dataset-preparation.ipynb)** - Create professional instruction datasets
2. **[Task 10.5: DPO Training](05-dpo-training.ipynb)** - Improve model quality with preference data
3. **[Task 10.7: Ollama Integration](07-ollama-integration.ipynb)** - Deploy your fine-tuned model locally

---

## Further Reading

- [QLoRA Paper](https://arxiv.org/abs/2305.14314)
- [Llama 3.1 Technical Report](https://ai.meta.com/research/publications/the-llama-3-herd-of-models/)
- [bitsandbytes Documentation](https://github.com/TimDettmers/bitsandbytes)