# Complete Guide to Fine-Tuning Llama 3 Models (2025 Edition)

This notebook provides a comprehensive walkthrough of fine-tuning Llama 3 models using the latest best practices and top frameworks.

## What You'll Learn

1. **When to fine-tune** vs. when to use RAG or prompting
2. **Top 5 fine-tuning frameworks** compared and demonstrated
3. **Best practices** for efficient training (LoRA, QLoRA, memory optimization)
4. **Hardware requirements** for different model sizes
5. **Synthetic data generation** techniques

## Frameworks Covered

| Framework | Best For | Key Feature |
|-----------|----------|-------------|
| **Unsloth** | Speed & Memory Efficiency | 2-5x faster, 70% less VRAM |
| **torchtune** | PyTorch Native | Official Meta support |
| **TRL** | Hugging Face Ecosystem | SFTTrainer, RLHF support |
| **Axolotl** | Flexibility | YAML config, many techniques |
| **LLaMA-Factory** | All-in-one | WebUI, 100+ models |

## When to Fine-Tune vs. Alternatives

### Use Fine-Tuning When:
- You need consistent formatting or style
- Domain-specific knowledge is required
- You want to reduce inference costs (smaller fine-tuned model can match larger base model)
- Privacy: You can't send data to external APIs

### Use RAG Instead When:
- Knowledge changes frequently
- You need source attribution
- You have a large, evolving knowledge base

### Use Prompt Engineering When:
- Quick iteration is needed
- Limited training data available
- Task is well-defined with few examples

## Hardware Requirements

| Model Size | Full Fine-Tune | LoRA | QLoRA (4-bit) |
|------------|----------------|------|---------------|
| 1B | 8GB VRAM | 6GB | 4GB |
| 3B | 16GB VRAM | 10GB | 6GB |
| 8B | 40GB VRAM | 16GB | 8GB |
| 70B | 320GB VRAM | 80GB | 24GB |

**Recommendation**: Start with QLoRA on consumer GPUs (RTX 3090/4090 with 24GB), then scale up if needed.

---
# Part 1: Environment Setup

Let's install the required packages for all frameworks we'll explore.

In [None]:
# Core packages for all frameworks
!pip install -q transformers datasets accelerate peft bitsandbytes

# TRL for Hugging Face training
!pip install -q trl

# Unsloth for fast training (install separately as it has specific requirements)
# !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# torchtune (PyTorch native)
!pip install -q torchtune

# For evaluation and visualization
!pip install -q wandb matplotlib

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
import os

# Check GPU availability
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

---
# Part 2: Framework 1 - Unsloth (Recommended for Speed)

[Unsloth](https://github.com/unslothai/unsloth) is the fastest way to fine-tune LLMs, achieving **2-5x speedup** with **70% less memory**.

**Key Features:**
- No accuracy degradation
- Automatic Ollama export
- Supports Llama 3.x, Gemma, Mistral, Qwen, and more

In [None]:
# Install Unsloth (run this cell if you want to use Unsloth)
# For Colab:
# !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# For local installation:
# !pip install unsloth

In [None]:
# Unsloth Fine-Tuning Example
# Uncomment and run if Unsloth is installed

'''
from unsloth import FastLanguageModel

# Configuration
max_seq_length = 2048
dtype = None  # Auto-detect
load_in_4bit = True  # Use 4-bit quantization

# Load model with Unsloth optimizations
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank - higher = more parameters
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",  # Unsloth optimization
    random_state=42,
)

print(f"Trainable parameters: {model.print_trainable_parameters()}")
'''

---
# Part 3: Framework 2 - TRL + Hugging Face (Most Popular)

[TRL](https://huggingface.co/docs/trl) is Hugging Face's library for training language models with reinforcement learning and supervised fine-tuning.

**Key Features:**
- SFTTrainer for supervised fine-tuning
- PPO, DPO, ORPO for alignment
- Seamless integration with transformers ecosystem

In [None]:
# TRL Fine-Tuning with QLoRA

# Model configuration
model_id = "meta-llama/Llama-3.2-3B-Instruct"  # Or use a smaller model for testing

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# LoRA configuration
lora_config = LoraConfig(
    r=16,  # LoRA rank
    lora_alpha=32,  # Alpha scaling
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

In [None]:
# Load a sample dataset
# Using a small subset for demonstration
dataset = load_dataset("HuggingFaceH4/no_robots", split="train[:1000]")

print(f"Dataset size: {len(dataset)}")
print(f"\nExample:")
print(dataset[0])

In [None]:
# Format dataset for chat template
def format_chat_template(example):
    """Format the dataset for Llama 3 chat template."""
    messages = example.get("messages", [])
    if not messages:
        # Handle different dataset formats
        messages = [
            {"role": "user", "content": example.get("prompt", "")},
            {"role": "assistant", "content": example.get("completion", "")}
        ]
    return {"messages": messages}

# Apply formatting
formatted_dataset = dataset.map(format_chat_template)

In [None]:
# TRL Training Configuration
# Note: This is configured for demonstration. Uncomment to run actual training.

'''
from trl import SFTTrainer, SFTConfig

# Load model with quantization
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)

# Training arguments
training_args = SFTConfig(
    output_dir="./llama3-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    logging_steps=10,
    save_strategy="epoch",
    bf16=True,
    max_seq_length=2048,
    packing=True,  # Pack multiple samples into one sequence
)

# Create trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=formatted_dataset,
    tokenizer=tokenizer,
)

# Start training
trainer.train()

# Save the model
trainer.save_model("./llama3-finetuned-final")
'''

---
# Part 4: Framework 3 - torchtune (PyTorch Native)

[torchtune](https://github.com/pytorch/torchtune) is PyTorch's official library for LLM fine-tuning, with deep ecosystem integration.

**Key Features:**
- Official Meta support for Llama models
- YAML-based configuration
- Multi-GPU and multi-node training
- Supports NVIDIA, AMD, Intel, and Apple Silicon

In [None]:
# torchtune uses CLI commands for training
# Here are the key commands:

# 1. Download a model
# !tune download meta-llama/Llama-3.2-3B-Instruct --output-dir ./models/llama-3.2-3b

# 2. Single-GPU LoRA fine-tuning
# !tune run lora_finetune_single_device --config llama3_2/3B_lora_single_device

# 3. Multi-GPU full fine-tuning
# !tune run --nproc_per_node 2 full_finetune_distributed --config llama3_2/3B_full

# 4. QLoRA for memory-constrained environments
# !tune run lora_finetune_single_device --config llama3_2/3B_qlora_single_device

print("torchtune commands shown above. Uncomment to run.")

In [None]:
# Example torchtune YAML configuration
# Save this as 'custom_config.yaml' to use with torchtune

torchtune_config = '''
# Custom torchtune configuration for Llama 3.2 3B
model:
  _component_: torchtune.models.llama3_2.lora_llama3_2_3b
  lora_attn_modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj']
  lora_rank: 16
  lora_alpha: 32

tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: ./models/llama-3.2-3b/tokenizer.model

checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: ./models/llama-3.2-3b
  output_dir: ./outputs/llama-3.2-3b-finetuned

dataset:
  _component_: torchtune.datasets.alpaca_dataset
  source: tatsu-lab/alpaca

optimizer:
  _component_: torch.optim.AdamW
  lr: 2e-5
  weight_decay: 0.01

lr_scheduler:
  _component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
  num_warmup_steps: 100

training:
  epochs: 3
  batch_size: 4
  gradient_accumulation_steps: 4
  compile: false  # Set to true for faster training with torch.compile
'''

print(torchtune_config)

---
# Part 5: Framework 4 - Axolotl (Maximum Flexibility)

[Axolotl](https://github.com/axolotl-ai-cloud/axolotl) is known for its flexibility and community support.

**Key Features:**
- YAML configuration for reproducibility
- Supports many training techniques (LoRA, QLoRA, full fine-tune)
- Rapid adoption of new models and methods
- Great for beginners

In [None]:
# Axolotl YAML configuration example
# Save as 'axolotl_config.yaml'

axolotl_config = '''
base_model: meta-llama/Llama-3.2-3B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: true

# LoRA configuration
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

# Dataset
datasets:
  - path: HuggingFaceH4/no_robots
    type: chat_template
    chat_template: llama3

# Training parameters
sequence_len: 2048
sample_packing: true
pad_to_sequence_len: true

gradient_accumulation_steps: 4
micro_batch_size: 4
num_epochs: 3
learning_rate: 2e-4
lr_scheduler: cosine
warmup_ratio: 0.1

optimizer: adamw_bnb_8bit
bf16: true
tf32: true

gradient_checkpointing: true
flash_attention: true

output_dir: ./outputs/llama-3.2-3b-axolotl
'''

print(axolotl_config)

# To run Axolotl:
# !pip install axolotl
# !accelerate launch -m axolotl.cli.train axolotl_config.yaml

---
# Part 6: Framework 5 - LLaMA-Factory (All-in-One)

[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) provides a unified interface for training 100+ LLMs with web UI support.

**Key Features:**
- WebUI for no-code fine-tuning
- Supports 100+ models
- Integrated methods: SFT, RLHF, DPO, PPO
- Unsloth integration for speed

In [None]:
# LLaMA-Factory installation and usage

# Install
# !git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
# !cd LLaMA-Factory && pip install -e ".[torch,metrics]"

# Launch WebUI (easiest way to start)
# !cd LLaMA-Factory && llamafactory-cli webui

# Or use CLI for training
# !llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml

print("LLaMA-Factory commands shown above. Uncomment to run.")

In [None]:
# LLaMA-Factory configuration example
llamafactory_config = '''
### Model
model_name_or_path: meta-llama/Llama-3.2-3B-Instruct

### Method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 16
lora_alpha: 32
lora_target: q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj

### Dataset
dataset: alpaca_en
template: llama3
cutoff_len: 2048

### Training
output_dir: outputs/llama3-lora
per_device_train_batch_size: 4
gradient_accumulation_steps: 4
learning_rate: 2.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true

### Optimization
quantization_bit: 4
gradient_checkpointing: true
'''

print(llamafactory_config)

---
# Part 7: Best Practices for Fine-Tuning

## Data Quality Matters Most

1. **Quality over Quantity**: 1,000 high-quality examples often beats 100,000 noisy ones
2. **Diversity**: Cover edge cases and variations in your dataset
3. **Formatting**: Use consistent formatting that matches inference

## LoRA Hyperparameter Guidelines

| Parameter | Recommended | Notes |
|-----------|-------------|-------|
| `lora_r` | 8-32 | Higher = more capacity, more memory |
| `lora_alpha` | 16-64 | Usually 2x lora_r |
| `lora_dropout` | 0.0-0.1 | 0 for small datasets |
| `target_modules` | All attention + MLP | q,k,v,o,gate,up,down |

## Training Hyperparameters

| Parameter | Recommended | Notes |
|-----------|-------------|-------|
| Learning rate | 1e-4 to 2e-4 | Lower for larger models |
| Batch size | 4-8 per GPU | Use gradient accumulation |
| Epochs | 1-3 | Watch for overfitting |
| Warmup | 10% of steps | Prevents early instability |

---
# Part 8: Synthetic Data Generation

When you don't have enough training data, you can generate synthetic examples using a larger model.

## Techniques:

1. **Self-Instruct**: Use the model to generate instruction-response pairs
2. **Evol-Instruct**: Evolve simple instructions into complex ones
3. **Distillation**: Use a larger model to generate training data for a smaller one

In [None]:
import ollama

def generate_synthetic_examples(topic: str, num_examples: int = 5) -> list:
    """
    Generate synthetic training examples using a local LLM.
    
    Args:
        topic: The topic or domain for generating examples
        num_examples: Number of examples to generate
    
    Returns:
        List of instruction-response pairs
    """
    prompt = f'''Generate {num_examples} diverse instruction-response pairs for training a helpful AI assistant on the topic of "{topic}".

Format each example as:
INSTRUCTION: [user question or task]
RESPONSE: [helpful, detailed response]

Make the examples varied in complexity and style. Include edge cases.'''

    try:
        response = ollama.chat(
            model='llama3.2',
            messages=[{'role': 'user', 'content': prompt}]
        )
        return response['message']['content']
    except Exception as e:
        print(f"Error generating synthetic data: {e}")
        return []

# Example usage (uncomment to run)
# synthetic_examples = generate_synthetic_examples("Python debugging", num_examples=3)
# print(synthetic_examples)

---
# Part 9: Evaluating Your Fine-Tuned Model

Always evaluate your model after fine-tuning to ensure it has learned the desired behavior without forgetting general capabilities.

In [None]:
def evaluate_model(model, tokenizer, test_prompts: list):
    """
    Simple evaluation function for fine-tuned models.
    
    Args:
        model: The fine-tuned model
        tokenizer: The tokenizer
        test_prompts: List of test prompts to evaluate
    """
    model.eval()
    results = []
    
    for prompt in test_prompts:
        messages = [{"role": "user", "content": prompt}]
        
        # Format for Llama 3 chat template
        formatted_prompt = tokenizer.apply_chat_template(
            messages, 
            tokenize=False, 
            add_generation_prompt=True
        )
        
        inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)
        
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=256,
                temperature=0.7,
                do_sample=True,
            )
        
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        results.append({"prompt": prompt, "response": response})
        
    return results

# Example test prompts
test_prompts = [
    "Explain what machine learning is in simple terms.",
    "Write a Python function to calculate the Fibonacci sequence.",
    "What are the benefits of using renewable energy?",
]

---
# Part 10: Exporting to Ollama for Local Use

After fine-tuning, you can export your model to Ollama for easy local deployment.

In [None]:
# Export fine-tuned model to GGUF format for Ollama

def export_to_gguf(model_path: str, output_path: str, quantization: str = "q4_k_m"):
    """
    Convert a fine-tuned model to GGUF format for Ollama.
    
    Requires llama.cpp to be installed.
    """
    import subprocess
    
    # Convert to GGUF
    cmd = f"python llama.cpp/convert_hf_to_gguf.py {model_path} --outfile {output_path} --outtype {quantization}"
    print(f"Running: {cmd}")
    # subprocess.run(cmd, shell=True)
    
# Create Ollama Modelfile
modelfile_content = '''
FROM ./llama-3.2-3b-finetuned.gguf

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""

PARAMETER stop <|start_header_id|>
PARAMETER stop <|end_header_id|>
PARAMETER stop <|eot_id|>
PARAMETER temperature 0.7
'''

print("Modelfile for Ollama:")
print(modelfile_content)

# To create the Ollama model:
# !ollama create my-finetuned-model -f Modelfile

---
# Part 11: Framework Comparison Summary

## Decision Guide

| Your Situation | Recommended Framework |
|----------------|----------------------|
| Limited GPU memory | **Unsloth** |
| Need fastest training | **Unsloth** |
| PyTorch ecosystem | **torchtune** |
| Hugging Face ecosystem | **TRL** |
| Maximum flexibility | **Axolotl** |
| Prefer web UI | **LLaMA-Factory** |
| New to fine-tuning | **Axolotl** or **LLaMA-Factory** |
| Production deployment | **torchtune** or **TRL** |

## Performance Comparison (8B model on RTX 4090)

| Framework | Training Speed | Memory Usage |
|-----------|---------------|---------------|
| Unsloth | ~2.1x faster | ~8GB |
| TRL + QLoRA | Baseline | ~12GB |
| torchtune | ~1.1x faster | ~11GB |
| Axolotl | Baseline | ~12GB |
| LLaMA-Factory | ~1.5x faster (w/ Unsloth) | ~8GB |

---
# Additional Resources

## Official Documentation
- [Meta Llama Fine-tuning Guide](https://www.llama.com/docs/how-to-guides/fine-tuning/)
- [Hugging Face PEFT](https://huggingface.co/docs/peft)
- [TRL Documentation](https://huggingface.co/docs/trl)

## Framework Repositories
- [Unsloth](https://github.com/unslothai/unsloth) - Fast training with low memory
- [torchtune](https://github.com/pytorch/torchtune) - PyTorch native
- [Axolotl](https://github.com/axolotl-ai-cloud/axolotl) - Flexible YAML config
- [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) - All-in-one solution

## Tutorials & Guides
- [Unsloth Llama 3 Tutorial](https://unsloth.ai/docs/get-started/fine-tuning-llms-guide)
- [Fine-tune Llama 3.1 with TRL](https://huggingface.co/blog/mlabonne/sft-llama3)
- [llama-cookbook Fine-tuning](https://github.com/meta-llama/llama-cookbook/tree/main/getting-started/finetuning)

## Related Notebooks in This Course
- `6.0-fine-tuning-llama3-what-you-need-to-know.md` - Conceptual overview
- `6.2-quantization-precision-format-code-explanation.ipynb` - Deep dive into quantization