# Fine-Tuning Gemma 3 1B Instruct: Complete Guide 🎯

Welcome to the comprehensive guide for fine-tuning Google's Gemma 3 1B Instruct model! This notebook will walk you through the entire process of customizing a pre-trained language model for your specific use case.

## What you'll learn:
- Understanding fine-tuning vs training from scratch
- Setting up the environment for different devices (CPU, CUDA, Apple Silicon)
- Loading and preparing the Gemma 3 1B Instruct model
- Creating and formatting training datasets
- Implementing LoRA (Low-Rank Adaptation) for efficient fine-tuning
- Training with different optimization techniques
- Evaluating and testing your fine-tuned model
- Saving and sharing your custom model

## Prerequisites:
- Python 3.8 or higher
- At least 16GB of RAM (32GB+ recommended)
- GPU with 8GB+ VRAM (or Apple Silicon with 16GB+ unified memory)
- HuggingFace account and token for Gemma access
- Basic understanding of machine learning concepts

## Step 1: Understanding Fine-Tuning

Before we start, let's understand what fine-tuning means and why it's powerful:

### 🧠 **What is Fine-Tuning?**
Fine-tuning takes a pre-trained model and adapts it to your specific task or domain by training it on your custom dataset.

### 🎯 **Types of Fine-Tuning:**
- **Full Fine-Tuning**: Updates all model parameters (expensive, high quality)
- **LoRA (Low-Rank Adaptation)**: Updates only small adapter layers (efficient, good quality)
- **Prompt Tuning**: Learns optimal prompts (very efficient, task-specific)

### 💡 **Why Fine-Tune Gemma 3 1B Instruct?**
- Smaller model = faster training and inference
- Good performance for many tasks
- Fits in consumer hardware
- Already instruction-tuned for better baseline

### 📊 **Device Considerations:**
- **Apple Silicon (M1/M2/M3)**: Great for LoRA fine-tuning, unified memory advantage
- **NVIDIA GPUs**: Excellent for all types of fine-tuning
- **CPU Only**: Possible but slow, best for very small datasets

## Step 2: Install Required Libraries

Let's install all the necessary libraries for fine-tuning:

In [1]:
# Install required libraries for fine-tuning
import subprocess
import sys

def install_package(package):
    """Install a package using pip"""
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Core libraries for fine-tuning
packages = [
    "transformers>=4.36.0",    # Latest transformers with Gemma support
    "torch>=2.1.0",           # PyTorch with MPS support
    "datasets",               # For dataset handling
    "accelerate",             # For distributed training
    "peft",                   # For LoRA and other parameter-efficient methods
    "bitsandbytes",           # For quantization (if supported)
    "trl",                    # For training utilities
    "psutil",                 # For system monitoring
    "sentencepiece",          # For tokenization
    "protobuf",               # Required for some tokenizers
]

print("📦 Installing fine-tuning packages...")
print("⚠️  This may take several minutes")
print()

for package in packages:
    try:
        print(f"Installing {package}...")
        install_package(package)
        print(f"✅ {package} installed successfully")
    except Exception as e:
        print(f"❌ Failed to install {package}: {e}")
        if "bitsandbytes" in package:
            print("💡 bitsandbytes may not be available on Apple Silicon - this is OK")

print("\n🎉 Installation complete!")
print("\n💡 Note: Some packages may show warnings - this is normal")

📦 Installing fine-tuning packages...
⚠️  This may take several minutes

Installing transformers>=4.36.0...
✅ transformers>=4.36.0 installed successfully
Installing torch>=2.1.0...
✅ transformers>=4.36.0 installed successfully
Installing torch>=2.1.0...



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


✅ torch>=2.1.0 installed successfully
Installing datasets...



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


✅ datasets installed successfully
Installing accelerate...



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


✅ accelerate installed successfully
Installing peft...



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


✅ peft installed successfully
Installing bitsandbytes...



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


✅ bitsandbytes installed successfully
Installing trl...



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


✅ trl installed successfully
Installing psutil...



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


✅ psutil installed successfully
Installing sentencepiece...



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


✅ sentencepiece installed successfully
Installing protobuf...



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


✅ protobuf installed successfully

🎉 Installation complete!




[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Step 3: Environment Setup and Device Detection

Let's set up our environment and detect the best device for training:

In [2]:
# Import all necessary libraries
import torch
from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM, 
    TrainingArguments, 
    Trainer,
    DataCollatorForLanguageModeling
)
import transformers  # Import the module itself to access __version__
from datasets import Dataset, load_dataset
from peft import LoraConfig, get_peft_model, TaskType, PeftModel
import psutil
import json
import os
import warnings
import time
from typing import Dict, List

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore", category=UserWarning)
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# System information
print("🖥️  SYSTEM INFORMATION")
print("=" * 60)
print(f"Python version: {sys.version}")
print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {transformers.__version__}")

# Memory information
memory = psutil.virtual_memory()
print(f"\n💾 MEMORY INFORMATION:")
print(f"Total RAM: {memory.total / (1024**3):.1f} GB")
print(f"Available RAM: {memory.available / (1024**3):.1f} GB")
print(f"RAM usage: {memory.percent:.1f}%")

# Device detection with detailed information
print(f"\n🚀 DEVICE DETECTION:")
if torch.cuda.is_available():
    device = "cuda"
    gpu_count = torch.cuda.device_count()
    print(f"✅ CUDA available with {gpu_count} GPU(s)")
    for i in range(gpu_count):
        gpu_name = torch.cuda.get_device_name(i)
        gpu_memory = torch.cuda.get_device_properties(i).total_memory / (1024**3)
        print(f"   GPU {i}: {gpu_name} ({gpu_memory:.1f} GB)")
    print(f"   CUDA version: {torch.version.cuda}")
    
elif torch.backends.mps.is_available():
    device = "mps"
    print(f"✅ Apple Silicon (MPS) available")
    print(f"   Unified memory: {memory.total / (1024**3):.1f} GB")
    print(f"   MPS is ideal for LoRA fine-tuning")
    
else:
    device = "cpu"
    print(f"⚠️  Using CPU only")
    print(f"   Training will be slower but still possible")
    print(f"   Consider using smaller batch sizes")

print(f"\n🎯 Selected device: {device}")

# Training recommendations based on device
print(f"\n📋 TRAINING RECOMMENDATIONS:")
if device == "cuda":
    print("   • Use LoRA or full fine-tuning")
    print("   • Batch size: 4-16 depending on GPU memory")
    print("   • Enable gradient checkpointing for larger models")
elif device == "mps":
    print("   • LoRA fine-tuning recommended")
    print("   • Batch size: 2-8 depending on memory")
    print("   • Use float16 precision")
else:
    print("   • LoRA fine-tuning only")
    print("   • Small batch size: 1-2")
    print("   • Consider using smaller dataset")

print("\n✅ Environment setup complete!")

  from .autonotebook import tqdm as notebook_tqdm


🖥️  SYSTEM INFORMATION
Python version: 3.13.2 | packaged by Anaconda, Inc. | (main, Feb  6 2025, 18:56:02) [GCC 11.2.0]
PyTorch version: 2.8.0+cu128
Transformers version: 4.56.2

💾 MEMORY INFORMATION:
Total RAM: 31.2 GB
Available RAM: 22.9 GB
RAM usage: 26.8%

🚀 DEVICE DETECTION:
✅ CUDA available with 1 GPU(s)
   GPU 0: NVIDIA GeForce RTX 4090 (24.0 GB)
   CUDA version: 12.8

🎯 Selected device: cuda

📋 TRAINING RECOMMENDATIONS:
   • Use LoRA or full fine-tuning
   • Batch size: 4-16 depending on GPU memory
   • Enable gradient checkpointing for larger models

✅ Environment setup complete!


## Step 4: HuggingFace Authentication

Gemma models require authentication. Let's set up your HuggingFace token:

In [3]:
# HuggingFace Authentication Setup
print("🔐 HUGGINGFACE AUTHENTICATION")
print("=" * 50)

# Check if user is already logged in
from huggingface_hub import HfApi
try:
    api = HfApi()
    user_info = api.whoami()
    print(f"✅ Already authenticated as: {user_info['name']}")
    print(f"   Email: {user_info.get('email', 'Not provided')}")
    HF_TOKEN = True
except Exception:
    print("❌ Not authenticated with HuggingFace")
    HF_TOKEN = False

# If not authenticated, provide instructions
if not HF_TOKEN:
    print("\n🔑 TO ACCESS GEMMA MODELS:")
    print("1. Go to https://huggingface.co/settings/tokens")
    print("2. Create a new token with 'Read' permissions")
    print("3. Accept the Gemma license at: https://huggingface.co/google/gemma-3-1b-it")
    print("4. Run: huggingface-cli login")
    print("5. Paste your token when prompted")
    print("\n💡 Alternative: Set HF_TOKEN environment variable")
    print("   export HF_TOKEN=your_token_here")
    
    # Check for environment variable
    import os
    if 'HF_TOKEN' in os.environ:
        print("\n✅ Found HF_TOKEN in environment variables")
        HF_TOKEN = True
    else:
        print("\n⚠️  No HF_TOKEN found. Please authenticate before proceeding.")

if HF_TOKEN:
    print("\n🎉 Ready to proceed with Gemma model loading!")
else:
    print("\n⏹️  Please complete authentication before continuing.")

🔐 HUGGINGFACE AUTHENTICATION
✅ Already authenticated as: bobbinetor
   Email: petruolo95@gmail.com

🎉 Ready to proceed with Gemma model loading!
✅ Already authenticated as: bobbinetor
   Email: petruolo95@gmail.com

🎉 Ready to proceed with Gemma model loading!


## Step 5: Load and Prepare the Gemma 3 1B Instruct Model

**🎯 What this cell does:**
This step downloads and loads the Gemma 3 1B Instruct model from HuggingFace, along with its tokenizer. The model will be configured with the optimal settings for fine-tuning on your device.

**📋 What happens inside:**
- Downloads the model and tokenizer (may take several minutes on first run)
- Configures memory optimization settings based on your device
- Sets up proper tokenization with pad/EOS tokens
- Displays model information (parameters, memory usage, device placement)
- Performs a tokenizer test to verify everything works

**⚙️ What you can customize:**
- `MODEL_NAME`: Change to use different Gemma variants or other models
- `torch_dtype`: Modify precision (float16 for speed vs float32 for accuracy)
- `attn_implementation`: Keep as "eager" for Gemma3 compatibility
- `device_map`: Adjust GPU allocation strategy for multi-GPU setups

**🚨 Common issues and solutions:**
- **Authentication error**: Make sure you completed Step 4 (HuggingFace login)
- **Out of memory**: The model needs ~3-4GB RAM/VRAM minimum
- **Slow download**: Model is ~3GB - be patient on slow connections
- **Token mismatch**: The cell automatically fixes tokenizer pad/EOS token issues

**💡 Expected output:**
You should see successful model loading, parameter count (~2.5B parameters), and a tokenizer test showing proper text encoding/decoding.

In [4]:
# Load and prepare Gemma 3 1B Instruct model
MODEL_NAME = "google/gemma-3-1b-it"

print(f"🤖 LOADING GEMMA 3 1B INSTRUCT MODEL")
print("=" * 50)
print(f"Model: {MODEL_NAME}")
print(f"Device: {device}")

# Memory optimization settings
torch_dtype = torch.float16 if device != "cpu" else torch.float32

try:
    print("📥 Loading tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(
        MODEL_NAME,
        trust_remote_code=True,
    )
    
    # Set pad token if not already set
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.pad_token_id = tokenizer.eos_token_id
    
    print("✅ Tokenizer loaded successfully")
    print(f"   Vocabulary size: {len(tokenizer)}")
    print(f"   Pad token: {tokenizer.pad_token}")
    print(f"   EOS token: {tokenizer.eos_token}")
    
    print("\n📥 Loading model...")
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        torch_dtype=torch_dtype,
        trust_remote_code=True,
        device_map="auto" if device == "cuda" else None,
        attn_implementation="eager",  # Fix for Gemma3 attention
    )
    
    # Move model to device if not using device_map
    if device != "cuda":
        model = model.to(device)
    
    print("✅ Model loaded successfully")
    
    # Model information
    num_params = sum(p.numel() for p in model.parameters())
    print(f"\n📊 MODEL INFORMATION:")
    print(f"   Parameters: {num_params:,}")
    print(f"   Model size: ~{num_params * 2 / 1e9:.1f} GB (FP16)")
    print(f"   Device: {device}")
    print(f"   Data type: {torch_dtype}")
    print(f"   Attention implementation: eager")
    
    # Memory usage check
    if device == "cuda":
        memory_used = torch.cuda.memory_allocated() / 1e9
        memory_total = torch.cuda.get_device_properties(0).total_memory / 1e9
        print(f"   GPU memory used: {memory_used:.1f}GB / {memory_total:.1f}GB")
    
    # Test tokenizer with a simple example
    print(f"\n🧪 TOKENIZER TEST:")
    test_text = "### Instruction:\nHello\n\n### Response:\n"
    test_tokens = tokenizer.encode(test_text)
    print(f"   Test text: {repr(test_text)}")
    print(f"   Tokens: {len(test_tokens)}")
    print(f"   Decoded back: {repr(tokenizer.decode(test_tokens))}")
    
except Exception as e:
    print(f"❌ Error loading model: {e}")
    print("💡 This might be due to:")
    print("   • HuggingFace authentication issues")
    print("   • Insufficient memory")
    print("   • Network connectivity")
    print("   • Missing model access permissions")

print(f"\n✅ Model setup complete!")

🤖 LOADING GEMMA 3 1B INSTRUCT MODEL
Model: google/gemma-3-1b-it
Device: cuda
📥 Loading tokenizer...


`torch_dtype` is deprecated! Use `dtype` instead!


✅ Tokenizer loaded successfully
   Vocabulary size: 262145
   Pad token: <pad>
   EOS token: <eos>

📥 Loading model...
✅ Model loaded successfully

📊 MODEL INFORMATION:
   Parameters: 999,885,952
   Model size: ~2.0 GB (FP16)
   Device: cuda
   Data type: torch.float16
   Attention implementation: eager
   GPU memory used: 2.0GB / 25.8GB

🧪 TOKENIZER TEST:
   Test text: '### Instruction:\nHello\n\n### Response:\n'
   Tokens: 11
   Decoded back: '<bos>### Instruction:\nHello\n\n### Response:\n'

✅ Model setup complete!
✅ Model loaded successfully

📊 MODEL INFORMATION:
   Parameters: 999,885,952
   Model size: ~2.0 GB (FP16)
   Device: cuda
   Data type: torch.float16
   Attention implementation: eager
   GPU memory used: 2.0GB / 25.8GB

🧪 TOKENIZER TEST:
   Test text: '### Instruction:\nHello\n\n### Response:\n'
   Tokens: 11
   Decoded back: '<bos>### Instruction:\nHello\n\n### Response:\n'

✅ Model setup complete!


## Step 6: Setup LoRA Configuration and Device Optimization

**🎯 What this cell does:**
This step configures LoRA (Low-Rank Adaptation) for efficient fine-tuning and allows you to customize device and precision settings. LoRA reduces training time and memory usage by only training small adapter layers instead of the entire model.

**📋 What happens inside:**
- **Device Override**: Option to force specific device (CUDA/MPS/CPU) 
- **Precision Selection**: Choose between FP16, BF16, or FP32 for different speed/accuracy tradeoffs
- **LoRA Setup**: Creates and applies LoRA adapters to the model
- **Parameter Analysis**: Shows how many parameters will actually be trained
- **Gradient Verification**: Ensures training will work properly

**⚙️ Key variables you can customize:**

**Device and Precision:**
- `FORCE_DEVICE`: Set to `"cuda"`, `"mps"`, or `"cpu"` to override auto-detection
- `FORCE_PRECISION`: Set to `"fp16"` (fastest), `"bf16"` (stable), or `"fp32"` (most accurate)

**LoRA Parameters:**
- `LORA_R`: Rank (8-64) - Higher = better quality but more parameters to train
- `LORA_ALPHA`: Scaling (usually 2x rank) - Controls adapter influence 
- `LORA_DROPOUT`: Dropout rate (0.0-0.3) - Prevents overfitting
- `target_modules`: Which model layers to adapt - Gemma-optimized list included

**📊 Parameter efficiency guide:**
- **r=8**: Fastest, minimal memory, good for simple tasks
- **r=16**: Balanced (recommended for most use cases)
- **r=32+**: Best quality, more memory, for complex tasks

**💡 Expected output:**
You should see your device configuration, LoRA parameters applied, and a dramatic reduction in trainable parameters (typically <1% of total model parameters).

In [5]:
# Setup LoRA Configuration and Device/Precision Selection
print("🔧 SETTING UP LORA CONFIGURATION")
print("=" * 50)

# Check if model exists
if 'model' not in globals():
    print("❌ Error: 'model' variable not found!")
    print("💡 Please run Step 5 (Load Gemma 3 1B model) first")
else:
    # === DEVICE AND PRECISION CUSTOMIZATION ===
    print("⚙️ DEVICE AND PRECISION SELECTION:")
    print("Current auto-detected device:", device)
    
    # Allow user to override device selection
    FORCE_DEVICE = None  # Set to "cuda", "mps", or "cpu" to override auto-detection
    FORCE_PRECISION = "fp32"  # Set to "fp16", "bf16", or "fp32" to override auto-selection
    
    # Apply device override if specified
    if FORCE_DEVICE:
        device = FORCE_DEVICE
        print(f"🔄 Device overridden to: {device}")
    
    # Determine precision settings
    if FORCE_PRECISION:
        if FORCE_PRECISION == "fp16":
            use_fp16, use_bf16 = True, False
        elif FORCE_PRECISION == "bf16":
            use_fp16, use_bf16 = False, True
        else:  # fp32
            use_fp16, use_bf16 = False, False
        print(f"🔄 Precision overridden to: {FORCE_PRECISION}")
    else:
        # Auto-select precision based on device
        if device == "cuda":
            use_fp16, use_bf16 = True, False  # FP16 for CUDA
        elif device == "mps":
            use_fp16, use_bf16 = False, False  # FP32 for MPS (compatibility)
        else:
            use_fp16, use_bf16 = False, False  # FP32 for CPU
    
    print(f"✅ Selected configuration:")
    print(f"   Device: {device}")
    print(f"   Precision: {'FP16' if use_fp16 else 'BF16' if use_bf16 else 'FP32'}")
    
    # === LORA CONFIGURATION ===
    print(f"\n🎯 LORA PARAMETERS:")
    
    # LoRA parameters (customizable)
    LORA_R = 16        # Rank of adaptation (higher = more parameters, better quality)
    LORA_ALPHA = 32    # LoRA scaling parameter (typically 2x rank)
    LORA_DROPOUT = 0.1 # LoRA dropout (0.0-0.3)

    # Define target modules for LoRA (Gemma-specific)
    target_modules = [
        "q_proj",     # Query projection
        "k_proj",     # Key projection
        "v_proj",     # Value projection
        "o_proj",     # Output projection
        "gate_proj",  # Gate projection (MLP)
        "up_proj",    # Up projection (MLP)
        "down_proj"   # Down projection (MLP)
    ]

    # Create LoRA configuration
    peft_config = LoraConfig(
        r=LORA_R,
        lora_alpha=LORA_ALPHA,
        target_modules=target_modules,
        lora_dropout=LORA_DROPOUT,
        bias="none",
        task_type=TaskType.CAUSAL_LM,
    )

    print(f"   Rank (r): {LORA_R}")
    print(f"   Alpha: {LORA_ALPHA}")
    print(f"   Dropout: {LORA_DROPOUT}")
    print(f"   Target modules: {len(target_modules)} layers")

    # Apply LoRA to the model
    print(f"\n🔄 Applying LoRA to model...")
    try:
        # Enable gradient checkpointing before applying LoRA
        model.gradient_checkpointing_enable()
        
        # Apply LoRA
        model = get_peft_model(model, peft_config)
        
        # Ensure all LoRA parameters require gradients
        for name, param in model.named_parameters():
            if 'lora_' in name:
                param.requires_grad = True
        
        print("✅ LoRA applied successfully!")
        
        # Print trainable parameters
        model.print_trainable_parameters()
        
        # Calculate parameter efficiency
        total_params = sum(p.numel() for p in model.parameters())
        trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
        
        print(f"\n📊 PARAMETER EFFICIENCY:")
        print(f"   Total parameters: {total_params:,}")
        print(f"   Trainable parameters: {trainable_params:,}")
        print(f"   Percentage trainable: {100 * trainable_params / total_params:.2f}%")
        print(f"   Memory reduction: ~{(total_params - trainable_params) / total_params * 100:.1f}%")
        
        # Verify gradient setup
        grad_params = sum(1 for p in model.parameters() if p.requires_grad)
        print(f"   Parameters with gradients: {grad_params}")
        
    except Exception as e:
        print(f"❌ Error applying LoRA: {e}")
        print("💡 This might be due to model architecture or memory issues")

    print(f"\n✅ LoRA setup complete! Ready for efficient fine-tuning.")

🔧 SETTING UP LORA CONFIGURATION
⚙️ DEVICE AND PRECISION SELECTION:
Current auto-detected device: cuda
🔄 Precision overridden to: fp32
✅ Selected configuration:
   Device: cuda
   Precision: FP32

🎯 LORA PARAMETERS:
   Rank (r): 16
   Alpha: 32
   Dropout: 0.1
   Target modules: 7 layers

🔄 Applying LoRA to model...
✅ LoRA applied successfully!
trainable params: 13,045,760 || all params: 1,012,931,712 || trainable%: 1.2879

📊 PARAMETER EFFICIENCY:
   Total parameters: 1,012,931,712
   Trainable parameters: 13,045,760
   Percentage trainable: 1.29%
   Memory reduction: ~98.7%
   Parameters with gradients: 364

✅ LoRA setup complete! Ready for efficient fine-tuning.
✅ LoRA applied successfully!
trainable params: 13,045,760 || all params: 1,012,931,712 || trainable%: 1.2879

📊 PARAMETER EFFICIENCY:
   Total parameters: 1,012,931,712
   Trainable parameters: 13,045,760
   Percentage trainable: 1.29%
   Memory reduction: ~98.7%
   Parameters with gradients: 364

✅ LoRA setup complete! Ready 

## Step 7: Prepare Training Dataset

**🎯 What this cell does:**
Creates and formats a training dataset for instruction-following. This example uses a small sample dataset for demonstration - in practice, you'll want to replace this with your own data for better results.

**📋 What happens inside:**
- **Sample Data Creation**: Creates example instruction-response pairs
- **Data Expansion**: Repeats examples to create a larger training set
- **Format Conversion**: Converts to "Human: ... Assistant: ..." format (works better than ### format)
- **Dataset Creation**: Converts to HuggingFace Dataset format
- **Train/Validation Split**: Automatically splits data (80% train, 20% validation)

**⚙️ How to customize with your own data:**

**Option 1 - Replace sample_data:**
```python
sample_data = [
    {
        "instruction": "Your custom instruction here",
        "response": "Your expected response here"
    },
    # Add more examples...
]
```

**Option 2 - Load from file:**
```python
import json
with open('your_data.json', 'r') as f:
    sample_data = json.load(f)
```

**Option 3 - Use HuggingFace dataset:**
```python
from datasets import load_dataset
dataset = load_dataset("your-dataset-name")
# Convert to instruction-response format
```

**📏 Dataset size recommendations:**
- **Demo/Testing**: 50-100 examples (current setup)
- **Small project**: 500-1,000 examples  
- **Production**: 5,000+ examples
- **Complex tasks**: 10,000+ examples

**🎯 Data quality tips:**
- Ensure diverse instruction types and lengths
- Keep responses focused and consistent in style
- Include edge cases and error handling examples
- Balance different topics/domains in your data

**💡 Expected output:**
You should see the dataset creation process, example formatted text, and train/validation split statistics.

In [6]:
# Create a sample dataset for demonstration
# Replace this with your own dataset!

print("📚 PREPARING TRAINING DATASET")
print("=" * 50)

# Sample data: AI assistant training examples
sample_data = [
    {
        "instruction": "Explain what machine learning is in simple terms.",
        "response": "Machine learning is a type of artificial intelligence where computers learn to make predictions or decisions by finding patterns in data, rather than being explicitly programmed for every task."
    },
    {
        "instruction": "What are the benefits of renewable energy?",
        "response": "Renewable energy sources like solar and wind power are sustainable, reduce greenhouse gas emissions, decrease dependence on fossil fuels, and can lead to long-term cost savings while creating jobs in green technology sectors."
    },
    {
        "instruction": "How do you make a simple vegetable soup?",
        "response": "To make vegetable soup: 1) Heat oil in a pot, 2) Sauté diced onions and garlic, 3) Add chopped vegetables like carrots, celery, and potatoes, 4) Pour in vegetable broth, 5) Season with herbs and spices, 6) Simmer until vegetables are tender, about 20-30 minutes."
    },
    {
        "instruction": "What is the importance of exercise for mental health?",
        "response": "Exercise significantly benefits mental health by releasing endorphins that improve mood, reducing stress and anxiety, improving sleep quality, boosting self-esteem, and providing a healthy outlet for managing emotions and daily pressures."
    },
    {
        "instruction": "Explain the concept of photosynthesis.",
        "response": "Photosynthesis is the process by which plants convert sunlight, carbon dioxide, and water into glucose (sugar) and oxygen. This occurs mainly in plant leaves using chlorophyll, providing energy for the plant and producing oxygen as a byproduct that's essential for most life on Earth."
    }
]

# Extend the dataset by repeating and varying the examples
# In practice, you'd want hundreds or thousands of examples
extended_data = []
for i in range(20):  # Repeat each example multiple times
    for item in sample_data:
        extended_data.append(item)

print(f"📝 Created dataset with {len(extended_data)} examples")

# Format data for instruction following
def format_instruction(example):
    """Format the data as instruction-following examples - simplified format"""
    # Use simpler format to avoid confusion with ### Response: pattern
    return f"Human: {example['instruction']}\n\nAssistant: {example['response']}"

# Apply formatting
formatted_texts = [format_instruction(item) for item in extended_data]

print("\n📋 Example formatted training sample:")
print("-" * 40)
print(formatted_texts[0])
print("-" * 40)

# Create HuggingFace dataset
dataset = Dataset.from_dict({"text": formatted_texts})

print(f"\n✅ Dataset created with {len(dataset)} examples")
print(f"   Example keys: {list(dataset.features.keys())}")

# Split into train/validation
train_test_split = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = train_test_split['train']
eval_dataset = train_test_split['test']

print(f"   Training examples: {len(train_dataset)}")
print(f"   Validation examples: {len(eval_dataset)}")

print("\n💡 Note: In practice, you should use a much larger dataset (1000+ examples)")
print("   for better fine-tuning results. This is just a demonstration.")

📚 PREPARING TRAINING DATASET
📝 Created dataset with 100 examples

📋 Example formatted training sample:
----------------------------------------
Human: Explain what machine learning is in simple terms.

Assistant: Machine learning is a type of artificial intelligence where computers learn to make predictions or decisions by finding patterns in data, rather than being explicitly programmed for every task.
----------------------------------------

✅ Dataset created with 100 examples
   Example keys: ['text']
   Training examples: 80
   Validation examples: 20

💡 Note: In practice, you should use a much larger dataset (1000+ examples)
   for better fine-tuning results. This is just a demonstration.


## Step 8: Tokenize the Dataset

**🎯 What this cell does:**
Converts your text data into tokens (numbers) that the model can understand. This step also handles proper special token management and label creation for training.

**📋 What happens inside:**
- **Dependency Check**: Verifies all required variables from previous steps exist
- **Tokenization Function**: Converts text to token IDs with proper padding and truncation
- **Special Token Handling**: Adds BOS/EOS tokens correctly (crucial for generation quality)
- **Label Creation**: Creates training labels and masks padding tokens (-100) so they're ignored
- **Batch Processing**: Efficiently processes the entire dataset
- **Statistics**: Shows tokenization results and sample decoded text

**⚙️ Key parameter you can customize:**
- `MAX_LENGTH`: Maximum sequence length in tokens
  - **256**: For short Q&A pairs, faster training
  - **512**: Balanced (recommended for most cases)
  - **1024**: For longer conversations, more memory needed
  - **2048+**: For very long text, requires significant memory

**📊 Length selection guide:**
- Check your data: Most examples should fit in MAX_LENGTH
- Longer sequences = more memory usage and slower training
- Shorter sequences = faster but may truncate important content
- Monitor "Number of padding tokens" in output - less padding = more efficient

**🔧 Advanced customizations:**
```python
# In tokenize_function, you can modify:
truncation=True,          # Set to False to see truncation warnings
padding="max_length",     # Or "longest" for variable length
add_special_tokens=True,  # Critical for proper generation
```

**🚨 Troubleshooting:**
- **Missing variables error**: Run previous steps in order
- **Out of memory**: Reduce MAX_LENGTH or batch size
- **No padding tokens**: Data might be too long for MAX_LENGTH

**💡 Expected output:**
You should see successful tokenization progress bars, statistics about token lengths, and a sample of decoded tokenized text.

In [7]:
# Tokenization configuration
MAX_LENGTH = 512  # Adjust based on your data and memory

print(f"🔤 TOKENIZING DATASET")
print("=" * 40)

# Check if required variables exist
missing_vars = []
if 'tokenizer' not in globals():
    missing_vars.append('tokenizer (from Step 5)')
if 'train_dataset' not in globals():
    missing_vars.append('train_dataset (from Step 6)')
if 'eval_dataset' not in globals():
    missing_vars.append('eval_dataset (from Step 6)')

if missing_vars:
    print("❌ ERROR: Missing required variables!")
    print("💡 The following variables are not defined:")
    for var in missing_vars:
        print(f"   • {var}")
    print("\n🔄 REQUIRED STEPS:")
    print("   1. Run Step 4: HuggingFace authentication")
    print("   2. Run Step 5: Load Gemma 3 1B model (creates tokenizer)")
    print("   3. Run Step 6: Prepare training dataset (creates train_dataset, eval_dataset)")
    print("   4. Then run this Step 7: Tokenize the dataset")
    
    # Show what variables ARE defined
    defined_vars = [var for var in globals().keys() if not var.startswith('_') and var not in ['In', 'Out', 'get_ipython']]
    print(f"\n📋 Currently defined variables: {', '.join(sorted(defined_vars))}")
    
    # Exit early to prevent further errors
    print("\n⏹️ Stopping execution. Please run the missing steps first.")
    
else:
    print(f"✅ All required variables found!")
    print(f"Max sequence length: {MAX_LENGTH}")

    def tokenize_function(examples):
        """Tokenize the text examples with proper special token handling"""
        # Tokenize the texts with explicit special token handling
        tokenized = tokenizer(
            examples["text"],
            truncation=True,
            padding="max_length",
            max_length=MAX_LENGTH,
            add_special_tokens=True,  # Add BOS and EOS tokens properly
            return_tensors=None,
        )
        
        # For causal LM, labels are the same as input_ids
        # This ensures the model learns to generate the response portion
        tokenized["labels"] = tokenized["input_ids"].copy()
        
        # Replace padding token labels with -100 so they're ignored in loss
        labels = tokenized["labels"]
        for i, label_seq in enumerate(labels):
            # Convert to list if it's not already
            if hasattr(label_seq, 'tolist'):
                label_seq = label_seq.tolist()
            # Replace pad tokens with -100
            labels[i] = [-100 if token == tokenizer.pad_token_id else token for token in label_seq]
        
        tokenized["labels"] = labels
        
        return tokenized

    # Tokenize datasets
    print("🔄 Tokenizing training dataset...")
    tokenized_train = train_dataset.map(
        tokenize_function,
        batched=True,
        remove_columns=train_dataset.column_names,
        desc="Tokenizing training data"
    )

    print("🔄 Tokenizing validation dataset...")
    tokenized_eval = eval_dataset.map(
        tokenize_function,
        batched=True,
        remove_columns=eval_dataset.column_names,
        desc="Tokenizing validation data"
    )

    print("✅ Tokenization complete!")

    # Examine tokenized data
    sample_tokens = tokenized_train[0]
    print(f"\n📊 TOKENIZATION STATISTICS:")
    print(f"   Sample input_ids length: {len(sample_tokens['input_ids'])}")
    print(f"   Sample attention_mask length: {len(sample_tokens['attention_mask'])}")
    print(f"   Number of padding tokens in sample: {sample_tokens['attention_mask'].count(0)}")

    # Show a sample of tokenized text
    print(f"\n🔍 SAMPLE TOKENIZED TEXT:")
    sample_text = tokenizer.decode(sample_tokens['input_ids'][:50], skip_special_tokens=True)
    print(f"First 50 tokens decoded: {sample_text}...")

    print(f"\n💾 Dataset ready for training!")
    print(f"   Training samples: {len(tokenized_train)}")
    print(f"   Validation samples: {len(tokenized_eval)}")

🔤 TOKENIZING DATASET
✅ All required variables found!
Max sequence length: 512
🔄 Tokenizing training dataset...


Tokenizing training data: 100%|██████████| 80/80 [00:00<00:00, 1580.56 examples/s]

🔄 Tokenizing validation dataset...



Tokenizing validation data: 100%|██████████| 20/20 [00:00<00:00, 1291.65 examples/s]

✅ Tokenization complete!

📊 TOKENIZATION STATISTICS:
   Sample input_ids length: 512
   Sample attention_mask length: 512
   Number of padding tokens in sample: 446

🔍 SAMPLE TOKENIZED TEXT:
First 50 tokens decoded: ...

💾 Dataset ready for training!
   Training samples: 80
   Validation samples: 20





## Step 9: Configure Training Arguments

**🎯 What this cell does:**
Sets up all the training parameters optimized for your specific device and use case. These settings control how the model learns - too aggressive and it won't converge, too conservative and training will be slow.

**📋 What happens inside:**
- **Device Optimization**: Automatically adjusts batch sizes and workers for your hardware
- **Training Parameters**: Sets epochs, learning rate, and optimization settings
- **Memory Management**: Configures gradient checkpointing and precision settings
- **Evaluation Setup**: Configures validation frequency and model saving
- **Time Estimation**: Provides realistic training time estimates

**⚙️ Key parameters you can customize:**

**Training Speed vs Quality:**
- `NUM_EPOCHS`: Number of training passes through data
  - **1**: Quick test, minimal learning
  - **2-3**: Balanced (recommended for most cases)
  - **5+**: Risk of overfitting with small datasets
  
- `LEARNING_RATE`: How fast the model learns
  - **1e-5**: Conservative, stable but slow
  - **1e-4**: Balanced (current setting)
  - **5e-4**: Aggressive, faster but less stable

**Batch Size (device-dependent):**
- **CUDA**: `batch_size = 4-8` (depending on GPU memory)
- **Apple Silicon**: `batch_size = 2-4` (recommended)
- **CPU**: `batch_size = 1-2` (memory limited)

**Advanced fine-tuning:**
- `WEIGHT_DECAY`: Regularization (0.01 = balanced, 0.1 = strong)
- `WARMUP_RATIO`: Learning rate warmup (0.1 = 10% of training)
- `MAX_GRAD_NORM`: Gradient clipping (1.0 = balanced)

**Memory vs Speed tradeoffs:**
- `gradient_accumulation_steps`: Higher = larger effective batch with same memory
- `gradient_checkpointing`: Saves memory but slower training
- `dataloader_num_workers`: More workers = faster data loading

**💡 When to adjust settings:**
- **Model not learning**: Increase learning rate or epochs
- **Loss exploding**: Decrease learning rate, increase warmup
- **Out of memory**: Reduce batch size, increase gradient accumulation
- **Training too slow**: Increase batch size, reduce gradient checkpointing

**📊 Expected output:**
You should see your optimized training configuration, device-specific settings, and estimated training time for your hardware.

In [8]:
# Training configuration based on device and precision settings
print(f"⚙️  CONFIGURING TRAINING ARGUMENTS")
print("=" * 50)

# Check if required variables exist
missing_vars = []
if 'device' not in globals():
    missing_vars.append('device (from Step 3)')
if 'tokenized_train' not in globals():
    missing_vars.append('tokenized_train (from Step 8)')
if 'use_fp16' not in globals():
    missing_vars.append('use_fp16 (from Step 6)')

if missing_vars:
    print("❌ ERROR: Missing required variables!")
    print("💡 The following variables are not defined:")
    for var in missing_vars:
        print(f"   • {var}")
    print("\n🔄 REQUIRED STEPS:")
    print("   1. Run Step 3: Environment Setup (creates device)")
    print("   2. Run Steps 4-8: Complete model and dataset preparation")
    print("   3. Then run this Step 9: Configure Training Arguments")
    
    # Use fallback values
    device = "mps"  # Default fallback
    use_fp16, use_bf16 = False, False  # Safe defaults
    print(f"\n🔄 Using fallback values:")
    print(f"   Device: {device}")
    print(f"   Precision: FP32 (safe default)")
    print("⚠️  Training time estimates will be inaccurate without tokenized_train")
else:
    print(f"✅ All required variables found!")

print(f"   Optimizing for device: {device}")
print(f"   Using precision: {'FP16' if use_fp16 else 'BF16' if use_bf16 else 'FP32'}")

# === TRAINING PARAMETERS CONFIGURATION ===
print(f"\n🎯 TRAINING PARAMETERS:")

# Device-specific training parameters
if device == "cuda":
    # CUDA optimized settings
    batch_size = 4
    gradient_accumulation_steps = 2
    dataloader_num_workers = 4
    
elif device == "mps":
    # Apple Silicon optimized settings
    batch_size = 2
    gradient_accumulation_steps = 4
    dataloader_num_workers = 2
    
else:
    # CPU settings
    batch_size = 1
    gradient_accumulation_steps = 8
    dataloader_num_workers = 1

# Effective batch size calculation
effective_batch_size = batch_size * gradient_accumulation_steps
print(f"   Batch size: {batch_size}")
print(f"   Gradient accumulation steps: {gradient_accumulation_steps}")
print(f"   Effective batch size: {effective_batch_size}")

# Additional training parameters (customizable)
NUM_EPOCHS = 2          # Reduced epochs to prevent overfitting to EOS pattern
LEARNING_RATE = 1e-4    # Lower learning rate for more stable training  
WEIGHT_DECAY = 0.01     # Weight decay for regularization
WARMUP_RATIO = 0.1      # Warmup ratio
MAX_GRAD_NORM = 1.0     # Gradient clipping

print(f"   Epochs: {NUM_EPOCHS}")
print(f"   Learning rate: {LEARNING_RATE}")
print(f"   Weight decay: {WEIGHT_DECAY}")

# Output directory
output_dir = "./gemma-3-1b-it-finetuned"

# Training arguments (updated for newer transformers versions)
training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    
    # Training parameters
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    
    # Optimization
    learning_rate=LEARNING_RATE,
    weight_decay=WEIGHT_DECAY,
    warmup_ratio=WARMUP_RATIO,
    lr_scheduler_type="cosine",
    max_grad_norm=MAX_GRAD_NORM,
    
    # Precision - Uses values from Step 6
    fp16=use_fp16,
    bf16=use_bf16,
    
    # Memory optimization - Fixed for gradient issues
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},  # Fix for newer PyTorch
    dataloader_pin_memory=False,
    dataloader_num_workers=dataloader_num_workers,
    
    # Evaluation and logging (updated parameter names)
    eval_strategy="steps",          # Updated from evaluation_strategy
    eval_steps=50,
    logging_steps=10,
    save_strategy="steps",          # Added explicit save strategy
    save_steps=100,
    save_total_limit=2,
    
    # Early stopping
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    
    # Reproducibility
    seed=42,
    data_seed=42,
    
    # Reporting
    report_to=[],  # Disable wandb for now
    run_name="gemma-3-1b-it-finetune",
    
    # Disable cache for gradient checkpointing compatibility
    remove_unused_columns=False,
)

print(f"\n📋 TRAINING CONFIGURATION SUMMARY:")
print(f"   Device: {device}")
print(f"   Precision: {'FP16' if use_fp16 else 'BF16' if use_bf16 else 'FP32'}")
print(f"   Epochs: {training_args.num_train_epochs}")
print(f"   Learning rate: {training_args.learning_rate}")
print(f"   Effective batch size: {effective_batch_size}")
print(f"   Gradient checkpointing: {training_args.gradient_checkpointing}")
print(f"   Gradient checkpointing (reentrant): False")

# Estimate training time (only if tokenized_train exists)
print(f"\n⏱️  TRAINING ESTIMATES:")
if 'tokenized_train' in globals():
    total_steps = len(tokenized_train) // effective_batch_size * training_args.num_train_epochs
    print(f"   Total training steps: {total_steps}")
    
    # Adjust time estimates based on precision
    if device == "mps":
        base_time = 45 if not use_fp16 else 30  # FP32 vs FP16
        estimated_time = total_steps * base_time / 60
        precision_note = "FP32" if not use_fp16 else "FP16"
        print(f"   Estimated time on Apple Silicon ({precision_note}): ~{estimated_time:.1f} minutes")
    elif device == "cuda":
        base_time = 10 if use_fp16 else 15  # FP16 vs FP32
        estimated_time = total_steps * base_time / 60
        precision_note = "FP16" if use_fp16 else "BF16" if use_bf16 else "FP32"
        print(f"   Estimated time on GPU ({precision_note}): ~{estimated_time:.1f} minutes")
    else:
        estimated_time = total_steps * 120 / 60  # FP32 on CPU
        print(f"   Estimated time on CPU (FP32): ~{estimated_time:.1f} minutes")
else:
    print("   ⚠️  Cannot estimate training time: tokenized_train not found")
    print("   💡 Complete Steps 4-8 first to get accurate estimates")
    print("   📊 Estimated steps: ~30 (assuming 100 examples, 3 epochs)")

print(f"\n✅ Training arguments configured!")
print(f"\n📋 NEXT STEPS:")
print(f"   • Step 10: Setup trainer")
print(f"   • Step 11: Start training")

⚙️  CONFIGURING TRAINING ARGUMENTS
✅ All required variables found!
   Optimizing for device: cuda
   Using precision: FP32

🎯 TRAINING PARAMETERS:
   Batch size: 4
   Gradient accumulation steps: 2
   Effective batch size: 8
   Epochs: 2
   Learning rate: 0.0001
   Weight decay: 0.01

📋 TRAINING CONFIGURATION SUMMARY:
   Device: cuda
   Precision: FP32
   Epochs: 2
   Learning rate: 0.0001
   Effective batch size: 8
   Gradient checkpointing: True
   Gradient checkpointing (reentrant): False

⏱️  TRAINING ESTIMATES:
   Total training steps: 20
   Estimated time on GPU (FP32): ~5.0 minutes

✅ Training arguments configured!

📋 NEXT STEPS:
   • Step 10: Setup trainer
   • Step 11: Start training


## Step 10: Setup Data Collator and Trainer

**🎯 What this cell does:**
Creates the HuggingFace Trainer object that will handle the actual training process. This step combines your model, data, and training settings into a unified training system.

**📋 What happens inside:**
- **Dependency Verification**: Ensures all previous steps completed successfully
- **Data Collator Setup**: Configures how training batches are created and padded
- **Trainer Creation**: Combines model, data, and training arguments
- **Memory Analysis**: Shows current memory usage before training starts
- **Compatibility Handling**: Uses appropriate API parameters for your transformers version

**⚙️ Understanding the components:**

**Data Collator settings:**
- `mlm=False`: We're doing causal language modeling, not masked language modeling
- `pad_to_multiple_of=8`: Optimizes memory alignment for better performance
- `return_tensors="pt"`: Returns PyTorch tensors (required for training)

**Trainer configuration:**
- Links your LoRA-enhanced model with tokenized datasets
- Applies training arguments from Step 9
- Sets up automatic evaluation and model saving
- Handles gradient computation and backpropagation

**🔧 Advanced customizations:**
If you encounter issues, you can modify the trainer setup:

```python
# For older transformers versions:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    data_collator=data_collator,
    tokenizer=tokenizer,  # Instead of processing_class
)
```

**🚨 Common issues and solutions:**
- **"processing_class" error**: Your transformers version uses the older "tokenizer" parameter
- **Memory warnings**: Normal for large models - training will handle this
- **CUDA out of memory**: Reduce batch size in Step 9 and rerun
- **MPS compatibility issues**: Set FORCE_PRECISION="fp32" in Step 6

**💡 Expected output:**
You should see successful trainer creation, training setup summary with your model and dataset info, and current memory usage statistics.

In [9]:
# Setup data collator and trainer
print("📦 SETTING UP DATA COLLATOR AND TRAINER")
print("=" * 50)

# Check if required variables exist
missing_vars = []
required_vars = ['tokenizer', 'model', 'training_args', 'tokenized_train', 'tokenized_eval']
for var in required_vars:
    if var not in globals():
        missing_vars.append(var)

if missing_vars:
    print("❌ ERROR: Missing required variables!")
    print("💡 The following variables are not defined:")
    for var in missing_vars:
        print(f"   • {var}")
    print("\n🔄 REQUIRED STEPS:")
    print("   1. Run Steps 3-9 in order to create all required variables")
    print("   2. Then run this Step 10: Setup trainer")
    
else:
    # Data collator for language modeling
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False,  # We're doing causal LM, not masked LM
        pad_to_multiple_of=8,  # For efficiency
        return_tensors="pt",  # Ensure proper tensor format
    )

    print("✅ Data collator created")

    # Create trainer
    print("🏋️ Creating trainer...")

    try:
        trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=tokenized_train,
            eval_dataset=tokenized_eval,
            data_collator=data_collator,
            processing_class=tokenizer,  # Use processing_class instead of tokenizer
        )
        
        print("✅ Trainer created successfully!")
        
        # Print training setup summary
        print(f"\n📊 TRAINING SETUP SUMMARY:")
        print(f"   Model: {MODEL_NAME}")
        print(f"   Device: {device}")
        print(f"   Training method: LoRA fine-tuning")
        print(f"   Training samples: {len(tokenized_train)}")
        print(f"   Validation samples: {len(tokenized_eval)}")
        print(f"   Output directory: {output_dir}")
        print(f"   Mixed precision: FP16={training_args.fp16}, BF16={training_args.bf16}")
        
        # Memory check before training
        if device == "cuda":
            memory_used = torch.cuda.memory_allocated() / 1e9
            memory_total = torch.cuda.get_device_properties(0).total_memory / 1e9
            print(f"   GPU memory: {memory_used:.1f}GB / {memory_total:.1f}GB")
        elif device == "mps":
            print(f"   Apple Silicon unified memory in use")
        
        current_memory = psutil.virtual_memory()
        print(f"   System RAM: {current_memory.percent:.1f}% used")
        
        print(f"\n🎯 Ready to start training!")
        
    except Exception as e:
        print(f"❌ Error creating trainer: {e}")
        print("💡 This might be due to:")
        print("   • Mixed precision compatibility issues")
        print("   • Memory or configuration issues")
        print("   • Missing variables from previous steps")
        
        # Provide specific help for common issues
        if "fp16" in str(e).lower() and device == "mps":
            print("\n🔧 SOLUTION: Re-run Step 6 and set FORCE_PRECISION='fp32'")
        elif "tokenizer" in str(e).lower():
            print("\n🔧 SOLUTION: Make sure you've run Step 5 (Load Model)")
        elif "processing_class" in str(e).lower():
            print("\n💡 Note: Using fallback tokenizer parameter for compatibility")
            # Fallback to older API
            try:
                trainer = Trainer(
                    model=model,
                    args=training_args,
                    train_dataset=tokenized_train,
                    eval_dataset=tokenized_eval,
                    data_collator=data_collator,
                    tokenizer=tokenizer,  # Fallback to tokenizer parameter
                )
                print("✅ Trainer created with fallback method!")
            except Exception as fallback_e:
                print(f"❌ Fallback also failed: {fallback_e}")

📦 SETTING UP DATA COLLATOR AND TRAINER
✅ Data collator created
🏋️ Creating trainer...
✅ Trainer created successfully!

📊 TRAINING SETUP SUMMARY:
   Model: google/gemma-3-1b-it
   Device: cuda
   Training method: LoRA fine-tuning
   Training samples: 80
   Validation samples: 20
   Output directory: ./gemma-3-1b-it-finetuned
   Mixed precision: FP16=False, BF16=False
   GPU memory: 2.1GB / 25.8GB
   System RAM: 28.1% used

🎯 Ready to start training!


## Step 11: Pre-Training Verification and Fine-Tuning

**🎯 What this cell does:**
This step performs critical pre-training checks to prevent common training failures, then starts the actual fine-tuning process. The verification fixes have been essential for resolving gradient computation issues.

**📋 What happens inside:**

**Pre-training Verification (Critical Fixes):**
- **Training Mode**: Ensures model is ready for gradient computation
- **LoRA Gradient Setup**: Verifies all LoRA parameters can receive gradients
- **Cache Disabling**: Prevents conflicts between caching and gradient checkpointing
- **Token Synchronization**: Aligns tokenizer settings with model configuration
- **Gradient Testing**: Performs a test forward/backward pass to verify training will work
- **Generation Testing**: Checks the model doesn't immediately produce EOS tokens

**Actual Training Process:**
- **Training Execution**: Starts the full fine-tuning process
- **Progress Monitoring**: Shows training progress, loss, and evaluation metrics
- **Time Tracking**: Measures actual training duration
- **Model Saving**: Automatically saves the best model based on validation loss
- **Error Handling**: Gracefully handles interruptions and saves partial progress

**⚙️ What you can monitor during training:**

**Training Progress Indicators:**
- **Training Loss**: Should generally decrease over time
- **Evaluation Loss**: Should decrease and stay close to training loss
- **Learning Rate**: Will follow cosine schedule (high → low)
- **Step Time**: Time per training step (should be consistent)

**Good Training Signs:**
- Loss decreases steadily in early steps
- Evaluation loss tracks training loss closely
- No "NaN" or "inf" values in losses
- Memory usage remains stable

**Warning Signs:**
- Loss increases or stays flat
- Large gap between train and eval loss (overfitting)
- Very slow training (check device utilization)
- Memory errors (reduce batch size and restart)

**🛑 If training fails:**
The cell includes comprehensive error handling and will try to save partial progress. Common solutions:
- **Out of memory**: Reduce batch_size in Step 9, restart from Step 10
- **Gradient errors**: The pre-training verification should prevent these
- **Slow training**: Check that your device (GPU/MPS) is actually being used

**⏱️ Training duration expectations:**
- **Apple Silicon (M1/M2)**: 15-45 minutes for demo dataset
- **Modern GPU**: 5-15 minutes for demo dataset  
- **CPU only**: 1-3 hours for demo dataset

**💡 Expected output:**
You should see pre-training verification passes, training progress bars, decreasing loss values, and successful model saving to the output directory.

In [10]:
# Pre-training fixes and verification
print("🔧 PRE-TRAINING FIXES AND VERIFICATION")
print("=" * 50)

# Check if all required variables exist
required_vars = ['model', 'trainer', 'tokenizer']
missing_vars = [var for var in required_vars if var not in globals()]

if missing_vars:
    print(f"❌ ERROR: Missing variables: {missing_vars}")
    print("💡 Please run previous steps first")
else:
    print("✅ All required variables found")
    
    # Fix 1: Ensure model is in training mode
    model.train()
    print("✅ Model set to training mode")
    
    # Fix 2: Verify LoRA parameters have gradients
    lora_params_with_grad = 0
    lora_params_total = 0
    
    for name, param in model.named_parameters():
        if 'lora_' in name:
            lora_params_total += 1
            if param.requires_grad:
                lora_params_with_grad += 1
            else:
                # Force gradient requirement for LoRA parameters
                param.requires_grad = True
                lora_params_with_grad += 1
                
    print(f"✅ LoRA parameters with gradients: {lora_params_with_grad}/{lora_params_total}")
    
    # Fix 3: Disable use_cache in model config (conflicts with gradient checkpointing)
    if hasattr(model.config, 'use_cache'):
        model.config.use_cache = False
        print("✅ Model use_cache disabled")
    
    # Fix 4: Update model and generation config with tokenizer tokens
    if hasattr(model.config, 'pad_token_id'):
        model.config.pad_token_id = tokenizer.pad_token_id
    if hasattr(model.config, 'eos_token_id'):
        model.config.eos_token_id = tokenizer.eos_token_id
    if hasattr(model, 'generation_config'):
        model.generation_config.pad_token_id = tokenizer.pad_token_id
        model.generation_config.eos_token_id = tokenizer.eos_token_id
    
    print("✅ Token IDs synchronized between model and tokenizer")
    
    # Fix 5: Verify gradient computation setup
    try:
        # Create a dummy forward pass to verify gradients
        dummy_input = torch.randint(1, 1000, (1, 10)).to(model.device)  # Avoid token 0
        with torch.enable_grad():
            outputs = model(dummy_input, labels=dummy_input)
            loss = outputs.loss
            
        print(f"✅ Forward pass successful, loss: {loss.item():.4f}")
        
        # Test gradient computation
        loss.backward()
        
        # Check if gradients were computed
        grad_found = False
        for name, param in model.named_parameters():
            if param.grad is not None and 'lora_' in name:
                grad_found = True
                break
                
        if grad_found:
            print("✅ Gradient computation verified")
        else:
            print("⚠️  No gradients found - this may cause training issues")
            
        # Clear gradients
        model.zero_grad()
        
        # Fix 6: Verify model doesn't immediately predict EOS
        print("\n🔍 Testing model generation before training...")
        test_input = tokenizer("Human: Hello\n\nAssistant:", return_tensors="pt", add_special_tokens=True)
        test_input = {k: v.to(model.device) for k, v in test_input.items()}
        
        with torch.no_grad():
            test_output = model.generate(
                **test_input,
                max_new_tokens=5,
                temperature=0.7,
                do_sample=True,
                pad_token_id=tokenizer.pad_token_id,
                eos_token_id=tokenizer.eos_token_id,
            )
        
        generated_tokens = test_output[0][test_input['input_ids'].shape[1]:].tolist()
        if len(generated_tokens) > 0 and generated_tokens[0] != tokenizer.eos_token_id:
            print("✅ Model generates non-EOS tokens before training")
        else:
            print("⚠️  Model may have EOS generation issues")
        
    except Exception as e:
        print(f"❌ Gradient verification failed: {e}")
        print("💡 This may cause training issues")
    
    print("\n🎯 Pre-training verification complete!")
    print("📋 If any issues were found above, address them before training")

🔧 PRE-TRAINING FIXES AND VERIFICATION
✅ All required variables found
✅ Model set to training mode
✅ LoRA parameters with gradients: 364/364
✅ Model use_cache disabled
✅ Token IDs synchronized between model and tokenizer
✅ Forward pass successful, loss: 18.5367
✅ Forward pass successful, loss: 18.5367


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
Caching is incompatible with gradient checkpointing in Gemma3DecoderLayer. Setting `past_key_values=None`.
Caching is incompatible with gradient checkpointing in Gemma3DecoderLayer. Setting `past_key_values=None`.


✅ Gradient computation verified

🔍 Testing model generation before training...
✅ Model generates non-EOS tokens before training

🎯 Pre-training verification complete!
📋 If any issues were found above, address them before training
✅ Model generates non-EOS tokens before training

🎯 Pre-training verification complete!
📋 If any issues were found above, address them before training


### 🔧 STARTING THE FINE TUNING PROCESS:

The pre-training verification step addresses several common issues:

1. **Gradient Computation**: Ensures all LoRA parameters have `requires_grad=True`
2. **Model Mode**: Sets the model to training mode for proper gradient flow
3. **Cache Conflicts**: Disables `use_cache` to prevent conflicts with gradient checkpointing
4. **Token Alignment**: Synchronizes tokenizer tokens with model and generation configs
5. **Attention Implementation**: Uses "eager" attention (better for Gemma3 training)
6. **Gradient Checkpointing**: Uses non-reentrant mode for better compatibility

These fixes resolve the "element 0 of tensors does not require grad" error you encountered.

In [11]:
# Start training
print("🚀 STARTING FINE-TUNING PROCESS")
print("=" * 50)
print("⚠️  This may take some time depending on your hardware")
print("💡 Monitor GPU/CPU usage and temperature")
print()

# Record training start time
training_start_time = time.time()

try:
    # Start training
    print("🏋️ Beginning training...")
    trainer.train()
    
    # Calculate training time
    training_end_time = time.time()
    training_duration = training_end_time - training_start_time
    
    print(f"\n🎉 TRAINING COMPLETED!")
    print(f"   Total training time: {training_duration / 60:.1f} minutes")
    print(f"   Average time per epoch: {training_duration / training_args.num_train_epochs / 60:.1f} minutes")
    
    # Get training metrics
    train_metrics = trainer.state.log_history
    if train_metrics:
        final_train_loss = None
        final_eval_loss = None
        
        # Find the final losses
        for log in reversed(train_metrics):
            if 'train_loss' in log and final_train_loss is None:
                final_train_loss = log['train_loss']
            if 'eval_loss' in log and final_eval_loss is None:
                final_eval_loss = log['eval_loss']
            if final_train_loss is not None and final_eval_loss is not None:
                break
        
        print(f"\n📊 FINAL METRICS:")
        if final_train_loss:
            print(f"   Final training loss: {final_train_loss:.4f}")
        if final_eval_loss:
            print(f"   Final validation loss: {final_eval_loss:.4f}")
    
    # Save the final model
    print(f"\n💾 Saving fine-tuned model...")
    trainer.save_model()
    tokenizer.save_pretrained(output_dir)
    
    print(f"✅ Model saved to: {output_dir}")
    
except KeyboardInterrupt:
    print(f"\n⏹️  Training interrupted by user")
    print(f"   Partial model may be saved in {output_dir}")
    
except Exception as e:
    print(f"\n❌ Training failed: {e}")
    print(f"💡 Common issues:")
    print(f"   • Out of memory: Reduce batch size or use gradient checkpointing")
    print(f"   • Model too large: Use smaller model or more aggressive LoRA settings")
    print(f"   • Device issues: Check CUDA/MPS availability")
    
    # Try to save partial progress
    try:
        trainer.save_model(output_dir + "_partial")
        print(f"📁 Partial model saved to: {output_dir}_partial")
    except:
        pass

print(f"\n🏁 Training session complete!")

🚀 STARTING FINE-TUNING PROCESS
⚠️  This may take some time depending on your hardware
💡 Monitor GPU/CPU usage and temperature

🏋️ Beginning training...


Step,Training Loss,Validation Loss


Repo card metadata block was not found. Setting CardData to empty.



🎉 TRAINING COMPLETED!
   Total training time: 0.3 minutes
   Average time per epoch: 0.1 minutes

📊 FINAL METRICS:
   Final training loss: 0.8612

💾 Saving fine-tuned model...
✅ Model saved to: ./gemma-3-1b-it-finetuned

🏁 Training session complete!
✅ Model saved to: ./gemma-3-1b-it-finetuned

🏁 Training session complete!


## Step 12: Test the Fine-Tuned Model

**🎯 What this cell does:**
This comprehensive testing cell compares your fine-tuned model against the original base model to evaluate training success. It tests both models with the same questions and shows you the differences side-by-side.

**📋 What happens inside:**

**Step 1 - Base Model Loading:**
- Downloads and loads the original Gemma 3-1B model (without your fine-tuning)
- Uses the same device and precision settings as your fine-tuned model
- Provides a clean baseline for comparison

**Step 2 - Model Verification:**
- Confirms your fine-tuned model is available and ready
- Ensures both models are properly configured for testing

**Step 3 - Side-by-Side Testing:**
- Tests both models with 5 diverse questions covering different domains
- Uses proper isolation techniques to prevent response contamination
- Shows responses from both models for direct comparison
- Provides immediate analysis of each comparison

**Step 4 - Evaluation Guidance:**
- Explains how to interpret the results
- Describes what successful fine-tuning looks like
- Identifies potential issues and their meanings

**⚙️ Understanding the test questions:**
The cell tests diverse topics to evaluate general knowledge retention:
- **Machine Learning**: Tests if fine-tuning improved AI knowledge
- **Cooking**: Tests practical everyday knowledge  
- **Science**: Tests factual scientific knowledge
- **Biology**: Tests educational content understanding
- **Programming**: Tests technical knowledge retention

**🎯 What good results look like:**
- **Different Responses**: Fine-tuned model gives different (hopefully better) answers than base model
- **Coherent Output**: Both models produce readable, relevant responses
- **No Empty Responses**: Both models generate meaningful content (not just EOS tokens)
- **Domain Appropriateness**: Responses match the question topics

**⚠️ Potential issues and what they mean:**
- **Identical Responses**: Fine-tuning may not have been effective (data size, learning rate, epochs)
- **Empty Responses**: Token generation issues (usually fixed by our training setup)
- **Worse Responses**: Possible overfitting or inappropriate training data
- **Inconsistent Quality**: Normal variation, but consistent patterns may indicate training issues

**🔧 If you see problems:**
- **No differences**: Increase epochs, learning rate, or dataset size in previous steps
- **Quality degradation**: Reduce learning rate or add more diverse training data
- **Empty outputs**: Check that Step 11 completed successfully
- **Generation errors**: Restart the kernel and rerun from Step 11

**💡 Expected output:**
You should see successful loading of both models, followed by detailed comparisons showing your fine-tuned model producing different and ideally improved responses compared to the original model.

In [12]:
# 🧪 COMPLETE MODEL COMPARISON: Base vs Fine-Tuned
print("🧪 COMPLETE BASE MODEL vs FINE-TUNED MODEL COMPARISON")
print("=" * 70)
print("🎯 This cell will help you compare your original model with your fine-tuned version")
print("📊 You'll see responses from both models side-by-side to evaluate training success")
print("=" * 70)

import gc

def test_model_safely(model_to_test, model_name, prompt):
    """
    Test a single model with a prompt using proper isolation techniques
    """
    print(f"\n🔍 Testing {model_name}...")
    print(f"📝 Question: {prompt}")
    print("-" * 50)
    
    try:
        # Set model to evaluation mode
        model_to_test.eval()
        
        # Clear any cached states for isolation
        if hasattr(model_to_test, 'past_key_values'):
            model_to_test.past_key_values = None
        if hasattr(model_to_test, '_past_key_values'): 
            model_to_test._past_key_values = None
        
        # Force garbage collection
        gc.collect()
        if device == "cuda":
            torch.cuda.empty_cache()
        
        # Tokenize the prompt
        inputs = tokenizer(f"Human: {prompt}\nAssistant:", return_tensors="pt", padding=False)
        inputs = {k: v.to(device) for k, v in inputs.items()}
        input_length = inputs['input_ids'].shape[1]
        
        # Generate response with isolation
        with torch.no_grad():
            outputs = model_to_test.generate(
                input_ids=inputs['input_ids'],
                attention_mask=inputs['attention_mask'],
                max_new_tokens=100,
                temperature=0.8,
                do_sample=True,
                top_p=0.9,
                pad_token_id=tokenizer.pad_token_id,
                eos_token_id=tokenizer.eos_token_id,
                use_cache=False,  # Complete isolation
                repetition_penalty=1.1,
            )
        
        # Extract and decode the generated response
        generated_tokens = outputs[0][input_length:]
        if len(generated_tokens) > 0:
            response = tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()
            if response and len(response) > 1:
                print(f"✅ {model_name} Response:")
                print(f"   {response}")
                return True, response
            else:
                print(f"❌ {model_name}: Empty response generated")
                return False, "Empty response"
        else:
            print(f"❌ {model_name}: No tokens generated")
            return False, "No tokens"
            
    except Exception as e:
        print(f"❌ {model_name} Error: {e}")
        return False, str(e)

# Test questions covering different domains
test_questions = [
    "What is machine learning?",
    "How do you make pasta?", 
    "What causes lightning?",
    "Explain photosynthesis",
    "What is Python programming?"
]

print("\n🔄 STEP 1: LOADING BASE MODEL")
print("=" * 40)
print("Loading the original Gemma 3-1B model (without fine-tuning)...")

base_model = None
try:
    base_model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        torch_dtype=torch.float16 if device != "cpu" else torch.float32,
        trust_remote_code=True,
        attn_implementation="eager"
    )
    base_model = base_model.to(device)
    print("✅ Base model loaded successfully!")
    
except Exception as e:
    print(f"❌ Error loading base model: {e}")

print("\n🔄 STEP 2: CHECKING FINE-TUNED MODEL")
print("=" * 40)
if 'model' in globals():
    print("✅ Fine-tuned model is available and ready!")
else:
    print("❌ Fine-tuned model not found! Please run Steps 5-11 first.")

print("\n🔄 STEP 3: SIDE-BY-SIDE COMPARISON")
print("=" * 40)

if base_model is not None and 'model' in globals():
    
    for i, question in enumerate(test_questions, 1):
        print(f"\n{'='*70}")
        print(f"🧪 TEST {i}: {question}")
        print(f"{'='*70}")
        
        # Test Base Model
        base_success, base_response = test_model_safely(base_model, "BASE MODEL", question)
        
        print("")  # Space between models
        
        # Test Fine-tuned Model  
        ft_success, ft_response = test_model_safely(model, "FINE-TUNED MODEL", question)
        
        # Quick comparison
        print(f"\n📊 Comparison for '{question}':")
        if base_success and ft_success:
            print("   ✅ Both models generated responses")
            if base_response != ft_response:
                print("   🔍 Responses are different (good sign of successful fine-tuning)")
            else:
                print("   ⚠️  Responses are identical (might indicate fine-tuning issues)")
        elif ft_success and not base_success:
            print("   🎯 Fine-tuned model performed better!")
        elif base_success and not ft_success:
            print("   ⚠️  Base model performed better - check fine-tuning")
        else:
            print("   ❌ Both models had issues")
    
    # Clean up base model
    print(f"\n🧹 CLEANING UP...")
    del base_model
    if device == "cuda":
        torch.cuda.empty_cache()
    gc.collect()
    
else:
    print("❌ Cannot perform comparison - missing models")

print(f"\n📋 STEP 4: EVALUATION GUIDE")
print("=" * 40)
print("🔍 How to interpret the results:")
print()
print("✅ GOOD SIGNS:")
print("   • Fine-tuned model generates coherent responses")
print("   • Responses are different from base model")
print("   • No empty responses or error messages")
print("   • Responses show knowledge improvements")
print()
print("⚠️  POTENTIAL ISSUES:")
print("   • Fine-tuned model gives identical responses to base model")
print("   • Empty responses or immediate EOS tokens")
print("   • Error messages during generation")
print("   • Responses seem worse than base model")
print()
print("🎯 WHAT SUCCESS LOOKS LIKE:")
print("   • Your fine-tuned model should show some differences from the base model")
print("   • Responses should be coherent and relevant to the questions")
print("   • The model should handle different types of questions appropriately")
print("   • Fine-tuning effects may be subtle but should be noticeable")

print(f"\n✅ COMPLETE MODEL COMPARISON FINISHED!")
print("🎉 If your fine-tuned model generated good responses, congratulations!")
print("💡 If you see issues, review the training data and parameters in earlier steps.")

🧪 COMPLETE BASE MODEL vs FINE-TUNED MODEL COMPARISON
🎯 This cell will help you compare your original model with your fine-tuned version
📊 You'll see responses from both models side-by-side to evaluate training success

🔄 STEP 1: LOADING BASE MODEL
Loading the original Gemma 3-1B model (without fine-tuning)...
✅ Base model loaded successfully!

🔄 STEP 2: CHECKING FINE-TUNED MODEL
✅ Fine-tuned model is available and ready!

🔄 STEP 3: SIDE-BY-SIDE COMPARISON

🧪 TEST 1: What is machine learning?

🔍 Testing BASE MODEL...
📝 Question: What is machine learning?
--------------------------------------------------
✅ Base model loaded successfully!

🔄 STEP 2: CHECKING FINE-TUNED MODEL
✅ Fine-tuned model is available and ready!

🔄 STEP 3: SIDE-BY-SIDE COMPARISON

🧪 TEST 1: What is machine learning?

🔍 Testing BASE MODEL...
📝 Question: What is machine learning?
--------------------------------------------------
✅ BASE MODEL Response:
   Machine learning (ML) is a field of artificial intelligence tha

## Step 13: Save and Share Your Model

**🎯 What this cell does:**
Prepares your fine-tuned model for deployment and sharing by creating documentation, saving configuration files, and providing deployment instructions.

**📋 What happens inside:**

**Documentation Creation:**
- **Model Card**: Creates a README.md with model description, training details, and usage examples
- **Training Config**: Saves all training parameters as JSON for reproducibility
- **File Inventory**: Lists all saved files with their sizes
- **Usage Instructions**: Provides code examples for loading and using your model

**Files Created in Output Directory:**
- `adapter_config.json`: LoRA configuration
- `adapter_model.safetensors`: Your trained LoRA weights  
- `README.md`: Complete model documentation
- `training_config.json`: Training parameters for reproducibility
- `tokenizer.json` & related: Tokenizer files
- `training_args.bin`: HuggingFace training arguments

**⚙️ Customization options:**

**Model Card Content:**
You can edit the `model_card_content` to include:
- Specific use cases your model was trained for
- Performance benchmarks and evaluation results
- Known limitations and bias considerations
- Citation information if publishing

**Sharing Options:**

**Option 1 - HuggingFace Hub (Recommended):**
```python
# Upload to HuggingFace Hub
model.push_to_hub("your-username/gemma-3-1b-your-domain")
tokenizer.push_to_hub("your-username/gemma-3-1b-your-domain")
```

**Option 2 - Direct File Sharing:**
- Share the entire output directory
- Recipients can load with: `PeftModel.from_pretrained(base_model, "path/to/directory")`

**Option 3 - Merge and Save Full Model:**
```python
# Merge LoRA weights into base model for standalone use
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged_model")
```

**🔧 Advanced deployment considerations:**

**Production Optimization:**
- **Quantization**: Use 8-bit or 4-bit quantization for smaller memory footprint
- **ONNX Conversion**: Convert to ONNX for cross-platform deployment
- **TensorRT**: Optimize for NVIDIA inference servers
- **Model Pruning**: Remove unused parameters for smaller size

**Quality Assurance:**
- Test with diverse inputs not in training data
- Implement safety filters for production use
- Monitor for bias and inappropriate outputs
- Set up feedback collection for continuous improvement

**💡 Expected output:**
You should see successful creation of documentation files, training configuration export, complete file listing with sizes, and detailed instructions for sharing and using your fine-tuned model.

In [13]:
# Save and prepare model for sharing
print("💾 SAVING AND PREPARING MODEL")
print("=" * 50)

# Create model card with training information
model_card_content = f"""
# Gemma 3 1B Instruct Fine-tuned Model

## Model Description
This is a fine-tuned version of Google's Gemma 3 1B Instruct model, adapted for custom instruction-following tasks.

## Training Details
- **Base model**: {MODEL_NAME}
- **Fine-tuning method**: LoRA (Low-Rank Adaptation)
- **Training device**: {device}
- **LoRA rank**: {LORA_R}
- **LoRA alpha**: {LORA_ALPHA}
- **Training epochs**: {training_args.num_train_epochs}
- **Learning rate**: {training_args.learning_rate}
- **Batch size**: {effective_batch_size} (effective)

## Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# Load tokenizer and base model
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-1b-it")
base_model = AutoModelForCausalLM.from_pretrained("google/gemma-3-1b-it")

# Load fine-tuned model
model = PeftModel.from_pretrained(base_model, "path/to/this/model")

# Generate text
prompt = "### Instruction:\\nYour question here\\n\\n### Response:\\n"
inputs = tokenizer.encode(prompt, return_tensors="pt")
outputs = model.generate(inputs, max_length=200, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
```

## Training Data
The model was fine-tuned on a custom instruction-following dataset.

## Limitations
- This is a demonstration model with limited training data
- May not generalize well to all tasks
- Requires the same format for optimal performance

## License
This model inherits the license from the base Gemma 3 model.
"""

# Save model card
with open(f"{output_dir}/README.md", "w") as f:
    f.write(model_card_content)

print("📝 Model card created")

# Save training configuration
training_config = {
    "base_model": MODEL_NAME,
    "device": device,
    "lora_config": {
        "r": LORA_R,
        "alpha": LORA_ALPHA,
        "dropout": LORA_DROPOUT,
        "target_modules": target_modules
    },
    "training_args": training_args.to_dict() if hasattr(training_args, 'to_dict') else str(training_args),
    "dataset_size": len(extended_data),
    "max_length": MAX_LENGTH
}

with open(f"{output_dir}/training_config.json", "w") as f:
    json.dump(training_config, f, indent=2, default=str)

print("⚙️ Training configuration saved")

# List all saved files
print(f"\n📁 SAVED FILES IN {output_dir}:")
try:
    for file in os.listdir(output_dir):
        file_path = os.path.join(output_dir, file)
        if os.path.isfile(file_path):
            size_mb = os.path.getsize(file_path) / (1024 * 1024)
            print(f"   {file} ({size_mb:.1f} MB)")
except Exception as e:
    print(f"❌ Error listing files: {e}")

# Instructions for using the model
print(f"\n🚀 NEXT STEPS:")
print(f"1. Test the model thoroughly with your use case")
print(f"2. If performance is good, consider training with more data")
print(f"3. You can push to HuggingFace Hub for sharing:")
print(f"   model.push_to_hub('your-username/gemma-3-1b-it-finetuned')")
print(f"4. Or share the '{output_dir}' folder directly")

print(f"\n✅ Model preparation complete!")

💾 SAVING AND PREPARING MODEL
📝 Model card created
⚙️ Training configuration saved

📁 SAVED FILES IN ./gemma-3-1b-it-finetuned:
   README.md (0.0 MB)
   tokenizer.json (31.8 MB)
   tokenizer_config.json (1.1 MB)
   adapter_model.safetensors (49.8 MB)
   training_args.bin (0.0 MB)
   added_tokens.json (0.0 MB)
   tokenizer.model (4.5 MB)
   training_config.json (0.0 MB)
   chat_template.jinja (0.0 MB)
   adapter_config.json (0.0 MB)
   special_tokens_map.json (0.0 MB)

🚀 NEXT STEPS:
1. Test the model thoroughly with your use case
2. If performance is good, consider training with more data
3. You can push to HuggingFace Hub for sharing:
   model.push_to_hub('your-username/gemma-3-1b-it-finetuned')
4. Or share the './gemma-3-1b-it-finetuned' folder directly

✅ Model preparation complete!


## 🎯 Summary and Next Steps

Congratulations! You've successfully completed the fine-tuning process for Gemma 3 1B Instruct. Here's what you've accomplished:

### ✅ What you achieved:
1. **Environment Setup**: Configured the system for different devices (CPU, CUDA, Apple Silicon)
2. **Model Loading**: Successfully loaded and prepared Gemma 3 1B Instruct for fine-tuning
3. **Dataset Preparation**: Created and formatted training data for instruction-following
4. **LoRA Implementation**: Applied efficient fine-tuning with Low-Rank Adaptation
5. **Training Execution**: Ran the complete fine-tuning process
6. **Model Evaluation**: Tested the fine-tuned model's performance
7. **Model Deployment**: Saved and prepared the model for sharing

### 🔑 Key concepts learned:
- **Parameter-Efficient Fine-Tuning**: Using LoRA to reduce computational requirements
- **Device Optimization**: Configuring training for different hardware
- **Dataset Formatting**: Preparing instruction-following datasets
- **Training Monitoring**: Understanding metrics and performance
- **Model Evaluation**: Testing and validating fine-tuned models

### 🚀 Improvement strategies:

#### For better results:
1. **More Training Data**: Use 1000+ high-quality examples
2. **Longer Training**: Increase epochs and fine-tune learning rate
3. **Better Data Quality**: Clean, diverse, and relevant examples
4. **Hyperparameter Tuning**: Experiment with LoRA rank, learning rate, batch size
5. **Evaluation Metrics**: Implement proper evaluation beyond loss

#### Advanced techniques:
1. **QLoRA**: Quantized LoRA for even more efficiency
2. **Multi-task Training**: Train on multiple tasks simultaneously
3. **Reinforcement Learning from Human Feedback (RLHF)**: Align with human preferences
4. **Curriculum Learning**: Progressive training difficulty
5. **Model Merging**: Combine multiple fine-tuned adapters

### 💡 Production considerations:
- **Quantization**: Use 8-bit or 4-bit quantization for deployment
- **Optimization**: ONNX conversion or TensorRT for inference speed
- **Monitoring**: Track model performance in production
- **Safety**: Implement content filtering and bias detection
- **Versioning**: Keep track of model versions and training data

### 🛠️ Troubleshooting tips:
- **Memory Issues**: Reduce batch size, use gradient checkpointing, or try CPU training
- **Slow Training**: Check device utilization, use mixed precision, optimize data loading
- **Poor Performance**: Increase training data, adjust learning rate, check data quality
- **Overfitting**: Use validation split, early stopping, or regularization

### 📚 Further learning:
- [PEFT Documentation](https://huggingface.co/docs/peft)
- [LoRA Paper](https://arxiv.org/abs/2106.09685)
- [Gemma Model Documentation](https://huggingface.co/docs/transformers/model_doc/gemma)
- [Fine-tuning Best Practices](https://huggingface.co/blog/rlhf)

### 🎉 Congratulations!
You now have a working fine-tuned Gemma 3 1B Instruct model and the knowledge to improve it further. The techniques you've learned can be applied to other models and tasks. Happy fine-tuning! 🚀

## 📏 Model Dimensions Analysis

**🎯 What this cell does:**
Provides comprehensive analysis of your fine-tuned model's architecture, dimensions, and parameter distribution. This helps you understand the model structure and verify your LoRA configuration.

**📊 Information displayed:**
- Model architecture and layer dimensions
- Parameter counts (total, trainable, frozen)
- LoRA adapter dimensions and locations
- Memory usage breakdown
- Model size comparison (base vs fine-tuned)

In [14]:
# Comprehensive Model Dimensions Analysis
print("📏 FINE-TUNED MODEL DIMENSIONS ANALYSIS")
print("=" * 60)

if 'model' not in globals():
    print("❌ Fine-tuned model not found!")
    print("💡 Please run the training steps first to create the 'model' variable")
else:
    import torch
    
    # === BASIC MODEL INFORMATION ===
    print("🤖 BASIC MODEL INFORMATION:")
    print(f"   Model type: {type(model).__name__}")
    print(f"   Base model: {MODEL_NAME}")
    print(f"   Device: {model.device}")
    print(f"   Training dtype: {model.dtype}")
    
    # === PARAMETER ANALYSIS ===
    print(f"\n📊 PARAMETER BREAKDOWN:")
    
    # Count different types of parameters
    total_params = 0
    trainable_params = 0
    frozen_params = 0
    lora_params = 0
    base_params = 0
    
    param_details = {}
    
    for name, param in model.named_parameters():
        param_count = param.numel()
        total_params += param_count
        
        if param.requires_grad:
            trainable_params += param_count
            if 'lora_' in name:
                lora_params += param_count
        else:
            frozen_params += param_count
            base_params += param_count
        
        # Group by layer type for detailed analysis
        layer_type = name.split('.')[0] if '.' in name else name
        if layer_type not in param_details:
            param_details[layer_type] = {'total': 0, 'trainable': 0}
        param_details[layer_type]['total'] += param_count
        if param.requires_grad:
            param_details[layer_type]['trainable'] += param_count
    
    print(f"   Total parameters: {total_params:,}")
    print(f"   Trainable parameters: {trainable_params:,} ({100 * trainable_params / total_params:.3f}%)")
    print(f"   Frozen parameters: {frozen_params:,} ({100 * frozen_params / total_params:.1f}%)")
    print(f"   LoRA parameters: {lora_params:,}")
    print(f"   Base model parameters: {base_params:,}")
    
    # === LORA ADAPTER ANALYSIS ===
    print(f"\n🔧 LORA ADAPTER DETAILS:")
    lora_adapters = {}
    
    for name, param in model.named_parameters():
        if 'lora_' in name and param.requires_grad:
            # Extract layer info
            parts = name.split('.')
            layer_name = '.'.join(parts[:-2])  # Remove lora_A/B and weight
            adapter_type = parts[-2]  # lora_A or lora_B
            
            if layer_name not in lora_adapters:
                lora_adapters[layer_name] = {}
            
            lora_adapters[layer_name][adapter_type] = {
                'shape': list(param.shape),
                'params': param.numel(),
                'dtype': str(param.dtype)
            }
    
    print(f"   Number of LoRA adapters: {len(lora_adapters)}")
    print(f"   LoRA rank (r): {LORA_R}")
    print(f"   LoRA alpha: {LORA_ALPHA}")
    print(f"   LoRA dropout: {LORA_DROPOUT}")
    
    # Show sample LoRA adapter dimensions
    if lora_adapters:
        print(f"\n   📋 Sample LoRA adapter dimensions:")
        for i, (layer_name, adapters) in enumerate(list(lora_adapters.items())[:3]):  # Show first 3
            print(f"   {i+1}. {layer_name}:")
            if 'lora_A' in adapters:
                print(f"      LoRA A: {adapters['lora_A']['shape']} ({adapters['lora_A']['params']:,} params)")
            if 'lora_B' in adapters:
                print(f"      LoRA B: {adapters['lora_B']['shape']} ({adapters['lora_B']['params']:,} params)")
        
        if len(lora_adapters) > 3:
            print(f"   ... and {len(lora_adapters) - 3} more adapters")
    
    # === MODEL ARCHITECTURE ANALYSIS ===
    print(f"\n🏗️  MODEL ARCHITECTURE:")
    
    # Get model config
    config = model.config if hasattr(model, 'config') else model.base_model.config
    
    print(f"   Hidden size: {config.hidden_size}")
    print(f"   Number of layers: {config.num_hidden_layers}")
    print(f"   Number of attention heads: {config.num_attention_heads}")
    if hasattr(config, 'num_key_value_heads'):
        print(f"   Number of key-value heads: {config.num_key_value_heads}")
    print(f"   Intermediate size: {config.intermediate_size}")
    print(f"   Vocabulary size: {config.vocab_size}")
    print(f"   Max position embeddings: {config.max_position_embeddings}")
    
    # === LAYER-BY-LAYER BREAKDOWN ===
    print(f"\n📈 LAYER-BY-LAYER PARAMETER BREAKDOWN:")
    print(f"{'Layer Type':<25} {'Total Params':<15} {'Trainable':<15} {'Percentage':<10}")
    print("-" * 70)
    
    for layer_type, counts in sorted(param_details.items()):
        total = counts['total']
        trainable = counts['trainable']
        percentage = (trainable / total * 100) if total > 0 else 0
        print(f"{layer_type:<25} {total:<15,} {trainable:<15,} {percentage:<10.2f}%")
    
    # === MEMORY ANALYSIS ===
    print(f"\n💾 MEMORY USAGE ANALYSIS:")
    
    # Calculate model memory usage
    model_memory_bytes = sum(p.numel() * p.element_size() for p in model.parameters())
    trainable_memory_bytes = sum(p.numel() * p.element_size() for p in model.parameters() if p.requires_grad)
    
    model_memory_mb = model_memory_bytes / (1024 * 1024)
    trainable_memory_mb = trainable_memory_bytes / (1024 * 1024)
    
    print(f"   Total model memory: {model_memory_mb:.1f} MB")
    print(f"   Trainable parameters memory: {trainable_memory_mb:.1f} MB")
    print(f"   Memory efficiency: {100 * (1 - trainable_memory_mb / model_memory_mb):.1f}% reduction")
    
    # Device memory if available
    if device == "cuda" and torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / (1024**3)
        cached = torch.cuda.memory_reserved() / (1024**3)
        print(f"   GPU memory allocated: {allocated:.2f} GB")
        print(f"   GPU memory cached: {cached:.2f} GB")
    
    # === COMPARISON WITH BASE MODEL ===
    print(f"\n📊 COMPARISON WITH BASE MODEL:")
    base_model_params = 2_506_327_808  # Approximate Gemma 3-1B parameter count
    print(f"   Original Gemma 3-1B parameters: {base_model_params:,}")
    print(f"   Added LoRA parameters: {lora_params:,}")
    print(f"   Parameter increase: {100 * lora_params / base_model_params:.3f}%")
    print(f"   Effective model size: {base_model_params + lora_params:,} parameters")
    
    # === TRAINABLE PARAMETER LOCATIONS ===
    print(f"\n🎯 TRAINABLE PARAMETER LOCATIONS:")
    trainable_layers = []
    for name, param in model.named_parameters():
        if param.requires_grad and 'lora_' in name:
            layer_info = name.split('.')[:-2]  # Remove lora_A/B.weight
            layer_path = '.'.join(layer_info)
            if layer_path not in trainable_layers:
                trainable_layers.append(layer_path)
    
    print(f"   Number of layers with LoRA adapters: {len(trainable_layers)}")
    if trainable_layers:
        print(f"   Target modules: {target_modules}")
        print(f"   Sample adapted layers:")
        for i, layer in enumerate(trainable_layers[:5]):  # Show first 5
            print(f"      {i+1}. {layer}")
        if len(trainable_layers) > 5:
            print(f"      ... and {len(trainable_layers) - 5} more layers")
    
    print(f"\n✅ Model dimensions analysis complete!")
    print(f"💡 Your LoRA fine-tuning is using only {100 * trainable_params / total_params:.3f}% of the model parameters!")

📏 FINE-TUNED MODEL DIMENSIONS ANALYSIS
🤖 BASIC MODEL INFORMATION:
   Model type: PeftModelForCausalLM
   Base model: google/gemma-3-1b-it
   Device: cuda:0
   Training dtype: torch.float16

📊 PARAMETER BREAKDOWN:
   Total parameters: 1,012,931,712
   Trainable parameters: 13,045,760 (1.288%)
   Frozen parameters: 999,885,952 (98.7%)
   LoRA parameters: 13,045,760
   Base model parameters: 999,885,952

🔧 LORA ADAPTER DETAILS:
   Number of LoRA adapters: 364
   LoRA rank (r): 16
   LoRA alpha: 32
   LoRA dropout: 0.1

   📋 Sample LoRA adapter dimensions:
   1. base_model.model.model.layers.0.self_attn.q_proj.lora_A:
   2. base_model.model.model.layers.0.self_attn.q_proj.lora_B:
   3. base_model.model.model.layers.0.self_attn.k_proj.lora_A:
   ... and 361 more adapters

🏗️  MODEL ARCHITECTURE:
   Hidden size: 1152
   Number of layers: 26
   Number of attention heads: 4
   Number of key-value heads: 1
   Intermediate size: 6912
   Vocabulary size: 262144
   Max position embeddings: 32768

