# Tool-Calling Fine-Tuning with SFT

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ProfSynapse/Toolset-Training/blob/main/Trainers/notebooks/sft_colab_beginner.ipynb)

## üéì What You'll Learn

This notebook teaches you how to fine-tune a language model to use the **Claudesidian-MCP toolset** for Obsidian vault operations. By the end, you'll have:

- **A custom AI model** that can call tools to manage Obsidian vaults
- **Hands-on experience** with supervised fine-tuning (SFT)
- **Understanding** of hyperparameters and how they affect training

## üî¨ What is SFT?

**SFT (Supervised Fine-Tuning)** is like teaching through examples:
- You show the model examples of correct tool-calling behavior
- The model learns to replicate those patterns
- Use SFT when teaching a model a **new skill** (like using tools)

**When to use SFT:**
- ‚úÖ Teaching tool-calling from scratch
- ‚úÖ Learning new task formats
- ‚úÖ Initial training with positive examples

**Not for:**
- ‚ùå Refining existing behavior (use KTO instead)
- ‚ùå Teaching preferences between good/bad outputs (use preference learning)

## üíª Hardware Requirements

**Recommended GPU:**
- 7B models: T4 (15GB VRAM) - ‚úÖ **Free Colab tier works!**
- 13B models: A100 (40GB VRAM) - Colab Pro
- 70B models: A100 (80GB VRAM) - Colab Pro+

**Training time:** ~45 minutes for a 7B model

## 1. Installation

Install Unsloth and dependencies. This takes ~2 minutes.

In [None]:
# Install Unsloth for faster training
%%capture
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [None]:
# Install training dependencies
%%capture
!pip install -U "transformers>=4.45.0"
!pip install "datasets>=2.14.0"
!pip install -U accelerate bitsandbytes
!pip install -U trl peft xformers triton

## 2. Mount Google Drive (Optional)

Save checkpoints to Google Drive so they persist if runtime disconnects.

In [None]:
from google.colab import drive
import os

drive.mount('/content/drive')

# Create output directory
DRIVE_OUTPUT_DIR = "/content/drive/MyDrive/SFT_Training"
os.makedirs(DRIVE_OUTPUT_DIR, exist_ok=True)

print(f"‚úì Google Drive mounted")
print(f"‚úì Checkpoints will be saved to: {DRIVE_OUTPUT_DIR}")

## 3. HuggingFace Credentials

Add your HF token to Colab secrets:
1. Click the üîë key icon in the left sidebar
2. Add new secret: `HF_TOKEN`
3. Get token from https://huggingface.co/settings/tokens

In [None]:
import os
from google.colab import userdata
from huggingface_hub import HfApi

# Get token from Colab secrets
HF_TOKEN = userdata.get('HF_TOKEN')
os.environ['HF_TOKEN'] = HF_TOKEN

# Get your HuggingFace username automatically
api = HfApi()
hf_user = api.whoami(token=HF_TOKEN)["name"]

print(f"‚úì HuggingFace token loaded")
print(f"‚úì Username: {hf_user}")

## 4. Model Configuration

**What this does:** Choose the base model you want to fine-tune and configure basic settings.

Think of this like choosing which "brain" you want to teach tool-calling skills to.

In [None]:
# @title ‚öôÔ∏è Model & Dataset Configuration
# @markdown Use the dropdowns to select your model and configure your dataset.

# @markdown ### üß† Base Model Selection
# @markdown Choose a model based on your VRAM availability. Models are ordered by size (1B - 24B).
# @markdown * **1B-3B:** Fast, runs on any GPU
# @markdown * **7B-9B:** Standard balance of speed/intelligence
# @markdown * **12B-24B:** High intelligence, requires ~12GB-24GB VRAM (A100 recommended for 20B+)
MODEL_NAME = "unsloth/mistral-7b-instruct-v0.3-bnb-4bit" # @param ["unsloth/Llama-3.2-1B-Instruct-bnb-4bit", "unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit", "unsloth/gemma-2-2b-it-bnb-4bit", "unsloth/Llama-3.2-3B-Instruct-bnb-4bit", "unsloth/Qwen2.5-3B-Instruct-bnb-4bit", "unsloth/Phi-3.5-mini-instruct", "unsloth/Qwen2.5-7B-Instruct-bnb-4bit", "unsloth/mistral-7b-v0.3-bnb-4bit", "unsloth/mistral-7b-instruct-v0.3-bnb-4bit", "unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit", "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",  "unsloth/DeepSeek-R1-0528-Qwen3-8B-unsloth-bnb-4bit", "unsloth/gemma-2-9b-it-bnb-4bit", "unsloth/Llama-3.2-11B-Vision-Instruct-unsloth-bnb-4bit", "unsloth/Mistral-Nemo-Instruct-v1-bnb-4bit", "unsloth/gemma-3-12b-it-unsloth-bnb-4bit", "unsloth/Qwen2.5-14B-Instruct-bnb-4bit", "unsloth/Phi-4", "unsloth/gpt-oss-20b-unsloth-bnb-4bit", "unsloth/Mistral-Small-24B-Instruct-2501-bnb-4bit", "unsloth/Magistral-Small-2509-unsloth-bnb-4bit"]

# @markdown ### üìè Max Output Length
# @markdown  2048 is standard. Higher values require more VRAM.
MAX_SEQ_LENGTH = 2048 # @param [1024, 2048, 4096, 8192] {type:"raw"}

# @markdown ### üìö Dataset Configuration
DATASET_NAME = "professorsynapse/claudesidian-synthetic-dataset" # @param {type:"string"}
DATASET_FILE = "tools_sft_v1.5_11.29.25.jsonl" # @param {type:"string"}

# @markdown ### üè∑Ô∏è Output Model Name
# @markdown Name your fine-tuned model (e.g., `my-tool-model-v1`).
OUTPUT_MODEL_NAME = "nexus-tools-sft-7b" # @param {type:"string"}

print(f"‚úì Configuration set:")
print(f"  ‚Ä¢ Model: {MODEL_NAME}")
print(f"  ‚Ä¢ Context: {MAX_SEQ_LENGTH}")
print(f"  ‚Ä¢ Dataset: {DATASET_NAME}/{DATASET_FILE}")
print(f"  ‚Ä¢ Output: {OUTPUT_MODEL_NAME}")

## 5. Load Model and Tokenizer

**What this does:** Downloads the base model and prepares it for training.

The model is the "brain" that will learn tool-calling. The tokenizer converts text into numbers the model can process. We use 4-bit quantization to fit large models into limited GPU memory.

In [None]:
from unsloth import FastLanguageModel
import torch

# Check GPU and store info for lineage
GPU_NAME = torch.cuda.get_device_name(0)
CUDA_VERSION = torch.version.cuda
GPU_MEMORY_GB = torch.cuda.get_device_properties(0).total_memory / 1024**3

print(f"Using GPU: {GPU_NAME}")
print(f"CUDA version: {CUDA_VERSION}")
print(f"Available VRAM: {GPU_MEMORY_GB:.1f} GB")
print()

In [None]:
# Load the base model and tokenizer from HuggingFace
# This downloads the model weights (~7GB for 7B models)
#
# Parameters explained:
#   model_name: Which model to download
#   max_seq_length: Max tokens model can process at once
#   dtype=None: Auto-detect best precision for your GPU
#   load_in_4bit=True: Use 4-bit quantization to save memory
#   token: Your HF token for accessing the model

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,  # Auto-detect (usually bfloat16 or float16)
    load_in_4bit=True,  # Reduces memory usage by ~75%
    token=HF_TOKEN,
)

# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
# CRITICAL: Apply chat template BEFORE training using Unsloth
# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
# This ensures the special tokens (<|im_start|>, <|im_end|>, [INST], etc.)
# are properly handled as single tokens, not as literal text strings.
# Without this, the model will output special tokens as literal text!

from unsloth.chat_templates import get_chat_template

# Detect the correct chat template based on model name
if "qwen" in MODEL_NAME.lower():
    CHAT_TEMPLATE_NAME = "chatml"  # Qwen uses ChatML format
elif "llama" in MODEL_NAME.lower():
    CHAT_TEMPLATE_NAME = "llama-3"
elif "mistral" in MODEL_NAME.lower():
    CHAT_TEMPLATE_NAME = "mistral"
elif "gemma" in MODEL_NAME.lower():
    CHAT_TEMPLATE_NAME = "gemma"
elif "phi" in MODEL_NAME.lower():
    CHAT_TEMPLATE_NAME = "phi-3"
elif "deepseek" in MODEL_NAME.lower():
    CHAT_TEMPLATE_NAME = "chatml"  # DeepSeek uses ChatML
else:
    CHAT_TEMPLATE_NAME = "chatml"  # Default fallback

tokenizer = get_chat_template(
    tokenizer,
    chat_template=CHAT_TEMPLATE_NAME,
)

print(f"Applied {CHAT_TEMPLATE_NAME} chat template")

# Store for lineage
TOTAL_PARAMS = sum(p.numel() for p in model.parameters())

print("Model loaded successfully")
print(f"  Model has {TOTAL_PARAMS:,} parameters")
print(f"  Chat template: {CHAT_TEMPLATE_NAME}")

## 6. Apply LoRA Adapters

**What this does:** Add trainable "adapter" layers to the model instead of training the entire thing.

Think of LoRA like teaching a new skill through muscle memory - we add small specialized layers that learn the new behavior, while keeping the main "brain" frozen. This is way faster and uses less memory than retraining everything.

In [None]:
# @title üîß LoRA Adapter Configuration
# @markdown Configure the size and strength of the fine-tuning adapters.

# @markdown ### üéõÔ∏è LoRA Parameters
# @markdown **Rank (r):** Higher = smarter but slower/more memory (Standard: 16-64).
# @markdown Alpha will be automatically set to 2 * r.
LORA_R = 32 # @param [8, 16, 32, 64, 128] {type:"raw"}

LORA_ALPHA = LORA_R * 2

# @markdown **Dropout:** Helps prevent overfitting (Standard: 0.05).
LORA_DROPOUT = 0.05 # @param {type:"number"}

# @markdown **Random Seed:** Change this for different initialization.
RANDOM_STATE = 3407 # @param {type:"integer"}

# Target modules for LoRA
TARGET_MODULES = ["q_proj", "k_proj", "v_proj", "o_proj",
                  "gate_proj", "up_proj", "down_proj"]

model = FastLanguageModel.get_peft_model(
    model,
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    bias="none",
    target_modules=TARGET_MODULES,
    use_gradient_checkpointing="unsloth",
    random_state=RANDOM_STATE,
)

# Store for lineage
TRAINABLE_PARAMS = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"‚úì LoRA adapters applied:")
print(f"  ‚Ä¢ Rank: {LORA_R}")
print(f"  ‚Ä¢ Alpha: {LORA_ALPHA}")
print(f"  ‚Ä¢ Dropout: {LORA_DROPOUT}")
print(f"  ‚Ä¢ Trainable params: {TRAINABLE_PARAMS:,}")

## 7. Load and Prepare Dataset

**What this does:** Downloads training examples and formats them for the model.

The dataset contains examples of correct tool-calling behavior. Think of it like a textbook full of solved problems that the model will learn from.

In [None]:
from datasets import load_dataset

"""
LOAD DATASET FROM HUGGINGFACE

This downloads pre-made training examples of correct tool-calling behavior.
Each example shows: user request ‚Üí tool call ‚Üí result ‚Üí assistant response
"""

print(f"Loading dataset: {DATASET_NAME}/{DATASET_FILE}")
dataset = load_dataset(
    DATASET_NAME,  # HuggingFace repository containing the dataset
    data_files=DATASET_FILE,  # Specific JSONL file to use
    split="train"  # Use the training split
)

# Store dataset info for lineage
DATASET_SIZE = len(dataset)

print(f"Loaded {DATASET_SIZE} examples")
print(f"\nSample example:")
print(dataset[0])

"""
CONVERT TOOL CALLS FORMAT FOR QWEN MODELS

Qwen models expect a flattened tool_calls format:
  - Qwen format:  tool_call.name, tool_call.arguments
  - OpenAI format: tool_call.function.name, tool_call.function.arguments

This conversion ensures the dataset works with Qwen's native chat template.
"""

def convert_to_qwen_tool_format(example):
    """
    Convert OpenAI-style tool_calls to Qwen-compatible format.

    OpenAI format:
        {"tool_calls": [{"function": {"name": "...", "arguments": "..."}}]}

    Qwen format:
        {"tool_calls": [{"name": "...", "arguments": "..."}]}
    """
    conversations = example.get("conversations", [])
    converted_conversations = []

    for msg in conversations:
        new_msg = dict(msg)  # Copy the message

        if "tool_calls" in msg and msg["tool_calls"]:
            # Convert from OpenAI nested format to Qwen flat format
            new_msg["tool_calls"] = [
                {
                    "name": tc["function"]["name"],
                    "arguments": tc["function"]["arguments"]
                }
                for tc in msg["tool_calls"]
                if "function" in tc  # Only convert if in OpenAI format
            ]

        converted_conversations.append(new_msg)

    return {"conversations": converted_conversations}

# Detect Qwen models and apply conversion
is_qwen = 'qwen' in MODEL_NAME.lower()

if is_qwen:
    print("\n" + "=" * 60)
    print("QWEN MODEL DETECTED - Converting tool call format")
    print("=" * 60)
    print("Converting: OpenAI format -> Qwen format")
    print("  Before: tool_call.function.name")
    print("  After:  tool_call.name")

    dataset = dataset.map(
        convert_to_qwen_tool_format,
        desc="Converting to Qwen format"
    )

    print("Tool calls converted to Qwen-compatible format")
    print()
else:
    print(f"\nNon-Qwen model detected - keeping original tool call format")

# Chat template was already applied in cell-12 using get_chat_template
# No need for manual template definitions here
print(f"\nUsing chat template: {CHAT_TEMPLATE_NAME} (applied in cell-12)")

In [None]:
"""
FORMAT DATASET FOR TRAINING

Convert the conversation format into the exact text format the model expects.
This applies the chat template (already configured in cell-12) to each example.
"""
import json

def render_tool_calls_to_content(tool_calls):
    """
    Convert tool_calls array to text content for Qwen ChatML format.
    
    Renders as:
    <tool_call>
    {"name": "toolName", "arguments": {...}}
    </tool_call>
    """
    if not tool_calls:
        return ""
    
    rendered_parts = []
    for tc in tool_calls:
        # Handle OpenAI nested format: {"function": {"name": ..., "arguments": ...}}
        if "function" in tc and tc["function"]:
            func = tc["function"]
            name = func.get("name", "")
            args = func.get("arguments", "{}")
        else:
            # Handle flat format: {"name": ..., "arguments": ...}
            name = tc.get("name", "")
            args = tc.get("arguments", "{}")
        
        # Parse arguments if it's a string, keep as-is if dict
        if isinstance(args, str):
            try:
                args_obj = json.loads(args)
            except:
                args_obj = args
        else:
            args_obj = args
        
        # Format the tool call
        tool_call_obj = {"name": name, "arguments": args_obj}
        rendered_parts.append(
            f"<tool_call>\n{json.dumps(tool_call_obj, indent=2)}\n</tool_call>"
        )
    
    return "\n".join(rendered_parts)


def sanitize_conversations(conversations):
    """
    Ensure all message fields are properly set and tool_calls are rendered to content.
    """
    sanitized = []
    for msg in conversations:
        new_msg = dict(msg)
        
        # Get existing content (or empty string if None)
        content = new_msg.get("content") or ""
        
        # If there are tool_calls, render them to content
        if "tool_calls" in new_msg and new_msg["tool_calls"]:
            tool_content = render_tool_calls_to_content(new_msg["tool_calls"])
            if tool_content:
                # Combine existing content with tool calls
                if content:
                    content = f"{content}\n\n{tool_content}"
                else:
                    content = tool_content
        
        new_msg["content"] = content
        
        # Remove tool_calls since we've rendered them to content
        # (the chat template doesn't handle them)
        if "tool_calls" in new_msg:
            del new_msg["tool_calls"]
        
        sanitized.append(new_msg)
    return sanitized


def format_chat_template(example):
    """
    Convert conversations to tokenizer's chat template.

    Input: {"conversations": [{"role": "user", "content": "..."}, ...]}
    Output: {"text": "<|im_start|>user\n...<|im_end|>\n..."}
    """
    # Sanitize conversations and render tool_calls to content
    conversations = sanitize_conversations(example["conversations"])
    
    text = tokenizer.apply_chat_template(
        conversations,
        tokenize=False,
        add_generation_prompt=False
    )
    return {"text": text}

# Apply formatting to entire dataset
dataset = dataset.map(
    format_chat_template,
    remove_columns=dataset.column_names,
    desc="Formatting dataset"
)

print("Dataset formatted for training")
print(f"Chat template: {CHAT_TEMPLATE_NAME}")
print(f"\nFormatted example (first 1000 characters):")
print(dataset[0]["text"][:1000])
print("\n... (truncated)")

## 8. Training Configuration

**What this does:** Set the hyperparameters that control how the model learns.

This is the **most important section** - these settings determine how fast the model learns, how much memory it uses, and how good the final result will be. Think of it like configuring a study plan: how many hours per day, how many review sessions, how to handle difficult material, etc.

In [None]:
from trl import SFTTrainer, SFTConfig
from unsloth import is_bfloat16_supported
from datetime import datetime

# Create timestamped output directory
TRAINING_TIMESTAMP = datetime.now().strftime("%Y%m%d_%H%M%S")
output_dir = f"{DRIVE_OUTPUT_DIR}/{TRAINING_TIMESTAMP}"

# @title üèÉ Training Hyperparameters
# @markdown Control the speed and quality of training.

# @markdown ### ‚ö° Performance Settings
# @markdown **Batch Size:** Examples per step. Lower if you run out of memory.
BATCH_SIZE = 2 # @param [1, 2, 4, 6, 8, 10, 12, 16] {type:"raw"}

# @markdown **Gradient Accumulation:** Simulates larger batches.
GRADIENT_ACCUMULATION = 4 # @param [1, 2, 4, 6, 8, 10, 12, 16] {type:"raw"}

# @markdown ### üß† Learning Rate Configuration
# @markdown **Step 1: Choose the Magnitude (Exponent)**
# @markdown This is the most important setting. It determines the "speed" of learning.
# @markdown * **4** = Standard (1e-4). Recommended for 7B models and SFT.
# @markdown * **5** = Slow (1e-5). Use if training is unstable or for larger models.
# @markdown * **6** = Very Slow (1e-6). Precise but takes much longer.
LEARNING_RATE_EXPONENT = 4 # @param [4, 5, 6, 7] {type:"raw"}

# @markdown **Step 2: Choose the Multiplier**
# @markdown Fine-tunes the rate within that magnitude (e.g., Multiplier 2 + Exponent 4 = 2e-4).
LEARNING_RATE_MULTIPLIER = 2 # @param [1, 2, 3, 4, 5, 6, 7, 8, 9] {type:"raw"}

LEARNING_RATE = LEARNING_RATE_MULTIPLIER * (10 ** -LEARNING_RATE_EXPONENT)

# @markdown ### üîÑ Epochs
# @markdown Number of passes through the dataset.
NUM_EPOCHS = 3 # @param {type:"integer"}

# @markdown ### üíæ Saving & Logging
SAVE_STEPS = 50 # @param {type:"integer"}
LOGGING_STEPS = 5 # @param {type:"integer"}

# Other training settings
WARMUP_RATIO = 0.1
MAX_GRAD_NORM = 1.0
LR_SCHEDULER = "cosine"
OPTIMIZER = "adamw_8bit"
USE_BF16 = is_bfloat16_supported()

training_args = SFTConfig(
    output_dir=output_dir,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION,
    learning_rate=LEARNING_RATE,
    max_grad_norm=MAX_GRAD_NORM,
    lr_scheduler_type=LR_SCHEDULER,
    warmup_ratio=WARMUP_RATIO,
    num_train_epochs=NUM_EPOCHS,
    max_seq_length=MAX_SEQ_LENGTH,
    packing=False,
    dataset_text_field="text",
    fp16=not USE_BF16,
    bf16=USE_BF16,
    optim=OPTIMIZER,
    gradient_checkpointing=True,
    logging_steps=LOGGING_STEPS,
    save_steps=SAVE_STEPS,
    save_total_limit=3,
    seed=RANDOM_STATE,
    report_to="none",
)

# Calculate effective batch size
EFFECTIVE_BATCH_SIZE = BATCH_SIZE * GRADIENT_ACCUMULATION

print("‚úì Training configuration ready")
print(f"  ‚Ä¢ Batch Size: {BATCH_SIZE}")
print(f"  ‚Ä¢ Gradient Accum.: {GRADIENT_ACCUMULATION}")
print(f"  ‚Ä¢ Effective Batch: {EFFECTIVE_BATCH_SIZE}")
print(f"  ‚Ä¢ Learning Rate: {LEARNING_RATE} ({LEARNING_RATE_MULTIPLIER}e-{LEARNING_RATE_EXPONENT})")
print(f"  ‚Ä¢ Epochs: {NUM_EPOCHS}")
print(f"  ‚Ä¢ Output Dir: {output_dir}")

## 9. Initialize Trainer

**What this does:** Creates the training engine that coordinates everything.

The SFTTrainer is the orchestrator - it takes the model, dataset, and configuration, then handles all the training mechanics (gradient updates, checkpointing, logging, etc.).

In [None]:
# Create the SFTTrainer
# This combines the model, dataset, and configuration into one training pipeline

trainer = SFTTrainer(
    model=model,  # The model with LoRA adapters
    tokenizer=tokenizer,  # For converting text to tokens
    train_dataset=dataset,  # Our formatted training examples
    args=training_args,  # All the hyperparameters we configured
)

print("‚úì Trainer initialized")
print("  Ready to start training!")

## 10. Train!

**What this does:** The actual learning happens here!

The model will:
1. **Read examples** from the dataset
2. **Predict** what the response should be
3. **Compare** its prediction to the correct answer
4. **Update weights** to get closer to the correct answer
5. **Repeat** this process for 3 epochs (3 full passes through the data)

**What to expect:**
- Training takes ~45 minutes for 7B models on T4 GPU
- You'll see progress updates every 5 steps
- Loss should generally decrease over time (learning is working!)
- Checkpoints are saved every 100 steps to Google Drive

**What the metrics mean:**
- **Loss:** How "wrong" the model is (lower = better, aim for <1.0)
- **Learning Rate:** Gradually decreases as training progresses
- **Samples/sec:** Training speed (depends on GPU)

**üíæ Checkpoint Resumption:**
If your Colab session disconnects, don't worry! Your checkpoints are saved to Google Drive. You can resume training by:
1. Re-running cells 1-9 (setup, model loading, dataset prep, config)
2. In the training cell below, the code will automatically detect the latest checkpoint and resume from there
3. Your progress is preserved!

In [None]:
import glob
import os
import time

# Check GPU memory
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
print()

# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
# üîç Check for existing checkpoints (automatic resumption)
# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
checkpoint_dirs = sorted(glob.glob(f"{output_dir}/checkpoint-*"))
resume_from_checkpoint = None

if checkpoint_dirs:
    # Found checkpoints - get the latest one
    latest_checkpoint = max(checkpoint_dirs, key=lambda x: int(x.split("-")[-1]))
    resume_from_checkpoint = latest_checkpoint
    print(f"‚úì Found existing checkpoint: {os.path.basename(latest_checkpoint)}")
    print(f"  Resuming training from this checkpoint")
    print(f"  Total checkpoints found: {len(checkpoint_dirs)}")
else:
    print("‚ÑπÔ∏è  No existing checkpoints found - starting fresh training")

print()

# Start training (or resume)
print("=" * 60)
if resume_from_checkpoint:
    print("RESUMING TRAINING FROM CHECKPOINT")
else:
    print("STARTING TRAINING")
print("=" * 60)
print()

# Track training time
training_start_time = time.time()

trainer_stats = trainer.train(resume_from_checkpoint=resume_from_checkpoint)

# Calculate training duration
TRAINING_DURATION_SECONDS = time.time() - training_start_time
TRAINING_DURATION_MINUTES = TRAINING_DURATION_SECONDS / 60

# Store final metrics for lineage
FINAL_LOSS = trainer_stats.training_loss
TOTAL_STEPS = trainer_stats.global_step

print()
print("=" * 60)
print("TRAINING COMPLETED")
print("=" * 60)
print(f"Final loss: {FINAL_LOSS:.4f}")
print(f"Total steps: {TOTAL_STEPS}")
print(f"Training time: {TRAINING_DURATION_MINUTES:.1f} minutes")

## 11. Build Training Lineage

**What this does:** Captures all training metadata for reproducibility and analysis.

This creates a complete record of:
- Base model and configuration
- Dataset details
- All hyperparameters used
- Training results and metrics
- Hardware and environment info

This information will be automatically added to your HuggingFace model card!

In [None]:
import json
from datetime import datetime

"""
BUILD COMPLETE TRAINING LINEAGE

This dictionary captures EVERYTHING about the training run for:
- Reproducibility
- Model card generation
- Experiment tracking
- Analysis and comparison
"""

TRAINING_LINEAGE = {
    # IDENTIFICATION
    "model_name": OUTPUT_MODEL_NAME,
    "training_method": "SFT",
    "training_timestamp": TRAINING_TIMESTAMP,
    "training_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
    
    # BASE MODEL
    "base_model": {
        "name": MODEL_NAME,
        "total_parameters": TOTAL_PARAMS,
        "quantization": "4-bit",
        "max_seq_length": MAX_SEQ_LENGTH,
        "chat_template": CHAT_TEMPLATE_NAME,  # Track chat template used
        "is_qwen": is_qwen,
    },
    
    # LORA CONFIGURATION
    "lora_config": {
        "r": LORA_R,
        "alpha": LORA_ALPHA,
        "dropout": LORA_DROPOUT,
        "target_modules": TARGET_MODULES,
        "trainable_parameters": TRAINABLE_PARAMS,
        "trainable_percentage": round(TRAINABLE_PARAMS / TOTAL_PARAMS * 100, 4),
    },
    
    # DATASET
    "dataset": {
        "name": DATASET_NAME,
        "file": DATASET_FILE,
        "huggingface_url": f"https://huggingface.co/datasets/{DATASET_NAME}",
        "total_examples": DATASET_SIZE,
        "chat_template": CHAT_TEMPLATE_NAME,
        "qwen_tool_format_conversion": is_qwen,  # Track if conversion was applied
    },
    
    # TRAINING HYPERPARAMETERS
    "training_config": {
        "batch_size": BATCH_SIZE,
        "gradient_accumulation_steps": GRADIENT_ACCUMULATION,
        "effective_batch_size": EFFECTIVE_BATCH_SIZE,
        "learning_rate": LEARNING_RATE,
        "learning_rate_scheduler": LR_SCHEDULER,
        "warmup_ratio": WARMUP_RATIO,
        "max_grad_norm": MAX_GRAD_NORM,
        "num_epochs": NUM_EPOCHS,
        "optimizer": OPTIMIZER,
        "precision": "bf16" if USE_BF16 else "fp16",
        "gradient_checkpointing": True,
        "packing": False,
        "random_seed": RANDOM_STATE,
    },
    
    # TRAINING RESULTS
    "training_results": {
        "final_loss": round(FINAL_LOSS, 4),
        "total_steps": TOTAL_STEPS,
        "training_duration_minutes": round(TRAINING_DURATION_MINUTES, 1),
    },
    
    # HARDWARE & ENVIRONMENT
    "hardware": {
        "gpu": GPU_NAME,
        "gpu_memory_gb": round(GPU_MEMORY_GB, 1),
        "cuda_version": CUDA_VERSION,
        "platform": "Google Colab",
    },
    
    # FRAMEWORK VERSIONS
    "framework_versions": {
        "torch": torch.__version__,
        "transformers": __import__("transformers").__version__,
        "trl": __import__("trl").__version__,
        "peft": __import__("peft").__version__,
        "unsloth": "latest",
    },
}

# Save lineage to file
lineage_path = f"{output_dir}/training_lineage.json"
with open(lineage_path, "w") as f:
    json.dump(TRAINING_LINEAGE, f, indent=2)

print("Training lineage captured!")
print(f"  Saved to: {lineage_path}")
print()
print("Summary:")
print(f"  Base Model: {MODEL_NAME}")
print(f"  Chat Template: {CHAT_TEMPLATE_NAME}")
print(f"  Dataset: {DATASET_NAME}/{DATASET_FILE} ({DATASET_SIZE} examples)")
print(f"  Method: SFT (Supervised Fine-Tuning)")
print(f"  LoRA: r={LORA_R}, alpha={LORA_ALPHA}")
print(f"  LR: {LEARNING_RATE}, Epochs: {NUM_EPOCHS}")
print(f"  Final Loss: {FINAL_LOSS:.4f}")
print(f"  Duration: {TRAINING_DURATION_MINUTES:.1f} min on {GPU_NAME}")
if is_qwen:
    print(f"  Qwen tool format conversion: Applied")

## 12. Upload to HuggingFace

**What this does:** Share your trained model with the world!

The model card will be **automatically generated** with all your training details:
- Base model and configuration
- Dataset information
- All hyperparameters
- Training results
- Hardware used

We'll create **three versions** of your model:

1. **LoRA adapters** - Small files that contain just the changes
2. **Merged 16-bit model** - Full model with adapters merged in
3. **GGUF quantizations** - Optimized versions for CPU/GPU inference

In [None]:
def generate_model_card(lineage: dict, hf_username: str) -> str:
    """
    Generate a comprehensive HuggingFace model card from training lineage.
    
    This creates a professional README.md with all training details
    for reproducibility and transparency.
    """
    
    base_model = lineage["base_model"]["name"]
    dataset = lineage["dataset"]
    lora = lineage["lora_config"]
    training = lineage["training_config"]
    results = lineage["training_results"]
    hardware = lineage["hardware"]
    frameworks = lineage["framework_versions"]
    
    model_card = f'''---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- tool-calling
- sft
- supervised-fine-tuning
- claudesidian
- obsidian
- fine-tuned
- unsloth
base_model: {base_model}
datasets:
- {dataset["name"]}
pipeline_tag: text-generation
model-index:
- name: {lineage["model_name"]}
  results:
  - task:
      type: text-generation
    metrics:
    - name: Final Loss
      type: loss
      value: {results["final_loss"]}
---

# {lineage["model_name"]}

This model was fine-tuned using **SFT (Supervised Fine-Tuning)** to learn tool-calling behavior for the Claudesidian vault application.

## Model Description

- **Base Model:** [{base_model}](https://huggingface.co/{base_model})
- **Training Method:** SFT (Supervised Fine-Tuning)
- **Task:** Tool-calling for Obsidian vault operations
- **Training Date:** {lineage["training_date"]}

## Training Details

### Dataset

| Property | Value |
|----------|-------|
| Dataset | [{dataset["name"]}]({dataset["huggingface_url"]}) |
| File | {dataset["file"]} |
| Total Examples | {dataset["total_examples"]:,} |
| Chat Template | {dataset["chat_template"]} |

### LoRA Configuration

| Parameter | Value |
|-----------|-------|
| Rank (r) | {lora["r"]} |
| Alpha (Œ±) | {lora["alpha"]} |
| Dropout | {lora["dropout"]} |
| Target Modules | {', '.join(lora["target_modules"])} |
| Trainable Parameters | {lora["trainable_parameters"]:,} ({lora["trainable_percentage"]}%) |

### Training Hyperparameters

| Parameter | Value |
|-----------|-------|
| Batch Size | {training["batch_size"]} |
| Gradient Accumulation | {training["gradient_accumulation_steps"]} |
| Effective Batch Size | {training["effective_batch_size"]} |
| Learning Rate | {training["learning_rate"]} |
| LR Scheduler | {training["learning_rate_scheduler"]} |
| Warmup Ratio | {training["warmup_ratio"]} |
| Max Grad Norm | {training["max_grad_norm"]} |
| Epochs | {training["num_epochs"]} |
| Optimizer | {training["optimizer"]} |
| Precision | {training["precision"]} |
| Packing | {training["packing"]} |
| Random Seed | {training["random_seed"]} |

### Training Results

| Metric | Value |
|--------|-------|
| Final Loss | {results["final_loss"]} |
| Total Steps | {results["total_steps"]:,} |
| Training Duration | {results["training_duration_minutes"]} minutes |

### Hardware

| Component | Value |
|-----------|-------|
| GPU | {hardware["gpu"]} |
| GPU Memory | {hardware["gpu_memory_gb"]} GB |
| CUDA Version | {hardware["cuda_version"]} |
| Platform | {hardware["platform"]} |

### Framework Versions

| Library | Version |
|---------|--------|
| PyTorch | {frameworks["torch"]} |
| Transformers | {frameworks["transformers"]} |
| TRL | {frameworks["trl"]} |
| PEFT | {frameworks["peft"]} |
| Unsloth | {frameworks["unsloth"]} |

## Usage

### With Transformers

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("{hf_username}/{lineage["model_name"]}")
tokenizer = AutoTokenizer.from_pretrained("{hf_username}/{lineage["model_name"]}")

# Example tool-calling prompt
messages = [{{
    "role": "user",
    "content": "Show me the contents of my project roadmap file."
}}]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))
```

### With Ollama (GGUF)

```bash
# Download the GGUF version
ollama pull {hf_username}/{lineage["model_name"]}

# Run inference
ollama run {hf_username}/{lineage["model_name"]}
```

### With LM Studio

1. Open LM Studio ‚Üí "Discover" tab
2. Search for `{hf_username}/{lineage["model_name"]}`
3. Download the Q4_K_M or Q5_K_M GGUF version
4. Load and test with tool-calling prompts

## Intended Use

This model is designed for:
- Tool-calling in Obsidian vault management applications
- Claudesidian MCP integration
- Local AI assistants that interact with note-taking systems

## Limitations

- Trained specifically for Claudesidian tool schemas
- May not generalize to other tool-calling formats
- Best performance with the specific tool set it was trained on

## Citation

```bibtex
@misc{{{lineage["model_name"].replace("-", "_")},
  author = {{{hf_username}}},
  title = {{{lineage["model_name"]}: SFT Fine-tuned Tool-Calling Model}},
  year = {{2025}},
  publisher = {{HuggingFace}},
  url = {{https://huggingface.co/{hf_username}/{lineage["model_name"]}}}
}}
```

## Training Lineage

<details>
<summary>Click to expand full training configuration (JSON)</summary>

```json
{json.dumps(lineage, indent=2)}
```

</details>
'''
    
    return model_card

# Generate model card
MODEL_CARD = generate_model_card(TRAINING_LINEAGE, hf_user)

# Save model card locally
model_card_path = f"{output_dir}/README.md"
with open(model_card_path, "w") as f:
    f.write(MODEL_CARD)

print("‚úì Model card generated!")
print(f"  Saved to: {model_card_path}")
print()
print("Preview (first 50 lines):")
print("=" * 60)
print("\n".join(MODEL_CARD.split("\n")[:50]))
print("...")

In [None]:
from huggingface_hub import HfApi, upload_file

# Upload LoRA adapters with model card
print("Uploading LoRA adapters...")

model.push_to_hub(
    f"{hf_user}/{OUTPUT_MODEL_NAME}",
    token=HF_TOKEN,
    private=False
)
tokenizer.push_to_hub(
    f"{hf_user}/{OUTPUT_MODEL_NAME}",
    token=HF_TOKEN,
    private=False
)

# Upload model card (README.md)
api = HfApi()
api.upload_file(
    path_or_fileobj=model_card_path,
    path_in_repo="README.md",
    repo_id=f"{hf_user}/{OUTPUT_MODEL_NAME}",
    token=HF_TOKEN,
)

# Upload training lineage JSON
api.upload_file(
    path_or_fileobj=lineage_path,
    path_in_repo="training_lineage.json",
    repo_id=f"{hf_user}/{OUTPUT_MODEL_NAME}",
    token=HF_TOKEN,
)

print(f"‚úì LoRA adapters uploaded to HuggingFace")
print(f"‚úì Model card with full lineage uploaded")
print(f"  View at: https://huggingface.co/{hf_user}/{OUTPUT_MODEL_NAME}")

In [None]:
# Upload merged 16-bit model with model card
print("Merging LoRA weights into base model (16-bit)...")
print("This will take ~5 minutes...")

model.push_to_hub_merged(
    f"{hf_user}/{OUTPUT_MODEL_NAME}-merged",
    tokenizer,
    save_method="merged_16bit",
    token=HF_TOKEN,
    private=False
)

# Upload model card to merged repo too
api.upload_file(
    path_or_fileobj=model_card_path,
    path_in_repo="README.md",
    repo_id=f"{hf_user}/{OUTPUT_MODEL_NAME}-merged",
    token=HF_TOKEN,
)

api.upload_file(
    path_or_fileobj=lineage_path,
    path_in_repo="training_lineage.json",
    repo_id=f"{hf_user}/{OUTPUT_MODEL_NAME}-merged",
    token=HF_TOKEN,
)

print(f"‚úì Merged model uploaded to HuggingFace")
print(f"‚úì Model card with full lineage uploaded")
print(f"  View at: https://huggingface.co/{hf_user}/{OUTPUT_MODEL_NAME}-merged")

In [None]:
# Create GGUF quantizations
quantization_methods = ["q4_k_m", "q5_k_m", "q8_0"]

print("Creating GGUF quantizations...")
print(f"This will create {len(quantization_methods)} versions")
print()

model.push_to_hub_gguf(
    f"{hf_user}/{OUTPUT_MODEL_NAME}",
    tokenizer,
    quantization_method=quantization_methods,
    token=HF_TOKEN,
)

# Re-upload model card (GGUF upload overwrites it with Unsloth's generic template)
api.upload_file(
    path_or_fileobj=model_card_path,
    path_in_repo="README.md",
    repo_id=f"{hf_user}/{OUTPUT_MODEL_NAME}",
    token=HF_TOKEN,
)

api.upload_file(
    path_or_fileobj=lineage_path,
    path_in_repo="training_lineage.json",
    repo_id=f"{hf_user}/{OUTPUT_MODEL_NAME}",
    token=HF_TOKEN,
)

print()
print("‚úì GGUF quantizations created and uploaded!")
print("‚úì Model card restored (GGUF upload overwrites it)")
print(f"  View at: https://huggingface.co/{hf_user}/{OUTPUT_MODEL_NAME}")

## 13. Evaluate Model (Optional)

**What this does:** Run automated tests to measure your model's tool-calling accuracy.

This will:
- Load your trained model with vLLM for fast inference
- Run test prompts covering all 47 tools
- Calculate pass rates by category
- Generate evaluation lineage that can be added to your model card

**Skip this section** if you want to evaluate later using the standalone evaluation notebook.

In [None]:
# @title Run Evaluation (Optional)
# @markdown Test your trained model's tool-calling accuracy and behavioral patterns.

# @markdown ### Enable Evaluation
run_evaluation = True # @param {type:"boolean"}

# @markdown ### Test Suite Selection
# @markdown * **Tool Coverage (47 tools):** Tests each tool individually
# @markdown * **Behavioral Patterns (24 tests):** Tests context efficiency, executePrompt delegation, etc.
# @markdown * **All Suites:** Runs both tool coverage and behavioral patterns
eval_test_suite = "All Suites" # @param ["Tool Coverage (47 tools)", "Behavioral Patterns (24 tests)", "All Suites"]

if run_evaluation:
    print("Installing vLLM for evaluation...")
    import subprocess
    subprocess.run(["pip", "install", "-q", "vllm>=0.6.0"], check=True)
    
    # Download evaluator framework
    import requests
    from pathlib import Path
    
    os.makedirs("Evaluator/prompts", exist_ok=True)
    os.makedirs("Evaluator/results", exist_ok=True)
    os.makedirs("tools", exist_ok=True)
    
    REPO_BASE = "https://raw.githubusercontent.com/ProfSynapse/Toolset-Training/main"
    
    eval_files = {
        "Evaluator/__init__.py": "Evaluator/__init__.py",
        "Evaluator/runner.py": "Evaluator/runner.py",
        "Evaluator/schema_validator.py": "Evaluator/schema_validator.py",
        "Evaluator/prompt_sets.py": "Evaluator/prompt_sets.py",
        "Evaluator/reporting.py": "Evaluator/reporting.py",
        "Evaluator/config.py": "Evaluator/config.py",
        "Evaluator/prompts/tool_prompts.json": "Evaluator/prompts/tool_prompts.json",
        "Evaluator/prompts/baseline.json": "Evaluator/prompts/baseline.json",
        "Evaluator/prompts/behavioral_patterns.json": "Evaluator/prompts/behavioral_patterns.json",
        "tools/tool_schemas.json": "tools/tool_schemas.json",
    }
    
    print("Downloading evaluation framework...")
    for remote_path, local_path in eval_files.items():
        url = f"{REPO_BASE}/{remote_path}"
        try:
            response = requests.get(url, timeout=30)
            response.raise_for_status()
            Path(local_path).parent.mkdir(parents=True, exist_ok=True)
            with open(local_path, 'w', encoding='utf-8') as f:
                f.write(response.text)
        except Exception as e:
            print(f"  Warning: Failed to download {remote_path}")
    
    print("Evaluation framework ready")
else:
    print("Evaluation skipped. Set run_evaluation = True to enable.")

In [None]:
if run_evaluation:
    from vllm import LLM, SamplingParams
    from dataclasses import dataclass
    from typing import Any, Dict, Mapping, Sequence
    import time
    import sys
    
    sys.path.insert(0, '/content')
    
    # Use the merged model for evaluation
    EVAL_MODEL = f"{hf_user}/{OUTPUT_MODEL_NAME}-merged"
    
    print(f"Loading model for evaluation: {EVAL_MODEL}")
    print("This may take 1-2 minutes...")
    
    # Initialize vLLM
    eval_llm = LLM(
        model=EVAL_MODEL,
        tensor_parallel_size=1,
        gpu_memory_utilization=0.85,
        max_model_len=2048,
        trust_remote_code=True,
        dtype="auto",
        token=HF_TOKEN,
    )
    
    # Get the vLLM tokenizer for proper chat template handling
    eval_tokenizer = eval_llm.get_tokenizer()
    
    # Create vLLM client for evaluator
    @dataclass
    class VLLMResponse:
        message: str
        raw: Dict[str, Any]
        latency_s: float
    
    class VLLMClient:
        def __init__(self, llm, llm_tokenizer, temperature=0.2, max_tokens=1024, seed=42):
            self.llm = llm
            self.tokenizer = llm_tokenizer
            self.temperature = temperature
            self.max_tokens = max_tokens
            self.seed = seed
        
        def chat(self, messages):
            # Use the tokenizer's chat template instead of manual construction
            # This ensures special tokens are handled correctly
            prompt = self.tokenizer.apply_chat_template(
                messages,
                tokenize=False,
                add_generation_prompt=True
            )
            
            sampling_params = SamplingParams(
                temperature=self.temperature,
                max_tokens=self.max_tokens,
                seed=self.seed,
            )
            
            start = time.perf_counter()
            outputs = self.llm.generate([prompt], sampling_params)
            latency_s = time.perf_counter() - start
            
            output = outputs[0]
            message = output.outputs[0].text.strip()
            
            return VLLMResponse(
                message=message,
                raw={"output": message},
                latency_s=latency_s
            )
    
    eval_client = VLLMClient(eval_llm, eval_tokenizer)
    print("Model loaded for evaluation")
    
    # Run evaluation
    from Evaluator.prompt_sets import load_prompt_cases, filter_prompts
    from Evaluator.runner import evaluate_cases
    from Evaluator.reporting import build_evaluation_lineage, generate_evaluation_model_card_section
    from Evaluator.config import PromptFilter
    
    # Map test suite to files - updated with behavioral patterns
    eval_suite_map = {
        "Tool Coverage (47 tools)": ["Evaluator/prompts/tool_prompts.json"],
        "Behavioral Patterns (24 tests)": ["Evaluator/prompts/behavioral_patterns.json"],
        "All Suites": [
            "Evaluator/prompts/tool_prompts.json",
            "Evaluator/prompts/behavioral_patterns.json"
        ],
    }
    
    eval_prompt_files = eval_suite_map[eval_test_suite]
    all_eval_records = []
    suite_results = {}  # Track results per suite
    
    print()
    print("=" * 60)
    print("RUNNING EVALUATION")
    print(f"Test Suite: {eval_test_suite}")
    print("=" * 60)
    
    for prompt_file in eval_prompt_files:
        suite_name = prompt_file.split("/")[-1].replace(".json", "")
        cases = load_prompt_cases(prompt_file)
        print(f"\nRunning {len(cases)} tests from {suite_name}")
        
        records = evaluate_cases(
            cases=cases,
            client=eval_client,
            dry_run=False,
        )
        all_eval_records.extend(records)
        
        passed = sum(1 for r in records if r.passed)
        suite_results[suite_name] = {
            "passed": passed,
            "total": len(records),
            "rate": passed/len(records)*100
        }
        print(f"   Results: {passed}/{len(records)} passed ({passed/len(records)*100:.1f}%)")
    
    # Calculate overall results
    eval_passed = sum(1 for r in all_eval_records if r.passed)
    eval_total = len(all_eval_records)
    EVAL_PASS_RATE = round(eval_passed / eval_total * 100, 1)
    
    print()
    print("=" * 60)
    print("EVALUATION COMPLETE")
    print("=" * 60)
    print(f"\nResults by Suite:")
    for suite_name, results in suite_results.items():
        print(f"  {suite_name}: {results['passed']}/{results['total']} ({results['rate']:.1f}%)")
    print(f"\nOverall: {eval_passed}/{eval_total} passed ({EVAL_PASS_RATE}%)")
    
    # Build evaluation lineage
    eval_config = {"temperature": 0.2, "max_tokens": 1024, "seed": 42}
    eval_hardware = {
        "gpu": GPU_NAME,
        "gpu_memory_gb": round(GPU_MEMORY_GB, 1),
        "platform": "Google Colab",
    }
    
    EVALUATION_LINEAGE = build_evaluation_lineage(
        records=all_eval_records,
        model_name=EVAL_MODEL,
        test_suites=eval_prompt_files,
        eval_config=eval_config,
        hardware_info=eval_hardware,
    )
    
    # Add suite-level breakdown to lineage
    EVALUATION_LINEAGE["suite_results"] = suite_results
    
    MODEL_CARD_EVAL_SECTION = generate_evaluation_model_card_section(EVALUATION_LINEAGE)
    
    # Save evaluation lineage
    eval_lineage_path = f"{output_dir}/evaluation_lineage.json"
    with open(eval_lineage_path, "w") as f:
        json.dump(EVALUATION_LINEAGE, f, indent=2)
    
    print()
    print(f"Evaluation lineage saved: {eval_lineage_path}")
    print(f"  Overall Pass Rate: {EVAL_PASS_RATE}%")

In [None]:
if run_evaluation:
    import re
    import tempfile
    from huggingface_hub import hf_hub_download
    
    print("Uploading evaluation results to HuggingFace...")
    
    # Upload to both LoRA and merged repos
    repos_to_update = [
        f"{hf_user}/{OUTPUT_MODEL_NAME}",
        f"{hf_user}/{OUTPUT_MODEL_NAME}-merged",
    ]
    
    for repo_id in repos_to_update:
        try:
            # Upload evaluation lineage JSON
            api.upload_file(
                path_or_fileobj=eval_lineage_path,
                path_in_repo="evaluation_lineage.json",
                repo_id=repo_id,
                token=HF_TOKEN,
            )
            
            # Download and update README with evaluation section
            try:
                readme_path = hf_hub_download(repo_id=repo_id, filename="README.md", token=HF_TOKEN)
                with open(readme_path, 'r') as f:
                    existing_readme = f.read()
                
                if "## Evaluation Results" in existing_readme:
                    pattern = r'## Evaluation Results.*?(?=\n## |\Z)'
                    updated_readme = re.sub(pattern, MODEL_CARD_EVAL_SECTION, existing_readme, flags=re.DOTALL)
                else:
                    updated_readme = existing_readme.rstrip() + "\n\n" + MODEL_CARD_EVAL_SECTION
                
                with tempfile.NamedTemporaryFile(mode='w', suffix='.md', delete=False) as f:
                    f.write(updated_readme)
                    temp_readme = f.name
                
                api.upload_file(
                    path_or_fileobj=temp_readme,
                    path_in_repo="README.md",
                    repo_id=repo_id,
                    token=HF_TOKEN,
                )
                print(f"  ‚úì Updated: {repo_id}")
            except Exception as e:
                print(f"  ‚ö†Ô∏è  Could not update README for {repo_id}: {e}")
        
        except Exception as e:
            print(f"  ‚ö†Ô∏è  Failed to update {repo_id}: {e}")
    
    print()
    print("=" * 60)
    print("‚úì EVALUATION RESULTS UPLOADED")
    print("=" * 60)
    print(f"Pass Rate: {EVAL_PASS_RATE}%")
    print(f"View model card: https://huggingface.co/{hf_user}/{OUTPUT_MODEL_NAME}-merged")

## Done!

Your model has been trained, evaluated, and uploaded to HuggingFace!

### What You Accomplished

| Step | Description |
|------|-------------|
| ‚úÖ Fine-tuned | Trained a language model to use the Claudesidian toolset |
| ‚úÖ Lineage | Captured complete training metadata for reproducibility |
| ‚úÖ Model Card | Auto-generated with all hyperparameters |
| ‚úÖ Formats | Created LoRA, merged 16-bit, and GGUF versions |
| ‚úÖ Evaluation | Tested tool-calling accuracy (if enabled) |
| ‚úÖ Published | Model card includes training AND evaluation results |

### Your Lineage Files

| File | Location | Description |
|------|----------|-------------|
| `training_lineage.json` | HuggingFace + Google Drive | All training parameters |
| `evaluation_lineage.json` | HuggingFace + Google Drive | Test results by category |
| `README.md` | HuggingFace | Auto-generated model card |

### Next Steps

**Test locally with LM Studio:**
1. Open LM Studio ‚Üí "Discover" tab
2. Search for your model name
3. Download the GGUF version
4. Test with tool-calling prompts

**Test with Ollama:**
```bash
ollama pull {your-username}/{model-name}
ollama run {your-username}/{model-name}
```

**Refine with KTO:**
After SFT, you can further improve your model with KTO (preference learning) to teach it to prefer better tool calls over worse ones.

---

**Questions?** Check the [Evaluator README](https://github.com/ProfSynapse/Toolset-Training/blob/main/Evaluator/README.md) or open an issue on GitHub.