# vLLM Inference & Evaluation for Tool-Calling Models

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ProfSynapse/Toolset-Training/blob/main/Trainers/notebooks/vllm_inference_evaluation.ipynb)

## üéØ Purpose

This notebook evaluates fine-tuned models for **Claudesidian-MCP tool-calling accuracy**. Use it to:

- **Load any model** (HuggingFace, local path, LoRA adapters)
- **Run comprehensive evaluations** (47 tools + behavioral tests)
- **Compare model performance** across different test suites
- **Generate detailed reports** with pass rates and failure analysis

## üìä Test Suites Available

1. **Full Coverage** (47 tests) - One test per tool
2. **Behavioral Patterns** (21 tests) - Context efficiency, executePrompt usage
3. **Baseline** (6 tests) - General workflows and clarification handling
4. **Tool Combos** - Multi-step tool sequences

## üíª Hardware Requirements

- **7B models:** T4 GPU (15GB VRAM) - ‚úÖ Free Colab works
- **13B models:** A100 (40GB VRAM) - Colab Pro
- **Inference time:** ~1-2 minutes for full coverage (47 tests)

## 1. Installation

Install vLLM for fast inference and evaluation dependencies.

In [None]:
# Install vLLM and dependencies
%%capture
!pip install vllm>=0.6.0
!pip install requests pandas
!pip install huggingface_hub

print("‚úì Dependencies installed")

## 2. Download Evaluation Framework

Download the Evaluator code and all test suites from the repository.

In [None]:
import os
import requests
from pathlib import Path

# Create directory structure
os.makedirs("Evaluator/prompts", exist_ok=True)
os.makedirs("Evaluator/results", exist_ok=True)
os.makedirs("tools", exist_ok=True)

# Base URL for raw files from GitHub
REPO_BASE = "https://raw.githubusercontent.com/ProfSynapse/Toolset-Training/main"

# Files to download
files_to_download = {
    # Core evaluator modules
    "Evaluator/__init__.py": "Evaluator/__init__.py",
    "Evaluator/runner.py": "Evaluator/runner.py",
    "Evaluator/schema_validator.py": "Evaluator/schema_validator.py",
    "Evaluator/prompt_sets.py": "Evaluator/prompt_sets.py",
    "Evaluator/reporting.py": "Evaluator/reporting.py",
    "Evaluator/config.py": "Evaluator/config.py",
    
    # Prompt sets
    "Evaluator/prompts/tool_prompts.json": "Evaluator/prompts/tool_prompts.json",
    "Evaluator/prompts/behavioral_patterns.json": "Evaluator/prompts/behavioral_patterns.json",
    "Evaluator/prompts/baseline.json": "Evaluator/prompts/baseline.json",
    "Evaluator/prompts/tool_combos.json": "Evaluator/prompts/tool_combos.json",
    
    # Tool schemas (needed for validation)
    "tools/tool_schemas.json": "tools/tool_schemas.json",
}

def download_file(url, dest):
    """Download a file from URL to destination."""
    response = requests.get(url, timeout=30)
    response.raise_for_status()
    Path(dest).parent.mkdir(parents=True, exist_ok=True)
    with open(dest, 'w', encoding='utf-8') as f:
        f.write(response.text)

print("Downloading evaluation framework...")
failed_downloads = []

for remote_path, local_path in files_to_download.items():
    url = f"{REPO_BASE}/{remote_path}"
    try:
        download_file(url, local_path)
        print(f"  ‚úì {remote_path}")
    except Exception as e:
        print(f"  ‚úó Failed: {remote_path} - {e}")
        failed_downloads.append(remote_path)

if failed_downloads:
    print(f"\n‚ö†Ô∏è  Failed to download {len(failed_downloads)} files:")
    for path in failed_downloads:
        print(f"    - {path}")
else:
    print("\n‚úì Evaluation framework ready!")

## 3. Configure Model to Load

Choose which model you want to evaluate. You can load from:
- **HuggingFace Hub** - Any public or private model
- **Local path** - Model saved in this session
- **LoRA adapters** - Base model + your adapters

In [None]:
# @title ‚öôÔ∏è Model Configuration
# @markdown Select how you want to load the model.

# @markdown ### üìç Model Source
model_source = "HuggingFace" # @param ["HuggingFace", "Local Path", "LoRA Adapters"]

# @markdown ### ü§ó HuggingFace Configuration
# @markdown If using HuggingFace, enter the model name (e.g., `username/model-name`).
hf_model_name = "professorsynapse/nexus-tools-sft-7b-merged" # @param {type:"string"}
hf_token_required = False # @param {type:"boolean"}

# @markdown ### üìÅ Local Path Configuration
# @markdown If using a local path, enter the full path to the model directory.
local_model_path = "/content/drive/MyDrive/model" # @param {type:"string"}

# @markdown ### üîß LoRA Configuration
# @markdown If using LoRA adapters, specify base model and adapter path.
base_model_name = "unsloth/mistral-7b-instruct-v0.3-bnb-4bit" # @param {type:"string"}
lora_adapter_path = "/content/drive/MyDrive/lora_adapters" # @param {type:"string"}

# Determine final model configuration
if model_source == "HuggingFace":
    MODEL_NAME = hf_model_name
    USE_LORA = False
    print(f"‚úì Configuration set: HuggingFace model")
    print(f"  Model: {MODEL_NAME}")
elif model_source == "Local Path":
    MODEL_NAME = local_model_path
    USE_LORA = False
    print(f"‚úì Configuration set: Local model")
    print(f"  Path: {MODEL_NAME}")
else:  # LoRA Adapters
    MODEL_NAME = base_model_name
    USE_LORA = True
    LORA_PATH = lora_adapter_path
    print(f"‚úì Configuration set: LoRA adapters")
    print(f"  Base model: {MODEL_NAME}")
    print(f"  Adapters: {LORA_PATH}")

# Handle HF token if needed
HF_TOKEN = None
if hf_token_required:
    try:
        from google.colab import userdata
        HF_TOKEN = userdata.get('HF_TOKEN')
        print("  ‚úì HuggingFace token loaded from secrets")
    except:
        print("  ‚ö†Ô∏è  Could not load HF_TOKEN from secrets. Add it in the üîë Secrets panel if needed.")

## 4. Load Model with vLLM

Initialize the vLLM engine for fast inference.

In [None]:
from vllm import LLM, SamplingParams
import torch

# @title üöÄ vLLM Configuration
# @markdown Configure vLLM inference settings.

# @markdown ### üîß Performance Settings
tensor_parallel_size = 1 # @param {type:"integer"}
gpu_memory_utilization = 0.85 # @param {type:"slider", min:0.5, max:0.95, step:0.05}
max_model_len = 2048 # @param [1024, 2048, 4096, 8192] {type:"raw"}

# @markdown ### üêõ Troubleshooting Options
# @markdown Enable if you're having issues loading the model.
enforce_eager = False # @param {type:"boolean"}
disable_custom_all_reduce = False # @param {type:"boolean"}

# Check GPU
if torch.cuda.is_available():
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
    total_vram = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"Available VRAM: {total_vram:.1f} GB")
    
    # Memory check
    estimated_usage = total_vram * gpu_memory_utilization
    print(f"Target VRAM usage: {estimated_usage:.1f} GB")
    
    if total_vram < 15:
        print(f"\n‚ö†Ô∏è  Warning: Limited VRAM detected ({total_vram:.1f} GB)")
        print(f"   Consider using smaller models or reducing max_model_len")
else:
    print("‚ö†Ô∏è  No GPU detected. vLLM requires a GPU.")
    raise RuntimeError("GPU required for vLLM")

print(f"\nInitializing vLLM engine...")
print(f"  ‚Ä¢ Model: {MODEL_NAME}")
print(f"  ‚Ä¢ Tensor Parallel: {tensor_parallel_size}")
print(f"  ‚Ä¢ GPU Memory: {gpu_memory_utilization:.0%}")
print(f"  ‚Ä¢ Max Length: {max_model_len}")

# Build vLLM kwargs
vllm_kwargs = {
    "model": MODEL_NAME,
    "tensor_parallel_size": tensor_parallel_size,
    "gpu_memory_utilization": gpu_memory_utilization,
    "max_model_len": max_model_len,
    "trust_remote_code": True,
    "dtype": "auto",
}

# Add LoRA if needed
if USE_LORA:
    vllm_kwargs["enable_lora"] = True
    print(f"  ‚Ä¢ LoRA enabled: {LORA_PATH}")

# Add HF token if needed
if HF_TOKEN:
    os.environ['HF_TOKEN'] = HF_TOKEN

# Add troubleshooting options if enabled
if enforce_eager:
    vllm_kwargs["enforce_eager"] = True
    print(f"  ‚Ä¢ Enforce eager mode: True (slower but more compatible)")

if disable_custom_all_reduce:
    vllm_kwargs["disable_custom_all_reduce"] = True
    print(f"  ‚Ä¢ Custom all-reduce disabled: True")

# Initialize vLLM with better error handling
print("\n‚è≥ Loading model... (this may take 1-2 minutes)")
try:
    llm = LLM(**vllm_kwargs)
    print("\n‚úì vLLM engine ready!")
    
    # Check actual memory usage
    torch.cuda.synchronize()
    current_vram = torch.cuda.memory_allocated() / 1024**3
    max_vram = torch.cuda.max_memory_allocated() / 1024**3
    print(f"  VRAM allocated: {current_vram:.1f} GB")
    print(f"  Peak VRAM: {max_vram:.1f} GB")
    
except Exception as e:
    print(f"\n‚ùå Failed to initialize vLLM")
    print(f"\nError: {str(e)}")
    print("\n" + "=" * 60)
    print("TROUBLESHOOTING STEPS")
    print("=" * 60)
    print("\n1. **Model Not Found**")
    print("   ‚Ä¢ Verify model name is correct")
    print("   ‚Ä¢ If private model, enable 'hf_token_required' above")
    print("   ‚Ä¢ Try downloading manually first:")
    print(f"     !huggingface-cli download {MODEL_NAME}")
    print("\n2. **Out of Memory (OOM)**")
    print("   ‚Ä¢ Reduce gpu_memory_utilization to 0.7 or lower")
    print("   ‚Ä¢ Reduce max_model_len to 1024")
    print("   ‚Ä¢ Use smaller model (3B instead of 7B)")
    print(f"   ‚Ä¢ Your GPU has {total_vram:.1f} GB VRAM")
    print("\n3. **Compatibility Issues**")
    print("   ‚Ä¢ Enable 'enforce_eager' option above")
    print("   ‚Ä¢ Enable 'disable_custom_all_reduce' option above")
    print("   ‚Ä¢ Try upgrading vLLM:")
    print("     !pip install --upgrade vllm")
    print("\n4. **Model Format Issues**")
    print("   ‚Ä¢ Some models need specific vLLM versions")
    print("   ‚Ä¢ Try a different model format (merged vs GGUF)")
    print("   ‚Ä¢ Check if model is compatible with vLLM")
    print("\n5. **Colab-Specific Issues**")
    print("   ‚Ä¢ Free tier T4 may be overloaded - try later")
    print("   ‚Ä¢ Restart runtime and try again")
    print("   ‚Ä¢ Consider Colab Pro for A100 GPUs")
    print("\n" + "=" * 60)
    
    # Re-raise to stop execution
    raise

## 5. Create vLLM Client for Evaluator

Wrap vLLM in a client that works with the Evaluator framework.

In [None]:
from dataclasses import dataclass
from typing import Any, Dict, Mapping, Sequence
import time

@dataclass
class VLLMResponse:
    """Response from vLLM inference."""
    message: str
    raw: Dict[str, Any]
    latency_s: float

class VLLMClient:
    """
    vLLM client that implements the same interface as OllamaClient/LMStudioClient.
    This allows it to work seamlessly with the Evaluator framework.
    """

    def __init__(
        self,
        llm: LLM,
        temperature: float = 0.2,
        top_p: float = 0.9,
        max_tokens: int = 1024,
        seed: int = None,
    ):
        self.llm = llm
        self.temperature = temperature
        self.top_p = top_p
        self.max_tokens = max_tokens
        self.seed = seed

    def chat(self, messages: Sequence[Mapping[str, str]]) -> VLLMResponse:
        """
        Send a chat conversation to vLLM and return the response.

        Args:
            messages: List of message dicts with 'role' and 'content' keys

        Returns:
            VLLMResponse with the assistant's message, raw output, and latency
        """
        # Format messages into a prompt
        # Detect model type and use appropriate format
        model_name_lower = MODEL_NAME.lower()
        
        if 'mistral' in model_name_lower:
            # Mistral format: <s>[INST] user [/INST] assistant</s>
            prompt = self._format_mistral(messages)
        elif 'llama-3' in model_name_lower or 'llama3' in model_name_lower:
            # Llama 3 format
            prompt = self._format_llama3(messages)
        elif 'qwen' in model_name_lower:
            # Qwen format
            prompt = self._format_qwen(messages)
        else:
            # Generic ChatML format
            prompt = self._format_chatml(messages)

        # Create sampling params
        sampling_params = SamplingParams(
            temperature=self.temperature,
            top_p=self.top_p,
            max_tokens=self.max_tokens,
            seed=self.seed,
        )

        # Generate
        start = time.perf_counter()
        outputs = self.llm.generate([prompt], sampling_params)
        latency_s = time.perf_counter() - start

        # Extract response
        output = outputs[0]
        message = output.outputs[0].text.strip()

        # Build raw response dict
        raw = {
            "prompt": prompt,
            "output": message,
            "finish_reason": output.outputs[0].finish_reason,
            "prompt_tokens": len(output.prompt_token_ids),
            "completion_tokens": len(output.outputs[0].token_ids),
        }

        return VLLMResponse(
            message=message,
            raw=raw,
            latency_s=latency_s
        )

    def _format_mistral(self, messages: Sequence[Mapping[str, str]]) -> str:
        """Format for Mistral models."""
        prompt_parts = []
        for msg in messages:
            role = msg.get("role", "")
            content = msg.get("content", "")
            if role == "user":
                prompt_parts.append(f"[INST] {content} [/INST]")
            elif role == "assistant":
                prompt_parts.append(f" {content}</s>")
            elif role == "system":
                prompt_parts.append(f"{content} ")
        return "<s>" + "".join(prompt_parts)

    def _format_llama3(self, messages: Sequence[Mapping[str, str]]) -> str:
        """Format for Llama 3 models."""
        prompt_parts = []
        for msg in messages:
            role = msg.get("role", "")
            content = msg.get("content", "")
            prompt_parts.append(f"<|start_header_id|>{role}<|end_header_id|>\n\n{content}<|eot_id|>")
        prompt_parts.append("<|start_header_id|>assistant<|end_header_id|>\n\n")
        return "<|begin_of_text|>" + "".join(prompt_parts)

    def _format_qwen(self, messages: Sequence[Mapping[str, str]]) -> str:
        """Format for Qwen models."""
        prompt_parts = []
        for msg in messages:
            role = msg.get("role", "")
            content = msg.get("content", "")
            prompt_parts.append(f"<|im_start|>{role}\n{content}<|im_end|>\n")
        prompt_parts.append("<|im_start|>assistant\n")
        return "".join(prompt_parts)

    def _format_chatml(self, messages: Sequence[Mapping[str, str]]) -> str:
        """Generic ChatML format."""
        prompt_parts = []
        for msg in messages:
            role = msg.get("role", "")
            content = msg.get("content", "")
            prompt_parts.append(f"<|{role}|>\n{content}<|end|>\n")
        prompt_parts.append("<|assistant|>\n")
        return "".join(prompt_parts)

# Create client with default settings
vllm_client = VLLMClient(
    llm=llm,
    temperature=0.2,
    top_p=0.9,
    max_tokens=1024,
    seed=42,
)

print("‚úì vLLM client created and ready for evaluation")

## 6. Configure Evaluation

Choose which test suites to run and configure generation settings.

In [None]:
# @title üéØ Evaluation Configuration
# @markdown Select test suites and configure generation parameters.

# @markdown ### üìã Test Suite Selection
test_suite = "Full Coverage (47 tools)" # @param ["Full Coverage (47 tools)", "Behavioral Patterns (21 tests)", "Baseline (6 tests)", "Tool Combos (Multi-step)", "All Suites"]

# @markdown ### üî¢ Limits
# @markdown Limit prompts for quick testing (0 = no limit).
max_prompts = 0 # @param {type:"integer"}

# @markdown ### üé≤ Generation Settings
eval_temperature = 0.2 # @param {type:"slider", min:0.0, max:1.0, step:0.1}
eval_top_p = 0.9 # @param {type:"slider", min:0.0, max:1.0, step:0.1}
eval_max_tokens = 1024 # @param {type:"integer"}
eval_seed = 42 # @param {type:"integer"}

# @markdown ### üíæ Output Settings
save_to_drive = True # @param {type:"boolean"}
drive_output_dir = "/content/drive/MyDrive/Evaluation_Results" # @param {type:"string"}

# Map test suite to prompt files
suite_map = {
    "Full Coverage (47 tools)": ["Evaluator/prompts/tool_prompts.json"],
    "Behavioral Patterns (21 tests)": ["Evaluator/prompts/behavioral_patterns.json"],
    "Baseline (6 tests)": ["Evaluator/prompts/baseline.json"],
    "Tool Combos (Multi-step)": ["Evaluator/prompts/tool_combos.json"],
    "All Suites": [
        "Evaluator/prompts/tool_prompts.json",
        "Evaluator/prompts/behavioral_patterns.json",
        "Evaluator/prompts/baseline.json",
        "Evaluator/prompts/tool_combos.json",
    ]
}

prompt_files = suite_map[test_suite]

# Update client settings
vllm_client.temperature = eval_temperature
vllm_client.top_p = eval_top_p
vllm_client.max_tokens = eval_max_tokens
vllm_client.seed = eval_seed

# Setup output directory
if save_to_drive:
    try:
        from google.colab import drive
        drive.mount('/content/drive', force_remount=False)
        os.makedirs(drive_output_dir, exist_ok=True)
        print(f"‚úì Google Drive mounted: {drive_output_dir}")
    except:
        print("‚ö†Ô∏è  Could not mount Google Drive. Results will only be saved locally.")
        save_to_drive = False

print(f"\n‚úì Evaluation configured:")
print(f"  ‚Ä¢ Test Suite: {test_suite}")
print(f"  ‚Ä¢ Prompt Files: {len(prompt_files)}")
if max_prompts > 0:
    print(f"  ‚Ä¢ Max Prompts: {max_prompts}")
else:
    print(f"  ‚Ä¢ Max Prompts: No limit (all prompts)")
print(f"  ‚Ä¢ Temperature: {eval_temperature}")
print(f"  ‚Ä¢ Top-p: {eval_top_p}")
print(f"  ‚Ä¢ Max Tokens: {eval_max_tokens}")
print(f"  ‚Ä¢ Seed: {eval_seed}")

## 7. Run Evaluation

Execute the test suite and collect results.

In [None]:
import sys
sys.path.insert(0, '/content')  # Add current dir to path

from Evaluator.prompt_sets import load_prompt_cases, filter_prompts
from Evaluator.runner import evaluate_cases
from Evaluator.reporting import build_run_payload, build_evaluation_lineage
from Evaluator.config import PromptFilter
from datetime import datetime
import json

# Results storage
all_records = []

print("=" * 60)
print("STARTING EVALUATION")
print("=" * 60)
print()

for prompt_file in prompt_files:
    print(f"\nüìù Loading prompts from: {prompt_file}")

    # Load and filter prompts
    cases = load_prompt_cases(prompt_file)
    prompt_filter = PromptFilter(tags=None, limit=max_prompts if max_prompts > 0 else None)
    selected_cases = filter_prompts(cases, prompt_filter)

    print(f"   ‚Ä¢ Loaded {len(cases)} prompts")
    print(f"   ‚Ä¢ Selected {len(selected_cases)} prompts for evaluation")

    if not selected_cases:
        print("   ‚ö†Ô∏è  No prompts matched filters, skipping...")
        continue

    # Progress callback
    completed = 0
    def on_record(record):
        nonlocal completed
        completed += 1
        status = "‚úì" if record.passed else "‚úó"
        time_str = f"{record.latency_s:.2f}s" if record.latency_s else "N/A"
        print(f"   [{completed}/{len(selected_cases)}] {status} {record.case.id} ({time_str})")

    # Run evaluation
    print(f"\nüîÑ Running evaluation...")
    records = evaluate_cases(
        cases=selected_cases,
        client=vllm_client,
        dry_run=False,
        on_record=on_record,
    )

    all_records.extend(records)

    # Calculate stats for this file
    passed = sum(1 for r in records if r.passed)
    failed = sum(1 for r in records if not r.passed)
    avg_latency = sum(r.latency_s for r in records if r.latency_s) / len(records) if records else 0

    print(f"\n   Results: {passed}/{len(records)} passed ({passed/len(records)*100:.1f}%)")
    print(f"   Average latency: {avg_latency:.2f}s")

# Overall summary
print("\n" + "=" * 60)
print("EVALUATION COMPLETE")
print("=" * 60)

total_passed = sum(1 for r in all_records if r.passed)
total_failed = sum(1 for r in all_records if not r.passed)
total_tests = len(all_records)
overall_avg_latency = sum(r.latency_s for r in all_records if r.latency_s) / total_tests if total_tests else 0

print(f"\nüìä Overall Results:")
print(f"   ‚Ä¢ Total Tests: {total_tests}")
print(f"   ‚Ä¢ Passed: {total_passed} ({total_passed/total_tests*100:.1f}%)")
print(f"   ‚Ä¢ Failed: {total_failed} ({total_failed/total_tests*100:.1f}%)")
print(f"   ‚Ä¢ Average Latency: {overall_avg_latency:.2f}s")
print(f"   ‚Ä¢ Total Time: {sum(r.latency_s for r in all_records if r.latency_s):.2f}s")

# Save results
EVAL_TIMESTAMP = datetime.now().strftime("%Y%m%d_%H%M%S")
model_name_safe = MODEL_NAME.replace("/", "_").replace(":", "_")
results_file = f"Evaluator/results/eval_{model_name_safe}_{EVAL_TIMESTAMP}.json"

# Build metadata for payload
eval_metadata = {
    "model": MODEL_NAME,
    "prompts_path": ", ".join(prompt_files),
    "test_suite": test_suite,
    "temperature": eval_temperature,
    "top_p": eval_top_p,
    "max_tokens": eval_max_tokens,
    "seed": eval_seed,
    "max_prompts": max_prompts if max_prompts > 0 else "all",
}

# Build payload
payload = build_run_payload(
    records=all_records,
    metadata=eval_metadata,
)

# Save locally
with open(results_file, 'w') as f:
    json.dump(payload, f, indent=2)
print(f"\nüíæ Results saved locally: {results_file}")

# Save to Google Drive if enabled
if save_to_drive and os.path.exists(drive_output_dir):
    drive_results_file = f"{drive_output_dir}/eval_{model_name_safe}_{EVAL_TIMESTAMP}.json"
    with open(drive_results_file, 'w') as f:
        json.dump(payload, f, indent=2)
    print(f"üíæ Results saved to Drive: {drive_results_file}")

# Store for analysis
eval_results = {
    "records": all_records,
    "payload": payload,
    "timestamp": EVAL_TIMESTAMP,
    "results_file": results_file,
}

## 8. Analyze Results by Category

Break down pass rates by tool category and show detailed failure information.

In [None]:
from collections import defaultdict
import pandas as pd

# Group by tags
results_by_tag = defaultdict(lambda: {"passed": 0, "failed": 0, "total": 0})

for record in all_records:
    tags = record.case.tags if hasattr(record.case, 'tags') and record.case.tags else ["untagged"]

    for tag in tags:
        results_by_tag[tag]["total"] += 1
        if record.passed:
            results_by_tag[tag]["passed"] += 1
        else:
            results_by_tag[tag]["failed"] += 1

# Convert to DataFrame
df_data = []
for tag, stats in sorted(results_by_tag.items()):
    pass_rate = (stats["passed"] / stats["total"] * 100) if stats["total"] > 0 else 0
    df_data.append({
        "Category": tag,
        "Passed": stats["passed"],
        "Failed": stats["failed"],
        "Total": stats["total"],
        "Pass Rate": f"{pass_rate:.1f}%"
    })

df = pd.DataFrame(df_data)
print("\nüìä Results by Category:")
print("=" * 60)
print(df.to_string(index=False))

# Show failures if any
failures = [r for r in all_records if not r.passed]
if failures:
    print(f"\n\n‚ùå Failed Tests ({len(failures)}):")
    print("=" * 60)
    for i, record in enumerate(failures[:15], 1):  # Show first 15 failures
        print(f"\n{i}. {record.case.id}")
        print(f"   Question: {record.case.question[:100]}..." if len(record.case.question) > 100 else f"   Question: {record.case.question}")
        
        if record.error:
            print(f"   Error: {record.error}")
        elif record.validator and record.validator.issues:
            print(f"   Issues:")
            for issue in record.validator.issues[:3]:  # Show first 3 issues
                print(f"      ‚Ä¢ [{issue.level}] {issue.message}")

    if len(failures) > 15:
        print(f"\n   ... and {len(failures) - 15} more failures")
else:
    print("\n\n‚úÖ All tests passed!")

## 9. Generate Markdown Report

Create a human-readable markdown report of the evaluation.

In [None]:
from Evaluator.reporting import render_markdown

# Generate markdown
markdown_file = f"Evaluator/results/eval_{model_name_safe}_{EVAL_TIMESTAMP}.md"
markdown_content = render_markdown(all_records, MODEL_NAME, test_suite)

with open(markdown_file, 'w') as f:
    f.write(markdown_content)

print(f"‚úì Markdown report saved locally: {markdown_file}")

# Save to Google Drive if enabled
if save_to_drive and os.path.exists(drive_output_dir):
    drive_markdown_file = f"{drive_output_dir}/eval_{model_name_safe}_{EVAL_TIMESTAMP}.md"
    with open(drive_markdown_file, 'w') as f:
        f.write(markdown_content)
    print(f"‚úì Markdown report saved to Drive: {drive_markdown_file}")

# Display preview
print("\n" + "=" * 60)
print("REPORT PREVIEW")
print("=" * 60)
print(markdown_content[:2000])  # Show first 2000 chars
if len(markdown_content) > 2000:
    print("\n... (see full report in markdown file)")

## 10. Build Evaluation Lineage

**What this does:** Creates a comprehensive record of the evaluation for model cards and tracking.

This captures:
- All test configurations and settings
- Pass rates by category
- Failure analysis with specific issues
- Performance metrics (latency, throughput)
- Hardware information

The lineage can be:
1. **Embedded in model cards** - Shows evaluation results on HuggingFace
2. **Saved as JSON** - For programmatic analysis and comparison
3. **Uploaded to HuggingFace** - Alongside your model

In [None]:
from Evaluator.reporting import build_evaluation_lineage, generate_evaluation_model_card_section

# Build evaluation configuration dict
eval_config = {
    "temperature": eval_temperature,
    "top_p": eval_top_p,
    "max_tokens": eval_max_tokens,
    "seed": eval_seed,
}

# Capture hardware info
hardware_info = {
    "gpu": torch.cuda.get_device_name(0) if torch.cuda.is_available() else "N/A",
    "gpu_memory_gb": round(torch.cuda.get_device_properties(0).total_memory / 1024**3, 1) if torch.cuda.is_available() else 0,
    "cuda_version": torch.version.cuda if torch.cuda.is_available() else "N/A",
    "platform": "Google Colab",
    "vllm_config": {
        "tensor_parallel_size": tensor_parallel_size,
        "gpu_memory_utilization": gpu_memory_utilization,
        "max_model_len": max_model_len,
    }
}

# Build evaluation lineage
EVALUATION_LINEAGE = build_evaluation_lineage(
    records=all_records,
    model_name=MODEL_NAME,
    test_suites=prompt_files,
    eval_config=eval_config,
    hardware_info=hardware_info,
)

# Generate model card section
MODEL_CARD_EVAL_SECTION = generate_evaluation_model_card_section(EVALUATION_LINEAGE)

# Save lineage to file
lineage_file = f"Evaluator/results/eval_lineage_{model_name_safe}_{EVAL_TIMESTAMP}.json"
with open(lineage_file, 'w') as f:
    json.dump(EVALUATION_LINEAGE, f, indent=2)

print("‚úì Evaluation lineage built!")
print(f"  Saved to: {lineage_file}")

# Save to Google Drive if enabled
if save_to_drive and os.path.exists(drive_output_dir):
    drive_lineage_file = f"{drive_output_dir}/eval_lineage_{model_name_safe}_{EVAL_TIMESTAMP}.json"
    with open(drive_lineage_file, 'w') as f:
        json.dump(EVALUATION_LINEAGE, f, indent=2)
    print(f"  Saved to Drive: {drive_lineage_file}")

print()
print("üìã Lineage Summary:")
print(f"  ‚Ä¢ Model: {MODEL_NAME}")
print(f"  ‚Ä¢ Test Suites: {', '.join(prompt_files)}")
print(f"  ‚Ä¢ Pass Rate: {EVALUATION_LINEAGE['results_summary']['overall_pass_rate']}%")
print(f"  ‚Ä¢ Tests: {EVALUATION_LINEAGE['results_summary']['passed']}/{EVALUATION_LINEAGE['test_config']['total_prompts']} passed")
print(f"  ‚Ä¢ Avg Latency: {EVALUATION_LINEAGE['performance']['avg_latency_s']}s")

# Show preview of model card section
print("\n" + "=" * 60)
print("MODEL CARD SECTION PREVIEW")
print("=" * 60)
print(MODEL_CARD_EVAL_SECTION[:1500])
if len(MODEL_CARD_EVAL_SECTION) > 1500:
    print("\n... (truncated)")

## 11. Upload Results to HuggingFace (Optional)

**What this does:** Upload evaluation results to your model's HuggingFace repository.

This will upload:
- `evaluation_lineage.json` - Complete evaluation data for programmatic access
- Update the model card with evaluation results section

**Prerequisites:**
- HF_TOKEN with write access in Colab secrets
- Model must already exist on HuggingFace

**Skip this section** if you just want to save results locally or to Google Drive.

In [None]:
# @title üì§ Upload to HuggingFace
# @markdown Configure and upload evaluation results to your model's HuggingFace repository.

# @markdown ### üîê HuggingFace Token
# @markdown Make sure HF_TOKEN is set in Colab secrets (üîë icon in sidebar).
upload_to_hf = True # @param {type:"boolean"}

# @markdown ### üì¶ Repository Settings
# @markdown The repo where results will be uploaded. Should match your model.
hf_repo_id = "professorsynapse/nexus-tools-sft-7b-merged" # @param {type:"string"}

if upload_to_hf:
    from huggingface_hub import HfApi, hf_hub_download
    import tempfile
    
    # Get HF token
    try:
        from google.colab import userdata
        upload_token = userdata.get('HF_TOKEN')
        if not upload_token:
            raise ValueError("HF_TOKEN not found")
    except Exception as e:
        print(f"‚ùå Could not get HF_TOKEN: {e}")
        print("   Add HF_TOKEN to Colab secrets (üîë icon in sidebar)")
        upload_to_hf = False

if upload_to_hf:
    api = HfApi()
    
    print(f"üì§ Uploading evaluation results to: {hf_repo_id}")
    print()
    
    try:
        # 1. Upload evaluation lineage JSON
        print("1. Uploading evaluation_lineage.json...")
        with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
            json.dump(EVALUATION_LINEAGE, f, indent=2)
            temp_lineage_path = f.name
        
        api.upload_file(
            path_or_fileobj=temp_lineage_path,
            path_in_repo="evaluation_lineage.json",
            repo_id=hf_repo_id,
            token=upload_token,
        )
        print("   ‚úì evaluation_lineage.json uploaded")
        
        # 2. Try to update the model card with evaluation section
        print("2. Updating model card with evaluation results...")
        try:
            # Download existing README
            readme_path = hf_hub_download(
                repo_id=hf_repo_id,
                filename="README.md",
                token=upload_token,
            )
            with open(readme_path, 'r') as f:
                existing_readme = f.read()
            
            # Check if evaluation section already exists
            if "## Evaluation Results" in existing_readme:
                # Replace existing evaluation section
                import re
                pattern = r'## Evaluation Results.*?(?=\n## |\Z)'
                updated_readme = re.sub(pattern, MODEL_CARD_EVAL_SECTION, existing_readme, flags=re.DOTALL)
                print("   Replacing existing evaluation section...")
            else:
                # Append evaluation section
                updated_readme = existing_readme.rstrip() + "\n\n" + MODEL_CARD_EVAL_SECTION
                print("   Adding new evaluation section...")
            
            # Upload updated README
            with tempfile.NamedTemporaryFile(mode='w', suffix='.md', delete=False) as f:
                f.write(updated_readme)
                temp_readme_path = f.name
            
            api.upload_file(
                path_or_fileobj=temp_readme_path,
                path_in_repo="README.md",
                repo_id=hf_repo_id,
                token=upload_token,
            )
            print("   ‚úì README.md updated with evaluation results")
            
        except Exception as e:
            print(f"   ‚ö†Ô∏è  Could not update README: {e}")
            print("   The evaluation_lineage.json was still uploaded successfully.")
        
        print()
        print("=" * 60)
        print("‚úì UPLOAD COMPLETE")
        print("=" * 60)
        print(f"\nView your model: https://huggingface.co/{hf_repo_id}")
        print(f"\nUploaded files:")
        print(f"  ‚Ä¢ evaluation_lineage.json - Full evaluation data")
        print(f"  ‚Ä¢ README.md - Model card with evaluation section")
        
    except Exception as e:
        print(f"\n‚ùå Upload failed: {e}")
        print("\nTroubleshooting:")
        print("  ‚Ä¢ Verify HF_TOKEN has write access")
        print("  ‚Ä¢ Check that the repository exists")
        print("  ‚Ä¢ Ensure you have permission to write to the repo")
else:
    print("‚ÑπÔ∏è  Upload to HuggingFace skipped")
    print("   Set upload_to_hf = True to enable")

## 12. Quick Test Interface

Test your model with custom prompts interactively.

In [None]:
# @title üß™ Quick Test Interface
# @markdown Test your model with a custom prompt.

# @markdown ### Enter your test prompt:
test_prompt = "Can you search for all notes that mention 'Claude Code' and show me the results?" # @param {type:"string"}

# @markdown ### Generation settings:
quick_temperature = 0.2 # @param {type:"slider", min:0.0, max:1.0, step:0.1}
quick_max_tokens = 512 # @param {type:"integer"}

print("ü§ñ Generating response...\n")

# Create message
messages = [{"role": "user", "content": test_prompt}]

# Update client
vllm_client.temperature = quick_temperature
vllm_client.max_tokens = quick_max_tokens

# Generate
response = vllm_client.chat(messages)

print("=" * 60)
print("RESPONSE")
print("=" * 60)
print(response.message)
print()
print(f"‚è±Ô∏è  Latency: {response.latency_s:.2f}s")
print(f"üìä Tokens: {response.raw.get('completion_tokens', 'N/A')}")

# Validate response
from Evaluator.schema_validator import validate_assistant_response

try:
    validation = validate_assistant_response(response.message)
    print(f"\n‚úì Validation: {'PASSED' if validation.passed else 'FAILED'}")

    if validation.tool_calls:
        print(f"\nüîß Tool Calls Detected ({len(validation.tool_calls)}):")
        for tc in validation.tool_calls:
            print(f"   ‚Ä¢ {tc.name}")
            print(f"     Arguments: {list(tc.arguments.keys())}")

    if validation.issues:
        print(f"\n‚ö†Ô∏è  Issues ({len(validation.issues)}):")
        for issue in validation.issues:
            print(f"   ‚Ä¢ [{issue.level}] {issue.message}")
except Exception as e:
    print(f"\n‚ùå Validation error: {e}")

## Done!

### What You Have

| Output | Description |
|--------|-------------|
| **JSON Results** | Full test details with pass/fail for each prompt |
| **Markdown Report** | Human-readable summary with tables |
| **Evaluation Lineage** | Complete metadata for reproducibility |
| **Model Card Section** | Auto-generated evaluation results for HuggingFace |
| **Category Breakdown** | Pass rates per tool type |
| **Failure Analysis** | Specific issues identified |

### Lineage Tracking

The evaluation lineage captures:
- Test suites and configurations used
- Pass rates by category
- Failure analysis with specific issues
- Performance metrics (latency, throughput)
- Hardware information (GPU, VRAM, CUDA)
- Full JSON embedded in model card

### Pass Rate Targets

| Suite | Target | Description |
|-------|--------|-------------|
| Full Coverage | 85%+ | Model knows all 47 tools |
| Behavioral Patterns | 75%+ | Context efficiency, executePrompt usage |
| Baseline | 100% | General workflows |
| Tool Combos | 80%+ | Multi-step sequences |

### Common Issues

- **Missing context object** - Model didn't include required context fields
- **Wrong tool called** - Model used incorrect tool for task
- **Invalid arguments** - Parameters don't match schema
- **No tool call** - Model responded with text instead of tool call

### Next Steps

1. **Review failures** - Check the detailed failure breakdown
2. **Identify patterns** - Are failures concentrated in specific categories?
3. **Retrain if needed** - Use failures to improve training data
4. **Upload results** - Evaluation results auto-populate model cards
5. **Deploy** - Models passing 85%+ are production-ready

---

**Questions?** Check the [Evaluator README](https://github.com/ProfSynapse/Toolset-Training/blob/main/Evaluator/README.md) or open an issue on GitHub.