# vLLM Tool Call Parser Comparison

This notebook demonstrates:
1. **Pydantic-validated tool call models** with schema validation
2. **Three parser implementations** (Regex, Incremental, State Machine)
3. **vLLM comparison** - custom parsers vs native tool parsing
4. **Structured output approaches** - Post-hoc parsing vs Constrained decoding (Outlines, XGrammar)

## Key Features for vLLM Tool Calling
- OpenAI Chat Completions API compatibility
- Incremental/streaming parsing for early tool call detection
- Error recovery from malformed LLM output
- JSON Schema validation

## Requirements
- Google Colab with GPU runtime (T4 recommended)
- ~8GB GPU memory for 7B models

## Step 1: Check GPU and Install Dependencies

In [None]:
# Check GPU availability
!nvidia-smi

# Install vLLM and dependencies
!pip install -q vllm openai

## Step 2: Clone the Repository

In [None]:
# Clone the repo
!git clone https://github.com/shravsssss/vLLM-Tool-Call-Parser.git
%cd vLLM-Tool-Call-Parser

# Install requirements
!pip install -q -r requirements.txt

In [None]:
# Pydantic Models for Tool Calls
# These models provide validation compatible with OpenAI Chat Completions API

import json
import re
from typing import Any
from pydantic import BaseModel, Field, field_validator, model_validator

class ToolCall(BaseModel):
    """Pydantic-validated tool call model.
    
    Features:
    - Function name validation (must be valid identifier)
    - Auto-parsing of JSON string arguments
    - OpenAI API format compatibility
    - JSON Schema validation
    """
    
    id: str | None = Field(default=None, description="Tool call ID (e.g., call_abc123)")
    name: str = Field(..., min_length=1, max_length=256)
    arguments: dict[str, Any] = Field(default_factory=dict)
    
    @field_validator("name")
    @classmethod
    def validate_function_name(cls, v: str) -> str:
        """Ensure function name is a valid Python identifier."""
        v = v.strip()
        if not re.match(r"^[a-zA-Z_][a-zA-Z0-9_]*$", v):
            raise ValueError(f"Invalid function name '{v}'")
        return v
    
    @field_validator("arguments", mode="before")
    @classmethod  
    def parse_arguments(cls, v: Any) -> dict[str, Any]:
        """Auto-parse JSON string arguments (OpenAI format)."""
        if isinstance(v, str):
            return json.loads(v) if v.strip() else {}
        return v or {}
    
    def to_openai_format(self) -> dict[str, Any]:
        """Convert to OpenAI tool_calls format."""
        return {
            "id": self.id,
            "type": "function", 
            "function": {
                "name": self.name,
                "arguments": json.dumps(self.arguments),
            }
        }
    
    def matches_schema(self, schema: dict[str, Any]) -> bool:
        """Validate arguments against JSON Schema."""
        required = schema.get("required", [])
        for field in required:
            if field not in self.arguments:
                return False
        return True

# Demo the Pydantic model
print("=" * 60)
print("PYDANTIC TOOL CALL VALIDATION DEMO")
print("=" * 60)

# Valid tool call
call = ToolCall(name="get_weather", arguments={"city": "NYC", "unit": "celsius"})
print(f"\n1. Valid ToolCall: {call}")
print(f"   OpenAI format: {call.to_openai_format()}")

# Auto-parse JSON string (OpenAI API sends arguments as string)
call2 = ToolCall(name="search", arguments='{"query": "python", "limit": 10}')
print(f"\n2. Auto-parsed JSON string: {call2.arguments}")

# Schema validation
schema = {"required": ["city"], "properties": {"city": {"type": "string"}}}
print(f"\n3. Schema validation: {call.matches_schema(schema)}")

# Validation error demo
try:
    bad_call = ToolCall(name="123invalid", arguments={})
except Exception as e:
    print(f"\n4. Validation error (expected): {e}")

print("\n" + "=" * 60)

## Step 2.5: Pydantic Models Demo

This project uses **Pydantic** for robust tool call validation - a key skill for vLLM development.

In [None]:
# If cloning fails (private repo), install directly
!pip install -q pydantic>=2.0 openai>=1.0.0

# Create necessary directories and files inline
import os
os.makedirs("src/parser_benchmark/models", exist_ok=True)
os.makedirs("src/parser_benchmark/parsers", exist_ok=True)

print("Setup complete! You can either:")
print("1. Clone the repo if it's public")
print("2. Or copy the parser code from the cells below")

## Step 3: Start vLLM Server

Choose your model and parser type. Available parsers:
- `hermes` - For Hermes/Qwen models (XML-wrapped JSON)
- `llama3_json` - For Llama 3.x models
- `mistral` - For Mistral models
- `granite` - For IBM Granite models

In [None]:
# Configuration
MODEL = "Qwen/Qwen2.5-1.5B-Instruct"  # Small model for Colab free tier
# MODEL = "Qwen/Qwen2.5-7B-Instruct"  # Larger model (needs more GPU RAM)

PARSER = "hermes"  # Options: hermes, llama3_json, mistral, granite

print(f"Model: {MODEL}")
print(f"Parser: {PARSER}")

In [None]:
import subprocess
import time

# Start vLLM server in background
vllm_process = subprocess.Popen(
    [
        "python", "-m", "vllm.entrypoints.openai.api_server",
        "--model", MODEL,
        "--tool-call-parser", PARSER,
        "--enable-auto-tool-choice",
        "--port", "8000",
        "--max-model-len", "4096",
    ],
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
)

print("Starting vLLM server...")
print("This may take 1-2 minutes to load the model.")

# Wait for server to be ready
import requests
for i in range(120):
    try:
        response = requests.get("http://localhost:8000/health")
        if response.status_code == 200:
            print(f"\nvLLM server is ready! (took {i+1} seconds)")
            break
    except:
        pass
    time.sleep(1)
    if i % 10 == 0:
        print(f"Still loading... ({i}s)")
else:
    print("Timeout waiting for server. Check logs below:")
    print(vllm_process.stdout.read().decode()[:2000])

## Step 4: Test the Server

In [None]:
from openai import OpenAI

# Connect to vLLM
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

# Test with a simple tool call
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["city"],
            },
        },
    }
]

response = client.chat.completions.create(
    model=client.models.list().data[0].id,
    messages=[{"role": "user", "content": "What's the weather in San Francisco?"}],
    tools=tools,
    tool_choice="auto",
)

print("Test Response:")
print(f"Content: {response.choices[0].message.content}")
print(f"Tool Calls: {response.choices[0].message.tool_calls}")

## Step 5: Run Full Comparison

This runs comprehensive tests comparing your custom parsers against vLLM's native parsing.

In [None]:
import sys
sys.path.insert(0, 'src')

from parser_benchmark.vllm_comparison import (
    VLLMConfig,
    VLLMComparisonRunner,
    DEFAULT_TEST_PROMPTS,
)

# Create config and runner
config = VLLMConfig.local(port=8000, parser=PARSER)
runner = VLLMComparisonRunner(config)

# Check connection
connected, message = runner.check_connection()
print(message)

if not connected:
    print("Error: vLLM server not reachable. Check the server logs.")

In [None]:
# Run full comparison
print("Running full comparison...")
print(f"Testing with {len(DEFAULT_TEST_PROMPTS)} prompts")

report = runner.run_full_comparison(
    prompts=DEFAULT_TEST_PROMPTS,
    test_streaming=True,
    test_error_recovery=True,
)

print("\n" + "="*60)
print("COMPARISON COMPLETE")
print("="*60)

## Step 6: View Results

In [None]:
# Print summary
print("\n" + "="*60)
print("ACCURACY COMPARISON")
print("="*60)
print(f"\n{'Parser':<25} {'Accuracy':>10}")
print("-" * 40)
print(f"{'vLLM Native':<25} {report.vllm_accuracy:>9.1f}%")
for name, acc in report.accuracy_scores.items():
    print(f"{name:<25} {acc:>9.1f}%")

print("\n" + "="*60)
print("LATENCY COMPARISON (parsing only)")
print("="*60)
print(f"\n{'Parser':<25} {'Avg Latency':>12}")
print("-" * 40)
print(f"{'vLLM Native':<25} {report.vllm_avg_latency_ms:>10.2f} ms")
for name, lat in report.avg_latency_ms.items():
    speedup = report.vllm_avg_latency_ms / lat if lat > 0 else 0
    print(f"{name:<25} {lat:>10.2f} ms ({speedup:.1f}x faster)")

if report.streaming_advantage:
    sa = report.streaming_advantage
    print("\n" + "="*60)
    print("STREAMING ADVANTAGE")
    print("="*60)
    print(f"\nvLLM waits for complete response: {sa.vllm_total_time_ms:.1f} ms")
    print(f"Incremental parser first detection: {sa.incremental_first_call_ms:.1f} ms")
    print(f"\nAdvantage: {sa.advantage_ms:.1f} ms ({sa.advantage_percent:.1f}% earlier detection)")

print("\n" + "="*60)
print("ERROR RECOVERY WINS")
print("="*60)
for parser, wins in report.error_recovery_summary.items():
    print(f"  {parser}: {wins} cases")

## Step 7: Save Results for Dashboard

In [None]:
from datetime import datetime

# Save results to JSON
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = f"vllm_comparison_{timestamp}.json"

runner.save_results(report, output_file)
print(f"Results saved to: {output_file}")

In [None]:
# Download the file
from google.colab import files
files.download(output_file)

print("\n" + "="*60)
print("NEXT STEPS")
print("="*60)
print("""  
1. Download the JSON file (should start automatically)
2. Go to the HuggingFace Space dashboard
3. Click on 'vLLM Comparison' tab
4. Upload the JSON file
5. See your comparison charts!
""")

## Optional: Test with Different Parsers

Restart vLLM with a different `--tool-call-parser` to compare.

## Summary: Skills Demonstrated

This notebook demonstrates key skills for vLLM tool calling development:

| Skill | Demonstrated |
|-------|--------------|
| **Python + Pydantic** | ToolCall model with validators |
| **OpenAI API Compatibility** | `to_openai_format()`, Chat Completions |
| **Incremental Parsing** | Early tool call detection in streaming |
| **Outlines/XGrammar Knowledge** | Comparison with constrained decoding |
| **vLLM Tool Parsers** | Testing hermes, llama3_json, mistral |
| **Error Recovery** | Handling malformed LLM output |
| **JSON Schema Validation** | `matches_schema()` method |

**Links:**
- [HuggingFace Dashboard](https://huggingface.co/spaces/sravyayepuri/tool-call-parser-benchmark)
- [Outlines](https://github.com/outlines-dev/outlines)
- [XGrammar](https://github.com/mlc-ai/xgrammar)
- [vLLM Guided Decoding](https://docs.vllm.ai/en/latest/features/structured_outputs.html)

In [None]:
# vLLM Guided Decoding Example (Constrained Decoding)
# This shows how vLLM can use BOTH approaches

print("""
vLLM SUPPORTS BOTH APPROACHES:

1. POST-HOC PARSING (what we tested above):
   python -m vllm.entrypoints.openai.api_server \\
       --model Qwen/Qwen2.5-7B-Instruct \\
       --tool-call-parser hermes \\
       --enable-auto-tool-choice

2. CONSTRAINED DECODING (guaranteed valid JSON):
""")

# Example of vLLM guided decoding (requires vLLM Python API)
guided_decoding_example = '''
from vllm import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen2.5-7B-Instruct")

# Define JSON Schema for tool calls
tool_schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "arguments": {"type": "object"}
    },
    "required": ["name"]
}

# Use guided decoding - output GUARANTEED to match schema
sampling_params = SamplingParams(
    guided_decoding_backend="xgrammar",  # or "outlines"
    guided_json=tool_schema,
    max_tokens=256
)

outputs = llm.generate(["Call a function to get weather"], sampling_params)
# outputs[0].outputs[0].text is ALWAYS valid JSON
'''

print(guided_decoding_example)

print("""
HYBRID APPROACH (Recommended for Production):
  1. Use guided decoding to ensure valid JSON structure
  2. Use custom parser to extract and validate against app schema
  3. Best of both worlds: guaranteed structure + fast validation
""")

In [None]:
# Structured Output Approaches Comparison

print("""
╔══════════════════════════════════════════════════════════════════════════════╗
║              STRUCTURED OUTPUT: TWO APPROACHES                                ║
╠══════════════════════════════════════════════════════════════════════════════╣
║                                                                               ║
║  1. POST-HOC PARSING (This Project)                                           ║
║     ├─ Parse LLM output AFTER generation                                      ║
║     ├─ Works with ANY LLM API (OpenAI, Anthropic, vLLM)                       ║
║     ├─ Supports streaming with early detection                                ║
║     ├─ Can recover from malformed output                                      ║
║     └─ Zero generation overhead                                               ║
║                                                                               ║
║  2. CONSTRAINED DECODING (Outlines, XGrammar)                                 ║
║     ├─ Guide generation at LOGIT level                                        ║
║     ├─ Mask invalid tokens during sampling                                    ║
║     ├─ 100% guaranteed valid output                                           ║
║     ├─ Requires inference engine integration                                  ║
║     └─ 2-15% generation overhead                                              ║
║                                                                               ║
╚══════════════════════════════════════════════════════════════════════════════╝

COMPARISON TABLE:
""")

comparison = """
┌────────────────────────┬───────────────────────┬─────────────────────────────┐
│ Aspect                 │ Post-hoc Parsing      │ Constrained Decoding        │
├────────────────────────┼───────────────────────┼─────────────────────────────┤
│ When it runs           │ After generation      │ During generation           │
│ Guarantees valid JSON  │ No (recovers errors)  │ Yes (100%)                  │
│ Latency overhead       │ <0.1ms                │ 2-15% generation time       │
│ Works with any LLM     │ Yes                   │ Requires integration        │
│ Streaming support      │ Yes (early detection) │ Limited                     │
│ Error recovery         │ Yes                   │ N/A (no errors)             │
└────────────────────────┴───────────────────────┴─────────────────────────────┘
"""
print(comparison)

print("""
KEY LIBRARIES:

OUTLINES (github.com/outlines-dev/outlines):
  - Uses Finite State Machines from JSON Schema
  - Masks invalid logits during sampling
  - Example:
    generator = outlines.generate.json(model, schema)
    result = generator(prompt)  # Always valid JSON

XGRAMMAR (github.com/mlc-ai/xgrammar):
  - Compiles grammars to token masks
  - Optimized for batch inference
  - Supports JSON Schema, regex, EBNF

vLLM INTEGRATION:
  - Post-hoc: --tool-call-parser hermes/llama3_json/mistral
  - Constrained: guided_decoding_backend="outlines" or "xgrammar"
""")

## Structured Output: Parsing vs Constrained Decoding

This section compares two approaches to structured LLM output:

| Approach | Description | Libraries |
|----------|-------------|-----------|
| **Post-hoc Parsing** | Parse output after generation | This project, vLLM parsers |
| **Constrained Decoding** | Guide generation at logit level | Outlines, XGrammar, Guidance |

In [None]:
def restart_vllm_with_parser(parser_name: str):
    """Restart vLLM with a different parser."""
    global vllm_process, PARSER
    
    # Kill existing process
    if vllm_process:
        vllm_process.terminate()
        vllm_process.wait()
        time.sleep(2)
    
    PARSER = parser_name
    
    # Start new process
    vllm_process = subprocess.Popen(
        [
            "python", "-m", "vllm.entrypoints.openai.api_server",
            "--model", MODEL,
            "--tool-call-parser", parser_name,
            "--enable-auto-tool-choice",
            "--port", "8000",
            "--max-model-len", "4096",
        ],
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
    )
    
    print(f"Restarting with {parser_name} parser...")
    
    # Wait for ready
    for i in range(120):
        try:
            response = requests.get("http://localhost:8000/health")
            if response.status_code == 200:
                print(f"Ready with {parser_name} parser!")
                return True
        except:
            pass
        time.sleep(1)
    
    print("Timeout!")
    return False

# Example: Test with llama3_json parser
# restart_vllm_with_parser("llama3_json")

## Cleanup

In [None]:
# Stop vLLM server
if vllm_process:
    vllm_process.terminate()
    vllm_process.wait()
    print("vLLM server stopped.")