Skip to content

ModaLabs/GatewayBench

Repository files navigation

GatewayBench

Python 3.8+ License: MIT

Synthetic benchmark dataset generator for evaluating LLM gateway systems.

GatewayBench generates high-quality test cases with ground truth labels for evaluating:

  • Tool selection from large tool sets
  • Information retrieval scenarios
  • Multi-turn conversations
  • Stress testing with complex schemas

Features

  • High Performance: 10-20x faster with async generation
  • Comprehensive Testing: 80%+ test coverage with mocked APIs
  • Ground Truth Labels: Each example includes relevance labels for evaluation
  • Four Task Types: Tool-heavy, retrieval, chat, and stress scenarios
  • Built-in Analytics: Dataset statistics and analysis tools
  • Type Safe: Full Pydantic models and type hints
  • CI/CD Ready: GitHub Actions workflow included

Quick Start

Installation

# Basic installation
pip install gatewaybench

# Development installation with test dependencies
git clone https://github.com/ModaLabs/GatewayBench.git
cd GatewayBench
pip install -e .[dev]

Requirements: Python 3.8+

Generate Your First Dataset

# Set your OpenAI API key
export OPENAI_API_KEY='your-api-key-here'

# Generate a dataset (100 examples by default)
python scripts/generate_dataset.py

# Generate more examples
NUM_EXAMPLES=2000 CONCURRENCY=20 python scripts/generate_dataset.py

# Or use the CLI for custom generation
gatewaybench generate --num 1500 --output data/raw/dataset.jsonl --concurrency 20

Validate and Analyze

# Validate your dataset
gatewaybench validate data/raw/dataset.jsonl

# Show comprehensive statistics
gatewaybench stats data/raw/dataset.jsonl

Task Types

GatewayBench generates four types of test scenarios:

Task Type Tool Count Required Tools Use Case
tool-heavy 20-80 1-4 Testing tool selection in large sets
retrieval 5-20 1+ Information retrieval scenarios
chat 0 0 Pure conversation without tools
stress 60-100 1-10 Stress testing with complex schemas

Programmatic Usage

Basic Generation

import asyncio
from gatewaybench import generate_batch_async, validate_example

async def main():
    # Generate 100 examples
    examples = await generate_batch_async(
        num_examples=100,
        task_distribution={
            "tool-heavy": 0.3,
            "retrieval": 0.3,
            "chat": 0.25,
            "stress": 0.15
        },
        concurrency=20
    )
    
    # Validate all examples
    for example in examples:
        validate_example(example)
    
    print(f"Generated {len(examples)} valid examples")

asyncio.run(main())

Using Configuration Objects

from gatewaybench import GenerationConfig, generate_batch_async
import asyncio

config = GenerationConfig(
    openai_api_key="your-key",
    num_examples=500,
    concurrency=30,
    deterministic=True,
    random_seed=42,
    output_path="data/my_dataset.jsonl"
)

examples = asyncio.run(generate_batch_async(
    num_examples=config.num_examples,
    task_distribution=config.task_distribution,
    api_key=config.openai_api_key,
    concurrency=config.concurrency
))

Working with Type-Safe Models

from gatewaybench import Example, TaskType, RelevanceLabel

# Load and validate with Pydantic
example_dict = {...}  # Your example data
example = Example.from_dict(example_dict)

# Type-safe access
print(example.task_type)  # TaskType enum
print(example.metadata.difficulty)  # int (1-5)

# Get required tools
required_tools = example.get_required_tools()

Data Schema

Each example is a JSON object with:

{
  "id": "uuid-string",
  "user_prompt": "User query or request",
  "conversation_history": [],
  "task_type": "tool-heavy|retrieval|chat|stress",
  "relevance_by_name": {
    "tool_name": "required|useful|irrelevant"
  },
  "ideal_tool_subset": ["required_tool_1", "required_tool_2"],
  "reference_answer": "Expected answer using ideal tools",
  "model_pool": [...],
  "required_capabilities": [],
  "metadata": {
    "domain": "analytics|support|coding|...",
    "difficulty": 1-5,
    "source": "synthetic",
    "version": "1.0.0"
  }
}

Note: For chat tasks, relevance_by_name and ideal_tool_subset are omitted.

See docs/SCHEMA_SPEC.md for complete schema documentation.

Performance

GatewayBench uses asynchronous API calls for dramatically faster generation:

Examples Sequential Async (20 workers) Speedup
100 ~2-3 min ~10-20 sec 10-15x
500 ~10-15 min ~1-2 min 10-15x
1500 ~20-30 min ~2-3 min 10-20x

Key optimizations:

  • ✅ Concurrent API calls with configurable workers
  • ✅ Cached OpenAI clients and prompts
  • ✅ Efficient batch processing and I/O
  • ✅ Retry logic with exponential backoff

Adjust concurrency based on your API rate limits:

gatewaybench generate --num 1000 --concurrency 30

Development

Running Tests

# Run all tests
make test

# Run with coverage report
make test-verbose

# Run specific test file
pytest tests/test_validator.py -v

Code Quality

# Format code
make format

# Run linting
make lint

# Type checking
make type-check

# Security audit
make security

# Run all checks
make check-all

Project Structure

gatewaybench/
├── core/           # Core generation logic
│   ├── generator.py
│   ├── validator.py
│   ├── tool_generator.py
│   ├── config.py
│   └── types.py
├── utils/          # Utilities
│   ├── caching.py
│   ├── logging.py
│   └── stats.py
├── prompts/        # LLM prompts
└── cli.py          # Command-line interface

tests/              # Comprehensive test suite
scripts/            # Utility scripts
docs/               # Documentation

Using the Dataset

GatewayBench v1 is available as a ready-to-use dataset on Hugging Face:

from datasets import load_dataset

# Load the dataset (2,000 examples)
ds = load_dataset("ModaLabs/gatewaybench-v1", split="train")

# Access an example
print(ds[0])

# Filter by task type
tool_heavy = [ex for ex in ds if ex["task_type"] == "tool-heavy"]

See docs/DATASET_CARD.md for full dataset documentation.

Documentation

Contributing

Contributions are welcome! Please see our development setup:

  1. Fork the repository
  2. Create a feature branch
  3. Install development dependencies: pip install -e .[dev]
  4. Make your changes with tests
  5. Run the test suite: make test
  6. Submit a pull request

Citation

If you use GatewayBench in your research, please cite:

@misc{bedi2025gatewaybench,
  title={GatewayBench: Evaluating LLM Gateways -- A Benchmark and Measurement Study of Cost/Latency/Quality Tradeoffs},
  author={Bedi, Pranav and Al Rasheed, Mohammed and Al Ani, Mazin},
  year={2025},
  url={https://github.com/ModaLabs/GatewayBench}
}

License

MIT License - see LICENSE for details.

Support

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •