Synthetic benchmark dataset generator for evaluating LLM gateway systems.
GatewayBench generates high-quality test cases with ground truth labels for evaluating:
- Tool selection from large tool sets
- Information retrieval scenarios
- Multi-turn conversations
- Stress testing with complex schemas
- High Performance: 10-20x faster with async generation
- Comprehensive Testing: 80%+ test coverage with mocked APIs
- Ground Truth Labels: Each example includes relevance labels for evaluation
- Four Task Types: Tool-heavy, retrieval, chat, and stress scenarios
- Built-in Analytics: Dataset statistics and analysis tools
- Type Safe: Full Pydantic models and type hints
- CI/CD Ready: GitHub Actions workflow included
# Basic installation
pip install gatewaybench
# Development installation with test dependencies
git clone https://github.com/ModaLabs/GatewayBench.git
cd GatewayBench
pip install -e .[dev]Requirements: Python 3.8+
# Set your OpenAI API key
export OPENAI_API_KEY='your-api-key-here'
# Generate a dataset (100 examples by default)
python scripts/generate_dataset.py
# Generate more examples
NUM_EXAMPLES=2000 CONCURRENCY=20 python scripts/generate_dataset.py
# Or use the CLI for custom generation
gatewaybench generate --num 1500 --output data/raw/dataset.jsonl --concurrency 20# Validate your dataset
gatewaybench validate data/raw/dataset.jsonl
# Show comprehensive statistics
gatewaybench stats data/raw/dataset.jsonlGatewayBench generates four types of test scenarios:
| Task Type | Tool Count | Required Tools | Use Case |
|---|---|---|---|
| tool-heavy | 20-80 | 1-4 | Testing tool selection in large sets |
| retrieval | 5-20 | 1+ | Information retrieval scenarios |
| chat | 0 | 0 | Pure conversation without tools |
| stress | 60-100 | 1-10 | Stress testing with complex schemas |
import asyncio
from gatewaybench import generate_batch_async, validate_example
async def main():
# Generate 100 examples
examples = await generate_batch_async(
num_examples=100,
task_distribution={
"tool-heavy": 0.3,
"retrieval": 0.3,
"chat": 0.25,
"stress": 0.15
},
concurrency=20
)
# Validate all examples
for example in examples:
validate_example(example)
print(f"Generated {len(examples)} valid examples")
asyncio.run(main())from gatewaybench import GenerationConfig, generate_batch_async
import asyncio
config = GenerationConfig(
openai_api_key="your-key",
num_examples=500,
concurrency=30,
deterministic=True,
random_seed=42,
output_path="data/my_dataset.jsonl"
)
examples = asyncio.run(generate_batch_async(
num_examples=config.num_examples,
task_distribution=config.task_distribution,
api_key=config.openai_api_key,
concurrency=config.concurrency
))from gatewaybench import Example, TaskType, RelevanceLabel
# Load and validate with Pydantic
example_dict = {...} # Your example data
example = Example.from_dict(example_dict)
# Type-safe access
print(example.task_type) # TaskType enum
print(example.metadata.difficulty) # int (1-5)
# Get required tools
required_tools = example.get_required_tools()Each example is a JSON object with:
{
"id": "uuid-string",
"user_prompt": "User query or request",
"conversation_history": [],
"task_type": "tool-heavy|retrieval|chat|stress",
"relevance_by_name": {
"tool_name": "required|useful|irrelevant"
},
"ideal_tool_subset": ["required_tool_1", "required_tool_2"],
"reference_answer": "Expected answer using ideal tools",
"model_pool": [...],
"required_capabilities": [],
"metadata": {
"domain": "analytics|support|coding|...",
"difficulty": 1-5,
"source": "synthetic",
"version": "1.0.0"
}
}Note: For chat tasks, relevance_by_name and ideal_tool_subset are omitted.
See docs/SCHEMA_SPEC.md for complete schema documentation.
GatewayBench uses asynchronous API calls for dramatically faster generation:
| Examples | Sequential | Async (20 workers) | Speedup |
|---|---|---|---|
| 100 | ~2-3 min | ~10-20 sec | 10-15x |
| 500 | ~10-15 min | ~1-2 min | 10-15x |
| 1500 | ~20-30 min | ~2-3 min | 10-20x |
Key optimizations:
- ✅ Concurrent API calls with configurable workers
- ✅ Cached OpenAI clients and prompts
- ✅ Efficient batch processing and I/O
- ✅ Retry logic with exponential backoff
Adjust concurrency based on your API rate limits:
gatewaybench generate --num 1000 --concurrency 30# Run all tests
make test
# Run with coverage report
make test-verbose
# Run specific test file
pytest tests/test_validator.py -v# Format code
make format
# Run linting
make lint
# Type checking
make type-check
# Security audit
make security
# Run all checks
make check-allgatewaybench/
├── core/ # Core generation logic
│ ├── generator.py
│ ├── validator.py
│ ├── tool_generator.py
│ ├── config.py
│ └── types.py
├── utils/ # Utilities
│ ├── caching.py
│ ├── logging.py
│ └── stats.py
├── prompts/ # LLM prompts
└── cli.py # Command-line interface
tests/ # Comprehensive test suite
scripts/ # Utility scripts
docs/ # Documentation
GatewayBench v1 is available as a ready-to-use dataset on Hugging Face:
from datasets import load_dataset
# Load the dataset (2,000 examples)
ds = load_dataset("ModaLabs/gatewaybench-v1", split="train")
# Access an example
print(ds[0])
# Filter by task type
tool_heavy = [ex for ex in ds if ex["task_type"] == "tool-heavy"]See docs/DATASET_CARD.md for full dataset documentation.
docs/API.md- Complete API documentationdocs/SCHEMA_SPEC.md- Data schema specificationdocs/CONFIG.md- Configuration guidedocs/DATASET_CARD.md- Dataset card and usagePROJECT_DOCUMENTATION.md- Technical overview
Contributions are welcome! Please see our development setup:
- Fork the repository
- Create a feature branch
- Install development dependencies:
pip install -e .[dev] - Make your changes with tests
- Run the test suite:
make test - Submit a pull request
If you use GatewayBench in your research, please cite:
@misc{bedi2025gatewaybench,
title={GatewayBench: Evaluating LLM Gateways -- A Benchmark and Measurement Study of Cost/Latency/Quality Tradeoffs},
author={Bedi, Pranav and Al Rasheed, Mohammed and Al Ani, Mazin},
year={2025},
url={https://github.com/ModaLabs/GatewayBench}
}MIT License - see LICENSE for details.