GatewayBench

Synthetic benchmark dataset generator for evaluating LLM gateway systems.

GatewayBench generates high-quality test cases with ground truth labels for evaluating:

Tool selection from large tool sets
Information retrieval scenarios
Multi-turn conversations
Stress testing with complex schemas

Features

High Performance: 10-20x faster with async generation
Comprehensive Testing: 80%+ test coverage with mocked APIs
Ground Truth Labels: Each example includes relevance labels for evaluation
Four Task Types: Tool-heavy, retrieval, chat, and stress scenarios
Built-in Analytics: Dataset statistics and analysis tools
Type Safe: Full Pydantic models and type hints
CI/CD Ready: GitHub Actions workflow included

Quick Start

Installation

# Basic installation
pip install gatewaybench

# Development installation with test dependencies
git clone https://github.com/ModaLabs/GatewayBench.git
cd GatewayBench
pip install -e .[dev]

Requirements: Python 3.8+

Generate Your First Dataset

# Set your OpenAI API key
export OPENAI_API_KEY='your-api-key-here'

# Generate a dataset (100 examples by default)
python scripts/generate_dataset.py

# Generate more examples
NUM_EXAMPLES=2000 CONCURRENCY=20 python scripts/generate_dataset.py

# Or use the CLI for custom generation
gatewaybench generate --num 1500 --output data/raw/dataset.jsonl --concurrency 20

Validate and Analyze

# Validate your dataset
gatewaybench validate data/raw/dataset.jsonl

# Show comprehensive statistics
gatewaybench stats data/raw/dataset.jsonl

Task Types

GatewayBench generates four types of test scenarios:

Task Type	Tool Count	Required Tools	Use Case
tool-heavy	20-80	1-4	Testing tool selection in large sets
retrieval	5-20	1+	Information retrieval scenarios
chat	0	0	Pure conversation without tools
stress	60-100	1-10	Stress testing with complex schemas

Programmatic Usage

Basic Generation

import asyncio
from gatewaybench import generate_batch_async, validate_example

async def main():
    # Generate 100 examples
    examples = await generate_batch_async(
        num_examples=100,
        task_distribution={
            "tool-heavy": 0.3,
            "retrieval": 0.3,
            "chat": 0.25,
            "stress": 0.15
        },
        concurrency=20
    )
    
    # Validate all examples
    for example in examples:
        validate_example(example)
    
    print(f"Generated {len(examples)} valid examples")

asyncio.run(main())

Using Configuration Objects

from gatewaybench import GenerationConfig, generate_batch_async
import asyncio

config = GenerationConfig(
    openai_api_key="your-key",
    num_examples=500,
    concurrency=30,
    deterministic=True,
    random_seed=42,
    output_path="data/my_dataset.jsonl"
)

examples = asyncio.run(generate_batch_async(
    num_examples=config.num_examples,
    task_distribution=config.task_distribution,
    api_key=config.openai_api_key,
    concurrency=config.concurrency
))

Working with Type-Safe Models

from gatewaybench import Example, TaskType, RelevanceLabel

# Load and validate with Pydantic
example_dict = {...}  # Your example data
example = Example.from_dict(example_dict)

# Type-safe access
print(example.task_type)  # TaskType enum
print(example.metadata.difficulty)  # int (1-5)

# Get required tools
required_tools = example.get_required_tools()

Data Schema

Each example is a JSON object with:

{
  "id": "uuid-string",
  "user_prompt": "User query or request",
  "conversation_history": [],
  "task_type": "tool-heavy|retrieval|chat|stress",
  "relevance_by_name": {
    "tool_name": "required|useful|irrelevant"
  },
  "ideal_tool_subset": ["required_tool_1", "required_tool_2"],
  "reference_answer": "Expected answer using ideal tools",
  "model_pool": [...],
  "required_capabilities": [],
  "metadata": {
    "domain": "analytics|support|coding|...",
    "difficulty": 1-5,
    "source": "synthetic",
    "version": "1.0.0"
  }
}

Note: For chat tasks, relevance_by_name and ideal_tool_subset are omitted.

See docs/SCHEMA_SPEC.md for complete schema documentation.

Performance

GatewayBench uses asynchronous API calls for dramatically faster generation:

Examples	Sequential	Async (20 workers)	Speedup
100	~2-3 min	~10-20 sec	10-15x
500	~10-15 min	~1-2 min	10-15x
1500	~20-30 min	~2-3 min	10-20x

Key optimizations:

✅ Concurrent API calls with configurable workers
✅ Cached OpenAI clients and prompts
✅ Efficient batch processing and I/O
✅ Retry logic with exponential backoff

Adjust concurrency based on your API rate limits:

gatewaybench generate --num 1000 --concurrency 30

Development

Running Tests

# Run all tests
make test

# Run with coverage report
make test-verbose

# Run specific test file
pytest tests/test_validator.py -v

Code Quality

# Format code
make format

# Run linting
make lint

# Type checking
make type-check

# Security audit
make security

# Run all checks
make check-all

Project Structure

gatewaybench/
├── core/           # Core generation logic
│   ├── generator.py
│   ├── validator.py
│   ├── tool_generator.py
│   ├── config.py
│   └── types.py
├── utils/          # Utilities
│   ├── caching.py
│   ├── logging.py
│   └── stats.py
├── prompts/        # LLM prompts
└── cli.py          # Command-line interface

tests/              # Comprehensive test suite
scripts/            # Utility scripts
docs/               # Documentation

Using the Dataset

GatewayBench v1 is available as a ready-to-use dataset on Hugging Face:

from datasets import load_dataset

# Load the dataset (2,000 examples)
ds = load_dataset("ModaLabs/gatewaybench-v1", split="train")

# Access an example
print(ds[0])

# Filter by task type
tool_heavy = [ex for ex in ds if ex["task_type"] == "tool-heavy"]

See docs/DATASET_CARD.md for full dataset documentation.

Documentation

docs/API.md - Complete API documentation
docs/SCHEMA_SPEC.md - Data schema specification
docs/CONFIG.md - Configuration guide
docs/DATASET_CARD.md - Dataset card and usage
PROJECT_DOCUMENTATION.md - Technical overview

Contributing

Contributions are welcome! Please see our development setup:

Fork the repository
Create a feature branch
Install development dependencies: pip install -e .[dev]
Make your changes with tests
Run the test suite: make test
Submit a pull request

Citation

If you use GatewayBench in your research, please cite:

@misc{bedi2025gatewaybench,
  title={GatewayBench: Evaluating LLM Gateways -- A Benchmark and Measurement Study of Cost/Latency/Quality Tradeoffs},
  author={Bedi, Pranav and Al Rasheed, Mohammed and Al Ani, Mazin},
  year={2025},
  url={https://github.com/ModaLabs/GatewayBench}
}

License

MIT License - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
data/processed		data/processed
docs		docs
gatewaybench		gatewaybench
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GatewayBench

Features

Quick Start

Installation

Generate Your First Dataset

Validate and Analyze

Task Types

Programmatic Usage

Basic Generation

Using Configuration Objects

Working with Type-Safe Models

Data Schema

Performance

Development

Running Tests

Code Quality

Project Structure

Using the Dataset

Documentation

Contributing

Citation

License

Support

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

ModaLabs/GatewayBench

Folders and files

Latest commit

History

Repository files navigation

GatewayBench

Features

Quick Start

Installation

Generate Your First Dataset

Validate and Analyze

Task Types

Programmatic Usage

Basic Generation

Using Configuration Objects

Working with Type-Safe Models

Data Schema

Performance

Development

Running Tests

Code Quality

Project Structure

Using the Dataset

Documentation

Contributing

Citation

License

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages