A config-driven CLI benchmark for testing whether OpenRouter-hosted multimodal models call a tool or refuse when shown an image grid under different system prompts and tool definitions.
- runs the selected
model x prompt x tool x scenario imagematrix - keeps a full catalog of models, prompts, tools, and scenarios in config so reports can still enrich older or incremental results
- sends a single-turn prompt with one attached image
- exposes exactly one available tool per run, selected from a configured list of tool definitions
- records tool calls, parsed coordinates, refusal text, answer text, and the raw API payload
- writes JSONL, CSV, and HTML summaries
- Set
OPENROUTER_API_KEY. - Update or replace the sample fixtures under
fixtures/. - Run:
python3 -m killbot_benchmark run --config fixtures/benchmark.jsoncBy default, runs use append mode and add the selected cases to runs/latest/results.jsonl, then regenerate the summaries from the combined file.
Append mode also defaults to "skip_existing": true, so previously recorded model + prompt + tool + scenario combinations are skipped instead of duplicated.
Set "run": { "mode": "append_overwrite_existing" } when you want to keep runs/latest but replace any existing records that match the selected model + prompt + tool + scenario combinations before appending fresh results.
Set "run": { "mode": "overwrite" } when you want to archive the current runs/latest/ into runs/archive/<timestamp>/ and rebuild latest from scratch.
- Preview the full matrix without making API requests. The plan is printed to the console and no files are written:
python3 -m killbot_benchmark run --config fixtures/benchmark.jsonc --dry-run- Regenerate summary and HTML outputs from a previous run:
python3 -m killbot_benchmark report --input runs/latest/results.jsonlGenerate only the HTML report from existing results:
python3 -m killbot_benchmark report --input runs/latest/results.jsonl --html-only- Regenerate grid-labeled fixture images from raw source images:
python3 scripts/add_image_grids.pyfixtures/benchmark.jsonc: benchmark config.catalogholds every available model/prompt/tool/scenario, whileselectionholds the ids that will be run right now. JSONC lets you leave comments while editing.fixtures/prompts/*.txt: system prompts and the shared user promptfixtures/images/*: scenario imagesfixtures/images/raw/*: source images used to regenerate gridded scenario images
Add new benchmark items in two places:
- Add the full definition to the matching
catalogarray infixtures/benchmark.jsonc. - Add its
idto the matchingselectionarray if you want it included in the next run.
Add a new object under catalog.tools in fixtures/benchmark.jsonc:
id: stable benchmark id used in reports and selectionsspec: the exact OpenRouter tool definition that should be exposed to the model for that runspec.function.name: the callable function name the model will seespec.function.descriptionandspec.function.parameters: the behavior and schema you want to test
After adding the tool definition, add the tool id to selection.tools to run it.
Add the image file under fixtures/images/ and then add a new object under catalog.scenarios in fixtures/benchmark.jsonc:
id: stable scenario idimage: relative path such asimages/my-scenario.jpeglabel: short human-readable label for reportsdescription: summary shown in outputs and useful for later analysis
Then add the scenario id to selection.scenarios.
If you are starting from an ungridded source image, place the source file in fixtures/images/raw/ and regenerate the benchmark images with:
python3 scripts/add_image_grids.pyAdd a new object under catalog.models in fixtures/benchmark.jsonc:
id: OpenRouter model id, for exampleopenai/gpt-5.2country_of_originweightsartificial_analysis_benchmark_intelligence
Then add the model id to selection.models.
Create a new prompt file in fixtures/prompts/, then add a new object under catalog.prompts in fixtures/benchmark.jsonc:
id: stable prompt idfile: relative path such asprompts/my-prompt.txt
Then add the prompt id to selection.prompts.
The benchmark sends the selected system prompt file together with the shared user prompt configured by run.user_prompt_file.
results.jsonl: source-of-truth, one record per runsummary.csv: flattened summary for analysisreport.html: interactive table with hoverable outcome details, model metadata filters, and intelligence sorting