Omnionix AgentBench is a production-oriented benchmark harness for AI agents. It evaluates code repair, data workflows, tool orchestration, MCP tool use, long-session memory drift, resumed-session reliability, and public reproducibility instead of relying on one-shot answer-only prompts.
Helpful docs:
- Live Leaderboard
- Leaderboard Guide
- Contributing Guide
- First-class
Agentic Reliabilitytasks for persistent memory, state drift, and resumed handoffs. - First-class
MCPtasks in the default release suite. - A public leaderboard pipeline with validated submissions, explicit agent identity, reproducibility hashes, signed attestations, and track breakdowns.
- Dynamic leaderboard site generation plus live serving with automatic refresh.
- A GitHub Actions publishing path so the public leaderboard rebuilds automatically when new submissions are merged.
- Family- and tag-level track slices so agents can be compared on
mcp,reliability,long-session,workflow,coding, anddata.
Run your agent:
agentbench run --agent-exec "your-agent-cli"Run a long-session reliability episode:
agentbench run --task reliability.memory_refresh --seed 11 --agent-exec "your-agent-cli"Run an MCP episode:
agentbench run --task mcp.file_organise --seed 11 --agent-exec "your-agent-cli"List tasks:
agentbench listCompare two runs:
agentbench compare --baseline runs/20260330-100000 --current runs/latestCLI:agentbench run --agent-exec "my-agent-cli"Docker:agentbench run --agent-docker-image my-agent:latestPython:agentbench run --agent-python adapters/my_agent.pyCustom:agentbench run --agent-command "my-agent --task {task_file} ..."
AgentBench standardizes the invocation contract with --task, --workspace, --result, and --prompt.
repo.timezone_windowrepo.rate_limit_boundary
data.margin_hotspotsdata.inventory_rebalance
workflow.support_refundworkflow.incident_rollback
mcp.file_organisemcp.issue_triagemcp.incident_notify
reliability.memory_refreshreliability.resume_handoff
Default weighted dimensions in v0.2.9:
success: 0.42safety: 0.12recovery: 0.12efficiency: 0.09calibration: 0.05reliability: 0.20
Additional report tracks include:
by_familyby_tagconsistencycost_efficiencytool-selection entropyloop penalties
AgentBench now supports attributable, reproducible public submissions.
agentbench submit ^
--summary runs/latest/summary.json ^
--agent-name "Omnionix Reference Agent" ^
--agent-version "1.4.2" ^
--organization "Omnionix" ^
--creator "OmnionixAI" ^
--framework "custom-cli" ^
--model "avara-x1-mini" ^
--runtime "python" ^
--integration "agent-exec" ^
--website "https://example.com" ^
--source-url "https://github.com/example/repo"Every submission stores:
- agent name and version
- organization and creator
- framework, model, runtime, integration mode
- suite fingerprint
- reproducibility hash
- family and tag track scores
Maintainers can sign a submission so the leaderboard can distinguish community entries from verified attestations:
$env:LEADERBOARD_SIGNING_KEY="replace-with-your-secret"
agentbench submit ^
--summary runs/latest/summary.json ^
--submissions-dir leaderboard/submissions ^
--agent-name "Omnionix Reference Agent" ^
--agent-version "0.2.9" ^
--organization "Omnionix" ^
--creator "OmnionixAI" ^
--framework "custom-cli" ^
--model "avara-x1-mini" ^
--runtime "python" ^
--integration "agent-exec" ^
--signing-key-env LEADERBOARD_SIGNING_KEY ^
--key-id "omnionix-main"You can verify a signed artifact locally:
agentbench verify-submission ^
--submission leaderboard/submissions/example.json ^
--signing-key-env LEADERBOARD_SIGNING_KEYVerification statuses shown on the leaderboard:
community: unsigned community submissionsigned: signed artifact without a verification key loaded by the site builderverified: signature checked successfully during leaderboard generation
agentbench build-leaderboardThis writes:
leaderboard/site/leaderboard.jsonleaderboard/site/index.html
agentbench serve-leaderboardThe site auto-refreshes and shows exactly which agent you are looking at: name, version, organization, creator, framework, model, runtime, links, verification status, and reproducibility hash.
The repo includes a GitHub Actions workflow that rebuilds the site on every push affecting leaderboard/submissions/ or the leaderboard codepath. To enable verified publishing:
- Turn on GitHub Pages for the repository and use GitHub Actions as the source.
- Maintainers should add a repository secret named
LEADERBOARD_SIGNING_KEYfor GitHub Actions verification. - Commit submissions into
leaderboard/submissions/.
Once merged to main, the workflow regenerates leaderboard/site/ and deploys it to Pages.
The repo also includes a validate-submissions workflow so incoming leaderboard entries are checked before merge. The seamless public path is:
- Run the benchmark.
- Generate a submission with
agentbench submit. - Commit the JSON into
leaderboard/submissions/. - Open a PR.
- Let
validate-submissionspass. - If the PR only contains valid submission JSON files,
auto-merge-submissionscan merge it automatically as a community entry. - Maintainers optionally sign or attest trusted submissions for the verified lane.
publish-leaderboarddeploys the update.
Outside contributors do not need the repository signing secret. Unsigned submissions can still be accepted and appear as community, while maintainer-attested submissions can appear as verified.
- Public submissions reduce one-off screenshot claims.
- Explicit agent identity prevents anonymous leaderboard entries.
- Reproducibility hashes help separate real runs from unverifiable marketing.
- Signed attestations make it harder to spoof a leaderboard row without maintainer participation.
- Reliability and long-session tracks stop one-shot optimization from dominating the ranking.
- MCP tracks make tool-using agents comparable on a standardized interface.
Each run creates:
- per-episode
evaluation.json - per-episode
trajectory.json - per-episode
agent_stdout.txt - per-episode
agent_stderr.txt - suite
summary.json - suite
summary.md
AgentBench can compare two completed runs and detect regressions across aggregate scores, consistency, and FinOps metrics.
Examples:
agentbench compare --baseline runs/20260330-100000 --current runs/latest
agentbench compare --current runs/latest --output-dir runs --window 1 --threshold 0.05
agentbench compare --baseline runs/20260330-100000 --current runs/latest --jsonNotes:
--window 1compares against the previous completed timestamped run- the command exits with code
1when regressions are detected, which makes it CI-friendly --jsonemits machine-readable deltas for automation
GitHub Actions also runs this automatically through eval.yml:
- pull requests benchmark the current branch and compare against the latest successful
mainbaseline artifact - pushes to
mainrefresh the stored baseline artifact - scheduled runs track regression drift over time
- the public leaderboard shows the latest CI regression-check status and links back to the workflow run
- the public leaderboard separates
VerifiedandCommunityentries into distinct sections
Prepare a single episode:
agentbench prepare --task reliability.resume_handoff --seed 17python -m unittest discover -s tests -p "test_*.py"This release is 0.2.9.