BixBench

Description

BixBench is an environment for evaluating AI agents on real-world bioinformatics computational analysis tasks. Built from Code Ocean capsules containing published bioinformatics analyses, agents are given access to biological datasets and must answer hypothesis-driven research questions through multi-step analytical trajectories.

Capabilities

Exploring and analyzing biological datasets using CLI tools
Writing and executing bioinformatics analysis code
Interpreting results from genomic, transcriptomic, and proteomic analyses
Multi-step computational biology reasoning

Compute Requirements

Agents in BixBench are given a sandbox with access to bioinformatics tools (samtools, bcftools, bedtools, tabix) and the full Code Ocean capsule data (~5.91 GB total).

License

Apache 2.0.

Tasks

There is one split in this environment:

Test: 205 bioinformatics analysis tasks

Each task is derived from a Code Ocean capsule and presents a hypothesis-driven question about biological data. Tasks span diverse bioinformatics domains including genomics, transcriptomics, and proteomics.

Reward Structure

This is a multi-turn environment with binary reward at submission:

1.0 — Correct answer
0.0 — Incorrect answer

Evaluation uses two modes depending on the task:

String verifier: Case-insensitive string matching with LLM semantic fallback (gpt-5-mini)
Range verifier: Numeric proximity check with distractor-based tolerance

Exact matches are checked first to avoid unnecessary LLM calls.

Data

Task data consists of a Parquet metadata file and Code Ocean capsules containing biological datasets. Capsules are mounted at /orwd_data/bixbench/capsules/ in production.

Source: futurehouse/BixBench

Tools

Tool	Description
`submit_answer`	Submit your answer for binary evaluation.
`bash`	Execute shell commands.
`glob`	Find files by pattern.
`grep`	Search file contents.
`ls`	List directory contents.
`read`	Read file contents.
`write`	Write to files.
`edit`	Edit existing files.
`multi_edit`	Apply multiple edits to a file.
`todo_write`	Track task progress.

Time Horizon

BixBench is a multi-turn environment. Agents iteratively explore data, write analysis code, and execute computations before submitting a final answer.

Environment Difficulty

Model performance on BixBench from the original paper (open-answer setting):

Model	Accuracy
Claude 3.5 Sonnet	17%
GPT-4o	9%

Even frontier models achieve no better than random in the multiple-choice setting, indicating that fully autonomous bioinformatics research remained challenging at the time of the benchmark's release.

Other Environment Requirements

OpenAI API key: Required for LLM-based fallback grading in string verification. Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in BixBench interact with published biological datasets in a sandboxed environment. The environment does not involve human subjects or clinical data requiring special protections.

Citations

@article{mitchener2025bixbench,
  author    = {Mitchener, Ludovico and Laurent, Jon M and Tenmann, Benjamin and Narayanan, Siddharth and Wellawatte, Geemi P and White, Andrew and Sani, Lorenzo and Rodriques, Samuel G},
  title     = {BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology},
  journal   = {arXiv preprint arXiv:2503.00096},
  year      = {2025},
  url       = {https://arxiv.org/abs/2503.00096}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
__pycache__		__pycache__
.DS_Store		.DS_Store
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
bixbench.py		bixbench.py
cli_environment.py		cli_environment.py
constants.py		constants.py
download_data.py		download_data.py
extract_capsules.py		extract_capsules.py
requirements.txt		requirements.txt
server.py		server.py
test_agent.py		test_agent.py
unzipped.txt		unzipped.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BixBench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BixBench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages