Skip to content

EnvCommons/BixBench

Repository files navigation

BixBench

⭐ OpenReward Environment Hugging Face Dataset

Description

BixBench is an environment for evaluating AI agents on real-world bioinformatics computational analysis tasks. Built from Code Ocean capsules containing published bioinformatics analyses, agents are given access to biological datasets and must answer hypothesis-driven research questions through multi-step analytical trajectories.

Capabilities

  • Exploring and analyzing biological datasets using CLI tools
  • Writing and executing bioinformatics analysis code
  • Interpreting results from genomic, transcriptomic, and proteomic analyses
  • Multi-step computational biology reasoning

Compute Requirements

Agents in BixBench are given a sandbox with access to bioinformatics tools (samtools, bcftools, bedtools, tabix) and the full Code Ocean capsule data (~5.91 GB total).

License

Apache 2.0.

Tasks

There is one split in this environment:

  • Test: 205 bioinformatics analysis tasks

Each task is derived from a Code Ocean capsule and presents a hypothesis-driven question about biological data. Tasks span diverse bioinformatics domains including genomics, transcriptomics, and proteomics.

Reward Structure

This is a multi-turn environment with binary reward at submission:

  • 1.0 — Correct answer
  • 0.0 — Incorrect answer

Evaluation uses two modes depending on the task:

  • String verifier: Case-insensitive string matching with LLM semantic fallback (gpt-5-mini)
  • Range verifier: Numeric proximity check with distractor-based tolerance

Exact matches are checked first to avoid unnecessary LLM calls.

Data

Task data consists of a Parquet metadata file and Code Ocean capsules containing biological datasets. Capsules are mounted at /orwd_data/bixbench/capsules/ in production.

Source: futurehouse/BixBench

Tools

Tool Description
submit_answer Submit your answer for binary evaluation.
bash Execute shell commands.
glob Find files by pattern.
grep Search file contents.
ls List directory contents.
read Read file contents.
write Write to files.
edit Edit existing files.
multi_edit Apply multiple edits to a file.
todo_write Track task progress.

Time Horizon

BixBench is a multi-turn environment. Agents iteratively explore data, write analysis code, and execute computations before submitting a final answer.

Environment Difficulty

Model performance on BixBench from the original paper (open-answer setting):

Model Accuracy
Claude 3.5 Sonnet 17%
GPT-4o 9%

Even frontier models achieve no better than random in the multiple-choice setting, indicating that fully autonomous bioinformatics research remained challenging at the time of the benchmark's release.

Other Environment Requirements

  • OpenAI API key: Required for LLM-based fallback grading in string verification. Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in BixBench interact with published biological datasets in a sandboxed environment. The environment does not involve human subjects or clinical data requiring special protections.

Citations

@article{mitchener2025bixbench,
  author    = {Mitchener, Ludovico and Laurent, Jon M and Tenmann, Benjamin and Narayanan, Siddharth and Wellawatte, Geemi P and White, Andrew and Sani, Lorenzo and Rodriques, Samuel G},
  title     = {BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology},
  journal   = {arXiv preprint arXiv:2503.00096},
  year      = {2025},
  url       = {https://arxiv.org/abs/2503.00096}
}

About

BixBench implementation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors