Evaluating LLM-Driven Protein Design: Agents Lack Iterative Evaluation Depth Jeonghyeon Kim & Philip Romero β Romero Lab, Duke University
π Paper: coming soon Β Β·Β π Leaderboard:
RomeroLab-Duke/BioDesignBench-Leaderboard · 𧬠Reference MCP server:jasonkim8652/protein-design-mcp·pip install protein-design-mcp
BioDesignBench is a benchmark for testing whether tool-augmented LLM agents can orchestrate the stochastic, multi-step pipelines of computational protein design. Where existing chemistry-agent and code-agent benchmarks evaluate deterministic tool chains, we focus on the qualitatively different setting in which generative tools (RFdiffusion, ProteinMPNN, Boltz-2) sample from distributions over structures and sequences and a competent practitioner must generate multiple candidates and screen them across complementary biophysical metrics before a design is viable.
We evaluate four frontier LLMs (DeepSeek V3, GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro) under guided and unguided MCP-tool presentation modes against deterministic and human baselines on 76 expert-curated tasks drawn from 2024β2026 literature. The headline finding: top-tier agents now beat a hardcoded pipeline, but invoke evaluation tools at only 14% of expert depth, and workflow guidance rescues coverage without rescuing depth.
Hybrid score (100 pts)
Human Oracle ββββββββββββββββββββ 74.9
Human Expert βββββββββββββββββ 61.3
DeepSeek V3 (unguided) βββββββββββββββββ 60.4
DeepSeek V3 (guided) ββββββββββββββββ 58.5
GPT-5 (unguided) βββββββββββββββ 55.6
GPT-5 (guided) βββββββββββββββ 55.3
Hardcoded Pipeline βββββββββββββββ 54.2
Claude Sonnet 4.5 (guided) ββββββββββββββ 50.2
Claude Sonnet 4.5 (unguid) ββββββββββββ 41.2
Gemini 2.5 Pro ββ 8.4
- Top-tier LLM agents now beat a deterministic pipeline. DeepSeek V3 and GPT-5 surpass a hand-engineered hardcoded pipeline (54.2) under both modes. Autonomous protein-design orchestration is no longer infeasible.
- Coverageβdepth dissociation. Workflow guidance closes the coverage gap (Rescue Index up to +3.01) but leaves utilisation depth unchanged (Rescue Index β 0). Better tool docs cannot teach iterative depth.
- Evaluation depth, not tool knowledge, is the bottleneck. Across 836 taskβcondition observations, evaluation depth per candidate correlates with total score at Ο = 0.685 (p < 10β»ΒΉΒΉβ·). LLM agents generate backbone candidates at expert-level rates but evaluate each one at 14% of expert depth. Forced-depth interventions confirm this is causal.
To prevent contamination of future language models, the 76 task specifications, their input PDBs, ground truth, and oracle outputs are deliberately not released here. The benchmark is hosted as a private HuggingFace dataset and agents are evaluated through the public submission flow at the leaderboard URL above. The repo contains:
- the scoring & evaluation pipeline (
biodesignbench/eval/) - the agent harness, baselines, and bio-specific agent wrappers
(
biodesignbench/agents/) - the MCP tool provider that maps the 17 reference tools to
Anthropic / OpenAI / Gemini function-calling schemas
(
biodesignbench/tools/) - the 2 Γ 5 taxonomy module (
biodesignbench/taxonomy.py) - the LLM judge for the 28-point rubric portion
(
biodesignbench/eval/llm_judge/) - all paper figure-generating analysis scripts (
scripts/analysis/) - the HuggingFace Space leaderboard backend (
biodesignbench-leaderboard/)
Anything that would let you reconstruct a task β input files, prompts, ground truth, baseline outputs, results CSVs β is held privately by Romero Lab and served at evaluation time only.
BioDesignBench/
βββ biodesignbench/ # Python package
β βββ taxonomy.py # 2 Γ 5 design matrix (DesignApproach Γ MolecularSubject)
β βββ eval/ # 100-point scoring pipeline
β β βββ tier1/ # Bio-coding tasks (unit-test style)
β β βββ tier2/ # Design tasks (4D metrics + Boltz-2 verification)
β β βββ metrics/ # approach / orchestration / quality / etc.
β β βββ llm_judge/ # 28-pt LLM judge panel (PoLL with self-exclusion)
β β βββ pipeline.py # Top-level orchestration
β βββ agents/ # Agent harness
β β βββ general_purpose/ # GPT-5, Claude Sonnet, Gemini, DeepSeek wrappers
β β βββ bio_specific/ # Biomni / STELLA / BioML wrappers
β β βββ baselines/ # Hardcoded pipeline + human-expert agent
β βββ tools/ # 17-tool MCP provider with mode toggle
β βββ interventions.py # Forced-depth & low-diversity intervention specs
β βββ tool_audit.py # Tool-call trace analysis
βββ biodesignbench-leaderboard/ # Gradio HuggingFace Space (backend + UI)
βββ scripts/analysis/ # All paper figure / SI analysis scripts (60 files)
βββ docker/sandbox/ # Sandbox image for executing agent-generated code
βββ docs/PRD.md # Project requirements document
βββ pyproject.toml
βββ environment.yml
git clone https://github.com/RomeroLab/BioDesignBench.git
cd BioDesignBench
# Conda environment (CPU only β no protein-design GPU tools)
conda env create -f environment.yml
conda activate biodesignbench
# Editable install with optional extras
pip install -e ".[dev,agents]"For the GPU-side protein-design tools (RFdiffusion, ProteinMPNN, Boltz-2, PyRosetta, AF2), install the reference MCP server:
pip install protein-design-mcp
# Source, Dockerfiles, and Modal deploy template:
# https://github.com/jasonkim8652/protein-design-mcpcp .env.example .env
# Edit .env and set ANTHROPIC_API_KEY / OPENAI_API_KEY / GOOGLE_API_KEY / DEEPSEEK_API_KEYfrom biodesignbench.eval.pipeline import score_design
from biodesignbench.taxonomy import get_category, DesignApproach, MolecularSubject
# 2 Γ 5 taxonomy
cat = get_category("dn_bnd_001")
print(cat.approach, cat.subject)
# DesignApproach.DE_NOVO, MolecularSubject.BINDER
# Score a hypothetical design (without task data, only the rubric pipeline)
help(score_design)All paper figures and SI analyses are reproducible from the canonical
score CSVs (held privately). Each script in scripts/analysis/ is named
after the figure it produces:
scripts/analysis/bdb_022_fig2_leaderboard.py # Figure 2: leaderboard
scripts/analysis/bdb_023_fig3_mode_comparison.py # Figure 3: coverageβdepth dissociation
scripts/analysis/bdb_050_variance_decomposition.py # Figure 5: variance partition
scripts/analysis/bdb_060_contamination.py # SI Figure 9: contamination
Submissions are accepted through the HuggingFace Space: π https://huggingface.co/spaces/RomeroLab-Duke/BioDesignBench-Leaderboard
Unlike most agent benchmarks, submitters do not host an HTTP endpoint. The 76 task descriptions never leave Romero Lab infrastructure. You provide:
- an LLM provider + API key β we run the BioDesignBench agent loop against your chosen model (Anthropic / OpenAI / Google / DeepSeek) inside the leaderboard backend. Your key is scrubbed from our records immediately after the dispatch phase.
- (optional) a custom MCP URL if you want to evaluate your own tool implementations. Otherwise, the agent calls our reference protein-design-mcp endpoint.
Each submission carries a unique canary token embedded as an HTML comment in every task prompt, so we can retrospectively detect leakage if any future model regurgitates it.
If you want to benchmark a new tool implementation (a faster structure predictor, a different diffusion backbone, your own stability model) against the same 76 tasks / same scoring rubric used by the paper, stand up an HTTPS endpoint satisfying the MCP contract and paste the URL into the submission form's Advanced: Custom MCP section:
- Contract + hosting options:
biodesignbench-leaderboard/README.md - Minimal FastAPI stub (~150 lines):
biodesignbench-leaderboard/example_mcp_server.py - Reference implementation to fork:
jasonkim8652/protein-design-mcp(PyPI:protein-design-mcp; Modal deploy template included indeploy/modal_app.py)
The MCP server β ours or yours β only ever sees operational tool arguments (sequences, PDB paths, hotspot residues). It never sees the raw task prompt or evaluation criteria.
Rate limit: 1 submission per calendar month per organization. LLM-judge API costs are paid by Romero Lab; please be considerate.
| Phase | Step | Status |
|---|---|---|
| A | Dispatch tasks β CPU scoring (5/6 components) | live |
| B | Boltz-2 structure verification | live (Modal-hosted A10G sidecar) |
| C | LLM-judge panel (28-pt hybrid) | live |
| D | Finalize + publish | live |
See biodesignbench-leaderboard/README.md
for the Modal companion-app deployment notes.
@article{biodesignbench2026,
title = {Evaluating LLM-Driven Protein Design:
Agents Lack Iterative Evaluation Depth},
author = {Kim, Jeonghyeon and Romero, Philip},
year = {2026},
}Code: MIT. Task content (held privately): not licensed for redistribution.
- Jeonghyeon Kim β
jeonghyeon.kim@duke.edu - Philip Romero β
philip.romero@duke.edu