PaperBench is an environment for evaluating language model agents on their ability to replicate machine learning research papers. Based on the PaperBench benchmark from OpenAI, agents are given a research paper and must produce a complete reproduction — implementing the paper's methods, executing experiments, and generating results that match the original findings. Tasks are drawn from ICML 2024 papers spanning diverse ML topics, with detailed hierarchical rubrics co-developed with the original paper authors.
- Reading and understanding complex ML research papers
- Implementing algorithms and models described in papers from scratch
- Setting up experimental pipelines and running experiments on GPU
- Iterating on code using bash, file viewing, and editing tools
- Producing a self-contained reproduction script (
reproduce.sh) and submission repository
Agents in PaperBench are given a sandbox with an NVIDIA L4 GPU, with internet access enabled. The sandbox includes a pre-configured Python virtual environment and Docker for building custom environments if needed.
MIT.
There are 2 splits with 23 tasks total, each corresponding to a distinct ICML 2024 paper:
| Split | Tasks | Type | Description |
|---|---|---|---|
| dev | 3 | validation | Small subset for development testing |
| main | 20 | test | Full evaluation set of ICML 2024 papers |
Each task provides the agent with a research paper (in PDF and markdown format), an addendum with clarifications, and a blacklist of resources the agent must not use (e.g., the paper's original codebase). The agent must produce a git repository at /home/agent/submission/ containing source code and a reproduce.sh script.
This is a sparse reward environment with continuous scoring. The answer tool triggers grading at the end of the episode, while intermediate tool calls (bash, view, str_replace, insert, create) return a reward of 0.
Grading uses an LLM-as-judge (o3-mini with high reasoning effort) that evaluates the agent's submission against a hierarchical rubric. The rubric decomposes each paper into fine-grained requirements across three categories:
| Category | Scoring | Description |
|---|---|---|
| Code Development | Binary (0 or 1) | Was the algorithm/method correctly implemented? |
| Code Execution | Binary (0 or 1) | Did the code run successfully via reproduce.sh? |
| Result Analysis | Binary (0 or 1) | Do reproduced results match the paper's findings? |
Parent nodes in the rubric receive weighted averages of their children's scores, propagating up to a single root score between 0 and 1 that serves as the final reward.
Each task contains a paper directory with:
paper.pdfandpaper.md— the research paperaddendum.md— clarifications and scope notesblacklist.txt— resources the agent must not accessrubric.json— hierarchical grading rubric (removed from the agent's sandbox to prevent cheating; used server-side for grading)
Paper data is stored on the OpenReward platform. The rubric is never exposed to the agent.
Agents are given six tools:
bash: Execute shell commands in the sandbox (with Python virtualenv auto-activated)view: View file contents with optional line rangestr_replace: Find and replace text in filesinsert: Insert content at a specific line numbercreate: Create a new file with given contentanswer: Submit the final reproduction for grading (terminates the episode)
PaperBench is a long-horizon, multi-turn environment. Agents must read a paper, implement its methods, run experiments, and iterate on their code. The default reproduction timeout is 12 hours. There is no limit on the number of tool calls; the agent decides when to call answer.
Results from the original PaperBench paper using BasicAgent with a 12-hour time limit (average replication score across 20 papers):
| Model | Score |
|---|---|
| Claude 3.5 Sonnet (New) | 21.0% |
| o1-high | 13.2% |
| DeepSeek-R1 | 6.0% |
| GPT-4o | 4.1% |
| Gemini 2.0 Flash | 3.2% |
| o3-mini-high | 2.6% |
With an iterative agent scaffold (no early exit), o1-high achieves 24.4% and reaches 26.0% with a 36-hour time limit. For comparison, ML PhDs achieved 41.4% on a 3-paper subset after 48 hours of work.
PaperBench requires the following secrets to be passed via the session:
openai_api_key— Required. Used server-side by the SimpleJudge (o3-mini) for grading, and passed into the sandbox so agents can use the OpenAI API during reproduction.hf_token— Required. Passed into the sandbox so agents can download datasets and model weights from HuggingFace.
Agents in PaperBench operate within isolated sandboxes with full internet access enabled (required for downloading datasets and model weights). The sandbox is destroyed after the session ends. The rubric is removed from the agent's environment to prevent the agent from gaming the evaluation. The primary safety consideration is that agents execute arbitrary code with GPU access and network connectivity, which is contained by the sandbox.
@article{starace2025paperbench,
title={PaperBench: Evaluating AI's Ability to Replicate AI Research},
author={Starace, Giulio and Jaffe, Oliver and Sherburn, Dane and Aung, James and Chan, Jun Shern and Maksin, Leon and Dias, Rachel and Mays, Evan and Kinsella, Benjamin and Thompson, Wyatt and Heidecke, Johannes and Glaese, Amelia and Patwardhan, Tejal},
journal={arXiv preprint arXiv:2504.01848},
year={2025}
}