PaperBench

Description

PaperBench is an environment for evaluating language model agents on their ability to replicate machine learning research papers. Based on the PaperBench benchmark from OpenAI, agents are given a research paper and must produce a complete reproduction — implementing the paper's methods, executing experiments, and generating results that match the original findings. Tasks are drawn from ICML 2024 papers spanning diverse ML topics, with detailed hierarchical rubrics co-developed with the original paper authors.

Capabilities

Reading and understanding complex ML research papers
Implementing algorithms and models described in papers from scratch
Setting up experimental pipelines and running experiments on GPU
Iterating on code using bash, file viewing, and editing tools
Producing a self-contained reproduction script (reproduce.sh) and submission repository

Compute Requirements

Agents in PaperBench are given a sandbox with an NVIDIA L4 GPU, with internet access enabled. The sandbox includes a pre-configured Python virtual environment and Docker for building custom environments if needed.

License

MIT.

Tasks

There are 2 splits with 23 tasks total, each corresponding to a distinct ICML 2024 paper:

Split	Tasks	Type	Description
dev	3	validation	Small subset for development testing
main	20	test	Full evaluation set of ICML 2024 papers

Each task provides the agent with a research paper (in PDF and markdown format), an addendum with clarifications, and a blacklist of resources the agent must not use (e.g., the paper's original codebase). The agent must produce a git repository at /home/agent/submission/ containing source code and a reproduce.sh script.

Reward Structure

This is a sparse reward environment with continuous scoring. The answer tool triggers grading at the end of the episode, while intermediate tool calls (bash, view, str_replace, insert, create) return a reward of 0.

Grading uses an LLM-as-judge (o3-mini with high reasoning effort) that evaluates the agent's submission against a hierarchical rubric. The rubric decomposes each paper into fine-grained requirements across three categories:

Category	Scoring	Description
Code Development	Binary (0 or 1)	Was the algorithm/method correctly implemented?
Code Execution	Binary (0 or 1)	Did the code run successfully via `reproduce.sh`?
Result Analysis	Binary (0 or 1)	Do reproduced results match the paper's findings?

Parent nodes in the rubric receive weighted averages of their children's scores, propagating up to a single root score between 0 and 1 that serves as the final reward.

Data

Each task contains a paper directory with:

paper.pdf and paper.md — the research paper
addendum.md — clarifications and scope notes
blacklist.txt — resources the agent must not access
rubric.json — hierarchical grading rubric (removed from the agent's sandbox to prevent cheating; used server-side for grading)

Paper data is stored on the OpenReward platform. The rubric is never exposed to the agent.

Tools

Agents are given six tools:

bash: Execute shell commands in the sandbox (with Python virtualenv auto-activated)
view: View file contents with optional line range
str_replace: Find and replace text in files
insert: Insert content at a specific line number
create: Create a new file with given content
answer: Submit the final reproduction for grading (terminates the episode)

Time Horizon

PaperBench is a long-horizon, multi-turn environment. Agents must read a paper, implement its methods, run experiments, and iterate on their code. The default reproduction timeout is 12 hours. There is no limit on the number of tool calls; the agent decides when to call answer.

Environment Difficulty

Results from the original PaperBench paper using BasicAgent with a 12-hour time limit (average replication score across 20 papers):

Model	Score
Claude 3.5 Sonnet (New)	21.0%
o1-high	13.2%
DeepSeek-R1	6.0%
GPT-4o	4.1%
Gemini 2.0 Flash	3.2%
o3-mini-high	2.6%

With an iterative agent scaffold (no early exit), o1-high achieves 24.4% and reaches 26.0% with a 36-hour time limit. For comparison, ML PhDs achieved 41.4% on a 3-paper subset after 48 hours of work.

Other Environment Requirements

PaperBench requires the following secrets to be passed via the session:

openai_api_key — Required. Used server-side by the SimpleJudge (o3-mini) for grading, and passed into the sandbox so agents can use the OpenAI API during reproduction.
hf_token — Required. Passed into the sandbox so agents can download datasets and model weights from HuggingFace.

Safety

Agents in PaperBench operate within isolated sandboxes with full internet access enabled (required for downloading datasets and model weights). The sandbox is destroyed after the session ends. The rubric is removed from the agent's environment to prevent the agent from gaming the evaluation. The primary safety consideration is that agents execute arbitrary code with GPU access and network connectivity, which is contained by the sandbox.

Citations

@article{starace2025paperbench,
  title={PaperBench: Evaluating AI's Ability to Replicate AI Research},
  author={Starace, Giulio and Jaffe, Oliver and Sherburn, Dane and Aung, James and Chan, Jun Shern and Maksin, Leon and Dias, Rachel and Mays, Evan and Kinsella, Benjamin and Thompson, Wyatt and Heidecke, Johannes and Glaese, Amelia and Patwardhan, Tejal},
  journal={arXiv preprint arXiv:2504.01848},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
paperbench		paperbench
splits		splits
tests		tests
DATA_UPLOAD.md		DATA_UPLOAD.md
Dockerfile		Dockerfile
README.md		README.md
agent.env.example		agent.env.example
agent_optional_requirements.txt		agent_optional_requirements.txt
benchmark.py		benchmark.py
computer.dockerfile		computer.dockerfile
constants.py		constants.py
instructions.txt		instructions.txt
paperbench_env.py		paperbench_env.py
requirements.txt		requirements.txt
server.py		server.py
test_agent.py		test_agent.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PaperBench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PaperBench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages