AutoResearchBench

English · 简体中文

Reference code for inference and evaluation on the AutoResearchBench.

Abstract

Autonomous scientific research is significantly advanced thanks to the development of AI agents. One key step in this process is finding the right scientific literature, whether to explore existing knowledge for a research problem, or to acquire evidence for verifying assumptions and supporting claims.

To assess AI agents' capability in driving this process, we present AutoResearchBench, a dedicated benchmark for autonomous scientific literature discovery.

AutoResearchBench consists of two complementary task types:

Deep Research: requires tracking down a specific target paper through a progressive, multi-step probing process.
Wide Research: requires comprehensively collecting a set of papers satisfying given conditions.

Compared to previous benchmarks on agentic web browsing, AutoResearchBench is distinguished along three dimensions: it is research-oriented, calling for in-depth comprehension of scientific concepts; literature-focused, demanding fine-grained utilization of detailed information; and open-ended, involving an unknown number of qualified papers and thus requiring deliberate reasoning and search throughout. These properties make AutoResearchBench uniquely suited for evaluating autonomous research capabilities, and extraordinarily challenging.

Even the most powerful LLMs, despite having largely conquered general agentic web-browsing benchmarks such as BrowseComp, achieve only 9.39% accuracy on Deep Research and 9.31% IoU on Wide Research, while many other strong baselines fall below 5%. We publicly release the dataset and evaluation pipeline to facilitate future research in this direction.

Figures

Construction pipeline (high-level overview). Vector figure: assets/construction-pipeline.pdf.

Illustrative benchmark cases. Vector figure: assets/autoresearchbench-cases.pdf.

Main experimental results reported with the DeepXiv search tool (end-to-end systems evaluated separately in the table’s protocol). Raster export of the paper’s summary table:

Repository Map

Icon	Component	Purpose
🚀	`run_inference.sh` + `inference.py`	Main batch inference entrypoint with `.env`-driven configuration.
🔎	`tool_deepxivsearch.py` + `tool_websearch.py`	Search backends for academic retrieval and general web retrieval.
🧠	`prompts.py` + `utils.py`	Shared prompting logic, model client wiring, and JSONL helpers.
📊	`evaluate/evaluate_deep_search.py` + `evaluate/evaluate_wide_search.py`	Deep-search judging and wide-search retrieval metrics.
🔓	`decrypt_benchmark.py` + `benchmark_crypto.py`	Local bundle restoration from the released `.obf.json` file back to plaintext JSONL.

Quick Start

Install dependencies:

python3 -m pip install -r requirements.txt

Create an environment file:

cp example.env .env

Fill in the required fields in .env:

MODEL=your_model_name
OPENAI_API_KEY=your_api_key
OPENAI_API_BASE=your_api_base
INPUT_FILE=input_data/academic_deepsearch_example.jsonl

Run inference:

bash run_inference.sh

Run evaluation:

bash evaluate/run_evaluate.sh deep --input-file output_data/inference_output.jsonl
bash evaluate/run_evaluate.sh wide --input-file output_data/inference_output.jsonl --gt-file path/to/gt.jsonl

Benchmark Data

The released benchmark bundle is hosted on the Hugging Face dataset repo Lk123/AutoResearchBench.

1. Download the released bundle

mkdir -p input_data

curl -L \
  -o input_data/AutoResearchBench.jsonl.obf.json \
  https://huggingface.co/datasets/Lk123/AutoResearchBench/resolve/main/AutoResearchBench.jsonl.obf.json

If you mirror the bundle into a private Hugging Face repo, add -H "Authorization: Bearer ${HF_TOKEN}" to the curl command.

2. Decrypt it locally

python3 decrypt_benchmark.py \
  --input-file input_data/AutoResearchBench.jsonl.obf.json \
  --output-file input_data/AutoResearchBench.jsonl

3. Point inference to the decrypted JSONL

INPUT_FILE=input_data/AutoResearchBench.jsonl

Note

The released file on Hugging Face is the obfuscated bundle. Run inference on the decrypted .jsonl, not on the .obf.json file directly.

Citation

If you use this benchmark or code, please cite the AutoResearchBench publication when available, and retain the dataset attribution required by the Hugging Face repository license.

License

This repository is released under the Apache License 2.0. See LICENSE for details.

Notes

Inference automatically skips questions that already exist in the output JSONL file.
run_inference.sh and evaluate/run_evaluate.sh both load configuration from .env by default. Set AUTORESEARCHBENCH_ENV_FILE to use a different environment file.
Use --verbose on Python entrypoints when you need detailed debugging logs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoResearchBench

Abstract

Figures

Repository Map

Quick Start

Benchmark Data

1. Download the released bundle

2. Decrypt it locally

3. Point inference to the decrypted JSONL

Citation

License

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
evaluate		evaluate
input_data		input_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
benchmark_crypto.py		benchmark_crypto.py
decrypt_benchmark.py		decrypt_benchmark.py
example.env		example.env
inference.py		inference.py
prompts.py		prompts.py
requirements.txt		requirements.txt
run_inference.sh		run_inference.sh
tool_deepxivsearch.py		tool_deepxivsearch.py
tool_websearch.py		tool_websearch.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

AutoResearchBench

Abstract

Figures

Repository Map

Quick Start

Benchmark Data

1. Download the released bundle

2. Decrypt it locally

3. Point inference to the decrypted JSONL

Citation

License

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages