Skip to content

CherYou/AutoResearchBench

Repository files navigation

AutoResearchBench

English · 简体中文

Reference code for inference and evaluation on the AutoResearchBench.



Quick Start Hugging Face Dataset Benchmark Data Repository Map

Python 3.10+ Batch Inference Deep and Wide Evaluation Search Backends Obfuscated Bundle

Abstract

Autonomous scientific research is significantly advanced thanks to the development of AI agents. One key step in this process is finding the right scientific literature, whether to explore existing knowledge for a research problem, or to acquire evidence for verifying assumptions and supporting claims.

To assess AI agents' capability in driving this process, we present AutoResearchBench, a dedicated benchmark for autonomous scientific literature discovery.

AutoResearchBench consists of two complementary task types:

  • Deep Research: requires tracking down a specific target paper through a progressive, multi-step probing process.
  • Wide Research: requires comprehensively collecting a set of papers satisfying given conditions.

Compared to previous benchmarks on agentic web browsing, AutoResearchBench is distinguished along three dimensions: it is research-oriented, calling for in-depth comprehension of scientific concepts; literature-focused, demanding fine-grained utilization of detailed information; and open-ended, involving an unknown number of qualified papers and thus requiring deliberate reasoning and search throughout. These properties make AutoResearchBench uniquely suited for evaluating autonomous research capabilities, and extraordinarily challenging.

Even the most powerful LLMs, despite having largely conquered general agentic web-browsing benchmarks such as BrowseComp, achieve only 9.39% accuracy on Deep Research and 9.31% IoU on Wide Research, while many other strong baselines fall below 5%. We publicly release the dataset and evaluation pipeline to facilitate future research in this direction.

Figures

Construction pipeline (high-level overview). Vector figure: assets/construction-pipeline.pdf.

Construction pipeline overview

Illustrative benchmark cases. Vector figure: assets/autoresearchbench-cases.pdf.

Benchmark case illustrations

Main experimental results reported with the DeepXiv search tool (end-to-end systems evaluated separately in the table’s protocol). Raster export of the paper’s summary table:

Main experimental results

Repository Map

Icon Component Purpose
🚀 run_inference.sh + inference.py Main batch inference entrypoint with .env-driven configuration.
🔎 tool_deepxivsearch.py + tool_websearch.py Search backends for academic retrieval and general web retrieval.
🧠 prompts.py + utils.py Shared prompting logic, model client wiring, and JSONL helpers.
📊 evaluate/evaluate_deep_search.py + evaluate/evaluate_wide_search.py Deep-search judging and wide-search retrieval metrics.
🔓 decrypt_benchmark.py + benchmark_crypto.py Local bundle restoration from the released .obf.json file back to plaintext JSONL.

Quick Start

  1. Install dependencies:
python3 -m pip install -r requirements.txt
  1. Create an environment file:
cp example.env .env
  1. Fill in the required fields in .env:
MODEL=your_model_name
OPENAI_API_KEY=your_api_key
OPENAI_API_BASE=your_api_base
INPUT_FILE=input_data/academic_deepsearch_example.jsonl
  1. Run inference:
bash run_inference.sh
  1. Run evaluation:
bash evaluate/run_evaluate.sh deep --input-file output_data/inference_output.jsonl
bash evaluate/run_evaluate.sh wide --input-file output_data/inference_output.jsonl --gt-file path/to/gt.jsonl

Benchmark Data

The released benchmark bundle is hosted on the Hugging Face dataset repo Lk123/AutoResearchBench.

1. Download the released bundle

mkdir -p input_data

curl -L \
  -o input_data/AutoResearchBench.jsonl.obf.json \
  https://huggingface.co/datasets/Lk123/AutoResearchBench/resolve/main/AutoResearchBench.jsonl.obf.json

If you mirror the bundle into a private Hugging Face repo, add -H "Authorization: Bearer ${HF_TOKEN}" to the curl command.

2. Decrypt it locally

python3 decrypt_benchmark.py \
  --input-file input_data/AutoResearchBench.jsonl.obf.json \
  --output-file input_data/AutoResearchBench.jsonl

3. Point inference to the decrypted JSONL

INPUT_FILE=input_data/AutoResearchBench.jsonl

Note

The released file on Hugging Face is the obfuscated bundle. Run inference on the decrypted .jsonl, not on the .obf.json file directly.

Citation

If you use this benchmark or code, please cite the AutoResearchBench publication when available, and retain the dataset attribution required by the Hugging Face repository license.

License

This repository is released under the Apache License 2.0. See LICENSE for details.

Notes

  • Inference automatically skips questions that already exist in the output JSONL file.
  • run_inference.sh and evaluate/run_evaluate.sh both load configuration from .env by default. Set AUTORESEARCHBENCH_ENV_FILE to use a different environment file.
  • Use --verbose on Python entrypoints when you need detailed debugging logs.

About

Official Repo: AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors