QUEST

Introduction

Quest is a general-purpose Deep Search Agent designed to handle a wide range of search tasks, with strong capabilities in fact seeking, citation grounding, and report synthesis.

🚀 Updates

14/05/2026: We released QUEST, including model checkpoints, data, and code.

🔗 Resources

All released models and datasets are organized in the Hugging Face collection. You can also try the hosted demo.

Type	Resources
35B checkpoints	RL, MT+SFT, MT, SFT
30B checkpoints	RL, MT+SFT, SFT
Smaller checkpoints	9B, 4B, 2B
Training data	RL data, SFT objective data, SFT open-ended data

Model selection note: if you only need to evaluate objective tasks and do not need open-ended task evaluation, we recommend the MT+SFT checkpoints because they perform better on objective benchmarks. For more comprehensive evaluation across both objective and open-ended tasks, we recommend the RL checkpoints.

Release note: cached databases and mid-training data are still under legal review. We will release them only after confirming that their release is legally compliant.

📊 Results Snapshot

Overall benchmark snapshot. QUEST-35B is compared with leading proprietary and open deep research agents across eight benchmarks covering fact seeking, citation grounding, and report synthesis.

30B-scale comparison. QUEST-30B is compared against other open research agents of similar scale, highlighting its performance across objective and open-ended benchmarks.

Scaling across QUEST checkpoints. Smaller QUEST models show consistent gains as model size increases from 2B to 35B on representative benchmarks.

Training-stage ablation. The training recipe is broken down across vanilla, SFT, mid-training, and RL stages to show where each stage improves final performance.

🛠️ Environment Setup

Create an environment and install the shared runtime dependencies:

pip install -r requirements.txt

This environment is intended for inference, data generation, and evaluation workflows. Training uses separate backend stacks: install SFT dependencies under training_scripts/sft/ according to LlamaFactory requirements, and install RL dependencies under training_scripts/rl/ according to VERL requirements.

Optional local databases and caches used by search, visit, and scholar tools live under the repository-level database/ directory:

database/

These files are not included in the repository. If you do not download existing databases, the search and visit caches are created automatically during runs. Providing prebuilt databases is useful when you want to reuse cached results, reduce external requests, or run workflows that require prepared search/scholar indexes.

⚙️ Runtime Configuration

The exact environment variables depend on the workflow. Common groups include:

Group	Examples	Used By
Search	`SERPER_KEY_ID`	Search and scholar fallback
Visit	`JINA_API_KEYS`	Page reading and page summarization
Azure/OpenAI-compatible	`API_KEY`, `API_BASE`, `AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_API_VERSION`, `AZURE_OPENAI_DEPLOYMENT`	Shared legacy and fallback LLM paths
Inference summary and memory	`SUMMARY_MODEL_NAME`, `MEMORY_MODEL_NAME`, `MEMORY_API_KEY`, `MEMORY_API_BASE`	Visit summarization and memory condensation
Reward/eval LLMs	`EVAL_LLM_`, `CITATION_EVAL_LLM_`, `OPENENDED_EVAL_LLM_*`	Objective, citation, and open-ended reward evaluation
Services	`SEARCH_NODES_CONF`, `SCHOLAR_NODES_CONF`, `PYTHON_NODES_CONF`, `EVAL_LLM_NODES_CONF`	Tool and local eval-node routing

For inference, see inference/api_config.yaml for the default configuration template. For the full RL backend environment list, see the DeepResearch recipe README. Any <HOST_IP> or [PLACEHOLDER] values in committed configs are examples only. Replace them with the real host IPs, ports, model paths, and credentials for your own inference, evaluation, or training deployment.

🧪 Benchmark Replication

Inference

Use inference/ when you have a model endpoint and want to run benchmark predictions with the QUEST agent.

Before launching, configure:

api_config.yaml
server_endpoints.conf

Then check the benchmark script and update:

DATASET
OUTPUT_PATH
TASK_LOG_DIR
MODEL_PATH
MAX_WORKERS
MEMORY_THRESHOLD
LLM_MAX_TOKENS
API_CONFIG_FILE
SERVER_ENDPOINTS_FILE

Run the benchmark-specific launch script from inference/ after configuration. Endpoint routing is controlled by server_endpoints.conf, which the agent can reload during a run. See inference/README.md for the available launch scripts and benchmark-specific defaults.

Evaluation

Evaluation scripts consume prediction directories produced by inference/.

Benchmark	Directory
BrowseComp	`evaluation/browsecomp/`
BrowseComp-Plus	`evaluation/browsecomp_plus/`
GAIA	`evaluation/gaia/`
HLE	`evaluation/hle/`
DeepResearch Bench	`evaluation/drbench/`
LiveResearchBench	`evaluation/liveresearchbench/`
Mind2Web2	`evaluation/Mind2Web2/`
WideSearch	`evaluation/widesearch/`

For a new run, update the target result directory, dataset path, model or run name, judge model, worker count, and judge credentials.

See evaluation/README.md for benchmark-specific commands and notes.

🎯 Mid-training / SFT Training

Use training_scripts/sft for mid-training and supervised fine-tuning workflows. Before training, prepare the mid-training/SFT datasets and convert them to the format expected by LlamaFactory. See Resources for released model checkpoints and datasets.

The SFT backend is based on LlamaFactory. Use its data configuration and training entrypoints under training_scripts/sft/LlamaFactory/ after the datasets are prepared.

🏃 Run Training

RL Backend

Use training_scripts/rl as the working directory:

cd training_scripts/rl

The active recipe is:

recipe/deepresearch/

Core files:

Path	Purpose
`recipe/deepresearch/run_deepresearch_fully_async_megatron.sh`	Main fully async Megatron launcher
`recipe/deepresearch/agent_loop/`	Multi-turn research rollout logic
`recipe/deepresearch/reward.py`	Reward routing for objective, citation, and open-ended tasks
`recipe/deepresearch/tools/`	Search, scholar, visit, Python, memory-related tool implementations
`recipe/deepresearch/scripts/`	Search/scholar services and FAISS build scripts
`recipe/deepresearch/config/`	Tool, service-node, eval-node, and trainer configs
`recipe/deepresearch/data/`	Default train/validation parquet files

Before building FAISS, confirm that the required databases are available:

visit database
search database
scholar database

Also make sure the Python interpreter service is running if the training workers will use the Python tool.

Then build the FAISS indexes:

bash recipe/deepresearch/scripts/init_faiss_search.sh --skip-merge
bash recipe/deepresearch/scripts/init_faiss_scholar.sh --skip-merge

Then start the services:

bash recipe/deepresearch/scripts/run_search_service.sh
bash recipe/deepresearch/scripts/run_scholar_service.sh

Launch training:

bash recipe/deepresearch/run_deepresearch_fully_async_megatron.sh

The full runbook, including environment variables and FAISS setup, is in:

training_scripts/rl/recipe/deepresearch/README.md

🧬 Data Generation

Objective Tasks

Objective tasks use a verifiable rubric-tree pipeline:

High-level flow:

generate trajectories -> merge rubric predictions -> format verifier inputs
-> refine rubric trees -> verify rubric trees -> extract accepted questions

See task/obj_task/README.md for the runnable commands and expected input/output paths.

Objective Verifier Scripts

Generate one Python verifier script per formatted objective task.

See task/obj_eval/README.md for the generation command and expected formatted-task input structure.

Open-Ended Tasks

Open-ended longform generation lives under task/open_ended_task/.

High-level flow:

generate longform tasks -> extract proposed QAs -> generate criteria
-> polish criteria -> generate reference answers -> refine final answers
-> extract final answers

See task/open_ended_task/README.md for the runnable commands and expected input/output paths.

Open-Ended Evaluation

Rubric-based document quality evaluation lives under task/open_ended_eval/:

cd task/open_ended_eval
bash run_eval.sh

It compares an answer against a reference answer across criteria such as comprehensiveness, insight, instruction following, and readability.

🗺️ Documentation Map

We provide details of each component in the READMEs below.

Area	Directory	Main README	What It Contains
Inference	`inference/`	`inference/README.md`	QUEST inference pipeline
RL backend	`training_scripts/rl/recipe/deepresearch/`	`training_scripts/rl/recipe/deepresearch/README.md`	QUEST RL training recipe
SFT backend	`training_scripts/sft/`	`training_scripts/sft/README.md`	LlamaFactory-based SFT backend
Objective task generation	`task/obj_task/`	`task/obj_task/README.md`	Objective task generation pipeline
Objective verifier scripts	`task/obj_eval/`	`task/obj_eval/README.md`	Objective-task verifier generation
Open-ended task generation	`task/open_ended_task/`	`task/open_ended_task/README.md`	Open-ended task generation pipeline
Open-ended evaluation	`task/open_ended_eval/`	`task/open_ended_eval/README.md`	Open-ended task evaluation pipeline
Evaluation	`evaluation/`	`evaluation/README.md`	Benchmark evaluation scripts

📝 Citation

If our paper or related resources prove valuable to your research, we kindly ask for a citation.

@misc{xie2026quest,
  title={QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks},
  author={Xie, Jian and Lin, Tianhe and Wang, Zilu and Ning, Yuting and Yao, Yuekun and Xue, Tianci and Zhang, Zhehao and Li, Zhongyang and Zhang, Kai and Wu, Yufan and Chen, Shijie and Gou, Boyu and Han, Mingzhe and Su, Yu and Sun, Huan},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QUEST

Introduction

🚀 Updates

Table of Contents

🔗 Resources

📊 Results Snapshot

🛠️ Environment Setup

⚙️ Runtime Configuration

🧪 Benchmark Replication

Inference

Evaluation

🎯 Mid-training / SFT Training

🏃 Run Training

RL Backend

🧬 Data Generation

Objective Tasks

Objective Verifier Scripts

Open-Ended Tasks

Open-Ended Evaluation

🗺️ Documentation Map

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
assets/readme		assets/readme
evaluation		evaluation
inference		inference
task		task
training_scripts		training_scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

QUEST

Introduction

🚀 Updates

Table of Contents

🔗 Resources

📊 Results Snapshot

🛠️ Environment Setup

⚙️ Runtime Configuration

🧪 Benchmark Replication

Inference

Evaluation

🎯 Mid-training / SFT Training

🏃 Run Training

RL Backend

🧬 Data Generation

Objective Tasks

Objective Verifier Scripts

Open-Ended Tasks

Open-Ended Evaluation

🗺️ Documentation Map

📝 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages