CuSearch is a lightweight trajectory-selection layer for RLVR / GRPO agentic RAG: from N·G rollouts per step it keeps K updates by search depth (well-formed searches with non-empty retrieval), reallocating budget toward deeper trajectories via Search-Depth Greedy Allocation (SDGA)—SDGA-Auto (implicit curriculum) or SDGA-Phase (explicit phase threshold). Rewards, model, and environment stay unchanged.
On ZeroSearch (Qwen2.5-3B), SDGA-Phase improves average EM by up to +11.8 vs standard GRPO-Full, with consistent gains under Search-R1 and other reported settings.
conda create -n cusearch python=3.9
conda activate cusearch
conda install -c conda-forge pyarrow pandas numpy
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install vllm==0.6.3 --no-build-isolation
pip install "wandb<0.13"
pip install serpapi
# verl
pip install -e . --no-build-isolation
# flash attention 2
# pip3 install flash-attn --no-build-isolation
(1) Download the training dataset.
huggingface-cli download --repo-type dataset --resume-download Alibaba-NLP/ZeroSearch_dataset --local-dir ZeroSearch_dataset(2) Download the simulation LLMs.
# Simulation LLMs are available in different parameter sizes. Choose the one that best suits your needs. The 7B version is recommended for its stable and reliable simulation performance.
huggingface-cli download --resume-download Alibaba-NLP/Simulation_LLM_google_7B_V2 --local-dir Simulation_LLM_google_7B(3) Launch a local simulation server.
# Prompt-based simulation
bash ./simulator_vLLM.sh(4) Conduct RL training (default script: GRPO + optional CuSearch / SDGA).
# Activate the same conda env as in Dependencies (section above)
conda activate cusearch
# Real Google / SerpAPI search (only if you use a non-simulated retriever in your config)
export SER_API_KEY=your_api_key
# CuSearch knobs live in train_grpo.sh: enable_sdga, sdga_variant (auto | phase), etc.
# START_THRESHOLD / END_THRESHOLD there control the simulated-retrieval curriculum schedule
# (interpolated in the agent loop; see llm_agent/generation.py), not the SDGA depth allocator.
## Prompt-based simulation
bash train_grpo.sh
Figure 1 summarizes Exact Match (EM, %) on seven open-domain QA benchmarks (NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, MuSiQue, Bamboogle) under VERL + GRPO with ZeroSearch / Search-R1-style setups. Figures 2–3 show search-depth behavior and training curves.
Figure 2. Average valid search count during training and within-batch search-depth distribution shift.
Figure 3. ZeroSearch + Qwen2.5-3B: EM on NQ, mean reward, and average search depth vs. training step.
This repository builds on the ZeroSearch training stack and ecosystem (including simulated retrieval workflows) and the VERL RL training framework used throughout the paper’s experiments. We thank the authors of ZeroSearch (Sun et al., 2025), Search-R1 (Jin et al., 2025), veRL, and RAGEN for open-sourcing strong baselines and infrastructure.
For this codebase and method questions, please open an issue in the repository.
If this work is helpful, please cite the CuSearch paper (venue/version subject to camera-ready updates):
@misc{cusearch2026curriculum,
title={CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic {RAG}},
author={Shen, Jianghan and Luo, Siqi and Cheng, Xinyu and Xiong, Jing and Li, Yue and Liu, Jiyao and Lin, Jiashi and Chen, Yirong and He, Junjun},
howpublished={Manuscript under review},
year={2026}
}If you use the ZeroSearch data/simulator pipeline, please also cite:
@article{sun2025zerosearch,
title={ZeroSearch: Incentivize the Search Capability of LLMs without Searching},
author={Sun, Hao and Qiao, Zile and Guo, Jiayan and Fan, Xuanbo and Hou, Yingyan and Jiang, Yong and Xie, Pengjun and Huang, Fei and Zhang, Yan},
journal={arXiv preprint arXiv:2505.04588},
year={2025}
}


