Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering
Models | Data | Environment | Retrieval | Training | Evaluation
Large language models are increasingly deployed as tool-augmented agents to acquire information beyond parametric knowledge. While recent work has improved long-horizon tool-use reasoning, most approaches focus on tasks with a single correct answer. In contrast, many real-world queries require discovering a comprehensive set of valid answers, a setting known as Multi-Answer QA. This setting raises two challenges: fine-grained credit assignment over long search trajectories and reward alignment for sustained exploration beyond easy high-frequency entities. We propose SPADER, a reinforcement learning framework for long-horizon tool use in Multi-Answer QA. SPADER includes Step-wise Peer Advantage (SPA), a critic-free step-level credit assignment mechanism that aligns parallel trajectories by decision step and estimates advantages from peer returns. It also includes a diversity-aware exploration reward that promotes long-tail entity discovery by upweighting rare findings and downweighting redundant ones. Experiments on QAMPARI, Mintaka, WebQSP, and QUEST show that SPADER generally improves recall and overall F1 over prompting-based agents, outcome-supervised RL methods, and recent step-level supervision approaches. Our code and model weights are available at https://github.com/KhanCold/spader.
SPADER trains tool-augmented search agents for multi-answer QA. The repository keeps the original VERL training stack and adds a SPADER reward manager plus a masked step-level GRPO advantage estimator.
| Component | Scope |
|---|---|
| Retrieval environment | Local HTTP service for Wikipedia-based search |
| Training implementation | VERL-based SPADER with step-wise peer advantage (SPA) |
| Reward design | Diversity-aware exploration reward manager |
| Evaluation | Lightweight scripts for released model variants |
| Checkpoints | Qwen3-8B and Llama-3.1-8B SPADER releases |
| Model | Hugging Face Repo | Download |
|---|---|---|
qwen3-8b-spader |
KhanCold/qwen3-8b-spader |
huggingface-cli download KhanCold/qwen3-8b-spader --local-dir spader/models/qwen3-8b-spader |
llama3-8b-spader |
KhanCold/llama3-8b-spader |
huggingface-cli download KhanCold/llama3-8b-spader --local-dir spader/models/llama3-8b-spader |
examples/sglang_multiturn/
config/ # QAMPARI training configs
qampari/ # training launch scripts
search_r1_like/local_dense_retriever/
# local retrieval server
run_test/
inference_scripts/ # vLLM serving scripts
run_sh/ # evaluation launch scripts
run_eval.py # evaluation entrypoint
verl/ # VERL training framework with SPADER additions
Place training files anywhere on disk and pass them with TRAIN_DATA and VAL_DATA.
Training uses VERL-style .parquet files, where each sample should include:
| Field | Type | Description |
|---|---|---|
prompt |
list of messages | Chat messages consumed by the rollout agent |
data_source |
string | Dataset or task name used by the reward function |
reward_model |
dict | Contains ground_truth for scoring |
extra_info |
dict, optional | Optional metadata such as index |
For Multi-Answer QA, reward_model.ground_truth should contain a list of answer entities with aliases:
{
"ground_truth": [
["Entity A", "Alias A"],
["Entity B"]
]
}Evaluation uses .jsonl files, one JSON object per line. Each object should include:
| Field | Type | Description |
|---|---|---|
question_text or question |
string | Input question |
ground_truth |
list of lists | Gold entities and aliases |
data_source |
string, optional | Dataset or task name |
Place evaluation files under data/evaluation by default, or set EVAL_DATA_DIR to another directory.
Named evaluation datasets are resolved from EVAL_DATA_DIR.
For example, --dataset qampari reads data/evaluation/qampari_test.jsonl.
You can also pass an explicit .jsonl path with --dataset /path/to/file.jsonl.
Create the main training/runtime environment:
conda create --name verl-sglang python=3.12
conda activate verl-sglang
pip install -e ".[sglang,vllm]"
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
pip install cachetoolsIf your CUDA/PyTorch build needs FlashAttention installed manually, install the wheel matching your CUDA, Python, and PyTorch versions. The experiments used a CUDA 12 build.
Create the retrieval environment:
conda env create -f retriever.yml
conda activate retriever
pip install torch==2.8.0 torchaudio==2.8.0 torchvision --index-url https://download.pytorch.org/whl/cu128Download the base models to models/:
python - <<'PY'
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="Qwen/Qwen3-8B",
local_dir="models/Qwen3-8B",
allow_patterns=["*.json", "*.safetensors", "*.model", "tokenizer*", "*.py"],
ignore_patterns=["*.bin", "*.h5", "*.ot", "*.msgpack"],
)
snapshot_download(
repo_id="meta-llama/Llama-3.1-8B-Instruct",
local_dir="models/Llama-3.1-8B-Instruct",
allow_patterns=["*.json", "*.safetensors", "tokenizer*", "*.py"],
ignore_patterns=["*.bin", "*.h5", "*.ot", "*.msgpack"],
)
PYTraining scripts default to the paths above. Override BASE_MODEL if your checkpoints are stored elsewhere.
SPADER uses an HTTP retrieval service at http://127.0.0.1:8000/retrieve.
For QAMPARI, use the Wikipedia 2021 index. Build it from the chunked dump:
mkdir -p data/QAMPARI_wikipedia_2021
wget "https://aggreg-qa.s3.amazonaws.com/chunked_wikipedia.tar.gz" \
-O data/QAMPARI_wikipedia_2021/chunked_wikipedia.tar.gz
bash examples/sglang_multiturn/search_r1_like/local_dense_retriever/build_qampari_index.shOr download the prebuilt index:
mkdir -p data
wget "https://huggingface.co/datasets/KhanCold/QAMPARI_wikipedia_2021/resolve/main/QAMPARI_wikipedia_2021.tar" \
-O data/QAMPARI_wikipedia_2021.tar
tar -xf data/QAMPARI_wikipedia_2021.tar -C dataOptional BM25 + reranker mode:
huggingface-cli download BAAI/bge-reranker-v2-m3 \
--local-dir models/bge-reranker-v2-m3 \
--local-dir-use-symlinks False
conda activate retriever
bash examples/sglang_multiturn/search_r1_like/local_dense_retriever/start_bm25_rerank_server.shSmoke test:
curl -X POST "http://localhost:8000/retrieve" \
-H "Content-Type: application/json" \
-d '{"queries": ["Mel Brooks directed and produced films"], "topk": 3}'Run commands from the repository root. All scripts accept normal Hydra overrides after the script name.
| Setting | Command |
|---|---|
| Qwen3-8B GRPO | bash examples/sglang_multiturn/qampari/run_qwen3-8b_grpo.sh |
| Qwen3-8B SPADER | bash examples/sglang_multiturn/qampari/run_qwen3-8b_spader.sh |
| Llama-3.1-8B GRPO | bash examples/sglang_multiturn/qampari/run_llama3-8b_grpo.sh |
| Llama-3.1-8B SPADER | bash examples/sglang_multiturn/qampari/run_llama3-8b_spader.sh |
Useful overrides:
BASE_MODEL=models/Qwen3-8B \
TRAIN_DATA=/path/to/train.parquet \
VAL_DATA=/path/to/val.parquet \
OUTPUT_ROOT=output \
bash examples/sglang_multiturn/qampari/run_qwen3-8b_spader.sh \
trainer.n_gpus_per_node=8 \
trainer.save_freq=100TRAIN_DATA and VAL_DATA are required by the launch scripts.
SPADER-specific code paths:
| Module | Path |
|---|---|
| Reward manager | verl/workers/reward_manager/spader_step.py, registered as spader_step |
| Advantage estimator | masked_step_grpo in verl/trainer/ppo/core_algos.py |
| Configs | examples/sglang_multiturn/config/qampari_*_spader.yaml |
First merge an FSDP actor checkpoint to Hugging Face format if needed:
python -m verl.model_merger merge \
--backend fsdp \
--local_dir output/<run_name>/checkpoints/global_step_<step>/actor \
--target_dir models/qwen3-8b-spaderServe the model with vLLM:
bash run_test/inference_scripts/run_qwen3_8b_spader.shThen evaluate:
EVAL_DATA_DIR=/path/to/evaluation_data \
python run_test/run_eval.py \
--dataset all \
--model qwen3-8b-spader \
--concurrency 20 \
--analyzeConvenience scripts are available for all released variants:
bash run_test/run_sh/run_eval_qwen3_8b_grpo.sh
bash run_test/run_sh/run_eval_qwen3_8b_spader.sh
bash run_test/run_sh/run_eval_llama3_8b_grpo.sh
bash run_test/run_sh/run_eval_llama3_8b_spader.shResults are written to run_test/results/<dataset>/.
@misc{shi2026spaderstepwisepeeradvantage,
title={SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering},
author={Qiming Shi and Zhaolu Kang and Yunfan Zhou and Di Weng and Yingcai Wu},
year={2026},
eprint={2606.00593},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2606.00593},
}