SPADER

Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering

Abstract

Large language models are increasingly deployed as tool-augmented agents to acquire information beyond parametric knowledge. While recent work has improved long-horizon tool-use reasoning, most approaches focus on tasks with a single correct answer. In contrast, many real-world queries require discovering a comprehensive set of valid answers, a setting known as Multi-Answer QA. This setting raises two challenges: fine-grained credit assignment over long search trajectories and reward alignment for sustained exploration beyond easy high-frequency entities. We propose SPADER, a reinforcement learning framework for long-horizon tool use in Multi-Answer QA. SPADER includes Step-wise Peer Advantage (SPA), a critic-free step-level credit assignment mechanism that aligns parallel trajectories by decision step and estimates advantages from peer returns. It also includes a diversity-aware exploration reward that promotes long-tail entity discovery by upweighting rare findings and downweighting redundant ones. Experiments on QAMPARI, Mintaka, WebQSP, and QUEST show that SPADER generally improves recall and overall F1 over prompting-based agents, outcome-supervised RL methods, and recent step-level supervision approaches. Our code and model weights are available at https://github.com/KhanCold/spader.

SPADER trains tool-augmented search agents for multi-answer QA. The repository keeps the original VERL training stack and adds a SPADER reward manager plus a masked step-level GRPO advantage estimator.

Overview

Component	Scope
Retrieval environment	Local HTTP service for Wikipedia-based search
Training implementation	VERL-based SPADER with step-wise peer advantage (SPA)
Reward design	Diversity-aware exploration reward manager
Evaluation	Lightweight scripts for released model variants
Checkpoints	Qwen3-8B and Llama-3.1-8B SPADER releases

Released Models

Model	Hugging Face Repo	Download
`qwen3-8b-spader`	`KhanCold/qwen3-8b-spader`	`huggingface-cli download KhanCold/qwen3-8b-spader --local-dir spader/models/qwen3-8b-spader`
`llama3-8b-spader`	`KhanCold/llama3-8b-spader`	`huggingface-cli download KhanCold/llama3-8b-spader --local-dir spader/models/llama3-8b-spader`

Repository Layout

examples/sglang_multiturn/
  config/                         # QAMPARI training configs
  qampari/                        # training launch scripts
  search_r1_like/local_dense_retriever/
                                  # local retrieval server
run_test/
  inference_scripts/              # vLLM serving scripts
  run_sh/                         # evaluation launch scripts
  run_eval.py                     # evaluation entrypoint
verl/                             # VERL training framework with SPADER additions

Data

Place training files anywhere on disk and pass them with TRAIN_DATA and VAL_DATA. Training uses VERL-style .parquet files, where each sample should include:

Field	Type	Description
`prompt`	list of messages	Chat messages consumed by the rollout agent
`data_source`	string	Dataset or task name used by the reward function
`reward_model`	dict	Contains `ground_truth` for scoring
`extra_info`	dict, optional	Optional metadata such as `index`

For Multi-Answer QA, reward_model.ground_truth should contain a list of answer entities with aliases:

{
  "ground_truth": [
    ["Entity A", "Alias A"],
    ["Entity B"]
  ]
}

Evaluation uses .jsonl files, one JSON object per line. Each object should include:

Field	Type	Description
`question_text` or `question`	string	Input question
`ground_truth`	list of lists	Gold entities and aliases
`data_source`	string, optional	Dataset or task name

Place evaluation files under data/evaluation by default, or set EVAL_DATA_DIR to another directory. Named evaluation datasets are resolved from EVAL_DATA_DIR. For example, --dataset qampari reads data/evaluation/qampari_test.jsonl. You can also pass an explicit .jsonl path with --dataset /path/to/file.jsonl.

Environment

Create the main training/runtime environment:

conda create --name verl-sglang python=3.12
conda activate verl-sglang
pip install -e ".[sglang,vllm]"
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
pip install cachetools

If your CUDA/PyTorch build needs FlashAttention installed manually, install the wheel matching your CUDA, Python, and PyTorch versions. The experiments used a CUDA 12 build.

Create the retrieval environment:

conda env create -f retriever.yml
conda activate retriever
pip install torch==2.8.0 torchaudio==2.8.0 torchvision --index-url https://download.pytorch.org/whl/cu128

Base Models for training

Download the base models to models/:

python - <<'PY'
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="Qwen/Qwen3-8B",
    local_dir="models/Qwen3-8B",
    allow_patterns=["*.json", "*.safetensors", "*.model", "tokenizer*", "*.py"],
    ignore_patterns=["*.bin", "*.h5", "*.ot", "*.msgpack"],
)

snapshot_download(
    repo_id="meta-llama/Llama-3.1-8B-Instruct",
    local_dir="models/Llama-3.1-8B-Instruct",
    allow_patterns=["*.json", "*.safetensors", "tokenizer*", "*.py"],
    ignore_patterns=["*.bin", "*.h5", "*.ot", "*.msgpack"],
)
PY

Training scripts default to the paths above. Override BASE_MODEL if your checkpoints are stored elsewhere.

Retrieval

SPADER uses an HTTP retrieval service at http://127.0.0.1:8000/retrieve.

For QAMPARI, use the Wikipedia 2021 index. Build it from the chunked dump:

mkdir -p data/QAMPARI_wikipedia_2021
wget "https://aggreg-qa.s3.amazonaws.com/chunked_wikipedia.tar.gz" \
  -O data/QAMPARI_wikipedia_2021/chunked_wikipedia.tar.gz
bash examples/sglang_multiturn/search_r1_like/local_dense_retriever/build_qampari_index.sh

Or download the prebuilt index:

mkdir -p data
wget "https://huggingface.co/datasets/KhanCold/QAMPARI_wikipedia_2021/resolve/main/QAMPARI_wikipedia_2021.tar" \
  -O data/QAMPARI_wikipedia_2021.tar
tar -xf data/QAMPARI_wikipedia_2021.tar -C data

Optional BM25 + reranker mode:

huggingface-cli download BAAI/bge-reranker-v2-m3 \
  --local-dir models/bge-reranker-v2-m3 \
  --local-dir-use-symlinks False

conda activate retriever
bash examples/sglang_multiturn/search_r1_like/local_dense_retriever/start_bm25_rerank_server.sh

Smoke test:

curl -X POST "http://localhost:8000/retrieve" \
  -H "Content-Type: application/json" \
  -d '{"queries": ["Mel Brooks directed and produced films"], "topk": 3}'

SPADER RL Training

Run commands from the repository root. All scripts accept normal Hydra overrides after the script name.

Setting	Command
Qwen3-8B GRPO	`bash examples/sglang_multiturn/qampari/run_qwen3-8b_grpo.sh`
Qwen3-8B SPADER	`bash examples/sglang_multiturn/qampari/run_qwen3-8b_spader.sh`
Llama-3.1-8B GRPO	`bash examples/sglang_multiturn/qampari/run_llama3-8b_grpo.sh`
Llama-3.1-8B SPADER	`bash examples/sglang_multiturn/qampari/run_llama3-8b_spader.sh`

Useful overrides:

BASE_MODEL=models/Qwen3-8B \
TRAIN_DATA=/path/to/train.parquet \
VAL_DATA=/path/to/val.parquet \
OUTPUT_ROOT=output \
bash examples/sglang_multiturn/qampari/run_qwen3-8b_spader.sh \
  trainer.n_gpus_per_node=8 \
  trainer.save_freq=100

TRAIN_DATA and VAL_DATA are required by the launch scripts.

SPADER-specific code paths:

Module	Path
Reward manager	`verl/workers/reward_manager/spader_step.py`, registered as `spader_step`
Advantage estimator	`masked_step_grpo` in `verl/trainer/ppo/core_algos.py`
Configs	`examples/sglang_multiturn/config/qampari_*_spader.yaml`

Evaluation

First merge an FSDP actor checkpoint to Hugging Face format if needed:

python -m verl.model_merger merge \
  --backend fsdp \
  --local_dir output/<run_name>/checkpoints/global_step_<step>/actor \
  --target_dir models/qwen3-8b-spader

Serve the model with vLLM:

bash run_test/inference_scripts/run_qwen3_8b_spader.sh

Then evaluate:

EVAL_DATA_DIR=/path/to/evaluation_data \
python run_test/run_eval.py \
  --dataset all \
  --model qwen3-8b-spader \
  --concurrency 20 \
  --analyze

Convenience scripts are available for all released variants:

bash run_test/run_sh/run_eval_qwen3_8b_grpo.sh
bash run_test/run_sh/run_eval_qwen3_8b_spader.sh
bash run_test/run_sh/run_eval_llama3_8b_grpo.sh
bash run_test/run_sh/run_eval_llama3_8b_spader.sh

Results are written to run_test/results/<dataset>/.

Citation

@misc{shi2026spaderstepwisepeeradvantage,
      title={SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering}, 
      author={Qiming Shi and Zhaolu Kang and Yunfan Zhou and Di Weng and Yingcai Wu},
      year={2026},
      eprint={2606.00593},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.00593}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples/sglang_multiturn		examples/sglang_multiturn
run_test		run_test
verl		verl
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements-cuda.txt		requirements-cuda.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
retriever-win.yml		retriever-win.yml
retriever.yml		retriever.yml
setup.py		setup.py
verl-sglang.yml		verl-sglang.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPADER

Abstract

Overview

Released Models

Repository Layout

Data

Environment

Base Models for training

Retrieval

SPADER RL Training

Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SPADER

Abstract

Overview

Released Models

Repository Layout

Data

Environment

Base Models for training

Retrieval

SPADER RL Training

Evaluation

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages