SearchSkill-SFT-7B-Instruct | SearchSkill-SFT-7B-Base | SearchSkill-RL-7B-Instruct-GRPO | SearchSkill-RL-7B-Base-GRPO
SearchSkill teaches language models to use search tools through an evolving SkillBank. This repository keeps the code and data needed to inspect the pipeline, rebuild training data, train SFT/RL policies, and evaluate on the included dev/full benchmark splits.
SearchSkill improves multi-hop exact match with the same Qwen2.5-Instruct backbones after two-stage SFT. The figures below compare against search-agent baselines on HotpotQA, 2WikiMultiHopQA, MuSiQue, and Bamboogle; Avg. is the macro average over these four multi-hop benchmarks.
data_preparation/: sampled training data, profiling reports, and sampling scripts.skill_bank/: seed bank, four evolution rounds, final SkillBank, and the policy eval script.teacher_trajectory/: teacher rollout code and released canonical trajectories.supervised_finetuning/: SFT data builders, LoRA training wrappers, and merge script.reinforcement_learning/: RL data builder, training wrappers, and benchmark evaluation launcher.benchmarks/: publicdev/andfull/test JSONL files.external/runtime_patch/: SearchSkill-specific files to overlay into a compatible VERL-style RL runtime.
Create and activate a conda environment. Install the PyTorch build that matches your CUDA version, then install the project dependencies.
git clone https://github.com/HIT-HJC/SearchSkill.git
cd SearchSkill
conda create -n searchskill python=3.10 -y
conda activate searchskill
python -m pip install --upgrade pip
# Example for CUDA 12.1. Replace this line with the correct PyTorch command for your machine.
# python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
python -m pip install -r requirements.txtSet common paths:
export SEARCHSKILL_ROOT="$(pwd)"
export ROOT="$SEARCHSKILL_ROOT"
export PYTHON_BIN="$(command -v python)"
export HF_MODELS="<directory_containing_Qwen2.5_models_and_e5-base-v2>"
export HF_DATA="<dataset_mirror_root>"
export HF_CACHE="<hf_cache_root>"For evaluation-only use, install requirements-eval.txt instead of the full dependency file in the same environment after installing PyTorch:
python -m pip install -r requirements-eval.txtSearchSkill uses an HTTP retriever endpoint during teacher rollout, RL training, and evaluation. Start it before any command that calls search.
export E5_INDEX_PATH="<retriever_index_dir>"
export E5_CORPUS_PATH="<retriever_corpus_jsonl>"
export RETRIEVER_HOST="127.0.0.1"
export RETRIEVER_PORT="8000"
export FAISS_GPU=0 # set to 1 for a GPU FAISS index
bash data_preparation/samples/trajectory_pruning/start_local_retriever.shThe retriever corpus should contain either a contents field formatted as title\ntext, or separate title and text fields. The retriever wrapper loads $HF_MODELS/e5-base-v2; install a GPU-enabled FAISS build separately if you set FAISS_GPU=1. The evaluation launcher checks /retrieve before running.
Key retriever files:
data_preparation/samples/trajectory_pruning/start_local_retriever.shexternal/runtime_patch/search_r1/search/retrieval_server.pyexternal/runtime_patch/search_r1/search/retrieval.py
The repository already includes released samples and benchmark splits. To rebuild the sampled training data:
bash data_preparation/run_singlehop_sampling.sh
bash data_preparation/run_multihop_sampling.shKey code:
data_preparation/sample_singlehop_train.pydata_preparation/sample_multihop_train.pydata_preparation/samples/*/train_sample_*.jsonlbenchmarks/dev/*.jsonlbenchmarks/full/*.jsonl
The final released bank is:
skill_bank/round_4_musique/outputs/final_skill_bank.mdTo rebuild the four public evolution rounds, set an OpenAI-compatible API endpoint and run:
export OPENAI_API_KEY="<your_key>"
export OPENAI_BASE_URL="${OPENAI_BASE_URL:-https://api.openai.com/v1}"
python skill_bank/round_1_singlehop/build_packets.py
python skill_bank/round_1_singlehop/run_b1_expand.py --base-url "$OPENAI_BASE_URL"
python skill_bank/round_2_hotpotqa/build_packets.py
python skill_bank/round_2_hotpotqa/run_b2_expand.py --base-url "$OPENAI_BASE_URL"
python skill_bank/round_3_2wiki/build_packets.py
python skill_bank/round_3_2wiki/run_b3_expand.py --base-url "$OPENAI_BASE_URL"
python skill_bank/round_4_musique/build_packets.py
python skill_bank/round_4_musique/run_b4_expand.py --base-url "$OPENAI_BASE_URL"Key code:
skill_bank/inputs/seed_skill_bank.mdskill_bank/round_*/build_packets.pyskill_bank/round_*/run_b*_expand.pyskill_bank/round_4_musique/outputs/final_skill_bank.md
The released canonical trajectory file is:
teacher_trajectory/data/canonical_trajectories.jsonlTo run a small teacher rollout, keep the retriever server running and execute:
python teacher_trajectory/src/build_manifest.py \
--output-dir teacher_trajectory/work/manifest \
--train-datasets hotpotqa \
--hotpot-count 20 \
--nq-count 0 --triviaqa-count 0 --2wiki-count 0 --musique-count 0 --failure-count 0
python teacher_trajectory/src/run_teacher_rollout.py \
--manifest-path teacher_trajectory/work/manifest/manifest.jsonl \
--output-dir teacher_trajectory/work/rollout \
--skill-bank-path skill_bank/round_4_musique/outputs/final_skill_bank.md \
--base-url "$OPENAI_BASE_URL" \
--retriever-host "$RETRIEVER_HOST" \
--retriever-port "$RETRIEVER_PORT" \
--max-examples 20 \
--resumeKey code:
teacher_trajectory/src/build_manifest.pyteacher_trajectory/src/run_teacher_rollout.pyteacher_trajectory/src/merge_rollout_outputs.pyteacher_trajectory/src/build_canonical_teacher_set.pyteacher_trajectory/src/pack_sft.py
The checked-in SFT data can be used directly. To rebuild it from the released canonical trajectories:
python supervised_finetuning/scripts/build_stage1_dataset.py \
--input-path teacher_trajectory/data/canonical_trajectories.jsonl \
--output-dir supervised_finetuning/data/stage1
python supervised_finetuning/scripts/build_stage2_dataset.py \
--input-train supervised_finetuning/data/stage1/train.jsonl \
--input-eval supervised_finetuning/data/stage1/eval.jsonl \
--skill-bank-path skill_bank/round_4_musique/outputs/final_skill_bank.md \
--output-dir supervised_finetuning/data/stage2Train the two-stage 7B instruct policy:
export HF_MODELS="<directory_containing_Qwen2.5_models>"
export CUDA_VISIBLE_DEVICES=0,1,2,3
export NPROC_PER_NODE=4
bash supervised_finetuning/scripts/train_stage1_7b_instruct.sh
bash supervised_finetuning/scripts/train_stage2_7b_instruct.shMerge the stage-two LoRA adapter into a dense checkpoint:
python supervised_finetuning/scripts/merge_lora.py \
--base-model-path "$HF_MODELS/Qwen2.5-7B-Instruct" \
--adapter-path supervised_finetuning/models/stage2 \
--output-dir supervised_finetuning/models/stage2_7b_instruct_merged \
--device cuda \
--overwriteOther released wrappers are available for 3B/7B base and instruct backbones:
supervised_finetuning/scripts/train_stage1_*.shsupervised_finetuning/scripts/train_stage2_*.shsupervised_finetuning/scripts/train_lora.pysupervised_finetuning/scripts/merge_lora.py
RL training requires a compatible VERL-style runtime. Overlay the runtime patch first:
cp -r external/runtime_patch/* "<rl_runtime>/"
export RUNTIME_ROOT="<rl_runtime>"Build or reuse the released RL parquet data:
python reinforcement_learning/scripts/build_policy_dataset.py \
--train-jsonl reinforcement_learning/source_data/policy_training_pool/train.jsonl \
--dev-jsonl reinforcement_learning/source_data/policy_training_pool/dev.jsonl \
--skill-bank-path skill_bank/round_4_musique/outputs/final_skill_bank.md \
--output-dir reinforcement_learning/data/policy_7b_instructStart the retriever server, set the SFT checkpoint, and launch GRPO training:
export MODEL_PATH="$SEARCHSKILL_ROOT/supervised_finetuning/models/stage2_7b_instruct_merged"
export DATA_DIR="$SEARCHSKILL_ROOT/reinforcement_learning/data/policy_7b_instruct"
export RETRIEVER_HOST="127.0.0.1"
export RETRIEVER_PORT="8000"
export CUDA_VISIBLE_DEVICES=0,1,2,3
bash reinforcement_learning/scripts/train_7b_instruct.shKey code:
reinforcement_learning/scripts/build_policy_dataset.pyreinforcement_learning/scripts/launch_training.shreinforcement_learning/scripts/train_*.shexternal/runtime_patch/verl/trainer/main_ppo_searchskill.pyexternal/runtime_patch/verl/utils/reward_score/searchskill.py
You can evaluate either a local checkpoint or one of the released Hugging Face SearchSkill models. Keep the retriever server running, then run a dev split:
export MODEL_PATH="HJCHJC/SearchSkill-SFT-7B-Instruct"
export BENCHMARK_SPLIT=dev
export SHARD_COUNT=1
export GPU_IDS_CSV=0
bash reinforcement_learning/scripts/evaluate_policy.sh nqRun all dev benchmarks:
MODEL_PATH="$MODEL_PATH" BENCHMARK_SPLIT=dev bash reinforcement_learning/scripts/evaluate_policy.sh allRun full benchmarks:
MODEL_PATH="$MODEL_PATH" BENCHMARK_SPLIT=full bash reinforcement_learning/scripts/evaluate_policy.sh allThe evaluation output is written under eval/<run_name>/. The launcher supports all, singlehop, multihop, or a single dataset name: nq, triviaqa, popqa, hotpotqa, 2wiki, musique, or bamboogle.
Key code:
reinforcement_learning/scripts/evaluate_policy.shskill_bank/nq_eval/eval_nq_qwen_skillbank.pyskill_bank/nq_eval/eval_common.pybenchmarks/dev/*.jsonlbenchmarks/full/*.jsonl
- Model weights are hosted at https://huggingface.co/HJCHJC.
- Retrieval indexes and corpus files are external resources and are not included in this repository.
- API keys, caches, logs, local paths, and raw experiment outputs are intentionally excluded.
- For a more verbose step-by-step reproduction guide, see REPRODUCE.md.