π Project Page Β Β·Β π Paper Β Β·Β π€ Dataset Β Β·Β δΈζζζ‘£
Note: This repository is the Plan-Act-Replan Agent codebase in SVFSearch.
For mmsearch-r1-game code, see: https://github.com/SVFSearch/SVFSearch-mmsearch-r1-game
This repository contains the inference and evaluation code for SVFSearch β a multimodal retrieval-augmented QA system with dynamic tool routing. Three runnable entry points are provided:
run_agent.pyβ Plan-Act-Replan agent (dynamic tool routing)run_workflow.pyβ fixed baseline workflow (img_ann β query vote β text_ann β answer)run_direct_qa.pyβ direct QA baseline without external retrieval services
- Main Entrypoints
- Repository Layout
- Requirements
- Services and Default Ports
- Quick Start
- Input Data Format
- Run Evaluation
- Multi-Model Batch Benchmark
- Output Files
- Key Environment Variables
- Pipeline Notes
- FAQ
| Entry | Purpose | Recommended Use |
|---|---|---|
run_agent.py |
Dynamic planning + multi-tool retrieval | Primary benchmark pipeline |
run_workflow.py |
Fixed 4-step workflow | Stable baseline comparison |
run_direct_qa.py |
Direct model answering only | Fast no-retrieval baseline |
run_benchmark.sh |
Multi-model batch runner | Large-scale experiments |
.
βββ run_agent.py
βββ run_workflow.py
βββ run_benchmark.sh
βββ run_direct_qa.py
βββ qa_agent/
β βββ config.py
β βββ graph.py
β βββ llm_client.py
β βββ pipeline.py
β βββ prompts.py
β βββ retrieval.py
β βββ schema.py
β βββ tool_skills.py
βββ tools/
β βββ img_emb_server.py
β βββ text_emb_server.py
β βββ multimodal_emb_server.py
β βββ bm25_server.py
β βββ kn_lookup_server.py
β βββ *_client.sh / *_test.py
βββ skills/
βββ data/
- Python
3.10(recommended) - Linux + NVIDIA GPU (vLLM and embedding services are GPU-first)
- CUDA / PyTorch / vLLM properly installed
pip install -r requirements.txtNote:
requirements.txtmay include machine-specific wheel paths. Replace any incompatible entries with versions that match your own environment.
Default values are defined in qa_agent/config.py.
| Service | Default Endpoint | Description |
|---|---|---|
| LLM API | http://127.0.0.1:8000/v1 |
OpenAI-compatible endpoint |
img_ann |
http://127.0.0.1:8001/img_ann |
Image retrieval |
kn_lookup |
http://127.0.0.1:8002/kn_lookup |
Knowledge enrichment |
multimodal_ann |
http://127.0.0.1:8003/multimodal_ann |
Multimodal retrieval |
text_ann |
http://127.0.0.1:8004/text_ann |
Text retrieval |
bm25_ann |
http://127.0.0.1:8005/bm25_ann |
Keyword retrieval |
vllm serve <YOUR_VLM_MODEL> --host 0.0.0.0 --port 8000Or use the helper script:
bash tools/vllm_serve.shtext_ann
export TEXT_EMB_MODEL_PATH=/path/to/text-embedding-model
export TEXT_ANN_PATH=/path/to/text_ann_corpus.jsonl
python tools/text_emb_server.pyimg_ann
export IMG_EMB_MODEL_PATH=/path/to/image-backbone
export IMG_EMB_CKPT_PATH=/path/to/image-ckpt.pt
export IMG_ANN_PATH=/path/to/image_ann_pool.jsonl
python tools/img_emb_server.pymultimodal_ann
export MULTIMODAL_EMB_MODEL_PATH=/path/to/multimodal-embedding-model
export MULTIMODAL_ANN_PATH=/path/to/query_multimodal.final.jsonl
python tools/multimodal_emb_server.pykn_lookup
python tools/kn_lookup_server.pyCustom knowledge files:
KN_FILES=/path/a.jsonl:/path/b.jsonl python tools/kn_lookup_server.pybm25_ann (optional)
export BM25_DATA_PATH=/path/to/corpus.jsonl
python tools/bm25_server.pyAll three entry points accept JSONL input where each line has the following structure:
{
"query": "optional; used by direct_qa --use_kn",
"img": "/path/to/image.jpg",
"qa": {
"question": "question text",
"options": ["option A", "option B", "option C", "option D"],
"answer": "option A"
}
}Tips:
qa.answercan be eitherA/B/C/Dor the full option text.- Prefer passing
--inputexplicitly instead of relying on defaults.
The benchmark dataset is available at π€ SVFSearchData.
python run_agent.py \
--input /path/to/benchmark.jsonl \
--output outputs/predictions.jsonl \
--answer-sheet outputs/answer_sheet.jsonl \
--stats outputs/stats.json \
--log-file outputs/run.log \
--llm-model <YOUR_SERVED_MODEL_NAME> \
--llm-base-url http://127.0.0.1:8000/v1 \
--llm-api-key EMPTY \
--text-ann-url http://127.0.0.1:8004/text_ann \
--img-ann-url http://127.0.0.1:8001/img_ann \
--multimodal-ann-url http://127.0.0.1:8003/multimodal_annpython run_workflow.py \
--input /path/to/benchmark.jsonl \
--output outputs/workflow_predictions.jsonl \
--answer-sheet outputs/workflow_answer_sheet.jsonl \
--stats outputs/workflow_stats.json \
--log-file outputs/workflow_run.log \
--llm-model <YOUR_SERVED_MODEL_NAME> \
--llm-base-url http://127.0.0.1:8000/v1 \
--llm-api-key EMPTY \
--text-ann-url http://127.0.0.1:8004/text_ann \
--img-ann-url http://127.0.0.1:8001/img_annpython run_direct_qa.py \
--input /path/to/benchmark.jsonl \
--output outputs/direct_qa.jsonl \
--model <YOUR_LOCAL_MODEL_OR_HF_PATH>Optional --use_kn:
- Reads
query_rag_kn_part_1.jsonlandqwen_rag_kn_part_2.jsonlfrom the current working directory. - If these files are elsewhere, update the script or create symlinks.
run_benchmark.sh executes each model in MODELS sequentially:
- Start vLLM
- Run
run_agent.pyorrun_workflow.py - Stop vLLM and clean up GPU processes
# Run agent pipeline
RUNNER=agent INPUT_PATH=/path/to/benchmark.jsonl bash run_benchmark.sh
# Run workflow pipeline
RUNNER=workflow INPUT_PATH=/path/to/benchmark.jsonl bash run_benchmark.sh
# Dry run (no actual execution)
DRY_RUN=1 bash run_benchmark.shBefore running, adjust the machine-specific config inside run_benchmark.sh:
MODELSβ model name, TP size, GPU memory ratioINPUT_PATHβ default may not match your datasetCUDA_VISIBLE_DEVICESβ match your actual GPU topology
| File | Description |
|---|---|
predictions.jsonl |
Compact prediction results |
answer_sheet.jsonl |
Debug details (route / evidence / trace / raw_output) |
stats.json |
Aggregate metrics (accuracy, tool usage, latency, etc.) |
run.log |
Runtime logs (base64 image blobs are filtered out) |
Common variables (see qa_agent/config.py for the full list):
| Variable | Description |
|---|---|
LLM_BASE_URL, LLM_API_KEY, LLM_MODEL |
LLM service configuration |
TEXT_ANN_URL, IMG_ANN_URL, MULTIMODAL_ANN_URL, BM25_ANN_URL |
ANN service endpoints |
ANN_TOPK, ANN_TIMEOUT |
Retrieval parameters |
MAX_PLAN_ROUNDS, PLAN_MAX_ATTEMPTS, ANSWER_MAX_ATTEMPTS |
Agent loop limits |
KN_LOOKUP_URL, KN_LOOKUP_TIMEOUT |
Knowledge lookup service |
IMG_ANN_KN_TOP_QUERIES, IMG_ANN_KN_SELECT_MODE |
majority or llm |
DEBUG |
Enable verbose debug output |
skills/*/SKILL.mdfiles are injected into the planning prompt inrun_agent.py.kn_lookupis auto-triggered afterimg_ann; no manual planner call is needed.bm25_annis called with planner-generatedbm25_query, useful for exact keyword matching.
| Problem | Solution |
|---|---|
FileNotFoundError: data/benckmark.jsonl |
Pass --input explicitly and verify the path. |
| vLLM health check failed | Verify --llm-base-url matches the service port. |
| Empty ANN results | Check that the service is running and index paths are correct. |
| Dependency installation failure | Replace machine-specific wheel entries in requirements.txt. |
If you find this work useful, please cite our paper:
@misc{mao2026svfsearchmultimodalknowledgeintensivebenchmark,
title={SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain},
author={Lingtao Mao and Huangyu Dai and Xinyu Sun and Zihan Liang and Ben Chen and Chenyi Lei and Wenwu Ou},
year={2026},
eprint={2605.17946},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2605.17946},
}