DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding
This is the official repo for Dynamic Focus (Visual Search), a training-free visual search method for enhancing LMMs/MLLMs in Fine-Grained Visual Understanding by simulating human dynamic visual focus.
- [2025-08-11]: 🚀 Updated to be compatible with the latest vllm. Merged DyFo and expert environments for easier setup.
- [2025-05-15]: 🚀 Codes released.
- [2025-04-21]: ⭐️ DyFo is selected as Poster Highlight in CVPR 2025! (Top 13.5% in accepted papers), check out this link for details.
- We introduce DyFo (Dynamic Focus), a training-free visual search method that dynamically adjusts focus regions to enhance fine-grained visual understanding in large multimodal models (LMMs).
- The focus adjustment is guided by a bidirectional interaction between LMMs and visual experts, optimized via a Monte Carlo Tree Search (MCTS) algorithm
- DyFo effectively filters out irrelevant content while avoiding the need for additional training or specialized localization modules, leading to improved fine-grained visual understanding and reduced hallucination in LMMs.
DyFo combines two components: (1) A Large Multimodal Model (LMM) like Qwen2-VL and LLaVA-1.5 (vllm), and (2) A visual expert like Lang_SAM(this link) to collaborative inference.
Note
If you encounter network issues accessing GitHub or HuggingFace during installation, you can try using these mirror sites:
conda create -n dyfo python=3.11
conda activate dyfo
pip install -r requirements.txt- Download github repository:
git clone https://github.com/luca-medeiros/lang-segment-anything && cd lang-segment-anything- (Manual Action) Modify line 41 in
lang_sam/models/gdino.pyto support batch inference:
inputs = self.processor(images=images_pil, text=texts_prompt, return_tensors="pt", padding=True).to(self.model.device)
- (Manual Action) Modify line 47 in
lang_sam/models/gdino.pyto adapt for latest transformers (version 4.55):
threshold=box_threshold,
Download the dataset from this link and unzip dataset.zip to get the following directory structure:
.
├── dyfo
│ ├── scripts
│ └── src
└── playground (dataset)
└── data
└── eval
├── pope
└── vstar
To start both LMM and Visual Expert servers:
# Start LMM server (recommend tmux)
conda activate dyfo
CUDA_VISIBLE_DEVICES=0 bash dyfo/scripts/lmm_server/<qwen/llava>_server.sh # Start Visual Expert server (recommend tmux)
conda activate dyfo
CUDA_VISIBLE_DEVICES=1 bash dyfo/scripts/expert_server/start_server.sh For POPE evaluation:
- Batch testing (all 9 sub-datasets about 6~7h):
conda activate dyfo
CUDA_VISIBLE_DEVICES=0 bash dyfo/scripts/pope/<qwen/llava>_batch.sh- Single dataset testing (about 40~50mins):
# take gqa_random for example
# other datasets: <coco/aokvqa/gqa>/<coco/aokvqa/gqa>_pope_<random/popular/adversarial>
conda activate dyfo
CUDA_VISIBLE_DEVICES=0 bash dyfo/scripts/pope/stream_pope_<qwen/llava>.sh mcts False gqa/gqa_pope_randomFor V* evaluation (about 30mins):
conda activate dyfo
CUDA_VISIBLE_DEVICES=0 bash dyfo/scripts/vstar/stream_vstar_<qwen/llava>.sh mcts FalseThe experimental results of new version are shown below:
| Dataset | Type | Model | Accuracy↑ | Precision | Recall | F1Score↑ |
|---|---|---|---|---|---|---|
| MSCOCO | random | LLaVA1.5 | 92.03 | 93.94 | 89.87 | 91.86 |
| Qwen2-VL | 92.33 | 96.49 | 87.87 | 91.97 | ||
| popular | LLaVA1.5 | 88.77 | 87.69 | 90.20 | 88.93 | |
| Qwen2-VL | 89.20 | 90.50 | 87.60 | 89.02 | ||
| adversarial | LLaVA1.5 | 83.33 | 79.66 | 89.53 | 84.31 | |
| Qwen2-VL | 86.87 | 86.62 | 87.20 | 86.91 | ||
| A-OKVQA | random | LLaVA1.5 | 90.43 | 87.42 | 94.47 | 90.80 |
| Qwen2-VL | 92.33 | 92.05 | 92.67 | 92.36 | ||
| popular | LLaVA1.5 | 84.83 | 79.04 | 94.80 | 86.21 | |
| Qwen2-VL | 89.17 | 87.07 | 92.00 | 89.47 | ||
| adversarial | LLaVA1.5 | 75.17 | 68.11 | 94.67 | 79.22 | |
| Qwen2-VL | 82.13 | 76.78 | 92.13 | 83.76 | ||
| GQA | random | LLaVA1.5 | 90.03 | 87.27 | 93.73 | 90.39 |
| Qwen2-VL | 88.60 | 94.74 | 81.73 | 87.76 | ||
| popular | LLaVA1.5 | 80.33 | 74.00 | 93.53 | 82.63 | |
| Qwen2-VL | 85.87 | 88.93 | 81.93 | 85.29 | ||
| adversarial | LLaVA1.5 | 75.03 | 68.33 | 93.33 | 78.90 | |
| Qwen2-VL | 81.87 | 82.12 | 81.47 | 81.79 |
table 1. Results on POPE for MSCOCO/AOKVQA/GQA with LLaVA1.5 and Qwen2-VL.
| Dataset | Model | Attribute↑ | Spatial↑ | Overall↑ |
|---|---|---|---|---|
| V* | DyFo-L | 65.22 | 57.89 | 62.30 |
| DyFo-Q | 80.87 | 78.95 | 80.10 |
table 2. Results on V*. DyFo-L and DyFo-Q represent our method with LLaVA1.5 and Qwen2-VL, respectively.
- Please refer to our paper for detailed experimental results.
If you find our project useful, we hope you can star our repo and cite our paper as follows:
@misc{li2025dyfotrainingfreedynamicfocus,
title={DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding},
author={Geng Li and Jinglin Xu and Yunzhen Zhao and Yuxin Peng},
year={2025},
eprint={2504.14920},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.14920},
}
- V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs
- Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
- LLaVA 1.5: Improved Baselines with Visual Instruction Tuning
- LangSam: Language Segment-Anything (Cool Expert!)
- vLLM: Efficient Memory Management for Large Language Model Serving with PagedAttention
- VCD: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
