we propose ACTIVE-O3, a purely reinforcement learning-based training framework built on top of GRPO, designed to equip MLLMs with active perception capabilities. We further establish a comprehensive benchmark suite to evaluate ACTIVE-O3 across both general open-world tasks—such as small-object and dense object grounding—and domain-specific scenarios, including small object detection in remote sensing and autonomous driving, as well as fine-grained interactive segmentation. Experimental results demonstrate that ACTIVE-O3 significantly enhances active perception capabilities compared to Qwen-VL2.5-CoT. For example, Figure 1 shows an example of zero-shot reasoning on the V* benchmark, where ACTIVE- O3 successfully identifies the number on the traffic light by zooming in on the relevant region, while Qwen2.5-VL fails to do so. Moreover, across all downstream tasks, ACTIVE-O3 consistently improves performance under fixed computational budgets. We hope that our work here can provide a simple codebase and evaluation protocol to facilitate future research on active perception MLLM.
- Release the weights.
- Release the inference demo.
- Release the dataset.
- Release the training scripts.
- Release the evaluation scripts.
# build environment
conda create -n activeo3 python=3.10
conda activate activeo3
# install packages
pip install torch==2.5.1 torchvision==0.20.1
pip install flash-attn --no-build-isolation
pip install transformers==4.51.3
pip install qwen-omni-utils[decord]
# run demo
python demo/activeo3_demo_vstar.py
For academic usage, this project is licensed under the 2-clause BSD License. For commercial inquiries, please contact Chunhua Shen.
If you find this work helpful for your research, please cite:
@article{zhu2025active,
title={Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO},
author={Zhu, Muzhi and Zhong, Hao and Zhao, Canyu and Du, Zongze and Huang, Zheng and Liu, Mingyu and Chen, Hao and Zou, Cheng and Chen, Jingdong and Yang, Ming and others},
journal={arXiv preprint arXiv:2505.21457},
year={2025}
}