TL;DR (1) - Allow the MLLM to control its own vision perception process.
TL;DR (2) - Treat visual perception as a function calling process and control the perception process through Visual Perception Tokens. The MLLM outputs Visual Perception Tokens in the same manner as natural language tokens.
π§ The image illustrates two types of Visual Perception Tokens.
The Region Selection Token carries explicit semantic information,
representing important regions with bounding boxes, while the Vision Re-Encoding Token lacks semantic information.
Instead, the Vision Projector extracts control information directly from its hidden state.
The inference process incorporating Visual Perception Tokens can be divided into three stages.
First, the MLLM generates Visual Perception Tokens based on the image and the given question.
Next, the Vision Branches perform a second perception of the image guided by the Visual Perception Tokens.
Finally, the MLLM utilizes the Vision Features obtained from both perception stages to answer the question.
π Performance comparison of MLLMs with and without Visual Perception Tokens.
Datasets marked with ``*'' are not used in the training process.
A 2B model with Visual Perception Tokens can even outperform the 7B model without Visual Perception Tokens.
π Examples collected from the testing sets. The responses were generated by the 7B model and the 2B+VPT model.
Model | Stage | Base Model | Finetuned Modules | #Total Params | Download Link |
---|---|---|---|---|---|
Qwen2-VL-2b-VPT-Det-Alignment | After Alignment | Qwen/Qwen2-VL-2B-Instruct | Projector | 2.75B | π€ HuggingFace Model |
Qwen2-VL-2b-VPT-Seg-Alignment | After Alignment | Qwen/Qwen2-VL-2B-Instruct | Projector | 2.76B | π€ HuggingFace Model |
Qwen2-VL-2b-VPT-CLIP | After Instruction Tuning | Qwen/Qwen2-VL-2B-Instruct | All | 2.45B | π€ HuggingFace Model |
Qwen2-VL-2b-VPT-Det | After Instruction Tuning | Qwen2-VL-2b-VPT-Det-Alignment | All | 2.75B | π€ HuggingFace Model |
Qwen2-VL-2b-VPT-Det-NoPrompt | After Instruction Tuning | Qwen2-VL-2b-VPT-Det-Alignment | All | 2.75B | π€ HuggingFace Model |
Qwen2-VL-2b-VPT-Seg | After Instruction Tuning | Qwen2-VL-2b-VPT-Seg-Alignment | All | 2.76B | π€ HuggingFace Model |
Qwen2-VL-7b-VPT-CLIP | After Instruction Tuning | Qwen/Qwen2-VL-7B-Instruct | LoRA-r512 | 8.32B | π€ HuggingFace Model |
- Our model can be categorized into two types based on the training stage. In cases where an additional vision encoder is required, an extra alignment step is performed. The aligned model then serves as the starting point for the subsequent instruction tuning process.
- The 7B model is fine-tuned using LoRA. The released model has already been merged.
- The NoPrompt model corresponds to the Free Choice model. It does not rely on specific prompts to trigger the Visual Perception Token; instead, the model autonomously decides whether to use the Visual Perception Token and which type to apply.
Our training and evaluation data are hosted on π€ VPT Datasets. The entire dataset is divided into multiple splits, with each split corresponding to a separate JSON file. For the evaluation datasets, we have annotated the reasoning process involving the use of the Visual Perception Token. During evaluation, adjustments need to be made based on different models.
Dataset | Stage | Compatible Model | Traning/Evaluation | #Samples |
---|---|---|---|---|
MixVRT_CLIP_Full | Instruction Tuning | CLIP | Training | 829k |
MixVRT_Detection_Full | Instruction Tuning | Det | Training | 829k |
MixVRT_Seg_Full | Instruction Tuning | Seg | Training | 829k |
CUB_Birds_action_test | Instruction Tuning | All | Evaluation | 0.5k |
DocVQA_region_test | Instruction Tuning | All | Evaluation | 0.9k |
DUDE_region_test | Instruction Tuning | All | Evaluation | 0.6k |
Flickr30k_action_test | Instruction Tuning | All | Evaluation | 1.5k |
LLaVA_COCO_free_action_test | Instruction Tuning | All | Evaluation | 1k |
LLaVA_COCO_single_action_test | Instruction Tuning | All | Evaluation | 1k |
OI_region_test | Instruction Tuning | All | Evaluation | 1k |
POPE_action_test | Instruction Tuning | All | Evaluation | 3k |
TextCap_region_test | Instruction Tuning | All | Evaluation | 8.5k |
TextVQA_region_test | Instruction Tuning | All | Evaluation | 0.5k |
VSR_region_test | Instruction Tuning | All | Evaluation | 0.4k |
llava_alignment_detection_qwen_response_train | Alignment | Det | Training | 585k |
llava_alignment_seg_qwen_response_train | Alignment | Seg | Training | 585k |
llava_alignment_detection_qwen_response_eval | Alignment | Det | Evaluation | 5k |
llava_alignment_seg_qwen_response_eval | Alignment | Seg | Evaluation | 5k |
Our code is primarily based on Transformers and Llama-Factory. DevNote outlines the core modifications we made to the original transformers and Llama-Factory libraries. It serves as an overview to help readers understand our code structure and provides a starting point for exploration.
The process of setting up our environment primarily involves installing modified versions of Transformers 4.45.2 and LLaMA-Factory 0.9.1.dev0. The following are the steps to create our environment.
# clone transformers 4.45.2
git clone https://github.com/huggingface/transformers.git
cd transformers
git checkout v4.45.2
# copy modeling_qwen2_vl_vpt.py
mkdir -p "src/transformers/models/qwen2_vl_vpt"
cp "VPT/transformers/src/transformers/models/qwen2_vl_vpt/modeling_qwen2_vl_vpt.py" "src/transformers/models/qwen2_vl_vpt/"
# modify the path of transformers and llama-factory in env.yml file
# create environment
cd /to/this/folder
conda env create -f env.yml
Prepare the following datasets and adjust the image path in the json files downloaded from π€ VPT Datasets.
- COCO2017
- CUB_200_2011
- DocVQA
- DUDE
- Flickr30k
- GQA
- OCRVQA
- OpenImage
- TextVQA
- VG
- VSR
Our training and evaluation are supported by LLaMA-Factory. Adjust the path in the configs/*.yaml
files before run the code.
cd LLaMA-Factory
# Detection Projector Alignment
llamafactory-cli train configs/Qwen2-VL-2b-VPT-Det-Alignment.yaml
# Segmentation Projector Alignment
llamafactory-cli train configs/Qwen2-VL-2b-VPT-Seg-Alignment.yaml
# 2b Model + VPT-original vision encoder
llamafactory-cli train configs/Qwen2-VL-2b-VPT-CLIP.yaml
# 2b Model + VPT-DINO
llamafactory-cli train configs/Qwen2-VL-2b-VPT-Det.yaml
# 2b Model + VPT-SAM
llamafactory-cli train configs/Qwen2-VL-2b-VPT-Seg.yaml
# 7b Model + VPT-original vision encoder
llamafactory-cli train configs/Qwen2-VL-7b-VPT-CLIP.yaml
The folder evaluation
contains the code for evaluation. Run python evaluation.py
to finish the evaluation process.
If you find our work useful, please cite using this BibTeX:
@misc{yu2025vpt,
title={Introducing Visual Perception Token into Multimodal Large Language Model},
author={Runpeng Yu and Xinyin Ma and Xinchao Wang},
year={2025},
eprint={2502.17425},
archivePrefix={arXiv},
}