Skip to content

yu-rp/VisualPerceptionToken

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TL;DR (1) - Allow the MLLM to control its own vision perception process.

TL;DR (2) - Treat visual perception as a function calling process and control the perception process through Visual Perception Tokens. The MLLM outputs Visual Perception Tokens in the same manner as natural language tokens.

[Paper] [Models] [Datasets]

Graphical Abstract

. πŸ”§ The image illustrates two types of Visual Perception Tokens. The Region Selection Token carries explicit semantic information, representing important regions with bounding boxes, while the Vision Re-Encoding Token lacks semantic information. Instead, the Vision Projector extracts control information directly from its hidden state. The inference process incorporating Visual Perception Tokens can be divided into three stages. First, the MLLM generates Visual Perception Tokens based on the image and the given question. Next, the Vision Branches perform a second perception of the image guided by the Visual Perception Tokens. Finally, the MLLM utilizes the Vision Features obtained from both perception stages to answer the question.

. πŸ‘ Performance comparison of MLLMs with and without Visual Perception Tokens. Datasets marked with ``*'' are not used in the training process. A 2B model with Visual Perception Tokens can even outperform the 7B model without Visual Perception Tokens.

. πŸ‘ Examples collected from the testing sets. The responses were generated by the 7B model and the 2B+VPT model.

Table of Contents

  1. Inventory
  2. Environment Setup
  3. Training and Evaluation

Inventory

Models

Model Stage Base Model Finetuned Modules #Total Params Download Link
Qwen2-VL-2b-VPT-Det-Alignment After Alignment Qwen/Qwen2-VL-2B-Instruct Projector 2.75B πŸ€— HuggingFace Model
Qwen2-VL-2b-VPT-Seg-Alignment After Alignment Qwen/Qwen2-VL-2B-Instruct Projector 2.76B πŸ€— HuggingFace Model
Qwen2-VL-2b-VPT-CLIP After Instruction Tuning Qwen/Qwen2-VL-2B-Instruct All 2.45B πŸ€— HuggingFace Model
Qwen2-VL-2b-VPT-Det After Instruction Tuning Qwen2-VL-2b-VPT-Det-Alignment All 2.75B πŸ€— HuggingFace Model
Qwen2-VL-2b-VPT-Det-NoPrompt After Instruction Tuning Qwen2-VL-2b-VPT-Det-Alignment All 2.75B πŸ€— HuggingFace Model
Qwen2-VL-2b-VPT-Seg After Instruction Tuning Qwen2-VL-2b-VPT-Seg-Alignment All 2.76B πŸ€— HuggingFace Model
Qwen2-VL-7b-VPT-CLIP After Instruction Tuning Qwen/Qwen2-VL-7B-Instruct LoRA-r512 8.32B πŸ€— HuggingFace Model
  • Our model can be categorized into two types based on the training stage. In cases where an additional vision encoder is required, an extra alignment step is performed. The aligned model then serves as the starting point for the subsequent instruction tuning process.
  • The 7B model is fine-tuned using LoRA. The released model has already been merged.
  • The NoPrompt model corresponds to the Free Choice model. It does not rely on specific prompts to trigger the Visual Perception Token; instead, the model autonomously decides whether to use the Visual Perception Token and which type to apply.

Datasets

Our training and evaluation data are hosted on πŸ€— VPT Datasets. The entire dataset is divided into multiple splits, with each split corresponding to a separate JSON file. For the evaluation datasets, we have annotated the reasoning process involving the use of the Visual Perception Token. During evaluation, adjustments need to be made based on different models.

Dataset Stage Compatible Model Traning/Evaluation #Samples
MixVRT_CLIP_Full Instruction Tuning CLIP Training 829k
MixVRT_Detection_Full Instruction Tuning Det Training 829k
MixVRT_Seg_Full Instruction Tuning Seg Training 829k
CUB_Birds_action_test Instruction Tuning All Evaluation 0.5k
DocVQA_region_test Instruction Tuning All Evaluation 0.9k
DUDE_region_test Instruction Tuning All Evaluation 0.6k
Flickr30k_action_test Instruction Tuning All Evaluation 1.5k
LLaVA_COCO_free_action_test Instruction Tuning All Evaluation 1k
LLaVA_COCO_single_action_test Instruction Tuning All Evaluation 1k
OI_region_test Instruction Tuning All Evaluation 1k
POPE_action_test Instruction Tuning All Evaluation 3k
TextCap_region_test Instruction Tuning All Evaluation 8.5k
TextVQA_region_test Instruction Tuning All Evaluation 0.5k
VSR_region_test Instruction Tuning All Evaluation 0.4k
llava_alignment_detection_qwen_response_train Alignment Det Training 585k
llava_alignment_seg_qwen_response_train Alignment Seg Training 585k
llava_alignment_detection_qwen_response_eval Alignment Det Evaluation 5k
llava_alignment_seg_qwen_response_eval Alignment Seg Evaluation 5k

Code Development Note

Our code is primarily based on Transformers and Llama-Factory. DevNote outlines the core modifications we made to the original transformers and Llama-Factory libraries. It serves as an overview to help readers understand our code structure and provides a starting point for exploration.

Environment Setup

Prepare Enviroment

The process of setting up our environment primarily involves installing modified versions of Transformers 4.45.2 and LLaMA-Factory 0.9.1.dev0. The following are the steps to create our environment.

# clone transformers 4.45.2
git clone https://github.com/huggingface/transformers.git
cd transformers
git checkout v4.45.2
# copy modeling_qwen2_vl_vpt.py
mkdir -p "src/transformers/models/qwen2_vl_vpt"
cp "VPT/transformers/src/transformers/models/qwen2_vl_vpt/modeling_qwen2_vl_vpt.py" "src/transformers/models/qwen2_vl_vpt/"
# modify the path of transformers and llama-factory in env.yml file
# create environment
cd /to/this/folder
conda env create -f env.yml

Prepare Dataset

Prepare the following datasets and adjust the image path in the json files downloaded from πŸ€— VPT Datasets.

  • COCO2017
  • CUB_200_2011
  • DocVQA
  • DUDE
  • Flickr30k
  • GQA
  • OCRVQA
  • OpenImage
  • TextVQA
  • VG
  • VSR

Training and Evaluation

Our training and evaluation are supported by LLaMA-Factory. Adjust the path in the configs/*.yaml files before run the code.

cd LLaMA-Factory
# Detection Projector Alignment
llamafactory-cli train configs/Qwen2-VL-2b-VPT-Det-Alignment.yaml
# Segmentation Projector Alignment
llamafactory-cli train configs/Qwen2-VL-2b-VPT-Seg-Alignment.yaml
# 2b Model + VPT-original vision encoder
llamafactory-cli train configs/Qwen2-VL-2b-VPT-CLIP.yaml
# 2b Model + VPT-DINO
llamafactory-cli train configs/Qwen2-VL-2b-VPT-Det.yaml
# 2b Model + VPT-SAM
llamafactory-cli train configs/Qwen2-VL-2b-VPT-Seg.yaml
# 7b Model + VPT-original vision encoder
llamafactory-cli train configs/Qwen2-VL-7b-VPT-CLIP.yaml

The folder evaluation contains the code for evaluation. Run python evaluation.py to finish the evaluation process.

Citation

If you find our work useful, please cite using this BibTeX:

@misc{yu2025vpt,
      title={Introducing Visual Perception Token into Multimodal Large Language Model}, 
      author={Runpeng Yu and Xinyin Ma and Xinchao Wang},
      year={2025},
      eprint={2502.17425},
      archivePrefix={arXiv},
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages