Anqi Zhang1,2, Xiaokang Ji1, Guangyu Gao1*, Jianbo Jiao2, Chi Harold Liu1, Yunchao Wei3,4
1School of Computer Science, Beijing Institute of Technology
2The MIx group, School of Computer Science, University of Birmingham
3WEI Lab, Institute of Information Science, Beijing Jiaotong University
4Beijing Academy of Artificial Intelligence
Paper is accepted by CVPR 2026.
✅️ No external expert decoder for text-guided referring segmentation.
✅️ Only 1 [SEG] token for segmentation.
✅️ First method integrating two characteristics above with solid and competitive performance.
🚀 Step forward for integrating segmentation ability inside MLLM.
Our project aims to investigate whether and how we can unlock segmentation ability from MLLM itSELF with 1 segmentation Embedding (SELF1E) while achieving competitive results, which eliminates the need for external decoders. To this end, our approach targets the fundamental limitation of resolution reduction in pixel-shuffled image features from MLLMs.
- First, we retain image features at their original uncompressed resolution, and refill them with residual features extracted from MLLM-processed compressed features, thereby improving feature precision.
- Subsequently, we integrate pixel-unshuffle operations on image features with and without LLM processing, respectively, to unleash the details of compressed features and amplify the residual features under uncompressed resolution, which further enhances the resolution of refilled features.
- Moreover, we redesign the attention mask with dual perception pathways, i.e., image-to-image and image-to-segmentation, enabling rich feature interaction between pixels and the segmentation token.
pip install -r requirements.txt
Note: Qwen2 models in some of the versions of transformers may not have attention_mask argument, you may have to modify the code.
Following the LISA dataset preparation, the training data consists of 4 types of data:
-
Semantic segmentation datasets: ADE20K, COCO-Stuff, PACO-LVIS, PASCAL-Part, COCO Images
Note: For COCO-Stuff, we use the annotation file stuffthingmaps_trainval2017.zip. We only use the PACO-LVIS part in PACO. COCO Images should be put into the
dataset/coco/directory. -
Referring segmentation datasets: refCOCO, refCOCO+, refCOCOg, refCLEF (saiapr_tc-12), gRefCOCO
Note: the original links of refCOCO series data are down, and we update them with new ones. If the download speed is super slow or unstable, we also provide a OneDrive link to download. You must also follow the rules that the original datasets require.
-
Reasoning segmentation dataset: ReasonSeg
-
Advanced Visual Question Answering dataset: LLaVA-Instruct-150k
-
Traditional Visual Question Answering dataset: Follow InternVL VQA datasets for preparation. We use vqav2, okvqa, textvqa, vizwiz, gqa datasets for training.
Download them from the above links, and organize them as follows.
├── dataset
│ ├── ade20k
│ │ ├── annotations
│ │ └── images
│ ├── coco
│ │ └── train2017
│ │ ├── 000000000009.jpg
│ │ └── ...
│ ├── cocostuff
│ │ └── train2017
│ │ ├── 000000000009.png
│ │ └── ...
│ ├── llava_dataset
│ │ └── llava_instruct_150k.json
│ ├── reason_seg
│ │ └── ReasonSeg
│ │ ├── train
│ │ ├── val
│ │ └── explanatory
│ ├── refer_seg
│ │ ├── images
│ │ │ ├── saiapr_tc-12
│ │ │ └── mscoco
│ │ │ └── images
│ │ │ └── train2014
│ │ ├── refclef
│ │ ├── refcoco
│ │ ├── refcoco+
│ │ ├── refcocog
│ │ └── grefcoco
│ └── vlpart
│ ├── paco
│ │ └── annotations
│ └── pascal_part
│ ├── train.json
│ └── VOCdevkit
├── data
│ ├── coco
│ ├── gqa
│ ├── mmbench
│ ├── mme
│ ├── okvqa
│ ├── pope
│ ├── textvqa
│ ├── vizwiz
│ └── vqav2
Training model with 1 epoch:
deepspeed --master_port=24995 train_hf_ivl_seq.py \
--version="***/InternVL3-2B" \
--dataset_dir='./dataset' \
--dataset="sem_seg||refer_seg||reason_seg||vqa" \
--sample_rates="1,1,1,1" \
--batch_size 5 \
--grad_accumulation_steps 8 \
--gradient_checkpointing \
--exp_name="${EXP_NAME}" \
--model_max_length 512 \
--explanatory -1 \
--lora_r 128 \
--lora_alpha 256 \
--epochs 1 \
--lr 1e-4 \
--vision_lr 1e-4 \
--optimize_vision \
--use_llm_lora \
--use_vision_lora
Training model with more segmentation data for SEG version:
deepspeed --master_port=24995 train_hf_ivl_seq.py \
--version="***/InternVL3-2B" \
--dataset_dir='./dataset' \
--dataset="sem_seg||refer_seg||reason_seg||vqa" \
--sample_rates="6,20,6,1" \
--batch_size 5 \
--grad_accumulation_steps 8 \
--gradient_checkpointing \
--exp_name="${EXP_NAME}" \
--model_max_length 512 \
--explanatory -1 \
--lora_r 128 \
--lora_alpha 256 \
--epochs 1 \
--lr 1e-4 \
--vision_lr 1e-4 \
--optimize_vision \
--use_llm_lora \
--use_vision_lora
Evaluate on all segmentation datasets:
bash eval_all.sh
Evaluate on VQA datasets: Please refer to InternVL VQA datasets for preparation and download the codes, and then run:
bash eval_vqas.sh
Note: DeepSpeed ZoRO-3 optimization is not supported for this method due to the customized design of MLLM and dataloading.
If you find this project useful in your research, please consider citing:
@inproceedings{zhang2026self1e,
author = {Zhang, Anqi and Ji, Xiaokang and Gao, Guangyu and Jiao, Jianbo and Liu, Chi Harold and Wei, Yunchao},
title = {SELF1E: Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2026},
}




