SELF1E: Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token

Anqi Zhang^1,2, Xiaokang Ji¹, Guangyu Gao^1*, Jianbo Jiao², Chi Harold Liu¹, Yunchao Wei^3,4

¹School of Computer Science, Beijing Institute of Technology
²The MIx group, School of Computer Science, University of Birmingham
³WEI Lab, Institute of Information Science, Beijing Jiaotong University
⁴Beijing Academy of Artificial Intelligence

Paper is accepted by CVPR 2026.

Highlights

✅️ No external expert decoder for text-guided referring segmentation.

✅️ Only 1 [SEG] token for segmentation.

✅️ First method integrating two characteristics above with solid and competitive performance.

🚀 Step forward for integrating segmentation ability inside MLLM.

Abstract

Our project aims to investigate whether and how we can unlock segmentation ability from MLLM itSELF with 1 segmentation Embedding (SELF1E) while achieving competitive results, which eliminates the need for external decoders. To this end, our approach targets the fundamental limitation of resolution reduction in pixel-shuffled image features from MLLMs.

First, we retain image features at their original uncompressed resolution, and refill them with residual features extracted from MLLM-processed compressed features, thereby improving feature precision.
Subsequently, we integrate pixel-unshuffle operations on image features with and without LLM processing, respectively, to unleash the details of compressed features and amplify the residual features under uncompressed resolution, which further enhances the resolution of refilled features.
Moreover, we redesign the attention mask with dual perception pathways, i.e., image-to-image and image-to-segmentation, enabling rich feature interaction between pixels and the segmentation token.

Visualization

Performance

Installation

pip install -r requirements.txt

Note: Qwen2 models in some of the versions of transformers may not have attention_mask argument, you may have to modify the code.

Dataset Preparation

Segmentation Data Preparation

Following the LISA dataset preparation, the training data consists of 4 types of data:

Semantic segmentation datasets: ADE20K, COCO-Stuff, PACO-LVIS, PASCAL-Part, COCO Images

Note: For COCO-Stuff, we use the annotation file stuffthingmaps_trainval2017.zip. We only use the PACO-LVIS part in PACO. COCO Images should be put into the dataset/coco/ directory.
Referring segmentation datasets: refCOCO, refCOCO+, refCOCOg, refCLEF (saiapr_tc-12), gRefCOCO

Note: the original links of refCOCO series data are down, and we update them with new ones. If the download speed is super slow or unstable, we also provide a OneDrive link to download. You must also follow the rules that the original datasets require.
Reasoning segmentation dataset: ReasonSeg

VQA data preparation

Advanced Visual Question Answering dataset: LLaVA-Instruct-150k
Traditional Visual Question Answering dataset: Follow InternVL VQA datasets for preparation. We use vqav2, okvqa, textvqa, vizwiz, gqa datasets for training.

Dataset Organization

Download them from the above links, and organize them as follows.

├── dataset
│   ├── ade20k
│   │   ├── annotations
│   │   └── images
│   ├── coco
│   │   └── train2017
│   │       ├── 000000000009.jpg
│   │       └── ...
│   ├── cocostuff
│   │   └── train2017
│   │       ├── 000000000009.png
│   │       └── ...
│   ├── llava_dataset
│   │   └── llava_instruct_150k.json
│   ├── reason_seg
│   │   └── ReasonSeg
│   │       ├── train
│   │       ├── val
│   │       └── explanatory
│   ├── refer_seg
│   │   ├── images
│   │   │   ├── saiapr_tc-12 
│   │   │   └── mscoco
│   │   │       └── images
│   │   │           └── train2014
│   │   ├── refclef
│   │   ├── refcoco
│   │   ├── refcoco+
│   │   ├── refcocog
│   │   └── grefcoco
│   └── vlpart
│       ├── paco
│       │   └── annotations
│       └── pascal_part
│           ├── train.json
│           └── VOCdevkit
├── data
│   ├── coco
│   ├── gqa
│   ├── mmbench
│   ├── mme
│   ├── okvqa
│   ├── pope
│   ├── textvqa
│   ├── vizwiz
│   └── vqav2

Training and Evaluation

Training model with 1 epoch:

deepspeed --master_port=24995 train_hf_ivl_seq.py \
    --version="***/InternVL3-2B" \
    --dataset_dir='./dataset' \
    --dataset="sem_seg||refer_seg||reason_seg||vqa" \
    --sample_rates="1,1,1,1" \
    --batch_size 5 \
    --grad_accumulation_steps 8 \
    --gradient_checkpointing \
    --exp_name="${EXP_NAME}" \
    --model_max_length 512 \
    --explanatory -1 \
    --lora_r 128 \
    --lora_alpha 256 \
    --epochs 1 \
    --lr 1e-4 \
    --vision_lr 1e-4 \
    --optimize_vision \
    --use_llm_lora \
    --use_vision_lora

Training model with more segmentation data for SEG version:

deepspeed --master_port=24995 train_hf_ivl_seq.py \
    --version="***/InternVL3-2B" \
    --dataset_dir='./dataset' \
    --dataset="sem_seg||refer_seg||reason_seg||vqa" \
    --sample_rates="6,20,6,1" \
    --batch_size 5 \
    --grad_accumulation_steps 8 \
    --gradient_checkpointing \
    --exp_name="${EXP_NAME}" \
    --model_max_length 512 \
    --explanatory -1 \
    --lora_r 128 \
    --lora_alpha 256 \
    --epochs 1 \
    --lr 1e-4 \
    --vision_lr 1e-4 \
    --optimize_vision \
    --use_llm_lora \
    --use_vision_lora

Evaluate on all segmentation datasets:

bash eval_all.sh

Evaluate on VQA datasets: Please refer to InternVL VQA datasets for preparation and download the codes, and then run:

bash eval_vqas.sh

Note: DeepSpeed ZoRO-3 optimization is not supported for this method due to the customized design of MLLM and dataloading.

Citation

If you find this project useful in your research, please consider citing:

@inproceedings{zhang2026self1e,
  author = {Zhang, Anqi and Ji, Xiaokang and Gao, Guangyu and Jiao, Jianbo and Liu, Chi Harold and Wei, Yunchao},
  title = {SELF1E: Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year = {2026},
}

Acknowledgement

This work is built upon the LISA and some of the training settings are borrowed from PSALM. Thanks for their extraordinary works.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
figs		figs
model		model
utils_global		utils_global
utils_internvl		utils_internvl
.gitignore		.gitignore
README.md		README.md
eval_all.sh		eval_all.sh
eval_vqas.sh		eval_vqas.sh
evaluate.sh		evaluate.sh
requirements.txt		requirements.txt
train_hf_ivl_seq.py		train_hf_ivl_seq.py
train_ivl_all.sh		train_ivl_all.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SELF1E: Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token

Highlights

Abstract

Visualization

Performance

Installation

Dataset Preparation

Segmentation Data Preparation

VQA data preparation

Dataset Organization

Training and Evaluation

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SELF1E: Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token

Highlights

Abstract

Visualization

Performance

Installation

Dataset Preparation

Segmentation Data Preparation

VQA data preparation

Dataset Organization

Training and Evaluation

Citation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages