Highlights • News • Installation • Models & Datasets • Quick Start • Training • Evaluation • Citation
Parallel region captioning in action — given multiple masks, PerceptionDLM describes all regions simultaneously in a single denoising pass:
demo_1.mp4 |
demo_0.mp4 |
If the videos do not play inline, click to view: demo 0 · demo 1.
PerceptionDLM is a multimodal diffusion language model optimized for efficient parallel region perception. Built upon a strong foundational baseline (PerceptionDLM-Base), it fully leverages the parallel decoding nature of diffusion language models (DLMs): given an image and multiple region masks, it generates descriptions for all regions simultaneously within a single denoising process — avoiding the linear latency growth of autoregressive (AR) region captioners.
- 🧩 Parallel region captioning. Describe many masked regions in a single denoising pass, achieving up to 3.4× throughput speedup in dense multi-region scenarios.
- 🏆 Strong diffusion VLM baseline. PerceptionDLM-Base outperforms LLaDA-V on 15 / 16 multimodal benchmarks, establishing a new state of the art among open discrete diffusion VLMs.
- 📊 New benchmark — ParaDLC-Bench. A multi-region localized captioning benchmark that jointly evaluates caption quality and inference efficiency.
- 🔁 Fully open. Code, model weights, training data recipe, and evaluation suite are released.
- [2026-6] 🎉 PerceptionDLM is released! Paper, code, models, and ParaDLC-Bench are now public.
We use uv for fast and reproducible Python environment management.
# 1. Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Clone the repository
git clone https://github.com/MSALab-PKU/PerceptionDLM.git
cd PerceptionDLM
# 3. Sync the environment for the model you want to use
uv sync --extra=dmllm # for PerceptionDLM-Base
uv sync --extra=pdmllm # for PerceptionDLM (parallel region perception)After syncing, activate the virtual environment (e.g., source .venv/bin/activate).
| Type | Name | Link |
|---|---|---|
| Model | PerceptionDLM-Base (8B) | 🤗 MSALab/PerceptionDLM-Base |
| Model | PerceptionDLM (8B) | 🤗 MSALab/PerceptionDLM |
| Backbone | LLaDA-8B-Instruct (HF format) | 🤗 MSALab/LLaDA-8B-Instruct-HF |
| Data | PerceptionDLM-Data (training data) | 🤗 MSALab/PerceptionDLM-Data |
| Benchmark | ParaDLC-Bench | 🤗 MSALab/ParaDLC-Bench · evaluation/ParaDLC-Bench |
Run direct inference on the provided sample images in assets/.
python demo/infer_dmllm.py \
--model-path MSALab/PerceptionDLM-Base \
--image assets/demo.jpg \
--prompt "What color shirt is the man in the picture wearing?" \
--gen-length 64 --block-length 64 --steps 64Generate captions for one or more binary masks in parallel:
python demo/infer_pdmllm.py \
--model-path MSALab/PerceptionDLM \
--image assets/demo.jpg \
--masks assets/demo_mask_0.jpg \
assets/demo_mask_1.jpg \
assets/demo_mask_2.jpg \
--gen-length 32 --steps 32 --temperature 0.0 --top-p 1.0💡 A web demo is also available under
demo/gradio.
Download the datasets from Hugging Face and organize them as shown below:
- Bee Collections —
Bee-Training-Data-Stage1,Bee-Training-Data-Stage2,Honey-Data-15M - LLaVA-OneVision-1.5-Instruct-Data
- 🤗 MSALab/PerceptionDLM-Data — region mask/caption annotations for PerceptionDLM
./
├── datasets/ # PerceptionDLM-Base (4-stage) training data
│ ├── Bee-Training-Data-Stage1/
│ ├── Bee-Training-Data-Stage2/
│ ├── LLaVA-OneVision-1.5-Instruct-Data/
│ └── Honey-Data-15M/
├── annotations/ # region mask/caption annotations (PerceptionDLM)
│ ├── dam_dataset.json
│ ├── coconut_dataset.json
│ └── sam_dataset.json
└── images/ # corresponding image files
📄 For detailed dataset formats and config structures, see datasets.md.
To train from the original LLaDA weights, convert them to our format first (or simply use the pre-converted MSALab/LLaDA-8B-Instruct-HF on Hugging Face, which is the default in all configs):
python scripts/convert.py \
--model_path /path/to/LLaDA-8B-Instruct \
--output /path/to/LLaDA-8B-Instruct-HFTraining configurations (data configs and train configs) live in the configs/ directory.
export WANDB_API_KEY="your_wandb_api_key_here"
# Example: 8 GPUs per node
bash scripts/dmllm_multi_run.sh train <data_config> <training_config> 8Reproducing our training setup
- PerceptionDLM-Base: full 4-stage pipeline on 32× NVIDIA H100 (80GB) GPUs (~3 weeks total).
- PerceptionDLM: initialized from PerceptionDLM-Base and trained on the full ParaCaption corpus in ~2 days on 32× H100.
See datasets.md and the paper appendix for the exact per-stage hyper-parameters.
We provide a comprehensive evaluation suite covering both Multimodal Benchmarks (via VLMEvalKit) and Dense Grounded Captioning (ParaDLC-Bench & DLC-Bench). Our ParaDLC-Bench is available on Hugging Face: 🤗 MSALab/ParaDLC-Bench.
👉 See the dedicated Evaluation Guide for setup, commands, and judge configuration.
PerceptionDLM-Base establishes a strong open diffusion VLM baseline, outperforming LLaDA-V on 15 / 16 benchmarks and staying competitive with leading AR VLMs at the same scale.
Bold = best score in each row. "–" = not reported;
$^\dagger$ /$^\star$ = re-evaluated by us (official scripts / VLMEvalKit). See the paper for the full protocol.
PerceptionDLM achieves a strong accuracy–efficiency trade-off on multi-region captioning:
| Method | Type | ParaDLC-Bench Avg (%) | TPF ↑ | Time (s) ↓ |
|---|---|---|---|---|
| GAR-8B | AR (sequential) | 69.5 | 1.0 | 479 |
| LLaDA-V-8B | Diffusion | 35.2 | 1.0 | 3241 |
| PerceptionDLM-8B | Diffusion (parallel) | 62.4 | 2.9 | 276 |
TPF= Tokens Per Forward (higher means more parallel). Full tables are reported in the paper.
This project builds upon the excellent work of LLaDA, LLaVA, VLMEvalKit, DAM, and GAR. We thank the authors of the Open-Bee and LLaVA-OneVision-1.5 datasets for their open contributions.
If you find PerceptionDLM useful for your research, please consider citing:
@article{sun2026perceptiondlm,
title = {PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models},
author = {Sun, Yueyi and Wang, Yuhao and Li, Jason and Tian, Ye and Zhang, Tao and Mai, Jacky and Wang, Yihan and Wang, Haochen and Bai, Jinbin and Yang, Ling and Tong, Yunhai},
journal = {arXiv preprint arXiv:2606.19534},
year = {2026},
eprint = {2606.19534},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2606.19534}
}This project is released under the Apache License 2.0.

