Skip to content

MSALab-PKU/PerceptionDLM

Repository files navigation

PerceptionDLM

Parallel Region Perception with Multimodal Diffusion Language Models

Paper Project Page Hugging Face Models Hugging Face Dataset Hugging Face Benchmark Code License

HighlightsNewsInstallationModels & DatasetsQuick StartTrainingEvaluationCitation

PerceptionDLM teaser

🎬 Demos

Parallel region captioning in action — given multiple masks, PerceptionDLM describes all regions simultaneously in a single denoising pass:

demo_1.mp4
demo_0.mp4

If the videos do not play inline, click to view: demo 0 · demo 1.


PerceptionDLM is a multimodal diffusion language model optimized for efficient parallel region perception. Built upon a strong foundational baseline (PerceptionDLM-Base), it fully leverages the parallel decoding nature of diffusion language models (DLMs): given an image and multiple region masks, it generates descriptions for all regions simultaneously within a single denoising process — avoiding the linear latency growth of autoregressive (AR) region captioners.

✨ Highlights

  • 🧩 Parallel region captioning. Describe many masked regions in a single denoising pass, achieving up to 3.4× throughput speedup in dense multi-region scenarios.
  • 🏆 Strong diffusion VLM baseline. PerceptionDLM-Base outperforms LLaDA-V on 15 / 16 multimodal benchmarks, establishing a new state of the art among open discrete diffusion VLMs.
  • 📊 New benchmark — ParaDLC-Bench. A multi-region localized captioning benchmark that jointly evaluates caption quality and inference efficiency.
  • 🔁 Fully open. Code, model weights, training data recipe, and evaluation suite are released.
PerceptionDLM architecture

📰 News

  • [2026-6] 🎉 PerceptionDLM is released! Paper, code, models, and ParaDLC-Bench are now public.

📦 Installation

We use uv for fast and reproducible Python environment management.

# 1. Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Clone the repository
git clone https://github.com/MSALab-PKU/PerceptionDLM.git
cd PerceptionDLM

# 3. Sync the environment for the model you want to use
uv sync --extra=dmllm    # for PerceptionDLM-Base
uv sync --extra=pdmllm   # for PerceptionDLM (parallel region perception)

After syncing, activate the virtual environment (e.g., source .venv/bin/activate).

🤗 Models & Datasets

Type Name Link
Model PerceptionDLM-Base (8B) 🤗 MSALab/PerceptionDLM-Base
Model PerceptionDLM (8B) 🤗 MSALab/PerceptionDLM
Backbone LLaDA-8B-Instruct (HF format) 🤗 MSALab/LLaDA-8B-Instruct-HF
Data PerceptionDLM-Data (training data) 🤗 MSALab/PerceptionDLM-Data
Benchmark ParaDLC-Bench 🤗 MSALab/ParaDLC-Bench · evaluation/ParaDLC-Bench

🚀 Quick Start

Run direct inference on the provided sample images in assets/.

PerceptionDLM-Base (image-level understanding)

python demo/infer_dmllm.py \
  --model-path MSALab/PerceptionDLM-Base \
  --image assets/demo.jpg \
  --prompt "What color shirt is the man in the picture wearing?" \
  --gen-length 64 --block-length 64 --steps 64

PerceptionDLM (parallel region captioning)

Generate captions for one or more binary masks in parallel:

python demo/infer_pdmllm.py \
  --model-path MSALab/PerceptionDLM \
  --image assets/demo.jpg \
  --masks assets/demo_mask_0.jpg \
          assets/demo_mask_1.jpg \
          assets/demo_mask_2.jpg \
  --gen-length 32 --steps 32 --temperature 0.0 --top-p 1.0

💡 A web demo is also available under demo/gradio.

📚 Data Preparation

Download the datasets from Hugging Face and organize them as shown below:

./
├── datasets/                              # PerceptionDLM-Base (4-stage) training data
│   ├── Bee-Training-Data-Stage1/
│   ├── Bee-Training-Data-Stage2/
│   ├── LLaVA-OneVision-1.5-Instruct-Data/
│   └── Honey-Data-15M/
├── annotations/                           # region mask/caption annotations (PerceptionDLM)
│   ├── dam_dataset.json
│   ├── coconut_dataset.json
│   └── sam_dataset.json
└── images/                                # corresponding image files

📄 For detailed dataset formats and config structures, see datasets.md.

Model Conversion (optional)

To train from the original LLaDA weights, convert them to our format first (or simply use the pre-converted MSALab/LLaDA-8B-Instruct-HF on Hugging Face, which is the default in all configs):

python scripts/convert.py \
  --model_path /path/to/LLaDA-8B-Instruct \
  --output /path/to/LLaDA-8B-Instruct-HF

🏋️ Training

Training configurations (data configs and train configs) live in the configs/ directory.

export WANDB_API_KEY="your_wandb_api_key_here"

# Example: 8 GPUs per node
bash scripts/dmllm_multi_run.sh train <data_config> <training_config> 8
Reproducing our training setup
  • PerceptionDLM-Base: full 4-stage pipeline on 32× NVIDIA H100 (80GB) GPUs (~3 weeks total).
  • PerceptionDLM: initialized from PerceptionDLM-Base and trained on the full ParaCaption corpus in ~2 days on 32× H100.

See datasets.md and the paper appendix for the exact per-stage hyper-parameters.

📈 Evaluation

We provide a comprehensive evaluation suite covering both Multimodal Benchmarks (via VLMEvalKit) and Dense Grounded Captioning (ParaDLC-Bench & DLC-Bench). Our ParaDLC-Bench is available on Hugging Face: 🤗 MSALab/ParaDLC-Bench.

👉 See the dedicated Evaluation Guide for setup, commands, and judge configuration.

📊 Main Results

PerceptionDLM-Base — General Multimodal Understanding

PerceptionDLM-Base establishes a strong open diffusion VLM baseline, outperforming LLaDA-V on 15 / 16 benchmarks and staying competitive with leading AR VLMs at the same scale.

PerceptionDLM-Base multimodal benchmark comparison

Bold = best score in each row. "–" = not reported; $^\dagger$ / $^\star$ = re-evaluated by us (official scripts / VLMEvalKit). See the paper for the full protocol.

PerceptionDLM — Parallel Region Perception

PerceptionDLM achieves a strong accuracy–efficiency trade-off on multi-region captioning:

Method Type ParaDLC-Bench Avg (%) TPF ↑ Time (s) ↓
GAR-8B AR (sequential) 69.5 1.0 479
LLaDA-V-8B Diffusion 35.2 1.0 3241
PerceptionDLM-8B Diffusion (parallel) 62.4 2.9 276

TPF = Tokens Per Forward (higher means more parallel). Full tables are reported in the paper.

🙏 Acknowledgements

This project builds upon the excellent work of LLaDA, LLaVA, VLMEvalKit, DAM, and GAR. We thank the authors of the Open-Bee and LLaVA-OneVision-1.5 datasets for their open contributions.

📝 Citation

If you find PerceptionDLM useful for your research, please consider citing:

@article{sun2026perceptiondlm,
  title   = {PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models},
  author  = {Sun, Yueyi and Wang, Yuhao and Li, Jason and Tian, Ye and Zhang, Tao and Mai, Jacky and Wang, Yihan and Wang, Haochen and Bai, Jinbin and Yang, Ling and Tong, Yunhai},
  journal = {arXiv preprint arXiv:2606.19534},
  year    = {2026},
  eprint  = {2606.19534},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url     = {https://arxiv.org/abs/2606.19534}
}

📄 License

This project is released under the Apache License 2.0.

About

Official Repo For PerceptionDLM Codebase

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors