Peng Sun1,3,* · Jun Xie1,2,3,* · Tao Lin3
1Zhejiang University 2Shanghai Innovation Institute 3Westlake University
Official PyTorch implementation of IOMM: 🏆 A data-efficient training (both pre-training and fine-tuning) paradigm for Unified Multimodal Models.
[✅] Release the paper.
[ ] Release IOMM-B.
[ ] Release inference code.
[ ] Update the paper.
[ ] Release training code.
🚀 Image-only Pre-training:
- ✅ No need for high-quality text-image pair datasets
- ✅ Achieve GenEval=0.89 at 10 epoch under 11 million data
- ✅ Gain editing ability under zero-shot settings
⚡ Mixed Data Fine-tuning:
-
✅ GenEval = 0.89 and WISE = 0.63 for Qwen-Image-20B
-
📊 Extended results for additional tuned models are available here
📖 Check more detailed features in our paper!
Our mixed data fine-tuning paradigm also achieves notable performance enhancement in those open-source Unified Multimodal Models (UMMs), even for the powerful baseline Qwen-Image-20B.
| METHOD | Res. | NFE | GenEval | WISE |
|---|---|---|---|---|
| OpenUni-L | 512 | 20 |
0.85 | 0.52 |
|
|
512 | 20 |
0.88 | 0.62 |
|
|
512 | 20 |
0.88 | 0.59 |
| Qwen-Image-20B | 512 | 50 |
0.85 | - |
|
|
512 | 50 |
0.88 | 0.63 |
|
|
512 | 50 |
0.89 | 0.63 |
| Qwen-Image-20B | 1024 | 50 |
0.87 | 0.62 |
|
|
1024 | 50 |
0.88 | 0.63 |
|
|
1024 | 50 |
0.89 | 0.63 |
If you find this repository helpful for your project, please consider citing our work:
@inproceedings{sun2026rethinking,
title={Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training},
author={Sun, Peng and Xie, Jun and Lin, Tao},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}
Apache License 2.0 - See LICENSE for details.
