Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

Peng Sun^1,3,* · Jun Xie^1,2,3,* · Tao Lin³

¹Zhejiang University ²Shanghai Innovation Institute ³Westlake University

Official PyTorch implementation of IOMM: 🏆 A data-efficient training (both pre-training and fine-tuning) paradigm for Unified Multimodal Models.

Generation results of our IOMM-XL

🚧 TODOs

[✅] Release the paper.

[ ] Release IOMM-B.

[ ] Release inference code.

[ ] Update the paper.

[ ] Release training code.

✨ Features

🚀 Image-only Pre-training:

✅ No need for high-quality text-image pair datasets
✅ Achieve GenEval=0.89 at 10 epoch under 11 million data
✅ Gain editing ability under zero-shot settings

⚡ Mixed Data Fine-tuning:

✅ GenEval = 0.89 and WISE = 0.63 for Qwen-Image-20B
📊 Extended results for additional tuned models are available here

📖 Check more detailed features in our paper!

🚀 Generalization to open-source UMMs

Our mixed data fine-tuning paradigm also achieves notable performance enhancement in those open-source Unified Multimodal Models (UMMs), even for the powerful baseline Qwen-Image-20B.

METHOD	Res.	NFE	GenEval	WISE
OpenUni-L	512	20 $\times$ 2	0.85	0.52
$\quad \boldsymbol{\oplus}$ Pair finetuning	512	20 $\times$ 2	0.88	0.62
$\quad \boldsymbol{\oplus}$ Mix finetuning	512	20 $\times$ 2	0.88	0.59
Qwen-Image-20B	512	50 $\times$ 2	0.85	-
$\quad \boldsymbol{\oplus}$ Pair finetuning	512	50 $\times$ 2	0.88	0.63
$\quad \boldsymbol{\oplus}$ Mix finetuning	512	50 $\times$ 2	0.89	0.63
Qwen-Image-20B	1024	50 $\times$ 2	0.87	0.62
$\quad \boldsymbol{\oplus}$ Pair finetuning	1024	50 $\times$ 2	0.88	0.63
$\quad \boldsymbol{\oplus}$ Mix finetuning	1024	50 $\times$ 2	0.89	0.63

🏷️ Bibliography

If you find this repository helpful for your project, please consider citing our work:

@inproceedings{sun2026rethinking,
  title={Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training},
  author={Sun, Peng and Xie, Jun and Lin, Tao},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

📄 License

Apache License 2.0 - See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets/figures		assets/figures
.gitignore		.gitignore
LICENSE		LICENSE
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

🚧 TODOs

✨ Features

🚀 Generalization to open-source UMMs

🏷️ Bibliography

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

🚧 TODOs

✨ Features

🚀 Generalization to open-source UMMs

🏷️ Bibliography

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages