Skip to content

LINs-lab/IOMM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

Peng Sun1,3,*·   Jun Xie1,2,3,*·Tao Lin3    

1Zhejiang University   2Shanghai Innovation Institute  3Westlake University 

📄 Paper🏷️ BibTeX

Official PyTorch implementation of IOMM: 🏆 A data-efficient training (both pre-training and fine-tuning) paradigm for Unified Multimodal Models.

Generation results of our IOMM-XL

🚧 TODOs

[✅] Release the paper.

[ ] Release IOMM-B.

[ ] Release inference code.

[ ] Update the paper.

[ ] Release training code.

✨ Features

🚀 Image-only Pre-training:

  • No need for high-quality text-image pair datasets
  • Achieve GenEval=0.89 at 10 epoch under 11 million data
  • Gain editing ability under zero-shot settings

Mixed Data Fine-tuning:

  • GenEval = 0.89 and WISE = 0.63 for Qwen-Image-20B

  • 📊 Extended results for additional tuned models are available here

📖 Check more detailed features in our paper!

🚀 Generalization to open-source UMMs

Our mixed data fine-tuning paradigm also achieves notable performance enhancement in those open-source Unified Multimodal Models (UMMs), even for the powerful baseline Qwen-Image-20B.

METHOD Res. NFE GenEval WISE
OpenUni-L 512 20 $\times$ 2 0.85 0.52
$\quad \boldsymbol{\oplus}$ Pair finetuning 512 20 $\times$ 2 0.88 0.62
$\quad \boldsymbol{\oplus}$ Mix finetuning 512 20 $\times$ 2 0.88 0.59
Qwen-Image-20B 512 50 $\times$ 2 0.85 -
$\quad \boldsymbol{\oplus}$ Pair finetuning 512 50 $\times$ 2 0.88 0.63
$\quad \boldsymbol{\oplus}$ Mix finetuning 512 50 $\times$ 2 0.89 0.63
Qwen-Image-20B 1024 50 $\times$ 2 0.87 0.62
$\quad \boldsymbol{\oplus}$ Pair finetuning 1024 50 $\times$ 2 0.88 0.63
$\quad \boldsymbol{\oplus}$ Mix finetuning 1024 50 $\times$ 2 0.89 0.63

🏷️ Bibliography

If you find this repository helpful for your project, please consider citing our work:

@inproceedings{sun2026rethinking,
  title={Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training},
  author={Sun, Peng and Xie, Jun and Lin, Tao},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

📄 License

Apache License 2.0 - See LICENSE for details.

About

[CVPR 2026] IOMM: Fast Pre-training of Unified Multimodal Models without Text-Image Pairs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors