Skip to content

HITsz-TMG/Uni-MoE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Welcome to the repo of Uni-MOE

Uni-MoE is a MoE-based omnimodal large model and can understand and generate omnimodalities.

πŸ€—Hugging Face Project Page Demo Paper

πŸ€—Hugging Face Project Page Demo Paper

If you appreciate our project, please consider giving us a star ⭐ on GitHub to stay updated with the latest developments.

πŸ”₯ News

  • [2025/11/24] πŸ”₯ We have integrated our model Uni-MoE-2.0-Omni for evaluation within the Lmms-eval framework, see here.

  • [2025/11/13] πŸ”₯ We release the second version of Uni-MoE-2.0-Omni. It achieves a significant leap in language-centric multimodal understanding, reasoning, and generation capabilities, while efficiently supporting cross-modal interactions across ten-plus modalities such as images, text, and speech through its dynamic MoE architecture and progressive training strategy.

  • [2025/10/16] πŸ”₯ We release a better UniMoE-Audio, the first audio generation model with a unified speech and music generation.

  • [2025/8/6] πŸ”₯ We release a better Uni-MoE v1.5 at modelscope here with a unified speech encoding approach.

  • [2025/1/9] πŸ”₯ Our paper has been accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025.

  • [2024/8/28] πŸ”₯ We release our video evaluation benchmark VideoVista and the automatically generated video instruction tuning data VideoVista-Train.

  • [2024/5/31] πŸ”₯ The checkpoint of Uni-MoE with 8 experts is now available for downloading and inference. For more details, please refer to the Uni_MoE_8e table.

  • [2024/4/28] πŸ”₯ We have upgraded the Uni-MoE codebase to facilitate training across multiple Nodes and GPUs. Explore this enhanced functionality in our revamped fine-tuning script. Additionally, we have introduced a version that integrates distributed MoE modules. This enhancement allows for training our model with parallel processing at both the expert and modality levels, enhancing efficiency and scalability. For more details, please refer to the Uni_MoE_v2 documentation.

  • [2024/3/7] πŸ”₯ We released Uni-MOE: Scaling Unified Multimodal LLMs with Mixture of Experts. We proposed the development of a unified Multimodal LLM (MLLM) utilizing the MoE framework, which can process diverse modalities, including audio, image, text, and video. Checkout the paper and demo.

πŸ“€ Demo Video

πŸ‘€ Uni-MoE-2.0-Omni

models_intro.mp4

πŸ‘€ UniMoE-Audio

final-UniMoE_Audio.mp4

πŸ‘€ Uni-MoE 1.0

Demo 2 contains the real-time understanding of speech (Starting from 30S).

demo1.mp4
demo2.mp4

🌟 Model Structure

πŸš€ Uni-MoE 2.0

We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances the capabilities of Lychee's Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the Qwen2.5-7B dense architecture, we train Uni-MoE 2.0 from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. Uni-MoE 2.0 is capable of cross- and tri-modality understanding, as well as generating images, text, and speech.

πŸš€ UniMoE-Audio

UniMoE-Audio introduces a dynamic-capacity routing mechanism based on Top-P sampling for adaptive expert allocation, together with a hybrid expert design that separates domain-specific computation (dynamic experts) from universal representations (shared experts). To address data imbalance and task conflicts, UniMoE-Audio adopts a structured three-stage training curriculum. From voice cloning and text-to-speech (TTS) to text-to-music (T2M) and video-to-music (V2M), UniMoE-Audio supports diverse creative workflows. Extensive experiments confirm its state-of-the-art performance and superior cross-task synergy, paving the way toward universal audio generation.

πŸš€ Uni-MoE 1.0

The model architecture of Uni-MoE is shown below. Three training stages contain: 1) Utilize pairs from different modalities and languages to build connectors that map these elements to a unified language space, establishing a foundation for multimodal understanding; 2) Develop modality-specific experts using cross-modal data to ensure deep understanding, preparing for a cohesive multi-expert model; 3) Incorporate multiple trained experts into LLMs and refine the unified multimodal model using the LoRA technique on mixed multimodal data.

πŸ™ Star History

Star History Chart

❀️ Citation

If you find Uni-MoE useful for your research and applications, please cite using this BibTeX:

@article{li2025uni2omni,
  title={Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data},
  author={Li, Yunxin and Chen, Xinyu and Jiang, Shenyuan and Shi, Haoyuan and Liu, Zhenyu and Zhang, Xuanyu and Deng, Nanhao and Xu, Zhenran and Ma, Yicheng and Zhang, Meishan and others},
  journal={arXiv preprint arXiv:2511.12609},
  year={2025}
}
@ARTICLE{li_unimoe,
  author={Li, Yunxin and Jiang, Shenyuan and Hu, Baotian and Wang, Longyue and Zhong, Wanqi and Luo, Wenhan and Ma, Lin and Zhang, Min},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
  title={Uni-MoE: Scaling Unified Multimodal LLMs With Mixture of Experts}, 
  year={2025},
  volume={47},
  number={5},
  pages={3424-3439},
  doi={10.1109/TPAMI.2025.3532688}}
@article{liu2025unimoe,
  title={UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE},
  author={Liu, Zhenyu and Li, Yunxin and Zhang, Xuanyu and Teng, Qixun and Jiang, Shenyuan and Chen, Xinyu and Shi, Haoyuan and Li, Jinchao and Wang, Qi and Chen, Haolan and others},
  journal={arXiv preprint arXiv:2510.13344},
  year={2025}
}

About

Uni-MoE: Lychee's Large Multimodal Model Family.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published