Highlights
- Pro
Stars
Video-R1: Reinforcing Video Reasoning in MLLMs [🔥the first work to systematically explore R1 for video]
A curated list of Awesome Personalized Large Multimodal Models resources
CUDA Python: Performance meets Productivity
Code for "How far can we go with ImageNet for Text-to-Image generation?" paper
Official repository of "GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing"
xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
GenEval: An object-focused framework for evaluating text-to-image alignment
A Unified Tokenizer for Visual Generation and Understanding
Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation
[ICLR 2025] VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
High-Resolution 3D Assets Generation with Large Scale Hunyuan3D Diffusion Models.
Cosmos is a world model development platform that consists of world foundation models, tokenizers and video processing pipeline to accelerate the development of Physical AI at Robotics & AV labs. C…
A suite of image and video neural tokenizers
High-performance Image Tokenizers for VAR and AR
SEED-Voken: A Series of Powerful Visual Tokenizers
[ICLR 2025][arXiv:2406.07548] Image and Video Tokenization with Binary Spherical Quantization
SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer
[CVPR 2025] Official code of "DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation"
Official Pytorch implementation for LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior (ICLR 2025 Oral).
[CVPR'25] Official PyTorch implementation of Lumos: Learning Visual Generative Priors without Text
[CVPR 2025] 🔥 Official impl. of "TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation".
This repo contains evaluation code for the paper "AV-Odyssey: Can Your Multimodal LLMs Really Understand Audio-Visual Information?"
[ICLR 2025] SOTA discrete acoustic codec models with 40/75 tokens per second for audio language modeling
📖 This is a repository for organizing papers, codes and other resources related to unified multimodal models.
[ICLR 2025] Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.