Stars
Langbridge: Interpreting Image as a Combination of Language Embeddings
[CVPR 2025] This is an official inference code of the paper "BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation" . Project page: https://bizgen-msra.github.io/
A suite of image and video neural tokenizers
Cosmos is a world model development platform that consists of world foundation models, tokenizers and video processing pipeline to accelerate the development of Physical AI at Robotics & AV labs. C…
TokenBridge: Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation
[Neurips'24 Spotlight] Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
Clean, minimal, accessible reproduction of DeepSeek R1-Zero
The code for creating the iGSM datasets in papers "Physics of Language Models Part 2.1, Grade-School Math and the Hidden Reasoning Process" (arxiv 2407.20311) and "Physics of Language Models Part 2…
WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes
A Token-level Text Image Foundation Model for Document Understanding
This is the first paper to explore how to effectively use RL for MLLMs and introduce Vision-R1, a reasoning MLLM that leverages cold-start initialization and RL training to incentivize reasoning ca…
Official repository of "GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing"
Official code for the paper "Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models" (ICLR 2025 Oral)
[ICLR'25] Official code for the paper 'MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs'
UniToken is an auto-regressive generation model that combines discrete and continuous representations to process visual inputs, making it easy to integrate both visual understanding and image gener…
[ECCV 2024] API: Attention Prompting on Image for Large Vision-Language Models
Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
🔥CVPR 2025 Multimodal Large Language Models Paper List
An Easy-to-use, Scalable and High-performance RLHF Framework designed for Multimodal Models.
Awesome-Long2short-on-LRMs is a collection of state-of-the-art, novel, exciting long2short methods on large reasoning models. It contains papers, codes, datasets, evaluations, and analyses.
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
[NeurIPS 2024] Code for the paper "Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models"
MM-EUREKA: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning