Contributions are most welcome, if you have any suggestions or improvements, feel free to create an issue or raise a pull request.
| Date | Project | Task |
|---|---|---|
| 24.05 | M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought [📑Paper] | M3CoT |
| 24.10 | MME-CoT: Benchmarking Chain-of-Thought in LMMs for Reasoning Quality, Robustness, and Efficiency [📑Paper][🖥️Code] | MME-CoT |
| 24.11 | VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models [📑Paper] | VLRewardBench |
| 25.01 | LlamaV-o1: Rethinking Step-By-Step Visual Reasoning in LLMs [📑Paper][🖥️Code] | VRCBench |
| 25.02 | OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference [📑Paper][🖥️Code] | MM-AlignBench |
| 25.02 | MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [📑Paper] | MM-RLHF-RewardBench, MM-RLHF-SafetyBench |
| Date | Project | SFT | RL | Task |
|---|---|---|---|---|
| 24.03 | Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning [📑Paper][🖥️Code] | visual chain-of-thought dataset comprising 438k data items | - | Various VQA |
| 24.10 | Improve Vision Language Model Chain-of-thought Reasoning [📑Paper][🖥️Code] | 193k CoT sft data by GPT4-o | DPO | Various VQA |
| 24.11 | LLaVA-CoT: Let Vision Language Models Reason Step-by-Step [📑Paper][🖥️Code] | LLaVA-CoT-100k by GPT4-o | - | Various VQA |
| 24.11 | Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models [📑Paper][🖥️Code] | sft for agent | Iterative DPO | Various VQA |
| 24.11 | Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization [📑Paper] | - | MPO | Various VQA |
| 25.01 | Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [📑Paper][🖥️Code] | 2k Text data from R1/QwQ and visual data from QvQ/SD | - | Math & MMMU |
| 25.01 | InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model [📑Paper][🖥️Code] | - | PPO | Reward & Various VQA |
| 25.01 | LlamaV-o1: Rethinking Step-By-Step Visual Reasoning in LLMs [📑Paper][🖥️Code] | LLaVA-CoT-100k & PixMo [13] subset | - | VRC-Bench & Various VQA |
| 25.02 | MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [📑Paper][🖥️Code] | - | DPO with 120k fine-grained, human-annotated preference comparison pairs. | Reward & Various VQA |
| 25.02 | OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference [📑Paper][🖥️Code] | 200k sft data | DPO | Alignment & Various VQA |
| 25.02 | Multimodal Open R1 [🖥️Code] | - | GRPO | Mathvista-mini, MMMU |
| 25.02 | VLM-R1: A stable and generalizable R1-style Large Vision-Language Model [🖥️Code] | - | GRPO | Referring Expression Comprehension |
| 25.02 | R1-V: Reinforcing Super Generalization Ability in Vision Language Models with Less Than $3 [🖥️Code] | - | GRPO | Item Counting, Number Related Reasoning and Geometry Reasoning |
| 25.03 | EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework [🖥️Code] | - | GRPO | Geometry3K |
| 25.03 | Unified Reward Model for Multimodal Understanding and Generation [📑Paper][🖥️Code] | - | DPO | Various VQA & Generation |
| 25.03 | MM-EUREKA: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning [📑Paper][🖥️Code] | - | RLOO | Math |
| 25.03 | R1-Zero’s “Aha Moment” in Visual Reasoning on a 2B Non-SFT Model [📑Paper][🖥️Code] | - | GRPO | CVBench |
| 25.03 | Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement [📑Paper][🖥️Code] | - | GRPO | RefCOCO&ReasonSeg |
| 25.03 | Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models [📑Paper][🖥️Code] | - | GRPO | Math |
| 25.03 | Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning [📑Paper] | Self-Improvement Training | GRPO | Detection, Classification, Math |
| 25.03 | LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL [📑Paper][🖥️Code] | - | PPO | Math, Sokoban-Global, Football-Online |
| 25.03 | Visual-RFT: Visual Reinforcement Fine-Tuning [📑Paper][🖥️Code] | - | GRPO | Detection, Grounding, Classification |
| 25.03 | VisRL: Intention-Driven Visual Perception via Reinforced Reasoning [📑Paper][🖥️Code] | warm up | DPO | Various VQA |
| 25.03 (CVPR2025) | GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks [📑Paper] | - | GFlowNets | NumberLine (NL) and BlackJack (BJ) |
| 25.03 | MMR1: Advancing the Frontiers of Multimodal Reasoning [🖥️Code] | - | GRPO | Math |
| 25.03 | R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization [📑Paper][🖥️Code] | ongoing | ongoing | ongoing |
| Date | Project | SFT | RL | Task |
|---|---|---|---|---|
| 25.01 | Temporal Preference Optimization for Long-Form Video Understanding [📑Paper][🖥️Code] | - | DPO | various video QA |
| 25.01 | Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding [📑Paper][🖥️Code] | main training | DPO | Video caption & QA |
| 25.02 | video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model [📑Paper] | cold start | DPO | various video QA |
| 25.02 | Open-R1-Video[🖥️Code] | - | GRPO | LongVideoBench |
| 25.02 | Video-R1: Towards Super Reasoning Ability in Video Understanding [🖥️Code] | - | GRPO | DVD-counting |
| 25.03 | R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning [📑Paper][🖥️Code] | cold start | GRPO | Emotion recognition |
| Date | Proj | Comment |
|---|---|---|
| 25.02 | C-Drag:Chain-of-Thought Driven Motion Controller for Video Generation | Calculate simple motion vector with LLM. |
| 25.01 | Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step | Potential Assessment Reward Model for AR Image Generation. |
| 25.01 | Imagine while Reasoning in Space: Multimodal Visualization-of-Thought | Visualization-of-Thought |
| 25.01 | ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding | Draw something! |
| 24.12 | EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing | Thinking in text space with a caption model. |
| Date | Project | Comment |
|---|---|---|
| Multimodal Chain-of-Thought Reasoning in Language Models [🖥️Code] |
| Date | Project | Comment |
|---|---|---|
| 24.11 | VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection[📑Paper][🖥️Code] | various video QA |
| 25.03 | Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning[📑Paper][Data] | 3D-CoT Benchmark |