Skip to content

HJYao00/Awesome-Multimodal-Reasoning

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 

Repository files navigation

Awesome-Multimodal-Reasoning

Contributions are most welcome, if you have any suggestions or improvements, feel free to create an issue or raise a pull request.

Contents

Benchmark

Date Project Task
24.05 M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought [📑Paper] M3CoT
24.10 MME-CoT: Benchmarking Chain-of-Thought in LMMs for Reasoning Quality, Robustness, and Efficiency [📑Paper][🖥️Code] MME-CoT
24.11 VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models [📑Paper] VLRewardBench
25.01 LlamaV-o1: Rethinking Step-By-Step Visual Reasoning in LLMs [📑Paper][🖥️Code] VRCBench
25.02 OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference [📑Paper][🖥️Code] MM-AlignBench
25.02 MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [📑Paper] MM-RLHF-RewardBench, MM-RLHF-SafetyBench

Model

Image MLLM

Date Project SFT RL Task
24.03 Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning [📑Paper][🖥️Code] visual chain-of-thought dataset comprising 438k data items - Various VQA
24.10 Improve Vision Language Model Chain-of-thought Reasoning [📑Paper][🖥️Code] 193k CoT sft data by GPT4-o DPO Various VQA
24.11 LLaVA-CoT: Let Vision Language Models Reason Step-by-Step [📑Paper][🖥️Code] LLaVA-CoT-100k by GPT4-o - Various VQA
24.11 Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models [📑Paper][🖥️Code] sft for agent Iterative DPO Various VQA
24.11 Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization [📑Paper] - MPO Various VQA
25.01 Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [📑Paper][🖥️Code] 2k Text data from R1/QwQ and visual data from QvQ/SD - Math & MMMU
25.01 InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model [📑Paper][🖥️Code] - PPO Reward & Various VQA
25.01 LlamaV-o1: Rethinking Step-By-Step Visual Reasoning in LLMs [📑Paper][🖥️Code] LLaVA-CoT-100k & PixMo [13] subset - VRC-Bench & Various VQA
25.02 MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [📑Paper][🖥️Code] - DPO with 120k fine-grained, human-annotated preference comparison pairs. Reward & Various VQA
25.02 OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference [📑Paper][🖥️Code] 200k sft data DPO Alignment & Various VQA
25.02 Multimodal Open R1 [🖥️Code] - GRPO Mathvista-mini, MMMU
25.02 VLM-R1: A stable and generalizable R1-style Large Vision-Language Model [🖥️Code] - GRPO Referring Expression Comprehension
25.02 R1-V: Reinforcing Super Generalization Ability in Vision Language Models with Less Than $3 [🖥️Code] - GRPO Item Counting, Number Related Reasoning and Geometry Reasoning
25.03 EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework [🖥️Code] - GRPO Geometry3K
25.03 Unified Reward Model for Multimodal Understanding and Generation [📑Paper][🖥️Code] - DPO Various VQA & Generation
25.03 MM-EUREKA: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning [📑Paper][🖥️Code] - RLOO Math
25.03 R1-Zero’s “Aha Moment” in Visual Reasoning on a 2B Non-SFT Model [📑Paper][🖥️Code] - GRPO CVBench
25.03 Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement [📑Paper][🖥️Code] - GRPO RefCOCO&ReasonSeg
25.03 Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models [📑Paper][🖥️Code] - GRPO Math
25.03 Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning [📑Paper] Self-Improvement Training GRPO Detection, Classification, Math
25.03 LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL [📑Paper][🖥️Code] - PPO Math, Sokoban-Global, Football-Online
25.03 Visual-RFT: Visual Reinforcement Fine-Tuning [📑Paper][🖥️Code] - GRPO Detection, Grounding, Classification
25.03 VisRL: Intention-Driven Visual Perception via Reinforced Reasoning [📑Paper][🖥️Code] warm up DPO Various VQA
25.03 (CVPR2025) GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks [📑Paper] - GFlowNets NumberLine (NL) and BlackJack (BJ)
25.03 MMR1: Advancing the Frontiers of Multimodal Reasoning [🖥️Code] - GRPO Math
25.03 R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization [📑Paper][🖥️Code] ongoing ongoing ongoing

Video MLLM

Date Project SFT RL Task
25.01 Temporal Preference Optimization for Long-Form Video Understanding [📑Paper][🖥️Code] - DPO various video QA
25.01 Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding [📑Paper][🖥️Code] main training DPO Video caption & QA
25.02 video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model [📑Paper] cold start DPO various video QA
25.02 Open-R1-Video[🖥️Code] - GRPO LongVideoBench
25.02 Video-R1: Towards Super Reasoning Ability in Video Understanding [🖥️Code] - GRPO DVD-counting
25.03 R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning [📑Paper][🖥️Code] cold start GRPO Emotion recognition

Image/Video Generation

Date Proj Comment
25.02 C-Drag:Chain-of-Thought Driven Motion Controller for Video Generation Calculate simple motion vector with LLM.
25.01 Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step Potential Assessment Reward Model for AR Image Generation.
25.01 Imagine while Reasoning in Space: Multimodal Visualization-of-Thought Visualization-of-Thought
25.01 ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding Draw something!
24.12 EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing Thinking in text space with a caption model.

LLM

Date Project Comment
Multimodal Chain-of-Thought Reasoning in Language Models [🖥️Code]

Data

Date Project Comment
24.11 VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection[📑Paper][🖥️Code] various video QA
25.03 Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning[📑Paper][Data] 3D-CoT Benchmark

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published