With the release of OpenAI o1 and Deepseek-R1, reasoning models have yielded remarkably promising results and garnered significant attention from the research community. This development signals that reasoning models represent a critical advancement toward Artificial General Intelligence (AGI). The standard reasoning paradigm can be formally defined as:
- Standard Reasoning: The model conducts a comprehensive intermediate reasoning phase prior to generating the final response. This intermediate reasoning typically manifests as unstructured textual content, with the entire inference process constituting a single atomic operation.
Recently, the introduction of OpenAI o3, Deep research, Zochi, and BAGEL has established an alternative reasoning formulation, which we designate as Interleaving Reasoning. In contrast to standard reasoning, Interleaving Reasoning is characterized by multi-turn interactions and exhibits sophisticated reasoning dynamics. This reasoning modality has empirically demonstrated superior accuracy in addressing complex problems. Consequently, we posit that Interleaving Reasoning potentially constitutes the Next-Generation Reasoning Systems for AGI. We propose a taxonomy of Interleaving Reasoning that encompasses the following categories:
- Multimodal Interleaving Reasoning: The model's inference process operates on diverse information modalities (e.g., textual, visual, auditory, video). This involves an intricately interleaved execution of modality-specific information processing and cross-modal reasoning. Examples: OpenAi o3, DeepEyes.
- Multi-Round Acting Interleaving Reasoning: The system achieves task completion through iterative interactions (actions) with the environment. Each action is either predicated upon or performed in conjunction with a reasoning-driven inference step, establishing an interleaved execution of action and inference processes. Examples: Deep research, Search-R1, ReTool, UI-TARS, ReAct.
- Multi-Agent Interleaving Reasoning: In a multi-agent system, multiple agents, such as LLMs and MLLMs, engage in collaborative or competitive dynamics via a paradigm of interleaved reasoning. This implies that agents either alternate in contributing discrete reasoning steps, share intermediate conclusions to establish a shared cognitive state, and subsequently build upon this foundation, or their respective inferential processes exhibit mutual influence. Examples: Society of Minds, Zochi, MetaGPT.
- Unified Understanding and Generation Interleaving Reasoning: The model's reasoning capabilities are not confined to producing solely unimodal outputs. Instead, it strategically generates multimodal content (e.g., textual and visual elements) as an integral intermediate step within its intrinsic processes of comprehension and problem-solving. Example: GoT, T2I-R1, BAGEL.
It is imperative to establish precise categorical boundaries:
- While Multimodal Interleaving Reasoning could conceivably be subsumed within the Multi-Round Acting Interleaving Reasoning paradigm, we formally define Multimodal Interleaving Reasoning as necessitating the direct incorporation of multi-modal information streams during the reasoning process. This information typically derives from the processing of input modalities, as exemplified by OpenAi o3, which extracts visual information and integrates it into text-based reasoning workflows.
- The fundamental distinction between Multi-Round Acting Interleaving Reasoning and Multi-Agent Interleaving Reasoning lies in their architectural composition: Multi-Round Acting Interleaving Reasoning typically employs a single LLM/MLLM to perform reasoning and determine subsequent actions. Conversely, Multi-Agent Interleaving Reasoning leverages multiple LLM/MLLM entities that collaboratively contribute to reasoning steps.
- The differentiation between Unified Understanding and Generation Interleaving Reasoning and Multimodal Interleaving Reasoning resides in their information processing mechanisms. Unified Understanding and Generation Interleaving Reasoning utilizes an unified understanding and generation model capable of directly generating multimodal outputs during the reasoning process. In contrast, Multimodal Interleaving Reasoning typically sources its multimodal information from external systems or processes.
We aim to provide the community with a comprehensive and timely synthesis of this fascinating and promising field, as well as some insights into it. This repository provides valuable reference for researchers in the field of Interleaving Reasoning, please start your exploration!
This work is in progress!
Table of Contents
- Our Group
- Our Activities
- Standard Reasoning Examples
- Awesome Interleaving Reasoning Papers
- Awesome Datasets
Wenxuan Huang
Zhenfei Yin
ECNU&CUHK
USYD&Oxford
🔥🔥🔥 ICCV 2025 Workshop on Multi-Modal Reasoning for Agentic Intelligence (MMRAgi-2025)
Submission DDL: Proceeding Track: 24 June 2025, 23:59 AoE, Non-Proceeding Track: 24 July 2025, 23:59 AoE.
🔥🔥🔥 Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
🔥🔥🔥 DeepEyes: Incentivizing “Thinking with Images” via Reinforcement Learning
-
[OpenAI o1] Introducing OpenAI o1
-
[DeepSeek-R1] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [🤗Models] [💻Code]
-
[Kimi-k1.5] Kimi k1.5: Scaling Reinforcement Learning with LLMs [💻Code]
-
[QVQ-Max] QVQ-Max: Think with Evidence
-
[Vision-R1] Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models [🤗Models] [🤗Datasets] [💻Code]
PR Temporal
- [RL] [2505] [DeepEyes] DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning [🌐Project] [🤗Models] [🤗Datasets] [💻Code]
You can select your categories in [Pretrain, SFT, RL, Prompt, Position paper, Survey paper] and so on. Furthermore, you can combine them, for example, SFT+RL.
Definition: The model's inference process operates on diverse information modalities (e.g., textual, visual, auditory, video). This involves an intricately interleaved execution of modality-specific information processing and cross-modal reasoning.
-
[SFT+RL] [2505] [CoF] Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL [🌐Project] [💻Code] [🤗Models] [🤗Datasets]
-
[SFT+RL] [2505] [Pixel Reasoner] Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning [🌐Project] [💻Code] [🤗Models] [🤗Datasets]
-
[SFT+RL] [2505] [V-Triune] One RL to See Them All: Visual Triple Unified Reinforcement Learning [💻Code] [🤗Models] [🤗Datasets]
-
[SFT+RL] [2505] [ViGoRL] Grounded Reinforcement Learning for Visual Reasoning [🌐Project]
-
[SFT+RL] [2505] [VLM-R\3] VLM-R\3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought
-
[RL] [2505] [Ground-R1] Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning
-
[RL] [2505] [GRIT] GRIT: Teaching MLLMs to Think with Images [🌐Project] [💻Code] [🤗Models]
-
[SFT+RL] [2505] [Visual-ARFT] Visual Agentic Reinforcement Fine-Tuning [💻Code] [🤗Models] [🤗Datasets]
-
[Prompt] [2505] [VAT] Visual Abstract Thinking Empowers Multimodal Reasoning [💻Code]
-
[SFT] [2505] [MathCoder-VL] MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning [💻Code] [🤗Models] [🤗Datasets]
-
[SFT] [2505] [v1] Don’t Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation [💻Code] [🤗Models]
-
[SFT+RL] [2505] [OpenThinkIMG] OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning [💻Code]
-
[RL] [2505] [DeepEyes] DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning [🌐Project] [🤗Models] [🤗Datasets] [💻Code]
-
[2504] [OpenAI o3] Introducing OpenAI o3 and o4-mini
-
[SFT] [2503] [CoT-VLA] CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [🌐Project]
-
[SFT] [2501] [MVoT] Imagine while Reasoning in Space: Multimodal Visualization-of-Thought [🌐Project] [💻Code] [🤗Models] [🤗Datasets]
-
[SFT] [2403] [Visual COT] Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning [🌐Project] [💻Code] [🤗Models] [🤗Datasets]
-
[Prompt] [2406] [Sketchpad] Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models [🌐Project] [💻Code]
-
[SFT] [2312] [V*] V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs [🌐Project] [💻Code]
-
[Prompt] [2211] [VISPROG] Visual Programming: Compositional visual reasoning without training [🌐Project] [💻Code]
Definition: The system achieves task completion through iterative interactions (actions) with the environment. Each action is either predicated upon or performed in conjunction with a reasoning-driven inference step, establishing an interleaved execution of action and inference processes.
-
[RL] [2506] [ReasoningSearch] Reinforcement Fine-Tuning for Reasoning towards Multi-Step Multi-Source Search in Large Language Models [💻Code]
-
[RL] [2506] [R-Search] R-Search: Empowering LLM Reasoning with Search via Multi-Reward Reinforcement Learning [💻Code] [🤗Models]
-
[RL] [2505] [Search-R1] An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents [💻Code]
-
[RL] [2505] [InForage] Scent of Knowledge: Optimizing Search-Enhanced Reasoning with Information Foraging
-
[RL] [2505] [ManuSearch] ManuSearch: Democratizing Deep Search in Large Language Models with a Transparent and Open Multi-Agent Framework [💻Code]
-
[RL] [2505] [O2-Searcher] O2-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering [💻Code]
-
[RL] [2505] [R3-RAG] R3-RAG: Learning Step-by-Step Reasoning and Retrieval for LLMs via Reinforcement Learning [💻Code]
-
[SFT] [2505] [SimpleDeepSearcher] SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis[💻Code] [🤗Models] [🤗Datasets]
-
[SFT+RL] [2505] [EvolveSearch] EvolveSearch: An Iterative Self-Evolving Search Agent
-
[RL] [2505] [PANGU DEEPDIVER] PANGU DEEPDIVER: ADAPTIVE SEARCH INTENSITY SCALING VIA OPEN-WEB REINFORCEMENT LEARNING
-
[RL] [2505] [ZEROSEARCH] ZEROSEARCH: Incentivize the Search Capability of LLMs without Searching [🌐Project] [💻Code] [🤗Datasets] [🤗Models]
-
[SFT+RL] [2505] [R1-Searcher++] R1-Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement Learning [💻Code]
-
[RL] [2505] [AutoRefine] Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs [💻Code] [🤗Models]
-
[RL] [2504] [DeepResearcher] DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments [💻Code]
-
[RL] [2504] [WebThinker] WebThinker: Empowering Large Reasoning Models with Deep Research Capability [💻Code] [🤗Models]
-
[RL] [2503] [ReSearch] ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning [🤗Models] [🤗Datasets] [💻Code]
-
[RL] [2503] [R1-Searcher] R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [🤗Models] [🤗Datasets] [💻Code]
-
[RL] [2503] [Search-R1] Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning [💻Code]
-
[2502] [Deep research] Introducing deep research
-
[Prompt] [2501] [Search-o1] Search-o1: Agentic Search-Enhanced Large Reasoning Models [💻Code]
-
[RL] [2112] [WebGPT] WebGPT: Browser-assisted question-answering with human feedback [🤗Datasets]
-
[RL] [2506] [Hint-Engineering] CoRT: Code-integrated Reasoning within Thinking[💻Code] [🤗Models]
-
[RL] [2506] [CTM] Computational Thinking Reasoning in Large Language Models
-
[Prompt] [2506] [AUTOMIND] AUTOMIND: Adaptive Knowledgeable Agent for Automated Data Science [💻Code]
-
[SFT] [2506] [KnowCoder-V2] KnowCoder-V2: Deep Knowledge Analysis
-
[RL] [2505] [VTool-R1] VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use [💻Code]
-
[RL] [2505] [Tool-Star] Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning [💻Code] [🤗Models]
-
[SFT+RL] [2504] [ReTool] ReTool: Reinforcement Learning for Strategic Tool Use in LLMs [🌐Project] [💻Code] [🤗Models] [🤗Datasets]
-
[RL] [2504] [SQL-R1] SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning [💻Code][🤗Models]
-
[Prompt] [2502] [Agentic Reasoning] Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research
-
[SFT] [2412] [CoinMath] CoinMath: Harnessing the Power of Coding Instruction for Math LLMs
-
[SFT] [2410] [MathCoder2] MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code [🤗Models] [🤗Datasets] [💻Code]
-
[SFT] [2408] [SIAM] SIAM: SELF-IMPROVING CODE-ASSISTED MATHEMATICAL REASONING OF LARGE LANGUAGE MODELS [💻Code]
-
[SFT] [2312] [VPD] Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
-
[SFT] [2310] [MathCoder] MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning [🤗Models] [🤗Datasets] [💻Code]
-
[SFT] [2309] [ToRA] ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving [🌐Project] [🤗Models] [🤗Datasets] [💻Code]
-
[SFT] [2506] [GUI-Reflection] GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior [🌐Project] [🤗Models] [🤗Datasets] [💻Code]
-
[SFT+RL] [2506] [ComfyUI-R1] ComfyUI-R1: Exploring Reasoning Models for Workflow Generation [💻Code]
-
[RL] [2506] [TTI] Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction [🌐Project] [🤗Models] [💻Code]
-
[RL] [2506] [WEBAGENT-R1] WEBAGENT-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning [💻Code]
-
[RL] [2505] [ZeroGUI] ZeroGUI: Automating Online GUI Learning at Zero Human Cost [💻Code] [🤗Models]
-
[RL] [2504] [GUI-R1] GUI-R1: A Generalist R1-Style Vision-Language Action Model For GUI Agents [💻Code] [🤗Models]
-
[RL] [2504] [TongUI] TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials [🌐Project] [💻Code] [🤗Models][🤗Datasets]
-
[RL] [2504] [InfiGUI-R1] InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
-
[Prompt] [2504] [UFO2] UFO2: The Desktop AgentOS [💻Code]
-
[RL] [2502] [Explorer] Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents [🌐Project]
-
[Pretrain+SFT+RL] [2501] [UI-TARS] UI-TARS: Pioneering Automated GUI Interaction with Native Agents [🌐Project] [🤗Models] [💻Code]
-
[SFT] [2412] [AGUVIS] AGUVIS: Unified Pure Vision Agents for Autonomous GUI Interaction [🌐Project] [💻Code]
-
[Prompt] [2304] [DroidBot-GPT] DroidBot-GPT: GPT-powered UI Automation for Android [💻Code]
-
[RL] [2505] [GiGPO] Group-in-Group Policy Optimization for LLM Agent Training [💻Code]
-
[RL] [2505] [ToolN1] Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning [💻Code]
-
[RL] [2505] [ARTIST] Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning
-
[RL] [2504] [ToolRL] ToolRL: Reward is All Tool Learning Needs [💻Code] [🤗Models] [🤗Datasets]
-
[Prompt] [2504] [DwT] Draw with Thought: Unleashing Multimodal Reasoning for Scientific Diagram Generation
-
[RL] [2503] [TORL] TORL: Scaling Tool-Integrated RL [💻Code] [🤗Models] [🤗Datasets]
-
[Prompt] [2409] [AWM] Agent Workflow Memory [🌐Project] [💻Code]
-
[Prompt] [2210] [ReAct] ReAct: Synergizing Reasoning and Acting in Language Models [🌐Project] [💻Code]
Definition: In a multi-agent system, multiple agents, such as LLMs and MLLMs, engage in collaborative or competitive dynamics via a paradigm of interleaved reasoning. This implies that agents either alternate in contributing discrete reasoning steps, share intermediate conclusions to establish a shared cognitive state, and subsequently build upon this foundation, or their respective inferential processes exhibit mutual influence.
-
[Prompt] [2502] [S2-MAD] S2-MAD: Breaking the Token Barrier to Enhance Multi-Agent Debate Efficiency
-
[Position paper] [2502] [-] If Multi-Agent Debate is the Answer, What is the Question?
-
[RL] [2411] [ACC-Collab] ACC-Collab: An Actor-Critic Approach to Multi-Agent LLM Collaboration [💻Code]
-
[Prompt] [2409] [GroupDebate] GroupDebate: Enhancing the Efficiency of Multi-Agent Debate Using Group Discussion
-
[Prompt] [2408] [Prompt] Can LLMs Beat Humans in Debating? A Dynamic Multi-agent Framework for Competitive Debate
-
[Position paper] [2311] Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs [💻Code]
-
[Prompt] [2305] [Society of Minds] improving factuality and reasoning in language models through multiagent debate [💻Code]
-
[Prompt] [2305] [MAD] Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [💻Code]
-
[RL] [2506] [HARIS] Coordinating Search-Informed Reasoning and Reasoning-Guided Search in Claim Verification
-
[Prompt] [2505] [WORKFORCE] OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation [💻Code]
-
[Prompt] [2505] [MACNET] SCALING LARGE LANGUAGE MODEL-BASED MULTI-AGENT COLLABORATION [💻Code]
-
[Prompt] [2505] [AGENTNET] AGENTNET: DECENTRALIZED EVOLUTIONARY COORDINATION FOR LLM-BASED MULTI-AGENT SYSTEMS
-
[Prompt] [2503] [GoT] ReAgent: Reversible Multi-Agent Reasoning for Knowledge-Enhanced Multi-Hop QA [💻Code]
-
[2503] [Zochi] Zochi Technical Report [🌐Project] [💻Code]
-
[Prompt] [2406] [Croto] Multi-Agent Collaboration via Cross-Team Orchestration [💻Code]
-
[Prompt] [2402] [AgentScope] AgentScope: A Flexible yet Robust Multi-Agent Platform [💻Code]
-
[Prompt] [2312] [Co-Learning] Experiential Co-Learning of Software-Developing Agents [💻Code]
-
[Prompt] [2310] [MachineSoM] Exploring Collaboration Mechanisms for LLM Agents: A Social Psychology View [💻Code]
-
[Prompt] [2309] [Agents] Agents: An Open-source Framework for Autonomous Language Agents [🌐Project] [💻Code]
-
[Prompt] [2308] [AgentVerse] AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors [💻Code]
-
[Prompt] [2308] [MetaGPT] MetaGPT: Meta Programming for Multi-Agent Collaborative Framework [💻Code]
-
[Prompt] [2307] [ChatDev] Communicative Agents for Software Development [💻Code]
Definition: The model's reasoning capabilities are not confined to producing solely unimodal outputs. Instead, it strategically generates multimodal content (e.g., textual and visual elements) as an integral intermediate step within its intrinsic processes of comprehension and problem-solving.
-
[SFT+RL] [2506] [ControlThinker] ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning [💻Code]
-
[SFT+RL] [2505] [MindOmni] MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO [🤗Models] [💻Code]
-
[SFT+RL] [2505] [GoT-R1] GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning [🤗Models] [💻Code]
-
[Prompt] [2505] [ComfyMind] ComfyMind: Toward General-Purpose Generation via Tree-Based Planning and Reactive Feedback [💻Code]
-
[Pretrain+SFT+RL] [2505] [UniGen] UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation
-
[SFT] [2505] [TwGI-Anole] Thinking with Generated Images [🤗Models] [💻Code]
-
[Pretrain+SFT] [2505] [BAGEL] Emerging Properties in Unified Multimodal Pretraining [🌐Project] [🤗Models] [🤗Datasets] [💻Code]
-
[RL] [2505] [T2I-R1] T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT [🤗Models] [💻Code]
-
[SFT] [2503] [GoT] GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing [🤗Models] [🤗Datasets] [💻Code]