Awesome Interleaving Reasoning

With the release of OpenAI o1 and Deepseek-R1, reasoning models have yielded remarkably promising results and garnered significant attention from the research community. This development signals that reasoning models represent a critical advancement toward Artificial General Intelligence (AGI). The standard reasoning paradigm can be formally defined as:

Standard Reasoning: The model conducts a comprehensive intermediate reasoning phase prior to generating the final response. This intermediate reasoning typically manifests as unstructured textual content, with the entire inference process constituting a single atomic operation.

Recently, the introduction of OpenAI o3, Deep research, Zochi, and BAGEL has established an alternative reasoning formulation, which we designate as Interleaving Reasoning. In contrast to standard reasoning, Interleaving Reasoning is characterized by multi-turn interactions and exhibits sophisticated reasoning dynamics. This reasoning modality has empirically demonstrated superior accuracy in addressing complex problems. Consequently, we posit that Interleaving Reasoning potentially constitutes the Next-Generation Reasoning Systems for AGI. We propose a taxonomy of Interleaving Reasoning that encompasses the following categories:

Multimodal Interleaving Reasoning: The model's inference process operates on diverse information modalities (e.g., textual, visual, auditory, video). This involves an intricately interleaved execution of modality-specific information processing and cross-modal reasoning. Examples: OpenAi o3, DeepEyes.
Multi-Round Acting Interleaving Reasoning: The system achieves task completion through iterative interactions (actions) with the environment. Each action is either predicated upon or performed in conjunction with a reasoning-driven inference step, establishing an interleaved execution of action and inference processes. Examples: Deep research, Search-R1, ReTool, UI-TARS, ReAct.
Multi-Agent Interleaving Reasoning: In a multi-agent system, multiple agents, such as LLMs and MLLMs, engage in collaborative or competitive dynamics via a paradigm of interleaved reasoning. This implies that agents either alternate in contributing discrete reasoning steps, share intermediate conclusions to establish a shared cognitive state, and subsequently build upon this foundation, or their respective inferential processes exhibit mutual influence. Examples: Society of Minds, Zochi, MetaGPT.
Unified Understanding and Generation Interleaving Reasoning: The model's reasoning capabilities are not confined to producing solely unimodal outputs. Instead, it strategically generates multimodal content (e.g., textual and visual elements) as an integral intermediate step within its intrinsic processes of comprehension and problem-solving. Example: GoT, T2I-R1, BAGEL.

It is imperative to establish precise categorical boundaries:

While Multimodal Interleaving Reasoning could conceivably be subsumed within the Multi-Round Acting Interleaving Reasoning paradigm, we formally define Multimodal Interleaving Reasoning as necessitating the direct incorporation of multi-modal information streams during the reasoning process. This information typically derives from the processing of input modalities, as exemplified by OpenAi o3, which extracts visual information and integrates it into text-based reasoning workflows.

The fundamental distinction between Multi-Round Acting Interleaving Reasoning and Multi-Agent Interleaving Reasoning lies in their architectural composition: Multi-Round Acting Interleaving Reasoning typically employs a single LLM/MLLM to perform reasoning and determine subsequent actions. Conversely, Multi-Agent Interleaving Reasoning leverages multiple LLM/MLLM entities that collaboratively contribute to reasoning steps.

The differentiation between Unified Understanding and Generation Interleaving Reasoning and Multimodal Interleaving Reasoning resides in their information processing mechanisms. Unified Understanding and Generation Interleaving Reasoning utilizes an unified understanding and generation model capable of directly generating multimodal outputs during the reasoning process. In contrast, Multimodal Interleaving Reasoning typically sources its multimodal information from external systems or processes.

We aim to provide the community with a comprehensive and timely synthesis of this fascinating and promising field, as well as some insights into it. This repository provides valuable reference for researchers in the field of Interleaving Reasoning, please start your exploration!

This work is in progress!

Table of Contents

Our Group
- Originators
- Members
Our Activities
Standard Reasoning Examples
Awesome Interleaving Reasoning Papers
Awesome Datasets

Our Group

Originators

Wenxuan Huang          Zhenfei Yin
   ECNU&CUHK           USYD&Oxford

Members

Our Activities

🔥🔥🔥 ICCV 2025 Workshop on Multi-Modal Reasoning for Agentic Intelligence (MMRAgi-2025)

We organised ICCV 2025 Workshop MMRAgi!
Submission DDL: Proceeding Track: 24 June 2025, 23:59 AoE, Non-Proceeding Track: 24 July 2025, 23:59 AoE.

🔥🔥🔥 Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

[📖 arXiv Paper] [🌟 GitHub

] [🤗 Vision-R1-cold Dataset] [🤗 Vision-R1-7B]

This is the first paper to explore how to effectively use RL for MLLMs and introduce Vision-R1, a reasoning MLLM that leverages cold-start initialization and RL training to incentivize reasoning capability.

🔥🔥🔥 DeepEyes: Incentivizing “Thinking with Images” via Reinforcement Learning

[📖 arXiv Paper] [🌟 GitHub

] [🤗 Dataset] [🤗 DeepEyes-7B]

The first opensource "o3-like" interleaving reasoning MLLM with "Thinking with Images". They don’t just see an image, they can integrate visual information directly into the reasoning chain.

Standard Reasoning Examples

[OpenAI o1] Introducing OpenAI o1
[DeepSeek-R1] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [🤗Models] [💻Code]
[Kimi-k1.5] Kimi k1.5: Scaling Reinforcement Learning with LLMs [💻Code]
[QVQ-Max] QVQ-Max: Think with Evidence
[Vision-R1] Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models [🤗Models] [🤗Datasets] [💻Code]