Awesome-Long-Chain-of-Thought-Reasoning

🔥 News

2025.04: 🎉🎉🎉 We have updated the number of reviewed papers to over 900. Additionally, we have enhanced the presentation with more engaging teaser figure.
2025.03: 🎉🎉🎉 We have published a survey paper titled "Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models". Please feel free to cite or open pull requests for your awesome studies.

🌟 Introduction

Welcome to the repository associated with our survey paper, "Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models". This repository contains resources and updates related to our ongoing Long CoT research. For a detailed introduction, please refer to our survey paper.

Recent advancements in reasoning with large language models (RLLMs), such as OpenAI-O1 and DeepSeek-R1, have demonstrated their impressive capabilities in complex domains like mathematics and coding. A central factor in their success lies in the application of long chain-of-thought (Long CoT) characteristics, which enhance reasoning abilities and enable the solution of intricate problems.

However, despite these developments, a comprehensive survey on Long CoT is still lacking, limiting our understanding of its distinctions from traditional short chain-of-thought (Short CoT) and complicating ongoing debates on issues like "overthinking" and "test-time scaling." This survey seeks to fill this gap by offering a unified perspective on Long CoT. (1) We first distinguish Long CoT from Short CoT and introduce a novel taxonomy to categorize current reasoning paradigms. (2) Next, we explore the key characteristics of Long CoT: deep reasoning, extensive exploration, and feasible reflection, which enable models to handle more complex tasks and produce more efficient, coherent outcomes compared to the shallower Short CoT. (3) We then investigate key phenomena such as the emergence of Long CoT with these characteristics, including overthinking, and test-time scaling, offering insights into how these processes manifest in practice. (4) Finally, we identify significant research gaps and highlight promising future directions, including the integration of multi-modal reasoning, efficiency improvements, and enhanced knowledge frameworks. By providing a structured overview, this survey aims to inspire future research and further the development of logical reasoning in artificial intelligence.

🔮 Contents

Awesome-Long-CoT

1. Analysis and Evaluation

1.1 Analysis & Explanation for Long CoT

Explainable AI in Large Language Models: A Review, Sauhandikaa et al.,
Xai meets llms: A survey of the relation between explainable ai and large language models, Cambria et al.,
When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1, McCoy et al.,
Think or Step-by-Step? UnZIPping the Black Box in Zero-Shot Prompts, Sadr et al.,

1.1.1 Long CoT External Behavior Analysis

Language Models Can Predict Their Own Behavior, Ashok et al.,

Aha Moment Phenomenon

There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study, Liu et al.,
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al.,
Open R1, Team et al.,
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, Xie et al.,
R1-Zero's" Aha Moment" in Visual Reasoning on a 2B Non-SFT Model, Zhou et al.,
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning, Meng et al.,
Understanding Aha Moments: from External Observations to Internal Mechanisms, Yang et al.,

Inference Test-Time Scaling Phenomenon

Greedy Policy Search: A Simple Baseline for Learnable Test-Time Augmentation, Lyzhov et al.,
Large language monkeys: Scaling inference compute with repeated sampling, Brown et al.,
The Impact of Reasoning Step Length on Large Language Models, Jin et al.,
Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models, Wu et al.,
Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al.,
From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, Welleck et al.,
Openai o1 system card, Jaech et al.,
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al.,
When More is Less: Understanding Chain-of-Thought Length in LLMs, Wu et al.,
Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights, Parashar et al.,
Examining False Positives under Inference Scaling for Mathematical Reasoning, Wang et al.,
ECM: A Unified Electronic Circuit Model for Explaining the Emergence of In-Context Learning and Chain-of-Thought in Large Language Model, Chen et al.,
PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models, Anderson et al.,
Scaling Test-Time Compute Without Verification or RL is Suboptimal, Setlur et al.,
The Relationship Between Reasoning and Performance in Large Language Models--o3 (mini) Thinks Harder, Not Longer, Ballon et al.,
Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead, Balachandran et al.,

Long CoT Emergence Phenomenon

Star: Bootstrapping reasoning with reasoning, Zelikman et al.,
Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters, Wang et al.,
LAMBADA: Backward Chaining for Automated Reasoning in Natural Language, Kazemi et al.,
What Makes Chain-of-Thought Prompting Effective? A Counterfactual Study, Madaan et al.,
Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data, Shum et al.,
MoT: Memory-of-Thought Enables ChatGPT to Self-Improve, Li et al.,
The llama 3 herd of models, Dubey et al.,
Do Large Language Models Latently Perform Multi-Hop Reasoning?, Yang et al.,
Chain of Thoughtlessness? An Analysis of CoT in Planning, Stechly et al.,
Chain-of-Thought Reasoning Without Prompting, Wang et al.,
Qwen2.5 technical report, Yang et al.,
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al.,
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though, Xiang et al.,
Think or Step-by-Step? UnZIPping the Black Box in Zero-Shot Prompts, Sadr et al.,
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, Xie et al.,
Problem-Solving Logic Guided Curriculum In-Context Learning for LLMs Complex Reasoning, Ma et al.,
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs, Gandhi et al.,
Style over Substance: Distilled Language Models Reason Via Stylistic Replication, Lippmann et al.,
Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning, Wang et al.,

Overthinking Phenomenon

The Impact of Reasoning Step Length on Large Language Models, Jin et al.,
Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al.,
Compositional Hardness of Code in Large Language Models--A Probabilistic Perspective, Wolf et al.,
DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models, Pan et al.,
What Are Step-Level Reward Models Rewarding? Counterintuitive Findings from MCTS-Boosted Mathematical Reasoning, Ma et al.,
Do not think that much for 2+ 3=? on the overthinking of o1-like llms, Chen et al.,
Rethinking External Slow-Thinking: From Snowball Errors to Probability of Correct Reasoning, Gan et al.,
Complexity Control Facilitates Reasoning-Based Compositional Generalization in Transformers, Zhang et al.,
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, Xie et al.,
When More is Less: Understanding Chain-of-Thought Length in LLMs, Wu et al.,
ECM: A Unified Electronic Circuit Model for Explaining the Emergence of In-Context Learning and Chain-of-Thought in Large Language Model, Chen et al.,
The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks, Cuadron et al.,
OVERTHINKING: Slowdown Attacks on Reasoning LLMs, Kumar et al.,

PRM v.s. ORM Phenomenon

Concrete problems in AI safety, Amodei et al.,
The effects of reward misspecification: Mapping and mitigating misaligned models, Pan et al.,
Goal misgeneralization in deep reinforcement learning, Di Langosco et al.,
Can language models learn from explanations in context?, Lampinen et al.,
Causal Abstraction for Chain-of-Thought Reasoning in Arithmetic Word Problems, Tan et al.,
Processbench: Identifying process errors in mathematical reasoning, Zheng et al.,
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models, Xu et al.,
PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models, Song et al.,
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al.,
Do We Need to Verify Step by Step? Rethinking Process Supervision from a Theoretical Perspective, Jia et al.,
Unveiling and Causalizing CoT: A Causal Pespective, Fu et al.,
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, Xie et al.,
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems, Peng et al.,
Process-based Self-Rewarding Language Models, Zhang et al.,
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation, Baker et al.,
Rewarding Curse: Analyze and Mitigate Reward Modeling Issues for LLM Reasoning, Li et al.,

Reasoning Boundary Phenomenon

The Expressive Power of Transformers with Chain of Thought, Merrill et al.,
Chain of Thought Empowers Transformers to Solve Inherently Serial Problems, Li et al.,
Mathprompter: Mathematical reasoning using large language models, Imani et al.,
Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective, Feng et al.,
When Do Program-of-Thought Works for Reasoning?, Bi et al.,
MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning, Sprague et al.,
How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments, Huang et al.,
Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al.,
Not All LLM Reasoners Are Created Equal, Hosseini et al.,
Exploring the Compositional Deficiency of Large Language Models in Mathematical Reasoning Through Trap Problems, Zhao et al.,
GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?, Zhou et al.,
Lower Bounds for Chain-of-Thought Reasoning in Hard-Attention Transformers, Amiri et al.,
The Lookahead Limitation: Why Multi-Operand Addition is Hard for LLMs, Baeumel et al.,
Reasoning Beyond Limits: Advances and Open Problems for LLMs, Ferrag et al.,

1.1.2 Long CoT Internal Mechanism Analysis

Knowledge Incorporating Mechanism

Why think step by step? Reasoning emerges from the locality of experience, Prystawski et al.,
Thinking llms: General instruction following with thought generation, Wu et al.,
On the reasoning capacity of ai models and how to quantify it, Radha et al.,
Exploring Concept Depth: How Large Language Models Acquire Knowledge and Concept at Different Layers?, Jin et al.,
How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training, Ou et al.,
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, Xie et al.,
Enhancing llm reliability via explicit knowledge boundary modeling, Zheng et al.,

Reasoning Interal Mechanism

How Large Language Models Implement Chain-of-Thought?, Wang et al.,
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model, Hanna et al.,
System 2 Attention (is something you might need too), Weston et al.,
How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning, Dutta et al.,
An Investigation of Neuron Activation as a Unified Lens to Explain Chain-of-Thought Eliciting Arithmetic Reasoning of LLMs, Rai et al.,
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective, Li et al.,
Finite State Automata Inside Transformers with Chain-of-Thought: A Mechanistic Study on State Tracking, Zhang et al.,
The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It, Bertolazzi et al.,
Layer by Layer: Uncovering Hidden Representations in Language Models, Skean et al.,
Back Attention: Understanding and Enhancing Multi-Hop Reasoning in Large Language Models, Yu et al.,

1.2 Long CoT Evaluations

1.2.3 Advanced Evaluation

AI for Research

ScienceWorld: Is your Agent Smarter than a 5th Grader?, Wang et al.,
Can llms generate novel research ideas? a large-scale human study with 100+ nlp researchers, Si et al.,
Mle-bench: Evaluating machine learning agents on machine learning engineering, Chan et al.,
Chain of ideas: Revolutionizing research via novel idea development with llm agents, Li et al.,
HardML: A Benchmark For Evaluating Data Science And Machine Learning knowledge and reasoning in AI, Pricope et al.,
DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking, Li et al.,
Large Language Models Penetration in Scholarly Writing and Peer Review, Zhou et al.,
Towards an AI co-scientist, Gottweis et al.,
Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research, Wu et al.,
Open Deep Research, Team et al.,
Enabling AI Scientists to Recognize Innovation: A Domain-Agnostic Algorithm for Assessing Novelty, Wang et al.,

Agentic & Embodied Reasoning

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents, Yao et al.,
ScienceWorld: Is your Agent Smarter than a 5th Grader?, Wang et al.,
WebArena: A Realistic Web Environment for Building Autonomous Agents, Zhou et al.,
How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments, Huang et al.,
CogAgent: A Visual Language Model for GUI Agents, Hong et al.,
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments, Xie et al.,
ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark, Nath et al.,
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks, Wang et al.,
PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning, Zhang et al.,
Text2World: Benchmarking Large Language Models for Symbolic World Model Generation, Hu et al.,
WebGames: Challenging General-Purpose Web-Browsing AI Agents, Thomas et al.,
VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model, Zheng et al.,
Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration, Wang et al.,
Generating Symbolic World Models via Test-time Scaling of Large Language Models, Yu et al.,
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning, Lu et al.,

Multimodal Reasoning

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering, Lu et al.,
A Multi-Modal Neural Geometric Solver with Textual Clauses Parsed from Diagram, Zhang et al.,
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts, Lu et al.,
Plot2code: A comprehensive benchmark for evaluating multi-modal large language models in code generation from scientific plots, Wu et al.,
M³CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought, Chen et al.,
PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns, Chia et al.,
Can LLMs Solve Molecule Puzzles? A Multimodal Benchmark for Molecular Structure Elucidation, Guo et al.,
Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset, Wang et al.,
Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, Zhang et al.,
HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks, Zhang et al.,
A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges, Yan et al.,
CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models, Cheng et al.,
CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models, Li et al.,
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation, Yang et al.,
Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios, Wang et al.,
EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges, Wang et al.,
Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities, Wang et al.,
Exploring and Evaluating Multimodal Knowledge Reasoning Consistency of Multimodal Large Language Models, Jia et al.,
MMSciBench: Benchmarking Language Models on Multimodal Scientific Problems, Ye et al.,
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?, Tang et al.,

1.2.4 Benchmarks

Outcome Benchmarks

On the measure of intelligence, Chollet et al.,
What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams, Jin et al.,
Training verifiers to solve math word problems, Cobbe et al.,
Measuring Mathematical Problem Solving With the MATH Dataset, Hendrycks et al.,
Competition-Level Code Generation with AlphaCode, Li et al.,
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them, Suzgun et al.,
Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning, Bao et al.,
AI for Math or Math for AI? On the Generalization of Learning Mathematical Problem Solving, Zhou et al.,
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI, Huang et al.,
Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning, Gulati et al.,
Let's verify step by step, Lightman et al.,
SWE-bench: Can Language Models Resolve Real-world Github Issues?, Jimenez et al.,
Benchmarking large language models on answering and explaining challenging medical questions, Chen et al.,
Achieving> 97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems, Zhong et al.,
Mhpp: Exploring the capabilities and limitations of language models beyond basic code generation, Dai et al.,
AIME 2024, AI-MO et al.,
AMC 2023, AI-MO et al.,
GPQA: A Graduate-Level Google-Proof Q&A Benchmark, Rein et al.,
OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems, He et al.,
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark, Wang et al.,
Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai, Glazer et al.,
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation, Yu et al.,
LiveBench: A Challenging, Contamination-Limited LLM Benchmark, White et al.,
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code, Jain et al.,
JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models, Chen et al.,
Humanity's Last Exam, Phan et al.,
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding, Zuo et al.,
Theoretical Physics Benchmark (TPBench)--a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics, Chung et al.,
AIME 2025, OpenCompass et al.,
ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning, Huang et al.,
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations, Huang et al.,
ProBench: Benchmarking Large Language Models in Competitive Programming, Yang et al.,
EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking, Wei et al.,
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning, Lin et al.,
DivIL: Unveiling and Addressing Over-Invariance for Out-of-Distribution Generalization, WANG et al.,
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines, Du et al.,
DeepSeek-R1 Outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in Bilingual Complex Ophthalmology Reasoning, Xu et al.,
QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?, Li et al.,
Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad, Petrov et al.,
Benchmarking Reasoning Robustness in Large Language Models, Yu et al.,
From Code to Courtroom: LLMs as the New Software Judges, He et al.,
Interacting with AI Reasoning Models: Harnessing" Thoughts" for AI-Driven Software Engineering, Treude et al.,
Can Frontier LLMs Replace Annotators in Biomedical Text Mining? Analyzing Challenges and Exploring Solutions, Zhao et al.,
An evaluation of DeepSeek Models in Biomedical Natural Language Processing, Zhan et al.,
Cognitive-Mental-LLM: Leveraging Reasoning in Large Language Models for Mental Health Prediction via Online Text, Patil et al.,

1.2.5 Process Evaluations

Deep Reasoning Benchmarks

ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning, Golovneva et al.,
Making Language Models Better Reasoners with Step-Aware Verifier, Li et al.,
ReCEval: Evaluating Reasoning Chains via Correctness and Informativeness, Prasad et al.,
Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al.,
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning, Lin et al.,
Evaluating Step-by-step Reasoning Traces: A Survey, Lee et al.,
Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges, Shrestha et al.,
Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models, Zhou et al.,

Exploration Benchmarks

EVOLvE: Evaluating and Optimizing LLMs For Exploration, Nie et al.,
Evaluating the Systematic Reasoning Abilities of Large Language Models through Graph Coloring, Heyman et al.,
Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights, Parashar et al.,

Reflection Benchmarks

Rewardbench: Evaluating reward models for language modeling, Lambert et al.,
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs, Zeng et al.,
Evaluating LLMs at Detecting Errors in LLM Responses, Kamoi et al.,
CriticBench: Benchmarking LLMs for Critique-Correct Reasoning, Lin et al.,
Judgebench: A benchmark for evaluating llm-based judges, Tan et al.,
Errorradar: Benchmarking complex mathematical reasoning of multimodal large language models via error detection, Yan et al.,
Processbench: Identifying process errors in mathematical reasoning, Zheng et al.,
Medec: A benchmark for medical error detection and correction in clinical notes, Abacha et al.,
PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models, Song et al.,
Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models, Yasunaga et al.,
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models, Zhang et al.,
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?, He et al.,
FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving, Chen et al.,

2. Deep Reasoning

Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs, Wang et al.,
Don't Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls, Wang et al.,

2.1 Deep Reasoning Format

2.1.1 Latent Space Deep Reasoning

Guiding language model reasoning with planning tokens, Wang et al.,
MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning, Sprague et al.,
Quiet-star: Language models can teach themselves to think before speaking, Zelikman et al.,
From explicit cot to implicit cot: Learning to internalize cot step by step, Deng et al.,
Training large language models to reason in a continuous latent space, Hao et al.,
Efficient Reasoning with Hidden Thinking, Shen et al.,
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach, Geiping et al.,
Reasoning with Latent Thoughts: On the Power of Looped Transformers, Saunshi et al.,
Self-Enhanced Reasoning Training: Activating Latent Reasoning in Small Models for Enhanced Reasoning Distillation, Zhang et al.,
LLM Pretraining with Continuous Concepts, Tack et al.,
Scalable Language Models with Posterior Inference of Latent Thought Vectors, Kong et al.,
Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking, Chen et al.,
Reasoning to Learn from Latent Thoughts, Ruan et al.,

2.1.2 Natural Language Deep Reasoning

Reflection of thought: Inversely eliciting numerical reasoning in language models via solving linear systems, Zhou et al.,
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Wei et al.,
Mathprompter: Mathematical reasoning using large language models, Imani et al.,
Deductive Verification of Chain-of-Thought Reasoning, Ling et al.,
Cross-lingual Prompting: Improving Zero-shot Chain-of-Thought Reasoning across Languages, Qin et al.,
AutoCAP: Towards Automatic Cross-lingual Alignment Planning for Zero-shot Chain-of-Thought, Zhang et al.,
Large language models are not strong abstract reasoners, Gendron et al.,
Planning in Natural Language Improves LLM Search for Code Generation, Wang et al.,
CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction, Li et al.,

2.1.3 Structured Language Deep Reasoning

Generative language modeling for automated theorem proving, Polu et al.,
Multi-step deductive reasoning over natural language: An empirical study on out-of-distribution generalisation, Bao et al.,
PAL: Program-aided Language Models, Gao et al.,
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks, Chen et al.,
Tinygsm: achieving> 80% on gsm8k with small language models, Liu et al.,
ChatLogic: Integrating Logic Programming with Large Language Models for Multi-step Reasoning, Wang et al.,
Brain-Inspired Two-Stage Approach: Enhancing Mathematical Reasoning by Imitating Human Thought Processes, Chen et al.,
MathDivide: Improved mathematical reasoning by large language models, Srivastava et al.,
Certified Deductive Reasoning with Language Models, Poesia et al.,
Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models, Xu et al.,
Lean-star: Learning to interleave thinking and proving, Lin et al.,
Chain of Code: Reasoning with a Language Model-Augmented Code Emulator, Li et al.,
Siam: Self-improving code-assisted mathematical reasoning of large language models, Yu et al.,
Formal mathematical reasoning: A new frontier in ai, Yang et al.,
SKIntern: Internalizing Symbolic Knowledge for Distilling Better CoT Capabilities into Small Language Models, Liao et al.,
CodePlan: Unlocking Reasoning Potential in Large Language Models by Scaling Code-form Planning, Wen et al.,
Improving Chain-of-Thought Reasoning via Quasi-Symbolic Abstractions, Ranaldi et al.,
Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments, Payoungkhamdee et al.,
Beyond Limited Data: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving, Dong et al.,
Theorem Prover as a Judge for Synthetic Data Generation, Leang et al.,
Code-Driven Inductive Synthesis: Enhancing Reasoning Abilities of Large Language Models with Sequences, Chen et al.,

2.2 Deep Reasoning Learning

Instruction tuning for large language models: A survey, Zhang et al.,
On memorization of large language models in logical reasoning, Xie et al.,
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs, Wang et al.,
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al.,
Sft memorizes, rl generalizes: A comparative study of foundation model post-training, Chu et al.,
Don't Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls, Wang et al.,

2.2.4 Deep Reasoning Imitation

Training verifiers to solve math word problems, Cobbe et al.,
Chain of Thought Imitation with Procedure Cloning, Yang et al.,
Large Language Models Are Reasoning Teachers, Ho et al.,
The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning, Kim et al.,
Brain-Inspired Two-Stage Approach: Enhancing Mathematical Reasoning by Imitating Human Thought Processes, Chen et al.,
Exploring Iterative Enhancement for Improving Learnersourced Multiple-Choice Question Explanations with Large Language Models, Bao et al.,
Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, Yang et al.,
Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic Corpus, Morishita et al.,
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving, Tong et al.,
O1 Replication Journey--Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?, Huang et al.,
System-2 Mathematical Reasoning via Enriched Instruction Tuning, Cai et al.,
Acemath: Advancing frontier math reasoning with post-training and reward modeling, Liu et al.,
Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems, Min et al.,
Openai o1 system card, Jaech et al.,
Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling, Bansal et al.,
Advancing Math Reasoning in Language Models: The Impact of Problem-Solving Data, Data Synthesis Methods, and Training Stages, Chen et al.,
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al.,
Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training, Yuan et al.,
s1: Simple test-time scaling, Muennighoff et al.,
RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems?, Xu et al.,
FastMCTS: A Simple Sampling Strategy for Data Synthesis, Li et al.,
LLMs Can Teach Themselves to Better Predict the Future, Turtel et al.,
SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers, Li et al.,
Distillation Scaling Laws, Busbridge et al.,
Unveiling the Mechanisms of Explicit CoT Training: How Chain-of-Thought Enhances Reasoning Generalization, Yao et al.,
CoT2Align: Cross-Chain of Thought Distillation via Optimal Transport Alignment for Language Models with Different Tokenizers, Le et al.,
Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning, Chen et al.,
Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision, Zhu et al.,
Demystifying Long Chain-of-Thought Reasoning in LLMs, Yeo et al.,
LIMO: Less is More for Reasoning, Ye et al.,
PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models, Zhao et al.,
Rewarding Graph Reasoning Process makes LLMs more Generalized Reasoners, Peng et al.,
OpenCodeReasoning: Advancing Data Distillation for Competitive Coding, Ahmad et al.,

2.2.5 Deep Reasoning Self-Learning

Thinking fast and slow with deep learning and tree search, Anthony et al.,
Star: Bootstrapping reasoning with reasoning, Zelikman et al.,
Synthetic Prompting: Generating Chain-of-Thought Demonstrations for Large Language Models, Shao et al.,
Reinforced self-training (rest) for language modeling, Gulcehre et al.,
Training Chain-of-Thought via Latent-Variable Inference, Hoffman et al.,
Tree of Thoughts: Deliberate Problem Solving with Large Language Models, Yao et al.,
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment, Dong et al.,
Training large language models for reasoning through reverse curriculum reinforcement learning, Xi et al.,
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models, Singh et al.,
V-STaR: Training Verifiers for Self-Taught Reasoners, Hosseini et al.,
ReAct Meets ActRe: Autonomous Annotation of Agent Trajectories for Contrastive Self-Training, Yang et al.,
Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models, Puerto et al.,
Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation, Liu et al.,
Iterative Reasoning Preference Optimization, Pang et al.,
Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs, Zhang et al.,
AlphaMath Almost Zero: Process Supervision without Process, Chen et al.,
Cream: Consistency Regularized Self-Rewarding Language Models, Wang et al.,
TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees, Liao et al.,
Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning, Wang et al.,
On the impact of fine-tuning on chain-of-thought reasoning, Lobo et al.,
Weak-to-Strong Reasoning, Yang et al.,
Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards, Hwang et al.,
OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning, Zhang et al.,
Proposing and solving olympiad geometry with guided tree search, Zhang et al.,
Enhancing Reasoning through Process Supervision with Monte Carlo Tree Search, Li et al.,
Policy Guided Tree Search for Enhanced LLM Reasoning, Li et al.,
Self-Improvement Towards Pareto Optimality: Mitigating Preference Conflicts in Multi-Objective Alignment, Li et al.,
BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation, Pang et al.,
Process-based Self-Rewarding Language Models, Zhang et al.,
Entropy-Based Adaptive Weighting for Self-Training, Wang et al.,
Entropy-based Exploration Conduction for Multi-step Reasoning, Zhang et al.,

3. Feasible Reflection

3.1 Feedback

When is Tree Search Useful for LLM Planning? It Depends on the Discriminator, Chen et al.,
From generation to judgment: Opportunities and challenges of llm-as-a-judge, Li et al.,
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering, Guan et al.,
Llms-as-judges: a comprehensive survey on llm-based evaluation methods, Li et al.,
A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics, Wei et al.,

3.1.1 Hybrid Feedbacks

The lessons of developing process reward models in mathematical reasoning, Zhang et al.,
Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback, Lin et al.,

3.1.2 Overall Feedback

Overall Feedback from Outcome Reward Model

Training verifiers to solve math word problems, Cobbe et al.,
Towards Mitigating LLM Hallucination via Self Reflection, Ji et al.,
Deepseekmath: Pushing the limits of mathematical reasoning in open language models, Shao et al.,
Generative verifiers: Reward modeling as next-token prediction, Zhang et al.,
Self-generated critiques boost reward modeling for language models, Yu et al.,
Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models, Zhou et al.,

Overall Feedback from RLLMs

Self-critiquing models for assisting human evaluators, Saunders et al.,
Language models (mostly) know what they know, Kadavath et al.,
Constitutional AI: Harmlessness from AI Feedback, Bai et al.,
Contrastive learning with logic-driven data augmentation for logical reasoning over text, Bao et al.,
Self-verification improves few-shot clinical information extraction, Gero et al.,
Shepherd: A critic for language model generation, Wang et al.,
Large Language Models are Better Reasoners with Self-Verification, Weng et al.,
Large Language Models Cannot Self-Correct Reasoning Yet, Huang et al.,
SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning, Miao et al.,
Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization, Zhang et al.,
Self-reflection in llm agents: Effects on problem-solving performance, Renze et al.,
Llm critics help catch llm bugs, McAleese et al.,
LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models, Hao et al.,
Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution, Fernando et al.,
Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives, Zhang et al.,
Critic-cot: Boosting the reasoning abilities of large language model via chain-of-thoughts critic, Zheng et al.,
Small Language Models Need Strong Verifiers to Self-Correct Reasoning, Zhang et al.,
Abstract Meaning Representation-Based Logic-Driven Data Augmentation for Logical Reasoning, Bao et al.,
Reversal of Thought: Enhancing Large Language Models with Preference-Guided Reverse Reasoning Warm-up, Yuan et al.,
What Makes Large Language Models Reason in (Multi-Turn) Code Generation?, Zheng et al.,
Attention-guided Self-reflection for Zero-shot Hallucination Detection in Large Language Models, Liu et al.,
Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge, Saha et al.,
Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems, Ye et al.,
Training an LLM-as-a-Judge Model: Pipeline, Insights, and Practical Lessons, Hu et al.,
A Study on Leveraging Search and Self-Feedback for Agent Reasoning, Yuan et al.,
RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation, Zhou et al.,

Overall Feedback from Rule Extraction

Star: Bootstrapping reasoning with reasoning, Zelikman et al.,
Critic: Large language models can self-correct with tool-interactive critiquing, Gou et al.,
LEVER: Learning to Verify Language-to-Code Generation with Execution, Ni et al.,
Reinforced self-training (rest) for language modeling, Gulcehre et al.,
Tree of Thoughts: Deliberate Problem Solving with Large Language Models, Yao et al.,
Reasoning with Language Model is Planning with World Model, Hao et al.,
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification, Zhou et al.,
VerMCTS: Synthesizing Multi-Step Programs using a Verifier, a Large Language Model, and Tree Search, Brandfonbrener et al.,
ReFT: Reasoning with Reinforced Fine-Tuning, Trung et al.,
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement, Zheng et al.,
On designing effective rl reward at training time for llm reasoning, Gao et al.,
o1-coder: an o1 replication for coding, Zhang et al.,
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al.,
Dynamic Scaling of Unit Tests for Code Reward Modeling, Ma et al.,
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, Xie et al.,
SoRFT: Issue Resolving with Subtask-oriented Reinforced Fine-Tuning, Ma et al.,
ACECODER: Acing Coder RL via Automated Test-Case Synthesis, Zeng et al.,
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation, Baker et al.,

3.1.3 Process Feedback

Solving math word problems with process- and outcome-based feedback, Uesato et al.,
Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment, Kazemnejad et al.,

Process Feedback from Process Rewarded Model

Let's reward step by step: Step-Level reward model as the Navigators for Reasoning, Ma et al.,
Let's verify step by step, Lightman et al.,
Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in Large Language Models, Hu et al.,
OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning, Yu et al.,
Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models, Lee et al.,
Tlcr: Token-level continuous reward for fine-grained reinforcement learning from human feedback, Yoon et al.,
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations, Wang et al.,
Selective Preference Optimization via Token-Level Reward Function Estimation, Yang et al.,
Step-level Value Preference Optimization for Mathematical Reasoning, Chen et al.,
Skywork-o1 open series, Team et al.,
Entropy-Regularized Process Reward Model, Zhang et al.,
Hunyuanprover: A scalable data synthesis framework and guided tree search for automated theorem proving, Li et al.,
Acemath: Advancing frontier math reasoning with post-training and reward modeling, Liu et al.,
Free process rewards without process labels, Yuan et al.,
AutoPSV: Automated Process-Supervised Verifier, Lu et al.,
Scaling Autonomous Agents via Automatic Reward Modeling And Planning, Chen et al.,
Advancing LLM Reasoning Generalists with Preference Trees, Yuan et al.,
Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge, Saha et al.,
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning, Setlur et al.,
Capturing Nuanced Preferences: Preference-Aligned Distillation for Small Language Models, Gu et al.,
Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones?, Zhang et al.,
Adaptivestep: Automatically dividing reasoning step through model confidence, Liu et al.,
Process Reward Models for LLM Agents: Practical Framework and Directions, Choudhury et al.,
Process reinforcement through implicit rewards, Cui et al.,
Full-Step-DPO: Self-Supervised Preference Optimization with Step-wise Rewards for Mathematical Reasoning, Xu et al.,
VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data, Zeng et al.,
Direct Value Optimization: Improving Chain-of-Thought Reasoning in LLMs with Refined Values, Zhang et al.,
Teaching Language Models to Critique via Reinforcement Learning, Xie et al.,
Uncertainty-Aware Search and Value Models: Mitigating Search Scaling Flaws in LLMs, Yu et al.,
AURORA: Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification, Tan et al.,
Visualprm: An effective process reward model for multimodal reasoning, Wang et al.,
QwQ: Reflect Deeply on the Boundaries of the Unknown, Team et al.,

Process Feedback from RLLMs

ReAct: Synergizing Reasoning and Acting in Language Models, Yao et al.,
Reflexion: language agents with verbal reinforcement learning, Shinn et al.,
Can We Verify Step by Step for Incorrect Answer Detection?, Xu et al.,
Monte carlo tree search boosts reasoning via iterative preference learning, Xie et al.,
Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models, Xu et al.,
Llm critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback, Gao et al.,
Step-dpo: Step-wise preference optimization for long-chain reasoning of llms, Lai et al.,
Reasoning in Flux: Enhancing Large Language Models Reasoning through Uncertainty-aware Adaptive Guidance, Yin et al.,
Advancing Process Verification for Large Language Models via Tree-Based Preference Learning, He et al.,
Outcome-Refining Process Supervision for Code Generation, Yu et al.,
Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework, Sun et al.,
Zero-Shot Verification-guided Chain of Thoughts, Chowdhury et al.,
Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation, Xie et al.,
Unveiling and Causalizing CoT: A Causal Pespective, Fu et al.,
Diverse Inference and Verification for Advanced Reasoning, Drori et al.,
Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges, Shrestha et al.,
Uncertainty-Aware Step-wise Verification with Generative Reward Models, Ye et al.,
JudgeLRM: Large Reasoning Models as a Judge, Chen et al.,

3.2 Refinement

3.2.4 Prompt-based Refinement Generation

Self-critiquing models for assisting human evaluators, Saunders et al.,
Self-Refine: Iterative Refinement with Self-Feedback, Madaan et al.,
Grace: Discriminator-guided chain-of-thought reasoning, Khalifa et al.,
Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies, Pan et al.,
Reflexion: language agents with verbal reinforcement learning, Shinn et al.,
Towards Mitigating LLM Hallucination via Self Reflection, Ji et al.,
SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning, Miao et al.,
Learning to check: Unleashing potentials for self-correction in large language models, Zhang et al.,
REFINER: Reasoning Feedback on Intermediate Representations, Paul et al.,
GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements, Havrilla et al.,
Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models through Logic, Zhao et al.,
General purpose verification for chain of thought prompting, Vacareanu et al.,
Large language models have intrinsic self-correction ability, Liu et al.,
Progressive-Hint Prompting Improves Reasoning in Large Language Models, Zheng et al.,
Accessing gpt-4 level mathematical olympiad solutions via monte carlo tree self-refine with llama-3 8b, Zhang et al.,
Toward Adaptive Reasoning in Large Language Models with Thought Rollback, Chen et al.,
CoT Rerailer: Enhancing the Reliability of Large Language Models in Complex Reasoning Tasks through Error Detection and Correction, Wan et al.,
Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement, Xu et al.,
A Theoretical Understanding of Self-Correction through In-context Alignment, Wang et al.,
ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search, Zhang et al.,
Enhancing Mathematical Reasoning in LLMs by Stepwise Correction, Wu et al.,
LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints, Ferraz et al.,
Advancing Large Language Model Attribution through Self-Improving, Huang et al.,
Confidence vs Critique: A Decomposition of Self-Correction Capability for LLMs, Yang et al.,
LLM2: Let Large Language Models Harness System 2 Reasoning, Yang et al.,
Understanding the Dark Side of LLMs' Intrinsic Self-Correction, Zhang et al.,
Enhancing LLM Reasoning with Multi-Path Collaborative Reactive and Reflection agents, He et al.,
BackMATH: Towards Backward Reasoning for Solving Math Problems Step by Step, Zhang et al.,
ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding, Sun et al.,
Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models, Yang et al.,
Optimizing generative AI by backpropagating language model feedback, Yuksekgonul et al.,
DLPO: Towards a Robust, Efficient, and Generalizable Prompt Optimization Framework from a Deep-Learning Perspective, Peng et al.,
Instruct-of-Reflection: Enhancing Large Language Models Iterative Reflection Capabilities via Dynamic-Meta Instruction, Liu et al.,
The Lighthouse of Language: Enhancing LLM Agents via Critique-Guided Improvement, Yang et al.,

3.2.5 RL-based Refinement Learning

Training language models to self-correct via reinforcement learning, Kumar et al.,
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al.,
7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient, Zeng et al.,
S²R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning, Ma et al.,
ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates, Yang et al.,
ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification, Lee et al.,
ARIES: Stimulating Self-Refinement of Large Language Models by Iterative Preference Optimization, Zeng et al.,

3.2.6 SFT-based Refinement Imitation

Learning from mistakes makes llm better reasoner, An et al.,
Reflection-tuning: Data recycling improves llm instruction-tuning, Li et al.,
Teaching Large Language Models to Self-Debug, Chen et al.,
Enhancing visual-language modality alignment in large vision language models via self-improvement, Wang et al.,
Llm critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback, Gao et al.,
Mutual reasoning makes smaller llms stronger problem-solvers, Qi et al.,
S ³ c-Math: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners, Yan et al.,
Recursive Introspection: Teaching Language Model Agents How to Self-Improve, Qu et al.,
O1 Replication Journey: A Strategic Progress Report--Part 1, Qin et al.,
Enhancing llm reasoning via critique models with test-time and training-time supervision, Xi et al.,
Vision-language models can self-improve reasoning via reflection, Cheng et al.,
CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis, Zhang et al.,
Critique fine-tuning: Learning to critique is more effective than learning to imitate, Wang et al.,
RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques, Tang et al.,
ProgCo: Program Helps Self-Correction of Large Language Models, Song et al.,
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics, Luo et al.,
Iterative Deepening Sampling for Large Language Models, Chen et al.,
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!, Li et al.,
MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification, Sun et al.,

4. Extensive Exploration

Improving Policies via Search in Cooperative Partially Observable Games, Lerer et al.,
On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability, Wang et al.,
LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench, Valmeekam et al.,
Scaling of search and learning: A roadmap to reproduce o1 from reinforcement learning perspective, Zeng et al.,
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al.,

4.1 Exploration Scaling

Greedy Policy Search: A Simple Baseline for Learnable Test-Time Augmentation, Lyzhov et al.,
Scaling scaling laws with board games, Jones et al.,
Large language monkeys: Scaling inference compute with repeated sampling, Brown et al.,
From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, Welleck et al.,
From medprompt to o1: Exploration of run-time strategies for medical challenge problems and beyond, Nori et al.,
ECM: A Unified Electronic Circuit Model for Explaining the Emergence of In-Context Learning and Chain-of-Thought in Large Language Model, Chen et al.,
(Mis) Fitting: A Survey of Scaling Laws, Li et al.,

4.1.1 Parallel Scaling

Self-Consistency Improves Chain of Thought Reasoning in Language Models, Wang et al.,
Large language monkeys: Scaling inference compute with repeated sampling, Brown et al.,
Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models, Wu et al.,
Inference Scaling vs Reasoning: An Empirical Analysis of Compute-Optimal LLM Problem-Solving, AbdElhameed et al.,
Metascale: Test-time scaling with evolving meta-thoughts, Liu et al.,

Sampling Optimization

Cross-lingual Prompting: Improving Zero-shot Chain-of-Thought Reasoning across Languages, Qin et al.,
Scaling llm inference with optimized sample compute allocation, Zhang et al.,
Planning in Natural Language Improves LLM Search for Code Generation, Wang et al.,
Python is Not Always the Best Choice: Embracing Multilingual Program of Thoughts, Luo et al.,
Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective, Yu et al.,
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?, Zeng et al.,
Optimizing Temperature for Language Models with Multi-Sample Inference, Du et al.,
Bag of Tricks for Inference-time Computation of LLM Reasoning, Liu et al.,
Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning, Yang et al.,
Is Depth All You Need? An Exploration of Iterative Reasoning in LLMs, Wu et al.,
Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scale Test-Time Compute, Chen et al.,

Verification Optimization

Show Your Work: Scratchpads for Intermediate Computation with Language Models, Nye et al.,
Making Language Models Better Reasoners with Step-Aware Verifier, Li et al.,
Deductive Verification of Chain-of-Thought Reasoning, Ling et al.,
Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization, Zhou et al.,
Multi-step problem solving through a verifier: An empirical analysis on model-induced process supervision, Wang et al.,
Stepwise self-consistent mathematical reasoning with large language models, Zhao et al.,
General purpose verification for chain of thought prompting, Vacareanu et al.,
Improve Mathematical Reasoning in Language Models by Automated Process Supervision, Luo et al.,
Scaling llm test-time compute optimally can be more effective than scaling model parameters, Snell et al.,
Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models, Wu et al.,
Learning to Reason via Program Generation, Emulation, and Search, Weir et al.,
What are the essential factors in crafting effective long context multi-hop instruction datasets? insights and best practices, Chen et al.,
MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning, Chen et al.,
Rlef: Grounding code llms in execution feedback with reinforcement learning, Gehring et al.,
Beyond examples: High-level automated reasoning paradigm in in-context learning via mcts, Wu et al.,
Wrong-of-Thought: An Integrated Reasoning Framework with Multi-Perspective Verification and Wrong Information, Zhang et al.,
A simple and provable scaling law for the test-time compute of large language models, Chen et al.,
Lachesis: Predicting LLM Inference Accuracy using Structural Properties of Reasoning Paths, Kim et al.,
Seed-cts: Unleashing the power of tree search for superior performance in competitive coding tasks, Wang et al.,
From Drafts to Answers: Unlocking LLM Potential via Aggregation Fine-Tuning, Li et al.,
SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling, Chen et al.,
Instantiation-based Formalization of Logical Reasoning Tasks using Language Models and Logical Solvers, Raza et al.,
The lessons of developing process reward models in mathematical reasoning, Zhang et al.,
ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning, Yu et al.,
Scalable Best-of-N Selection for Large Language Models via Self-Certainty, Kang et al.,
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling, Liu et al.,
ECM: A Unified Electronic Circuit Model for Explaining the Emergence of In-Context Learning and Chain-of-Thought in Large Language Model, Chen et al.,
Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification, Zhao et al.,
TestNUC: Enhancing Test-Time Computing Approaches through Neighboring Unlabeled Data Consistency, Zou et al.,
Confidence Improves Self-Consistency in LLMs, Taubenfeld et al.,
S*: Test Time Scaling for Code Generation, Li et al.,
Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning, Zhou et al.,
Multidimensional Consistency Improves Reasoning in Language Models, Lai et al.,
Efficient test-time scaling via self-calibration, Huang et al.,

4.1.2 Vertical Scaling

Complexity-Based Prompting for Multi-step Reasoning, Fu et al.,
Openai o1 system card, Jaech et al.,
s1: Simple test-time scaling, Muennighoff et al.,
Test-time Computing: from System-1 Thinking to System-2 Thinking, Ji et al.,
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach, Geiping et al.,
Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking, Chen et al.,
METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling, Li et al.,
Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment, Li et al.,
Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking, Tian et al.,
What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models, Zhang et al.,

4.2 External Exploration

4.2.3 Human-driven Exploration

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models, Zhou et al.,
Tree of Thoughts: Deliberate Problem Solving with Large Language Models, Yao et al.,
PATHFINDER: Guided Search over Multi-Step Reasoning Paths, Golovneva et al.,
Demystifying chains, trees, and graphs of thoughts, Besta et al.,
Graph of Thoughts: Solving Elaborate Problems with Large Language Models, Besta et al.,
Tree of Uncertain Thoughts Reasoning for Large Language Models, Mo et al.,
GraphReason: Enhancing Reasoning Capabilities of Large Language Models through A Graph-Based Verification Approach, Cao et al.,
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing, Tian et al.,
On the diagram of thought, Zhang et al.,
Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination, Chen et al.,
Treebon: Enhancing inference-time alignment with speculative tree-search and best-of-n sampling, Qiu et al.,
Scattered Forest Search: Smarter Code Space Exploration with LLMs, Light et al.,
On the Empirical Complexity of Reasoning and Planning in LLMs, Kang et al.,
CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models, Li et al.,
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models, Cheng et al.,
Forest-of-thought: Scaling test-time compute for enhancing LLM reasoning, Bi et al.,
Tree-of-Code: A Tree-Structured Exploring Framework for End-to-End Code Generation and Execution in Complex Task Handling, Ni et al.,
A Roadmap to Guide the Integration of LLMs in Hierarchical Planning, Puerta-Merino et al.,
Atom of Thoughts for Markov LLM Test-Time Scaling, Teng et al.,
START: Self-taught Reasoner with Tools, Li et al.,

4.2.4 Model-driven Exploration

Enhancing Exploration Logics

Self-Evaluation Guided Beam Search for Reasoning, Xie et al.,
No train still gain. unleash mathematical reasoning of large language models with monte carlo tree search guided by energy function, Xu et al.,
Mindstar: Enhancing math reasoning in pre-trained llms at inference time, Kang et al.,
Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search, Light et al.,
Deductive Beam Search: Decoding Deducible Rationale for Chain-of-Thought Reasoning, Zhu et al.,
Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping, Lehnert et al.,
Tree search for language model agents, Koh et al.,
Agent q: Advanced reasoning and learning for autonomous ai agents, Putta et al.,
RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation, Li et al.,
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning, Zhai et al.,
Aflow: Automating agentic workflow generation, Zhang et al.,
Cooperative Strategic Planning Enhances Reasoning Capabilities in Large Language Models, Wang et al.,
Deliberate reasoning for llms as structure-aware planning with accurate world model, Xiong et al.,
Marco-o1: Towards open reasoning models for open-ended solutions, Zhao et al.,
Technical report: Enhancing llm reasoning with reward-guided tree search, Jiang et al.,
SRA-MCTS: Self-driven Reasoning Aurmentation with Monte Carlo Tree Search for Enhanced Code Generation, Xu et al.,
GPT-Guided Monte Carlo Tree Search for Symbolic Regression in Financial Fraud Detection, Kadam et al.,
MC-NEST--Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree, Rabby et al.,
Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search, Yao et al.,
Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning, Jiang et al.,
Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning, Park et al.,
Monte Carlo Tree Search for Comprehensive Exploration in LLM-Based Automatic Heuristic Design, Zheng et al.,
Leveraging Constrained Monte Carlo Tree Search to Generate Reliable Long Chain-of-Thought for Mathematical Reasoning, Lin et al.,
Reasoning with Reinforced Functional Token Tuning, Zhang et al.,
CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning, Pan et al.,
MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning, Park et al.,
Better Process Supervision with Bi-directional Rewarding Signals, Chen et al.,

Exploration-Path Feedback

Don't throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding, Liu et al.,
Making PPO even better: Value-Guided Monte-Carlo Tree Search decoding, Liu et al.,
Llama-berry: Pairwise optimization for o1-like olympiad-level mathematical reasoning, Zhang et al.,
AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning, Xiang et al.,
A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods, Puri et al.,

Unified Improvements

Enhancing multi-step reasoning abilities of language models through direct q-function optimization, Liu et al.,
Process reward model with q-value rankings, Li et al.,
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking, Guan et al.,
Evolving Deeper LLM Thinking, Lee et al.,
Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models, Kim et al.,
SIFT: Grounding LLM Reasoning in Contexts via Stickers, Zeng et al.,
QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search, Lin et al.,
CritiQ: Mining Data Quality Criteria from Human Preferences, Guo et al.,

4.3 Internal Exploration

RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold, Setlur et al.,
Sft memorizes, rl generalizes: A comparative study of foundation model post-training, Chu et al.,
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search, Shen et al.,
Demystifying Long Chain-of-Thought Reasoning in LLMs, Yeo et al.,
LLM Post-Training: A Deep Dive into Reasoning Large Language Models, Kumar et al.,

4.3.5 RL Strategies

Policy Gradient Methods for Reinforcement Learning with Function Approximation, Sutton et al.,
Proximal policy optimization algorithms, Schulman et al.,
Deepseekmath: Pushing the limits of mathematical reasoning in open language models, Shao et al.,
ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models, Li et al.,
CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks, Wang et al.,
Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback, Ivison et al.,
A Small Step Towards Reproducing OpenAI o1: Progress Report on the Steiner Open Source Models, Ji et al.,
A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications, Xiao et al.,
Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's Reasoning Capability, Lin et al.,
Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization, Liu et al.,
Offline Reinforcement Learning for LLM Multi-Step Reasoning, Wang et al.,
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models, Hu et al.,
Diverse Preference Optimization, Lanchantin et al.,
COS (M+ O) S: Curiosity and RL-Enhanced MCTS for Exploring Story Space via Language Models, Materzok et al.,
7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient, Zeng et al.,
LIMR: Less is More for RL Scaling, Li et al.,
Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning, Vassoyan et al.,
Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance, Huang et al.,
Process reinforcement through implicit rewards, Cui et al.,
SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin, Yi et al.,
Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment, Sun et al.,
Focused-DPO: Enhancing Code Generation Through Focused Preference Optimization on Error-Prone Points, Zhang et al.,
Reasoning with Reinforced Functional Token Tuning, Zhang et al.,
Qsharp: Provably Optimal Distributional RL for LLM Post-Training, Zhou et al.,
Thinking Preference Optimization, Yang et al.,
Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training, Bartoldson et al.,
Dapo: An open-source llm reinforcement learning system at scale, Yu et al.,
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model, Hu et al.,
Optimizing Test-Time Compute via Meta Reinforcement Finetuning, Qu et al.,
Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't, Dang et al.,
SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks, Zhou et al.,
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks, YuYue et al.,

4.3.6 Reward Strategies

Model-rewarded RL

Training verifiers to solve math word problems, Cobbe et al.,
AlphaZero-Like Tree-Search can Guide Large Language Model Decoding and Training, Wan et al.,
ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search, Zhang et al.,
Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling, Hou et al.,
Kimi k1. 5: Scaling reinforcement learning with llms, Team et al.,
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL, Luo et al.,
STeCa: Step-level Trajectory Calibration for LLM Agent Learning, Wang et al.,
Expanding RL with Verifiable Rewards Across Diverse Domains, Su et al.,
R-PRM: Reasoning-Driven Process Reward Modeling, She et al.,

Rule-rewarded RL

Stepcoder: Improve code generation with reinforcement learning from compiler feedback, Dou et al.,
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models, Singh et al.,
Building math agents with multi-turn iterative preference learning, Xiong et al.,
Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al.,
o1-coder: an o1 replication for coding, Zhang et al.,
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al.,
Kimi k1. 5: Scaling reinforcement learning with llms, Team et al.,
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, Xie et al.,
Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning, Yang et al.,
Training Language Models to Reason Efficiently, Arora et al.,
Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning, Lyu et al.,
Competitive Programming with Large Reasoning Models, El-Kishky et al.,
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution, Wei et al.,
Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights, Parashar et al.,
Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation, Kim et al.,
On the Emergence of Thinking in LLMs I: Searching for the Right Intuition, Ye et al.,
The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks, Cuadron et al.,
Z1: Efficient Test-time Scaling with Code, Yu et al.,

5. Future and Frontiers

5.1 Agentic & Embodied Long CoT

Solving Math Word Problems via Cooperative Reasoning induced Language Models, Zhu et al.,
Reasoning with Language Model is Planning with World Model, Hao et al.,
Large language models as commonsense knowledge for large-scale task planning, Zhao et al.,
Robotic Control via Embodied Chain-of-Thought Reasoning, Zawalski et al.,
Tree-Planner: Efficient Close-loop Task Planning with Large Language Models, Hu et al.,
Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models, Zhou et al.,
Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search, Light et al.,
Mixture-of-agents enhances large language model capabilities, Wang et al.,
ADaPT: As-Needed Decomposition and Planning with Language Models, Prasad et al.,
Tree search for language model agents, Koh et al.,
Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large language model, Hu et al.,
S3 agent: Unlocking the power of VLLM for zero-shot multi-modal sarcasm detection, Wang et al.,
MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical Problems, Lei et al.,
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning, Zhai et al.,
EVOLvE: Evaluating and Optimizing LLMs For Exploration, Nie et al.,
Agents Thinking Fast and Slow: A Talker-Reasoner Architecture, Christakopoulou et al.,
Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation, Xie et al.,
Titans: Learning to memorize at test time, Behrouz et al.,
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success, Kim et al.,
World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning, Wang et al.,
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks, Zhang et al.,
Cosmos-reason1: From physical common sense to embodied reasoning, Azzolini et al.,
Improving Retrospective Language Agents via Joint Policy Gradient Optimization, Feng et al.,
Haste Makes Waste: Evaluating Planning Abilities of LLMs for Efficient and Feasible Multitasking with Time Constraints Between Actions, Wu et al.,
MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents, Zhu et al.,
ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning, Wan et al.,
MAS-GPT: Training LLMs To Build LLM-Based Multi-Agent Systems, Ye et al.,
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems, Liu et al.,

5.2 Efficient Long CoT

Guiding language model reasoning with planning tokens, Wang et al.,
Synergy-of-thoughts: Eliciting efficient reasoning in hybrid language models, Shang et al.,
Distilling system 2 into system 1, Yu et al.,
Concise thoughts: Impact of output length on llm reasoning and cost, Nayab et al.,
Litesearch: Efficacious tree search for llm, Wang et al.,
Uncertainty-Guided Optimization on Large Language Model Search Trees, Grosse et al.,
CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks, Wang et al.,
Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al.,
Kvsharer: Efficient inference via layer-wise dissimilar KV cache sharing, Yang et al.,
Interpretable contrastive monte carlo tree search reasoning, Gao et al.,
Dualformer: Controllable fast and slow thinking by learning with randomized reasoning traces, Su et al.,
DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models, Pan et al.,
Language models are hidden reasoners: Unlocking latent reasoning capabilities via self-rewarding, Chen et al.,
Token-budget-aware llm reasoning, Han et al.,
B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners, Zeng et al.,
C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness, Kang et al.,
Training large language models to reason in a continuous latent space, Hao et al.,
CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models, Cheng et al.,
Kimi k1. 5: Scaling reinforcement learning with llms, Team et al.,
O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning, Luo et al.,
Reward-Guided Speculative Decoding for Efficient LLM Reasoning, Liao et al.,
Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization, Yu et al.,
Efficient Reasoning with Hidden Thinking, Shen et al.,
On the Query Complexity of Verifier-Assisted Language Generation, Botta et al.,
TokenSkip: Controllable Chain-of-Thought Compression in LLMs, Xia et al.,
Boost, Disentangle, and Customize: A Robust System2-to-System1 Pipeline for Code Generation, Du et al.,
Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE, Huang et al.,
Towards Reasoning Ability of Small Language Models, Srivastava et al.,
Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs, Ji et al.,
Portable Reward Tuning: Towards Reusable Fine-Tuning across Different Pretrained Models, Chijiwa et al.,
MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification, Sun et al.,
Language Models Can Predict Their Own Behavior, Ashok et al.,
On the Convergence Rate of MCTS for the Optimal Value Estimation in Markov Decision Processes, Chang et al.,
CoT-Valve: Length-Compressible Chain-of-Thought Tuning, Ma et al.,
Training Language Models to Reason Efficiently, Arora et al.,
Chain of Draft: Thinking Faster by Writing Less, Xu et al.,
Learning to Stop Overthinking at Test Time, Bao et al.,
Self-Training Elicits Concise Reasoning in Large Language Models, Munkhbat et al.,
Length-Controlled Margin-Based Preference Optimization without Reference Model, Li et al.,
Reasoning on a Spectrum: Aligning LLMs to System 1 and System 2 Thinking, Ziabari et al.,
Dynamic Parallel Tree Search for Efficient LLM Reasoning, Ding et al.,
Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models, Cui et al.,
Dynamic Chain-of-Thought: Towards Adaptive Deep Reasoning, Wang et al.,
SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs, Xu et al.,
LightThinker: Thinking Step-by-Step Compression, Zhang et al.,
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning, Yan et al.,
Stepwise Informativeness Search for Improving LLM Reasoning, Wang et al.,
Adaptive Group Policy Optimization: Towards Stable Training and Token-Efficient Reasoning, Li et al.,
Innate Reasoning is Not Enough: In-Context Learning Enhances Reasoning Large Language Models with Less Overthinking, Ge et al.,
Understanding r1-zero-like training: A critical perspective, Liu et al.,
The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models, Ji et al.,
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning, Aggarwal et al.,
DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models, Shen et al.,
ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning, Hou et al.,
Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning Eliciting Efficient Reasoning in Large Language Models, Yu et al.,
Can Pruning Improve Reasoning? Revisiting Long-CoT Compression with Capability in Mind for Better Reasoning, Shang et al.,

5.3 Knowledge-Augmented Long CoT

Best of Both Worlds: Harmonizing LLM Capabilities in Decision-Making and Question-Answering for Treatment Regimes, Liu et al.,
Understanding Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation, Wang et al.,
Stream of search (sos): Learning to search in language, Gandhi et al.,
CoPS: Empowering LLM Agents with Provable Cross-Task Experience Sharing, Yang et al.,
Disentangling memory and reasoning ability in large language models, Jin et al.,
Huatuogpt-o1, towards medical complex reasoning with llms, Chen et al.,
RAG-Star: Enhancing Deliberative Reasoning with Retrieval Augmented Verification and Refinement, Jiang et al.,
O1 Replication Journey--Part 3: Inference-time Scaling for Medical Reasoning, Huang et al.,
MedS ³: Towards Medical Small Language Models with Self-Evolved Slow Thinking, Jiang et al.,
Search-o1: Agentic search-enhanced large reasoning models, Li et al.,
Chain-of-Retrieval Augmented Generation, Wang et al.,
Evaluating Large Language Models through Role-Guide and Self-Reflection: A Comparative Study, Zhao et al.,
Citrus: Leveraging Expert Cognitive Pathways in a Medical Language Model for Advanced Medical Decision Support, Wang et al.,
ECM: A Unified Electronic Circuit Model for Explaining the Emergence of In-Context Learning and Chain-of-Thought in Large Language Model, Chen et al.,
Large Language Models for Recommendation with Deliberative User Preference Alignment, Fang et al.,
ChineseEcomQA: A Scalable E-commerce Concept Evaluation Benchmark for Large Language Models, Chen et al.,
DeepRAG: Thinking to Retrieval Step by Step for Large Language Models, Guan et al.,
Open Deep Research, Team et al.,
HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation, Liu et al.,
O1 Embedder: Let Retrievers Think Before Action, Yan et al.,
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning, Pan et al.,
Towards Robust Legal Reasoning: Harnessing Logical LLMs in Law, Kant et al.,
OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning, Lu et al.,
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning, Song et al.,
RARE: Retrieval-Augmented Reasoning Modeling, Wang et al.,
Graph-Augmented Reasoning: Evolving Step-by-Step Knowledge Graph Retrieval for LLM Reasoning, Wu et al.,
Learning to Reason with Search for LLMs via Reinforcement Learning, Chen et al.,
Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning, Liu et al.,
m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models, Huang et al.,

5.4 Multilingual Long CoT

Cross-lingual Prompting: Improving Zero-shot Chain-of-Thought Reasoning across Languages, Qin et al.,
Not All Languages Are Created Equal in LLMs: Improving Multilingual Capability by Cross-Lingual-Thought Prompting, Huang et al.,
xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning, Chai et al.,
Multilingual large language model: A survey of resources, taxonomy and frontiers, Qin et al.,
A Tree-of-Thoughts to Broaden Multi-step Reasoning across Languages, Ranaldi et al.,
AutoCAP: Towards Automatic Cross-lingual Alignment Planning for Zero-shot Chain-of-Thought, Zhang et al.,
Enhancing Advanced Visual Reasoning Ability of Large Language Models, Li et al.,
DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought, Wang et al.,
A survey of multilingual large language models, Qin et al.,
Demystifying Multilingual Chain-of-Thought in Process Reward Modeling, Wang et al.,
The Multilingual Mind: A Survey of Multilingual Reasoning in Language Models, Ghosh et al.,

5.5 Multimodal Long CoT

Large Language Models Can Self-Correct with Minimal Effort, Wu et al.,
Multimodal Chain-of-Thought Reasoning in Language Models, Zhang et al.,
Q*: Improving multi-step reasoning for llms with deliberative planning, Wang et al.,
M³CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought, Chen et al.,
A survey on evaluation of multimodal large language models, Huang et al.,
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning, Zhai et al.,
What factors affect multi-modal in-context learning? an in-depth exploration, Qin et al.,
Enhancing Advanced Visual Reasoning Ability of Large Language Models, Li et al.,
Insight-v: Exploring long-chain visual reasoning with multimodal large language models, Dong et al.,
Llava-o1: Let vision language models reason step-by-step, Xu et al.,
AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning, Xiang et al.,
ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback, Byun et al.,
Enhancing the reasoning ability of multimodal large language models via mixed preference optimization, Wang et al.,
Slow Perception: Let's Perceive Geometric Figures Step-by-step, Wei et al.,
Diving into Self-Evolving Training for Multimodal Reasoning, Liu et al.,
Scaling inference-time search with vision value model for improved visual comprehension, Xiyao et al.,
CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models, Cheng et al.,
Inference Retrieval-Augmented Multi-Modal Chain-of-Thoughts Reasoning for Language Models, He et al.,
Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model, Ma et al.,
BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning, Zhang et al.,
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model, Zang et al.,
Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark, Hao et al.,
Visual Agents as Fast and Slow Thinkers, Sun et al.,
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM, Du et al.,
Llamav-o1: Rethinking step-by-step visual reasoning in llms, Thawakar et al.,
Inference-time scaling for diffusion models beyond scaling denoising steps, Ma et al.,
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step, Guo et al.,
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought, Li et al.,
Monte Carlo Tree Diffusion for System 2 Planning, Yoon et al.,
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking, Wu et al.,
Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models, Xie et al.,
Visual-RFT: Visual Reinforcement Fine-Tuning, Liu et al.,
Qwen2. 5-Omni Technical Report, Xu et al.,
Vision-r1: Incentivizing reasoning capability in multimodal large language models, Huang et al.,
Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl, Peng et al.,
Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning, Tan et al.,
OThink-MR1: Stimulating multimodal generalized reasoning capabilities through dynamic reinforcement learning, Liu et al.,
Grounded Chain-of-Thought for Multimodal Large Language Models, Wu et al.,
Test-Time View Selection for Multi-Modal Decision Making, Jain et al.,
Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme, Ma et al.,

5.6 Safety and Stability for Long CoT

Larger and more instructable language models become less reliable, Zhou et al.,
On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models, Tanneru et al.,
The Impact of Reasoning Step Length on Large Language Models, Jin et al.,
Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al.,
Can Large Language Models Understand You Better? An MBTI Personality Detection Dataset Aligned with Population Traits, Li et al.,
o3-mini vs DeepSeek-R1: Which One is Safer?, Arrieta et al.,
Efficient Reasoning with Hidden Thinking, Shen et al.,
Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking, Cheng et al.,
Explicit vs. Implicit: Investigating Social Bias in Large Language Models through Self-Reflection, Zhao et al.,
Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies, Parmar et al.,
Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation, Arrieta et al.,
International AI Safety Report, Bengio et al.,
GuardReasoner: Towards Reasoning-based LLM Safeguards, Liu et al.,
OVERTHINKING: Slowdown Attacks on Reasoning LLMs, Kumar et al.,
A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos, Yao et al.,
MetaSC: Test-Time Safety Specification Optimization for Language Models, Gallego et al.,
Leveraging Reasoning with Guidelines to Elicit and Utilize Knowledge for Enhancing Safety Alignment, Wang et al.,
The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1, Zhou et al.,
Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models, Lu et al.,
Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?, Bengio et al.,
Emergent Response Planning in LLM, Dong et al.,
Investigating the Impact of Quantization Methods on the Safety and Reliability of Large Language Models, Kharinaev et al.,
Safety Evaluation of DeepSeek Models in Chinese Contexts, Zhang et al.,
Reasoning Does Not Necessarily Improve Role-Playing Ability, Feng et al.,
H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking, Kuo et al.,
BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack, Zhu et al.,
" Nuclear Deployed!": Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents, Xu et al.,
SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities, Jiang et al.,
Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking, Zhu et al.,
CER: Confidence Enhanced Reasoning in LLMs, Razghandi et al.,
Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps, Tutek et al.,
The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety Analysis, Pan et al.,
Policy Frameworks for Transparent Chain-of-Thought Reasoning in Large Language Models, Chen et al.,
Do Chains-of-Thoughts of Large Language Models Suffer from Hallucinations, Cognitive Biases, or Phobias in Bayesian Reasoning?, Araya et al.,
Process or Result? Manipulated Ending Tokens Can Mislead Reasoning LLMs to Ignore the Correct Reasoning Steps, Cui et al.,
Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable, Huang et al.,
Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?, Yan et al.,
Reasoning Models Don’t Always Say What They Think, Chen et al.,

6. Resources

6.1 Open-Sourced Training Framework

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework, Hu et al.,
LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models, Hao et al.,
OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models, Wang et al.,
TinyZero, Pan et al.,
R1-V: Reinforcing Super Generalization Ability in Vision-Language Models with Less Than 3, Chen et al.,
VL-Thinking: An R1-Derived Visual Instruction Tuning Dataset for Thinkable LVLMs, Chen et al.,
VLM-R1: A stable and generalizable R1-style Large Vision-Language Model, Shen et al.,
7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient, Zeng et al.,
Open R1, Team et al.,
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL, Luo et al.,
X-R1, Team et al.,
Open-Reasoner-Zero: An Open Source Approach to Scaling Reinforcement Learning on the Base Model, Jingcheng Hu and Yinmin Zhang and Qi Han and Daxin Jiang and Xiangyu Zhang et al.,
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, Xie et al.,
R1-Multimodal-Journey, Shao et al.,
Open-R1-Multimodal, Lab et al.,
Video-R1, Team et al.,
Dapo: An open-source llm reinforcement learning system at scale, Yu et al.,
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks, YuYue et al.,

🎁 Citation

If you find this work useful, welcome to cite us.

@misc{chen2025reasoning,
      title={Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models}, 
      author={Qiguang Chen and Libo Qin and Jinhao Liu and Dengyun Peng and Jiannan Guan and Peng Wang and Mengkang Hu and Yuhang Zhou and Te Gao and Wanxiang Che},
      year={2025},
      eprint={2503.09567},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2503.09567}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets/images		assets/images
README.md		README.md

LightChen233/Awesome-Long-Chain-of-Thought-Reasoning

Folders and files

Latest commit

History

Repository files navigation