Skip to content

EvanZhuang/knowledge_flow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Knowledge Flow: Scaling Reasoning Beyond the Context Limit

Yufan Zhuang1,2, Liyuan Liu1, Dinghuai Zhang1, Chandan Singh1, Yelong Shen1, Jingbo Shang2, Jianfeng Gao1

1Microsoft Research 2UC San Diego

We explore how Knowledge Flow Prompting—iteratively updating a knowledge list between rollouts—overcomes the context limit of LLMs in test-time scaling. This iterative refinement process mirrors human deliberation: the model progressively distills insights from attempts, refining a knowledge list, thereby empowering subsequent rollouts. Knowledge Flow enables both gpt-oss-120b and Qwen3-235B-A22B-Thinking to achieve 100% accuracy on AIME25 without any training/tools/external feedback.

Knowledge Flow

For detailed insights into the methodology and results, please refer to our blog post.


🚀 Quick Start

Environment Setup

1. Prerequisites

  • CUDA >= 12.8
  • VLLM == 0.10.2

2. Installation

Option A: Minimal Installation (recommended for quickstarts)

pip install -r requirements.txt

This installs the core dependencies.

Option B: Full Installation (includes all dependency versions)

pip install -r requirements_full.txt

This includes additional packages for extended functionality including CUDA libraries, evaluation tools, and more.


📂 Project Structure

Main Implementation

The core Knowledge Flow implementation is available in two variants:

For GPT-OSS Models:

For Qwen Models:

Key Parameters

Both main scripts support the following arguments:

--model_name        # Hugging Face model identifier
--max_new_tokens    # Maximum generation length (e.g., 131072 for GPT-OSS, 262144 for Qwen)
--temperature       # Sampling temperature (default: 0.6)
--reflex_size       # Number of iterations (default: 64)
--split             # Dataset split: train/test/validation
--output_postfix    # Custom postfix for output files

🔬 Ablation Studies

We provide several ablation variants to study the impact of different components:

1. Knowledge Regeneration Variants

These scripts test the effect of regenerating knowledge at each step:

Key Difference: Uses NLTK sentence tokenization to segment and regenerate knowledge descriptions more systematically.

2. Positive Knowledge Variants

These scripts explore using positive reinforcement (correct solutions) instead of mistake tracking:

Key Difference: Builds a knowledge base from successful solutions rather than learning from mistakes.

3. Markovian Thinking Baseline

These scripts implement a baseline that doesn't maintain long-term knowledge:

Key Difference: Tests the Markovian Thinking where each iteration only depends on the immediate previous step, without accumulated knowledge.


💻 Usage Examples

Running Knowledge Flow on GPT-OSS

cd bash_scripts
bash gpt_oss.sh

Or run directly with Python:

model_name=openai/gpt-oss-120b
BACKEND=TRITON_ATTN_VLLM_V1
export HF_HUB_ENABLE_HF_TRANSFER=1
export VLLM_FLASHINFER_ALLREDUCE_FUSION_THRESHOLDS_MB='{"2":32,"4":32,"8":8}'
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1

VLLM_ATTENTION_BACKEND=$BACKEND TOKENIZERS_PARALLELISM=false PYTHONPATH="../":"$PYTHONPATH" python3 ../scripts/vllm_kflow_oss.py \
   --model_name "$model_name" \
   --max_new_tokens 131072 \
   --temperature 0.6 \
   --reflex_size 64 

📊 Output Format

Results are saved in ./results/vllm_{max_new_tokens}_{output_postfix}/{model_name}/:

  • predictions_{step}.json - Predictions at each step
  • Includes: questions, generated solutions, extracted answers, gold answers, and correctness

🔧 Technical Details

vLLM Backend Configuration

We used this setting on B200 in our experiments, adjust accordingly to your compute platforms.

For GPT-OSS:

export VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1
export HF_HUB_ENABLE_HF_TRANSFER=1
export VLLM_FLASHINFER_ALLREDUCE_FUSION_THRESHOLDS_MB='{"2":32,"4":32,"8":8}'
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1

For Qwen:

export VLLM_ATTENTION_BACKEND=FLASH_ATTN
export HF_HUB_ENABLE_HF_TRANSFER=1

📚 Citation

If you find this work helpful, please cite us:

@misc{zhuang2025knowledgeflow,
  title = {Knowledge Flow: Scaling Reasoning Beyond the Context Limit},
  url = {https://yufanzhuang.notion.site/knowledge-flow},
  author = {Zhuang, Yufan and Liu, Liyuan and Zhang, Dinghuai and Singh, Chandan and Shen, Yelong and Shang, Jingbo and Gao, Jianfeng},
  journal = {Notion Blog},
  year = {2025},
  month = Oct,
}

📄 License

We release our code under Apache 2.0, see LICENSE file for details.

About

Official Implementation of Knowledge Flow Prompting

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published