Yufan Zhuang1,2, Liyuan Liu1, Dinghuai Zhang1, Chandan Singh1, Yelong Shen1, Jingbo Shang2, Jianfeng Gao1
1Microsoft Research 2UC San Diego
We explore how Knowledge Flow Prompting—iteratively updating a knowledge list between rollouts—overcomes the context limit of LLMs in test-time scaling. This iterative refinement process mirrors human deliberation: the model progressively distills insights from attempts, refining a knowledge list, thereby empowering subsequent rollouts. Knowledge Flow enables both gpt-oss-120b and Qwen3-235B-A22B-Thinking to achieve 100% accuracy on AIME25 without any training/tools/external feedback.
For detailed insights into the methodology and results, please refer to our blog post.
- CUDA >= 12.8
- VLLM == 0.10.2
Option A: Minimal Installation (recommended for quickstarts)
pip install -r requirements.txtThis installs the core dependencies.
Option B: Full Installation (includes all dependency versions)
pip install -r requirements_full.txtThis includes additional packages for extended functionality including CUDA libraries, evaluation tools, and more.
The core Knowledge Flow implementation is available in two variants:
- Main Script:
scripts/vllm_kflow_oss.py - Bash Script:
bash_scripts/gpt_oss.sh - Model:
openai/gpt-oss-120b
- Main Script:
scripts/vllm_kflow_qwen.py - Bash Script:
bash_scripts/qwen3.sh - Model:
Qwen/Qwen3-235B-A22B-Thinking-2507
Both main scripts support the following arguments:
--model_name # Hugging Face model identifier
--max_new_tokens # Maximum generation length (e.g., 131072 for GPT-OSS, 262144 for Qwen)
--temperature # Sampling temperature (default: 0.6)
--reflex_size # Number of iterations (default: 64)
--split # Dataset split: train/test/validation
--output_postfix # Custom postfix for output filesWe provide several ablation variants to study the impact of different components:
These scripts test the effect of regenerating knowledge at each step:
Key Difference: Uses NLTK sentence tokenization to segment and regenerate knowledge descriptions more systematically.
These scripts explore using positive reinforcement (correct solutions) instead of mistake tracking:
Key Difference: Builds a knowledge base from successful solutions rather than learning from mistakes.
These scripts implement a baseline that doesn't maintain long-term knowledge:
Key Difference: Tests the Markovian Thinking where each iteration only depends on the immediate previous step, without accumulated knowledge.
cd bash_scripts
bash gpt_oss.shOr run directly with Python:
model_name=openai/gpt-oss-120b
BACKEND=TRITON_ATTN_VLLM_V1
export HF_HUB_ENABLE_HF_TRANSFER=1
export VLLM_FLASHINFER_ALLREDUCE_FUSION_THRESHOLDS_MB='{"2":32,"4":32,"8":8}'
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
VLLM_ATTENTION_BACKEND=$BACKEND TOKENIZERS_PARALLELISM=false PYTHONPATH="../":"$PYTHONPATH" python3 ../scripts/vllm_kflow_oss.py \
--model_name "$model_name" \
--max_new_tokens 131072 \
--temperature 0.6 \
--reflex_size 64 Results are saved in ./results/vllm_{max_new_tokens}_{output_postfix}/{model_name}/:
predictions_{step}.json- Predictions at each step- Includes: questions, generated solutions, extracted answers, gold answers, and correctness
We used this setting on B200 in our experiments, adjust accordingly to your compute platforms.
For GPT-OSS:
export VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1
export HF_HUB_ENABLE_HF_TRANSFER=1
export VLLM_FLASHINFER_ALLREDUCE_FUSION_THRESHOLDS_MB='{"2":32,"4":32,"8":8}'
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1For Qwen:
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
export HF_HUB_ENABLE_HF_TRANSFER=1If you find this work helpful, please cite us:
@misc{zhuang2025knowledgeflow,
title = {Knowledge Flow: Scaling Reasoning Beyond the Context Limit},
url = {https://yufanzhuang.notion.site/knowledge-flow},
author = {Zhuang, Yufan and Liu, Liyuan and Zhang, Dinghuai and Singh, Chandan and Shen, Yelong and Shang, Jingbo and Gao, Jianfeng},
journal = {Notion Blog},
year = {2025},
month = Oct,
}We release our code under Apache 2.0, see LICENSE file for details.