Paper: Explore-Execute Chain: Towards an Efficient Structured Reasoning Paradigm
Models: TingheOliver/Explore-Execute-Chain-Qwen
Datasets: TingheOliver/Explore-Execute-Chain-Datasets
E2C separates reasoning into two phases inside a single model:
- Exploration (~1k tokens): sketch a high-level plan -- enumerate approaches, identify the most promising one.
- Execution (~10k tokens): carry out the plan with full step-by-step reasoning.
Because test-time search targets only the short exploration phase, scaling compute is ~8x cheaper than searching over full reasoning chains. Because only exploration segments need domain-specific fine-tuning, adapting to a new domain (e.g., medical QA) uses ~3.5% of the tokens required by standard SFT.
Qwen3-8B, Pass@1 averaged over 8 samples:
| Model | AIME'24 | AIME'25 | MATH500 | AMC23 | Avg |
|---|---|---|---|---|---|
| Qwen3-8B + GRPO | 36.9 | 34.4 | 88.2 | 79.3 | 59.6 |
| Qwen3-8B + E2C-(SFT+RL) | 40.6 | 33.8 | 87.7 | 80.3 | 61.5 |
AIME 2024, K/N = 32:
| Method | Accuracy | Tokens (k) |
|---|---|---|
| Self-Consistency | 50.0% | 86.2 |
| Tree-of-Thoughts | 50.0% | 71.3 |
| E2C-ReAct Loop | 53.3% | 12.4 |
E2C-ReAct Loop matches or beats standard methods while using 7x fewer tokens.
- Python 3.10+
- CUDA-capable GPU. Minimum VRAM:
- Inference / evaluation: 16 GB (single GPU)
- SFT training: 4x 40 GB GPUs recommended
- RL training: 8x 40 GB GPUs recommended
- PyTorch 2.1+
git clone https://github.com/OliverZ-dot/Explore-Execute-Chain-main.git
cd Explore-Execute-Chain-main
pip install -r verl/requirements.txtRun the released Qwen3-8B checkpoint on a single problem:
python example_inference.py \
--model_path TingheOliver/Explore-Execute-Chain-Qwen \
--subfolder Qwen3-8B-E2C-SFT-RL \
--problem "Find all prime numbers p such that p^2 + 2 is also prime."Use the 4B model instead:
python example_inference.py \
--model_path TingheOliver/Explore-Execute-Chain-Qwen \
--subfolder Qwen3-4B-E2C-SFT-RL \
--problem "Your problem here"All options:
python example_inference.py
--model_path TingheOliver/Explore-Execute-Chain-Qwen # HF model ID or local path
--subfolder Qwen3-8B-E2C-SFT-RL # subfolder in HF repo (omit for local paths)
--problem "Your problem here" # omit to use built-in example
--max_tokens 4096 # default: 2048
--temperature 0.7 # default: 0.7
The model outputs two clearly delimited sections:
EXPLORATION PHASE:
<high-level plan, ~1k tokens>
EXECUTION PHASE:
<step-by-step solution, ~10k tokens>
Select from eight built-in problems (math, medical, code) or enter your own:
python example_interactive.pyNote:
example_interactive.pyusesTingheOliver/Explore-Execute-Chain-Qwen(8B) by default. To switch models, editmodel_path_strandsubfolder_strat the top ofmain(). For a local path, setsubfolder_str = None.
Download and preprocess all training and evaluation data in one step:
bash scripts/prepare_all_data.sh # standard
bash scripts/prepare_all_data.sh --mirror # use hf-mirror.com (recommended in China)Partial downloads:
bash scripts/prepare_all_data.sh --skip-rl # SFT data only
bash scripts/prepare_all_data.sh --skip-download # reprocess already-downloaded files
bash scripts/download_datasets.sh --dataset eval # 16 evaluation benchmarks onlyWhat gets downloaded from TingheOliver/Explore-Execute-Chain-Datasets:
| Split | File | Size | Description |
|---|---|---|---|
| SFT train | e2c-sft.parquet |
77.7 MB | 58k exploration-execution pairs |
| RL train | e2c-rl.parquet |
19.4 MB | 14k GRPO rollout prompts |
| RL valid | e2c-rl-valid.parquet |
706 KB | validation split |
| Eval (math) | 8 datasets | varies | AIME'24/25, AMC23, GSM8K, MATH500, Minerva, OlympiadBench, MATH-Algebra |
| Eval (medical) | 8 datasets | varies | MedQA, MedMCQA, Anatomy, Clinical Knowledge, College Biology, College Medicine, Medical Genetics, Professional Medicine |
Processed files land in data/processed/.
Supervised fine-tuning on exploration-execution pairs.
bash scripts/e2c_sft.shOverride defaults with environment variables:
export MODEL_PATH="Qwen/Qwen3-8B"
export TRAIN_DATA="data/processed/sft/e2c-sft-train.parquet"
export OUTPUT_DIR="models/checkpoints/sft"
bash scripts/e2c_sft.shKey env vars: MODEL_PATH, TRAIN_DATA, VAL_DATA, OUTPUT_DIR, TOTAL_TRAINING_STEPS, CUDA_VISIBLE_DEVICES. GPUs are auto-detected when CUDA_VISIBLE_DEVICES is unset.
Checkpoint: models/checkpoints/sft/
Two-stage GRPO. Stage 1 warms up with high diversity (rollout=32, temp=1.3, 1 epoch). Stage 2 sharpens execution with elevated advantage weight on exploration tokens (rollout=8, temp=1.0, adv_coeff=2.0, 2 epochs).
export MODEL_PATH="models/checkpoints/sft/final"
export N_GPUS=8
bash scripts/e2c_rl.sh # both stages
bash scripts/e2c_rl.sh --stage 1 # warm-up only
bash scripts/e2c_rl.sh --stage 2 # main stage onlyCheckpoints: models/checkpoints/rl/stage1-warmup/, models/checkpoints/rl/stage2-main/
Fine-tunes only the exploration segments mixed with a small E2C regularization fraction. Uses ~3.5% of the tokens required by full SFT.
export MODEL_PATH="models/checkpoints/rl/stage2-main/final"
export TRAIN_DATA="data/processed/ef_sft/medical-train.parquet"
bash scripts/ef-sft.shCheckpoint: models/checkpoints/ef_sft/
export MODEL_PATH="models/checkpoints/sft/final"
bash scripts/e2c_dapo.sh # both stages
bash scripts/e2c_dapo.sh --stage 2 # main stage onlybash scripts/prepare_all_data.sh
bash scripts/e2c_sft.sh
export MODEL_PATH="models/checkpoints/sft/final" && bash scripts/e2c_rl.sh
bash scripts/eval.sh --model models/checkpoints/rl/stage2-main/final --dataset all --sample 8bash scripts/eval.sh # GSM8K, released 8B model
bash scripts/eval.sh --dataset all --sample 8 # all math benchmarks
bash scripts/eval.sh --dataset med --sample 4 # all medical benchmarks
bash scripts/eval.sh --model path/to/ckpt --dataset aime24 # local checkpoint (no subfolder needed)
bash scripts/eval.sh --subfolder Qwen3-4B-E2C-SFT-RL --dataset all # use 4B model insteadAll options:
| Flag | Default | Description |
|---|---|---|
--model PATH |
TingheOliver/Explore-Execute-Chain-Qwen |
HF ID or local path |
--subfolder NAME |
Qwen3-8B-E2C-SFT-RL |
subfolder in HF repo; omit or leave empty for local paths |
--dataset NAME |
gsm8k |
gsm8k, math, aime24, aime25, amc23, all, medqa, medmcqa, med |
--sample N |
1 |
samples per question (use 8 for reported results) |
--gpus N |
auto | number of GPUs |
--temp T |
1.0 |
sampling temperature |
--save-path PATH |
evaluation/e2c-eval |
where to write predictions |
Reproduces Table 3 of the paper. See tts/README.md for full details.
cd tts
pip install -r requirements.txt
python scripts/download_aime2024.py
# Run all methods at all budgets
python run_tts.py
# Specific methods and budgets
python run_tts.py --methods e2c_react_loop e2c_tot --budgets 4 8 16 32
# Quick debug run (first 5 problems only)
python run_tts.py --limit 5Set model_path (and subfolder if using the HF repo) in tts/config/tts.yaml before running. Results and plots are written to tts/outputs/.
Available methods: greedy_cot, self_consistency, e2c_sc, e2c_rp, e2c_react_loop, e2c_tot, e2c_select_lm_judge, e2c_select_semantic_cluster, tree_of_thoughts, forest_of_thought -- see tts/config/tts.yaml for the full list.
.
+-- e2c/
| +-- inference/ generation and evaluation scripts
| +-- util/ reward, dataset, and model utilities
| +-- config/ YAML configs for generation and evaluation
+-- scripts/
| +-- prepare_all_data.sh
| +-- download_datasets.sh
| +-- e2c_sft.sh
| +-- e2c_rl.sh
| +-- e2c_dapo.sh
| +-- ef-sft.sh
| +-- eval.sh
| +-- README.md per-script documentation
+-- tts/ test-time scaling experiments
| +-- run_tts.py
| +-- tts_methods.py
| +-- config/tts.yaml
| +-- README.md
+-- data/ data preparation scripts
+-- verl/ training framework (volcengine/verl, vendored)
+-- example_inference.py single-problem inference demo
+-- example_interactive.py interactive multi-problem demo
+-- example_problems.json built-in example problems
@misc{yang2025e2c,
title={Explore-Execute Chain: Towards an Efficient Structured Reasoning Paradigm},
author={Kaisen Yang and Tinghe Zhang and Rushi Shah and Kaicheng Yang and
Qinwei Ma and Dianbo Liu and Alex Lamb},
year={2025},
eprint={2509.23946},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2509.23946}
}The RL training pipeline is built on verl (vendored with minor additions).
MIT. See verl/LICENSE.