Reference implementation for the paper "LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models" (arXiv:2605.09806).
LEAD is a self-calibrating multi-reward RL framework for efficient reasoning, built on top of verl. It replaces the two static knobs of GRPO-style efficient-reasoning recipes β fixed reward weights and a global length budget β with two online, self-calibrating mechanisms:
- ποΈ Dynamic reward weighting with decoupled group normalization β each reward is normalized in its own rollout group before combination, and the combination weights are updated online via a Potential-Scaled Instability (PSI) controller. No hand-tuned schedule.
- π Per-problem online target-length calibration β the global length budget is replaced by a per-prompt target $L^_q$ estimated from the model's own correct rollouts, with a symmetric efficiency reward around $L^_q$ that penalizes both overthinking and over-compression.
LEAD is a drop-in replacement for the GRPO advantage; just set algorithm.adv_estimator=lead in the verl config.
- π Quick Start
- π οΈ Requirements
- π¦ Installation
- βοΈ Configure Environment
- π Data Preparation
- ποΈ Training
- π Evaluation
- π€ Released Checkpoints
- π Repository Layout
- ποΈ Key Configuration Knobs
- π Reproducing Paper Tables
- π§° Troubleshooting
- π Citation
- π License
# 1. Install
conda create -n lead python=3.10 -y && conda activate lead
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install -e .
pip install flash-attn --no-build-isolation
pip install "vllm>=0.6.0"
# 2. Configure (HF token for DeepSeek-R1-Distill download)
cp .env.example .env
$EDITOR .env
# 3. Train LEAD on DeepSeek-R1-Distill-Qwen-1.5B (4K budget)
bash train_math_lead_deepseek-r1-1.5b.sh| Component | Tested version |
|---|---|
| OS | Ubuntu 22.04 |
| Python | 3.10 |
| CUDA | 12.1 |
| PyTorch | 2.4+ |
| GPU | 4Γ NVIDIA L40S (44 GB) or A6000 (48 GB) for 1.5B; 8Γ A6000 for 7B |
| RAM | 256 GB recommended |
| Disk | NVMe SSD; ~150 GB for model weights + 50 GB per checkpoint |
# Clone
git clone https://github.com/CrazyMint/LEAD.git LEAD-release
cd LEAD-release
# Create env
conda create -n lead python=3.10 -y
conda activate lead
# PyTorch with CUDA 12.1
pip install torch==2.4.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# verl + LEAD (this repo)
pip install -e .
# Required runtime extras (not pulled by setup.py)
pip install flash-attn --no-build-isolation
pip install "vllm>=0.6.0"If pip install -e . fails on requirements.txt, install dependencies directly: pip install -r requirements.txt.
The training scripts run source "$(dirname $0)/.env" on startup, so this file is required. Create it from the template:
cp .env.example .envEdit .env and set at minimum:
export WANDB_API_KEY="..." # optional; logs go to wandb
export HF_TOKEN="..." # required to download DeepSeek-R1-Distill weights
export HF_HOME="/path/to/hf-cache" # optional (defaults to ~/.cache/huggingface)Sanity-check the HF token works:
huggingface-cli whoami # should print your usernameThe training scripts read data/math/{train,test}.parquet. These files are checked into the repo, so you can skip this step on first run.
To regenerate from scratch:
bash scripts/data/download_math.sh
# or directly:
python scripts/data/prepare_math.py --local_dir data/mathThe shipped files are the MATH level 3β5 split (8,521 problems) used in the paper.
bash train_math_lead_deepseek-r1-1.5b.shOn first run this downloads the base model (~7 GB) into $HF_HOME, launches Ray + vLLM, and trains for 7 epochs (462 steps). On 4Γ L40S 44 GB the run takes ~30 wall-clock hours. Checkpoints are written to ${OUTPUT_ROOT:-./results}/math_lead_4k_deepseek-r1-1.5b/.
To run on a different GPU count, override before invocation:
N_GPUS=2 CUDA_VISIBLE_DEVICES=0,1 bash train_math_lead_deepseek-r1-1.5b.sh
β οΈ Note: the paper used 4 GPUs; results may shift slightly with smaller world size due to per-batch group statistics.
bash train_math_grpo_deepseek-r1-1.5b.shThe paper uses Sober Reasoning settings (temperature 0.8, top-p 0.9, pass@n where n=3 for MATH-500/Olympiad and n=10 for AIME 24/25 and AMC 23). Any verl- or lighteval-compatible eval harness will work. This release does not yet ship a turn-key evaluation script.
| Paper row | HuggingFace repo |
|---|---|
| Table 1, LEAD 1.5B-4K (Acc 53.36 / Len 3714 / AES 0.68) | Kotom1/math_lead_4k_deepseek-r1-1.5b |
Quick load:
from transformers import AutoModelForCausalLM, AutoTokenizer
m = AutoModelForCausalLM.from_pretrained("Kotom1/math_lead_4k_deepseek-r1-1.5b")
t = AutoTokenizer.from_pretrained("Kotom1/math_lead_4k_deepseek-r1-1.5b")LEAD-release/
βββ verl/ # patched verl (adv_estimator='lead')
β βββ trainer/ppo/ray_trainer.py # LEAD branch (Algorithm 1 in the paper)
β βββ trainer/config/ppo_trainer.yaml
β βββ utils/reward_score/deepscale.py
βββ train_math_lead_deepseek-r1-1.5b.sh # paper Table 1 (LEAD)
βββ train_math_grpo_deepseek-r1-1.5b.sh # paper Table 1 (GRPO baseline)
βββ ablations/
β βββ math_grpo_lambda_sweep/ # paper Table 2 (GRPO column)
β βββ math_lead_lambda_sweep/ # paper Table 2 (LEAD-static column)
β βββ math_lead_aggregator_ablation/ # paper Table 3
βββ data/
β βββ math/ # MATH (shipped)
β βββ deepscaler/ # alt training set (optional)
β βββ ...
βββ scripts/
βββ data/ # download + prepare scripts
In verl/trainer/config/ppo_trainer.yaml (override on the command line via algorithm.<key>=<value>):
| Flag | Default | Description |
|---|---|---|
algorithm.adv_estimator |
gae |
Set to lead to enable LEAD |
algorithm.lead_alpha |
1.0 |
Potential-decay exponent ( |
algorithm.lead_beta |
0.95 |
EMA momentum for weight smoothing |
algorithm.lead_lambda_min |
0.3 |
Floor on |
algorithm.lead_bmax |
8000 |
Sentinel for unsolved prompts; matches training-time max length |
algorithm.lead_lstar_mode |
max_asym |
max_sym (paper), max_asym, or upper_only
|
algorithm.lead_aggregator |
mean_correct |
mean_correct (paper), min_correct, median_correct, mean_all
|
algorithm.lead_static_lambda_corr |
null |
If set, bypass dynamic weights and use a fixed |
| Table | Command |
|---|---|
| Table 1, LEAD 1.5B-4K | bash train_math_lead_deepseek-r1-1.5b.sh |
| Table 2, GRPO ratio sweep | bash ablations/math_grpo_lambda_sweep/run_sweep.sh |
| Table 2, LEAD-static ratio sweep | METHODS=lead bash ablations/math_grpo_lambda_sweep/run_sweep.sh |
| Table 3, |
bash ablations/math_lead_aggregator_ablation/run_sweep.sh |
| Symptom | Fix |
|---|---|
source .env: No such file or directory |
Run the configure step (cp .env.example .env) and fill it in. |
HF gated repo error when downloading DeepSeek-R1-Distill |
Accept the model's terms on its HuggingFace page and ensure HF_TOKEN in .env is valid: huggingface-cli whoami should print your username. |
flash-attn build fails |
Ensure CUDA toolkit + nvcc are on PATH and you have at least 16 GB of build RAM. |
| Only one GPU available | Set N_GPUS=1 and CUDA_VISIBLE_DEVICES=0 before running. Results may shift slightly versus the 4-GPU paper setting. |
| vLLM OOM during rollout | Lower actor_rollout_ref.rollout.gpu_memory_utilization from 0.65 to 0.5 in the training script. |
| Ray complains about port already in use | Kill any stray Ray cluster: ray stop --force and re-run. |
If LEAD is useful in your research, please cite our paper:
@misc{wei2026lead,
title={LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models},
author={Songtao Wei and Yi Li and Zhikai Li and Xu Hu and Yuede Ji and Guanpeng Li and Feng Chen and Carl Yang and Zhichun Guo and Bingzhe Li},
year={2026},
eprint={2605.09806},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2605.09806}
}Apache 2.0 (inherited from upstream verl). See LICENSE.