Caijun Xu, Changyi Xiao, Zhongyuan Peng, Yixin Cao
Fudan University · Shanghai Innovation Institute
Figure 1. DenoiseRL conditions the policy on a truncated incorrect prefix produced by a weak model and trains it, via verifiable-reward RL, to denoise the corrupted reasoning state and recover the correct solution path.
This repository contains the official implementation of DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes. DenoiseRL is a recovery-oriented reinforcement learning framework that replaces stronger-teacher supervision with structured perturbations derived from weak-model failures. Rather than imitating a stronger model or curating harder data, the policy is conditioned on incorrect reasoning prefixes and explicitly optimized to revise mistakes and reach a verified answer.
State-of-the-art reasoning RL pipelines (e.g., GRPO and DAPO) are typically constrained along two axes:
- Supervisory ceiling. Performance gains often hinge on access to a stronger teacher model, capping further progress when such teachers are unavailable.
- Data engineering cost. Capability scaling commonly relies on heavy hard-data curation, adversarial synthesis, or trajectory filtering.
DenoiseRL departs from both directions. We invert the role of weak models: instead of treating them as imperfect supervisors, we exploit them as low-cost generators of structured corruptions. The policy is conditioned on truncated incorrect prefixes and trained — under standard verifiable rewards — to denoise the corrupted state and arrive at a verified solution. This casts reasoning RL as a denoising problem, drawing a conceptual parallel to denoising autoencoders and BART-style pretraining.
Each training step samples, per problem, a mixture of two rollout types and updates the policy with a single GRPO-style group baseline shared across the mixture:
- Main rollouts (N per problem): standard on-policy generations conditioned on the prompt.
- Denoise rollouts (K per problem): generations conditioned on a truncated weak-model wrong prefix. Given a wrong solution
w, we retain its firstp = max(1, ⌊rho · |w|⌋)tokens as an assistant-side prefix; the policy continues from this corrupted state.
Three design choices stabilize and amplify the recovery signal:
- Length-fair folding. The visible response is
ỹ = [w₁:p, y_{p+1:p+L}]withp + L ≤ R, preserving a comparable response budget against main rollouts. - Continuation-only optimization. PPO/GRPO gradients flow primarily through the model-generated continuation; the heavily off-policy prefix is verifier-visible but excluded from the loss, avoiding the high-variance importance ratios documented in prior PPO-style off-policy literature.
- Shared group baseline. Main and denoise trajectories of the same problem share a single advantage baseline, so denoise rollouts naturally provide negative or contrastive signal for problems that are otherwise saturated.
The joint objective can be written as a mixture
J(θ) = N/(N+K) · J_main(θ) + K/(N+K) · J_denoise(θ),
which is interpretable as optimizing the policy under a mixture of solving-from-scratch and recovering-from-corruption distributions.
Reported in the paper across Qwen3-4B and Qwen3-8B policy backbones (training corpus: MATH-7.5K; weak model: Qwen2.5-1.5B-Instruct). For AMC23, AIME24, and AIME25 we report AVG@16; for MATH500 and BBEH we report AVG@1.
Qwen3-4B-Base
| Method | MATH500 | AMC23 | AIME24 | AIME25 | BBEH | Avg. |
|---|---|---|---|---|---|---|
| Base | 70.0 | 43.1 | 8.3 | 7.7 | 4.1 | 26.6 |
| GRPO | 83.6 | 63.1 | 22.1 | 18.1 | 11.1 | 39.6 |
| DAPO | 83.8 | 62.5 | 20.6 | 21.5 | 10.4 | 39.8 |
| DenoiseRL-GRPO | 85.8 | 61.4 | 24.8 | 23.3 | 14.8 | 42.0 |
| DenoiseRL-DAPO | 84.6 | 63.6 | 21.9 | 21.7 | 15.7 | 41.5 |
Qwen3-8B-Base
| Method | MATH500 | AMC23 | AIME24 | AIME25 | BBEH | Avg. |
|---|---|---|---|---|---|---|
| Base | 70.4 | 49.2 | 11.9 | 10.8 | 4.1 | 29.3 |
| GRPO | 87.8 | 69.7 | 24.0 | 22.9 | 10.6 | 43.0 |
| DAPO | 87.0 | 69.7 | 23.8 | 21.7 | 11.7 | 42.8 |
| DenoiseRL-GRPO | 87.2 | 70.3 | 24.6 | 23.1 | 11.5 | 43.3 |
| DenoiseRL-DAPO | 88.2 | 71.4 | 27.0 | 24.8 | 12.6 | 44.8 |
Additional takeaways from the ablation studies:
- Recovery intensity matters. Sweeping
K ∈ {1, 4, 8}atrho = 0.2showsK = 4provides the best trade-off; over-emphasized recovery (K = 8) hurts the primary solving objective. - Off-policy prefix updates are unstable. Directly backpropagating through prefix tokens leads to validation collapse and runaway response length — consistent with prior observations on PPO sensitivity to heavily off-policy tokens.
- Length-fair folding helps. Removing the
p + L ≤ Rcap weakens the 4B average by ~1.8 points (42.0 → 40.2). - Throughput overhead is modest. Per-step training time on Qwen3-4B-Base is 49.7s for DenoiseRL vs. 43.8s for GRPO under matched rollout budgets.
We refer readers to the paper for full ablations, the case-study analyses of recovery behavior, and the length / overthinking dynamics under varying rho.
DenoiseRL/
├── recipe/denoise/ # DenoiseRL recipe (entrypoints, trainer, configs)
│ ├── main_dapo.py # training entrypoint
│ ├── dapo_ray_trainer.py # Ray-based DAPO/GRPO trainer with denoise rollouts
│ ├── data_prepare.py # weak-model wrong-prefix construction
│ ├── config/ # Hydra training configs
│ ├── denoise_qwen3-{1.7b,4b,8b}_v1.0.sh
│ └── dapo_denoise_qwen3-{1.7b,4b,8b}_v1.0.sh
├── verl/ # local fork of the verl RL framework (editable install)
├── img/DenoiseRL.png # overview figure
├── paper/ # paper PDF
└── requirements.txt
DenoiseRL builds on a customized fork of verl. Two steps in particular are mandatory: installing the pinned dependencies and registering the local verl package in editable mode.
# (1) Create an isolated environment and install pinned dependencies.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtNote. The
--no-depsflag is intentional: dependency resolution is already pinned viarequirements.txt, and re-resolving fromverl/setup.pycan silently override critical versions (e.g.,vllm,transformers,flash-attn). The editable install ensures that any local modification to the framework propagates at runtime without re-installation.
Hardware-sensitive components (flash-attn, vllm, cupy-cuda12x, torch_npu, etc.) should be installed against the CUDA/driver stack of the target cluster.
recipe/denoise/data_prepare.py constructs the per-problem pool W(q) of incorrect-but-well-formed weak-model rollouts. It runs the weak model with vLLM, scores each rollout against the ground truth, and augments the source parquet with a wrong_answer_with_boxed column storing the wrong rollouts that nevertheless emit a parseable \boxed{...}.
python recipe/denoise/data_prepare.py \
--model /path/to/weak-model \
--dataset /path/to/train.parquet \
--rollout-n 8 \
--output-dir ./dataThe resulting *.with_wrong_boxed.parquet is consumed directly by TRAIN_FILE in the training scripts. Problems with an empty wrong-rollout pool fall back to standard main rollouts, as described in the paper.
Each model scale ships with two recipes: a GRPO-style backbone and a DAPO variant.
# GRPO backbone
bash recipe/denoise/denoise_qwen3-1.7b_v1.0.sh
bash recipe/denoise/denoise_qwen3-4b_v1.0.sh
bash recipe/denoise/denoise_qwen3-8b_v1.0.sh
# DAPO variant
bash recipe/denoise/dapo_denoise_qwen3-1.7b_v1.0.sh
bash recipe/denoise/dapo_denoise_qwen3-4b_v1.0.sh
bash recipe/denoise/dapo_denoise_qwen3-8b_v1.0.shThe DenoiseRL-specific knobs are exposed at the top of each script:
| Knob | Symbol | Description |
|---|---|---|
n_resp_per_prompt |
N |
number of main on-policy rollouts per problem |
sub_rollout_k |
K |
number of denoise rollouts per problem |
part_response_ratio_strategy |
— | fixed / normal / uniform sampler for rho |
part_response_ratio_fixed |
rho |
prefix ratio under the fixed strategy |
part_response_ratio_{mean,std,low,high} |
— | parameters for normal / uniform strategies |
partial_mode |
— | cutdown (mask prefix, length-fair), shift (gradient on prefix), none (no length cap) |
use_problem_id_as_uid |
— | share a single GRPO baseline across all N + K rollouts of one problem |
Cluster / path settings — MODEL_PATH, TRAIN_FILE, TEST_FILE, num_gpus, tensor_model_parallel_size — are likewise configured at the top of each script.
To reproduce the headline numbers reported in the paper, we recommend:
- Rollout composition:
N = 12, K = 4per problem. - Prefix intensity:
part_response_ratio_strategy=fixedwithpart_response_ratio_fixed=0.2. - Folding policy:
partial_mode=cutdown(length-fair; prefix masked from PPO loss). - Response budget:
max_response_lengthconsistent across main and denoise rollouts. - Optimization: continuation-only gradient flow; do not enable gradients on the off-policy prefix.
- Group baseline:
use_problem_id_as_uid=Trueto share advantages across the fullN + Kgroup.
Deviating from any of the above (in particular enabling gradient on the prefix or removing the length-fair cap) is documented in the paper as a source of instability.