Stabilizing Efficient Reasoning with Step-Level Advantage Selection (SAS)

Han Wang¹, Xiaodong Yu², Jialian Wu², Jiang Liu², Ximeng Sun², Mohit Bansal¹, Zicheng Liu²

¹UNC Chapel Hill, ²AMD

Installation

Create a conda environment

conda create -n verl_sas python==3.10
conda activate verl_sas

Clone the repository

git clone https://github.com/hannight/SAS.git
cd SAS

Install the dependencies for verl

cd verl
# Make sure you have activated verl conda env
# If you need to run with megatron
bash scripts/install_vllm_sglang_mcore.sh
# Or if you simply need to run with FSDP
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh

Install verl

cd verl
pip install --no-deps -e .

Dataset

We provide the training and evaluation data in the data folder.

Training Data

We use DeepScaleR-Preview-Dataset as the training data, which is in the data/train.parquet file.

You can also use your own training data, please follow the format of data/train.parquet.

Evaluation Data

We evaluate on five different math reasoning datasets: AIME2024 (data/aime.parquet), AIME2025 (data/aime2025.parquet), MATH (data/math.parquet), AMC (data/amc.parquet), and Olympiad-Bench (data/olympiad_bench.parquet). In addition, we also include GPQA-Diamond (data/gpqa.parquet), LSAT (data/lsat.parquet), and MMLU (500 instances subset, data/mmlu_500.parquet), three general reasoning benchmarks to test the ability to generalize tot out-of-domain data.

You can also use your own evaluation data, please follow the format of the evaluation data files.

Training

Train DeepScaleR-1.5B-Preview with SAS:

bash deepscaler_grpo_sas.sh

You can also train with other models via modifying the actor_rollout_ref.model.path in the deepscaler_grpo_sas.sh script.

We explain the important arguments in the deepscaler_grpo_sas.sh as follows:

data.train_files: The path to the training data.
data.val_files: The path to the validation data.
trainer.sas: Whether to use SAS.
trainer.sas_strategy: The strategy to use SAS. Available options: correct_only (only apply SAS to correct rollouts), wrong_only (only apply SAS to wrong rollouts), both (apply SAS to both correct and wrong rollouts). Default to both.
trainer.mask_steps_ratio: The ratio of steps to set their advantages to 0 (range from 0 to 1, default to 0.3).
trainer.random_mask: Whether to use random selection (for ablation study).

You can train DeepScaleR-1.5B-Preview with standard GRPO post-training under a 4K training context, without any additional RL techniques:

bash deepscaler_grpo_4k.sh

Note: Before evaluation, you need to merge the checkpoints from FSDP and Megatron backends. Please refer to the verl documentation for more details. Example command:

python scripts/model_merger.py merge \
    --backend fsdp \
    --local_dir /path/to/the/saved/model/checkpoints \
    --target_dir /path/to/the/merged/hf/model

Evaluation

Evaluate the model on all the evaluation datasets:

MODEL_PATH=/path/to/the/model/checkpoint
OUTPUT_DIR=/path/to/the/output/directory
bash eval_model.sh --model ${MODEL_PATH} --num-tokens 8192 --datasets aime aime2025 amc olympiad_bench gpqa lsat mmlu_500 --output-dir ${OUTPUT_DIR}

Acknowledgement

We sincerely thank the authors of verl and DeepScaleR for their public code and data release.

Citation

@article{wang2026stabilizing,
  title={Stabilizing Efficient Reasoning with Step-Level Advantage Selection},
  author={Han Wang and Xiaodong Yu and Jialian Wu and Jiang Liu and Ximeng Sun and Mohit Bansal and Zicheng Liu},
  year={2026},
  journal={arXiv preprint arXiv:2604.24003}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stabilizing Efficient Reasoning with Step-Level Advantage Selection (SAS)

Installation

Dataset

Training Data

Evaluation Data

Training

Evaluation

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
verl		verl
LICENSE		LICENSE
README.md		README.md
deepscaler_grpo_4k.sh		deepscaler_grpo_4k.sh
deepscaler_grpo_sas.sh		deepscaler_grpo_sas.sh
eval_model.sh		eval_model.sh

Folders and files

Latest commit

History

Repository files navigation

Stabilizing Efficient Reasoning with Step-Level Advantage Selection (SAS)

Installation

Dataset

Training Data

Evaluation Data

Training

Evaluation

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages