Han Wang1, Xiaodong Yu2, Jialian Wu2, Jiang Liu2, Ximeng Sun2, Mohit Bansal1, Zicheng Liu2
1UNC Chapel Hill, 2AMD
- Create a conda environment
conda create -n verl_sas python==3.10
conda activate verl_sas- Clone the repository
git clone https://github.com/hannight/SAS.git
cd SAS- Install the dependencies for verl
cd verl
# Make sure you have activated verl conda env
# If you need to run with megatron
bash scripts/install_vllm_sglang_mcore.sh
# Or if you simply need to run with FSDP
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
- Install verl
cd verl
pip install --no-deps -e .We provide the training and evaluation data in the data folder.
We use DeepScaleR-Preview-Dataset as the training data, which is in the data/train.parquet file.
You can also use your own training data, please follow the format of data/train.parquet.
We evaluate on five different math reasoning datasets: AIME2024 (data/aime.parquet), AIME2025 (data/aime2025.parquet), MATH (data/math.parquet), AMC (data/amc.parquet), and Olympiad-Bench (data/olympiad_bench.parquet). In addition, we also include GPQA-Diamond (data/gpqa.parquet), LSAT (data/lsat.parquet), and MMLU (500 instances subset, data/mmlu_500.parquet), three general reasoning benchmarks to test the ability to generalize tot out-of-domain data.
You can also use your own evaluation data, please follow the format of the evaluation data files.
Train DeepScaleR-1.5B-Preview with SAS:
bash deepscaler_grpo_sas.shYou can also train with other models via modifying the actor_rollout_ref.model.path in the deepscaler_grpo_sas.sh script.
We explain the important arguments in the deepscaler_grpo_sas.sh as follows:
data.train_files: The path to the training data.data.val_files: The path to the validation data.trainer.sas: Whether to use SAS.trainer.sas_strategy: The strategy to use SAS. Available options:correct_only(only apply SAS to correct rollouts),wrong_only(only apply SAS to wrong rollouts),both(apply SAS to both correct and wrong rollouts). Default toboth.trainer.mask_steps_ratio: The ratio of steps to set their advantages to 0 (range from 0 to 1, default to 0.3).trainer.random_mask: Whether to use random selection (for ablation study).
You can train DeepScaleR-1.5B-Preview with standard GRPO post-training under a 4K training context, without any additional RL techniques:
bash deepscaler_grpo_4k.shNote: Before evaluation, you need to merge the checkpoints from FSDP and Megatron backends. Please refer to the verl documentation for more details. Example command:
python scripts/model_merger.py merge \
--backend fsdp \
--local_dir /path/to/the/saved/model/checkpoints \
--target_dir /path/to/the/merged/hf/modelEvaluate the model on all the evaluation datasets:
MODEL_PATH=/path/to/the/model/checkpoint
OUTPUT_DIR=/path/to/the/output/directory
bash eval_model.sh --model ${MODEL_PATH} --num-tokens 8192 --datasets aime aime2025 amc olympiad_bench gpqa lsat mmlu_500 --output-dir ${OUTPUT_DIR}We sincerely thank the authors of verl and DeepScaleR for their public code and data release.
@article{wang2026stabilizing,
title={Stabilizing Efficient Reasoning with Step-Level Advantage Selection},
author={Han Wang and Xiaodong Yu and Jialian Wu and Jiang Liu and Ximeng Sun and Mohit Bansal and Zicheng Liu},
year={2026},
journal={arXiv preprint arXiv:2604.24003}
}