Skip to content

HanNight/SAS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Han Wang1, Xiaodong Yu2, Jialian Wu2, Jiang Liu2, Ximeng Sun2, Mohit Bansal1, Zicheng Liu2

1UNC Chapel Hill, 2AMD

image

Installation

  1. Create a conda environment
conda create -n verl_sas python==3.10
conda activate verl_sas
  1. Clone the repository
git clone https://github.com/hannight/SAS.git
cd SAS
  1. Install the dependencies for verl
cd verl
# Make sure you have activated verl conda env
# If you need to run with megatron
bash scripts/install_vllm_sglang_mcore.sh
# Or if you simply need to run with FSDP
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
  1. Install verl
cd verl
pip install --no-deps -e .

Dataset

We provide the training and evaluation data in the data folder.

Training Data

We use DeepScaleR-Preview-Dataset as the training data, which is in the data/train.parquet file.

You can also use your own training data, please follow the format of data/train.parquet.

Evaluation Data

We evaluate on five different math reasoning datasets: AIME2024 (data/aime.parquet), AIME2025 (data/aime2025.parquet), MATH (data/math.parquet), AMC (data/amc.parquet), and Olympiad-Bench (data/olympiad_bench.parquet). In addition, we also include GPQA-Diamond (data/gpqa.parquet), LSAT (data/lsat.parquet), and MMLU (500 instances subset, data/mmlu_500.parquet), three general reasoning benchmarks to test the ability to generalize tot out-of-domain data.

You can also use your own evaluation data, please follow the format of the evaluation data files.

Training

Train DeepScaleR-1.5B-Preview with SAS:

bash deepscaler_grpo_sas.sh

You can also train with other models via modifying the actor_rollout_ref.model.path in the deepscaler_grpo_sas.sh script.

We explain the important arguments in the deepscaler_grpo_sas.sh as follows:

  • data.train_files: The path to the training data.
  • data.val_files: The path to the validation data.
  • trainer.sas: Whether to use SAS.
  • trainer.sas_strategy: The strategy to use SAS. Available options: correct_only (only apply SAS to correct rollouts), wrong_only (only apply SAS to wrong rollouts), both (apply SAS to both correct and wrong rollouts). Default to both.
  • trainer.mask_steps_ratio: The ratio of steps to set their advantages to 0 (range from 0 to 1, default to 0.3).
  • trainer.random_mask: Whether to use random selection (for ablation study).

You can train DeepScaleR-1.5B-Preview with standard GRPO post-training under a 4K training context, without any additional RL techniques:

bash deepscaler_grpo_4k.sh

Note: Before evaluation, you need to merge the checkpoints from FSDP and Megatron backends. Please refer to the verl documentation for more details. Example command:

python scripts/model_merger.py merge \
    --backend fsdp \
    --local_dir /path/to/the/saved/model/checkpoints \
    --target_dir /path/to/the/merged/hf/model

Evaluation

Evaluate the model on all the evaluation datasets:

MODEL_PATH=/path/to/the/model/checkpoint
OUTPUT_DIR=/path/to/the/output/directory
bash eval_model.sh --model ${MODEL_PATH} --num-tokens 8192 --datasets aime aime2025 amc olympiad_bench gpqa lsat mmlu_500 --output-dir ${OUTPUT_DIR}

Acknowledgement

We sincerely thank the authors of verl and DeepScaleR for their public code and data release.

Citation

@article{wang2026stabilizing,
  title={Stabilizing Efficient Reasoning with Step-Level Advantage Selection},
  author={Han Wang and Xiaodong Yu and Jialian Wu and Jiang Liu and Ximeng Sun and Mohit Bansal and Zicheng Liu},
  year={2026},
  journal={arXiv preprint arXiv:2604.24003}
}

About

Code for ACL 2026 (Findings) paper "Stabilizing Efficient Reasoning with Step-Level Advantage Selection"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages