ACPO

This repository contains the implementation for Adaptive Clip Policy Optimization (ACPO) from the paper "What are Key Factors for Updates in RL for LLM Reasoning?"

ACPO is built on top of verl. It adds variance-adaptive clipping for GRPO/RLVR updates, together with the baselines used in the paper: DAPO, CISPO, AR-Lopti, and entropy-based selective token updates.

Installation

git clone https://github.com/Control-derek/ACPO.git
cd ACPO
bash setup.sh
conda activate verl

setup.sh creates a conda environment, installs this package in editable mode, and installs the training dependencies used by the released scripts.

Data

The math scripts expect ORZ-style parquet files under DATA_DIR:

train.parquet
test1.parquet
test2.parquet
test_amc.parquet
test_minerva.parquet
test_aime2025.parquet

By default DATA_DIR=~/data/open_reasoner_zero_nochat. You can prepare these files with the included preprocessing script after downloading the public ORZ/evaluation data sources:

git clone https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero.git ~/Open-Reasoner-Zero
git clone https://github.com/sail-sg/understand-r1-zero.git ~/understand-r1-zero
python examples/data_preprocess/open_reasoner_zero_nochat.py

Training

Main ACPO runs for Qwen2.5-7B on ORZ-57K:

# Near on-policy: 2 updates per rollout
bash scripts/acpo/math_qwen_7b_acpo_near_onpolicy.sh

# Off-policy: 16 updates per rollout
bash scripts/acpo/math_qwen_7b_acpo_offpolicy.sh

The scripts default to console logging only. To enable W&B, pass your own project settings explicitly:

LOGGER="['console','wandb']" WANDB_PROJECT=acpo WANDB_ENTITY=<your-entity> \
  bash scripts/acpo/math_qwen_7b_acpo_offpolicy.sh

Common overrides:

DATA_DIR=/path/to/open_reasoner_zero_nochat \
OUTPUT_ROOT=/path/to/outputs \
N_GPUS=8 \
bash scripts/acpo/math_qwen_7b_acpo_offpolicy.sh

Main Configs

The paper uses two off-policy regimes with the same base setup:

Regime	Updates/rollout	Train batch	PPO mini batch
Near on-policy	2	128	64
Off-policy	16	256	16

Shared settings: Qwen/Qwen2.5-7B, learning rate 1e-6, rollout n=10, validation n=8, max prompt length 1024, max response length 3072, total epochs 15, no KL loss, no entropy bonus, seq-mean-token-mean loss aggregation, DAPO clipping eps_low=0.2, eps_high=0.3.

ACPO main setting: use_variance_adaptive_clip=True, variance_alpha=3.0, variance_base_clip=0.2, variance_clip_min=0.0, variance_clip_max=3.0, variance_num_bins=5.

Baseline scripts:

bash scripts/acpo/math_qwen_7b_dapo_offpolicy.sh
bash scripts/acpo/math_qwen_7b_cispo_offpolicy.sh
bash scripts/acpo/math_qwen_7b_ar_lopti_offpolicy.sh
bash scripts/acpo/math_qwen_7b_entropy_top_offpolicy.sh
bash scripts/acpo/math_qwen_7b_entropy_bottom_offpolicy.sh

Near-on-policy versions are provided with the same names ending in _near_onpolicy.sh.

Acknowledgement

This codebase is a research fork of verl. We thank the verl authors and contributors for the open-source RLHF/RLVR infrastructure. The original code is licensed under Apache-2.0, and this repository keeps the same license.

Citation

If you use this code, please cite:

@misc{wang2026keyfactorsrlreasoning,
  title  = {What are Key Factors for Updates in RL for LLM Reasoning?},
  author = {Peidong Wang and Demi Wang and Xufang Luo and Jiahang Xu and Xiaocui Yang and Shi Feng and Yuqing Yang and Dongsheng Li},
  year   = {2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
docker		docker
docs		docs
examples		examples
recipe		recipe
scripts		scripts
server/math		server/math
tests		tests
verl		verl
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements-npu.txt		requirements-npu.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
setup.py		setup.py
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ACPO

Installation

Data

Training

Main Configs

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ACPO

Installation

Data

Training

Main Configs

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages