Policy Split

Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

Entropy-guided RL for LLMs faces a fundamental exploration–exploitation conflict: pushing for higher entropy often degrades accuracy, especially when continually post-training low-entropy large reasoning models (LRMs) that are already proficient. Policy Split resolves this by bifurcating a single model into two collaborating policies — a normal mode optimized for correctness and a high-entropy mode that prefers exploration — trained together via dual-mode entropy regularization.

(a) Naively preferring high entropy can drop accuracy. (b) Policy Split bifurcates the policy into a normal mode and a high-entropy mode (identified by a high-entropy system prompt) that share parameters and rollouts. (c) Dual-mode entropy regularization drives the two modes to distinct entropy behaviors and a growing KL divergence.

Key Idea

Instantiate two policies inside one model that benefit from each other's experience:

Normal mode π_θ(·|q) — trained with vanilla GRPO advantages for correctness only.
High-entropy mode π_θ^HE(·|q) = π_θ(·|s,q) — same parameters, activated by a high-entropy system prompt s, and trained with an entropy-regularized advantage that adds (i) an entropy preference and (ii) a KL term pushing it away from the normal mode, with clamping for stability.
Inter-policy collaboration via rollout sharing: the normal mode learns from newly discovered correct rollouts, while the high-entropy mode is kept from collapsing by stable correct responses — at the cost of ~1.4× GRPO training time.

Note: simply prepending a high-entropy prompt is not enough — the output distribution barely shifts (KL ≈ 0.01). Explicit training is required to establish a real high-entropy mode.

The Policy Split training framework: rollouts are sampled from both modes and shared, then each is assigned its corresponding correctness-only or entropy-regularized advantage.

Highlights

Consistent gains. Outperforms strong entropy-guided RL baselines in average accuracy across Qwen3-1.7B/4B/8B (all improvements over the base model pass significance testing at p < 0.01), while reviving entropy in low-entropy LRMs.
Two artifacts, one model. The normal mode is more rigorous and excels at general tasks; the high-entropy mode excels at creative writing — switchable at inference time by a simple prompt.
Genuine dual-mode split. Policy Split greatly enlarges the inter-mode KL divergence and behavioral gap, whereas prompting the original model or vanilla GRPO barely changes behavior.
Higher Best-of-N. The high-entropy mode finds more unique correct rollouts (higher best-of-8), providing distinct learning signals during training.
Some prompt generalization. A rewritten high-entropy prompt still induces high entropy, though following an unseen low-entropy instruction remains a bottleneck.

Method at a Glance

Component	Normal mode	High-entropy mode
Activation	no system prompt	high-entropy system prompt
Advantage	vanilla GRPO (correctness only)	GRPO + clamped entropy & KL term
Objective	preserve accuracy	encourage novel exploration
Best use	general / precision tasks	creative tasks

Both modes share parameters and rollouts and are always trained with their own advantage type, regardless of which mode produced the rollout.

Citation

@article{yao2026policy,
  title={Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization},
  author={Yao, Jiashu and Huang, Heyan and Wu, Daiqing and Liu, Zeming and Guo, Yuhang},
  journal={arXiv preprint arXiv:2604.11510},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
verl		verl
README.md		README.md
README.zh-CN.md		README.zh-CN.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Policy Split

Key Idea

Highlights

Method at a Glance

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Policy Split

Key Idea

Highlights

Method at a Glance

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages