Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

[📃Paper] [🌐Project Page] [🤗Hugging Face]

📣 What's New

[2025.5.21] We have released our DMPO algorithm in verl DMPO !
[2025.5.21] We have released data in OliverLeeXZ/NP-MM and OliverLeeXZ/NP. 🎉🎉🎉
[2026.5.19] Our DMPO Paper is released! Check it at 📃Arxiv: DMPO !
[2026.5.1] Our DMPO has been accepted at ICML 2026! See you in Seoul! 🎉🎉🎉
[2026.5.1] Our NPMM-Bench is now integrated into VLMEvalKit via PR #1463.

🌟 Highlights

We show that on-policy RL methods suffer from mode collapse due to reverse KL's mode-seeking behavior, and propose DMPO—a simple, practical solution that approximates forward KL minimization at the group level, achieving 9-12% relative improvements on optimization tasks.
We introduce MM-NP-Bench to vision-language models with visual representations of 10 NP-hard tasks. The benchmark features dual-metric evaluation (Success Rate & Quality Ratio) that makes mode collapse observable: high SR but low QR reveals a policy that finds solutions but doesn't optimize them. We provide a complete infrastructure including parametric generators, rule-based verifiers, and heuristic solvers, enabling both evaluation and RLVR training.
Extensive experiments showing DMPO outperforms five strong baselines by 4.7%-3.8% on optimization tasks, 2% on mathematical reasoning, and 2.3% on out-of-domain tasks, with evidence that diversity-preserving training transfers to general reasoning capabilities.

Quick Start

Environment Setup

We recommend following the official verl installation guide: Install verl.

NPMM Training Setup

You can integrate the vision NP task: NP_MM, into verl, or directly use the existing NP task in verl: NP Task.

Latest verl Recipe for DMPO

To run DMPO, we recommend using the latest official verl framework and recipe from the official codebase:

Latest verl recipe:

verl recipe: verl-project/verl-recipe#105

🖊️ Citation

If you find this work helpful, please consider to star🌟 this repo and cite this paper. Thanks for your support!

@misc{li2026modecollapsedistributionmatching,
      title={Beyond Mode Collapse: Distribution Matching for Diverse Reasoning}, 
      author={Xiaozhe Li and Yang Li and Xinyu Fang and Shengyuan Ding and Peiji Li and Yongkang Chen and Yichuan Ma and Tianyi Lyu and Linyang Li and Dahua Lin and Qipeng Guo and Qingwen Liu and Kai Chen},
      year={2026},
      eprint={2605.19461},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.19461}, 
}

🙏 Acknowledgement

DMPO is built on the excellent RL framework verl, and NPMM-Bench is built on the widely used VLM evaluation framework VLMEvalKit. We thank the authors and contributors of these projects for their valuable work.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
NP_MM		NP_MM
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

📣 What's New

🌟 Highlights

Quick Start

Environment Setup

NPMM Training Setup

Latest verl Recipe for DMPO

🖊️ Citation

🙏 Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

📣 What's New

🌟 Highlights

Quick Start

Environment Setup

NPMM Training Setup

Latest verl Recipe for DMPO

🖊️ Citation

🙏 Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages