Skip to content

OliverLeeXZ/DMPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

[πŸ“ƒPaper] [🌐Project Page] [πŸ€—Hugging Face]

πŸ“£ What's New

  • [2025.5.21] We have released our DMPO algorithm in verl DMPO !
  • [2025.5.21] We have released data in OliverLeeXZ/NP-MM and OliverLeeXZ/NP. πŸŽ‰πŸŽ‰πŸŽ‰
  • [2026.5.19] Our DMPO Paper is released! Check it at πŸ“ƒArxiv: DMPO !
  • [2026.5.1] Our DMPO has been accepted at ICML 2026! See you in Seoul! πŸŽ‰πŸŽ‰πŸŽ‰
  • [2026.5.1] Our NPMM-Bench is now integrated into VLMEvalKit via PR #1463.

🌟 Highlights

  1. We show that on-policy RL methods suffer from mode collapse due to reverse KL's mode-seeking behavior, and propose DMPOβ€”a simple, practical solution that approximates forward KL minimization at the group level, achieving 9-12% relative improvements on optimization tasks.
  2. We introduce MM-NP-Bench to vision-language models with visual representations of 10 NP-hard tasks. The benchmark features dual-metric evaluation (Success Rate & Quality Ratio) that makes mode collapse observable: high SR but low QR reveals a policy that finds solutions but doesn't optimize them. We provide a complete infrastructure including parametric generators, rule-based verifiers, and heuristic solvers, enabling both evaluation and RLVR training.
  3. Extensive experiments showing DMPO outperforms five strong baselines by 4.7%-3.8% on optimization tasks, 2% on mathematical reasoning, and 2.3% on out-of-domain tasks, with evidence that diversity-preserving training transfers to general reasoning capabilities.

Quick Start

Environment Setup

  • We recommend following the official verl installation guide: Install verl.

NPMM Training Setup

  • You can integrate the vision NP task: NP_MM, into verl, or directly use the existing NP task in verl: NP Task.

Latest verl Recipe for DMPO

To run DMPO, we recommend using the latest official verl framework and recipe from the official codebase:

Latest verl recipe:

πŸ–ŠοΈ Citation

If you find this work helpful, please consider to star🌟 this repo and cite this paper. Thanks for your support!

@misc{li2026modecollapsedistributionmatching,
      title={Beyond Mode Collapse: Distribution Matching for Diverse Reasoning}, 
      author={Xiaozhe Li and Yang Li and Xinyu Fang and Shengyuan Ding and Peiji Li and Yongkang Chen and Yichuan Ma and Tianyi Lyu and Linyang Li and Dahua Lin and Qipeng Guo and Qingwen Liu and Kai Chen},
      year={2026},
      eprint={2605.19461},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.19461}, 
}

πŸ™ Acknowledgement

DMPO is built on the excellent RL framework verl, and NPMM-Bench is built on the widely used VLM evaluation framework VLMEvalKit. We thank the authors and contributors of these projects for their valuable work.

About

[ICML 2026] Official implement on 'Beyond Mode Collapse: Distribution Matching for Diverse Reasoning'

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages