DPWriter: Reinforcement Learning with Diverse Planning Branching for Creative Writing

🔥 News

[2026, Mar 17]: 🎉🎉We release the dataset and the code Feel free to use!
[2026, Jan 14]: 🎉🎉Our arXiv paper is available! Check it out for more details.

📑 Table of Contents

🔥 News
📍 Brief Intro
⚡ Usage
- Setup
- Train
🌟 Cite

📍 Brief Intro

This work addresses the diversity collapse problem in RL fine-tuning for LLMs by introducing a framework based on a semi-structured long Chain-of-Thought. We propose Diverse Planning Branching to introduce divergence at the planning stage based on diversity variation, along with a group-aware diversity reward to encourage distinct generation trajectories. Experiments on creative writing tasks show that our method significantly improves output diversity while maintaining generation quality.

Abstract (Click me)

Reinforcement learning (RL)-based enhancement of large language models (LLMs) often leads to reduced output diversity, undermining their utility in open-ended tasks like creative writing. Current methods lack explicit mechanisms for guiding diverse exploration and instead prioritize optimization efficiency and performance over diversity. This paper proposes an RL framework structured around a semi-structured long Chain-of-Thought (CoT), in which the generation process is decomposed into explicitly planned intermediate steps. We introduce a Diverse Planning Branching method that strategically introduces divergence at the planning phase based on diversity variation, alongside a group-aware diversity reward to encourage distinct trajectories. Experimental results on creative writing benchmarks demonstrate that our approach significantly improves output diversity without compromising generation quality, consistently outperforming existing baselines.

Figure 1. Comparison among three generation paradigms. Our semi-structured reasoning paradigm introduces global planning before reasoning, providing high-level guidance while maintaining higher quality.

⚡ Usage

Setup

Create a new virtual environment:

git clone https://github.com/Aman-4-Real/DPWriter.git
cd DPWriter/
conda create -n dpwriter python=3.12
conda activate dpwriter

Our code is based on verl. Thanks for their great work! Using it by installing the version verl=0.4.1 and the corresponding Python packages. Change the cudatoolkit version according to your environment if necessary.

Train

For training, first download the initial ckpt you need and the DPWriterData we used. Note that the data we provide is the whole set. You can use it for sft first and use the sft model to filter out those too-easy samples for RL efficiency (for details, refer to our paper).

However, due to certain copyright policies, we removed some publicly crawled data.

Make sure to update all YOUR_PATH fields in the config as needed:

### use ngram as diversity metric
bash training_scripts/train_dpwriter_k32_ngram_lambda06.sh

### use emb as diversity metric
bash training_scripts/train_dpwriter_k32_emb_lambda06.sh

For the core implementation of DPB in our paper, please refer to verl/div_branching/branching_strategy.py.

For the rewarding code, please refer to dpwriter_rewards/reward_skywork_w_div_norm.py.

We respect and uphold the usage terms of the original data providers. If you believe that any part of this dataset affects your legal rights or raises other concerns, please reach out to us. We will carefully review your request and respond without delay.

🌟 Cite

Please cite our paper if you find our work useful.

@article{cao2026dpwriter,
  title={DPWriter: Reinforcement Learning with Diverse Planning Branching for Creative Writing},
  author={Cao, Qian and Liu, Yahui and Bi, Wei and Zhao, Yi and Song, Ruihua and Wang, Xiting and Tang, Ruiming and Zhou, Guorui and Li, Han},
  journal={arXiv preprint arXiv:2601.09609},
  year={2026}
}

For any questions, please feel free to contact me at caoqian4real@ruc.edu.cn.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
dpwriter_rewards		dpwriter_rewards
figs		figs
recipe		recipe
scripts		scripts
tests		tests
training_scripts		training_scripts
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-npu.txt		requirements-npu.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
run_emb_api.sh		run_emb_api.sh
serve_emb_api.py		serve_emb_api.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DPWriter: Reinforcement Learning with Diverse Planning Branching for Creative Writing

🔥 News

📑 Table of Contents

📍 Brief Intro

⚡ Usage

Setup

Train

🌟 Cite

Please cite our paper if you find our work useful.

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DPWriter: Reinforcement Learning with Diverse Planning Branching for Creative Writing

🔥 News

📑 Table of Contents

📍 Brief Intro

⚡ Usage

Setup

Train

🌟 Cite

Please cite our paper if you find our work useful.

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages