We present Writing-RL: an Adaptive Curriculum Reinforcement Learning framework to advance long-form writing capabilities beyond traditional supervised fine-tuning (SFT) approaches.
The Writing-RL framework consists of three key components:
- Margin-aware Data Selection strategy: prioritizes samples with high learning potential
- Pairwise Comparison Reward mechanism: provides discriminative learning signals in the absence of verifiable rewards
- Dynamic Reference Scheduling approach: adaptively adjusting task difficulty based on evolving model performance.
๐ซ Advancing Long-form Writing Capabilities: Our resulting writer models show that our RL framework largely improves long-form writing performance over strong SFT baselines.
๐ซ Generalization from Long-Output to Long-Input: We observe that models trained with long-output RL generalize surprisingly well to long-input reasoning tasks, potentially offering a promising perspective for rethinking long-context training.
[2025.09.18] ๐ We are delighted to release the training code of our paper. The training dataset and the resulting writer models are coming soon. Stay tuned!
[2025.06.06] ๐ Our initial paper is available in Arxiv.
Intuition ๐งฉ: Samples with higher learning potential enhances learning efficiency.
We propose Margin-aware Data Selection, a method that prioritizes training samples with high learning potential. In our view, an estimate of sample-wise learning potential is defined as the performance gap between the policy modelโs response and the best response from a set of competitive LLMs. Afterwards, we use this metric for sample selection.
Intuition ๐งฉ: For writing, it is easier to evaluate response quality when comparing with references.
We introduce Pairwise Comparison Reward that leverages LLM-as-a-Judge to provide discriminative, reference-based feedback for long-form writing. Instead of relying on noisy pointwise scores, our reward function compares the policyโs response against a high-quality reference: it awards 1 if the response is judged superior, 0.5 for equivalence, and 0 if inferior.
Intuition ๐งฉ: We aim to increasingly challenge the model as it evolves.
We propose Dynamic Reference Scheduling, a curriculum learning strategy that adaptively escalates task difficulty by progressively replacing defeated references with stronger ones during training. As the policy improves and wins pairwise comparisons against its current reference, it โlevels upโ to a marginally harder one, creating a personalized, asynchronous learning path per sample. This eliminates static difficulty assumptions and inference overhead while continuously aligning challenge with model capability, driving steady improvement without overwhelming the learner.
Our Writing-RL trained models exhibit long-form writing capabilities that match or even surpass those of proprietary models, positioning them as strong open-source alternatives for long-form generation tasks.
Our Writing-RL trained models exhibit surprising generalization to long-context reasoning tasks with substantially longer inputs (evaluated on LongBench v2), while the SFT-trained counterparts show slight performance degradation in this regime. (Training input + output: 1k - 8k; Evaluation input: 8k - 2M)
conda create -n verl python==3.10 -y
conda activate verl
pip3 install torch==2.6.0
pip3 install flash-attn==2.7.4.post1 --no-build-isolation
pip3 install -e .
pip3 install vllm==0.8.2
For more specific environment issues, you can refer to VeRL and TinyZero.
The training dataset should be placed in data directory, consists of train.parquet and test.parquet. The training samples should be in the following format:
{
'data_source': 'writing',
'prompt': [
{
'content': "You are a helpful assistant.",
'role': "system"
},
{
'content': prompt[str],
'role': 'user'
}
],
'ability': 'write',
'reward_model': {
'ground_truth': { ### the ground truth is used for constructing judge prompt
'prompt': prompt[str],
'target': references[list[str]], ### pre-generated list of references with increasing quality
'criteria': criteria[str], ### could be None
'id': id[int] ### id should be an integer range(0, len(references)), indicating the index of current reference, increasing during training process.
},
'style': 'rule'
},
'extra_info': {
'index': id,
'split': train ###
}
}
Our training dataset will be available soon, stay tuned!
We recommend using a stronger writer model as the RL base model, such as Writing-Model-Qwen-7b.
We include the environment configurations and the training commands in scripts/train_writing_rl.sh. Most configurations are similar with the original VeRL config and you can adjust them based on your hardware conditions. We would like to elaborate on some critical parameters introduced by our framework.
Since we use pairwise LLM-as-Judge methods to provide rewards, we implement a ParallelRewardManager to call LLM APIs in parallel after completing rollouts of a batch. The writing-oriented reward function is implemented in compute_score and the judging prompts are included in judging_prompts. You can customize them to meet your needs.
In the training scripts, you can adjust the following environment variables.
- JUDGE_MODEL: the name of the judge model. In our experiments, we use qwen-plus to balance performance and costs. NOTE: the training process can be costly.
- DASHSCOPE_API_KEY: your dashscope api key here when using qwen models as judge model.
- OPENAI_API_KEY: your openai api key here when using gpt models as judge model.
- REWARD_STRATEGY: should be chosen from
position-disadvantageorposition-advantage. As shown in previous work, the pairwise LLM-Judge has certain level of positional bias (mainly favors the former response).- position-disadvantage: When tuning stronger writer models, we recommend to set it as
position-disadvantageto place the model answer as the latter response in the comparison and introduce more pressure for the policy model. In this configuration, the reward function isLLM-Judge(reference, model response). - position-advantage: When tuning less competent writer models, we recommend to set it as
position-advantageto place the model answer as the former response in the comparison and let the model receive more positive reward signals. In this configuration, the reward function isLLM-Judge(model response, reference).
- position-disadvantage: When tuning stronger writer models, we recommend to set it as
- API_PARALLEL: the parallel rate of judge model api and you need to adjust it based on your api rate limit.
- We recommend using wandb to log your experiments and you need to configure your wandb api key in the script.
- To dynamically monitor the current reference indexes of each training sample, we first execute
scripts/create_index_file.pyto createindex_{experiment_name}.jsonin therunning_filedirectory. The file contains a mapping ofsample_idtocurrent_reference_indexand it will be modified in each trainning step. At the begining of reward computation, the program will read the file to find the current index of the questions and usereference_list[current_reference_index]for each sample to construct pairwise comparisons. Afterwards, when some of the model responses beat the current reference, the correspondingcurrent_reference_indexwill add 1 (step towards a better reference) and is finally written to the file. In this way, we implement ourDynamic Reference Schedulingframework and we use an external file to log the information to maintain minimal invasive modifications to the original VeRL codebase and can be beneficial for intermediate checking and mannual intervention.
After the preparations, you can launch the traning process by our one-stop script.
bash scripts/train_writing_rl.sh
We would like to thank Deepseek R1, VeRL, TinyZero projects for early explorations in LLM RL and vllm for their infra suppot. We also thank Qwen and LLaMa models to provide a strong foundation for our experiments.
If you find our work helpful, please consider citing our work.
@article{lei2025writing,
title={Writing-RL: Advancing Long-form Writing via Adaptive Curriculum Reinforcement Learning},
author={Lei, Xuanyu and Li, Chenliang and Wu, Yuning and Liu, Kaiming and Shen, Weizhou and Li, Peng and Yan, Ming and Zhang, Ji and Huang, Fei and Liu, Yang},
journal={arXiv preprint arXiv:2506.05760},
year={2025}
}


