π [Paper] β’ π [Github Repo] β’ π [Critic Model] β’ βοΈ [Writing Model]
WritingBench is a comprehensive benchmark for evaluating LLMs' writing capabilities across 1,239 real-world queries, spanning:
- 6 primary domains
- 100 fine-grained subdomains
- 3 core writing requirements: Style / Format / Length
- 1,546 avg. tokens per query
WritingBench integrates diverse sources of materials. Each query is paired with 5 instance-specific criteria, scoring either through LLM evaluators or through a finetuned critic model.
WritingBench is built through a hybrid pipeline combining Model-Augmented Query Generation and Human-in-the-Loop Refinement, ensuring both diversity and real-world applicability. The construction process involves two key phases:
Leverage LLMs to generate queries from a two-tiered domain pool grounded in real-world writing scenarios, consisting of 6 primary domains and 100 secondary subdomains, covering:
- π¬ Academic & Engineering
- πΌ Finance & Business
- βοΈ Politics & Law
- π¨ Literature & Art
- π Education
- π’ Advertising & Marketing
Enhance the diversity and practical applicability of queries by random selected strategies from Query Refinement Guidance Pool, covering:
- Style Adjustments (e.g., kid-friendly tone)
- Format Specifications (e.g., IEEE template)
- Length Constraints (e.g., 500-word summary)
- Personalization (e.g., educator's perspective)
- Content Specificity (e.g., 2023 Q3 metrics)
- Expression Optimization (query rewriting)
30 trained annotators collect necessary open-source materials (e.g., public financial statements or legal templates), guided by material requirements generated by LLMs.
5 experts conduct a delicate two-stage filtering process:
- query adaptation: ambiguous or unrealistic queries are revised to better align with the provided materials and practical scenarios
- material pruning: redundant or irrelevant content is eliminated from the collected materials
Given a query
For each criterion
git clone https://github.com/yourusername/WritingBench.git
.
βββ evaluate_benchmark.py # Evaluation script
βββ prompt.py # Prompt templates
βββ evaluator/
β βββ __int__.py
β βββ critic.py # Critic model evaluation interface
β βββ llm.py # LLM evaluation interface
βββ benchmark_query/
βββ benchmark_all.jsonl # Full dataset (1239 queries)
βββ requirement/
βββ style/ # Style-specific subsets
β βββ style_subset.jsonl
β βββ style_subset_C.jsonl
βββ format/ # Format-specific subsets
β βββ format_subset.jsonl
β βββ format_subset_C.jsonl
βββ length/ # Length-specific subsets
βββ length_subset.jsonl
βββ length_subset_C.jsonl
- Add your API credentials:
- For LLM-as-a-Judge, see evaluator/llm.py
self.api_key = "your_api_key_here"
self.url = "Your API endpoint"
self.model = "Chose your model name"
- For critic model, see evaluator/critic.py
self.model = LLM(
model="", # Your local path. Please download critic model from https://huggingface.co/AQuarterMile/WritingBench-Critic-Model-Qwen-7B.
tensor_parallel_size=1, # Your tensor parallel size setting. Defaults to 1, indicating no parallelism
)
- Choose appropriate evaluation sets from benchmark_query/
python evaluate_benchmark.py \
--evaluator critic # or claude
--query_criteria_file query_set.jsonl \
--input_file samples.jsonl \
--output_file scores.jsonl
@misc{wu2025writingbench,
title={WritingBench: A Comprehensive Benchmark for Generative Writing},
author={Yuning Wu and Jiahao Mei and Ming Yan and Chenliang Li and Shaopeng Lai and Yuran Ren and Zijia Wang and Ji Zhang and Mengyue Wu and Qin Jin and Fei Huang},
year={2025},
url={https://arxiv.org/abs/2503.05244},
}