WritingBench: A Comprehensive Benchmark for Generative Writing

📃 [Paper] • 🚀 [Github Repo] • 🏆 [Leaderboard] • 📏 [Critic Model] • ✍️ [Writing Model]

🚀 What's New

`2025-04-29`

🏆 Leaderboard Launch: Explore evaluation results on Hugging Face Leaderboard and ModelScope Leaderboard. Update latest LLM evaluations (Claude-3-7-Sonnet, o3, grok-3, etc)
- Parameters for response generation: top_p: 0.8; top_k: 20; temperature: 0.7; max_length: 16000 (or maximum allowed if less than 16000)
- Parameters for scoring: top_p: 0.95; top_k: (empty); temperature: 1.0; max_length: 2048
- Leaderboard scores are scaled from 10 to 100 by multiplying by 10 for easier viewing.
‼️ Update benchmark queries & criteria for improved assessment, including 1,000 queries and requirement dimension subsets.
‼️ Update evaluation prompt for better scoring, and switch to using Claude-3-7-Sonnet for evaluation.

`2025-03-10`

We release the first version of WritingBench, including 1,239 writing queries and style/format/length dimension subsets.

📖 Overview

WritingBench is a comprehensive benchmark for evaluating LLMs' writing capabilities across 1,000 real-world queries, spanning:

6 primary domains
100 fine-grained subdomains
1,500+ avg. tokens per query

WritingBench integrates diverse sources of materials. Each query is paired with 5 instance-specific criteria, scoring either through LLM evaluators or through a finetuned critic model.

🏗️ Benchmark Construction

WritingBench is built through a hybrid pipeline combining Model-Augmented Query Generation and Human-in-the-Loop Refinement, ensuring both diversity and real-world applicability. The construction process involves two key phases:

🤖 Model-Augmented Query Generation

Phase 1: Initial Query Generation

Leverage LLMs to generate queries from a two-tiered domain pool grounded in real-world writing scenarios, consisting of 6 primary domains and 100 secondary subdomains, covering:

🔬 Academic & Engineering
💼 Finance & Business
⚖️ Politics & Law
🎨 Literature & Art
🎓 Education
📢 Advertising & Marketing

Phase 2: Query Diversification

Enhance the diversity and practical applicability of queries by random selected strategies from Query Refinement Guidance Pool, covering:

Style Adjustments (e.g., kid-friendly tone)
Format Specifications (e.g., IEEE template)
Length Constraints (e.g., 500-word summary)
Personalization (e.g., educator's perspective)
Content Specificity (e.g., 2023 Q3 metrics)
Expression Optimization (query rewriting)

✍️ Human-in-the-Loop Refinement

Phase 1: Material Collection

30 trained annotators collect necessary open-source materials (e.g., public financial statements or legal templates), guided by material requirements generated by LLMs.

Phase 2: Expert Screening & Optimization

5 experts conduct a delicate two-stage filtering process:

query adaptation: ambiguous or unrealistic queries are revised to better align with the provided materials and practical scenarios
material pruning: redundant or irrelevant content is eliminated from the collected materials

📈 Evaluation Framework

Phase 1: Dynamic Criteria Generation

Given a query $q$ in the WritingBench, the LLM is prompted to automatically generate a set of five evaluation criteria, $C_q = {c_1, \ldots, c_5}$. Each criterion comprises three components: a concise name summarizing the criterion, an extended description elaborating on the evaluation focus, and detailed scoring rubrics.

Phase 2: Rubric-based Scoring

For each criterion $c_i \in C_q$, the evaluator independently assigns a score on a 10-point scale to a response $r$, providing both a score and a justification.

🛠 Installation

git clone https://github.com/X-PLUG/WritingBench.git

📂 Repository Structure

.
├── evaluate_benchmark.py     # Evaluation script
├── prompt.py                 # Prompt templates
├── evaluator/
│   ├── __int__.py
│   ├── critic.py             # Critic model evaluation interface
│   └── llm.py                # LLM evaluation interface
└── benchmark_query/
    ├── benchmark_all.jsonl   # Full dataset (1,000 queries)
    └── requirement/
        ├── style/           
        │   ├── style_subset.jsonl    # requirement-involved subset for style
        │   └── style_subset_C.jsonl  # category-specific subset for style
        ├── format/          
        │   ├── format_subset.jsonl    # requirement-involved subset for format
        │   └── format_subset_C.jsonl  # category-specific subset for format
        └── length/         
            ├── length_subset.jsonl    # requirement-involved subset for length
            └── length_subset_C.jsonl  # category-specific subset for length

🚀 Quick Start

Add your API credentials:

For LLM-as-a-Judge, see evaluator/llm.py. Recommend using Claude-3-7-Sonnet for evaluation.

  self.api_key = "your_api_key_here"
  self.url = "Your API endpoint"
  self.model = "Chose your model name"

For critic model, see evaluator/critic.py

  self.model = LLM(
      model="", # Your local path. Please download critic model from https://huggingface.co/AQuarterMile/WritingBench-Critic-Model-Qwen-7B.
      tensor_parallel_size=1, # Your tensor parallel size setting. Defaults to 1, indicating no parallelism
  )

Choose appropriate evaluation sets from benchmark_query/

python evaluate_benchmark.py \
  --evaluator critic \ # or claude
  --query_criteria_file query_set.jsonl \ # use files under benchmark_query/
  --input_file samples.jsonl \
  --output_file scores.jsonl

An example of samples.jsonl used to store responses generated by the evaluated LLMs:

{"index": i, "response": "xxx"}

📝 Citation

@misc{wu2025writingbench,
      title={WritingBench: A Comprehensive Benchmark for Generative Writing}, 
      author={Yuning Wu and Jiahao Mei and Ming Yan and Chenliang Li and Shaopeng Lai and Yuran Ren and Zijia Wang and Ji Zhang and Mengyue Wu and Qin Jin and Fei Huang},
      year={2025},
      url={https://arxiv.org/abs/2503.05244}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
benchmark_query		benchmark_query
evaluator		evaluator
pics		pics
LICENSE		LICENSE
README.md		README.md
evaluate_benchmark.py		evaluate_benchmark.py
prompt.py		prompt.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WritingBench: A Comprehensive Benchmark for Generative Writing

🚀 What's New

`2025-04-29`

`2025-03-10`

📖 Overview

🏗️ Benchmark Construction

🤖 Model-Augmented Query Generation

Phase 1: Initial Query Generation

Phase 2: Query Diversification

✍️ Human-in-the-Loop Refinement

Phase 1: Material Collection

Phase 2: Expert Screening & Optimization

📈 Evaluation Framework

Phase 1: Dynamic Criteria Generation

Phase 2: Rubric-based Scoring

🛠 Installation

📂 Repository Structure

🚀 Quick Start

📝 Citation

About

Uh oh!

Releases

Packages

Languages

License

X-PLUG/WritingBench

Folders and files

Latest commit

History

Repository files navigation

WritingBench: A Comprehensive Benchmark for Generative Writing

🚀 What's New

2025-04-29

2025-03-10

📖 Overview

🏗️ Benchmark Construction

🤖 Model-Augmented Query Generation

Phase 1: Initial Query Generation

Phase 2: Query Diversification

✍️ Human-in-the-Loop Refinement

Phase 1: Material Collection

Phase 2: Expert Screening & Optimization

📈 Evaluation Framework

Phase 1: Dynamic Criteria Generation

Phase 2: Rubric-based Scoring

🛠 Installation

📂 Repository Structure

🚀 Quick Start

📝 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`2025-04-29`

`2025-03-10`

Packages