This repository contains the official implementation of Rulers (Rubric Unification, Locking, and Evidence-anchored Robust Scoring), a framework designed to align frozen LLM judges with human grading standards.
The framework reframes evaluation not as an open-ended generation task, but as a Compiler-Executor protocol. By locking rubrics into executable checklists and enforcing verbatim evidence extraction, Rulers significantly improves stability and human-agreement across diverse tasks.
The Rulers pipeline consists of three distinct phases designed to eliminate stochasticity and enforce auditability:
-
Phase I: Rubric Unification (The Compiler)
- Compiles natural language guidelines into a Locked Rubric Bundle.
- Freezes criteria definitions into a fixed taxonomy and deterministic checklist to eliminate interpretation drift.
- Output: A hashed, immutable JSON specification.
-
Phase II: Evidence-Anchored Scoring (The Executor)
- The model executes the checklist with a strict constraint: every high score must be supported by
MIN_EVverbatim quotes extracted from the input. - Includes deterministic verification to prevent hallucinated justifications and anti-halo boundary checks.
- Output: Structured evidence logs and raw checklist decisions.
- The model executes the checklist with a strict constraint: every high score must be supported by
-
Phase III: Robust Scoring Alignment (WGR)
- Applies Wasserstein Generative Regression (WGR) to map the model's internal score distribution to the specific granularity of human labels (e.g., 0.5-step intervals) without fine-tuning parameters.
- Output: Final calibrated scores aligned with human distributions.
We provide three standalone, standardized scripts. Each script is self-contained and tailored to a specific dataset's characteristics while sharing the core Rulers architecture.
| Script Name | Target Dataset | Domain | Key Traits |
|---|---|---|---|
rulers_asap_standardized.py |
ASAP 2.0 | Argumentative Writing | Content, Evidence, Organization, Language |
rulers_summ_standardized.py |
SummHF | Text Summarization | Coherence, Accuracy, Coverage |
rulers_dress_standardized.py |
DREsS | EFL Student Writing | Content, Organization, Language (0.5-step precision) |
This code is benchmarked on the three datasets discussed in the paper:
-
ASAP 2.0 (Argumentative Essay Scoring)
- Type: Student Argumentative Writing.
- Focus: Structural and linguistic quality evaluation.
- Setting: Standard 1-6 holistic or multi-trait scoring.
-
SummHF (Summarization Quality)
- Type: Summaries derived from human feedback (OpenAI).
- Focus: Hallucination detection and factual consistency in high-compression texts.
- Setting: 1-7 Likert scale.
-
DREsS (Deep Rubric for EFL Student Scoring)
- Type: EFL (English as a Foreign Language) Student Essays.
- Focus: Large-scale rubric-based assessment for classroom essays scored by experts.
- Setting: High-precision multi-trait scoring (Content, Organization, Language) with 0.5-point increments (Scale 1.0–5.0).
pip install openai pandas numpy scikit-learn scipyThe scripts are designed to be run independently. You must provide your OpenAI API key and the path to your data.
Note on Hyperparameters: The default values for
checklist_n(checklist size) andwgr_alpha(calibration strength) in the scripts are generic placeholders. As noted in the paper, these should be tuned based on your specific validation set to achieve optimal alignment.
Targeting structural argumentation with strict evidence requirements.
python rulers_asap_standardized.py \
--data_path ./data/asap_test.csv \
--api_key "YOUR_OPENAI_API_KEY" \
--model gpt-4o \
--checklist_n 15 \
--min_ev 2 \
--wgr_alpha 1.0Targeting factual consistency with a lower evidence threshold (due to short text length).
python rulers_summ_standardized.py \
--data_path ./data/summhf_test.jsonl \
--api_key "YOUR_OPENAI_API_KEY" \
--model gpt-4o \
--checklist_n 12 \
--min_ev 1 \
--wgr_alpha 2.5Targeting high-precision (0.5-step) scoring for language learners.
python rulers_dress_standardized.py \
--data_path ./data/dress_full.tsv \
--api_key "YOUR_OPENAI_API_KEY" \
--model gpt-4o \
--checklist_n 18 \
--min_ev 2 \
--wgr_alpha 0.5Upon execution, the scripts will generate the following outputs, corresponding to the three phases of the Rulers framework:
-
The Locked Rubric Bundle (Phase I)
- Console Output:
[*] Bundle Locked. Hash: a1b2c3... - Content: A JSON object containing the compiled taxonomy and the deterministic checklist (e.g.,
C01toC15). - Significance: This hash guarantees reproducibility. Any change in the prompt or model interpretation would alter this hash.
- Console Output:
-
Evidence-Anchored Logs (Phase II)
- Console Output:
[*] Phase II Scoring started... - Content: A structured log for every input sample, containing:
checklist_decisions: Binary/Ternary vectors for each checklist item.evidence_quotes: Verbatim substrings extracted from the text to support scores (enforcing the Evidence Support constraint).boundary_justification: A compact rationale for why the score is not higher or lower.
- Console Output:
-
Calibrated Scores (Phase III)
- Console Output:
[*] Calibration complete. Final scores generated. - Content: A final set of scores processed by the Wasserstein Generative Regression (WGR) engine.
- Significance: These scores are distributionally aligned with human grading standards (e.g., matching the specific 0.5-step granularity of the DREsS dataset) without fine-tuning the model parameters.
- Console Output:
--checklist_n: The number of granular checklist items generated in Phase I.- Tip: Complex rubrics (like DREsS) often benefit from a higher N (e.g., 18-20), while simpler tasks may require fewer.
--min_ev: The minimum number of verbatim quotes required to support a high checklist decision.- Default:
2for essays,1for short summaries.
- Default:
--wgr_alpha: The regularization strength for the WGR calibration layer.- Tip: Adjust this to prevent overfitting when fitting the calibrator on small subsets.
If you use this codebase, any part of the RULERS framework, or derived artifacts (including rubric bundles, evidence-anchored scoring logs, or calibration procedures) in your research, please cite our paper:
@article{hong2026rulers,
title = {RULERS: Locked Rubrics and Evidence-Anchored Scoring for Robust LLM Evaluation},
author = {Hong, Yihan and Yao, Huaiyuan and Shen, Bolin and Xu, Wanpeng and Wei, Hua and Dong, Yushun},
journal = {arXiv preprint arXiv:2601.08654},
year = {2026},
archivePrefix = {arXiv},
eprint = {2601.08654},
primaryClass = {cs.CL},
doi = {10.48550/arXiv.2601.08654},
url = {https://arxiv.org/abs/2601.08654}
}