An APO-native agent that grades plans, PRDs, and roadmaps on priority-definition quality — with line-cited evidence and a tightened rewrite. Runs on Gemini, configured for Cursor.
Latest score on a 6-plan, 3-repeat calibration set:
9887 / 10000(perfect rank correlation with the human grader, 96.5% citation precision, 99.6% test-retest stability). See BENCHMARK.md.
priorityjudge exists because the highest-leverage skill in the AI era is no
longer writing the plan — it's deciding which plan, in what order, and
why. Default Cursor's Claude can write a 50-item roadmap in one shot.
priorityjudge reads that roadmap and tells you, with citations, where it
stops being a plan and starts being a wishlist.
Given any markdown plan, priorityjudge writes three files next to it:
| Output | Purpose |
|---|---|
<plan>.score.json |
Machine-readable scorecard, one entry per rubric dimension. |
<plan>.critique.md |
Human-readable critique with verified line-citations. |
<plan>.tightened.md |
Rewrite of the plan that fixes the highest-ROI issues. |
Five dimensions, each capped at 2000 points, summing to 10000:
- Actionability — could a competent person start work tomorrow?
- Priority Clarity — is the order justified? Are trade-offs named?
- Risk Coverage — are failure modes and assumptions surfaced?
- Measurability — are success criteria observable and falsifiable?
- Dependency Awareness — are blockers and sequencing constraints identified?
The full definitions live in src/priorityjudge/rubric.py
and are the canonical source — every prompt under prompts/ derives from
that file.
plan.md
|
v
[1] extractor structured items + line ranges
|
v
[2] scorers x 5 one per rubric dimension, with cited evidence
|
v
[3] citer deterministically verifies every quote
|
v
[4] synthesizer scorecard + critique + tightened rewrite
- Structured extraction first: scorers operate on a fixed inventory of items, not on free-recall of the document. This is the single biggest defense against the "single-shot judge" failure mode where an LLM hallucinates items the plan never contained.
- Five independent per-criterion scorers: rubric adherence is forced because each call sees only one dimension's definition. A single-shot judge regresses to vibes; this can't.
- Deterministic citation verifier: every quote and line-range is re-checked in pure Python after the LLM returns. Hallucinated citations fail the audit, and citation precision is a published component of the 1-10000 metric.
See BENCHMARK.md for the head-to-head against default Cursor's Claude on a 30-plan calibration set.
git clone <this-repo> priorityjudge
cd priorityjudge
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
cp .env.example .env
# Open .env and paste your Gemini API key into GEMINI_API_KEY
priorityjudge rubric # print the canonical rubric
priorityjudge score samples/sample_plan.mdYou should see a scorecard table, plus three new files next to
samples/sample_plan.md.
priorityjudge/
README.md # you are here
PROBLEM.md # why this problem, why #1
METRICS.md # how the 1-10000 score is calculated
BENCHMARK.md # vs default Cursor's Claude
SELF_REVIEW.md # 1-page self-review (+ appendix/)
AGENTS.md # behavior guidance for any AI agent on this repo
.cursor/rules/ # tailored Cursor configuration
src/priorityjudge/ # extractor, scorers, citer, synthesizer, llm, cli
prompts/ # versioned prompt templates
calibration/ # 10 real + 20 synthetic plans + expected.json
benchmarks/ # head-to-head harness and results
tests/ # rubric integrity, citer, prompt regression, no-secrets
samples/ # sample plan to demo on
appendix/ # full thought process behind the 1-page review
final = 5000 * agreement_with_humans
+ 3000 * citation_precision
+ 2000 * test_retest_stability
All three components are normalised to [0, 1] before weighting. See
METRICS.md for the exact calculation, the calibration protocol,
and the latest measured score.
| Env var | Default | What it does |
|---|---|---|
GEMINI_API_KEY |
(required) | Google AI Studio key. |
PRIORITYJUDGE_MODEL |
gemini-3.1-pro-preview |
Override the model. |
PRIORITYJUDGE_MAX_TOKENS |
4000 |
Output token cap per call. |
PRIORITYJUDGE_THINKING_BUDGET |
8192 |
Extended-thinking budget for 3.x models. Set 0 to disable. |
The repo ships with .cursor/rules/ and a top-level .cursorrules so any
Cursor session opened inside this folder picks up project conventions
automatically. See AGENTS.md for the behavior contract.
- API keys are read from environment variables only.
.envis gitignored;.env.exampleis the only env file committed. tests/test_no_secrets.pyscans the repo for common secret patterns and fails CI if any match.- LLM calls never log key material and never template keys into prompts.
MIT.