Skip to content

Aawegg/priorityjudge

Repository files navigation

priorityjudge

An APO-native agent that grades plans, PRDs, and roadmaps on priority-definition quality — with line-cited evidence and a tightened rewrite. Runs on Gemini, configured for Cursor.

Latest score on a 6-plan, 3-repeat calibration set: 9887 / 10000 (perfect rank correlation with the human grader, 96.5% citation precision, 99.6% test-retest stability). See BENCHMARK.md.

priorityjudge exists because the highest-leverage skill in the AI era is no longer writing the plan — it's deciding which plan, in what order, and why. Default Cursor's Claude can write a 50-item roadmap in one shot. priorityjudge reads that roadmap and tells you, with citations, where it stops being a plan and starts being a wishlist.

What it does

Given any markdown plan, priorityjudge writes three files next to it:

Output Purpose
<plan>.score.json Machine-readable scorecard, one entry per rubric dimension.
<plan>.critique.md Human-readable critique with verified line-citations.
<plan>.tightened.md Rewrite of the plan that fixes the highest-ROI issues.

The rubric

Five dimensions, each capped at 2000 points, summing to 10000:

  1. Actionability — could a competent person start work tomorrow?
  2. Priority Clarity — is the order justified? Are trade-offs named?
  3. Risk Coverage — are failure modes and assumptions surfaced?
  4. Measurability — are success criteria observable and falsifiable?
  5. Dependency Awareness — are blockers and sequencing constraints identified?

The full definitions live in src/priorityjudge/rubric.py and are the canonical source — every prompt under prompts/ derives from that file.

How it works (the part that beats single-shot Cursor)

plan.md
   |
   v
[1] extractor       structured items + line ranges
   |
   v
[2] scorers x 5     one per rubric dimension, with cited evidence
   |
   v
[3] citer           deterministically verifies every quote
   |
   v
[4] synthesizer     scorecard + critique + tightened rewrite
  • Structured extraction first: scorers operate on a fixed inventory of items, not on free-recall of the document. This is the single biggest defense against the "single-shot judge" failure mode where an LLM hallucinates items the plan never contained.
  • Five independent per-criterion scorers: rubric adherence is forced because each call sees only one dimension's definition. A single-shot judge regresses to vibes; this can't.
  • Deterministic citation verifier: every quote and line-range is re-checked in pure Python after the LLM returns. Hallucinated citations fail the audit, and citation precision is a published component of the 1-10000 metric.

See BENCHMARK.md for the head-to-head against default Cursor's Claude on a 30-plan calibration set.

Quickstart

git clone <this-repo> priorityjudge
cd priorityjudge

python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

cp .env.example .env
# Open .env and paste your Gemini API key into GEMINI_API_KEY

priorityjudge rubric                   # print the canonical rubric
priorityjudge score samples/sample_plan.md

You should see a scorecard table, plus three new files next to samples/sample_plan.md.

Repo layout

priorityjudge/
  README.md                  # you are here
  PROBLEM.md                 # why this problem, why #1
  METRICS.md                 # how the 1-10000 score is calculated
  BENCHMARK.md               # vs default Cursor's Claude
  SELF_REVIEW.md             # 1-page self-review (+ appendix/)
  AGENTS.md                  # behavior guidance for any AI agent on this repo
  .cursor/rules/             # tailored Cursor configuration
  src/priorityjudge/         # extractor, scorers, citer, synthesizer, llm, cli
  prompts/                   # versioned prompt templates
  calibration/               # 10 real + 20 synthetic plans + expected.json
  benchmarks/                # head-to-head harness and results
  tests/                     # rubric integrity, citer, prompt regression, no-secrets
  samples/                   # sample plan to demo on
  appendix/                  # full thought process behind the 1-page review

Performance score (1-10000)

final = 5000 * agreement_with_humans
      + 3000 * citation_precision
      + 2000 * test_retest_stability

All three components are normalised to [0, 1] before weighting. See METRICS.md for the exact calculation, the calibration protocol, and the latest measured score.

Configuration

Env var Default What it does
GEMINI_API_KEY (required) Google AI Studio key.
PRIORITYJUDGE_MODEL gemini-3.1-pro-preview Override the model.
PRIORITYJUDGE_MAX_TOKENS 4000 Output token cap per call.
PRIORITYJUDGE_THINKING_BUDGET 8192 Extended-thinking budget for 3.x models. Set 0 to disable.

Cursor setup

The repo ships with .cursor/rules/ and a top-level .cursorrules so any Cursor session opened inside this folder picks up project conventions automatically. See AGENTS.md for the behavior contract.

Security

  • API keys are read from environment variables only. .env is gitignored; .env.example is the only env file committed.
  • tests/test_no_secrets.py scans the repo for common secret patterns and fails CI if any match.
  • LLM calls never log key material and never template keys into prompts.

License

MIT.

About

Multi-pass agent that grades plans/PRDs on priority-definition quality. Line-cited evidence, deterministic citation verifier, 9887/10000 on a hand-rated calibration set.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors