priorityjudge

An APO-native agent that grades plans, PRDs, and roadmaps on priority-definition quality — with line-cited evidence and a tightened rewrite. Runs on Gemini, configured for Cursor.

Latest score on a 6-plan, 3-repeat calibration set: 9887 / 10000 (perfect rank correlation with the human grader, 96.5% citation precision, 99.6% test-retest stability). See BENCHMARK.md.

priorityjudge exists because the highest-leverage skill in the AI era is no longer writing the plan — it's deciding which plan, in what order, and why. Default Cursor's Claude can write a 50-item roadmap in one shot. priorityjudge reads that roadmap and tells you, with citations, where it stops being a plan and starts being a wishlist.

What it does

Given any markdown plan, priorityjudge writes three files next to it:

Output	Purpose
`<plan>.score.json`	Machine-readable scorecard, one entry per rubric dimension.
`<plan>.critique.md`	Human-readable critique with verified line-citations.
`<plan>.tightened.md`	Rewrite of the plan that fixes the highest-ROI issues.

The rubric

Five dimensions, each capped at 2000 points, summing to 10000:

Actionability — could a competent person start work tomorrow?
Priority Clarity — is the order justified? Are trade-offs named?
Risk Coverage — are failure modes and assumptions surfaced?
Measurability — are success criteria observable and falsifiable?
Dependency Awareness — are blockers and sequencing constraints identified?

The full definitions live in src/priorityjudge/rubric.py and are the canonical source — every prompt under prompts/ derives from that file.

How it works (the part that beats single-shot Cursor)

plan.md
   |
   v
[1] extractor       structured items + line ranges
   |
   v
[2] scorers x 5     one per rubric dimension, with cited evidence
   |
   v
[3] citer           deterministically verifies every quote
   |
   v
[4] synthesizer     scorecard + critique + tightened rewrite

Structured extraction first: scorers operate on a fixed inventory of items, not on free-recall of the document. This is the single biggest defense against the "single-shot judge" failure mode where an LLM hallucinates items the plan never contained.
Five independent per-criterion scorers: rubric adherence is forced because each call sees only one dimension's definition. A single-shot judge regresses to vibes; this can't.
Deterministic citation verifier: every quote and line-range is re-checked in pure Python after the LLM returns. Hallucinated citations fail the audit, and citation precision is a published component of the 1-10000 metric.

See BENCHMARK.md for the head-to-head against default Cursor's Claude on a 30-plan calibration set.

Quickstart

git clone <this-repo> priorityjudge
cd priorityjudge

python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

cp .env.example .env
# Open .env and paste your Gemini API key into GEMINI_API_KEY

priorityjudge rubric                   # print the canonical rubric
priorityjudge score samples/sample_plan.md

You should see a scorecard table, plus three new files next to samples/sample_plan.md.

Repo layout

priorityjudge/
  README.md                  # you are here
  PROBLEM.md                 # why this problem, why #1
  METRICS.md                 # how the 1-10000 score is calculated
  BENCHMARK.md               # vs default Cursor's Claude
  SELF_REVIEW.md             # 1-page self-review (+ appendix/)
  AGENTS.md                  # behavior guidance for any AI agent on this repo
  .cursor/rules/             # tailored Cursor configuration
  src/priorityjudge/         # extractor, scorers, citer, synthesizer, llm, cli
  prompts/                   # versioned prompt templates
  calibration/               # 10 real + 20 synthetic plans + expected.json
  benchmarks/                # head-to-head harness and results
  tests/                     # rubric integrity, citer, prompt regression, no-secrets
  samples/                   # sample plan to demo on
  appendix/                  # full thought process behind the 1-page review

Performance score (1-10000)

final = 5000 * agreement_with_humans
      + 3000 * citation_precision
      + 2000 * test_retest_stability

All three components are normalised to [0, 1] before weighting. See METRICS.md for the exact calculation, the calibration protocol, and the latest measured score.

Configuration

Env var	Default	What it does
`GEMINI_API_KEY`	(required)	Google AI Studio key.
`PRIORITYJUDGE_MODEL`	`gemini-3.1-pro-preview`	Override the model.
`PRIORITYJUDGE_MAX_TOKENS`	`4000`	Output token cap per call.
`PRIORITYJUDGE_THINKING_BUDGET`	`8192`	Extended-thinking budget for 3.x models. Set `0` to disable.

Cursor setup

The repo ships with .cursor/rules/ and a top-level .cursorrules so any Cursor session opened inside this folder picks up project conventions automatically. See AGENTS.md for the behavior contract.

Security

API keys are read from environment variables only. .env is gitignored; .env.example is the only env file committed.
tests/test_no_secrets.py scans the repo for common secret patterns and fails CI if any match.
LLM calls never log key material and never template keys into prompts.

License

MIT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

priorityjudge

What it does

The rubric

How it works (the part that beats single-shot Cursor)

Quickstart

Repo layout

Performance score (1-10000)

Configuration

Cursor setup

Security

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.cursor/rules		.cursor/rules
.github/workflows		.github/workflows
appendix		appendix
benchmarks		benchmarks
calibration/synthetic		calibration/synthetic
openclaw		openclaw
prompts		prompts
samples		samples
src/priorityjudge		src/priorityjudge
tests		tests
.cursorrules		.cursorrules
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
BENCHMARK.md		BENCHMARK.md
Dockerfile		Dockerfile
LICENSE		LICENSE
METRICS.md		METRICS.md
PROBLEM.md		PROBLEM.md
README.md		README.md
SELF_REVIEW.md		SELF_REVIEW.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

priorityjudge

What it does

The rubric

How it works (the part that beats single-shot Cursor)

Quickstart

Repo layout

Performance score (1-10000)

Configuration

Cursor setup

Security

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages