Did the agent actually build the spec?
spec-verify takes the original spec/requirements document an AI agent was given
and the finished codebase it produced, and emits a per-criterion
PASS / FAIL / UNVERIFIABLE verdict for every acceptance criterion. It's a
CI-friendly acceptance gate for AI-built software: wire it into a pipeline and it
exits non-zero the moment a required criterion isn't actually met.
The trick is to gather deterministic evidence first (file existence, content grep, route/export presence, "does the test suite actually pass") and only fall back to an LLM judge for the genuinely subjective remainder. Deterministic checks are fast, free, and not subject to a model's optimism; the LLM only adjudicates what a regex can't.
LLM agents are confident. They will tell you the feature is "done" and the tests
"pass." spec-verify doesn't take their word for it: it re-derives the verdict
from the artifacts on disk. A criterion only PASSes when the evidence shows it.
npm install
# or globally (the CLI command stays `spec-verify`):
npm i -g github:Martello-Systems/spec-verify
# or as a dependency:
npm install spec-verifyRequires Node 18+.
Annotate your spec's acceptance criteria with inline machine-check directives
(optional but recommended), then run check:
spec-verify check --spec SPEC.md --src ./build- Exit code 0 if every criterion passes (or is only unverifiable).
- Exit code 1 if any criterion fails.
- Exit code 2 on a usage/IO error.
Add --strict to also fail on UNVERIFIABLE (see below) — recommended for CI.
| Flag | Description |
|---|---|
-s, --spec <file> |
Path to the markdown spec (required) |
-d, --src <dir> |
Path to the finished codebase (required) |
--json |
Emit machine-readable JSON instead of the table |
--report <file> |
Also write a full markdown report to <file> |
--model <id> |
LLM judge model (default claude-haiku-4-5-20251001) |
--smart |
Use the smarter judge model (claude-sonnet-4-6) |
--no-run-scripts |
Don't execute npm scripts referenced by directives |
--no-judge |
Skip the LLM judge; undecided criteria become UNVERIFIABLE |
--strict |
Treat UNVERIFIABLE criteria as failures (exit 1) — recommended for CI |
--require-modal |
Only treat bullets containing must/shall/should as criteria |
By default UNVERIFIABLE does not fail the gate: a criterion the tool could
not check (for example, a subjective one with no API key for the judge) is
reported but tolerated. That is convenient locally but dangerous in CI, where a
pipeline with no ANTHROPIC_API_KEY would otherwise go green simply because
nothing could be judged.
Pass --strict to count every UNVERIFIABLE as a failure (exit 1). This makes
CI honest: the gate only passes when every criterion was actually verified.
spec-verify check --spec SPEC.md --src ./build --strictThe judge uses the Anthropic SDK
and reads ANTHROPIC_API_KEY from the environment: no key is ever hardcoded.
If the key is absent, criteria that need the judge are reported as UNVERIFIABLE
(which does not fail the gate) and a warning is printed. Default model is the cheap
claude-haiku-4-5-20251001; pass --smart for claude-sonnet-4-6.
export ANTHROPIC_API_KEY=sk-ant-... # never commit this
spec-verify check --spec SPEC.md --src ./build --smartUse a GitHub-style checklist under an "Acceptance Criteria" heading. Attach an
inline <!-- check: ... --> directive to make a criterion deterministically
verifiable:
## Acceptance Criteria
- [ ] A README documents how to run the service <!-- check: file-exists path="README.md" -->
- [ ] The service exposes a health endpoint at /health <!-- check: route-exists path="/health" -->
- [ ] A `createWidget` function is exported <!-- check: export-exists name="createWidget" -->
- [ ] The codebase references "widget" <!-- check: grep pattern="widget" flags="i" -->
- [ ] The test suite passes <!-- check: npm-script name="test" -->
- [ ] The UI feels polished and on-brand <!-- no directive → judged by the LLM -->Criteria without a directive are still parsed; they get best-effort keyword evidence and are handed to the LLM judge.
| Directive | Args | Determines verdict by |
|---|---|---|
file-exists |
path or glob |
A matching file exists |
grep |
pattern, glob?, flags?, min? |
The regex matches ≥ min times total across files (default 1) |
export-exists |
name, glob? |
A JS/TS module exports the named symbol |
route-exists |
path, method?, glob? |
A route declaration references the path |
npm-script |
name, timeoutMs? |
npm run <name> exits 0 (the suite actually runs) |
Aliases: file/contains/script/test/export/route map to the above.
route-exists detects routes two ways, and a match via either is a PASS:
- Code routes (Express / Fastify / Koa, config-driven routers, raw
http): the path must appear as a routing construct — a method call (app.get('/x')), a.route('/x')chain, a config object that also carries amethod:key, or areq.url === '/x'comparison. A path that only shows up in a comment, a nav menu, or a test constant does not count. - Next.js file-convention routes (App + Pages Router): the path encoded in
the file path is matched against the file tree, e.g.
app/widgets/route.ts,app/widgets/page.tsx,pages/widgets.tsx,pages/api/widgets.ts. Thesrc/prefix, route groups(group), the root path, and dynamic/catch-all segments (/users/123→app/users/[id]/route.ts) are handled; deeply-dynamic catch-all matching is best-effort.
ID VERDICT CRITERION
------------------------------------------------------------
C1 PASS A README documents how to run the service
C2 FAIL The service exposes a health endpoint at /he…
C3 PASS A createWidget function is exported from the…
C4 PASS The codebase references the word "widget" so…
C5 PASS The project ships a passing test suite
C6 UNVERIFIABLE The service should be friendly and well docu…
------------------------------------------------------------
Total 6 | PASS 4 | FAIL 1 | UNVERIFIABLE 1
RESULT: FAILED
With --json:
{
"summary": { "total": 6, "pass": 4, "fail": 1, "unverifiable": 1, "passed": false, "exitCode": 1 },
"results": [
{ "id": "C2", "verdict": "FAIL", "reason": "route \"/health\" not found", "decidedBy": "deterministic", "...": "..." }
]
}import fs from 'node:fs';
import { verify, createAnthropicJudge, formatTable } from 'spec-verify';
const { results, summary } = await verify({
spec: fs.readFileSync('SPEC.md', 'utf8'),
srcDir: './build',
judge: createAnthropicJudge(), // reads ANTHROPIC_API_KEY
});
console.log(formatTable({ results, summary }));
process.exit(summary.exitCode);Swap the judge for a mock in tests with createMockJudge(fn).
- run: npx spec-verify check --spec SPEC.md --src . --strict
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}--strict is recommended in CI so the gate fails (rather than silently passes)
if any criterion could not be verified.
- Source-text heuristics, not an AST.
export-exists/route-existsmatch common ESM/CJS and Express/Fastify forms via regex (comment-stripped for routes), not a full parser. An exotic re-export or a route built from a computed string can be missed. Prefer agrepdirective for anything unusual. - Next.js file-convention routes are matched against the file tree, but
deeply-dynamic catch-all mapping is best-effort: a literal request segment is
allowed to match a
[param]/[...catchall]directory at that position. - The LLM judge is a fallback, not the moat. Deterministic directives decide the verdict whenever they can; the judge only adjudicates criteria with no directive. Lean on directives for anything you want CI to enforce hard.
npm-scriptruns the script in the target dir: only enable it on builds you trust, or pass--no-run-scripts.- Binary files are skipped (NUL-byte heuristic), so grep/export/route checks apply to text sources only.
Demo recording placeholder: a short asciinema/GIF of
spec-verify checkcatching a silently-skipped criterion in CI will live here.
npm install
npm test # node:test, deterministic detection + a fixtured LLM-judge seam
npm run lint # ESLint (flat config, plain ESM)The LLM judge has a clean seam: buildJudgePrompt (prompt assembly) and
parseJudgeResponse (response parsing) are pure and unit-tested against
recorded Anthropic responses (well-formed, malformed, partial, refusal, and
empty) with no API key and no network. A single live smoke test runs only
when ANTHROPIC_API_KEY is set, and skips cleanly otherwise. Detection is
proven against two spec/build fixture pairs: a good build where every criterion
passes, and a build that silently skips a criterion, which must be flagged.
Apache-2.0 © 2026 Martello Systems. See LICENSE.
spec-verify is part of the open-source toolkit from Martello Systems. We ship AI-built software, spec to delivery in days. If this saved you time, come see what we do.
Licensed under the Apache License 2.0.