spec-verify

Did the agent actually build the spec?

spec-verify takes the original spec/requirements document an AI agent was given and the finished codebase it produced, and emits a per-criterion PASS / FAIL / UNVERIFIABLE verdict for every acceptance criterion. It's a CI-friendly acceptance gate for AI-built software: wire it into a pipeline and it exits non-zero the moment a required criterion isn't actually met.

The trick is to gather deterministic evidence first (file existence, content grep, route/export presence, "does the test suite actually pass") and only fall back to an LLM judge for the genuinely subjective remainder. Deterministic checks are fast, free, and not subject to a model's optimism; the LLM only adjudicates what a regex can't.

Why

LLM agents are confident. They will tell you the feature is "done" and the tests "pass." spec-verify doesn't take their word for it: it re-derives the verdict from the artifacts on disk. A criterion only PASSes when the evidence shows it.

Install

npm install
# or globally (the CLI command stays `spec-verify`):
npm i -g github:Martello-Systems/spec-verify
# or as a dependency:
npm install spec-verify

Requires Node 18+.

Usage

Annotate your spec's acceptance criteria with inline machine-check directives (optional but recommended), then run check:

spec-verify check --spec SPEC.md --src ./build

Exit code 0 if every criterion passes (or is only unverifiable).
Exit code 1 if any criterion fails.
Exit code 2 on a usage/IO error.

Add --strict to also fail on UNVERIFIABLE (see below) — recommended for CI.

Options

Flag	Description
`-s, --spec <file>`	Path to the markdown spec (required)
`-d, --src <dir>`	Path to the finished codebase (required)
`--json`	Emit machine-readable JSON instead of the table
`--report <file>`	Also write a full markdown report to `<file>`
`--model <id>`	LLM judge model (default `claude-haiku-4-5-20251001`)
`--smart`	Use the smarter judge model (`claude-sonnet-4-6`)
`--no-run-scripts`	Don't execute npm scripts referenced by directives
`--no-judge`	Skip the LLM judge; undecided criteria become UNVERIFIABLE
`--strict`	Treat `UNVERIFIABLE` criteria as failures (exit 1) — recommended for CI
`--require-modal`	Only treat bullets containing must/shall/should as criteria

Strict mode (recommended for CI)

By default UNVERIFIABLE does not fail the gate: a criterion the tool could not check (for example, a subjective one with no API key for the judge) is reported but tolerated. That is convenient locally but dangerous in CI, where a pipeline with no ANTHROPIC_API_KEY would otherwise go green simply because nothing could be judged.

Pass --strict to count every UNVERIFIABLE as a failure (exit 1). This makes CI honest: the gate only passes when every criterion was actually verified.

spec-verify check --spec SPEC.md --src ./build --strict

The LLM judge (bring your own key)

The judge uses the Anthropic SDK and reads ANTHROPIC_API_KEY from the environment: no key is ever hardcoded. If the key is absent, criteria that need the judge are reported as UNVERIFIABLE (which does not fail the gate) and a warning is printed. Default model is the cheap claude-haiku-4-5-20251001; pass --smart for claude-sonnet-4-6.

export ANTHROPIC_API_KEY=sk-ant-...   # never commit this
spec-verify check --spec SPEC.md --src ./build --smart

Writing checkable criteria

Use a GitHub-style checklist under an "Acceptance Criteria" heading. Attach an inline  directive to make a criterion deterministically verifiable:

## Acceptance Criteria

- [ ] A README documents how to run the service <!-- check: file-exists path="README.md" -->
- [ ] The service exposes a health endpoint at /health <!-- check: route-exists path="/health" -->
- [ ] A `createWidget` function is exported <!-- check: export-exists name="createWidget" -->
- [ ] The codebase references "widget" <!-- check: grep pattern="widget" flags="i" -->
- [ ] The test suite passes <!-- check: npm-script name="test" -->
- [ ] The UI feels polished and on-brand   <!-- no directive → judged by the LLM -->

Criteria without a directive are still parsed; they get best-effort keyword evidence and are handed to the LLM judge.

Supported directives

Directive	Args	Determines verdict by
`file-exists`	`path` or `glob`	A matching file exists
`grep`	`pattern`, `glob?`, `flags?`, `min?`	The regex matches ≥ `min` times total across files (default 1)
`export-exists`	`name`, `glob?`	A JS/TS module exports the named symbol
`route-exists`	`path`, `method?`, `glob?`	A route declaration references the path
`npm-script`	`name`, `timeoutMs?`	`npm run <name>` exits 0 (the suite actually runs)

Aliases: file/contains/script/test/export/route map to the above.

route-exists detects routes two ways, and a match via either is a PASS:

Code routes (Express / Fastify / Koa, config-driven routers, raw http): the path must appear as a routing construct — a method call (app.get('/x')), a .route('/x') chain, a config object that also carries a method: key, or a req.url === '/x' comparison. A path that only shows up in a comment, a nav menu, or a test constant does not count.
Next.js file-convention routes (App + Pages Router): the path encoded in the file path is matched against the file tree, e.g. app/widgets/route.ts, app/widgets/page.tsx, pages/widgets.tsx, pages/api/widgets.ts. The src/ prefix, route groups (group), the root path, and dynamic/catch-all segments (/users/123 → app/users/[id]/route.ts) are handled; deeply-dynamic catch-all matching is best-effort.

Example output

ID    VERDICT      CRITERION
------------------------------------------------------------
C1    PASS         A README documents how to run the service
C2    FAIL         The service exposes a health endpoint at /he…
C3    PASS         A createWidget function is exported from the…
C4    PASS         The codebase references the word "widget" so…
C5    PASS         The project ships a passing test suite
C6    UNVERIFIABLE The service should be friendly and well docu…
------------------------------------------------------------
Total 6 | PASS 4 | FAIL 1 | UNVERIFIABLE 1
RESULT: FAILED

With --json:

{
  "summary": { "total": 6, "pass": 4, "fail": 1, "unverifiable": 1, "passed": false, "exitCode": 1 },
  "results": [
    { "id": "C2", "verdict": "FAIL", "reason": "route \"/health\" not found", "decidedBy": "deterministic", "...": "..." }
  ]
}

Programmatic API

import fs from 'node:fs';
import { verify, createAnthropicJudge, formatTable } from 'spec-verify';

const { results, summary } = await verify({
  spec: fs.readFileSync('SPEC.md', 'utf8'),
  srcDir: './build',
  judge: createAnthropicJudge(),   // reads ANTHROPIC_API_KEY
});

console.log(formatTable({ results, summary }));
process.exit(summary.exitCode);

Swap the judge for a mock in tests with createMockJudge(fn).

CI example (GitHub Actions)

- run: npx spec-verify check --spec SPEC.md --src . --strict
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

--strict is recommended in CI so the gate fails (rather than silently passes) if any criterion could not be verified.

Limitations

Source-text heuristics, not an AST. export-exists / route-exists match common ESM/CJS and Express/Fastify forms via regex (comment-stripped for routes), not a full parser. An exotic re-export or a route built from a computed string can be missed. Prefer a grep directive for anything unusual.
Next.js file-convention routes are matched against the file tree, but deeply-dynamic catch-all mapping is best-effort: a literal request segment is allowed to match a [param]/[...catchall] directory at that position.
The LLM judge is a fallback, not the moat. Deterministic directives decide the verdict whenever they can; the judge only adjudicates criteria with no directive. Lean on directives for anything you want CI to enforce hard.
npm-script runs the script in the target dir: only enable it on builds you trust, or pass --no-run-scripts.
Binary files are skipped (NUL-byte heuristic), so grep/export/route checks apply to text sources only.

Demo

Demo recording placeholder: a short asciinema/GIF of spec-verify check catching a silently-skipped criterion in CI will live here.

Development

npm install
npm test     # node:test, deterministic detection + a fixtured LLM-judge seam
npm run lint # ESLint (flat config, plain ESM)

The LLM judge has a clean seam: buildJudgePrompt (prompt assembly) and parseJudgeResponse (response parsing) are pure and unit-tested against recorded Anthropic responses (well-formed, malformed, partial, refusal, and empty) with no API key and no network. A single live smoke test runs only when ANTHROPIC_API_KEY is set, and skips cleanly otherwise. Detection is proven against two spec/build fixture pairs: a good build where every criterion passes, and a build that silently skips a criterion, which must be flagged.

License

Built by Martello Systems

spec-verify is part of the open-source toolkit from Martello Systems. We ship AI-built software, spec to delivery in days. If this saved you time, come see what we do.

Licensed under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
bin		bin
src		src
test		test
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
eslint.config.js		eslint.config.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

spec-verify

Why

Install

Usage

Options

Strict mode (recommended for CI)

The LLM judge (bring your own key)

Writing checkable criteria

Supported directives

Example output

Programmatic API

CI example (GitHub Actions)

Limitations

Demo

Development

License

Built by Martello Systems

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

spec-verify

Why

Install

Usage

Options

Strict mode (recommended for CI)

The LLM judge (bring your own key)

Writing checkable criteria

Supported directives

Example output

Programmatic API

CI example (GitHub Actions)

Limitations

Demo

Development

License

Built by Martello Systems

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages