Eval Methodology

How Diff Drift's evaluation works, exactly what the numbers mean, and — just as important — what they do not mean. Commands and file locations live in Development; this page is the methodology contract.

Two Different Measurements

	Engine eval	Blind-agent scorecard
Question	Does the engine produce exactly the expected flags, counts, and exit code?	Can a reviewer who sees only Diff Drift's output reach the right decision and cite the right evidence?
Gate	CI blocker (`npm run eval:engine`)	Advisory, local only
Scoring	Binary pass/fail per case against the oracle	0–100 rubric per case
Measures	The tool	The tool's usefulness to a reviewer — not the tool's detection rate

Cases

Each case in eval/cases/*.case.mjs defines a before/after repo state and an oracle: expected exit code, changed-file and risk counts, required flags (type, severity, file), forbidden flags, and per-file summaries. Fixtures are built as real temporary git repos and run through the real diff-drift check binary — nothing is mocked.

The suite currently covers: one case per high-severity rule family (secrets, eval, child_process, TLS, CORS, cookies), the multi-flag payments-api regression, dependency/script drift, TSX/JSX/.mjs files, a test-fixture suppression case, two benign cases (formatting-only; a rename-heavy noisy refactor), and the oversized-file skip guard.

Per-rule coverage

Rules with at least one dedicated engine case: Hardcoded secret, Dynamic code execution, Child process execution, Disabled TLS verification, Broadened CORS, Weakened cookie flags, Loose regex pattern, Crypto downgrade, Undeclared import, Disabled guard, Removed sanitization, Permissive logging config, Dependency not in lockfile, New dependency, npm script changed, Dependency version changed (the last three inside the dependency-drift case). Known gaps worth future cases: scoped-package slopsquatting, lockfile-format edge cases per package manager, deeper TSX surface (hooks, props spreading).

Limitations of the case suite: every fixture is synthetic and small (a handful of files). There is no real-world corpus yet, so no claim is made about recall or precision on production diffs — measuring that on your own repos is what npm run eval:fp-replay exists for.

Blind-Agent Rubric

A blind packet contains the Diff Drift markdown report, the raw git diff, and a prompt. It never contains the oracle. The reviewer returns a decision (approve | investigate | block), a findings list, and (since benchmark v2) an optional notes list for benign observations and report feedback. Notes are validated for shape but ignored entirely by scoring, no credit and no penalty. eval/lib/score.mjs scores the rest:

score = recall × 60
      + (decision accepted ? 20 : 0)
      + (top risk ranked first ? 10 : 0)
      + localization × 10
      − false positives × 5
      − (wrong decision on a benign case ? 50 : 0)

Recall is severity-weighted: low = 1, medium = 2, high = 3. Missing a High costs three times a Low.
Finding shape is enforced: severity, filePath, riskType, and evidence are mandatory.
Severity matching is scored in benchmark v3: the reported severity must meet or exceed the expected severity. Understating severity misses the required finding; conservative escalation is allowed.
Compound findings can satisfy distinct expected risk types when one reported finding clearly names both risks, but one finding cannot satisfy duplicate same-type expectations.
Localization: a matched finding must name the expected file path.
Related extras: an unmatched finding that names the expected file and risk area is tracked separately from false positives. It gets no recall credit, but it is not treated as speculative noise. Repeating an already matched finding is not a related extra; duplicate copies are penalized as unmatched findings.
False positives: every reported finding that matches no expected flag or related expected risk costs 5 points. On benign cases (zero required flags), any finding zeroes recall — always-block, always-find strategies cannot win.
Decision calibration: some cases accept more than one decision (e.g. investigate or block for a low-severity XSS-shaped change) when both are defensible review behavior. Accepted decisions are declared in the case file.

Semantic aliases (current rubric)

Finding text is matched against the expected flag type plus these aliases (global table in eval/lib/score.mjs; cases may add per-flag aliases, declared in the case file):

Flag type	Global aliases
Hardcoded secret	hardcoded secret, credential, secret, access key, aws access key
Dependency not in lockfile	dependency not in lockfile, lockfile, dependency drift
npm script changed	npm script, install script, postinstall
Weakened cookie flags	weakened cookie flags, cookie flags, httponly, secure, samesite
Permissive logging config	permissive logging, logger redaction, redaction removed
Undeclared import	undeclared import, undeclared dependency, not declared

Frozen-rubric policy: rubric weights, aliases, and accepted decisions are calibrated before answers are generated and are never tuned afterward to improve a score. If a defensible answer scores badly, that is reported as a rubric limitation. Changing the rubric starts a new benchmark version instead of rewriting older numbers.

Benchmark Versions

Every instrument change starts a new version; old scores are never recomputed under new rules, and new answers are always regenerated from scratch.

v1 (2026-06-09): original prompt, 15 synthetic cases, one blind model evaluator. Scored 72/100 — decision accuracy 15/15 and 100% per-rule recall, but precision was 43%: the prompt didn't say where benign observations belong, so the reviewer put commentary in findings and the zero-oracle cases (any finding zeroes recall there) collapsed to 15–20 points. That's a prompt-contract ambiguity, not a detection failure — but 72 is the honest v1 number and stands.
v2 (2026-06-10): the packet prompt defines the contract explicitly — findings are actionable trust risks only, benign observations go in a new scoring-ignored notes field, approve normally means empty findings, and a size-skipped file with a clean diff is a note, not a finding. The JSX secret fixture also became realistic (non-docs-example key, actually wired into the upload call) so reviewers have nothing legitimate to caveat. All 15 answers regenerated blind under the new packets. Scored 98/100. Clarification (same version, no score impact): the validator now enforces the full finding shape the prompt always required — severity, filePath, riskType, and evidence are mandatory, so a title-only finding cannot collect evidence-and-location credit. Every recorded v2 answer already satisfied the shape; the rescore is unchanged.
v3 (2026-06-10): severity became a scored part of the answer contract instead of only a required field. The prompt now says severity is scored; the scorer rejects severity understatements, permits conservative escalation, supports compound findings for distinct risk types, penalizes duplicate same-type reuse, and tracks non-duplicate related extra findings separately from false positives. All 15 answers were regenerated blind from packet-only context by three model-evaluator batches. Scored 100/100. The score remains model-only internal evidence with independent external validation pending.

Evaluators and Honesty Constraints

Every answer file records who produced it (evaluator: { id, kind: "model" | "human", external?: true, note }). The scorecard lists all evaluators and renders a standing banner — "independent external validation pending" — until at least two evaluators exist and at least one is a human outside the project with external: true. Internal maintainer answers can be recorded as kind: "human", but they do not clear the banner. That banner is computed by the harness, not hand-written, so it cannot be quietly dropped.

What the current scorecard therefore is: an internal product-quality signal on a small synthetic suite. What it is not: third-party validation, a detection-rate claim, or a comparison against other tools. Treat any headline number accordingly, and check the scorecard itself for case count and evaluator list.

Reproducing the Published Score

Every published benchmark's raw answers and scorecard are committed under eval/benchmarks/<version>/ — the working .eval/ directory stays gitignored, but the published evidence does not. From a fresh clone:

npm install
npm run eval:score-agent -- eval/benchmarks/v3/answers

That rescores the exact recorded v3 answers through the current rubric and must print the current published number: 100/100. Older benchmark folders preserve their raw answers and original scorecards for historical comparison; scorer changes start a new version rather than restating old numbers.

Reported Metrics

The scorecard reports, per run: overall score, decision accuracy, severity-weighted recall, localization, precision ((matched reported findings + related extras) ÷ all reported findings), total related extras, total false positives, and per-rule recall (matched/required per flag type across all cases that require it). Per-case rows include misses, mislocalizations, related extras, and unmatched findings verbatim.

FP-Replay: Measuring Noise on Your Own Repos

Synthetic cases cannot tell you the triage burden on your codebase. npm run eval:fp-replay runs diff-drift check over a list of local repos and baselines you configure (fp-replay.config.json, see fp-replay.config.example.json) and aggregates active flags per rule type and per changed file into .eval/results/fp-replay/latest.md. Point it at a few recent merged branches you consider benign: every flag it reports is a false positive for your code, which is the number that predicts real triage cost. Nothing is bundled or uploaded; it only reads repos you name.

Pending Work (tracked honestly)

Independent human evaluators for the blind suite (clears the banner).
A real-world labeled corpus to measure recall/precision beyond synthetic fixtures.
The packet-vs-raw-diff A/B study — pre-registered design in A/B Study Design, no results yet.

Diff Drift Wiki

Repository · Discussions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval Methodology

Eval Methodology

Two Different Measurements

Cases

Per-rule coverage

Blind-Agent Rubric

Semantic aliases (current rubric)

Benchmark Versions

Evaluators and Honesty Constraints

Reproducing the Published Score

Reported Metrics

FP-Replay: Measuring Noise on Your Own Repos

Pending Work (tracked honestly)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Diff Drift Wiki

Clone this wiki locally