-
Notifications
You must be signed in to change notification settings - Fork 0
Eval Methodology
How Diff Drift's evaluation works, exactly what the numbers mean, and — just as important — what they do not mean. Commands and file locations live in Development; this page is the methodology contract.
| Engine eval | Blind-agent scorecard | |
|---|---|---|
| Question | Does the engine produce exactly the expected flags, counts, and exit code? | Can a reviewer who sees only Diff Drift's output reach the right decision and cite the right evidence? |
| Gate |
CI blocker (npm run eval:engine) |
Advisory, local only |
| Scoring | Binary pass/fail per case against the oracle | 0–100 rubric per case |
| Measures | The tool | The tool's usefulness to a reviewer — not the tool's detection rate |
Each case in eval/cases/*.case.mjs defines a before/after repo state and an oracle: expected exit code, changed-file and risk counts, required flags (type, severity, file), forbidden flags, and per-file summaries. Fixtures are built as real temporary git repos and run through the real diff-drift check binary — nothing is mocked.
The suite currently covers: one case per high-severity rule family (secrets, eval, child_process, TLS, CORS, cookies), the multi-flag payments-api regression, dependency/script drift, TSX/JSX/.mjs files, a test-fixture suppression case, two benign cases (formatting-only; a rename-heavy noisy refactor), and the oversized-file skip guard.
Rules with at least one dedicated engine case: Hardcoded secret, Dynamic code execution, Child process execution, Disabled TLS verification, Broadened CORS, Weakened cookie flags, Loose regex pattern, Crypto downgrade, Undeclared import, Disabled guard, Removed sanitization, Permissive logging config, Dependency not in lockfile, New dependency, npm script changed, Dependency version changed (the last three inside the dependency-drift case). Known gaps worth future cases: scoped-package slopsquatting, lockfile-format edge cases per package manager, deeper TSX surface (hooks, props spreading).
Limitations of the case suite: every fixture is synthetic and small (a handful of files). There is no real-world corpus yet, so no claim is made about recall or precision on production diffs — measuring that on your own repos is what npm run eval:fp-replay exists for.
A blind packet contains the Diff Drift markdown report, the raw git diff, and a prompt. It never contains the oracle. The reviewer returns a decision (approve | investigate | block), a findings list, and (since benchmark v2) an optional notes list for benign observations and report feedback. Notes are validated for shape but ignored entirely by scoring, no credit and no penalty. eval/lib/score.mjs scores the rest:
score = recall × 60
+ (decision accepted ? 20 : 0)
+ (top risk ranked first ? 10 : 0)
+ localization × 10
− false positives × 5
− (wrong decision on a benign case ? 50 : 0)
- Recall is severity-weighted: low = 1, medium = 2, high = 3. Missing a High costs three times a Low.
-
Finding shape is enforced:
severity,filePath,riskType, andevidenceare mandatory. - Severity matching is scored in benchmark v3: the reported severity must meet or exceed the expected severity. Understating severity misses the required finding; conservative escalation is allowed.
- Compound findings can satisfy distinct expected risk types when one reported finding clearly names both risks, but one finding cannot satisfy duplicate same-type expectations.
- Localization: a matched finding must name the expected file path.
- Related extras: an unmatched finding that names the expected file and risk area is tracked separately from false positives. It gets no recall credit, but it is not treated as speculative noise. Repeating an already matched finding is not a related extra; duplicate copies are penalized as unmatched findings.
- False positives: every reported finding that matches no expected flag or related expected risk costs 5 points. On benign cases (zero required flags), any finding zeroes recall — always-block, always-find strategies cannot win.
-
Decision calibration: some cases accept more than one decision (e.g.
investigateorblockfor a low-severity XSS-shaped change) when both are defensible review behavior. Accepted decisions are declared in the case file.
Finding text is matched against the expected flag type plus these aliases (global table in eval/lib/score.mjs; cases may add per-flag aliases, declared in the case file):
| Flag type | Global aliases |
|---|---|
| Hardcoded secret | hardcoded secret, credential, secret, access key, aws access key |
| Dependency not in lockfile | dependency not in lockfile, lockfile, dependency drift |
| npm script changed | npm script, install script, postinstall |
| Weakened cookie flags | weakened cookie flags, cookie flags, httponly, secure, samesite |
| Permissive logging config | permissive logging, logger redaction, redaction removed |
| Undeclared import | undeclared import, undeclared dependency, not declared |
Frozen-rubric policy: rubric weights, aliases, and accepted decisions are calibrated before answers are generated and are never tuned afterward to improve a score. If a defensible answer scores badly, that is reported as a rubric limitation. Changing the rubric starts a new benchmark version instead of rewriting older numbers.
Every instrument change starts a new version; old scores are never recomputed under new rules, and new answers are always regenerated from scratch.
-
v1 (2026-06-09): original prompt, 15 synthetic cases, one blind model evaluator. Scored 72/100 — decision accuracy 15/15 and 100% per-rule recall, but precision was 43%: the prompt didn't say where benign observations belong, so the reviewer put commentary in
findingsand the zero-oracle cases (any finding zeroes recall there) collapsed to 15–20 points. That's a prompt-contract ambiguity, not a detection failure — but 72 is the honest v1 number and stands. -
v2 (2026-06-10): the packet prompt defines the contract explicitly — findings are actionable trust risks only, benign observations go in a new scoring-ignored
notesfield,approvenormally means empty findings, and a size-skipped file with a clean diff is a note, not a finding. The JSX secret fixture also became realistic (non-docs-example key, actually wired into the upload call) so reviewers have nothing legitimate to caveat. All 15 answers regenerated blind under the new packets. Scored 98/100. Clarification (same version, no score impact): the validator now enforces the full finding shape the prompt always required —severity,filePath,riskType, andevidenceare mandatory, so a title-only finding cannot collect evidence-and-location credit. Every recorded v2 answer already satisfied the shape; the rescore is unchanged. - v3 (2026-06-10): severity became a scored part of the answer contract instead of only a required field. The prompt now says severity is scored; the scorer rejects severity understatements, permits conservative escalation, supports compound findings for distinct risk types, penalizes duplicate same-type reuse, and tracks non-duplicate related extra findings separately from false positives. All 15 answers were regenerated blind from packet-only context by three model-evaluator batches. Scored 100/100. The score remains model-only internal evidence with independent external validation pending.
Every answer file records who produced it (evaluator: { id, kind: "model" | "human", external?: true, note }). The scorecard lists all evaluators and renders a standing banner — "independent external validation pending" — until at least two evaluators exist and at least one is a human outside the project with external: true. Internal maintainer answers can be recorded as kind: "human", but they do not clear the banner. That banner is computed by the harness, not hand-written, so it cannot be quietly dropped.
What the current scorecard therefore is: an internal product-quality signal on a small synthetic suite. What it is not: third-party validation, a detection-rate claim, or a comparison against other tools. Treat any headline number accordingly, and check the scorecard itself for case count and evaluator list.
Every published benchmark's raw answers and scorecard are committed under eval/benchmarks/<version>/ — the working .eval/ directory stays gitignored, but the published evidence does not. From a fresh clone:
npm install
npm run eval:score-agent -- eval/benchmarks/v3/answersThat rescores the exact recorded v3 answers through the current rubric and must print the current published number: 100/100. Older benchmark folders preserve their raw answers and original scorecards for historical comparison; scorer changes start a new version rather than restating old numbers.
The scorecard reports, per run: overall score, decision accuracy, severity-weighted recall, localization, precision ((matched reported findings + related extras) ÷ all reported findings), total related extras, total false positives, and per-rule recall (matched/required per flag type across all cases that require it). Per-case rows include misses, mislocalizations, related extras, and unmatched findings verbatim.
Synthetic cases cannot tell you the triage burden on your codebase. npm run eval:fp-replay runs diff-drift check over a list of local repos and baselines you configure (fp-replay.config.json, see fp-replay.config.example.json) and aggregates active flags per rule type and per changed file into .eval/results/fp-replay/latest.md. Point it at a few recent merged branches you consider benign: every flag it reports is a false positive for your code, which is the number that predicts real triage cost. Nothing is bundled or uploaded; it only reads repos you name.
- Independent human evaluators for the blind suite (clears the banner).
- A real-world labeled corpus to measure recall/precision beyond synthetic fixtures.
- The packet-vs-raw-diff A/B study — pre-registered design in A/B Study Design, no results yet.