# A/B Study Design: Diff Drift Packet vs Raw Git Diff **Status: pre-registered design only. No results exist yet.** This page is committed before any study runs so the design cannot drift to fit an outcome. If the design changes, the change history of this file is the audit trail. ## The Claim Under Test Diff Drift's core product claim is not "the rules catch everything" — it is that a reviewer using a Diff Drift packet reaches a correct trust decision more reliably (and cites better evidence) than the same reviewer using a raw `git diff` alone. The blind-agent scorecard measures the packet arm only; this study adds the control arm that turns the number into evidence. ## Design - **Unit**: one (case, reviewer, arm) triple. - **Arms**: - **A (packet)**: Diff Drift markdown report + raw git diff + standard prompt. - **B (control)**: raw git diff + the same prompt with Diff Drift references removed. - **Cases**: every case in `eval/cases/`, plus any added before the study starts. Case set is frozen at study start. - **Reviewers**: minimum 3 independent evaluators per arm. At least one human reviewer who has not contributed to Diff Drift; model evaluators must span at least two model families. No evaluator sees both arms of the same case (between-subjects per case) — assignment is alternated so each reviewer gets a mix of A and B cases. - **Blinding**: reviewers never see oracles, other reviewers' answers, or scores. Arm B prompts must not mention Diff Drift at all. ## Metrics (same rubric, both arms) Scored by the existing frozen rubric in `eval/lib/score.mjs`: 1. **Decision accuracy** (primary) — accepted-decision rate per arm. 2. **Severity-weighted finding recall** (primary). 3. **Precision / false positives** (secondary) — does the packet reduce speculative findings? 4. **Localization** (secondary). 5. **Time-to-decision** (secondary, when measurable) — wall-clock from packet open to answer; recorded only for human reviewers. ## Analysis Plan - Report per-arm means with per-case score pairs (A−B delta per case, aggregated across reviewers). - With the small expected n, report exact counts and a sign-test-style summary of per-case deltas rather than implying statistical power that does not exist; state n everywhere a delta is stated. - Publish raw (anonymized) answers alongside scores so the scoring is reproducible with `npm run eval:score-agent`. - Negative or null results are published with the same prominence as positive ones. If arm B matches arm A, that is a finding about the product, not a reason to rerun. ## Integrity Rules - Rubric, aliases, accepted decisions, and the case set are frozen before the first answer is generated (see [Eval Methodology](Eval-Methodology.md)). - No fabricated evaluators: every answer's `evaluator` metadata names the model or person, and human participation requires their consent to be named or pseudonymized. - Reviewer answers are collected once; no retries, no best-of-k selection. - The study runs locally with the existing harness (packets via `npm run eval:packets`; a control packet variant is generated by omitting the report). No code or repo content leaves the machine. ## Prerequisites Before Running 1. A control-packet mode in `eval/lib/packets.mjs` (diff + neutral prompt, no report) — small harness addition. 2. Recruited evaluators meeting the independence requirements above. 3. Case set review: enough benign cases that always-block cannot win either arm (currently 3 of 15).