-
Notifications
You must be signed in to change notification settings - Fork 0
AB Study Design
Statusnone420 edited this page Jun 10, 2026
·
1 revision
Status: pre-registered design only. No results exist yet. This page is committed before any study runs so the design cannot drift to fit an outcome. If the design changes, the change history of this file is the audit trail.
Diff Drift's core product claim is not "the rules catch everything" — it is that a reviewer using a Diff Drift packet reaches a correct trust decision more reliably (and cites better evidence) than the same reviewer using a raw git diff alone. The blind-agent scorecard measures the packet arm only; this study adds the control arm that turns the number into evidence.
- Unit: one (case, reviewer, arm) triple.
-
Arms:
- A (packet): Diff Drift markdown report + raw git diff + standard prompt.
- B (control): raw git diff + the same prompt with Diff Drift references removed.
-
Cases: every case in
eval/cases/, plus any added before the study starts. Case set is frozen at study start. - Reviewers: minimum 3 independent evaluators per arm. At least one human reviewer who has not contributed to Diff Drift; model evaluators must span at least two model families. No evaluator sees both arms of the same case (between-subjects per case) — assignment is alternated so each reviewer gets a mix of A and B cases.
- Blinding: reviewers never see oracles, other reviewers' answers, or scores. Arm B prompts must not mention Diff Drift at all.
Scored by the existing frozen rubric in eval/lib/score.mjs:
- Decision accuracy (primary) — accepted-decision rate per arm.
- Severity-weighted finding recall (primary).
- Precision / false positives (secondary) — does the packet reduce speculative findings?
- Localization (secondary).
- Time-to-decision (secondary, when measurable) — wall-clock from packet open to answer; recorded only for human reviewers.
- Report per-arm means with per-case score pairs (A−B delta per case, aggregated across reviewers).
- With the small expected n, report exact counts and a sign-test-style summary of per-case deltas rather than implying statistical power that does not exist; state n everywhere a delta is stated.
- Publish raw (anonymized) answers alongside scores so the scoring is reproducible with
npm run eval:score-agent. - Negative or null results are published with the same prominence as positive ones. If arm B matches arm A, that is a finding about the product, not a reason to rerun.
- Rubric, aliases, accepted decisions, and the case set are frozen before the first answer is generated (see Eval Methodology).
- No fabricated evaluators: every answer's
evaluatormetadata names the model or person, and human participation requires their consent to be named or pseudonymized. - Reviewer answers are collected once; no retries, no best-of-k selection.
- The study runs locally with the existing harness (packets via
npm run eval:packets; a control packet variant is generated by omitting the report). No code or repo content leaves the machine.
- A control-packet mode in
eval/lib/packets.mjs(diff + neutral prompt, no report) — small harness addition. - Recruited evaluators meeting the independence requirements above.
- Case set review: enough benign cases that always-block cannot win either arm (currently 3 of 15).