AB Study Design

A/B Study Design: Diff Drift Packet vs Raw Git Diff

Status: pre-registered design only. No results exist yet. This page is committed before any study runs so the design cannot drift to fit an outcome. If the design changes, the change history of this file is the audit trail.

The Claim Under Test

Diff Drift's core product claim is not "the rules catch everything" — it is that a reviewer using a Diff Drift packet reaches a correct trust decision more reliably (and cites better evidence) than the same reviewer using a raw git diff alone. The blind-agent scorecard measures the packet arm only; this study adds the control arm that turns the number into evidence.

Design

Unit: one (case, reviewer, arm) triple.
Arms:
- A (packet): Diff Drift markdown report + raw git diff + standard prompt.
- B (control): raw git diff + the same prompt with Diff Drift references removed.
Cases: every case in eval/cases/, plus any added before the study starts. Case set is frozen at study start.
Reviewers: minimum 3 independent evaluators per arm. At least one human reviewer who has not contributed to Diff Drift; model evaluators must span at least two model families. No evaluator sees both arms of the same case (between-subjects per case) — assignment is alternated so each reviewer gets a mix of A and B cases.
Blinding: reviewers never see oracles, other reviewers' answers, or scores. Arm B prompts must not mention Diff Drift at all.

Metrics (same rubric, both arms)

Scored by the existing frozen rubric in eval/lib/score.mjs:

Decision accuracy (primary) — accepted-decision rate per arm.
Severity-weighted finding recall (primary).
Precision / false positives (secondary) — does the packet reduce speculative findings?
Localization (secondary).
Time-to-decision (secondary, when measurable) — wall-clock from packet open to answer; recorded only for human reviewers.

Analysis Plan

Report per-arm means with per-case score pairs (A−B delta per case, aggregated across reviewers).
With the small expected n, report exact counts and a sign-test-style summary of per-case deltas rather than implying statistical power that does not exist; state n everywhere a delta is stated.
Publish raw (anonymized) answers alongside scores so the scoring is reproducible with npm run eval:score-agent.
Negative or null results are published with the same prominence as positive ones. If arm B matches arm A, that is a finding about the product, not a reason to rerun.

Integrity Rules

Rubric, aliases, accepted decisions, and the case set are frozen before the first answer is generated (see Eval Methodology).
No fabricated evaluators: every answer's evaluator metadata names the model or person, and human participation requires their consent to be named or pseudonymized.
Reviewer answers are collected once; no retries, no best-of-k selection.
The study runs locally with the existing harness (packets via npm run eval:packets; a control packet variant is generated by omitting the report). No code or repo content leaves the machine.

Prerequisites Before Running

A control-packet mode in eval/lib/packets.mjs (diff + neutral prompt, no report) — small harness addition.
Recruited evaluators meeting the independence requirements above.
Case set review: enough benign cases that always-block cannot win either arm (currently 3 of 15).

Diff Drift Wiki

Repository · Discussions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AB Study Design

A/B Study Design: Diff Drift Packet vs Raw Git Diff

The Claim Under Test

Design

Metrics (same rubric, both arms)

Analysis Plan

Integrity Rules

Prerequisites Before Running

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Diff Drift Wiki

Clone this wiki locally