Cross-model debate harness research, policy, and scenarios by EdgeCaser · Pull Request #6 · EdgeCaser/shipwright

EdgeCaser · 2026-04-15T01:25:28Z

Summary

This PR brings over the cross-model conflict-harness work from codex/cross-model-debate-harness-spec.

It includes:

Gemini replay judging hardening on Windows
richer verdict schema and repair telemetry
weighted-total validation tightening
judge-bias and repeatability analysis memos
orchestrator guidance for judge routing and escalation
scenario taxonomy for judge-analysis grouping
six new recent real-world strategy scenarios
committed benchmark/result artifacts already captured on the branch

Why PR Instead Of Local Merge

The local OneDrive-backed worktree hit repeated filesystem permission issues while trying to fast-forward and merge main, even after the branch was safely pushed. Opening a PR keeps the branch history intact and avoids further local filesystem risk.

Notes

docs/outreach/ is intentionally ignored and not part of this PR.
There is a local stash in the OneDrive worktree (temp-main-switch-state-before-merge) that preserves interrupted local-only state, but the branch contents for merge live safely on GitHub.

Recommended Review Focus

whether benchmark/result artifacts should land in this PR as-is or be split
orchestrator routing policy and default-model guidance
judge-schema / repair-path changes
new scenario framing quality

…ults Co-authored spec between Claude and Codex (3 review rounds to sign-off). Phase 1 implementation: head-to-head mode with sealed first-pass, structured critique exchange, blind adjudication, and subscription CLI adapters. Pilot results on prd-hidden-scope-creep confirm judge family affinity: Claude judge picks Claude (margin 1.2), GPT judge picks GPT (margin 0.8). This empirically validates the spec's Phase 2 requirement for multi-judge calibration before publishing benchmark conclusions. Includes: spec, 3 JSON schemas, case packet builder, conflict runner, 8 passing tests, full review exchange, and 2 pilot run transcripts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ion gaps Batch runner loops all scenarios with both Claude and GPT as judge, computes judge agreement rate, and gates publishability interpretation on full coverage. Unknown --scenario IDs now fail-closed instead of silently dropping. Fixes both findings from Codex code review round 2: - Partial batch coverage no longer produces false "usable" conclusions - Typo'd scenario names throw immediately with available IDs listed 5 passing tests including partial-coverage and unknown-ID regression tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…it reasoning effort - Fix verdict schema: rubric dimension scores accept number (not just integer) - Harden batch runner against EPIPE: child.on('error') calls finalize() instead of reject(), stdin error handler, process-level uncaughtException/unhandledRejection - Add --output-last-message file fallback for Codex CLI stdout fragility - Add explicit reasoning effort (medium) for all sides and judges (Codex contribution) - Add 4 real-world evidence-rich scenarios: Blockbuster Total Access, Netflix Qwikster, Zillow Offers, Yahoo/Microsoft — to test whether judge family affinity is prompt-conditioned - Add 2 additional scenarios from Codex: meta-muse-spark, supermicro-export-controls - Include interim calibration results and Claude-Codex review exchange - Codex sign-off on batch runner (code review round 3) Preliminary finding: judge family affinity observed on synthetic scenarios (judges pick own family's side) may weaken on evidence-rich real-world cases. Full wave 2 batch (22 runs) in progress to confirm. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… affinity Full 11-scenario dual-judge calibration with explicit medium reasoning effort. Zero errors across 22 runs (EPIPE hardening + --output-last-message working). Key finding: Claude judge picked Side A (Claude) in 10/11 scenarios (1 tie). GPT judge showed no consistent pattern (4 side_a, 3 side_b, 4 ties). Judge agreement rate: 27% (3/11). The asymmetry — not symmetric family affinity — is the core result. Side-swap runs needed to decompose family affinity from position bias. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… decomposed Swap test (5 scenarios, sides flipped): Claude judge shows 3/5 family affinity, 2/5 position bias. GPT judge shows 2/5 family affinity, 1/5 position bias. Neither pure effect explains all data — it's a scenario-dependent interaction. Current-events scenarios (meta-muse-spark, supermicro-export-controls): same pattern as wave 2 — Claude judge picks Claude, GPT judge variable. Key revision: GPT judge's apparent neutrality in wave 2 was partially an artifact of always being in the disadvantaged Side B position. Swap test reveals GPT also shows family preference when placed in Side A. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…decomposed Full 2x2 matrix: 11 scenarios x 2 judges x 2 orientations = 43/44 runs complete. Claude judge: 70% family affinity (follows Claude output across position swap), 30% position bias. Almost never ties (1/21 verdicts). GPT judge: 36% family, 27% position, 27% consistent ties, 10% mixed. More calibrated about uncertainty — ties are stable across orientations. Neither judge reliable as sole evaluator. Swap-test replication required to distinguish quality differences from systematic judge artifacts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Outreach drafts (Medium, LinkedIn, Reddit) are personal publishing artifacts, not part of the project surface. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds the tier-1 results memo and the shared letter to Claude/Gemini summarizing the first three-scenario replay sample. Updates the judge-feedback-request status to Implemented. Ignores slack-agent logs and tmp debug files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…udges, new real-world scenarios Adds the raw artifact outputs and summaries backing the recent analysis memos: - Gemini 2.5 Pro and flash-lite replay judgments across 13 scenarios - Repeatability replays (5x) on prd-hidden-scope-creep, handoff-contradiction, event-automation-boundary - Artifact-matched three-family comparison outputs - New real-world scenario runs (google-adtech-breakup-remedies, nissan-honda-merger-collapse, openai-nonprofit-control) - 6-scenario batch summary and default-GPT batch summary These are the raw evidence for claims in docs/review/ and should travel with the analysis. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

EdgeCaser · 2026-04-15T01:27:25Z

This PR is not in mergeable-review shape as a single unit.

Current size from the branch diff:

23 commits
6,707 changed files
565,777 additions
dominated by generated benchmark/results artifacts

Recommendation: do not merge this draft as-is.

Suggested split order:

implementation/policy PR

Gemini replay hardening
verdict schema / repair telemetry / weighted-total validation
orchestrator routing guidance
scenario taxonomy
new real-world scenario definitions
tests and docs that directly explain those code changes

analysis/docs PR

judge-bias memos
repeatability and cross-model analysis writeups
model-role and orchestrator proposal docs that are primarily research-facing

benchmark evidence PR(s)

generated benchmarks/results/**
rejudge outputs
large run artifacts
only if we explicitly want those in git

Why this is the right move:

keeps the code reviewable
avoids burying schema/policy changes under generated files
lets us decide intentionally whether benchmark artifacts belong in the main repo history

Operational note:

the branch itself is safely pushed and remains the source of truth
local merge attempts in the OneDrive worktree hit filesystem/permission problems, so the safest path forward is branch/PR splitting rather than more local merge surgery in that worktree

If we want to keep momentum, I’d start by extracting the implementation/policy slice into a fresh branch/PR from main, and leave this draft PR as the archival umbrella for the full research corpus.

EdgeCaser · 2026-04-15T01:32:55Z

Superseded for merge purposes by PR #7.

Use this draft PR as the umbrella/archive for the full research corpus.

Recommended order now:

Merge PR Extract conflict harness implementation and policy #7 (Extract conflict harness implementation and policy) first.
Re-evaluate whether any result artifacts or research memos from this PR should be split into follow-up PRs.

Reason:

PR Extract conflict harness implementation and policy #7 is the reviewable implementation/policy slice.
This PR remains far too large to merge safely as a first landing unit.

EdgeCaser and others added 23 commits April 13, 2026 12:25

Add replication analysis and swap-aware harness metrics

a4fec6c

Add Gemini replay judging and richer verdict rationale

30ce4ee

Pin Gemini effort via project-local aliases

af4e62b

Harden Gemini replay judging on Windows

86dee45

Track decisive judge dimensions and repair telemetry

c97de44

Tighten weighted total validation for judge verdicts

4758f46

Add Gemini full-pass and cross-model analysis memos

9f28b47

Add scenario taxonomy for judge analysis

3e5bef0

Add judge-bias rationale and test plan notes

8d13e3f

Harden replay studies and add alignment findings

a5b4554

Add orchestrator policy for judge escalation

27742ee

Add six recent real-world strategy scenarios

e9bbe7b

Clarify orchestrator model routing guidance

13f0772

Refine outreach drafts and model guidance

4687d6d

Ignore docs/outreach/ and untrack draft posts

b52b889

Outreach drafts (Medium, LinkedIn, Reddit) are personal publishing artifacts, not part of the project surface. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

EdgeCaser mentioned this pull request Apr 15, 2026

Extract conflict harness implementation and policy #7

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross-model debate harness research, policy, and scenarios#6

Cross-model debate harness research, policy, and scenarios#6
EdgeCaser wants to merge 23 commits into
mainfrom
codex/cross-model-debate-harness-spec

EdgeCaser commented Apr 15, 2026

Uh oh!

EdgeCaser commented Apr 15, 2026

Uh oh!

EdgeCaser commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EdgeCaser commented Apr 15, 2026

Summary

Why PR Instead Of Local Merge

Notes

Recommended Review Focus

Uh oh!

EdgeCaser commented Apr 15, 2026

Uh oh!

EdgeCaser commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant