Skip to content

Cross-model debate harness research, policy, and scenarios#6

Draft
EdgeCaser wants to merge 23 commits into
mainfrom
codex/cross-model-debate-harness-spec
Draft

Cross-model debate harness research, policy, and scenarios#6
EdgeCaser wants to merge 23 commits into
mainfrom
codex/cross-model-debate-harness-spec

Conversation

@EdgeCaser
Copy link
Copy Markdown
Owner

Summary

This PR brings over the cross-model conflict-harness work from codex/cross-model-debate-harness-spec.

It includes:

  • Gemini replay judging hardening on Windows
  • richer verdict schema and repair telemetry
  • weighted-total validation tightening
  • judge-bias and repeatability analysis memos
  • orchestrator guidance for judge routing and escalation
  • scenario taxonomy for judge-analysis grouping
  • six new recent real-world strategy scenarios
  • committed benchmark/result artifacts already captured on the branch

Why PR Instead Of Local Merge

The local OneDrive-backed worktree hit repeated filesystem permission issues while trying to fast-forward and merge main, even after the branch was safely pushed. Opening a PR keeps the branch history intact and avoids further local filesystem risk.

Notes

  • docs/outreach/ is intentionally ignored and not part of this PR.
  • There is a local stash in the OneDrive worktree (temp-main-switch-state-before-merge) that preserves interrupted local-only state, but the branch contents for merge live safely on GitHub.

Recommended Review Focus

  • whether benchmark/result artifacts should land in this PR as-is or be split
  • orchestrator routing policy and default-model guidance
  • judge-schema / repair-path changes
  • new scenario framing quality

EdgeCaser and others added 23 commits April 13, 2026 12:25
…ults

Co-authored spec between Claude and Codex (3 review rounds to sign-off).
Phase 1 implementation: head-to-head mode with sealed first-pass, structured
critique exchange, blind adjudication, and subscription CLI adapters.

Pilot results on prd-hidden-scope-creep confirm judge family affinity:
Claude judge picks Claude (margin 1.2), GPT judge picks GPT (margin 0.8).
This empirically validates the spec's Phase 2 requirement for multi-judge
calibration before publishing benchmark conclusions.

Includes: spec, 3 JSON schemas, case packet builder, conflict runner,
8 passing tests, full review exchange, and 2 pilot run transcripts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ion gaps

Batch runner loops all scenarios with both Claude and GPT as judge, computes
judge agreement rate, and gates publishability interpretation on full coverage.
Unknown --scenario IDs now fail-closed instead of silently dropping.

Fixes both findings from Codex code review round 2:
- Partial batch coverage no longer produces false "usable" conclusions
- Typo'd scenario names throw immediately with available IDs listed

5 passing tests including partial-coverage and unknown-ID regression tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…it reasoning effort

- Fix verdict schema: rubric dimension scores accept number (not just integer)
- Harden batch runner against EPIPE: child.on('error') calls finalize() instead
  of reject(), stdin error handler, process-level uncaughtException/unhandledRejection
- Add --output-last-message file fallback for Codex CLI stdout fragility
- Add explicit reasoning effort (medium) for all sides and judges (Codex contribution)
- Add 4 real-world evidence-rich scenarios: Blockbuster Total Access, Netflix
  Qwikster, Zillow Offers, Yahoo/Microsoft — to test whether judge family
  affinity is prompt-conditioned
- Add 2 additional scenarios from Codex: meta-muse-spark, supermicro-export-controls
- Include interim calibration results and Claude-Codex review exchange
- Codex sign-off on batch runner (code review round 3)

Preliminary finding: judge family affinity observed on synthetic scenarios
(judges pick own family's side) may weaken on evidence-rich real-world cases.
Full wave 2 batch (22 runs) in progress to confirm.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… affinity

Full 11-scenario dual-judge calibration with explicit medium reasoning effort.
Zero errors across 22 runs (EPIPE hardening + --output-last-message working).

Key finding: Claude judge picked Side A (Claude) in 10/11 scenarios (1 tie).
GPT judge showed no consistent pattern (4 side_a, 3 side_b, 4 ties).
Judge agreement rate: 27% (3/11).

The asymmetry — not symmetric family affinity — is the core result.
Side-swap runs needed to decompose family affinity from position bias.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… decomposed

Swap test (5 scenarios, sides flipped): Claude judge shows 3/5 family affinity,
2/5 position bias. GPT judge shows 2/5 family affinity, 1/5 position bias.
Neither pure effect explains all data — it's a scenario-dependent interaction.

Current-events scenarios (meta-muse-spark, supermicro-export-controls): same
pattern as wave 2 — Claude judge picks Claude, GPT judge variable.

Key revision: GPT judge's apparent neutrality in wave 2 was partially an artifact
of always being in the disadvantaged Side B position. Swap test reveals GPT also
shows family preference when placed in Side A.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…decomposed

Full 2x2 matrix: 11 scenarios x 2 judges x 2 orientations = 43/44 runs complete.

Claude judge: 70% family affinity (follows Claude output across position swap),
30% position bias. Almost never ties (1/21 verdicts).

GPT judge: 36% family, 27% position, 27% consistent ties, 10% mixed.
More calibrated about uncertainty — ties are stable across orientations.

Neither judge reliable as sole evaluator. Swap-test replication required to
distinguish quality differences from systematic judge artifacts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Outreach drafts (Medium, LinkedIn, Reddit) are personal publishing artifacts, not part of the project surface.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds the tier-1 results memo and the shared letter to Claude/Gemini summarizing the first three-scenario replay sample. Updates the judge-feedback-request status to Implemented. Ignores slack-agent logs and tmp debug files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…udges, new real-world scenarios

Adds the raw artifact outputs and summaries backing the recent analysis memos:
- Gemini 2.5 Pro and flash-lite replay judgments across 13 scenarios
- Repeatability replays (5x) on prd-hidden-scope-creep, handoff-contradiction, event-automation-boundary
- Artifact-matched three-family comparison outputs
- New real-world scenario runs (google-adtech-breakup-remedies, nissan-honda-merger-collapse, openai-nonprofit-control)
- 6-scenario batch summary and default-GPT batch summary

These are the raw evidence for claims in docs/review/ and should travel with the analysis.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Owner Author

This PR is not in mergeable-review shape as a single unit.

Current size from the branch diff:

  • 23 commits
  • 6,707 changed files
  • 565,777 additions
  • dominated by generated benchmark/results artifacts

Recommendation: do not merge this draft as-is.

Suggested split order:

  1. implementation/policy PR
  • Gemini replay hardening
  • verdict schema / repair telemetry / weighted-total validation
  • orchestrator routing guidance
  • scenario taxonomy
  • new real-world scenario definitions
  • tests and docs that directly explain those code changes
  1. analysis/docs PR
  • judge-bias memos
  • repeatability and cross-model analysis writeups
  • model-role and orchestrator proposal docs that are primarily research-facing
  1. benchmark evidence PR(s)
  • generated benchmarks/results/**
  • rejudge outputs
  • large run artifacts
  • only if we explicitly want those in git

Why this is the right move:

  • keeps the code reviewable
  • avoids burying schema/policy changes under generated files
  • lets us decide intentionally whether benchmark artifacts belong in the main repo history

Operational note:

  • the branch itself is safely pushed and remains the source of truth
  • local merge attempts in the OneDrive worktree hit filesystem/permission problems, so the safest path forward is branch/PR splitting rather than more local merge surgery in that worktree

If we want to keep momentum, I’d start by extracting the implementation/policy slice into a fresh branch/PR from main, and leave this draft PR as the archival umbrella for the full research corpus.

Copy link
Copy Markdown
Owner Author

Superseded for merge purposes by PR #7.

Use this draft PR as the umbrella/archive for the full research corpus.

Recommended order now:

  1. Merge PR Extract conflict harness implementation and policy #7 (Extract conflict harness implementation and policy) first.
  2. Re-evaluate whether any result artifacts or research memos from this PR should be split into follow-up PRs.

Reason:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant