Cross-model debate harness research, policy, and scenarios#6
Draft
EdgeCaser wants to merge 23 commits into
Draft
Cross-model debate harness research, policy, and scenarios#6EdgeCaser wants to merge 23 commits into
EdgeCaser wants to merge 23 commits into
Conversation
…ults Co-authored spec between Claude and Codex (3 review rounds to sign-off). Phase 1 implementation: head-to-head mode with sealed first-pass, structured critique exchange, blind adjudication, and subscription CLI adapters. Pilot results on prd-hidden-scope-creep confirm judge family affinity: Claude judge picks Claude (margin 1.2), GPT judge picks GPT (margin 0.8). This empirically validates the spec's Phase 2 requirement for multi-judge calibration before publishing benchmark conclusions. Includes: spec, 3 JSON schemas, case packet builder, conflict runner, 8 passing tests, full review exchange, and 2 pilot run transcripts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ion gaps Batch runner loops all scenarios with both Claude and GPT as judge, computes judge agreement rate, and gates publishability interpretation on full coverage. Unknown --scenario IDs now fail-closed instead of silently dropping. Fixes both findings from Codex code review round 2: - Partial batch coverage no longer produces false "usable" conclusions - Typo'd scenario names throw immediately with available IDs listed 5 passing tests including partial-coverage and unknown-ID regression tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…it reasoning effort
- Fix verdict schema: rubric dimension scores accept number (not just integer)
- Harden batch runner against EPIPE: child.on('error') calls finalize() instead
of reject(), stdin error handler, process-level uncaughtException/unhandledRejection
- Add --output-last-message file fallback for Codex CLI stdout fragility
- Add explicit reasoning effort (medium) for all sides and judges (Codex contribution)
- Add 4 real-world evidence-rich scenarios: Blockbuster Total Access, Netflix
Qwikster, Zillow Offers, Yahoo/Microsoft — to test whether judge family
affinity is prompt-conditioned
- Add 2 additional scenarios from Codex: meta-muse-spark, supermicro-export-controls
- Include interim calibration results and Claude-Codex review exchange
- Codex sign-off on batch runner (code review round 3)
Preliminary finding: judge family affinity observed on synthetic scenarios
(judges pick own family's side) may weaken on evidence-rich real-world cases.
Full wave 2 batch (22 runs) in progress to confirm.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… affinity Full 11-scenario dual-judge calibration with explicit medium reasoning effort. Zero errors across 22 runs (EPIPE hardening + --output-last-message working). Key finding: Claude judge picked Side A (Claude) in 10/11 scenarios (1 tie). GPT judge showed no consistent pattern (4 side_a, 3 side_b, 4 ties). Judge agreement rate: 27% (3/11). The asymmetry — not symmetric family affinity — is the core result. Side-swap runs needed to decompose family affinity from position bias. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… decomposed Swap test (5 scenarios, sides flipped): Claude judge shows 3/5 family affinity, 2/5 position bias. GPT judge shows 2/5 family affinity, 1/5 position bias. Neither pure effect explains all data — it's a scenario-dependent interaction. Current-events scenarios (meta-muse-spark, supermicro-export-controls): same pattern as wave 2 — Claude judge picks Claude, GPT judge variable. Key revision: GPT judge's apparent neutrality in wave 2 was partially an artifact of always being in the disadvantaged Side B position. Swap test reveals GPT also shows family preference when placed in Side A. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…decomposed Full 2x2 matrix: 11 scenarios x 2 judges x 2 orientations = 43/44 runs complete. Claude judge: 70% family affinity (follows Claude output across position swap), 30% position bias. Almost never ties (1/21 verdicts). GPT judge: 36% family, 27% position, 27% consistent ties, 10% mixed. More calibrated about uncertainty — ties are stable across orientations. Neither judge reliable as sole evaluator. Swap-test replication required to distinguish quality differences from systematic judge artifacts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Outreach drafts (Medium, LinkedIn, Reddit) are personal publishing artifacts, not part of the project surface. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds the tier-1 results memo and the shared letter to Claude/Gemini summarizing the first three-scenario replay sample. Updates the judge-feedback-request status to Implemented. Ignores slack-agent logs and tmp debug files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…udges, new real-world scenarios Adds the raw artifact outputs and summaries backing the recent analysis memos: - Gemini 2.5 Pro and flash-lite replay judgments across 13 scenarios - Repeatability replays (5x) on prd-hidden-scope-creep, handoff-contradiction, event-automation-boundary - Artifact-matched three-family comparison outputs - New real-world scenario runs (google-adtech-breakup-remedies, nissan-honda-merger-collapse, openai-nonprofit-control) - 6-scenario batch summary and default-GPT batch summary These are the raw evidence for claims in docs/review/ and should travel with the analysis. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Owner
Author
|
This PR is not in mergeable-review shape as a single unit. Current size from the branch diff:
Recommendation: do not merge this draft as-is. Suggested split order:
Why this is the right move:
Operational note:
If we want to keep momentum, I’d start by extracting the implementation/policy slice into a fresh branch/PR from |
Owner
Author
|
Superseded for merge purposes by PR #7. Use this draft PR as the umbrella/archive for the full research corpus. Recommended order now:
Reason:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR brings over the cross-model conflict-harness work from
codex/cross-model-debate-harness-spec.It includes:
Why PR Instead Of Local Merge
The local OneDrive-backed worktree hit repeated filesystem permission issues while trying to fast-forward and merge
main, even after the branch was safely pushed. Opening a PR keeps the branch history intact and avoids further local filesystem risk.Notes
docs/outreach/is intentionally ignored and not part of this PR.temp-main-switch-state-before-merge) that preserves interrupted local-only state, but the branch contents for merge live safely on GitHub.Recommended Review Focus