Summary
Artifacts generated by the conflict harness regularly contain fabricated specifics — numbers, benchmarks, and citations invented by the model to make arguments feel more grounded than the evidence supports. The structural layer (decision frames, evidence gap lists, pass/fail gates) is reliable. The specificity layer is not.
Evidence
Example 1 — Fabricated citation (retention-vs-growth-tradeoff)
The scenario has zero context files. Side A's final artifact cites:
"Retention investment generates higher ROI vs. new acquisition spend (internal modeling, ctx-1)"
ctx-1 does not exist. The model invented a citation to a non-existent source document. The judge did not flag it.
Example 2 — Fabricated industry benchmarks (multi-tenant-shared-learning)
"Enterprise customers lose 0.3–1.8% of GMV to fraud."
"Near-parity with centralized training at sufficient tenant count (≥20–30 active tenants)"
Neither figure is in the case packet. The judge did catch this one and penalized Side A for it — but the numbers are still in the committed artifact.
Example 3 — Fabricated cost estimate (mobile-first-single-survey)
Scenario states only "6 months of engineering effort." Side A adds:
"A 6-month engineering commitment represents $600K–$1.2M in fully-loaded cost"
Not in the scenario. Invented to strengthen the evidence-gate argument.
Example 4 — Fabricated segment size (churn-correlation-causation)
"closing the 18-percentage-point churn gap on even 20% of the inactive-collaboration segment"
The 20% figure is invented. The structural point is correct; the number is not.
Pattern
The model generates a structurally correct argument, then populates it with invented specifics to make the argument feel grounded. More pronounced in executive_ambiguity and historical_strategy scenarios (sparse evidence, room to fill). Less pronounced in evidence_fragile_prd scenarios (the prompt contract makes fabrication less available as a rhetorical move).
What Is and Isn't Broken
| Artifact element |
Reliability |
| Decision frame |
High |
| Evidence gap list |
High |
| Pass/fail gates |
High |
| Recommended next artifact |
High |
| Specific numbers, thresholds, benchmarks |
Low — treat as invented until verified |
| Citations to context files (ctx-N) |
Low — must be verified; may point to non-existent documents |
| Industry statistics |
Low — drawn from model training data, not the packet |
The harness is working correctly as a reasoning quality evaluator. It is not a factual accuracy engine and should not be positioned as one.
Recommended Fixes
1. Add fabrication detection to the judge prompt
Instruct the judge to explicitly flag any quantitative claim not traceable to the case packet:
"For each piece of quantitative evidence cited by either side, note whether it appears in the case packet. Flag any claim that introduces specificity not present in the packet."
2. Validate ctx-N references before generation
The harness knows which context files exist (case_packet.evidence). Inject an explicit constraint into the first-pass prompt:
"The only evidence available to you is contained in the case packet above. Do not introduce statistics, benchmarks, or citations not present in the packet."
Post-generation, scan for ctx-N references and verify each against the actual evidence array. Flag or reject artifacts with phantom citations before they reach the judge.
3. Add unsupported_claim_count to the verdict schema
The run schema already tracks disagreement_rate, declared_adoption_rate, and substantive_revision_rate. An unsupported_claim_count field — populated by the judge's explicit count of ungrounded claims — would make fabrication a first-class measurable output.
Full Write-Up
docs/review/fabricated-evidence-findings-2026-04-16.md
Summary
Artifacts generated by the conflict harness regularly contain fabricated specifics — numbers, benchmarks, and citations invented by the model to make arguments feel more grounded than the evidence supports. The structural layer (decision frames, evidence gap lists, pass/fail gates) is reliable. The specificity layer is not.
Evidence
Example 1 — Fabricated citation (
retention-vs-growth-tradeoff)The scenario has zero context files. Side A's final artifact cites:
ctx-1does not exist. The model invented a citation to a non-existent source document. The judge did not flag it.Example 2 — Fabricated industry benchmarks (
multi-tenant-shared-learning)Neither figure is in the case packet. The judge did catch this one and penalized Side A for it — but the numbers are still in the committed artifact.
Example 3 — Fabricated cost estimate (
mobile-first-single-survey)Scenario states only "6 months of engineering effort." Side A adds:
Not in the scenario. Invented to strengthen the evidence-gate argument.
Example 4 — Fabricated segment size (
churn-correlation-causation)The 20% figure is invented. The structural point is correct; the number is not.
Pattern
The model generates a structurally correct argument, then populates it with invented specifics to make the argument feel grounded. More pronounced in
executive_ambiguityandhistorical_strategyscenarios (sparse evidence, room to fill). Less pronounced inevidence_fragile_prdscenarios (the prompt contract makes fabrication less available as a rhetorical move).What Is and Isn't Broken
The harness is working correctly as a reasoning quality evaluator. It is not a factual accuracy engine and should not be positioned as one.
Recommended Fixes
1. Add fabrication detection to the judge prompt
Instruct the judge to explicitly flag any quantitative claim not traceable to the case packet:
2. Validate ctx-N references before generation
The harness knows which context files exist (
case_packet.evidence). Inject an explicit constraint into the first-pass prompt:Post-generation, scan for
ctx-Nreferences and verify each against the actual evidence array. Flag or reject artifacts with phantom citations before they reach the judge.3. Add
unsupported_claim_countto the verdict schemaThe run schema already tracks
disagreement_rate,declared_adoption_rate, andsubstantive_revision_rate. Anunsupported_claim_countfield — populated by the judge's explicit count of ungrounded claims — would make fabrication a first-class measurable output.Full Write-Up
docs/review/fabricated-evidence-findings-2026-04-16.md