Skip to content

fix(summarize-insights): drop reader-engagement filler, reframe audience as downstream AI#708

Merged
neoneye merged 3 commits into
mainfrom
fix/napkin-math-insights-no-filler
May 16, 2026
Merged

fix(summarize-insights): drop reader-engagement filler, reframe audience as downstream AI#708
neoneye merged 3 commits into
mainfrom
fix/napkin-math-insights-no-filler

Conversation

@neoneye
Copy link
Copy Markdown
Member

@neoneye neoneye commented May 16, 2026

Summary

The user pointed out that the line "If you read nothing else, read this." in insights.md is wasted tokens:

The planexe report is big, and humans spend on average 7 seconds before they navigate away. It's for AI consumption, not humans. So the "If you read nothing else, read this." is wasting tokens.

The structural markers (## Bad news first, ### Likely deal-breakers, the verdict labels) already do the work that prefix tried to do. The reader-hook framing came from when we were optimising for a project-manager audience; the actual primary consumer is downstream AI.

What changed

summarize_insights.py: removed the prefix from render_bad_news_first. The substantive sentence that explains what items are in the section stays:

Every item below is a signal the plan does not survive its own assumptions.
Items are ordered by severity.

summarize-insights/SKILL.md: new ## Audience and tone section codifying the principle so it survives future edits:

  • No reader-engagement prefixes
  • No filler sentences whose only job is to motivate the next sentence
  • Keep substantive explanations (what a verdict label means, what a column shows) — those are signal, not filler
  • Don't apologise for the bad news

The no-sycophancy rule also got its "the reader" → "the downstream consumer" to match the audience reframe.

Test plan

  • All three reference plans (Nuuk v31, Cross-Border Rail v33, Faraday v33) regenerate clean insights.md files with the prefix gone
  • Smoke 8/8 (tests/run_smoke.py)
  • Unittest 45/45 (tests/test_run_monte_carlo.py)

🤖 Generated with Claude Code

neoneye added 3 commits May 16, 2026 17:28
…nce as downstream AI

'If you read nothing else, read this.' was a reader hook for humans, but the primary consumer of insights.md is downstream AI (another agent, a planning loop, a follow-on extractor) — token-density of useful signal matters more than engagement. The structural markers (## Bad news first, ### Likely deal-breakers, verdict labels) already do the work the prefix tried to do; restating it in prose burns tokens.

Removed the prefix from render_bad_news_first. The substantive sentence ('Every item below is a signal the plan does not survive its own assumptions. Items are ordered by severity.') stays because it explains what's in the section.

SKILL.md gains a new 'Audience and tone' section codifying the principle: no reader-engagement prefixes, no filler sentences whose only job is to motivate the next sentence, keep substantive explanations (signal, not filler). Replaced 'reader' with 'downstream consumer' in the no-sycophancy rule to reflect the audience reframe.

Verified: all three reference plans regenerate insights.md cleanly with the prefix gone. Smoke 8/8, unittest 45/45.
ChatGPT's review of v33 raised 15 items; this commit ships the five 'quick win' ones that fit on top of the existing per-run state. All five are runner-side analyses plus matching insights.md sections — no schema changes, no LLM prompt edits.

§14 binding-gate frequency tracking. For every min() aggregate, the runner records which dependency provided the min value in each run, then aggregates conditional on the aggregate failing its threshold. Faraday demonstration: the weakest_program_gate fails in 9,826 of 10,000 runs; mil_std_cert_funding is the binder in 67% of those, cash_flow_trigger in 32%, inventory_overhang in 0.5%. That tells the reader which sub-gate to fix first; the previous output only knew it failed.

§7 quartile pass-rates. For each threshold × driver, P(threshold passes | driver in bottom quartile) vs P(threshold passes | driver in top quartile). The delta in percentage points is much more actionable than Pearson r — 'P(coverage 99%) goes from 18% in worst-quartile satellite-failure runs to 74% in best-quartile' is a directly usable lever.

§13 required-input thresholds. For each FAILING gate (P < 80%), find the input-bound restriction that would lift conditional pass rate to >= 80%. Empty list means no single-input restriction is enough — Faraday's weakest_program_gate gets an empty list, correctly diagnosing it as structurally unreachable.

§8 missing-value priority. Rank missing_values entries by |delta_pp on worst gate| * (1 - pass_prob) * bound_width_ratio. The highest-scoring entries are the ones most worth replacing with real data instead of an assumed range.

§10 model confidence grades. Per output, grade HIGH/MEDIUM/LOW based on the fraction of upstream input bounds anchored in 'data' vs 'assumption' and the average bound-width-to-base ratio. Cutoffs: data >= 70% AND width < 0.5 -> HIGH; data < 30% OR width > 1.5 -> LOW; else MEDIUM. The reasons array names the specific evidence.

Five new render functions in summarize_insights.py emit these as separate sections after the existing verdict table. Five new unittest.TestCase methods (TestNewAnalysisBlocks) cover each block end-to-end against a small synthetic fixture. Smoke 8/8, unittest 50/50. Reference runs regenerated for Nuuk, Cross-Border Rail, Faraday, India Census.
* main:
  prompts: add Hauts-de-France hyperscale AI datacenter test case
@neoneye neoneye merged commit 2b969bd into main May 16, 2026
3 checks passed
@neoneye neoneye deleted the fix/napkin-math-insights-no-filler branch May 16, 2026 22:09
neoneye added a commit that referenced this pull request May 16, 2026
Two small wording changes from ChatGPT's v38 feedback (the reviewer called the format 'production-candidate; freeze the structure', and these are the only follow-ups).

1) DOOM verdict band 'almost certainly fails' -> 'rarely passes under current bounds'. Avoids the epistemic overclaim of 'certainly' on a model-relative pass rate.

2) Decision implications intro line reworded from 'The actual plan revision is for human or LLM interpretation against the source report' to 'This section identifies the affected planning lever; concrete revisions should be derived by reading the source report and the relevant intermediary artifacts.' Less self-referential.

Status doc renamed 20260516_claude.md -> 20260517_claude.md and rewritten for the current state: PR #708 (merged, five analysis blocks) and PR #710 (in flight, this branch, the v34->v38 insights-format iteration). Schema v3 frozen. Cross-plan validation extended to five distinct domains. Open issues mostly carry over from 0516 plus a new manifest-regression-test gap.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant