Skip to content

napkin-math(compress): paraphrase-tolerant quote verification#744

Merged
neoneye merged 2 commits into
mainfrom
napkin-math/loosen-quote-match
May 21, 2026
Merged

napkin-math(compress): paraphrase-tolerant quote verification#744
neoneye merged 2 commits into
mainfrom
napkin-math/loosen-quote-match

Conversation

@neoneye
Copy link
Copy Markdown
Member

@neoneye neoneye commented May 21, 2026

Summary

  • Problem. quote_is_in_source was a strict substring check after normalisation. When the compress LLM paraphrased (dropped intermediate words, reordered the noun phrase), the check failed even though every content token came from the source. Because composite_score adds a +10 verified-quote bonus on a 1-25 e × r base, one missed verification can drop a load-bearing tripwire below the MAX_ITEMS_PER_BUCKET = 6 cutoff. This is the ranking-layer gap PR napkin-math(compress): second-pass shifts the variance failure from emission to ranking #743 left open: in v52 inspection, the paperclip $75,000 OPC UA bid had qv=False due to a reordered noun phrase ("lowest qualified middleware bid" vs source's "lowest qualified bid for OPC UA middleware") and lost to lower-e×r verified explicits.
  • Fix. Keep the substring check as the fast path. On miss, fall back to token-overlap that requires every quote token to appear in the source token set, after a min-3-token gate. Tokenisation is Unicode word boundaries (\w+), language- and domain-neutral. The all-tokens rule subsumes any digit-bearing anchor (numbers can't be substituted) and rejects any content-word substitution ("highest" for "lowest") regardless of quote length.
  • Out of scope. The emission-layer gap (LLM not producing a relevant item in either compress pass) is not addressed here; this PR only changes verification of items the LLM does emit.

Empirical posture

Same-LLM (Gemini Flash Lite) same-session re-runs, so this is a regression check on the deterministic verifier and a behavioural probe — not a same-LLM improvement claim.

  • Unit tests: 28 pass (21 prior + 7 new). New cases cover substring fast-path (regression), NOT IN SOURCE and empty, paraphrase via reorder/drop (intended True), hallucinated number (intended False), short-quote substitution (intended False), short unrelated quote (intended False), and a 14-token long-quote highest/lowest substitution (intended False — locks in the all-tokens rule against any future loosening).
  • Threshold-tightening cost: simulated 90% vs 100% overlap against the v53 outputs (1366 scored candidates, 8 compress runs across 6 plans): 0 items lose qv=True under tightening. The 90% rule was never functionally looser than the all-tokens rule on observed data; LLM paraphrases either reorder/elide (full token-overlap) or hit the substring fast path.
  • Paperclip acceptance (3× compress):
    • v53a: $75k bid in public gates_and_thresholds top-6 at position 2, qv=True
    • v53b: $75k bid in public gates_and_thresholds top-6 at position 3, qv=True
    • v53c: $75k bid emitted with qv=True, but in the risks_and_shocks/premortem bucket (candidate Staging #12, e=5 r=4 stress_test, score ~30). Three other score-30 items beat it on tie-break for the last 3 public slots. Net: not in any public top-6 in v53c. This is bucket-categorisation variance (LLM placed the gate under "risk") plus tie-break, not emission failure or quote-match failure.
    • Net: 2 of 3 paperclip runs have $75k in public top-6 (vs 1 of 3 before); 3 of 3 verified-when-emitted.
  • Regression sweep across 5 other probes (euro_adoption, yellowstone, crate, mars_gtld, datacenter_france): 165 fallback flips (13.7% of 1206 qv=True). Random-30 audit shows all legitimate paraphrases (dropped intermediate words, em-dash/hyphen variants, ellipsis-elided clauses) — no hallucinations, no content-word substitutions.

Test plan

  • pytest worker_plan/worker_plan_internal/parameter_extraction/tests/test_compress_report_section.py (28 pass)
  • Paperclip 3× run with $75k bid survival check (2/3 in public top-6 with qv=True; 1/3 emitted-and-verified-but-tie-broken in a different bucket)
  • 5-other-plan regression sweep with flip audit (165 flips, 30 sampled, all legitimate)
  • Threshold-tightening cost simulation (0 lost qv=True items under 90% → 100%)
  • CI green (tests, lint, typecheck)

🤖 Generated with Claude Code

neoneye and others added 2 commits May 21, 2026 15:03
…or LLM paraphrase

The strict substring check rejected verbatim claims whose source_quote dropped or reordered words even though every content token was present in the source. That made quote_verified a fragile signal — and because composite_score adds a +10 verified-quote bonus on a 1-25 base, a single missed verification can drop a load-bearing tripwire (e.g. paperclip's $75,000 OPC UA bid threshold) below the MAX_ITEMS_PER_BUCKET cutoff in roughly two-thirds of runs.

Fallback path: when the substring check misses, tokenize quote and source on Unicode word boundaries, require every digit-bearing quote token to appear in the source (no hallucinated numbers), and require >=90% of all quote tokens to appear (blocks single-word content substitutions like "highest" for "lowest"). Minimum 3 quote tokens to avoid trivial overlap on a large source. Substring fast-path is preserved as the primary signal.

Empirically (6 plans, 8 compress runs, 1361 candidate items): qv=True total rises from ~1027 to 1192. 165 items (13.8% of qv=True) now verify only via the fallback; spot-checking 30 random flips shows all are legitimate paraphrases (dropped intermediate words, em-dash variants, ellipsis-elided clauses), no hallucinations.

Paperclip acceptance probe (3 runs): $75,000 OPC UA bid now survives the public top-6 with qv=True in 2/3 (vs 1/3 before). The remaining 1/3 is an emission-layer miss (LLM does not produce the bid in either pass) and is out of scope for this PR — it was the same gap PR #743 explicitly left open.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y quote token (was 90%)

Code review on PR #744 noted the 90% threshold lets a long quote pass with one substituted content word (13/14 overlap), so 'highest'/'lowest' inversions in a long quote could verify even though they invert meaning. Tighten to require all quote tokens to appear in the source — the digit-bearing anchor is subsumed by the all-tokens rule, so it is removed.

Empirically (1366 scored candidates across 6 plans, 8 compress runs): 0 items lose qv=True under the tightening. The 90% rule was never functionally looser than the 100% rule on observed data; the LLM's paraphrases are either full token-overlap (reordering/elision) or hit the substring fast path. Tightening removes the theoretical false-positive surface at no empirical cost.

Adds a long-quote substitution test (14 tokens with one 'highest'/'lowest' swap) to lock the new bound in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@neoneye
Copy link
Copy Markdown
Member Author

neoneye commented May 21, 2026

Addressed review feedback:

  1. Matcher tightened from ≥90% overlap to all-tokens-required (9521b68a). The 90% rule allowed long-quote single-word substitutions (e.g. highest for lowest over 14 tokens) to verify; all-tokens rule rejects on any missing token. Added a long-quote highest/lowest substitution test. Threshold-tightening cost simulated against the existing v53 outputs (1366 candidates): 0 items lose qv=True — the 90% rule was never functionally looser than 100% on observed data, so the tightening is empirically free.
  2. Mars_gtld expert_criticism/gates_and_thresholds 0-public: investigated via a fresh rerun. LLM emits 0 candidates in both the first and second pass for this one bucket-section combo (other 4 buckets emit normally). This is Gemini Flash Lite emission variance, not anything the matcher can affect. PR notes updated to call this out as LLM run variance rather than "no regression."
  3. Paperclip v53c $75k framing corrected: the bid is emitted with qv=True, but in the risks_and_shocks/premortem bucket (candidate Staging #12, score ~30). Three other score-30 items beat it on tie-break for the last 3 public slots. So it's bucket-categorisation + tie-break, not emission or quote-match. PR notes now state "2 of 3 paperclip runs have $75k in any public top-6" (vs 1 of 3 before).

PR description updated. 28 unit tests still pass. Ready for re-review.

@neoneye neoneye merged commit 57b9924 into main May 21, 2026
3 checks passed
@neoneye neoneye deleted the napkin-math/loosen-quote-match branch May 21, 2026 13:24
neoneye added a commit that referenced this pull request May 21, 2026
docs(napkin-math): record PR #743 and PR #744 in 20260520 plan
neoneye added a commit that referenced this pull request May 21, 2026
…promoter) in 20260520 plan

Per user direction, the plan-status update lands in PR #750 (not a separate doc PR).

PR #749 marked merged (was previously open). PR #750 added to the landed-on-main section with the honest 'shipped after two reverted iterations' process note — first attempt was a risks-side prompt rule with the wrong causal model, second attempt only detected canonical if/then form and missed the actual v53c declarative phrasing, third commit extended the detector to both shapes with acronym-preserving rewrite. 44 unit tests including the literal v53c regression on the historical line.

Phase 1 status row updated to reference PR #750 as the cross-bucket promoter backstop on top of #737/#743/#744.

Next-likely-move list re-ordered: bucket-categorisation no longer item 1 (now covered by #750). Proposal 141 takes item 1, Phase 5 verify-bounds-citations takes item 2, different-LLM behavioural validation takes item 3, prompt-hygiene takes item 4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
huangyingting pushed a commit to repomesh/PlanExe that referenced this pull request May 22, 2026
…tus in 20260520 plan

Adds PR PlanExeOrg#743 (compress emission-layer second pass, merged) and PR PlanExeOrg#744 (compress ranking-layer paraphrase-tolerant quote verification, open and CI green) to the landed-on-main section with honest framing of what each closes and what each does not.

Refreshes the Compress-LLM run-to-run variance known-limitation entry: the emission side is closed by PlanExeOrg#743 and the verification side by PlanExeOrg#744. Residual modes are bucket-categorisation variance and second-pass-also-misses cases, both at the LLM's emission layer.

Updates Phase 1 status row and the Next-likely-move list — compress-variance handling is no longer item 1; bucket-categorisation discipline takes its place, then proposal 141, then different-LLM validation, then prompt hygiene.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
huangyingting pushed a commit to repomesh/PlanExe that referenced this pull request May 22, 2026
…rg#747 (Phase 4 runtime) in 20260520 plan

Marks PR PlanExeOrg#744 as merged (previously open), and adds PR PlanExeOrg#746 (Phase 3 validate-parameters — aggregate_not_bounded + requirement_has_margin) and PR PlanExeOrg#747 (Phase 4 runtime + schema readiness — calculation-output strip, lognormal/pert reserved, correlations key reserved, reason-branched warning text) to the landed-on-main section.

Phase 1 status row now references all three compress PRs (PlanExeOrg#737, PlanExeOrg#743, PlanExeOrg#744). Phase 3 row marks DONE via PR PlanExeOrg#746 with a note that the sampling_discipline enum bullet was routed to Phase 4. Phase 4 row marks the code-side DONE via PR PlanExeOrg#747 and lists the deferred prompt-side LLM-rule changes.

Next-likely-move list re-ordered: the Phase 4 prompt-side follow-up takes item 1 (was deferred from the previous update). Bucket-categorisation discipline, proposal 141 implementation, different-LLM validation, and prompt hygiene shift down to items 2-5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
huangyingting pushed a commit to repomesh/PlanExe that referenced this pull request May 22, 2026
…tes_and_thresholds via the risks-side rule

Addresses the residual paperclip v53c failure mode: the LLM occasionally files a '$X exceeds threshold, then <downside>' tripwire under risks_and_shocks instead of gates_and_thresholds. The 'do not restate' guard in the risks prompt didn't catch the case because the LLM put the item in risks first, not as a restatement.

Change: adds a structural-priority paragraph to the risks_and_shocks bucket prompt that tells the LLM NOT to emit a sentence here when its source side has the 'If <metric> <comparator> <numeric threshold>, then <consequence>' shape — that shape belongs in gates_and_thresholds even when the then-clause is downside-flavoured (cost, schedule, scope, penalty, vendor switch).

Why ONLY the risks side and not also a parallel paragraph in the gates prompt: I tried both sides in v54 and found a clear over-narrowing regression — the LLM became more conservative about what counted as a gate, with paperclip expert_criticism dropping from 6 gates to 2, yellowstone selected_scenario from 6 to 3, and similar shrinkages elsewhere. Adding a long structural-shape paragraph to the gates prompt implicitly raised the bar for what counted as a gate (numeric thresholds only), excluding legitimate deadline/categorical gates. The risks-side rule alone is enough to claim the if/then numeric sentences for gates without narrowing gates from the other direction.

Empirical posture (regression check, NOT improvement claim; same-LLM same-session Gemini Flash Lite reruns):

Paperclip 3x (v55a/b/c): $75k OPC UA bid lands in gates_and_thresholds public top in ALL THREE runs (vs 2/3 before in v53). This is the focal v53c case the change targets.

5-other-plan cross-probe regression (euro_adoption, yellowstone, crate, mars_gtld, datacenter): bucket counts mostly unchanged. Modest 1-2-item shrinkages in datacenter selected_scenario and yellowstone selected_scenario are within typical LLM run variance. Two sections produced 0 gates in v55 (crate premortem and mars_gtld expert_criticism) — these are LLM run variance unrelated to this change: (a) the gates bucket prompt is unchanged so a risks-side rule cannot affect first-pass gate emission; (b) mars_gtld expert_criticism gates is a known-flaky combo (saw 0 candidates on PR PlanExeOrg#744 rerun); (c) crate premortem produced 6 gates fine in v54 with an even more aggressive prompt, ruling out the v55 risks-side change as the cause.

Unit tests: 28 pass (no test changes; the prompt edit is structural language, not new code paths).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant