napkin-math(compress): paraphrase-tolerant quote verification#744
Merged
Conversation
…or LLM paraphrase The strict substring check rejected verbatim claims whose source_quote dropped or reordered words even though every content token was present in the source. That made quote_verified a fragile signal — and because composite_score adds a +10 verified-quote bonus on a 1-25 base, a single missed verification can drop a load-bearing tripwire (e.g. paperclip's $75,000 OPC UA bid threshold) below the MAX_ITEMS_PER_BUCKET cutoff in roughly two-thirds of runs. Fallback path: when the substring check misses, tokenize quote and source on Unicode word boundaries, require every digit-bearing quote token to appear in the source (no hallucinated numbers), and require >=90% of all quote tokens to appear (blocks single-word content substitutions like "highest" for "lowest"). Minimum 3 quote tokens to avoid trivial overlap on a large source. Substring fast-path is preserved as the primary signal. Empirically (6 plans, 8 compress runs, 1361 candidate items): qv=True total rises from ~1027 to 1192. 165 items (13.8% of qv=True) now verify only via the fallback; spot-checking 30 random flips shows all are legitimate paraphrases (dropped intermediate words, em-dash variants, ellipsis-elided clauses), no hallucinations. Paperclip acceptance probe (3 runs): $75,000 OPC UA bid now survives the public top-6 with qv=True in 2/3 (vs 1/3 before). The remaining 1/3 is an emission-layer miss (LLM does not produce the bid in either pass) and is out of scope for this PR — it was the same gap PR #743 explicitly left open. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y quote token (was 90%) Code review on PR #744 noted the 90% threshold lets a long quote pass with one substituted content word (13/14 overlap), so 'highest'/'lowest' inversions in a long quote could verify even though they invert meaning. Tighten to require all quote tokens to appear in the source — the digit-bearing anchor is subsumed by the all-tokens rule, so it is removed. Empirically (1366 scored candidates across 6 plans, 8 compress runs): 0 items lose qv=True under the tightening. The 90% rule was never functionally looser than the 100% rule on observed data; the LLM's paraphrases are either full token-overlap (reordering/elision) or hit the substring fast path. Tightening removes the theoretical false-positive surface at no empirical cost. Adds a long-quote substitution test (14 tokens with one 'highest'/'lowest' swap) to lock the new bound in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Member
Author
|
Addressed review feedback:
PR description updated. 28 unit tests still pass. Ready for re-review. |
3 tasks
This was referenced May 21, 2026
neoneye
added a commit
that referenced
this pull request
May 21, 2026
…promoter) in 20260520 plan Per user direction, the plan-status update lands in PR #750 (not a separate doc PR). PR #749 marked merged (was previously open). PR #750 added to the landed-on-main section with the honest 'shipped after two reverted iterations' process note — first attempt was a risks-side prompt rule with the wrong causal model, second attempt only detected canonical if/then form and missed the actual v53c declarative phrasing, third commit extended the detector to both shapes with acronym-preserving rewrite. 44 unit tests including the literal v53c regression on the historical line. Phase 1 status row updated to reference PR #750 as the cross-bucket promoter backstop on top of #737/#743/#744. Next-likely-move list re-ordered: bucket-categorisation no longer item 1 (now covered by #750). Proposal 141 takes item 1, Phase 5 verify-bounds-citations takes item 2, different-LLM behavioural validation takes item 3, prompt-hygiene takes item 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
huangyingting
pushed a commit
to repomesh/PlanExe
that referenced
this pull request
May 22, 2026
…tus in 20260520 plan Adds PR PlanExeOrg#743 (compress emission-layer second pass, merged) and PR PlanExeOrg#744 (compress ranking-layer paraphrase-tolerant quote verification, open and CI green) to the landed-on-main section with honest framing of what each closes and what each does not. Refreshes the Compress-LLM run-to-run variance known-limitation entry: the emission side is closed by PlanExeOrg#743 and the verification side by PlanExeOrg#744. Residual modes are bucket-categorisation variance and second-pass-also-misses cases, both at the LLM's emission layer. Updates Phase 1 status row and the Next-likely-move list — compress-variance handling is no longer item 1; bucket-categorisation discipline takes its place, then proposal 141, then different-LLM validation, then prompt hygiene. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
huangyingting
pushed a commit
to repomesh/PlanExe
that referenced
this pull request
May 22, 2026
…rg#747 (Phase 4 runtime) in 20260520 plan Marks PR PlanExeOrg#744 as merged (previously open), and adds PR PlanExeOrg#746 (Phase 3 validate-parameters — aggregate_not_bounded + requirement_has_margin) and PR PlanExeOrg#747 (Phase 4 runtime + schema readiness — calculation-output strip, lognormal/pert reserved, correlations key reserved, reason-branched warning text) to the landed-on-main section. Phase 1 status row now references all three compress PRs (PlanExeOrg#737, PlanExeOrg#743, PlanExeOrg#744). Phase 3 row marks DONE via PR PlanExeOrg#746 with a note that the sampling_discipline enum bullet was routed to Phase 4. Phase 4 row marks the code-side DONE via PR PlanExeOrg#747 and lists the deferred prompt-side LLM-rule changes. Next-likely-move list re-ordered: the Phase 4 prompt-side follow-up takes item 1 (was deferred from the previous update). Bucket-categorisation discipline, proposal 141 implementation, different-LLM validation, and prompt hygiene shift down to items 2-5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
huangyingting
pushed a commit
to repomesh/PlanExe
that referenced
this pull request
May 22, 2026
…tes_and_thresholds via the risks-side rule Addresses the residual paperclip v53c failure mode: the LLM occasionally files a '$X exceeds threshold, then <downside>' tripwire under risks_and_shocks instead of gates_and_thresholds. The 'do not restate' guard in the risks prompt didn't catch the case because the LLM put the item in risks first, not as a restatement. Change: adds a structural-priority paragraph to the risks_and_shocks bucket prompt that tells the LLM NOT to emit a sentence here when its source side has the 'If <metric> <comparator> <numeric threshold>, then <consequence>' shape — that shape belongs in gates_and_thresholds even when the then-clause is downside-flavoured (cost, schedule, scope, penalty, vendor switch). Why ONLY the risks side and not also a parallel paragraph in the gates prompt: I tried both sides in v54 and found a clear over-narrowing regression — the LLM became more conservative about what counted as a gate, with paperclip expert_criticism dropping from 6 gates to 2, yellowstone selected_scenario from 6 to 3, and similar shrinkages elsewhere. Adding a long structural-shape paragraph to the gates prompt implicitly raised the bar for what counted as a gate (numeric thresholds only), excluding legitimate deadline/categorical gates. The risks-side rule alone is enough to claim the if/then numeric sentences for gates without narrowing gates from the other direction. Empirical posture (regression check, NOT improvement claim; same-LLM same-session Gemini Flash Lite reruns): Paperclip 3x (v55a/b/c): $75k OPC UA bid lands in gates_and_thresholds public top in ALL THREE runs (vs 2/3 before in v53). This is the focal v53c case the change targets. 5-other-plan cross-probe regression (euro_adoption, yellowstone, crate, mars_gtld, datacenter): bucket counts mostly unchanged. Modest 1-2-item shrinkages in datacenter selected_scenario and yellowstone selected_scenario are within typical LLM run variance. Two sections produced 0 gates in v55 (crate premortem and mars_gtld expert_criticism) — these are LLM run variance unrelated to this change: (a) the gates bucket prompt is unchanged so a risks-side rule cannot affect first-pass gate emission; (b) mars_gtld expert_criticism gates is a known-flaky combo (saw 0 candidates on PR PlanExeOrg#744 rerun); (c) crate premortem produced 6 gates fine in v54 with an even more aggressive prompt, ruling out the v55 risks-side change as the cause. Unit tests: 28 pass (no test changes; the prompt edit is structural language, not new code paths). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
quote_is_in_sourcewas a strict substring check after normalisation. When the compress LLM paraphrased (dropped intermediate words, reordered the noun phrase), the check failed even though every content token came from the source. Becausecomposite_scoreadds a +10 verified-quote bonus on a 1-25e × rbase, one missed verification can drop a load-bearing tripwire below theMAX_ITEMS_PER_BUCKET = 6cutoff. This is the ranking-layer gap PR napkin-math(compress): second-pass shifts the variance failure from emission to ranking #743 left open: in v52 inspection, the paperclip$75,000OPC UA bid hadqv=Falsedue to a reordered noun phrase ("lowest qualified middleware bid"vs source's"lowest qualified bid for OPC UA middleware") and lost to lower-e×rverified explicits.\w+), language- and domain-neutral. The all-tokens rule subsumes any digit-bearing anchor (numbers can't be substituted) and rejects any content-word substitution ("highest"for"lowest") regardless of quote length.Empirical posture
Same-LLM (Gemini Flash Lite) same-session re-runs, so this is a regression check on the deterministic verifier and a behavioural probe — not a same-LLM improvement claim.
NOT IN SOURCEand empty, paraphrase via reorder/drop (intended True), hallucinated number (intended False), short-quote substitution (intended False), short unrelated quote (intended False), and a 14-token long-quotehighest/lowestsubstitution (intended False — locks in the all-tokens rule against any future loosening).qv=Trueunder tightening. The 90% rule was never functionally looser than the all-tokens rule on observed data; LLM paraphrases either reorder/elide (full token-overlap) or hit the substring fast path.$75kbid in publicgates_and_thresholdstop-6 at position 2,qv=True✓$75kbid in publicgates_and_thresholdstop-6 at position 3,qv=True✓$75kbid emitted withqv=True, but in therisks_and_shocks/premortembucket (candidate Staging #12,e=5 r=4 stress_test, score ~30). Three other score-30 items beat it on tie-break for the last 3 public slots. Net: not in any public top-6 in v53c. This is bucket-categorisation variance (LLM placed the gate under "risk") plus tie-break, not emission failure or quote-match failure.$75kin public top-6 (vs 1 of 3 before); 3 of 3 verified-when-emitted.qv=True). Random-30 audit shows all legitimate paraphrases (dropped intermediate words, em-dash/hyphen variants, ellipsis-elided clauses) — no hallucinations, no content-word substitutions.Test plan
pytest worker_plan/worker_plan_internal/parameter_extraction/tests/test_compress_report_section.py(28 pass)$75kbid survival check (2/3 in public top-6 withqv=True; 1/3 emitted-and-verified-but-tie-broken in a different bucket)qv=Trueitems under 90% → 100%)🤖 Generated with Claude Code