Harden analysis-gate + renderer against BLUF-leak and duplicate-card title regressions#2530
Conversation
…eak regressions) Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/55d20942-b164-429c-a8d7-4ffeb25c99ba Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
…d-aggregation duplicate-card guard Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/54808640-bb83-4031-b74a-089773718be0 Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
…build noise, not code changes) Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/54808640-bb83-4031-b74a-089773718be0 Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
🏷️ Automatic Labeling SummaryThis PR has been automatically labeled based on the files changed and PR metadata. Applied Labels: documentation,testing,refactor,size-l Label Categories
For more information, see |
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
There was a problem hiding this comment.
Pull request overview
This PR hardens the executive-brief headline pipeline so bad/boilerplate H1s can’t silently degrade into BLUF-derived fragments in rendered SERP/card titles, and adds a “duplicate headline across days” guard for period/aggregation subfolders.
Changes:
- Tightens
analysis-gateFamily C checks for executive-brief H1 validity (missing H1, placeholder/boilerplate collapse via renderer-equivalent cleanup, date-in-title, dangling punctuation/connectors). - Adds an across-days uniqueness check (last 7 sibling daily folders in the same subfolder) to prevent repeated headlines on
/news/index.html. - Improves renderer title hygiene (
cleanArticleTitletrailing punctuation strip;titleFromBlufsafer truncation; expanded date-prefix stripping) and adds regression tests.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
scripts/agentic/analysis-gate.ts |
Adds stricter executive-brief H1 validation and a sibling-days normalised-duplicate check. |
.github/prompts/05-analysis-gate.md |
Mirrors the new gate checks in bash as a fast pre-flight (date/connector/punctuation/duplicate checks). |
scripts/render-lib/aggregator/seo/title.ts |
Strengthens title cleanup + BLUF-title synthesis (punctuation strip, truncation guard, extra date-prefix patterns). |
tests/agentic-analysis-gate.test.ts |
Adds regression coverage for new gate failure modes incl. across-days duplicates. |
tests/render-lib-leaf-modules.test.ts |
Adds renderer-level regression tests for new title cleanup and BLUF truncation behavior. |
| checkId: 'family-c-structure', | ||
| passed: false, | ||
| message: | ||
| `executive-brief.md: H1 is byte-identical to analysis/daily/${siblingDate}/${subfolder}/executive-brief.md — reword to surface the day-specific angle (period-aggregation briefs must not ship duplicate cards on the news index)`, |
| it('fails when H1 is byte-identical to a prior day brief in the same subfolder', async () => { | ||
| // Reproduces the Phase-2 regression: period-aggregation briefs | ||
| // ("Tidö Current Mandate" × 2 days, "Sweden Year-Ahead → +365" | ||
| // × 2 days) shipped duplicate cards on /news/index.html because | ||
| // the workflow scope barely changes day-to-day. Build a | ||
| // synthetic `analysis/daily/<date>/<subfolder>/` layout with | ||
| // two sibling dates whose brief H1s normalise to the same | ||
| // string. | ||
| const fakeRoot = join(tmpdir(), `agentic-gate-test-acrossdays-${Date.now()}-${Math.random().toString(36).slice(2)}`); | ||
| const subfolder = 'forecast-year-ahead'; | ||
| const today = join(fakeRoot, 'analysis', 'daily', '2026-05-16', subfolder); | ||
| const yesterday = join(fakeRoot, 'analysis', 'daily', '2026-05-15', subfolder); | ||
| mkdirSync(today, { recursive: true }); | ||
| mkdirSync(yesterday, { recursive: true }); | ||
| try { | ||
| const h1Line = '# Sweden Year-Ahead Forecast: Tidö Coalition Faces Election Sprint\n'; | ||
| const body = '\n## 🎯 BLUF\n\nSummary.\n\n## 🧭 Decisions\n\n1. A\n'; | ||
| writeArtifact(today, 'executive-brief.md', h1Line + body); | ||
| writeArtifact(yesterday, 'executive-brief.md', h1Line + body); | ||
| const results = await checkFamilyCStructure(today); | ||
| const failures = results.filter( | ||
| (r) => !r.passed && r.artifact === 'executive-brief.md' && /byte-identical/i.test(r.message ?? ''), | ||
| ); | ||
| expect(failures.length).toBeGreaterThan(0); | ||
| expect(failures[0]?.message).toMatch(/2026-05-15/); |
| [ -n "$EB_SIB_H1" ] || continue | ||
| EB_SIB_NORM="$(printf '%s' "$EB_SIB_H1" | tr '[:upper:]' '[:lower:]' | sed -E 's/[0-9]{4}-[0-9]{2}-[0-9]{2}//g' | tr -s '[:space:][:punct:]' ' ' | sed -E 's/^[[:space:]]+|[[:space:]]+$//g')" | ||
| if [ "$EB_SIB_NORM" = "$EB_CURR_NORM" ]; then | ||
| echo "❌ executive-brief.md: H1 is byte-identical to analysis/daily/$EB_SIBLING/$EB_SUBFOLDER/executive-brief.md — reword to surface the day-specific angle (period-aggregation briefs must not ship duplicate cards on the news index)" |
| // Dangling-tail guard — if the cut ends on a ≤ 3-char word (`Tidö`, | ||
| // `two`, `the`, `on`, …) it reads as a truncated fragment. Step back | ||
| // to the previous word boundary so the title ends on a substantive | ||
| // word. Live cases: `… converging on the Tidö` (weekly-review), | ||
| // `… has advanced two` (committeeReports), `… of` (interpellations). | ||
| // Word lengths ≤ 3 are the empirical cutoff: 4-char words like | ||
| // `bill`, `cuts`, `vote` are substantive enough to end a title. | ||
| let safetyCounter = 0; | ||
| while (safetyCounter < 5) { | ||
| const tail = cut.match(/(\S+)$/); | ||
| if (!tail) break; | ||
| const tailWord = tail[1]!; | ||
| // Allow short tails only when followed by punctuation (e.g. `EU.` | ||
| // is already trimmed off above, but defensive). Also allow tails | ||
| // containing digits (numbers like `12` or `7` are substantive). | ||
| if (tailWord.length > 3 || /\d/.test(tailWord)) break; |
| # Date-in-H1 guard (seo-metadata-contract.md §2.1) — title must not | ||
| # contain a literal publication date. Catches ISO YYYY-MM-DD and | ||
| # English + Swedish long-form months. | ||
| EB_H1_TEXT="$(printf '%s' "$EB_H1" \ | ||
| | sed -E 's/^#[[:space:]]+//' \ | ||
| | sed -E 's/<[^>]+>//g')" | ||
| if printf '%s' "$EB_H1_TEXT" | grep -qE '[0-9]{4}[-/][0-9]{1,2}[-/][0-9]{1,2}'; then | ||
| echo "❌ executive-brief.md: H1 contains a literal ISO date (YYYY-MM-DD) — dates belong in article:published_time, not the SERP <title>" | ||
| FAIL=1 | ||
| elif printf '%s' "$EB_H1_LOWER" | grep -qE '[0-9]{1,2}[[:space:]]+(january|february|march|april|may|june|july|august|september|october|november|december)[[:space:]]+[0-9]{4}'; then | ||
| echo "❌ executive-brief.md: H1 contains a literal English long-form date — dates belong in article:published_time, not the SERP <title>" | ||
| FAIL=1 | ||
| elif printf '%s' "$EB_H1_LOWER" | grep -qE '[0-9]{1,2}[[:space:]]+(januari|februari|mars|april|maj|juni|juli|augusti|september|oktober|november|december)[[:space:]]+[0-9]{4}'; then | ||
| echo "❌ executive-brief.md: H1 contains a literal Swedish long-form date — dates belong in article:published_time, not the SERP <title>" | ||
| FAIL=1 | ||
| fi |
…tle BLUF-leak fix) Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/271ae15e-b960-45d0-b3b0-0d93505b766a Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
| /** | ||
| * Trailing connector punctuation / words left behind when the | ||
| * word-boundary truncation in {@link truncateAtWord} cuts a long | ||
| * brief H1 at a coordinating connector. Mirror of the rule in | ||
| * `aggregator/seo/title.ts § TRAILING_CONNECTOR_RE` — applied here too | ||
| * because the renderer's `<title>` budget (70 chars) is tighter than | ||
| * the brief H1 and can truncate a perfectly clean H1 mid-connector. | ||
| * | ||
| * Live case: brief H1 | ||
| * "Riksdag Enshrines Constitutional Protection for Abortion — and | ||
| * Expands the Security State's Toolkit" (99 chars) | ||
| * → without this strip the SERP `<title>` ships as | ||
| * "Riksdag Enshrines Constitutional Protection for Abortion — and…" | ||
| * which reads as a dangling connector to readers and search engines. | ||
| * With this strip the SERP `<title>` ships as | ||
| * "Riksdag Enshrines Constitutional Protection for Abortion…" | ||
| * which is clean prose. | ||
| */ | ||
| const TRAILING_CONNECTOR_RE = | ||
| /[\s,;:—–-]+(?:and|or|but|with|as|in|of|to|for|on|at|by|from|that|which|who|when|where|while|after|before|the|a|an|have|has|had|is|are|was|were|will|would|can|may|might|should|must|och|men|eller|med|som|av|till|för|på|i|att|der|die|das|und|oder|aber|mit|als|für|in|auf|et|ou|mais|avec|comme|de|à|pour|en|sur)$/iu; |
| it('strips trailing connector before site suffix when H1 already fits', () => { | ||
| // Short H1 ending with a connector should never happen post Phase-1 | ||
| // gate, but defensive: even if it does, the site suffix flow must | ||
| // not preserve the dangling connector. | ||
| const shortH1 = 'Riksdag Approves FiU48 Fuel-Tax Cut and'; | ||
| const result = buildSeoTitle({ ...baseInput, title: shortH1 }); | ||
| // Short H1 + suffix flow doesn't go through truncateAtWord, so the | ||
| // connector is preserved here — Phase-1 gate is the guard for this | ||
| // case. Document the contract by asserting the suffix appears. | ||
| expect(result).toMatch(/Riksdagsmonitor$/); | ||
| }); |
| it('keeps a complete H1 unchanged when it already fits the budget', () => { | ||
| const h1 = 'Russia Legalises Aggression — Sweden Faces Three Deadlines'; | ||
| const result = buildSeoTitle({ ...baseInput, title: h1 }); | ||
| // 58 chars + " — Riksdagsmonitor" (18) = 76 → does NOT fit, so | ||
| // truncateAtWord runs. But truncation should land cleanly, not on | ||
| // a connector. | ||
| expect(result).not.toMatch(/\b(and|or|with|the|a|an)…$/i); | ||
| }); |
| [ -n "$EB_SIB_H1" ] || continue | ||
| EB_SIB_NORM="$(printf '%s' "$EB_SIB_H1" | tr '[:upper:]' '[:lower:]' | sed -E 's/[0-9]{4}-[0-9]{2}-[0-9]{2}//g' | tr -s '[:space:][:punct:]' ' ' | sed -E 's/^[[:space:]]+|[[:space:]]+$//g')" | ||
| if [ "$EB_SIB_NORM" = "$EB_CURR_NORM" ]; then | ||
| echo "❌ executive-brief.md: H1 is byte-identical to analysis/daily/$EB_SIBLING/$EB_SUBFOLDER/executive-brief.md — reword to surface the day-specific angle (period-aggregation briefs must not ship duplicate cards on the news index)" |
| checkId: 'family-c-structure', | ||
| passed: false, | ||
| message: | ||
| `executive-brief.md: H1 is byte-identical to analysis/daily/${siblingDate}/${subfolder}/executive-brief.md — reword to surface the day-specific angle (period-aggregation briefs must not ship duplicate cards on the news index)`, |
|
@copilot apply changes based on the comments in this thread and fix Quality Checks / typescript-lint (pull_request) |
…ording, multilingual comment, test names) Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/078057a9-583a-4071-af61-bea7ffccdf9c Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
Done in 4211ea9:
ESLint clean on all touched files; 104 affected vitest tests pass. |
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
| '# Executive Brief — Realtime Pulse 2026-05-16\n\n## 🎯 BLUF\n\nSwedish parliamentary activity.\n\n## 🧭 Decisions\n\n1. A\n'); | ||
| const results = await checkFamilyCStructure(subfolderDir); | ||
| const failures = results.filter( | ||
| (r) => !r.passed && r.artifact === 'executive-brief.md', |
| * `aggregator/seo/title.ts § TRAILING_CONNECTOR_RE`: the aggregator | ||
| * helper is English-only (it operates on BLUF sentences that have | ||
| * already been normalised to English by the analysis pipeline), | ||
| * whereas the renderer here must also strip Swedish / German / | ||
| * French connectors because executive-brief H1s ship in all 14 | ||
| * languages. If you update either list, update both — keep this | ||
| * regex strictly a superset of the aggregator one (drift in the | ||
| * EN-only subset would let dangling EN connectors leak through in |
|
@copilot apply changes based on the comments in this thread |
…ify article-seo.ts doc comment re aggregator multilingual scope Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/b803dadf-cfa2-4e39-9ef1-838a0ede1faf Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
Applied both review comments in 97160cb:
ESLint clean; 104 tests pass. |
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
| // Swedish "Den <day> <månad> <year>" (e.g. `Den 13 maj 2026 antog …`). | ||
| // Swedish month names: januari, februari, mars, april, maj, juni, | ||
| // juli, augusti, september, oktober, november, december. | ||
| /^Den\s+\d{1,2}\s+(?:januari|februari|mars|april|maj|juni|juli|augusti|september|oktober|november|december)\s+\d{4}\s*[,—–-]?\s+/i, | ||
| // German "Am <day>. <Monat> <year>" — January…December plus | ||
| // Mai/Mär/Juni/Juli for native forms. | ||
| /^Am\s+\d{1,2}\.?\s+(?:Januar|Februar|März|Marz|April|Mai|Juni|Juli|August|September|Oktober|November|Dezember)\s+\d{4}\s*[,—–-]?\s+/i, | ||
| // French "Le <day> <mois> <year>" (e.g. `Le 13 mai 2026, …`). | ||
| /^Le\s+\d{1,2}\s+(?:janvier|février|fevrier|mars|avril|mai|juin|juillet|août|aout|septembre|octobre|novembre|décembre|decembre)\s+\d{4}\s*[,—–-]?\s+/i, |
|
@copilot apply changes based on the comments in this thread |
… regex in startsWithVerb Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/33adcefe-9921-4463-88d1-40ce5dcffa97 Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
Applied in 389b6e5:
All 126 leaf-module tests pass; ESLint clean. |
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
…dened title gates Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/d14d65f3-593c-442d-9dca-41dbe674d59b Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
analysis-gateCheck 7 hardening (TS + bash parity)aggregator/seo/title.ts<title>truncation:truncateAtWordappliestrimTrailingConnectors(EN + SV + DE + FR)no-useless-escapeESLint error/collapses to nothing/; clarify article-seo.ts doc comment distinguishing aggregator's EN-only trailing-connector list from its multilingual date-prefix patternsVERB_LEADING_TOKENSwith Swedish/German V2 verbs; Unicode\p{L}+regex instartsWithVerbarticle.mdexists (npx tsx scripts/render-articles.ts --all --lang all); 2408 HTML files refreshed across 172 subfolders so the hardened title gates (BLUF-leak strip, multilingual date-prefix strip + verb-leading guard, trailing-connector strip) propagate to all live SERP/card titles## Bottom Line/## So Whatsection exists in templates; BLUF is the canonical lede