Skip to content

Harden analysis-gate + renderer against BLUF-leak and duplicate-card title regressions#2530

Merged
pethers merged 11 commits into
mainfrom
copilot/improve-news-titles-and-descriptions
May 16, 2026
Merged

Harden analysis-gate + renderer against BLUF-leak and duplicate-card title regressions#2530
pethers merged 11 commits into
mainfrom
copilot/improve-news-titles-and-descriptions

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 16, 2026

  • Phase 1analysis-gate Check 7 hardening (TS + bash parity)
  • Phase 2 — across-days H1 uniqueness check (normalised-identical, prior 7 sibling daily folders)
  • Phase 3 — renderer safety net in aggregator/seo/title.ts
  • Phase 6 — renderer SERP <title> truncation: truncateAtWord applies trimTrailingConnectors (EN + SV + DE + FR)
  • Review follow-up (4211ea9): "byte-identical" → "normalised-identical"; multilingual-superset doc comment; renamed misleading test cases; fixed no-useless-escape ESLint error
  • Review follow-up (085e46f): corrected dangling-tail guard comment; added US-order date regex to bash pre-flight
  • Review follow-up (1de57af): fix normaliseH1 date-strip ordering; narrow alreadyFlagged to H1-related only; update BLUF_DATE_PREFIX_PATTERNS comment to multilingual
  • Review follow-up (97160cb): tighten collapse-test assertion to /collapses to nothing/; clarify article-seo.ts doc comment distinguishing aggregator's EN-only trailing-connector list from its multilingual date-prefix patterns
  • Review follow-up (389b6e5): extend VERB_LEADING_TOKENS with Swedish/German V2 verbs; Unicode \p{L}+ regex in startsWithVerb
  • Phase 5 (now in scope) — re-rendered all news HTML articles for all 14 languages where article.md exists (npx tsx scripts/render-articles.ts --all --lang all); 2408 HTML files refreshed across 172 subfolders so the hardened title gates (BLUF-leak strip, multilingual date-prefix strip + verb-leading guard, trailing-connector strip) propagate to all live SERP/card titles
  • Phase 4 (description quality) deferred — no ## Bottom Line / ## So What section exists in templates; BLUF is the canonical lede

Copilot AI and others added 3 commits May 16, 2026 15:39
…d-aggregation duplicate-card guard

Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/54808640-bb83-4031-b74a-089773718be0

Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
…build noise, not code changes)

Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/54808640-bb83-4031-b74a-089773718be0

Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
@pethers pethers marked this pull request as ready for review May 16, 2026 16:04
Copilot AI review requested due to automatic review settings May 16, 2026 16:04
@github-actions github-actions Bot added documentation Documentation updates testing Test coverage refactor Code refactoring size-l Large change (250-1000 lines) labels May 16, 2026
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ Automatic Labeling Summary

This PR has been automatically labeled based on the files changed and PR metadata.

Applied Labels: documentation,testing,refactor,size-l

Label Categories

  • 🗳️ Content: news, dashboard, visualization, intelligence
  • 💻 Technology: html-css, javascript, workflow, security
  • 📊 Data: cia-data, riksdag-data, data-pipeline, schema
  • 🌍 I18n: i18n, translation, rtl
  • 🔒 ISMS: isms, iso-27001, nist-csf, cis-controls
  • 🏗️ Infrastructure: ci-cd, deployment, performance, monitoring
  • 🔄 Quality: testing, accessibility, documentation, refactor
  • 🤖 AI: agent, skill, agentic-workflow

For more information, see .github/labeler.yml.

@github-actions
Copy link
Copy Markdown
Contributor

🔍 Lighthouse Performance Audit

Category Score Status
Performance 85/100 🟡
Accessibility 95/100 🟢
Best Practices 90/100 🟢
SEO 95/100 🟢

📥 Download full Lighthouse report

Budget Compliance: Performance budgets enforced via budget.json

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the executive-brief headline pipeline so bad/boilerplate H1s can’t silently degrade into BLUF-derived fragments in rendered SERP/card titles, and adds a “duplicate headline across days” guard for period/aggregation subfolders.

Changes:

  • Tightens analysis-gate Family C checks for executive-brief H1 validity (missing H1, placeholder/boilerplate collapse via renderer-equivalent cleanup, date-in-title, dangling punctuation/connectors).
  • Adds an across-days uniqueness check (last 7 sibling daily folders in the same subfolder) to prevent repeated headlines on /news/index.html.
  • Improves renderer title hygiene (cleanArticleTitle trailing punctuation strip; titleFromBluf safer truncation; expanded date-prefix stripping) and adds regression tests.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
scripts/agentic/analysis-gate.ts Adds stricter executive-brief H1 validation and a sibling-days normalised-duplicate check.
.github/prompts/05-analysis-gate.md Mirrors the new gate checks in bash as a fast pre-flight (date/connector/punctuation/duplicate checks).
scripts/render-lib/aggregator/seo/title.ts Strengthens title cleanup + BLUF-title synthesis (punctuation strip, truncation guard, extra date-prefix patterns).
tests/agentic-analysis-gate.test.ts Adds regression coverage for new gate failure modes incl. across-days duplicates.
tests/render-lib-leaf-modules.test.ts Adds renderer-level regression tests for new title cleanup and BLUF truncation behavior.

Comment thread scripts/agentic/analysis-gate.ts Outdated
checkId: 'family-c-structure',
passed: false,
message:
`executive-brief.md: H1 is byte-identical to analysis/daily/${siblingDate}/${subfolder}/executive-brief.md — reword to surface the day-specific angle (period-aggregation briefs must not ship duplicate cards on the news index)`,
Comment thread tests/agentic-analysis-gate.test.ts Outdated
Comment on lines +846 to +870
it('fails when H1 is byte-identical to a prior day brief in the same subfolder', async () => {
// Reproduces the Phase-2 regression: period-aggregation briefs
// ("Tidö Current Mandate" × 2 days, "Sweden Year-Ahead → +365"
// × 2 days) shipped duplicate cards on /news/index.html because
// the workflow scope barely changes day-to-day. Build a
// synthetic `analysis/daily/<date>/<subfolder>/` layout with
// two sibling dates whose brief H1s normalise to the same
// string.
const fakeRoot = join(tmpdir(), `agentic-gate-test-acrossdays-${Date.now()}-${Math.random().toString(36).slice(2)}`);
const subfolder = 'forecast-year-ahead';
const today = join(fakeRoot, 'analysis', 'daily', '2026-05-16', subfolder);
const yesterday = join(fakeRoot, 'analysis', 'daily', '2026-05-15', subfolder);
mkdirSync(today, { recursive: true });
mkdirSync(yesterday, { recursive: true });
try {
const h1Line = '# Sweden Year-Ahead Forecast: Tidö Coalition Faces Election Sprint\n';
const body = '\n## 🎯 BLUF\n\nSummary.\n\n## 🧭 Decisions\n\n1. A\n';
writeArtifact(today, 'executive-brief.md', h1Line + body);
writeArtifact(yesterday, 'executive-brief.md', h1Line + body);
const results = await checkFamilyCStructure(today);
const failures = results.filter(
(r) => !r.passed && r.artifact === 'executive-brief.md' && /byte-identical/i.test(r.message ?? ''),
);
expect(failures.length).toBeGreaterThan(0);
expect(failures[0]?.message).toMatch(/2026-05-15/);
Comment thread .github/prompts/05-analysis-gate.md Outdated
[ -n "$EB_SIB_H1" ] || continue
EB_SIB_NORM="$(printf '%s' "$EB_SIB_H1" | tr '[:upper:]' '[:lower:]' | sed -E 's/[0-9]{4}-[0-9]{2}-[0-9]{2}//g' | tr -s '[:space:][:punct:]' ' ' | sed -E 's/^[[:space:]]+|[[:space:]]+$//g')"
if [ "$EB_SIB_NORM" = "$EB_CURR_NORM" ]; then
echo "❌ executive-brief.md: H1 is byte-identical to analysis/daily/$EB_SIBLING/$EB_SUBFOLDER/executive-brief.md — reword to surface the day-specific angle (period-aggregation briefs must not ship duplicate cards on the news index)"
Comment on lines +295 to +310
// Dangling-tail guard — if the cut ends on a ≤ 3-char word (`Tidö`,
// `two`, `the`, `on`, …) it reads as a truncated fragment. Step back
// to the previous word boundary so the title ends on a substantive
// word. Live cases: `… converging on the Tidö` (weekly-review),
// `… has advanced two` (committeeReports), `… of` (interpellations).
// Word lengths ≤ 3 are the empirical cutoff: 4-char words like
// `bill`, `cuts`, `vote` are substantive enough to end a title.
let safetyCounter = 0;
while (safetyCounter < 5) {
const tail = cut.match(/(\S+)$/);
if (!tail) break;
const tailWord = tail[1]!;
// Allow short tails only when followed by punctuation (e.g. `EU.`
// is already trimmed off above, but defensive). Also allow tails
// containing digits (numbers like `12` or `7` are substantive).
if (tailWord.length > 3 || /\d/.test(tailWord)) break;
Comment on lines +208 to +223
# Date-in-H1 guard (seo-metadata-contract.md §2.1) — title must not
# contain a literal publication date. Catches ISO YYYY-MM-DD and
# English + Swedish long-form months.
EB_H1_TEXT="$(printf '%s' "$EB_H1" \
| sed -E 's/^#[[:space:]]+//' \
| sed -E 's/<[^>]+>//g')"
if printf '%s' "$EB_H1_TEXT" | grep -qE '[0-9]{4}[-/][0-9]{1,2}[-/][0-9]{1,2}'; then
echo "❌ executive-brief.md: H1 contains a literal ISO date (YYYY-MM-DD) — dates belong in article:published_time, not the SERP <title>"
FAIL=1
elif printf '%s' "$EB_H1_LOWER" | grep -qE '[0-9]{1,2}[[:space:]]+(january|february|march|april|may|june|july|august|september|october|november|december)[[:space:]]+[0-9]{4}'; then
echo "❌ executive-brief.md: H1 contains a literal English long-form date — dates belong in article:published_time, not the SERP <title>"
FAIL=1
elif printf '%s' "$EB_H1_LOWER" | grep -qE '[0-9]{1,2}[[:space:]]+(januari|februari|mars|april|maj|juni|juli|augusti|september|oktober|november|december)[[:space:]]+[0-9]{4}'; then
echo "❌ executive-brief.md: H1 contains a literal Swedish long-form date — dates belong in article:published_time, not the SERP <title>"
FAIL=1
fi
Copilot AI requested a review from pethers May 16, 2026 16:21
@pethers pethers requested a review from Copilot May 16, 2026 16:22
@github-actions
Copy link
Copy Markdown
Contributor

🔍 Lighthouse Performance Audit

Category Score Status
Performance 85/100 🟡
Accessibility 95/100 🟢
Best Practices 90/100 🟢
SEO 95/100 🟢

📥 Download full Lighthouse report

Budget Compliance: Performance budgets enforced via budget.json

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Comment on lines +49 to +68
/**
* Trailing connector punctuation / words left behind when the
* word-boundary truncation in {@link truncateAtWord} cuts a long
* brief H1 at a coordinating connector. Mirror of the rule in
* `aggregator/seo/title.ts § TRAILING_CONNECTOR_RE` — applied here too
* because the renderer's `<title>` budget (70 chars) is tighter than
* the brief H1 and can truncate a perfectly clean H1 mid-connector.
*
* Live case: brief H1
* "Riksdag Enshrines Constitutional Protection for Abortion — and
* Expands the Security State's Toolkit" (99 chars)
* → without this strip the SERP `<title>` ships as
* "Riksdag Enshrines Constitutional Protection for Abortion — and…"
* which reads as a dangling connector to readers and search engines.
* With this strip the SERP `<title>` ships as
* "Riksdag Enshrines Constitutional Protection for Abortion…"
* which is clean prose.
*/
const TRAILING_CONNECTOR_RE =
/[\s,;:—–-]+(?:and|or|but|with|as|in|of|to|for|on|at|by|from|that|which|who|when|where|while|after|before|the|a|an|have|has|had|is|are|was|were|will|would|can|may|might|should|must|och|men|eller|med|som|av|till|för|på|i|att|der|die|das|und|oder|aber|mit|als|für|in|auf|et|ou|mais|avec|comme|de|à|pour|en|sur)$/iu;
Comment on lines +55 to +65
it('strips trailing connector before site suffix when H1 already fits', () => {
// Short H1 ending with a connector should never happen post Phase-1
// gate, but defensive: even if it does, the site suffix flow must
// not preserve the dangling connector.
const shortH1 = 'Riksdag Approves FiU48 Fuel-Tax Cut and';
const result = buildSeoTitle({ ...baseInput, title: shortH1 });
// Short H1 + suffix flow doesn't go through truncateAtWord, so the
// connector is preserved here — Phase-1 gate is the guard for this
// case. Document the contract by asserting the suffix appears.
expect(result).toMatch(/Riksdagsmonitor$/);
});
Comment on lines +81 to +88
it('keeps a complete H1 unchanged when it already fits the budget', () => {
const h1 = 'Russia Legalises Aggression — Sweden Faces Three Deadlines';
const result = buildSeoTitle({ ...baseInput, title: h1 });
// 58 chars + " — Riksdagsmonitor" (18) = 76 → does NOT fit, so
// truncateAtWord runs. But truncation should land cleanly, not on
// a connector.
expect(result).not.toMatch(/\b(and|or|with|the|a|an)…$/i);
});
Comment thread .github/prompts/05-analysis-gate.md Outdated
[ -n "$EB_SIB_H1" ] || continue
EB_SIB_NORM="$(printf '%s' "$EB_SIB_H1" | tr '[:upper:]' '[:lower:]' | sed -E 's/[0-9]{4}-[0-9]{2}-[0-9]{2}//g' | tr -s '[:space:][:punct:]' ' ' | sed -E 's/^[[:space:]]+|[[:space:]]+$//g')"
if [ "$EB_SIB_NORM" = "$EB_CURR_NORM" ]; then
echo "❌ executive-brief.md: H1 is byte-identical to analysis/daily/$EB_SIBLING/$EB_SUBFOLDER/executive-brief.md — reword to surface the day-specific angle (period-aggregation briefs must not ship duplicate cards on the news index)"
Comment thread scripts/agentic/analysis-gate.ts Outdated
checkId: 'family-c-structure',
passed: false,
message:
`executive-brief.md: H1 is byte-identical to analysis/daily/${siblingDate}/${subfolder}/executive-brief.md — reword to surface the day-specific angle (period-aggregation briefs must not ship duplicate cards on the news index)`,
@pethers
Copy link
Copy Markdown
Member

pethers commented May 16, 2026

@copilot apply changes based on the comments in this thread and fix Quality Checks / typescript-lint (pull_request)
Quality Checks / typescript-lint (pull_request)Failing after 2m

…ording, multilingual comment, test names)

Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/078057a9-583a-4071-af61-bea7ffccdf9c

Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 16, 2026

@copilot apply changes based on the comments in this thread and fix [Quality Checks / typescript-lin...

Done in 4211ea9:

  • Reworded "byte-identical" → "normalised-identical (case/punctuation/date stripped)" in analysis-gate.ts failure message, the 05-analysis-gate.md bash pre-flight, and the two matching test assertions in agentic-analysis-gate.test.ts.
  • Clarified the TRAILING_CONNECTOR_RE doc comment in article-seo.ts as an expanded multilingual superset of the EN-only aggregator rule, with explicit sync guidance.
  • Renamed the two misleading test cases in article-seo-title-truncation.test.ts to describe the contract they actually verify (suffix-fits path preserves the dangling connector — Phase-1 gate is the guard; medium-H1 path runs truncateAtWord and must land on a substantive word).
  • Fixed the no-useless-escape ESLint error (\- inside a character class on analysis-gate.ts:815) — unblocks the failing Quality Checks / typescript-lint job.

ESLint clean on all touched files; 104 affected vitest tests pass.

@github-actions
Copy link
Copy Markdown
Contributor

🔍 Lighthouse Performance Audit

Category Score Status
Performance 85/100 🟡
Accessibility 95/100 🟢
Best Practices 90/100 🟢
SEO 95/100 🟢

📥 Download full Lighthouse report

Budget Compliance: Performance budgets enforced via budget.json

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comment thread tests/agentic-analysis-gate.test.ts Outdated
'# Executive Brief — Realtime Pulse 2026-05-16\n\n## 🎯 BLUF\n\nSwedish parliamentary activity.\n\n## 🧭 Decisions\n\n1. A\n');
const results = await checkFamilyCStructure(subfolderDir);
const failures = results.filter(
(r) => !r.passed && r.artifact === 'executive-brief.md',
Comment thread scripts/render-lib/article-seo.ts Outdated
Comment on lines +55 to +62
* `aggregator/seo/title.ts § TRAILING_CONNECTOR_RE`: the aggregator
* helper is English-only (it operates on BLUF sentences that have
* already been normalised to English by the analysis pipeline),
* whereas the renderer here must also strip Swedish / German /
* French connectors because executive-brief H1s ship in all 14
* languages. If you update either list, update both — keep this
* regex strictly a superset of the aggregator one (drift in the
* EN-only subset would let dangling EN connectors leak through in
@pethers
Copy link
Copy Markdown
Member

pethers commented May 16, 2026

@copilot apply changes based on the comments in this thread

…ify article-seo.ts doc comment re aggregator multilingual scope

Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/b803dadf-cfa2-4e39-9ef1-838a0ede1faf

Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 16, 2026

@copilot apply changes based on the comments in this thread

Applied both review comments in 97160cb:

  1. Collapse-test assertion tightened — now matches /collapses to nothing/i instead of just checking any failure exists, so regressions where a different H1 guard fires first won't mask the collapse guard.
  2. article-seo.ts doc comment clarified — now distinguishes the aggregator's EN-only trailing-connector list from its multilingual date-prefix patterns (BLUF_DATE_PREFIX_PATTERNS covers EN + SV + DE + FR).

ESLint clean; 104 tests pass.

@github-actions
Copy link
Copy Markdown
Contributor

🔍 Lighthouse Performance Audit

Category Score Status
Performance 85/100 🟡
Accessibility 95/100 🟢
Best Practices 90/100 🟢
SEO 95/100 🟢

📥 Download full Lighthouse report

Budget Compliance: Performance budgets enforced via budget.json

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Comment on lines +154 to +162
// Swedish "Den <day> <månad> <year>" (e.g. `Den 13 maj 2026 antog …`).
// Swedish month names: januari, februari, mars, april, maj, juni,
// juli, augusti, september, oktober, november, december.
/^Den\s+\d{1,2}\s+(?:januari|februari|mars|april|maj|juni|juli|augusti|september|oktober|november|december)\s+\d{4}\s*[,—–-]?\s+/i,
// German "Am <day>. <Monat> <year>" — January…December plus
// Mai/Mär/Juni/Juli for native forms.
/^Am\s+\d{1,2}\.?\s+(?:Januar|Februar|März|Marz|April|Mai|Juni|Juli|August|September|Oktober|November|Dezember)\s+\d{4}\s*[,—–-]?\s+/i,
// French "Le <day> <mois> <year>" (e.g. `Le 13 mai 2026, …`).
/^Le\s+\d{1,2}\s+(?:janvier|février|fevrier|mars|avril|mai|juin|juillet|août|aout|septembre|octobre|novembre|décembre|decembre)\s+\d{4}\s*[,—–-]?\s+/i,
@pethers
Copy link
Copy Markdown
Member

pethers commented May 16, 2026

@copilot apply changes based on the comments in this thread

… regex in startsWithVerb

Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/33adcefe-9921-4463-88d1-40ce5dcffa97

Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 16, 2026

@copilot apply changes based on the comments in this thread

Applied in 389b6e5:

  • startsWithVerb regex updated from /^([A-Za-z]+)/ to /^(\p{L}+)/u — Unicode property escape correctly captures Swedish (ä/å/ö) and German (ä/ö/ü) verb starts (e.g. röstade, godkände, veröffentlichte).
  • VERB_LEADING_TOKENS extended with Swedish V2 past-tense verbs (antog, beslutade, röstade, godkände, avslog, föreslog, …) and modals (ska, vill, måste, bör), plus German V2 past-tense verbs (beschloss, verabschiedete, stimmte, lehnte, kündigte, …) and modals (wird, soll, muss, kann).
  • VERB_LEADING_TOKENS comment updated to explain the multilingual scope (EN + SV + DE) mirrors BLUF_DATE_PREFIX_PATTERNS, and the V2 word-order mechanism behind the guard.
  • Tests: the old "strips Swedish date prefix" test is replaced with four new tests — Swedish V2 verb-leading → date kept (antog), Swedish comma-separated subject-first → date stripped, German V2 verb-leading → date kept (beschloss), German noun-leading → date stripped.

All 126 leaf-module tests pass; ESLint clean.

@github-actions
Copy link
Copy Markdown
Contributor

🔍 Lighthouse Performance Audit

Category Score Status
Performance 85/100 🟡
Accessibility 95/100 🟢
Best Practices 90/100 🟢
SEO 95/100 🟢

📥 Download full Lighthouse report

Budget Compliance: Performance budgets enforced via budget.json

@github-actions
Copy link
Copy Markdown
Contributor

🔍 Lighthouse Performance Audit

Category Score Status
Performance 85/100 🟡
Accessibility 95/100 🟢
Best Practices 90/100 🟢
SEO 95/100 🟢

📥 Download full Lighthouse report

Budget Compliance: Performance budgets enforced via budget.json

@github-actions
Copy link
Copy Markdown
Contributor

🔍 Lighthouse Performance Audit

Category Score Status
Performance 85/100 🟡
Accessibility 95/100 🟢
Best Practices 90/100 🟢
SEO 95/100 🟢

📥 Download full Lighthouse report

Budget Compliance: Performance budgets enforced via budget.json

@github-actions github-actions Bot added html-css HTML/CSS changes translation Translation updates rtl RTL language support (Arabic, Hebrew) news News articles and content generation size-xl Extra large change (> 1000 lines) labels May 16, 2026
@pethers pethers merged commit 63ca729 into main May 16, 2026
18 checks passed
@pethers pethers deleted the copilot/improve-news-titles-and-descriptions branch May 16, 2026 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Documentation updates html-css HTML/CSS changes news News articles and content generation refactor Code refactoring rtl RTL language support (Arabic, Hebrew) size-l Large change (250-1000 lines) size-xl Extra large change (> 1000 lines) testing Test coverage translation Translation updates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants