Skip to content

Strip Riksdag raw-dump prefix and embedded CSS in extractKeyPassage#1831

Merged
pethers merged 3 commits intomainfrom
copilot/fix-html-css-display-issues
Apr 18, 2026
Merged

Strip Riksdag raw-dump prefix and embedded CSS in extractKeyPassage#1831
pethers merged 3 commits intomainfrom
copilot/fix-html-css-display-issues

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 18, 2026

get_dokument_innehall returns a text field whose first ~500 bytes are a metadata dump followed by embedded CSS rule blocks. extractKeyPassage only stripped HTML tags and URLs, so every per-document summary in the generic/weekly/monthly review generators emitted raw metadata and CSS as visible prose (see the two cleaned articles in news/).

Example of the leaked content, now stripped at the source:

5287561 HD03242 2025/26 242 prop prop prop Proposition 2025/26:242
Proposition Proposition Landsbygds- och infrastrukturdepartementet MJU 242 0
2026-04-16 00:00:00 … Ett tydligt regelverk för aktivt skogsbruk
html-ec prop-RIM 76066c92-4400-457b-ac3a-a0f403e9bdfc
body {margin-top: 0px;margin-left: 0px;}
#page_1 {position:relative; overflow: hidden; …}

Changes

  • scripts/data-transformers/helpers.ts — new stripRiksdagRawDump() called from extractKeyPassage, so all generators using it (generic, weekly-review, monthly-review, propositions, committee, motions, policy-analysis) inherit the fix:
    • CSS block stripper: bounded [^{}]{0,300}\{[^{}]{0,1000}\} (no backtracking) + named CSS_PROPERTY_SIGNATURE constant tested only against the brace body via capture group, so prose like "prisökning: 10 procent { … }" with an unrelated brace pair is preserved.
    • Metadata prefix stripper: triggers on ^\d{6,}\s+HD\S+\s+\d{4}/\d{2} and removes up to whichever boundary appears first — html-ec …-RIM <uuid> → bare UUID → last YYYY-MM-DD HH:MM:SS timestamp.
  • news/2026-04-18-weekly-review-{en,sv}.html — removed the 64 corrupted <span lang="sv">DUMP</span> prefixes per file; legitimate <span lang="sv"> title wrappers are untouched.
  • tests/extract-key-passage.test.ts — 7 new cases: html-ec+UUID path, bare-UUID fallback, timestamp-only fallback, and false-positive guards for Swedish prose containing braces / :digit / px.
  • .github/aw/SHARED_PROMPT_PATTERNS.md — added a 🚨 rule next to the get_dokument_innehall field mapping: agents must never paste the raw text field (or a prefix of it) into article HTML.

Residual risk

The stripper is conservative — if Riksdag introduces a new dump layout without the current ID/HD/riksmöte triple, content falls through unchanged rather than being wrongly truncated. Add a new tier if that happens.

@pethers pethers marked this pull request as ready for review April 18, 2026 18:37
Copilot AI review requested due to automatic review settings April 18, 2026 18:37
@pethers pethers merged commit 7af654f into main Apr 18, 2026
13 checks passed
@github-actions github-actions Bot added documentation Documentation updates html-css HTML/CSS changes translation Translation updates testing Test coverage refactor Code refactoring news News articles and content generation agentic-workflow Agentic workflow changes size-l Large change (250-1000 lines) labels Apr 18, 2026
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ Automatic Labeling Summary

This PR has been automatically labeled based on the files changed and PR metadata.

Applied Labels: documentation,html-css,translation,testing,refactor,size-l,news,agentic-workflow

Label Categories

  • 🗳️ Content: news, dashboard, visualization, intelligence
  • 💻 Technology: html-css, javascript, workflow, security
  • 📊 Data: cia-data, riksdag-data, data-pipeline, schema
  • 🌍 I18n: i18n, translation, rtl
  • 🔒 ISMS: isms, iso-27001, nist-csf, cis-controls
  • 🏗️ Infrastructure: ci-cd, deployment, performance, monitoring
  • 🔄 Quality: testing, accessibility, documentation, refactor
  • 🤖 AI: agent, skill, agentic-workflow

For more information, see .github/labeler.yml.

@github-actions
Copy link
Copy Markdown
Contributor

🔍 Lighthouse Performance Audit

Category Score Status
Performance 85/100 🟡
Accessibility 95/100 🟢
Best Practices 90/100 🟢
SEO 95/100 🟢

📥 Download full Lighthouse report

Budget Compliance: Performance budgets enforced via budget.json

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the document-excerpt pipeline by stripping the Riksdag MCP get_dokument_innehall.text raw-dump prefix (metadata tokens + embedded CSS) so generators using extractKeyPassage() no longer leak dump/CSS into rendered article prose, and it retroactively cleans the affected 2026-04-18 weekly review pages.

Changes:

  • Add stripRiksdagRawDump() and call it from extractKeyPassage() to remove dump headers and CSS blocks at the source.
  • Clean the already-published weekly review EN/SV HTML pages to remove leaked dump fragments.
  • Add Vitest coverage for several dump layouts and false-positive guards; add an agent prompt rule to avoid pasting raw MCP text into HTML.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
scripts/data-transformers/helpers.ts Introduces stripRiksdagRawDump() and integrates it into extractKeyPassage() so downstream generators inherit sanitization.
tests/extract-key-passage.test.ts Adds regression tests for dump/CSS stripping and false-positive protection.
news/2026-04-18-weekly-review-sv.html Removes leaked dump/CSS text from per-document entries in the published Swedish weekly review.
news/2026-04-18-weekly-review-en.html Removes leaked dump/CSS text from per-document entries in the published English weekly review.
.github/aw/SHARED_PROMPT_PATTERNS.md Adds an explicit rule warning agents not to paste raw MCP text into article HTML and to use the sanitized excerpt path.

Comment on lines +396 to +403
// Fallback 1: strip up to and including the first bare UUID if it is
// still within the metadata header window (first ~1.5k chars).
const uuidIdx = s.search(/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/i);
if (uuidIdx > -1 && uuidIdx < 1500) {
s = s
.slice(uuidIdx)
.replace(/^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}\s*/i, '');
} else {
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the UUID fallback path, the code strips the UUID only when it is followed by whitespace. Some MCP dump tokens appear to append a language code directly after the UUID (e.g. ...<uuid>sv), which would leave a leading sv (or similar) at the start of the remaining text. Consider stripping an optional trailing language suffix immediately after the UUID as part of the removal step.

Copilot uses AI. Check for mistakes.
Comment on lines +382 to +383
const CSS_PROPERTY_SIGNATURE = /(?::\s*-?\d|\b(?:px|em|rem)\b|%\s*;|#[0-9a-f]{3,6}\b|position\s*:|margin\s*:|padding\s*:|overflow\s*:|width\s*:|height\s*:|top\s*:|left\s*:|z-index\s*:|display\s*:|font-|border\s*:)/i;
s = s.replace(/[^{}]{0,300}(\{[^{}]{0,1000}\})/g, (m, body: string) =>
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSS_PROPERTY_SIGNATURE (and the other regexes) are created on every stripRiksdagRawDump() call. Since extractKeyPassage() is used per-document and can run many times per article, consider moving these regex constants to module scope to avoid repeated compilation and reduce per-document overhead.

Copilot uses AI. Check for mistakes.
Comment on lines +71 to +118
it('strips Riksdag proposition raw-dump header + embedded CSS (html-ec … -RIM UUID marker)', () => {
const dump =
'5287561 HD03242 2025/26 242 prop prop prop Proposition 2025/26:242 ' +
'Proposition Proposition Landsbygds- och infrastrukturdepartementet MJU 242 0 ' +
'2026-04-16 00:00:00 2026-04-16 15:24:08 2026-04-16 00:00:00 ' +
'Ett tydligt regelverk för aktivt skogsbruk html-ec prop-RIM ' +
'76066c92-4400-457b-ac3a-a0f403e9bdfc ' +
'body {margin-top: 0px;margin-left: 0px;} ' +
'#page_1 {position:relative; overflow: hidden;margin:10px 0px 21px 10px;padding:0px;border:none;width:766px;} ' +
'Propositionens huvudsakliga innehåll är att tydliggöra regelverket för aktivt skogsbruk i Sverige.';
const result = extractKeyPassage(dump, 500);
expect(result).not.toContain('5287561');
expect(result).not.toContain('HD03242');
expect(result).not.toContain('html-ec');
expect(result).not.toContain('prop-RIM');
expect(result).not.toContain('76066c92-4400-457b-ac3a-a0f403e9bdfc');
expect(result).not.toContain('body {');
expect(result).not.toContain('#page_1');
expect(result).not.toContain('margin-top');
expect(result).toContain('Propositionens huvudsakliga innehåll');
});

it('strips Riksdag committee-report raw-dump header without html-ec marker (bare UUID fallback)', () => {
const dump =
'5286898 HD01KU44 2025/26 KU44 bet bet bet Betänkande 2025/26:KU44 ' +
'Betänkande Debatt om förslag KU 44 0 2026-04-13 00:00:00 2026-04-13 11:05:12 ' +
'2026-04-13 11:04:18 Uppskov med behandlingen av vissa ärenden planerat Brus ' +
'aeda74fa-73bd-4ad7-bf56-f02b41bc64e8 ' +
'Konstitutionsutskottet föreslår att riksdagen medger att behandlingen av vissa ärenden skjuts upp.';
const result = extractKeyPassage(dump, 500);
expect(result).not.toContain('5286898');
expect(result).not.toContain('HD01KU44');
expect(result).not.toContain('aeda74fa-73bd-4ad7-bf56-f02b41bc64e8');
expect(result).toContain('Konstitutionsutskottet');
});

it('strips Riksdag motion raw-dump header when no UUID is present (timestamp fallback)', () => {
const dump =
'5287820 HD024098 2025/26 4098 mot Kommittémotion mot Motion 2025/26:4098 ' +
'av Janine Alm Ericson m.fl. (MP) Motion Motion 088 FiU 4098 0 ' +
'2026-04-17 00:00:00 2026-04-17 16:23:34 2026-04-17 00:00:00 ' +
'med anledning av prop. 2025/26:236 Extra ändringsbudget för 2026.';
const result = extractKeyPassage(dump, 500);
expect(result).not.toContain('5287820');
expect(result).not.toContain('HD024098');
expect(result).not.toMatch(/2025\/26\s+4098\s+mot\s+mot/);
expect(result).toContain('med anledning av prop. 2025/26:236');
});
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests cover the html-ec … -RIM <uuid> and bare-UUID/timestamp fallbacks, but they don't cover the calendar/CMS dump variant seen in the weekly-review incident (e.g. 0 HDC… 2025/26 … html cms2023 cms<uuid>sv …). Adding a regression case for a short numeric id + HDC... prefix and a UUID with an immediate language suffix would help ensure the new stripper actually prevents that class of leak.

Copilot uses AI. Check for mistakes.
Comment on lines +372 to +391
// only against the block body — not the selector / surrounding prose — which
// avoids false positives on Swedish text like `"prisökning: 10 procent"`
// that precedes an unrelated brace pair.
//
// `CSS_PROPERTY_SIGNATURE` recognises common CSS property syntax patterns:
// - `: <digit>` (numeric value assignments, e.g. `top: 0`, `z-index: -1`)
// - CSS length units (`px`, `em`, `rem`) as whole words
// - `%;` (percent value terminator)
// - CSS hex colours (`#abc` or `#aabbcc`)
// - Known CSS property names followed by `:`
const CSS_PROPERTY_SIGNATURE = /(?::\s*-?\d|\b(?:px|em|rem)\b|%\s*;|#[0-9a-f]{3,6}\b|position\s*:|margin\s*:|padding\s*:|overflow\s*:|width\s*:|height\s*:|top\s*:|left\s*:|z-index\s*:|display\s*:|font-|border\s*:)/i;
s = s.replace(/[^{}]{0,300}(\{[^{}]{0,1000}\})/g, (m, body: string) =>
CSS_PROPERTY_SIGNATURE.test(body) ? ' ' : m
);

// 2. Detect Riksdag metadata prefix. Always begins with: numeric doc-id,
// HD-<dok_id>, and a riksmöte (YYYY/YY).
const metaPrefix = /^\s*\d{6,}\s+HD\S+\s+\d{4}\/\d{2}\s+/;
if (metaPrefix.test(s)) {
// Preferred boundary: "html-ec <doktype>-RIM <UUID>" marker — strip up to and including it.
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CSS stripping step runs unconditionally and the regex consumes up to 300 chars before the {...} block. When a CSS-like block is present, returning ' ' drops not only the selector+body but also any preceding prose within that 300-char window (e.g., it would delete the end of a legitimate paragraph that happens to contain body {…}). Consider (a) only attempting CSS stripping once you've positively identified a Riksdag dump header, and (b) adjusting the pattern so it removes only the selector + {…} (not arbitrary preceding text).

Suggested change
// only against the block body — not the selector / surrounding prose — which
// avoids false positives on Swedish text like `"prisökning: 10 procent"`
// that precedes an unrelated brace pair.
//
// `CSS_PROPERTY_SIGNATURE` recognises common CSS property syntax patterns:
// - `: <digit>` (numeric value assignments, e.g. `top: 0`, `z-index: -1`)
// - CSS length units (`px`, `em`, `rem`) as whole words
// - `%;` (percent value terminator)
// - CSS hex colours (`#abc` or `#aabbcc`)
// - Known CSS property names followed by `:`
const CSS_PROPERTY_SIGNATURE = /(?::\s*-?\d|\b(?:px|em|rem)\b|%\s*;|#[0-9a-f]{3,6}\b|position\s*:|margin\s*:|padding\s*:|overflow\s*:|width\s*:|height\s*:|top\s*:|left\s*:|z-index\s*:|display\s*:|font-|border\s*:)/i;
s = s.replace(/[^{}]{0,300}(\{[^{}]{0,1000}\})/g, (m, body: string) =>
CSS_PROPERTY_SIGNATURE.test(body) ? ' ' : m
);
// 2. Detect Riksdag metadata prefix. Always begins with: numeric doc-id,
// HD-<dok_id>, and a riksmöte (YYYY/YY).
const metaPrefix = /^\s*\d{6,}\s+HD\S+\s+\d{4}\/\d{2}\s+/;
if (metaPrefix.test(s)) {
// Preferred boundary: "html-ec <doktype>-RIM <UUID>" marker — strip up to and including it.
// 2. Detect Riksdag metadata prefix. Always begins with: numeric doc-id,
// HD-<dok_id>, and a riksmöte (YYYY/YY).
const metaPrefix = /^\s*\d{6,}\s+HD\S+\s+\d{4}\/\d{2}\s+/;
if (metaPrefix.test(s)) {
// 1. Remove inline CSS-like rule blocks that sometimes appear in the
// metadata/header area of Riksdag HTML dumps.
//
// Restrict matching to line-oriented selector + body pairs so we do not
// consume arbitrary prose preceding a brace block. The CSS test runs
// only against the block body — not the selector / surrounding prose —
// which avoids false positives on Swedish text like
// `"prisökning: 10 procent"` that precedes an unrelated brace pair.
//
// `CSS_PROPERTY_SIGNATURE` recognises common CSS property syntax patterns:
// - `: <digit>` (numeric value assignments, e.g. `top: 0`, `z-index: -1`)
// - CSS length units (`px`, `em`, `rem`) as whole words
// - `%;` (percent value terminator)
// - CSS hex colours (`#abc` or `#aabbcc`)
// - Known CSS property names followed by `:`
const CSS_PROPERTY_SIGNATURE = /(?::\s*-?\d|\b(?:px|em|rem)\b|%\s*;|#[0-9a-f]{3,6}\b|position\s*:|margin\s*:|padding\s*:|overflow\s*:|width\s*:|height\s*:|top\s*:|left\s*:|z-index\s*:|display\s*:|font-|border\s*:)/i;
s = s.replace(/(^|[\r\n])(\s*[^{}\r\n]{1,200}?)(\{[^{}]{0,1000}\})/g, (m, prefix: string, selector: string, body: string) =>
CSS_PROPERTY_SIGNATURE.test(body) ? prefix : m
);
// Preferred boundary: "html-ec <doktype>-RIM <UUID>" marker — strip up to and including it.
// Preferred boundary: "html-ec <doktype>-RIM <UUID>" marker — strip up to and including it.

Copilot uses AI. Check for mistakes.
Comment on lines +387 to +389
// 2. Detect Riksdag metadata prefix. Always begins with: numeric doc-id,
// HD-<dok_id>, and a riksmöte (YYYY/YY).
const metaPrefix = /^\s*\d{6,}\s+HD\S+\s+\d{4}\/\d{2}\s+/;
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata-prefix detector requires a 6+ digit leading id (^\d{6,}), but the corrupted weekly-review examples include prefixes like 0 HDC... 2025/26 ... (calendar/CMS items). With the current guard, those headers will not be stripped and can still leak into extractKeyPassage output. Consider loosening the leading numeric-id constraint (while keeping the HD... + YYYY/YY guard) so these shorter-id dumps are also cleaned.

Suggested change
// 2. Detect Riksdag metadata prefix. Always begins with: numeric doc-id,
// HD-<dok_id>, and a riksmöte (YYYY/YY).
const metaPrefix = /^\s*\d{6,}\s+HD\S+\s+\d{4}\/\d{2}\s+/;
// 2. Detect Riksdag metadata prefix. Always begins with: a numeric id,
// HD-<dok_id>, and a riksmöte (YYYY/YY). Some corrupted calendar/CMS
// dumps use very short ids (for example `0 HDC... 2025/26 ...`), so the
// numeric prefix must not require 6+ digits here.
const metaPrefix = /^\s*\d+\s+HD\S+\s+\d{4}\/\d{2}\s+/;

Copilot uses AI. Check for mistakes.
@pethers pethers deleted the copilot/fix-html-css-display-issues branch April 21, 2026 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agentic-workflow Agentic workflow changes documentation Documentation updates html-css HTML/CSS changes news News articles and content generation refactor Code refactoring size-l Large change (250-1000 lines) testing Test coverage translation Translation updates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants