Strip Riksdag raw-dump prefix and embedded CSS in extractKeyPassage#1831
Strip Riksdag raw-dump prefix and embedded CSS in extractKeyPassage#1831
Conversation
Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/1cdc4a76-7470-45d6-bee6-da84b6554a3f Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/a02fe4dc-2748-456d-ae34-401a841bb35e Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
…ositives Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/a02fe4dc-2748-456d-ae34-401a841bb35e Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
🏷️ Automatic Labeling SummaryThis PR has been automatically labeled based on the files changed and PR metadata. Applied Labels: documentation,html-css,translation,testing,refactor,size-l,news,agentic-workflow Label Categories
For more information, see |
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
There was a problem hiding this comment.
Pull request overview
This PR hardens the document-excerpt pipeline by stripping the Riksdag MCP get_dokument_innehall.text raw-dump prefix (metadata tokens + embedded CSS) so generators using extractKeyPassage() no longer leak dump/CSS into rendered article prose, and it retroactively cleans the affected 2026-04-18 weekly review pages.
Changes:
- Add
stripRiksdagRawDump()and call it fromextractKeyPassage()to remove dump headers and CSS blocks at the source. - Clean the already-published weekly review EN/SV HTML pages to remove leaked dump fragments.
- Add Vitest coverage for several dump layouts and false-positive guards; add an agent prompt rule to avoid pasting raw MCP
textinto HTML.
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
scripts/data-transformers/helpers.ts |
Introduces stripRiksdagRawDump() and integrates it into extractKeyPassage() so downstream generators inherit sanitization. |
tests/extract-key-passage.test.ts |
Adds regression tests for dump/CSS stripping and false-positive protection. |
news/2026-04-18-weekly-review-sv.html |
Removes leaked dump/CSS text from per-document entries in the published Swedish weekly review. |
news/2026-04-18-weekly-review-en.html |
Removes leaked dump/CSS text from per-document entries in the published English weekly review. |
.github/aw/SHARED_PROMPT_PATTERNS.md |
Adds an explicit rule warning agents not to paste raw MCP text into article HTML and to use the sanitized excerpt path. |
| // Fallback 1: strip up to and including the first bare UUID if it is | ||
| // still within the metadata header window (first ~1.5k chars). | ||
| const uuidIdx = s.search(/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/i); | ||
| if (uuidIdx > -1 && uuidIdx < 1500) { | ||
| s = s | ||
| .slice(uuidIdx) | ||
| .replace(/^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}\s*/i, ''); | ||
| } else { |
There was a problem hiding this comment.
In the UUID fallback path, the code strips the UUID only when it is followed by whitespace. Some MCP dump tokens appear to append a language code directly after the UUID (e.g. ...<uuid>sv), which would leave a leading sv (or similar) at the start of the remaining text. Consider stripping an optional trailing language suffix immediately after the UUID as part of the removal step.
| const CSS_PROPERTY_SIGNATURE = /(?::\s*-?\d|\b(?:px|em|rem)\b|%\s*;|#[0-9a-f]{3,6}\b|position\s*:|margin\s*:|padding\s*:|overflow\s*:|width\s*:|height\s*:|top\s*:|left\s*:|z-index\s*:|display\s*:|font-|border\s*:)/i; | ||
| s = s.replace(/[^{}]{0,300}(\{[^{}]{0,1000}\})/g, (m, body: string) => |
There was a problem hiding this comment.
CSS_PROPERTY_SIGNATURE (and the other regexes) are created on every stripRiksdagRawDump() call. Since extractKeyPassage() is used per-document and can run many times per article, consider moving these regex constants to module scope to avoid repeated compilation and reduce per-document overhead.
| it('strips Riksdag proposition raw-dump header + embedded CSS (html-ec … -RIM UUID marker)', () => { | ||
| const dump = | ||
| '5287561 HD03242 2025/26 242 prop prop prop Proposition 2025/26:242 ' + | ||
| 'Proposition Proposition Landsbygds- och infrastrukturdepartementet MJU 242 0 ' + | ||
| '2026-04-16 00:00:00 2026-04-16 15:24:08 2026-04-16 00:00:00 ' + | ||
| 'Ett tydligt regelverk för aktivt skogsbruk html-ec prop-RIM ' + | ||
| '76066c92-4400-457b-ac3a-a0f403e9bdfc ' + | ||
| 'body {margin-top: 0px;margin-left: 0px;} ' + | ||
| '#page_1 {position:relative; overflow: hidden;margin:10px 0px 21px 10px;padding:0px;border:none;width:766px;} ' + | ||
| 'Propositionens huvudsakliga innehåll är att tydliggöra regelverket för aktivt skogsbruk i Sverige.'; | ||
| const result = extractKeyPassage(dump, 500); | ||
| expect(result).not.toContain('5287561'); | ||
| expect(result).not.toContain('HD03242'); | ||
| expect(result).not.toContain('html-ec'); | ||
| expect(result).not.toContain('prop-RIM'); | ||
| expect(result).not.toContain('76066c92-4400-457b-ac3a-a0f403e9bdfc'); | ||
| expect(result).not.toContain('body {'); | ||
| expect(result).not.toContain('#page_1'); | ||
| expect(result).not.toContain('margin-top'); | ||
| expect(result).toContain('Propositionens huvudsakliga innehåll'); | ||
| }); | ||
|
|
||
| it('strips Riksdag committee-report raw-dump header without html-ec marker (bare UUID fallback)', () => { | ||
| const dump = | ||
| '5286898 HD01KU44 2025/26 KU44 bet bet bet Betänkande 2025/26:KU44 ' + | ||
| 'Betänkande Debatt om förslag KU 44 0 2026-04-13 00:00:00 2026-04-13 11:05:12 ' + | ||
| '2026-04-13 11:04:18 Uppskov med behandlingen av vissa ärenden planerat Brus ' + | ||
| 'aeda74fa-73bd-4ad7-bf56-f02b41bc64e8 ' + | ||
| 'Konstitutionsutskottet föreslår att riksdagen medger att behandlingen av vissa ärenden skjuts upp.'; | ||
| const result = extractKeyPassage(dump, 500); | ||
| expect(result).not.toContain('5286898'); | ||
| expect(result).not.toContain('HD01KU44'); | ||
| expect(result).not.toContain('aeda74fa-73bd-4ad7-bf56-f02b41bc64e8'); | ||
| expect(result).toContain('Konstitutionsutskottet'); | ||
| }); | ||
|
|
||
| it('strips Riksdag motion raw-dump header when no UUID is present (timestamp fallback)', () => { | ||
| const dump = | ||
| '5287820 HD024098 2025/26 4098 mot Kommittémotion mot Motion 2025/26:4098 ' + | ||
| 'av Janine Alm Ericson m.fl. (MP) Motion Motion 088 FiU 4098 0 ' + | ||
| '2026-04-17 00:00:00 2026-04-17 16:23:34 2026-04-17 00:00:00 ' + | ||
| 'med anledning av prop. 2025/26:236 Extra ändringsbudget för 2026.'; | ||
| const result = extractKeyPassage(dump, 500); | ||
| expect(result).not.toContain('5287820'); | ||
| expect(result).not.toContain('HD024098'); | ||
| expect(result).not.toMatch(/2025\/26\s+4098\s+mot\s+mot/); | ||
| expect(result).toContain('med anledning av prop. 2025/26:236'); | ||
| }); |
There was a problem hiding this comment.
Tests cover the html-ec … -RIM <uuid> and bare-UUID/timestamp fallbacks, but they don't cover the calendar/CMS dump variant seen in the weekly-review incident (e.g. 0 HDC… 2025/26 … html cms2023 cms<uuid>sv …). Adding a regression case for a short numeric id + HDC... prefix and a UUID with an immediate language suffix would help ensure the new stripper actually prevents that class of leak.
| // only against the block body — not the selector / surrounding prose — which | ||
| // avoids false positives on Swedish text like `"prisökning: 10 procent"` | ||
| // that precedes an unrelated brace pair. | ||
| // | ||
| // `CSS_PROPERTY_SIGNATURE` recognises common CSS property syntax patterns: | ||
| // - `: <digit>` (numeric value assignments, e.g. `top: 0`, `z-index: -1`) | ||
| // - CSS length units (`px`, `em`, `rem`) as whole words | ||
| // - `%;` (percent value terminator) | ||
| // - CSS hex colours (`#abc` or `#aabbcc`) | ||
| // - Known CSS property names followed by `:` | ||
| const CSS_PROPERTY_SIGNATURE = /(?::\s*-?\d|\b(?:px|em|rem)\b|%\s*;|#[0-9a-f]{3,6}\b|position\s*:|margin\s*:|padding\s*:|overflow\s*:|width\s*:|height\s*:|top\s*:|left\s*:|z-index\s*:|display\s*:|font-|border\s*:)/i; | ||
| s = s.replace(/[^{}]{0,300}(\{[^{}]{0,1000}\})/g, (m, body: string) => | ||
| CSS_PROPERTY_SIGNATURE.test(body) ? ' ' : m | ||
| ); | ||
|
|
||
| // 2. Detect Riksdag metadata prefix. Always begins with: numeric doc-id, | ||
| // HD-<dok_id>, and a riksmöte (YYYY/YY). | ||
| const metaPrefix = /^\s*\d{6,}\s+HD\S+\s+\d{4}\/\d{2}\s+/; | ||
| if (metaPrefix.test(s)) { | ||
| // Preferred boundary: "html-ec <doktype>-RIM <UUID>" marker — strip up to and including it. |
There was a problem hiding this comment.
The CSS stripping step runs unconditionally and the regex consumes up to 300 chars before the {...} block. When a CSS-like block is present, returning ' ' drops not only the selector+body but also any preceding prose within that 300-char window (e.g., it would delete the end of a legitimate paragraph that happens to contain body {…}). Consider (a) only attempting CSS stripping once you've positively identified a Riksdag dump header, and (b) adjusting the pattern so it removes only the selector + {…} (not arbitrary preceding text).
| // only against the block body — not the selector / surrounding prose — which | |
| // avoids false positives on Swedish text like `"prisökning: 10 procent"` | |
| // that precedes an unrelated brace pair. | |
| // | |
| // `CSS_PROPERTY_SIGNATURE` recognises common CSS property syntax patterns: | |
| // - `: <digit>` (numeric value assignments, e.g. `top: 0`, `z-index: -1`) | |
| // - CSS length units (`px`, `em`, `rem`) as whole words | |
| // - `%;` (percent value terminator) | |
| // - CSS hex colours (`#abc` or `#aabbcc`) | |
| // - Known CSS property names followed by `:` | |
| const CSS_PROPERTY_SIGNATURE = /(?::\s*-?\d|\b(?:px|em|rem)\b|%\s*;|#[0-9a-f]{3,6}\b|position\s*:|margin\s*:|padding\s*:|overflow\s*:|width\s*:|height\s*:|top\s*:|left\s*:|z-index\s*:|display\s*:|font-|border\s*:)/i; | |
| s = s.replace(/[^{}]{0,300}(\{[^{}]{0,1000}\})/g, (m, body: string) => | |
| CSS_PROPERTY_SIGNATURE.test(body) ? ' ' : m | |
| ); | |
| // 2. Detect Riksdag metadata prefix. Always begins with: numeric doc-id, | |
| // HD-<dok_id>, and a riksmöte (YYYY/YY). | |
| const metaPrefix = /^\s*\d{6,}\s+HD\S+\s+\d{4}\/\d{2}\s+/; | |
| if (metaPrefix.test(s)) { | |
| // Preferred boundary: "html-ec <doktype>-RIM <UUID>" marker — strip up to and including it. | |
| // 2. Detect Riksdag metadata prefix. Always begins with: numeric doc-id, | |
| // HD-<dok_id>, and a riksmöte (YYYY/YY). | |
| const metaPrefix = /^\s*\d{6,}\s+HD\S+\s+\d{4}\/\d{2}\s+/; | |
| if (metaPrefix.test(s)) { | |
| // 1. Remove inline CSS-like rule blocks that sometimes appear in the | |
| // metadata/header area of Riksdag HTML dumps. | |
| // | |
| // Restrict matching to line-oriented selector + body pairs so we do not | |
| // consume arbitrary prose preceding a brace block. The CSS test runs | |
| // only against the block body — not the selector / surrounding prose — | |
| // which avoids false positives on Swedish text like | |
| // `"prisökning: 10 procent"` that precedes an unrelated brace pair. | |
| // | |
| // `CSS_PROPERTY_SIGNATURE` recognises common CSS property syntax patterns: | |
| // - `: <digit>` (numeric value assignments, e.g. `top: 0`, `z-index: -1`) | |
| // - CSS length units (`px`, `em`, `rem`) as whole words | |
| // - `%;` (percent value terminator) | |
| // - CSS hex colours (`#abc` or `#aabbcc`) | |
| // - Known CSS property names followed by `:` | |
| const CSS_PROPERTY_SIGNATURE = /(?::\s*-?\d|\b(?:px|em|rem)\b|%\s*;|#[0-9a-f]{3,6}\b|position\s*:|margin\s*:|padding\s*:|overflow\s*:|width\s*:|height\s*:|top\s*:|left\s*:|z-index\s*:|display\s*:|font-|border\s*:)/i; | |
| s = s.replace(/(^|[\r\n])(\s*[^{}\r\n]{1,200}?)(\{[^{}]{0,1000}\})/g, (m, prefix: string, selector: string, body: string) => | |
| CSS_PROPERTY_SIGNATURE.test(body) ? prefix : m | |
| ); | |
| // Preferred boundary: "html-ec <doktype>-RIM <UUID>" marker — strip up to and including it. | |
| // Preferred boundary: "html-ec <doktype>-RIM <UUID>" marker — strip up to and including it. |
| // 2. Detect Riksdag metadata prefix. Always begins with: numeric doc-id, | ||
| // HD-<dok_id>, and a riksmöte (YYYY/YY). | ||
| const metaPrefix = /^\s*\d{6,}\s+HD\S+\s+\d{4}\/\d{2}\s+/; |
There was a problem hiding this comment.
The metadata-prefix detector requires a 6+ digit leading id (^\d{6,}), but the corrupted weekly-review examples include prefixes like 0 HDC... 2025/26 ... (calendar/CMS items). With the current guard, those headers will not be stripped and can still leak into extractKeyPassage output. Consider loosening the leading numeric-id constraint (while keeping the HD... + YYYY/YY guard) so these shorter-id dumps are also cleaned.
| // 2. Detect Riksdag metadata prefix. Always begins with: numeric doc-id, | |
| // HD-<dok_id>, and a riksmöte (YYYY/YY). | |
| const metaPrefix = /^\s*\d{6,}\s+HD\S+\s+\d{4}\/\d{2}\s+/; | |
| // 2. Detect Riksdag metadata prefix. Always begins with: a numeric id, | |
| // HD-<dok_id>, and a riksmöte (YYYY/YY). Some corrupted calendar/CMS | |
| // dumps use very short ids (for example `0 HDC... 2025/26 ...`), so the | |
| // numeric prefix must not require 6+ digits here. | |
| const metaPrefix = /^\s*\d+\s+HD\S+\s+\d{4}\/\d{2}\s+/; |
get_dokument_innehallreturns atextfield whose first ~500 bytes are a metadata dump followed by embedded CSS rule blocks.extractKeyPassageonly stripped HTML tags and URLs, so every per-document summary in the generic/weekly/monthly review generators emitted raw metadata and CSS as visible prose (see the two cleaned articles innews/).Example of the leaked content, now stripped at the source:
Changes
scripts/data-transformers/helpers.ts— newstripRiksdagRawDump()called fromextractKeyPassage, so all generators using it (generic, weekly-review, monthly-review, propositions, committee, motions, policy-analysis) inherit the fix:[^{}]{0,300}\{[^{}]{0,1000}\}(no backtracking) + namedCSS_PROPERTY_SIGNATUREconstant tested only against the brace body via capture group, so prose like"prisökning: 10 procent { … }"with an unrelated brace pair is preserved.^\d{6,}\s+HD\S+\s+\d{4}/\d{2}and removes up to whichever boundary appears first —html-ec …-RIM <uuid>→ bare UUID → lastYYYY-MM-DD HH:MM:SStimestamp.news/2026-04-18-weekly-review-{en,sv}.html— removed the 64 corrupted<span lang="sv">DUMP</span>prefixes per file; legitimate<span lang="sv">title wrappers are untouched.tests/extract-key-passage.test.ts— 7 new cases: html-ec+UUID path, bare-UUID fallback, timestamp-only fallback, and false-positive guards for Swedish prose containing braces /:digit/px..github/aw/SHARED_PROMPT_PATTERNS.md— added a 🚨 rule next to theget_dokument_innehallfield mapping: agents must never paste the rawtextfield (or a prefix of it) into article HTML.Residual risk
The stripper is conservative — if Riksdag introduces a new dump layout without the current ID/HD/riksmöte triple, content falls through unchanged rather than being wrongly truncated. Add a new tier if that happens.