Strip Riksdag raw-dump prefix and embedded CSS in extractKeyPassage by Copilot · Pull Request #1831 · Hack23/riksdagsmonitor

Copilot · 2026-04-18T18:37:17Z

get_dokument_innehall returns a text field whose first ~500 bytes are a metadata dump followed by embedded CSS rule blocks. extractKeyPassage only stripped HTML tags and URLs, so every per-document summary in the generic/weekly/monthly review generators emitted raw metadata and CSS as visible prose (see the two cleaned articles in news/).

Example of the leaked content, now stripped at the source:

5287561 HD03242 2025/26 242 prop prop prop Proposition 2025/26:242
Proposition Proposition Landsbygds- och infrastrukturdepartementet MJU 242 0
2026-04-16 00:00:00 … Ett tydligt regelverk för aktivt skogsbruk
html-ec prop-RIM 76066c92-4400-457b-ac3a-a0f403e9bdfc
body {margin-top: 0px;margin-left: 0px;}
#page_1 {position:relative; overflow: hidden; …}

Changes

scripts/data-transformers/helpers.ts — new stripRiksdagRawDump() called from extractKeyPassage, so all generators using it (generic, weekly-review, monthly-review, propositions, committee, motions, policy-analysis) inherit the fix:
- CSS block stripper: bounded [^{}]{0,300}\{[^{}]{0,1000}\} (no backtracking) + named CSS_PROPERTY_SIGNATURE constant tested only against the brace body via capture group, so prose like "prisökning: 10 procent { … }" with an unrelated brace pair is preserved.
- Metadata prefix stripper: triggers on ^\d{6,}\s+HD\S+\s+\d{4}/\d{2} and removes up to whichever boundary appears first — html-ec …-RIM <uuid> → bare UUID → last YYYY-MM-DD HH:MM:SS timestamp.
news/2026-04-18-weekly-review-{en,sv}.html — removed the 64 corrupted <span lang="sv">DUMP</span> prefixes per file; legitimate <span lang="sv"> title wrappers are untouched.
tests/extract-key-passage.test.ts — 7 new cases: html-ec+UUID path, bare-UUID fallback, timestamp-only fallback, and false-positive guards for Swedish prose containing braces / :digit / px.
.github/aw/SHARED_PROMPT_PATTERNS.md — added a 🚨 rule next to the get_dokument_innehall field mapping: agents must never paste the raw text field (or a prefix of it) into article HTML.

Residual risk

The stripper is conservative — if Riksdag introduces a new dump layout without the current ID/HD/riksmöte triple, content falls through unchanged rather than being wrongly truncated. Add a new tier if that happens.

Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/1cdc4a76-7470-45d6-bee6-da84b6554a3f Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>

Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/a02fe4dc-2748-456d-ae34-401a841bb35e Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>

…ositives Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/a02fe4dc-2748-456d-ae34-401a841bb35e Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>

github-actions · 2026-04-18T18:37:53Z

🏷️ Automatic Labeling Summary

This PR has been automatically labeled based on the files changed and PR metadata.

Applied Labels: documentation,html-css,translation,testing,refactor,size-l,news,agentic-workflow

Label Categories

🗳️ Content: news, dashboard, visualization, intelligence
💻 Technology: html-css, javascript, workflow, security
📊 Data: cia-data, riksdag-data, data-pipeline, schema
🌍 I18n: i18n, translation, rtl
🔒 ISMS: isms, iso-27001, nist-csf, cis-controls
🏗️ Infrastructure: ci-cd, deployment, performance, monitoring
🔄 Quality: testing, accessibility, documentation, refactor
🤖 AI: agent, skill, agentic-workflow

For more information, see .github/labeler.yml.

github-actions · 2026-04-18T18:38:20Z

🔍 Lighthouse Performance Audit

Category	Score	Status
Performance	85/100	🟡
Accessibility	95/100	🟢
Best Practices	90/100	🟢
SEO	95/100	🟢

📥 Download full Lighthouse report

Budget Compliance: Performance budgets enforced via budget.json

Copilot

Pull request overview

This PR hardens the document-excerpt pipeline by stripping the Riksdag MCP get_dokument_innehall.text raw-dump prefix (metadata tokens + embedded CSS) so generators using extractKeyPassage() no longer leak dump/CSS into rendered article prose, and it retroactively cleans the affected 2026-04-18 weekly review pages.

Changes:

Add stripRiksdagRawDump() and call it from extractKeyPassage() to remove dump headers and CSS blocks at the source.
Clean the already-published weekly review EN/SV HTML pages to remove leaked dump fragments.
Add Vitest coverage for several dump layouts and false-positive guards; add an agent prompt rule to avoid pasting raw MCP text into HTML.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`scripts/data-transformers/helpers.ts`	Introduces `stripRiksdagRawDump()` and integrates it into `extractKeyPassage()` so downstream generators inherit sanitization.
`tests/extract-key-passage.test.ts`	Adds regression tests for dump/CSS stripping and false-positive protection.
`news/2026-04-18-weekly-review-sv.html`	Removes leaked dump/CSS text from per-document entries in the published Swedish weekly review.
`news/2026-04-18-weekly-review-en.html`	Removes leaked dump/CSS text from per-document entries in the published English weekly review.
`.github/aw/SHARED_PROMPT_PATTERNS.md`	Adds an explicit rule warning agents not to paste raw MCP `text` into article HTML and to use the sanitized excerpt path.

Copilot · 2026-04-18T18:42:21Z

+      // Fallback 1: strip up to and including the first bare UUID if it is
+      // still within the metadata header window (first ~1.5k chars).
+      const uuidIdx = s.search(/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/i);
+      if (uuidIdx > -1 && uuidIdx < 1500) {
+        s = s
+          .slice(uuidIdx)
+          .replace(/^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}\s*/i, '');
+      } else {


In the UUID fallback path, the code strips the UUID only when it is followed by whitespace. Some MCP dump tokens appear to append a language code directly after the UUID (e.g. ...<uuid>sv), which would leave a leading sv (or similar) at the start of the remaining text. Consider stripping an optional trailing language suffix immediately after the UUID as part of the removal step.

Copilot · 2026-04-18T18:42:21Z

+  const CSS_PROPERTY_SIGNATURE = /(?::\s*-?\d|\b(?:px|em|rem)\b|%\s*;|#[0-9a-f]{3,6}\b|position\s*:|margin\s*:|padding\s*:|overflow\s*:|width\s*:|height\s*:|top\s*:|left\s*:|z-index\s*:|display\s*:|font-|border\s*:)/i;
+  s = s.replace(/[^{}]{0,300}(\{[^{}]{0,1000}\})/g, (m, body: string) =>


CSS_PROPERTY_SIGNATURE (and the other regexes) are created on every stripRiksdagRawDump() call. Since extractKeyPassage() is used per-document and can run many times per article, consider moving these regex constants to module scope to avoid repeated compilation and reduce per-document overhead.

Copilot · 2026-04-18T18:42:21Z

+  it('strips Riksdag proposition raw-dump header + embedded CSS (html-ec … -RIM UUID marker)', () => {
+    const dump =
+      '5287561 HD03242 2025/26 242 prop prop prop Proposition 2025/26:242 ' +
+      'Proposition Proposition Landsbygds- och infrastrukturdepartementet MJU 242 0 ' +
+      '2026-04-16 00:00:00 2026-04-16 15:24:08 2026-04-16 00:00:00 ' +
+      'Ett tydligt regelverk för aktivt skogsbruk html-ec prop-RIM ' +
+      '76066c92-4400-457b-ac3a-a0f403e9bdfc ' +
+      'body {margin-top: 0px;margin-left: 0px;} ' +
+      '#page_1 {position:relative; overflow: hidden;margin:10px 0px 21px 10px;padding:0px;border:none;width:766px;} ' +
+      'Propositionens huvudsakliga innehåll är att tydliggöra regelverket för aktivt skogsbruk i Sverige.';
+    const result = extractKeyPassage(dump, 500);
+    expect(result).not.toContain('5287561');
+    expect(result).not.toContain('HD03242');
+    expect(result).not.toContain('html-ec');
+    expect(result).not.toContain('prop-RIM');
+    expect(result).not.toContain('76066c92-4400-457b-ac3a-a0f403e9bdfc');
+    expect(result).not.toContain('body {');
+    expect(result).not.toContain('#page_1');
+    expect(result).not.toContain('margin-top');
+    expect(result).toContain('Propositionens huvudsakliga innehåll');
+  });
+
+  it('strips Riksdag committee-report raw-dump header without html-ec marker (bare UUID fallback)', () => {
+    const dump =
+      '5286898 HD01KU44 2025/26 KU44 bet bet bet Betänkande 2025/26:KU44 ' +
+      'Betänkande Debatt om förslag KU 44 0 2026-04-13 00:00:00 2026-04-13 11:05:12 ' +
+      '2026-04-13 11:04:18 Uppskov med behandlingen av vissa ärenden planerat Brus ' +
+      'aeda74fa-73bd-4ad7-bf56-f02b41bc64e8 ' +
+      'Konstitutionsutskottet föreslår att riksdagen medger att behandlingen av vissa ärenden skjuts upp.';
+    const result = extractKeyPassage(dump, 500);
+    expect(result).not.toContain('5286898');
+    expect(result).not.toContain('HD01KU44');
+    expect(result).not.toContain('aeda74fa-73bd-4ad7-bf56-f02b41bc64e8');
+    expect(result).toContain('Konstitutionsutskottet');
+  });
+
+  it('strips Riksdag motion raw-dump header when no UUID is present (timestamp fallback)', () => {
+    const dump =
+      '5287820 HD024098 2025/26 4098 mot Kommittémotion mot Motion 2025/26:4098 ' +
+      'av Janine Alm Ericson m.fl. (MP) Motion Motion 088 FiU 4098 0 ' +
+      '2026-04-17 00:00:00 2026-04-17 16:23:34 2026-04-17 00:00:00 ' +
+      'med anledning av prop. 2025/26:236 Extra ändringsbudget för 2026.';
+    const result = extractKeyPassage(dump, 500);
+    expect(result).not.toContain('5287820');
+    expect(result).not.toContain('HD024098');
+    expect(result).not.toMatch(/2025\/26\s+4098\s+mot\s+mot/);
+    expect(result).toContain('med anledning av prop. 2025/26:236');
+  });


Tests cover the html-ec … -RIM <uuid> and bare-UUID/timestamp fallbacks, but they don't cover the calendar/CMS dump variant seen in the weekly-review incident (e.g. 0 HDC… 2025/26 … html cms2023 cms<uuid>sv …). Adding a regression case for a short numeric id + HDC... prefix and a UUID with an immediate language suffix would help ensure the new stripper actually prevents that class of leak.

Copilot · 2026-04-18T18:42:22Z

+  //    only against the block body — not the selector / surrounding prose — which
+  //    avoids false positives on Swedish text like `"prisökning: 10 procent"`
+  //    that precedes an unrelated brace pair.
+  //
+  //    `CSS_PROPERTY_SIGNATURE` recognises common CSS property syntax patterns:
+  //      - `: <digit>` (numeric value assignments, e.g. `top: 0`, `z-index: -1`)
+  //      - CSS length units (`px`, `em`, `rem`) as whole words
+  //      - `%;` (percent value terminator)
+  //      - CSS hex colours (`#abc` or `#aabbcc`)
+  //      - Known CSS property names followed by `:`
+  const CSS_PROPERTY_SIGNATURE = /(?::\s*-?\d|\b(?:px|em|rem)\b|%\s*;|#[0-9a-f]{3,6}\b|position\s*:|margin\s*:|padding\s*:|overflow\s*:|width\s*:|height\s*:|top\s*:|left\s*:|z-index\s*:|display\s*:|font-|border\s*:)/i;
+  s = s.replace(/[^{}]{0,300}(\{[^{}]{0,1000}\})/g, (m, body: string) =>
+    CSS_PROPERTY_SIGNATURE.test(body) ? ' ' : m
+  );
+
+  // 2. Detect Riksdag metadata prefix. Always begins with: numeric doc-id,
+  //    HD-<dok_id>, and a riksmöte (YYYY/YY).
+  const metaPrefix = /^\s*\d{6,}\s+HD\S+\s+\d{4}\/\d{2}\s+/;
+  if (metaPrefix.test(s)) {
+    // Preferred boundary: "html-ec <doktype>-RIM <UUID>" marker — strip up to and including it.


The CSS stripping step runs unconditionally and the regex consumes up to 300 chars before the {...} block. When a CSS-like block is present, returning ' ' drops not only the selector+body but also any preceding prose within that 300-char window (e.g., it would delete the end of a legitimate paragraph that happens to contain body {…}). Consider (a) only attempting CSS stripping once you've positively identified a Riksdag dump header, and (b) adjusting the pattern so it removes only the selector + {…} (not arbitrary preceding text).

Suggested change

// only against the block body — not the selector / surrounding prose — which

// avoids false positives on Swedish text like `"prisökning: 10 procent"`

// that precedes an unrelated brace pair.

//

// `CSS_PROPERTY_SIGNATURE` recognises common CSS property syntax patterns:

// - `: <digit>` (numeric value assignments, e.g. `top: 0`, `z-index: -1`)

// - CSS length units (`px`, `em`, `rem`) as whole words

// - `%;` (percent value terminator)

// - CSS hex colours (`#abc` or `#aabbcc`)

// - Known CSS property names followed by `:`

const CSS_PROPERTY_SIGNATURE = /(?::\s*-?\d|\b(?:px|em|rem)\b|%\s*;|#[0-9a-f]{3,6}\b|position\s*:|margin\s*:|padding\s*:|overflow\s*:|width\s*:|height\s*:|top\s*:|left\s*:|z-index\s*:|display\s*:|font-|border\s*:)/i;

s = s.replace(/[^{}]{0,300}(\{[^{}]{0,1000}\})/g, (m, body: string) =>

CSS_PROPERTY_SIGNATURE.test(body) ? ' ' : m

);

// 2. Detect Riksdag metadata prefix. Always begins with: numeric doc-id,

// HD-<dok_id>, and a riksmöte (YYYY/YY).

const metaPrefix = /^\s*\d{6,}\s+HD\S+\s+\d{4}\/\d{2}\s+/;

if (metaPrefix.test(s)) {

// Preferred boundary: "html-ec <doktype>-RIM <UUID>" marker — strip up to and including it.

// 2. Detect Riksdag metadata prefix. Always begins with: numeric doc-id,

// HD-<dok_id>, and a riksmöte (YYYY/YY).

const metaPrefix = /^\s*\d{6,}\s+HD\S+\s+\d{4}\/\d{2}\s+/;

if (metaPrefix.test(s)) {

// 1. Remove inline CSS-like rule blocks that sometimes appear in the

// metadata/header area of Riksdag HTML dumps.

//

// Restrict matching to line-oriented selector + body pairs so we do not

// consume arbitrary prose preceding a brace block. The CSS test runs

// only against the block body — not the selector / surrounding prose —

// which avoids false positives on Swedish text like

// `"prisökning: 10 procent"` that precedes an unrelated brace pair.

//

// `CSS_PROPERTY_SIGNATURE` recognises common CSS property syntax patterns:

// - `: <digit>` (numeric value assignments, e.g. `top: 0`, `z-index: -1`)

// - CSS length units (`px`, `em`, `rem`) as whole words

// - `%;` (percent value terminator)

// - CSS hex colours (`#abc` or `#aabbcc`)

// - Known CSS property names followed by `:`

const CSS_PROPERTY_SIGNATURE = /(?::\s*-?\d|\b(?:px|em|rem)\b|%\s*;|#[0-9a-f]{3,6}\b|position\s*:|margin\s*:|padding\s*:|overflow\s*:|width\s*:|height\s*:|top\s*:|left\s*:|z-index\s*:|display\s*:|font-|border\s*:)/i;

s = s.replace(/(^|[\r\n])(\s*[^{}\r\n]{1,200}?)(\{[^{}]{0,1000}\})/g, (m, prefix: string, selector: string, body: string) =>

CSS_PROPERTY_SIGNATURE.test(body) ? prefix : m

);

// Preferred boundary: "html-ec <doktype>-RIM <UUID>" marker — strip up to and including it.

// Preferred boundary: "html-ec <doktype>-RIM <UUID>" marker — strip up to and including it.

Copilot · 2026-04-18T18:42:22Z

+  // 2. Detect Riksdag metadata prefix. Always begins with: numeric doc-id,
+  //    HD-<dok_id>, and a riksmöte (YYYY/YY).
+  const metaPrefix = /^\s*\d{6,}\s+HD\S+\s+\d{4}\/\d{2}\s+/;


The metadata-prefix detector requires a 6+ digit leading id (^\d{6,}), but the corrupted weekly-review examples include prefixes like 0 HDC... 2025/26 ... (calendar/CMS items). With the current guard, those headers will not be stripped and can still leak into extractKeyPassage output. Consider loosening the leading numeric-id constraint (while keeping the HD... + YYYY/YY guard) so these shorter-id dumps are also cleaned.

Suggested change

// 2. Detect Riksdag metadata prefix. Always begins with: numeric doc-id,

// HD-<dok_id>, and a riksmöte (YYYY/YY).

const metaPrefix = /^\s*\d{6,}\s+HD\S+\s+\d{4}\/\d{2}\s+/;

// 2. Detect Riksdag metadata prefix. Always begins with: a numeric id,

// HD-<dok_id>, and a riksmöte (YYYY/YY). Some corrupted calendar/CMS

// dumps use very short ids (for example `0 HDC... 2025/26 ...`), so the

// numeric prefix must not require 6+ digits here.

const metaPrefix = /^\s*\d+\s+HD\S+\s+\d{4}\/\d{2}\s+/;

Copilot AI and others added 3 commits April 18, 2026 18:13

Remove corrupted metadata/CSS dumps from weekly review articles

83ca589

Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/1cdc4a76-7470-45d6-bee6-da84b6554a3f Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>

Strip Riksdag raw-dump prefix + embedded CSS in extractKeyPassage

6ce9aa2

Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/a02fe4dc-2748-456d-ae34-401a841bb35e Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>

Address review feedback: extract CSS signature, prevent prose false-p…

c46f4ec

…ositives Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/a02fe4dc-2748-456d-ae34-401a841bb35e Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>

Copilot AI assigned Copilot and pethers Apr 18, 2026

Copilot created this pull request from a session on behalf of pethers April 18, 2026 18:37 View session

pethers marked this pull request as ready for review April 18, 2026 18:37

Copilot AI review requested due to automatic review settings April 18, 2026 18:37

pethers merged commit 7af654f into main Apr 18, 2026
13 checks passed

Copilot started reviewing on behalf of pethers April 18, 2026 18:38 View session

Copilot AI reviewed Apr 18, 2026

View reviewed changes

pethers deleted the copilot/fix-html-css-display-issues branch April 21, 2026 12:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strip Riksdag raw-dump prefix and embedded CSS in extractKeyPassage#1831

Strip Riksdag raw-dump prefix and embedded CSS in extractKeyPassage#1831
pethers merged 3 commits intomainfrom
copilot/fix-html-css-display-issues

Copilot AI commented Apr 18, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 18, 2026

Uh oh!

Copilot AI Apr 18, 2026

Uh oh!

Copilot AI Apr 18, 2026

Uh oh!

Copilot AI Apr 18, 2026

Uh oh!

Copilot AI Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		const CSS_PROPERTY_SIGNATURE = /(?::\s-?\d\|\b(?:px\|em\|rem)\b\|%\s;\|#[0-9a-f]{3,6}\b\|position\s:\|margin\s:\|padding\s:\|overflow\s:\|width\s:\|height\s:\|top\s:\|left\s:\|z-index\s:\|display\s:\|font-\|border\s*:)/i;
		s = s.replace(/[^{}]{0,300}(\{[^{}]{0,1000}\})/g, (m, body: string) =>

-  // 2. Detect Riksdag metadata prefix. Always begins with: numeric doc-id,
-  //    HD-<dok_id>, and a riksmöte (YYYY/YY).
-  const metaPrefix = /^\s*\d{6,}\s+HD\S+\s+\d{4}\/\d{2}\s+/;
+  // 2. Detect Riksdag metadata prefix. Always begins with: a numeric id,
+  //    HD-<dok_id>, and a riksmöte (YYYY/YY). Some corrupted calendar/CMS
+  //    dumps use very short ids (for example `0 HDC... 2025/26 ...`), so the
+  //    numeric prefix must not require 6+ digits here.
+  const metaPrefix = /^\s*\d+\s+HD\S+\s+\d{4}\/\d{2}\s+/;

Conversation

Copilot AI commented Apr 18, 2026

Changes

Residual risk

Uh oh!

Uh oh!

github-actions Bot commented Apr 18, 2026

🏷️ Automatic Labeling Summary

Label Categories

Uh oh!

github-actions Bot commented Apr 18, 2026

🔍 Lighthouse Performance Audit

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants