Skip to content

Add MCP data persistence layer with collision-free sidecar design#1440

Merged
pethers merged 16 commits intomainfrom
copilot/organize-analysis-folder-structure
Mar 28, 2026
Merged

Add MCP data persistence layer with collision-free sidecar design#1440
pethers merged 16 commits intomainfrom
copilot/organize-analysis-folder-structure

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 28, 2026

  • Fix log message in pre-article-analysis.ts:544 — say "documents" not "files"
  • Fix pdf-converter.ts module docs — remove false fallback claim
  • Fix pdf-converter.ts:112 — use os.tmpdir() instead of /tmp
  • Fix populate-analysis-data.ts shebang — #!/usr/bin/env tsx
  • Add date validation in populate-analysis-data.ts parseArgs
  • Add collision avoidance in data-persistence.ts persistDownloadedData (suffix -1, -2 for duplicate IDs)
  • Fix README.md:63 — remove government/ from documented tree (stored in mcp-responses)
  • Fix PersistenceResult docs — clarify logical records counted once (votes may produce multiple file pairs)
  • Sanitize path segments in persistMCPResponse to prevent path traversal
  • Remove analysis/data/documents/government/.gitkeep (government docs use mcp-responses)
  • Remove dot from path segment sanitizer whitelist per code review
  • Add optional riksmote param to persistMCPResponse + derive from call.params.rm
  • Use crypto.randomUUID() fallback in persistMCPResponse for unique IDs
  • Remove unused 'fallback' from ConversionResult.tool union type
  • Use crypto.randomUUID() for temp PDF filenames + suppress console.debug on cleanup
  • Fix gov doc provenance: include dateFrom, dateTo, limit in sidecar params
  • Fix voting group provenance: include limit in sidecar params
  • Fix date validation: use strict parse-and-roundtrip to reject overflow dates like 2026-02-30
  • Fix pdf-converter docs: clarify callers handle persistence, not this module
  • Fix data-persistence module docs: clarify sidecar schemas differ by data source
  • Add 5 new tests: collision avoidance, path traversal × 2, riksmote derivation × 2
  • All 4599 tests pass, CodeQL: 0 alerts

⚡ Quickly spin up Copilot coding agent tasks from anywhere on your macOS or Windows machine with Raycast.

@github-actions github-actions bot added the size-xs Extra small change (< 10 lines) label Mar 28, 2026
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ Automatic Labeling Summary

This PR has been automatically labeled based on the files changed and PR metadata.

Applied Labels: size-xs

Label Categories

  • 🗳️ Content: news, dashboard, visualization, intelligence
  • 💻 Technology: html-css, javascript, workflow, security
  • 📊 Data: cia-data, riksdag-data, data-pipeline, schema
  • 🌍 I18n: i18n, translation, rtl
  • 🔒 ISMS: isms, iso-27001, nist-csf, cis-controls
  • 🏗️ Infrastructure: ci-cd, deployment, performance, monitoring
  • 🔄 Quality: testing, accessibility, documentation, refactor
  • 🤖 AI: agent, skill, agentic-workflow

For more information, see .github/labeler.yml.

@github-actions
Copy link
Copy Markdown
Contributor

🔍 Lighthouse Performance Audit

Category Score Status
Performance 85/100 🟡
Accessibility 95/100 🟢
Best Practices 90/100 🟢
SEO 95/100 🟢

📥 Download full Lighthouse report

Budget Compliance: Performance budgets enforced via budget.json

- Create analysis/data/ directory structure for persistent MCP data storage
  (documents/propositions, motions, committeeReports, votes, speeches,
  questions, interpellations, government; votes by date; events by date; mps)
- Add data-persistence.ts module with persistDownloadedData(), persistEvents(),
  persistMPs() for consistent ID-based filenames and provenance metadata
- Add pdf-converter.ts utility for PDF-to-text/markdown conversion using
  poppler-utils pdftotext
- Integrate data persistence into pre-article-analysis.ts pipeline (Step 1a)
- Update all 12 agentic workflow safe-outputs create-pull-request config with
  labels, draft:false, and expires:14 for analysis data PRs
- Update analysis/README.md with new data directory documentation
- Add comprehensive tests (16 new tests, all passing)
- All 108 existing pre-article-analysis tests pass

Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/71e2d9ad-ee6f-4cb9-815f-7a455f4d2848

Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
@github-actions github-actions bot added documentation Documentation updates workflow GitHub Actions workflows ci-cd CI/CD pipeline changes testing Test coverage refactor Code refactoring news News articles and content generation agentic-workflow Agentic workflow changes size-l Large change (250-1000 lines) labels Mar 28, 2026
@github-actions
Copy link
Copy Markdown
Contributor

🔍 Lighthouse Performance Audit

Category Score Status
Performance 85/100 🟡
Accessibility 95/100 🟢
Best Practices 90/100 🟢
SEO 95/100 🟢

📥 Download full Lighthouse report

Budget Compliance: Performance budgets enforced via budget.json

…uplicate regex

- Extract MIN_HEADING_LENGTH/MAX_HEADING_LENGTH constants in pdf-converter.ts
- Extract ISO_DATE_RE constant in data-persistence.ts to eliminate regex duplication
- Fix test type safety by using delete instead of undefined cast

Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/71e2d9ad-ee6f-4cb9-815f-7a455f4d2848

Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
Copilot AI changed the title [WIP] Organize data and tools for mcp under analysis folder Add MCP data persistence layer with structured analysis/data/ folder Mar 28, 2026
@github-actions
Copy link
Copy Markdown
Contributor

🔍 Lighthouse Performance Audit

Category Score Status
Performance 85/100 🟡
Accessibility 95/100 🟢
Best Practices 90/100 🟢
SEO 95/100 🟢

📥 Download full Lighthouse report

Budget Compliance: Performance budgets enforced via budget.json

…add MCP response storage

- Rewrite data-persistence.ts to use collision-free design:
  - Data files ({id}.json) contain ONLY raw source data (no _metadata injection)
  - Metadata in separate sidecar files ({id}.meta.json) — safe to overwrite
  - Parallel workflows writing same document produce byte-identical output
- Add persistMCPResponse() for storing ANY MCP tool call/response
- Add persistWorldBankData() for World Bank economic indicators
- Add persistSCBData() for Statistics Sweden table data
- Add analysis/data/worldbank/, scb/, mcp-responses/ directories
- Export resolveDocId() for testing
- Update tests: 25 tests covering collision-free design, MCP storage, resolveDocId
- Update analysis/data/README.md with v2 collision-free design documentation

Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/bb68f8b6-e5d2-4da1-83e5-483a9f58073e

Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
@github-actions github-actions bot added the size-xl Extra large change (> 1000 lines) label Mar 28, 2026
@github-actions
Copy link
Copy Markdown
Contributor

🔍 Lighthouse Performance Audit

Category Score Status
Performance 85/100 🟡
Accessibility 95/100 🟢
Best Practices 90/100 🟢
SEO 95/100 🟢

📥 Download full Lighthouse report

Budget Compliance: Performance budgets enforced via budget.json

…ts, test isolation

- Remove duplicate directory structure/filename convention sections from README
- Clarify ISO_DATE_RE comment (format check only, not date validity)
- Add debug logging to PDF converter temp file cleanup
- Use unique test IDs with timestamps for MCP storage test isolation

Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/bb68f8b6-e5d2-4da1-83e5-483a9f58073e

Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

🔍 Lighthouse Performance Audit

Category Score Status
Performance 85/100 🟡
Accessibility 95/100 🟢
Best Practices 90/100 🟢
SEO 95/100 🟢

📥 Download full Lighthouse report

Budget Compliance: Performance budgets enforced via budget.json

Copilot AI changed the title Add MCP data persistence layer with structured analysis/data/ folder Collision-free MCP data persistence with sidecar metadata and universal tool response storage Mar 28, 2026
…ove unused 'fallback' type, suppress cleanup noise

Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/a5090ace-74b1-4e4a-9482-d54d0f285662

Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

🔍 Lighthouse Performance Audit

Category Score Status
Performance 85/100 🟡
Accessibility 95/100 🟢
Best Practices 90/100 🟢
SEO 95/100 🟢

📥 Download full Lighthouse report

Budget Compliance: Performance budgets enforced via budget.json

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 28, 2026

@copilot apply changes based on the comments in this thread

All 5 review comments addressed in 23fccb6 and fd22b74:

  1. PersistenceResult.written docs now clarify it counts logical records (votes counted once even when written to both documents/votes/ and votes/{date}/)
  2. persistMCPResponse accepts optional riksmote param (6th arg) and derives from call.params.rm when not explicitly provided
  3. Fallback IDs now use crypto.randomUUID() instead of Date.now() to prevent millisecond collisions
  4. Removed unused 'fallback' from ConversionResult.tool union type
  5. Temp PDF filenames use crypto.randomUUID() and cleanup errors are silently ignored instead of console.debug

Added 2 new tests for riksmote derivation. All 4599 tests pass, CodeQL: 0 alerts.

@github-actions
Copy link
Copy Markdown
Contributor

🔍 Lighthouse Performance Audit

Category Score Status
Performance 85/100 🟡
Accessibility 95/100 🟢
Best Practices 90/100 🟢
SEO 95/100 🟢

📥 Download full Lighthouse report

Budget Compliance: Performance budgets enforced via budget.json

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 33 changed files in this pull request and generated 2 comments.

Comment on lines +198 to +202
persistMCPResponse(
{ tool: 'search_regering', params: { limit }, server: 'riksdag-regering' },
doc,
resolveDocId(doc, i),
);
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The persisted MCP provenance for government documents doesn't reflect the actual fetch parameters. fetchGovernmentDocuments is called with dateFrom, dateTo, and limit, but the persistMCPResponse metadata only records { limit }, which makes the sidecar params misleading for audit/reproducibility. Include the same parameter set used for the fetch (at least dateFrom, dateTo, limit).

Copilot uses AI. Check for mistakes.
const id = (typeof record['parti'] === 'string' && record['parti'])
|| `group-${i + 1}`;
persistMCPResponse(
{ tool: 'get_voting_group', params: { rm, groupBy: 'parti' }, server: 'riksdag-regering' },
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The persisted MCP provenance for voting groups omits limit, even though fetchVotingGroup is called with { rm, groupBy: 'parti', limit }. This makes the .meta.json params inconsistent with the actual request that produced the data. Record the full param set (including limit) in the persistMCPResponse call.

Suggested change
{ tool: 'get_voting_group', params: { rm, groupBy: 'parti' }, server: 'riksdag-regering' },
{ tool: 'get_voting_group', params: { rm, groupBy: 'parti', limit }, server: 'riksdag-regering' },

Copilot uses AI. Check for mistakes.
@pethers
Copy link
Copy Markdown
Member

pethers commented Mar 28, 2026

@copilot apply changes based on the comments in this thread

…t for voting groups

Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/e86d7c55-f815-4e02-8c16-58ce9d052207

Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 28, 2026

@copilot apply changes based on the comments in this thread

Both review comments addressed in 34b7bc7:

  1. Government document provenance now includes dateFrom, dateTo, and limit in sidecar params (was only { limit })
  2. Voting group provenance now includes limit alongside rm and groupBy in sidecar params (was missing limit)

All 4599 tests pass.

@github-actions
Copy link
Copy Markdown
Contributor

🔍 Lighthouse Performance Audit

Category Score Status
Performance 85/100 🟡
Accessibility 95/100 🟢
Best Practices 90/100 🟢
SEO 95/100 🟢

📥 Download full Lighthouse report

Budget Compliance: Performance budgets enforced via budget.json

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 33 changed files in this pull request and generated 3 comments.

Comment on lines +99 to +105
const candidate = args[i + 1];
if (!/^\d{4}-\d{2}-\d{2}$/.test(candidate) || isNaN(new Date(candidate + 'T00:00:00Z').getTime())) {
console.error(`❌ Invalid date: "${candidate}". Expected YYYY-MM-DD format.`);
process.exit(1);
}
date = candidate;
i++;
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--date validation uses isNaN(new Date(candidate + 'T00:00:00Z').getTime()), but JS Dates accept overflow values (e.g. 2026-02-30 becomes a valid March date). This means invalid YYYY-MM-DD inputs can slip through and produce an incorrect riksmöte and date window. Consider reusing the repo’s strict parse-and-roundtrip validation pattern (construct UTC Date from Y/M/D and verify components) like parseAndValidateIsoDate in scripts/pre-article-analysis.ts.

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +11
* Converted content is stored alongside the original JSON metadata with a
* `.txt` or `.md` extension.
*
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Module header says converted content is stored alongside original JSON metadata as .txt/.md, but this module only returns converted text and never writes those output files (other than a temporary PDF for buffer conversion). Either update the documentation to reflect that callers are responsible for persistence, or implement the described .txt/.md output behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +13 to +15
* - **Sidecar metadata** (`{id}.meta.json`) tracks provenance (fetch
* timestamp, MCP tool name, riksmöte, document type) in a separate file
* that is safely overwritten on each run.
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module-level comment says sidecar metadata always includes riksmöte and document type for all persisted MCP data, but persistWorldBankData and persistSCBData currently write sidecars without riksmote/documentType (they only include fetchedAt/mcpTool plus indicator/country or tableId/query). Consider either (a) standardizing these sidecars to include the common fields (with riksmote: '' and documentType: 'worldbank'|'scb') or (b) updating the module docs/types to reflect that sidecar schemas differ by data source.

Suggested change
* - **Sidecar metadata** (`{id}.meta.json`) tracks provenance (fetch
* timestamp, MCP tool name, riksmöte, document type) in a separate file
* that is safely overwritten on each run.
* - **Sidecar metadata** (`{id}.meta.json`) tracks provenance (at minimum
* fetch timestamp and MCP tool name) in a separate file that is safely
* overwritten on each run. Riksdag/Riksdag-regeringen documents also
* include `riksmöte` and `documentType`, while external MCP tools
* (World Bank, SCB, etc.) use tool-specific fields (e.g. indicator /
* country, tableId / query) instead.

Copilot uses AI. Check for mistakes.
@pethers
Copy link
Copy Markdown
Member

pethers commented Mar 28, 2026

@copilot apply changes based on the comments in this thread

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 28, 2026

@copilot apply changes based on the comments in this thread

All 3 review comments addressed in cb64b59:

  1. Date validation now uses strict parse-and-roundtrip (construct Date.UTC from Y/M/D and verify all components match) — rejects overflow dates like 2026-02-30
  2. pdf-converter module docs updated to clarify callers handle persistence, not this module
  3. data-persistence module docs updated to clarify sidecar schemas differ by data source (riksdag includes riksmöte/documentType; World Bank/SCB use tool-specific fields)

All 4599 tests pass.

@github-actions
Copy link
Copy Markdown
Contributor

🔍 Lighthouse Performance Audit

Category Score Status
Performance 85/100 🟡
Accessibility 95/100 🟢
Best Practices 90/100 🟢
SEO 95/100 🟢

📥 Download full Lighthouse report

Budget Compliance: Performance budgets enforced via budget.json

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 33 changed files in this pull request and generated 1 comment.

Comment on lines +95 to +101
// No system tool available
return {
success: false,
text: '',
tool: 'none',
error: 'No PDF conversion tool available. Install poppler-utils: apt-get install poppler-utils',
};
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback error message hard-codes Debian/Ubuntu install instructions (apt-get install poppler-utils). Since this code can run on macOS/Windows as well (and CI runners may vary), consider making the message platform-agnostic (e.g. “install Poppler/pdftotext”) or include multiple common install commands (apt/brew/choco) or a link to Poppler docs.

Copilot uses AI. Check for mistakes.
@pethers pethers merged commit ba7cdb4 into main Mar 28, 2026
20 checks passed
@pethers pethers deleted the copilot/organize-analysis-folder-structure branch March 28, 2026 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agentic-workflow Agentic workflow changes ci-cd CI/CD pipeline changes documentation Documentation updates news News articles and content generation refactor Code refactoring size-l Large change (250-1000 lines) size-xl Extra large change (> 1000 lines) size-xs Extra small change (< 10 lines) testing Test coverage workflow GitHub Actions workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants