SEO metadata backfill — PR 2 of 5: scanner CLI + committed dry-run CSV diff report#1986
SEO metadata backfill — PR 2 of 5: scanner CLI + committed dry-run CSV diff report#1986
Conversation
🏷️ Automatic Labeling SummaryThis PR has been automatically labeled based on the files changed and PR metadata. Applied Labels: size-xs Label Categories
For more information, see |
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
…of 5) Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/163a59a0-ae83-4cff-ab71-35d5dec87fc7 Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds read-only “scanner” tooling to classify existing news/*.html articles into Tier A/B/C buckets and emit a deterministic RFC 4180 CSV diff report of SEO-metadata contract violations, with accompanying tests and operator docs.
Changes:
- Introduces contract checker + HTML metadata inspector + tier classifier + CSV report writer modules used for backfill planning.
- Adds
scripts/backfill-article-metadata.tsCLI with filtering flags and exit-code semantics, wired asnpm run backfill-metadata. - Adds comprehensive Vitest coverage plus an operator runbook and gitignore exceptions to commit the dry-run CSV artefacts.
Reviewed changes
Copilot reviewed 9 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
scripts/backfill-lib/contract-checker.ts |
Implements stable, code-based contract validation and per-language length windows. |
scripts/backfill-lib/html-inspector.ts |
Extracts <title>, meta tags, JSON-LD fields, and article body preview without DOM dependencies. |
scripts/backfill-lib/classifier.ts |
Assigns Tier A/B/C based on analysis source presence and contract results. |
scripts/backfill-lib/report-writer.ts |
Serialises scan output into RFC 4180 CSV with a pinned schema. |
scripts/backfill-article-metadata.ts |
CLI to scan, filter, classify, and write the CSV report (dry-run/check/apply modes). |
tests/contract-checker.test.ts |
Unit tests covering each contract rule and language window behaviors. |
tests/backfill-article-metadata.test.ts |
Tests for filename parsing, tiering, HTML inspection, CSV quoting, and CLI surface/end-to-end scan. |
analysis/metadata-backfill/README.md |
Operator runbook for regenerating and interpreting the CSV report. |
.gitignore |
Allows committed backfill CSV artefacts under analysis/metadata-backfill/. |
package.json |
Adds backfill-metadata npm script entrypoint. |
| it('maps all 14 contract languages', () => { | ||
| for (const lang of ['en', 'sv', 'da', 'no', 'nb', 'fi', 'de', 'fr', 'es', 'nl', 'ar', 'he', 'ja', 'ko', 'zh']) { | ||
| expect(LANG_WINDOWS[lang as keyof typeof LANG_WINDOWS]).toBeDefined(); | ||
| } | ||
| }); |
| export function scan(options: CliOptions): ScanResult { | ||
| const allFiles = listArticleFiles(options.newsDir); | ||
| const rows: ReportRow[] = []; | ||
|
|
||
| let filesWithViolations = 0; | ||
| let totalViolations = 0; | ||
| const tierCounts: Record<Tier, number> = { A: 0, B: 0, C: 0 }; | ||
| let uncategorised = 0; | ||
|
|
||
| for (const abs of allFiles) { | ||
| const relPath = path.relative(ROOT_DIR, abs); | ||
| const fp = parseArticleFilename(relPath); | ||
|
|
||
| // Apply CLI filters. | ||
| if (options.langs && !options.langs.includes(fp.lang)) continue; | ||
| if (options.dateFrom && (fp.date === null || fp.date < options.dateFrom)) continue; | ||
| if (options.dateTo && (fp.date === null || fp.date > options.dateTo)) continue; |
| const meta = inspectHtmlFile(abs); | ||
| // Prefer the explicit `<html lang>` over the filename-derived code — | ||
| // filename is authoritative for the file suffix but a mistyped lang | ||
| // attribute is a separate class of bug worth surfacing. | ||
| const lang = meta.lang || fp.lang; | ||
| const contract = checkAgainstContract( | ||
| { title: meta.title, description: meta.metaDescription }, | ||
| lang, | ||
| ); |
| langs = raw | ||
| .split(',') | ||
| .map((l) => l.trim().toLowerCase()) | ||
| .filter(Boolean); | ||
| for (const l of langs) { | ||
| if (!isKnownLang(l)) { | ||
| fail(`Unknown --lang value: ${l}.`); |
|
@copilot apply changes based on the comments in this thread |
…lback, empty --lang= Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/d032127c-f30d-4009-b37c-29c938989aa5 Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
Applied in cb1518f. All four reviewer comments + both CodeQL findings addressed:
|
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
|
@copilot apply changes based on the comments in this thread |
…usive modes, naming Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/22c644ba-3904-4c3b-9182-80e15447c78a Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
Applied in 5f85fc7. All 5 review comments addressed:
Regression tests added for single-quoted/unquoted/reordered Validation: |
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
| * | Flag | Purpose | | ||
| * | ------------------- | ---------------------------------------------------- | | ||
| * | `--dry-run` | Scan only; emit CSV; exit 0. Default behaviour. | | ||
| * | `--check` | Scan only; emit CSV; exit non-zero on any violation. | | ||
| * | `--apply` | Reserved for PRs 3-5. Fails with a pointer. | | ||
| * | `--tier=A\|B\|C\|all` | Restrict classification to a tier subset. | | ||
| * | `--lang=sv,no` | Restrict scan to comma-separated lang codes. | | ||
| * | `--date-from=YYYY-MM-DD` | Lower bound (inclusive) on article date. | | ||
| * | `--date-to=YYYY-MM-DD` | Upper bound (inclusive) on article date. | | ||
| * | `--output=<path>` | CSV output path (defaults to the committed diff report). | | ||
| * | `--news-dir=<path>` | Override the `news/` directory (for tests). | |
| * Each row represents **one** contract violation. A file with three | ||
| * violations produces three rows. `after` is intentionally blank in | ||
| * PR 2 — PRs 3/4/5 will populate it with the planned post-backfill | ||
| * value when they run. An un-violating file still produces one row per | ||
| * tier it qualifies for, with `field`, `violation_code`, and `before` | ||
| * all blank, so reviewers can audit the tier classification | ||
| * independently of the violation set. |
| # 1. Scan every news/*.html and emit today's CSV. | ||
| npm run backfill-metadata -- --dry-run \ | ||
| --output=analysis/metadata-backfill/diff-report-$(date +%F).csv | ||
|
|
||
| # 2. Commit the file so reviewers can inspect tier classification. | ||
| git add analysis/metadata-backfill/diff-report-$(date +%F).csv |
| | Flag | Effect | | ||
| | ---- | ------ | | ||
| | `--dry-run` | Scan only; emit CSV; exit `0`. **Default behaviour.** | | ||
| | `--check` | Scan only; emit CSV; exit `1` when any article violates the contract. | | ||
| | `--apply` | **Reserved for PRs 3–5.** Fails fast with exit `2` in PR 2. | | ||
| | `--tier=A\|B\|C\|all` | Restrict classification to a tier subset. `all` is the default. | | ||
| | `--lang=sv,no,de` | Restrict scan to comma-separated language codes. | | ||
| | `--date-from=YYYY-MM-DD` | Lower bound (inclusive) on article date. | | ||
| | `--date-to=YYYY-MM-DD` | Upper bound (inclusive) on article date. | | ||
| | `--output=<path>` | CSV output path. | | ||
| | `--news-dir=<path>` | Override the `news/` directory (test-only). | | ||
| | `--quiet` | Suppress the summary line on stderr. | |
|
@copilot apply changes based on the comments in this thread |
…w emission, and UTC date defaults Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/61a77415-7915-4287-84fd-982e2e474ff7 Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
Applied in 51a244f. All 4 doc-only review comments addressed:
Validation: |
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
| const header = CSV_COLUMNS.join(','); | ||
| const body = rows.map(serialiseRow).join('\n'); | ||
| const csv = rows.length > 0 ? `${header}\n${body}\n` : `${header}\n`; | ||
| fs.writeFileSync(outputPath, csv, 'utf8'); |
| export function inspectHtmlContent(html: string, filePath: string = ''): ArticleMetadata { | ||
| const lang = extractHtmlLang(html); | ||
| const title = match1(html, REGEXES.title); | ||
| const metaDescription = extractMetaContent(html, 'name', 'description'); | ||
| const ogTitle = extractMetaContent(html, 'property', 'og:title'); | ||
| const ogDescription = extractMetaContent(html, 'property', 'og:description'); | ||
| const twitterTitle = extractMetaContent(html, 'name', 'twitter:title'); | ||
| const twitterDescription = extractMetaContent(html, 'name', 'twitter:description'); | ||
|
|
||
| const jsonLdHeadline = extractJsonLdField(html, 'headline'); | ||
| const jsonLdAlternativeHeadline = extractJsonLdField(html, 'alternativeHeadline'); | ||
| const jsonLdDescription = extractJsonLdField(html, 'description'); |
| let uncategorised = 0; | ||
|
|
||
| for (const abs of allFiles) { | ||
| const relPath = path.relative(ROOT_DIR, abs); |
| return { relPath, date: null, subfolder: null, lang: 'en' }; | ||
| } | ||
| return { | ||
| relPath, | ||
| date: m[1] ?? null, | ||
| subfolder: m[2] ?? null, | ||
| lang: (m[3] ?? 'en').toLowerCase(), |
| tiers = raw.split(',').map((t) => t.trim().toUpperCase() as Tier); | ||
| for (const t of tiers) { | ||
| if (t !== 'A' && t !== 'B' && t !== 'C') { | ||
| fail(`Invalid --tier value: ${t}. Expected A | B | C | all.`); | ||
| } |
| "download-data:weekly": "npx tsx scripts/download-parliamentary-data.ts --aggregate weekly", | ||
| "backfill-metadata": "tsx scripts/backfill-article-metadata.ts" |
|
@copilot apply changes based on the comments in this thread |
…s, list-flag consistency, streaming CSV Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/f4044f59-432a-4e9c-bfd5-4e2351fbe927 Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
Applied in 9f3ed1f. All 6 review comments from thread 4175219627→4175241013 addressed:
Validation: |
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
PR 2 of 5 in the SEO metadata backfill series. Adds a read-only "scanner" CLI that classifies existing
news/*.htmlarticles into Tier A/B/C buckets and emits a deterministic RFC 4180 CSV diff report of SEO-metadata contract violations, plus the supporting library modules, Vitest coverage, operator runbook, and committed dry-run CSV artefact.Scope
scripts/backfill-article-metadata.tsCLI wired asnpm run backfill-metadata, with--dry-run/--check/--applymode flags,--lang/--date-*/--tierfilters, andCliUsageError→ exit-code-2 mapping for misuse.scripts/backfill-lib/{contract-checker,html-inspector,classifier,report-writer}.ts.analysis/metadata-backfill/README.md) and gitignore exceptions to commit dry-run CSV artefacts.Review thread follow-ups (latest first)
Review 4175241013
report-writer.ts—writeReportnow streams rows via a synchronous file descriptor (header → row-by-row → close) instead of building the entire CSV in memory; output bytes unchanged.html-inspector.ts—<meta>tags are parsed once into aMapkeyed byname:/property:and JSON-LD blocks once into an array;inspectHtmlContentreads all 5 meta + 3 JSON-LD fields from those cached structures (down from 5 + 3 sequential passes per file).backfill-article-metadata.ts—relPathis normalised to POSIX separators (split(path.sep).join('/')) sofile_pathcolumns stay deterministic on Windows runners.classifier.ts—parseArticleFilenamereturnslang: ''for unparseable filenames instead of mis-labelling them as English; the CLI applies the existingmeta.lang || fp.lang || 'en'fallback only for that empty case.backfill-article-metadata.ts—--tierparsing now mirrors--lang: empty list items (--tier=,--tier=,,--tier=A,,C) are dropped, and an all-empty result resets to "all".-sv.htmlarticles.Review 4175219627 (doc-only)
--tierflag table in the CLI doc block and the runbook CLI reference both document the comma-separated list syntax (A,B,C|all).report-writer.tsmodule doc rewritten to describe the actual emission rule — one row per (tier, violation) pair (T * Vrows;Trows for green articles).analysis/metadata-backfill/README.mdregeneration commands switched todate -u +%Fso the suggested filename matches the CLI's UTCisoToday()default.Review 4175159264
<html lang>regex replaced withparseAttributes-based extraction; accepts single-quoted, unquoted, and reordered attributes.<script\b…>matching; tolerates extra attributes (defer), reordered attrs, and single-quotedtype.rawTitlelocal totitleininspectHtmlContent(both call sites).--dry-run/--check/--applymode flags throwCliUsageError(exit code 2) when conflicting; repeated identical flags remain a no-op.<html lang>, reordered/single-quoted JSON-LDtype, and conflicting mode flags.Earlier reviews (4173159135, 4175112750)
htmlDecoderewritten as a single-pass replace keyed on a named-entity table;</script\s*>/</style\s*>end-tag stripping).totals.filesMatchedadded (counted after filters);filesScannedretains its on-disk meaning; stderr summary shows both.tiersToEmitwhen--tier=...is active.parseFlags()throwsCliUsageErrorfrom parsing;main()maps it to exit code 2.quoteFieldacceptsstring | null | undefineddirectly.TITLE_ENDS_WITH_BRANDno longer treats| Riksdagsmonitoras a dash-based double-brand violation.<meta>extraction parses attributes independent of order and tolerates entity-bearing unquoted values.nbalias for Norwegian";nbadded to theisKnownLangtest list for symmetry.Validation
tsc --noEmit -p tsconfig.scripts.json— cleaneslint— cleanvitest run— 2086/2086 passinganalysis/metadata-backfill/diff-report-2026-04-24.csv) regenerated and tracked in the PR.