Skip to content

feat(blog): link integrity - audit, redirects, multilingual rewrite, CI guards#38

Merged
miquelmatoses merged 2 commits into
mainfrom
feat/blog-link-integrity
May 28, 2026
Merged

feat(blog): link integrity - audit, redirects, multilingual rewrite, CI guards#38
miquelmatoses merged 2 commits into
mainfrom
feat/blog-link-integrity

Conversation

@miquelmatoses
Copy link
Copy Markdown
Collaborator

Summary

Phase 17.10. Three layers so internal blog links never break silently.

Layer A - one-shot audit (scripts/audit_blog_links.py)

Reads every published article from the API, extracts links from each language body, classifies internal/external, validates internal targets against the live slug set, probes externals (broken vs flaky, no false positives for 403/429/5xx/timeout), and proposes redirects via slug/title similarity. Report at docs/seo/links-audit-20260528.md.

Findings: 6 dead internal slugs across 60 link instances (5 left doubled by an old rename script e.g. ...-more-honest-data-more-honest-data, 1 genuine rename), 100 external 404s (mostly stale DOIs), 0 multilingual mismatches.

Markdown parsing handles balanced parens so DOIs like 10.1016/S0092-6566(03)00046-1 are not truncated into false 404s.

Layer C - automatic resilience

  • db/migrations/016_blog_slug_redirects.sql: blog_slug_redirects(slug_old PK -> slug_new, created_at, reason), idempotent, seeded with the 6 audited redirects + the known big-five-vs-disc-vs-belbin case (currently dormant — that slug still resolves to a live article; the spec's "no longer exists" premise is outdated, documented in the row comment).
  • GET /blog/<slug> returns 308 to /blog/<slug_new> for a dead slug with a redirect. Single hop; target must be a live post, so chains and A->B->A cycles never loop or land on a 404. Degrades to 404 if the table is unmigrated, so the code deploy is safe before the SQL is applied.
  • BlogArticlePage rewrites internal links to the active locale only when the target has that locale (else English fallback), using a new languages[] field on the /blog list endpoint carried in window.__BLOG_ARTICLES__. Replaces the previous blind prefix that mislabeled every link /<lang>/blog/....

Layer B - permanent guards

  • api/tests/test_internal_links_integrity.py: scans prerendered dist, fails if any internal /blog link resolves to neither a live article nor a redirect. Verified non-vacuous (catches all 6 without the seed).
  • api/jobs/external_links_check.py + cron + DDL 07: weekly probe into cercol_seo.external_links_status, emails a digest of links newly broken since last week (LINKS_ALERT_EMAIL).

All markdown link parsing lives in one place (api/blog_links.py), shared by the audit, the cron job, and the CI guard.

Test plan

  • pytest api/ - 192 passed (incl. 6 redirect, 2 external-job, 1 integrity)
  • integrity test passes against full build:full dist; proven to fail without the redirect seed
  • vitest run - 211 passed (incl. 7 localizeBlogLinks)
  • CI frontend job: build:full + integrity guard against complete dist
  • Post-merge: apply migration 016 via SSH; curl api.cercol.team/blog/big-five-vs-disc-vs-belbin (dormant -> 200; a truly dead slug -> 308)

Deploy notes

No migration runner in this repo: migration 016 is applied manually via sudo -u postgres psql cercol -f db/migrations/016_blog_slug_redirects.sql after the backend deploy (runbook updated). The new cron and BigQuery DDL 07 are installed per the runbook (not auto-installed).

Known follow-up (content debt, not in scope)

The 60 broken links are resolved by redirects but still live in the published article bodies. Fixing the source content (and the dead external DOIs) is tracked separately.

🤖 Generated with Claude Code

…guards

Phase 17.10. Three layers so internal blog links never break silently.

Layer A - one-shot audit (scripts/audit_blog_links.py):
Reads every published article from the API, extracts links from each
language body (markdown-aware, balanced parens so DOIs are not
truncated), classifies internal vs external, validates internal targets
against the live slug set, probes externals (broken vs flaky), and
proposes redirects by slug/title similarity. Report:
docs/seo/links-audit-20260528.md. It found 6 dead internal slugs (5
left doubled by an old rename script, 1 genuine rename) across 60 link
instances, plus 100 external 404s and multilingual mismatches (none).

Layer C - automatic resilience:
- db/migrations/016_blog_slug_redirects.sql: new blog_slug_redirects
  table (slug_old PK -> slug_new), idempotent, seeded with the 6 audited
  redirects plus the known big-five-vs-disc-vs-belbin case (currently
  dormant since that slug still resolves to a live article).
- GET /blog/<slug> answers 308 to /blog/<slug_new> for a dead slug that
  has a redirect; single hop only, target must be a live post, so chains
  and A->B->A cycles cannot loop or 404-land. Degrades to 404 if the
  table is not migrated yet, so the code deploy is safe before the SQL.
- BlogArticlePage rewrites internal links to the active locale only when
  the target has content in that locale (else English fallback), using
  the new languages[] field added to the /blog list endpoint and carried
  in window.__BLOG_ARTICLES__. Replaces the previous blind prefix that
  sent every link to /<lang>/blog/... regardless.

Layer B - permanent guards:
- api/tests/test_internal_links_integrity.py: scans prerendered dist and
  fails if any internal /blog link resolves to neither a live article
  nor a redirect (verified non-vacuous: catches all 6 without the seed).
- api/jobs/external_links_check.py + cron + DDL 07: weekly external-link
  probe into cercol_seo.external_links_status, emails a digest of links
  newly broken since last week (LINKS_ALERT_EMAIL).

All markdown link parsing lives in one place (api/blog_links.py), shared
by the audit, the cron job, and the CI guard.

Migration applied manually on the server post-deploy (no runner in this
repo); see docs/ops/runbook.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…chee

docs/seo/links-audit-*.md is generated by scripts/audit_blog_links.py and
its purpose is to list broken links, so link-checking it always fails by
design and its machine-generated tables trip MD012. Exclude the report
family from both doc linters.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@miquelmatoses miquelmatoses merged commit 8491693 into main May 28, 2026
7 checks passed
@miquelmatoses miquelmatoses deleted the feat/blog-link-integrity branch May 28, 2026 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant