feat(blog): link integrity - audit, redirects, multilingual rewrite, CI guards#38
Merged
Conversation
…guards Phase 17.10. Three layers so internal blog links never break silently. Layer A - one-shot audit (scripts/audit_blog_links.py): Reads every published article from the API, extracts links from each language body (markdown-aware, balanced parens so DOIs are not truncated), classifies internal vs external, validates internal targets against the live slug set, probes externals (broken vs flaky), and proposes redirects by slug/title similarity. Report: docs/seo/links-audit-20260528.md. It found 6 dead internal slugs (5 left doubled by an old rename script, 1 genuine rename) across 60 link instances, plus 100 external 404s and multilingual mismatches (none). Layer C - automatic resilience: - db/migrations/016_blog_slug_redirects.sql: new blog_slug_redirects table (slug_old PK -> slug_new), idempotent, seeded with the 6 audited redirects plus the known big-five-vs-disc-vs-belbin case (currently dormant since that slug still resolves to a live article). - GET /blog/<slug> answers 308 to /blog/<slug_new> for a dead slug that has a redirect; single hop only, target must be a live post, so chains and A->B->A cycles cannot loop or 404-land. Degrades to 404 if the table is not migrated yet, so the code deploy is safe before the SQL. - BlogArticlePage rewrites internal links to the active locale only when the target has content in that locale (else English fallback), using the new languages[] field added to the /blog list endpoint and carried in window.__BLOG_ARTICLES__. Replaces the previous blind prefix that sent every link to /<lang>/blog/... regardless. Layer B - permanent guards: - api/tests/test_internal_links_integrity.py: scans prerendered dist and fails if any internal /blog link resolves to neither a live article nor a redirect (verified non-vacuous: catches all 6 without the seed). - api/jobs/external_links_check.py + cron + DDL 07: weekly external-link probe into cercol_seo.external_links_status, emails a digest of links newly broken since last week (LINKS_ALERT_EMAIL). All markdown link parsing lives in one place (api/blog_links.py), shared by the audit, the cron job, and the CI guard. Migration applied manually on the server post-deploy (no runner in this repo); see docs/ops/runbook.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…chee docs/seo/links-audit-*.md is generated by scripts/audit_blog_links.py and its purpose is to list broken links, so link-checking it always fails by design and its machine-generated tables trip MD012. Exclude the report family from both doc linters. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 17.10. Three layers so internal blog links never break silently.
Layer A - one-shot audit (
scripts/audit_blog_links.py)Reads every published article from the API, extracts links from each language body, classifies internal/external, validates internal targets against the live slug set, probes externals (broken vs flaky, no false positives for 403/429/5xx/timeout), and proposes redirects via slug/title similarity. Report at
docs/seo/links-audit-20260528.md.Findings: 6 dead internal slugs across 60 link instances (5 left doubled by an old rename script e.g.
...-more-honest-data-more-honest-data, 1 genuine rename), 100 external 404s (mostly stale DOIs), 0 multilingual mismatches.Markdown parsing handles balanced parens so DOIs like
10.1016/S0092-6566(03)00046-1are not truncated into false 404s.Layer C - automatic resilience
db/migrations/016_blog_slug_redirects.sql:blog_slug_redirects(slug_old PK -> slug_new, created_at, reason), idempotent, seeded with the 6 audited redirects + the knownbig-five-vs-disc-vs-belbincase (currently dormant — that slug still resolves to a live article; the spec's "no longer exists" premise is outdated, documented in the row comment).GET /blog/<slug>returns 308 to/blog/<slug_new>for a dead slug with a redirect. Single hop; target must be a live post, so chains andA->B->Acycles never loop or land on a 404. Degrades to 404 if the table is unmigrated, so the code deploy is safe before the SQL is applied.BlogArticlePagerewrites internal links to the active locale only when the target has that locale (else English fallback), using a newlanguages[]field on the/bloglist endpoint carried inwindow.__BLOG_ARTICLES__. Replaces the previous blind prefix that mislabeled every link/<lang>/blog/....Layer B - permanent guards
api/tests/test_internal_links_integrity.py: scans prerendered dist, fails if any internal/bloglink resolves to neither a live article nor a redirect. Verified non-vacuous (catches all 6 without the seed).api/jobs/external_links_check.py+ cron + DDL07: weekly probe intocercol_seo.external_links_status, emails a digest of links newly broken since last week (LINKS_ALERT_EMAIL).All markdown link parsing lives in one place (
api/blog_links.py), shared by the audit, the cron job, and the CI guard.Test plan
pytest api/- 192 passed (incl. 6 redirect, 2 external-job, 1 integrity)build:fulldist; proven to fail without the redirect seedvitest run- 211 passed (incl. 7localizeBlogLinks)build:full+ integrity guard against complete distcurl api.cercol.team/blog/big-five-vs-disc-vs-belbin(dormant -> 200; a truly dead slug -> 308)Deploy notes
No migration runner in this repo: migration
016is applied manually viasudo -u postgres psql cercol -f db/migrations/016_blog_slug_redirects.sqlafter the backend deploy (runbook updated). The new cron and BigQuery DDL07are installed per the runbook (not auto-installed).Known follow-up (content debt, not in scope)
The 60 broken links are resolved by redirects but still live in the published article bodies. Fixing the source content (and the dead external DOIs) is tracked separately.
🤖 Generated with Claude Code