SEO AI tab: llms.txt + AI crawler controls (functional spike)#49869
SEO AI tab: llms.txt + AI crawler controls (functional spike)#49869angelablake wants to merge 14 commits into
Conversation
Adds two settings sections to the Jetpack > SEO AI tab, gated behind the existing rsm_jetpack_seo flag (+ surface visibility + seo-tools module) and wired end to end: - llms.txt: a toggle that serves a generated /llms.txt (site identity + published pages/posts) via a new Llms_Txt front-end handler. - AI crawler access: per-bot allow/block toggles that emit per-user-agent Disallow rules through a new Ai_Crawlers robots_txt filter. Persistence reuses the existing /jetpack/v4/settings endpoint: two new seo-tools options (jetpack_seo_llms_txt_enabled, jetpack_seo_blocked_ai_crawlers) are registered in the plugin whitelist, and the AI store/useAiForm/get_ai_data bootstrap are extended to round-trip them like the SEO Enhancer toggle. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Are you an Automattician? Please test your changes on all WordPress.com environments to help mitigate accidental explosions.
Interested in more tips and information?
|
|
Thank you for your PR! When contributing to Jetpack, we have a few suggestions that can help us test and review your patch:
This comment will be updated as you work on your PR and make changes. If you think that some of those checks are not needed for your PR, please explain why you think so. Thanks for cooperation 🤖 Follow this PR Review Process:
If you have questions about anything, reach out in #jetpack-developers for guidance! Jetpack plugin: The Jetpack plugin has different release cadences depending on the platform:
If you have any questions about the release process, please ask in the #jetpack-releases channel on Slack. |
Code Coverage SummaryCoverage changed in 2 files.
2 files are newly checked for coverage.
|
…ler toggles - Add AiCrawlersTest + LlmsTxtTest (35 package PHP tests green): robots directive generation, blocked-slug sanitization, llms.txt enable/identity and empty-state behavior. - Show an Allowed/Blocked state label under each AI-crawler toggle so the on/off meaning is unambiguous (named vars to avoid the i18n minifier fold). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the full-width stack with a two-column grid — AI SEO Enhancer + llms.txt on the left, the taller AI-crawler list on the right — so the settings don't stretch awkwardly across the page. Collapses to a single column on narrow screens. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The monolith plugin's changelogger vocabulary is major/enhancement/compat/ bugfix/other; "added" is only valid for packages. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Left column is now AI crawler access (the broadest control: can AI bots reach the site at all); the right column is llms.txt then the AI SEO Enhancer, reading broad → specific left-to-right, top-to-bottom. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Revert to a single column with the Settings tab's 660px centered width. Cards stack broad → specific: AI crawler access, llms.txt, then the AI SEO Enhancer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
It's the longest section and sits at the top; collapsing it by default keeps llms.txt and the Enhancer visible without scrolling. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Order is now llms.txt, AI SEO enhancer, then AI crawler access last (the longest, least-used section, still collapsed by default). Title "AI SEO Enhancer" → "AI SEO enhancer" to match the sentence-case of the other section titles. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The link sat under the switch; indent it 40px to align with the toggle's label/help text instead — same treatment as the Settings tab's sitemap link. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…uncrawlable Split the single AI-crawler section into two: "Answer engines" (allowed by default, so the site stays citable in AI answers) and "Training crawlers" (blocked by default). Each crawler now carries a `type` in the catalog, and the default blocked list is every training crawler (an explicit empty array still means "allow all"). Gate the controls when the site can't be crawled: show an explanation instead of toggles when search-engine indexing is off (with a link to the Settings visibility section) or the site is on a *.wpcomstaging.com staging subdomain, and warn when a static robots.txt at the web root would override the directives. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Amazon's own docs state Amazonbot "may be used to train Amazon AI models", while Amzn-SearchBot powers Alexa search and does not train. Move Amazonbot to Training (blocked by default) and add Amzn-SearchBot as the Amazon answer-engine bot (allowed by default), so the designations shown to users are accurate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Each crawler toggle now shows a link to the operator's own documentation for that bot (e.g. "Learn what GPTBot does"), opening in a new tab. Catalog gains a doc_url per bot and exposes the user-agent token to the UI for the link text. Bytespider has no official operator doc, so it shows no link. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…or docs For a crawler with no operator documentation (e.g. Bytespider), show a muted caption in place of the doc link so the missing link doesn't read as a bug. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a third status card to the Overview top row (the auto-fit grid becomes 3-up) summarizing the AI tab's state: llms.txt generated, answer engines allowed, training privacy configured, and AI-enhanced SEO enabled — each as a factual status row, not a graded score. A "Manage AI" button links to the AI tab. Reads from the AI store so saves on the AI tab reflect here on navigation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Correction on the crawler gating: the search-engine-visibility gating described in the PR body is implemented in the AI tab, but in spike testing it did not reliably trigger. Treat it as a finding to build + verify properly for the Phase 1 feature, not as working. Not investigated here (this is exploration). Tracked in JETPACK-1762. |
Fixes # N/A — exploration spike. Tracked in Linear: JETPACK-1761 (llms.txt) and JETPACK-1762 (AI crawler management), under the "Explore: AI SEO Settings" project.
Proposed changes
A functional spike building out the Jetpack > SEO → AI tab plus an Overview summary card, to review how these settings behave and reuse the patterns for the Phase 1 features (llms.txt Generation, AI Crawler Management). Everything is gated behind the existing
rsm_jetpack_seoflag (+ surface visibility + seo-tools module), mirroringSchema_Builder.AI tab (single centered column, Settings-tab width):
/llms.txt(site identity + published pages/posts), with a "View your llms.txt" link.Disallowrules via anAi_Crawlersrobots_txtfilter.*.wpcomstaging.comstaging subdomain — and a warning is shown when a static robots.txt would override the directives.Overview: a new AI readiness card (third card in the top row) summarizing the four AI settings as factual status rows, with a "Manage AI" button to the AI tab. Reads from the AI store so AI-tab saves reflect on navigation.
Persistence reuses
/jetpack/v4/settings:jetpack_seo_llms_txt_enabled(bool) +jetpack_seo_blocked_ai_crawlers(array) registered in the plugin whitelist; unset crawler list defaults to "training blocked, answer allowed".Spike findings (for Phase 1)
robots_txtfilter is defeated by a static/host robots.txt (JN test sites,*.wpcomstaging.com, any physical robots.txt) — detect + warn in production. (JETPACK-1762)Tests
/llms.txtrenders.Does this pull request change what data or activity we track or use?
No new Tracks events or data collection. Adds public front-end outputs (
/llms.txt, AI-crawler robots.txt directives) generated from already-public content, controlled by site admins.Testing instructions
On a site where WordPress serves robots.txt (local/Studio or a public custom-domain Atomic site — NOT a JN site or
*.wpcomstaging.com, which intercept robots.txt), with the SEO surface enabled (rsm_jetpack_seoon, surface visible, SEO Tools active) and search engines allowed:/llms.txtlists site title + Pages/Posts. Toggle off → no longer served./robots.txt, confirm itsUser-agent/Disallow: /block appears. Click a "Learn what does" link → opens the operator's docs in a new tab.Note: a site that configured crawler settings on an earlier build keeps that saved list;
wp option delete jetpack_seo_blocked_ai_crawlersresets it to the default.