Skip to content

SEO AI tab: llms.txt + AI crawler controls (functional spike)#49869

Draft
angelablake wants to merge 14 commits into
trunkfrom
add/seo-ai-tab-llms-and-crawler-controls
Draft

SEO AI tab: llms.txt + AI crawler controls (functional spike)#49869
angelablake wants to merge 14 commits into
trunkfrom
add/seo-ai-tab-llms-and-crawler-controls

Conversation

@angelablake

@angelablake angelablake commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Fixes # N/A — exploration spike. Tracked in Linear: JETPACK-1761 (llms.txt) and JETPACK-1762 (AI crawler management), under the "Explore: AI SEO Settings" project.

Proposed changes

A functional spike building out the Jetpack > SEO → AI tab plus an Overview summary card, to review how these settings behave and reuse the patterns for the Phase 1 features (llms.txt Generation, AI Crawler Management). Everything is gated behind the existing rsm_jetpack_seo flag (+ surface visibility + seo-tools module), mirroring Schema_Builder.

AI tab (single centered column, Settings-tab width):

  • llms.txt — toggle that serves a generated root-level /llms.txt (site identity + published pages/posts), with a "View your llms.txt" link.
  • AI crawler access — two sections with informed defaults:
    • Answer engines (OAI-SearchBot, Claude-SearchBot, PerplexityBot, Amzn-SearchBot) — allowed by default, so the site stays citable in AI answers.
    • Training crawlers (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, Meta-ExternalAgent, Bytespider, CCBot, Amazonbot) — blocked by default.
    • Each crawler shows an Allowed/Blocked state and a "Learn what does" link to the operator's own docs (Bytespider has no official doc → shows "No documentation available"). Classifications verified against each operator's crawler docs (e.g. Amazon's docs put Amazonbot in training, Amzn-SearchBot in answers).
    • Blocking emits per-user-agent Disallow rules via an Ai_Crawlers robots_txt filter.
  • AI SEO enhancer — the existing plan-gated toggle.
  • Crawler gating: when the site can't be crawled, the toggles are replaced by an explanation — search-engine indexing off (with a link to Settings → visibility), or a *.wpcomstaging.com staging subdomain — and a warning is shown when a static robots.txt would override the directives.

Overview: a new AI readiness card (third card in the top row) summarizing the four AI settings as factual status rows, with a "Manage AI" button to the AI tab. Reads from the AI store so AI-tab saves reflect on navigation.

Persistence reuses /jetpack/v4/settings: jetpack_seo_llms_txt_enabled (bool) + jetpack_seo_blocked_ai_crawlers (array) registered in the plugin whitelist; unset crawler list defaults to "training blocked, answer allowed".

Spike findings (for Phase 1)

  • robots_txt filter is defeated by a static/host robots.txt (JN test sites, *.wpcomstaging.com, any physical robots.txt) — detect + warn in production. (JETPACK-1762)
  • Default-blocking is "unconfigured-only": once a site saves crawler settings, the default no longer applies — so a training bot added later silently isn't blocked on already-configured sites. Decide whether to store a literal list (+ migrate) or store intent. (JETPACK-1762)
  • Readiness card still needs visibility-gating: it reads crawler state even when indexing is off, where those controls don't apply. (follow-up, JETPACK-1762)
  • llms.txt is hygiene, not a ranking lever (independent studies show near-zero fetches; Google says it won't use it). Competitor scan recorded in JETPACK-1759.

Tests

  • PHP (40 package tests): robots directive generation, blocked-slug sanitization/defaults, crawler detection helpers (search-engine visibility, staging subdomain, static robots.txt), llms.txt generation + enable state.
  • JS (70 tests): AI store slices seed/update independently.
  • Manual: verified end to end on a public custom-domain Atomic site — blocking crawlers produced the expected robots.txt directives; /llms.txt renders.

Does this pull request change what data or activity we track or use?

No new Tracks events or data collection. Adds public front-end outputs (/llms.txt, AI-crawler robots.txt directives) generated from already-public content, controlled by site admins.

Testing instructions

On a site where WordPress serves robots.txt (local/Studio or a public custom-domain Atomic site — NOT a JN site or *.wpcomstaging.com, which intercept robots.txt), with the SEO surface enabled (rsm_jetpack_seo on, surface visible, SEO Tools active) and search engines allowed:

  • AI tab → llms.txt: toggle on, click "View your llms.txt", confirm /llms.txt lists site title + Pages/Posts. Toggle off → no longer served.
  • AI tab → crawlers: confirm two sections (answer engines default-allowed, training default-blocked). Block a crawler, visit /robots.txt, confirm its User-agent / Disallow: / block appears. Click a "Learn what does" link → opens the operator's docs in a new tab.
  • Gating: turn off "Allow search engines to index this site" (Settings → visibility) → the crawler section shows the explanation + link instead of toggles.
  • Overview: confirm the AI readiness card appears as the third top-row card, its four rows reflect the AI-tab state, and "Manage AI" opens the AI tab.

Note: a site that configured crawler settings on an earlier build keeps that saved list; wp option delete jetpack_seo_blocked_ai_crawlers resets it to the default.

Adds two settings sections to the Jetpack > SEO AI tab, gated behind the
existing rsm_jetpack_seo flag (+ surface visibility + seo-tools module) and
wired end to end:

- llms.txt: a toggle that serves a generated /llms.txt (site identity +
  published pages/posts) via a new Llms_Txt front-end handler.
- AI crawler access: per-bot allow/block toggles that emit per-user-agent
  Disallow rules through a new Ai_Crawlers robots_txt filter.

Persistence reuses the existing /jetpack/v4/settings endpoint: two new
seo-tools options (jetpack_seo_llms_txt_enabled, jetpack_seo_blocked_ai_crawlers)
are registered in the plugin whitelist, and the AI store/useAiForm/get_ai_data
bootstrap are extended to round-trip them like the SEO Enhancer toggle.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Are you an Automattician? Please test your changes on all WordPress.com environments to help mitigate accidental explosions.

  • To test on WoA, go to the Plugins menu on a WoA dev site. Click on the "Upload" button and follow the upgrade flow to be able to upload, install, and activate the Jetpack Beta plugin. Once the plugin is active, go to Jetpack > Jetpack Beta, select your plugin (Jetpack), and enable the add/seo-ai-tab-llms-and-crawler-controls branch.
  • To test on Simple, run the following command on your sandbox:
bin/jetpack-downloader test jetpack add/seo-ai-tab-llms-and-crawler-controls

Interested in more tips and information?

  • In your local development environment, use the jetpack rsync command to sync your changes to a WoA dev blog.
  • Read more about our development workflow here: PCYsg-eg0-p2
  • Figure out when your changes will be shipped to customers here: PCYsg-eg5-p2

@github-actions github-actions Bot added [Package] Seo [Plugin] Jetpack Issues about the Jetpack plugin. https://wordpress.org/plugins/jetpack/ [Tests] Includes Tests labels Jun 23, 2026
@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Thank you for your PR!

When contributing to Jetpack, we have a few suggestions that can help us test and review your patch:

  • ✅ Include a description of your PR changes.
  • ✅ Add a "[Status]" label (In Progress, Needs Review, ...).
  • ✅ Add testing instructions.
  • ✅ Specify whether this PR includes any changes to data or privacy.
  • ✅ Add changelog entries to affected projects

This comment will be updated as you work on your PR and make changes. If you think that some of those checks are not needed for your PR, please explain why you think so. Thanks for cooperation 🤖


Follow this PR Review Process:

  1. Ensure all required checks appearing at the bottom of this PR are passing.
  2. Make sure to test your changes on all platforms that it applies to. You're responsible for the quality of the code you ship.
  3. You can use GitHub's Reviewers functionality to request a review.
  4. When it's reviewed and merged, you will be pinged in Slack to deploy the changes to WordPress.com simple once the build is done.

If you have questions about anything, reach out in #jetpack-developers for guidance!


Jetpack plugin:

The Jetpack plugin has different release cadences depending on the platform:

  • WordPress.com Simple releases happen as soon as you deploy your changes after merging this PR (PCYsg-Jjm-p2).
  • WoA releases happen weekly.
  • Releases to self-hosted sites happen monthly:
    • Scheduled release: July 7, 2026
    • Code freeze: July 6, 2026

If you have any questions about the release process, please ask in the #jetpack-releases channel on Slack.

@jp-launch-control

jp-launch-control Bot commented Jun 23, 2026

Copy link
Copy Markdown

Code Coverage Summary

Coverage changed in 2 files.

File Coverage Δ% Δ Uncovered
projects/plugins/jetpack/_inc/lib/class.core-rest-api-endpoints.php 1706/2636 (64.72%) 0.22% 1 ❤️‍🩹
projects/packages/seo/src/class-initializer.php 165/245 (67.35%) 3.22% 0 💚

2 files are newly checked for coverage.

File Coverage
projects/packages/seo/src/class-llms-txt.php 35/63 (55.56%) 💚
projects/packages/seo/src/class-ai-crawlers.php 101/103 (98.06%) 💚

Full summary · PHP report · JS report

Angela Blake and others added 13 commits June 23, 2026 17:29
…ler toggles

- Add AiCrawlersTest + LlmsTxtTest (35 package PHP tests green): robots
  directive generation, blocked-slug sanitization, llms.txt enable/identity
  and empty-state behavior.
- Show an Allowed/Blocked state label under each AI-crawler toggle so the
  on/off meaning is unambiguous (named vars to avoid the i18n minifier fold).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the full-width stack with a two-column grid — AI SEO Enhancer +
llms.txt on the left, the taller AI-crawler list on the right — so the
settings don't stretch awkwardly across the page. Collapses to a single
column on narrow screens.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The monolith plugin's changelogger vocabulary is major/enhancement/compat/
bugfix/other; "added" is only valid for packages.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Left column is now AI crawler access (the broadest control: can AI bots
reach the site at all); the right column is llms.txt then the AI SEO
Enhancer, reading broad → specific left-to-right, top-to-bottom.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Revert to a single column with the Settings tab's 660px centered width.
Cards stack broad → specific: AI crawler access, llms.txt, then the AI SEO
Enhancer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
It's the longest section and sits at the top; collapsing it by default
keeps llms.txt and the Enhancer visible without scrolling.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Order is now llms.txt, AI SEO enhancer, then AI crawler access last (the
longest, least-used section, still collapsed by default). Title "AI SEO
Enhancer" → "AI SEO enhancer" to match the sentence-case of the other
section titles.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The link sat under the switch; indent it 40px to align with the toggle's
label/help text instead — same treatment as the Settings tab's sitemap link.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…uncrawlable

Split the single AI-crawler section into two: "Answer engines" (allowed by
default, so the site stays citable in AI answers) and "Training crawlers"
(blocked by default). Each crawler now carries a `type` in the catalog, and the
default blocked list is every training crawler (an explicit empty array still
means "allow all").

Gate the controls when the site can't be crawled: show an explanation instead
of toggles when search-engine indexing is off (with a link to the Settings
visibility section) or the site is on a *.wpcomstaging.com staging subdomain,
and warn when a static robots.txt at the web root would override the directives.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Amazon's own docs state Amazonbot "may be used to train Amazon AI models",
while Amzn-SearchBot powers Alexa search and does not train. Move Amazonbot to
Training (blocked by default) and add Amzn-SearchBot as the Amazon answer-engine
bot (allowed by default), so the designations shown to users are accurate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Each crawler toggle now shows a link to the operator's own documentation for
that bot (e.g. "Learn what GPTBot does"), opening in a new tab. Catalog gains a
doc_url per bot and exposes the user-agent token to the UI for the link text.
Bytespider has no official operator doc, so it shows no link.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…or docs

For a crawler with no operator documentation (e.g. Bytespider), show a muted
caption in place of the doc link so the missing link doesn't read as a bug.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a third status card to the Overview top row (the auto-fit grid becomes
3-up) summarizing the AI tab's state: llms.txt generated, answer engines
allowed, training privacy configured, and AI-enhanced SEO enabled — each as a
factual status row, not a graded score. A "Manage AI" button links to the AI
tab. Reads from the AI store so saves on the AI tab reflect here on navigation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@angelablake

Copy link
Copy Markdown
Contributor Author

Correction on the crawler gating: the search-engine-visibility gating described in the PR body is implemented in the AI tab, but in spike testing it did not reliably trigger. Treat it as a finding to build + verify properly for the Phase 1 feature, not as working. Not investigated here (this is exploration). Tracked in JETPACK-1762.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

[Package] Seo [Plugin] Jetpack Issues about the Jetpack plugin. https://wordpress.org/plugins/jetpack/ [Status] In Progress [Tests] Includes Tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant