SEO AI tab: llms.txt + AI crawler controls (functional spike) by angelablake · Pull Request #49869 · Automattic/jetpack

angelablake · 2026-06-23T21:40:09Z

Fixes # N/A — exploration spike. Tracked in Linear: JETPACK-1761 (llms.txt) and JETPACK-1762 (AI crawler management), under the "Explore: AI SEO Settings" project.

Proposed changes

A functional spike building out the Jetpack > SEO → AI tab plus an Overview summary card, to review how these settings behave and reuse the patterns for the Phase 1 features (llms.txt Generation, AI Crawler Management). Everything is gated behind the existing rsm_jetpack_seo flag (+ surface visibility + seo-tools module), mirroring Schema_Builder.

AI tab (single centered column, Settings-tab width):

llms.txt — toggle that serves a generated root-level /llms.txt (site identity + published pages/posts), with a "View your llms.txt" link.
AI crawler access — two sections with informed defaults:
- Answer engines (OAI-SearchBot, Claude-SearchBot, PerplexityBot, Amzn-SearchBot) — allowed by default, so the site stays citable in AI answers.
- Training crawlers (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, Meta-ExternalAgent, Bytespider, CCBot, Amazonbot) — blocked by default.
- Each crawler shows an Allowed/Blocked state and a "Learn what does" link to the operator's own docs (Bytespider has no official doc → shows "No documentation available"). Classifications verified against each operator's crawler docs (e.g. Amazon's docs put Amazonbot in training, Amzn-SearchBot in answers).
- Blocking emits per-user-agent Disallow rules via an Ai_Crawlers robots_txt filter.
AI SEO enhancer — the existing plan-gated toggle.
Crawler gating: when the site can't be crawled, the toggles are replaced by an explanation — search-engine indexing off (with a link to Settings → visibility), or a *.wpcomstaging.com staging subdomain — and a warning is shown when a static robots.txt would override the directives.

Overview: a new AI readiness card (third card in the top row) summarizing the four AI settings as factual status rows, with a "Manage AI" button to the AI tab. Reads from the AI store so AI-tab saves reflect on navigation.

Persistence reuses /jetpack/v4/settings: jetpack_seo_llms_txt_enabled (bool) + jetpack_seo_blocked_ai_crawlers (array) registered in the plugin whitelist; unset crawler list defaults to "training blocked, answer allowed".

Spike findings (for Phase 1)

robots_txt filter is defeated by a static/host robots.txt (JN test sites, *.wpcomstaging.com, any physical robots.txt) — detect + warn in production. (JETPACK-1762)
Default-blocking is "unconfigured-only": once a site saves crawler settings, the default no longer applies — so a training bot added later silently isn't blocked on already-configured sites. Decide whether to store a literal list (+ migrate) or store intent. (JETPACK-1762)
Readiness card still needs visibility-gating: it reads crawler state even when indexing is off, where those controls don't apply. (follow-up, JETPACK-1762)
llms.txt is hygiene, not a ranking lever (independent studies show near-zero fetches; Google says it won't use it). Competitor scan recorded in JETPACK-1759.

Tests

PHP (40 package tests): robots directive generation, blocked-slug sanitization/defaults, crawler detection helpers (search-engine visibility, staging subdomain, static robots.txt), llms.txt generation + enable state.
JS (70 tests): AI store slices seed/update independently.
Manual: verified end to end on a public custom-domain Atomic site — blocking crawlers produced the expected robots.txt directives; /llms.txt renders.

Does this pull request change what data or activity we track or use?

No new Tracks events or data collection. Adds public front-end outputs (/llms.txt, AI-crawler robots.txt directives) generated from already-public content, controlled by site admins.

Testing instructions

On a site where WordPress serves robots.txt (local/Studio or a public custom-domain Atomic site — NOT a JN site or *.wpcomstaging.com, which intercept robots.txt), with the SEO surface enabled (rsm_jetpack_seo on, surface visible, SEO Tools active) and search engines allowed:

AI tab → llms.txt: toggle on, click "View your llms.txt", confirm /llms.txt lists site title + Pages/Posts. Toggle off → no longer served.
AI tab → crawlers: confirm two sections (answer engines default-allowed, training default-blocked). Block a crawler, visit /robots.txt, confirm its User-agent / Disallow: / block appears. Click a "Learn what does" link → opens the operator's docs in a new tab.
Gating: turn off "Allow search engines to index this site" (Settings → visibility) → the crawler section shows the explanation + link instead of toggles.
Overview: confirm the AI readiness card appears as the third top-row card, its four rows reflect the AI-tab state, and "Manage AI" opens the AI tab.

Note: a site that configured crawler settings on an earlier build keeps that saved list; wp option delete jetpack_seo_blocked_ai_crawlers resets it to the default.

Adds two settings sections to the Jetpack > SEO AI tab, gated behind the existing rsm_jetpack_seo flag (+ surface visibility + seo-tools module) and wired end to end: - llms.txt: a toggle that serves a generated /llms.txt (site identity + published pages/posts) via a new Llms_Txt front-end handler. - AI crawler access: per-bot allow/block toggles that emit per-user-agent Disallow rules through a new Ai_Crawlers robots_txt filter. Persistence reuses the existing /jetpack/v4/settings endpoint: two new seo-tools options (jetpack_seo_llms_txt_enabled, jetpack_seo_blocked_ai_crawlers) are registered in the plugin whitelist, and the AI store/useAiForm/get_ai_data bootstrap are extended to round-trip them like the SEO Enhancer toggle. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-23T21:41:16Z

Are you an Automattician? Please test your changes on all WordPress.com environments to help mitigate accidental explosions.

To test on WoA, go to the Plugins menu on a WoA dev site. Click on the "Upload" button and follow the upgrade flow to be able to upload, install, and activate the Jetpack Beta plugin. Once the plugin is active, go to Jetpack > Jetpack Beta, select your plugin (Jetpack), and enable the add/seo-ai-tab-llms-and-crawler-controls branch.
To test on Simple, run the following command on your sandbox:

bin/jetpack-downloader test jetpack add/seo-ai-tab-llms-and-crawler-controls

Interested in more tips and information?

In your local development environment, use the jetpack rsync command to sync your changes to a WoA dev blog.
Read more about our development workflow here: PCYsg-eg0-p2
Figure out when your changes will be shipped to customers here: PCYsg-eg5-p2

github-actions · 2026-06-23T21:43:07Z

Thank you for your PR!

When contributing to Jetpack, we have a few suggestions that can help us test and review your patch:

✅ Include a description of your PR changes.
✅ Add a "[Status]" label (In Progress, Needs Review, ...).
✅ Add testing instructions.
✅ Specify whether this PR includes any changes to data or privacy.
✅ Add changelog entries to affected projects

This comment will be updated as you work on your PR and make changes. If you think that some of those checks are not needed for your PR, please explain why you think so. Thanks for cooperation 🤖

Follow this PR Review Process:

Ensure all required checks appearing at the bottom of this PR are passing.
Make sure to test your changes on all platforms that it applies to. You're responsible for the quality of the code you ship.
You can use GitHub's Reviewers functionality to request a review.
When it's reviewed and merged, you will be pinged in Slack to deploy the changes to WordPress.com simple once the build is done.

If you have questions about anything, reach out in #jetpack-developers for guidance!

Jetpack plugin:

The Jetpack plugin has different release cadences depending on the platform:

WordPress.com Simple releases happen as soon as you deploy your changes after merging this PR (PCYsg-Jjm-p2).
WoA releases happen weekly.
Releases to self-hosted sites happen monthly:
- Scheduled release: July 7, 2026
- Code freeze: July 6, 2026

If you have any questions about the release process, please ask in the #jetpack-releases channel on Slack.

jp-launch-control · 2026-06-23T21:50:56Z

Code Coverage Summary

Coverage changed in 2 files.

File	Coverage	Δ%	Δ Uncovered
projects/plugins/jetpack/_inc/lib/class.core-rest-api-endpoints.php	1706/2636 (64.72%)	0.22%	1 ❤️‍🩹
projects/packages/seo/src/class-initializer.php	165/245 (67.35%)	3.22%	0 💚

2 files are newly checked for coverage.

File	Coverage
projects/packages/seo/src/class-llms-txt.php	35/63 (55.56%) 💚
projects/packages/seo/src/class-ai-crawlers.php	101/103 (98.06%) 💚

Full summary · PHP report · JS report

…ler toggles - Add AiCrawlersTest + LlmsTxtTest (35 package PHP tests green): robots directive generation, blocked-slug sanitization, llms.txt enable/identity and empty-state behavior. - Show an Allowed/Blocked state label under each AI-crawler toggle so the on/off meaning is unambiguous (named vars to avoid the i18n minifier fold). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Replace the full-width stack with a two-column grid — AI SEO Enhancer + llms.txt on the left, the taller AI-crawler list on the right — so the settings don't stretch awkwardly across the page. Collapses to a single column on narrow screens. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The monolith plugin's changelogger vocabulary is major/enhancement/compat/ bugfix/other; "added" is only valid for packages. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Left column is now AI crawler access (the broadest control: can AI bots reach the site at all); the right column is llms.txt then the AI SEO Enhancer, reading broad → specific left-to-right, top-to-bottom. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Revert to a single column with the Settings tab's 660px centered width. Cards stack broad → specific: AI crawler access, llms.txt, then the AI SEO Enhancer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

It's the longest section and sits at the top; collapsing it by default keeps llms.txt and the Enhancer visible without scrolling. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Order is now llms.txt, AI SEO enhancer, then AI crawler access last (the longest, least-used section, still collapsed by default). Title "AI SEO Enhancer" → "AI SEO enhancer" to match the sentence-case of the other section titles. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The link sat under the switch; indent it 40px to align with the toggle's label/help text instead — same treatment as the Settings tab's sitemap link. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…uncrawlable Split the single AI-crawler section into two: "Answer engines" (allowed by default, so the site stays citable in AI answers) and "Training crawlers" (blocked by default). Each crawler now carries a `type` in the catalog, and the default blocked list is every training crawler (an explicit empty array still means "allow all"). Gate the controls when the site can't be crawled: show an explanation instead of toggles when search-engine indexing is off (with a link to the Settings visibility section) or the site is on a *.wpcomstaging.com staging subdomain, and warn when a static robots.txt at the web root would override the directives. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Amazon's own docs state Amazonbot "may be used to train Amazon AI models", while Amzn-SearchBot powers Alexa search and does not train. Move Amazonbot to Training (blocked by default) and add Amzn-SearchBot as the Amazon answer-engine bot (allowed by default), so the designations shown to users are accurate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Each crawler toggle now shows a link to the operator's own documentation for that bot (e.g. "Learn what GPTBot does"), opening in a new tab. Catalog gains a doc_url per bot and exposes the user-agent token to the UI for the link text. Bytespider has no official operator doc, so it shows no link. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…or docs For a crawler with no operator documentation (e.g. Bytespider), show a muted caption in place of the doc link so the missing link doesn't read as a bug. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add a third status card to the Overview top row (the auto-fit grid becomes 3-up) summarizing the AI tab's state: llms.txt generated, answer engines allowed, training privacy configured, and AI-enhanced SEO enabled — each as a factual status row, not a graded score. A "Manage AI" button links to the AI tab. Reads from the AI store so saves on the AI tab reflect here on navigation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

angelablake · 2026-06-24T06:51:18Z

Correction on the crawler gating: the search-engine-visibility gating described in the PR body is implemented in the AI tab, but in spike testing it did not reliably trigger. Treat it as a finding to build + verify properly for the Phase 1 feature, not as working. Not investigated here (this is exploration). Tracked in JETPACK-1762.

angelablake added the [Status] In Progress label Jun 23, 2026

angelablake self-assigned this Jun 23, 2026

github-actions Bot added [Package] Seo [Plugin] Jetpack Issues about the Jetpack plugin. https://wordpress.org/plugins/jetpack/ [Tests] Includes Tests labels Jun 23, 2026

Angela Blake and others added 13 commits June 23, 2026 17:29

Fix jetpack plugin changelog type (enhancement, not added)

beb8cc5

The monolith plugin's changelogger vocabulary is major/enhancement/compat/ bugfix/other; "added" is only valid for packages. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

SEO AI tab: collapse AI crawler access by default

3a469fe

It's the longest section and sits at the top; collapsing it by default keeps llms.txt and the Enhancer visible without scrolling. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SEO AI tab: llms.txt + AI crawler controls (functional spike)#49869

SEO AI tab: llms.txt + AI crawler controls (functional spike)#49869
angelablake wants to merge 14 commits into
trunkfrom
add/seo-ai-tab-llms-and-crawler-controls

angelablake commented Jun 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

jp-launch-control Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

angelablake commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

angelablake commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Spike findings (for Phase 1)

Tests

Does this pull request change what data or activity we track or use?

Testing instructions

Uh oh!

github-actions Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jp-launch-control Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Coverage Summary

Uh oh!

angelablake commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

angelablake commented Jun 23, 2026 •

edited

Loading

github-actions Bot commented Jun 23, 2026 •

edited

Loading

github-actions Bot commented Jun 23, 2026 •

edited

Loading

jp-launch-control Bot commented Jun 23, 2026 •

edited

Loading