Add llms.txt and llms-full.txt generation for LLM-friendly docs#920
Add llms.txt and llms-full.txt generation for LLM-friendly docs#920
Conversation
Add a build script that generates llms.txt (lightweight index) and llms-full.txt (full documentation content) from the Starlight doc sources. These files follow the llms.txt specification, making the documentation easily consumable by LLMs and indexable by services like Context7 with minimal token usage. - llms.txt: structured index with title, description, and URL per page - llms-full.txt: all doc content as clean markdown (MDX/HTML stripped) - Runs automatically before each build via package.json scripts https://claude.ai/code/session_01Jj2MZELm7URFgydFbwwA8m
Replace the custom build script with the purpose-built starlight-llms-txt plugin, which generates llms.txt, llms-full.txt, and llms-small.txt from the rendered Starlight documentation at build time. This makes the docs easily accessible for LLMs and indexable by services like Context7 with minimal token usage. - Remove production guard so Starlight builds docs in all environments - Add starlight-llms-txt plugin with RocketSim project name/description - Remove custom generate-llms-txt.mjs script (replaced by plugin) - Revert package.json build script and .gitignore changes https://claude.ai/code/session_01Jj2MZELm7URFgydFbwwA8m
98a6fe7 to
bb2d730
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds LLM-friendly documentation generation by integrating the starlight-llms-txt plugin and creating a custom post-processing integration. The implementation generates two text file variants (llms-full.txt and llms-small.txt) that follow the llms.txt specification, making documentation easily consumable by LLMs and indexable by services like Context7.
Changes:
- Integrated starlight-llms-txt plugin (v0.7.0) to generate base llms documentation files
- Created custom post-processing integration to clean and transform generated content (removes home/404 pages, converts JSX components to markdown links, handles directives)
- Fixed escaped markdown formatting in multiple documentation files
- Added missing status bar image file and corrected broken image reference
Reviewed changes
Copilot reviewed 8 out of 10 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/astro.config.ts | Added starlight-llms-txt plugin configuration and custom post-process integration |
| docs/src/integrations/llms-txt-post-process.ts | New integration that post-processes generated llms files with content transformations and cleanup |
| docs/package.json | Added starlight-llms-txt dependency (v0.7.0) |
| docs/package-lock.json | Dependency lock updates for starlight-llms-txt and its transitive dependencies |
| docs/src/content/docs/docs/features/app-actions/user-defaults-editor.mdx | Fixed escaped markdown formatting in link text |
| docs/src/content/docs/docs/features/app-actions/network-speed-control-and-simulator-airplane-mode.mdx | Fixed escaped markdown formatting in link text |
| docs/src/content/docs/docs/features/app-actions/general-app-actions.mdx | Fixed escaped markdown formatting in multiple link texts |
| docs/src/content/docs/docs/features/capturing/statusbar-appearance.md | Fixed broken image reference |
| docs/src/content/docs/docs/features/capturing/statusbar-appearance/status_bar_override_9_41-1024x416.jpg | Added missing status bar screenshot image |
| docs/src/styles/starlight-custom.css | Increased search modal max-width from 40rem to 45rem |
Files not reviewed (1)
- docs/package-lock.json: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /^:::(tip|note|caution|danger)\n([\s\S]*?)^:::/gm, | ||
| (_match, type: string, content: string) => { | ||
| const label = type.charAt(0).toUpperCase() + type.slice(1); | ||
| const quoted = content |
There was a problem hiding this comment.
The regex pattern uses ^::: anchors which require the delimiters to be at the start of a line. However, the multiline flag m is used, and the pattern [\s\S]*? in the middle is non-greedy. If the closing ::: is not at the start of a line (e.g., indented or has trailing content on the same line), this pattern won't match correctly. Consider whether the closing delimiter should strictly be at the start of a line, or if the pattern should be adjusted to handle cases where it might be indented or have other content.
| border-radius: 8px; | ||
| background: var(--sl-color-gray-6); | ||
| max-width: 40rem; | ||
| max-width: 45rem; |
There was a problem hiding this comment.
This CSS change increases the max-width from 40rem to 45rem, but it's not mentioned in the PR description and doesn't seem related to the llms.txt generation feature. If this is an intentional improvement to the search modal width, it should be mentioned in the PR description. If it's an accidental change, consider reverting it or creating a separate PR for this UI adjustment.
| // Filter out unwanted pages | ||
| const filteredPages = pages.filter((page) => { | ||
| const titleMatch = page.match(/^# (.+)/); | ||
| if (!titleMatch) return true; |
There was a problem hiding this comment.
The split regex uses a positive lookahead (?=^# ) with the multiline flag m, which splits on positions just before lines starting with "# ". However, this could potentially create empty strings in the array if there are consecutive matches. While the filter on line 58 checks for titleMatch, if a page section doesn't have a title (titleMatch is null), it's still kept (returns true on line 60). Consider whether pages without titles should be kept or filtered out, and add explicit handling for edge cases like empty sections or malformed content.
| // Filter out unwanted pages | |
| const filteredPages = pages.filter((page) => { | |
| const titleMatch = page.match(/^# (.+)/); | |
| if (!titleMatch) return true; | |
| // Filter out unwanted pages and empty/malformed sections | |
| const filteredPages = pages.filter((page) => { | |
| const trimmed = page.trim(); | |
| // Drop empty or whitespace-only sections (can be created by the split regex) | |
| if (!trimmed) return false; | |
| const titleMatch = page.match(/^# (.+)/); | |
| if (!titleMatch) { | |
| // Keep non-empty sections without an explicit "# " title | |
| return true; | |
| } |
| const quoted = content | ||
| .trimEnd() | ||
| .split("\n") | ||
| .map((line) => `> ${line}`) | ||
| .join("\n"); | ||
| return `> **${label}:** ${quoted.slice(2)}`; |
There was a problem hiding this comment.
The blockquote generation logic on line 114 uses quoted.slice(2) to remove the first "> " from the quoted string. However, this assumes that the quoted string always starts with "> ", which is true because line 112 maps each line to add the "> " prefix. But the logic is a bit confusing because it adds "> " to all lines including the first one, then slices off the first 2 characters. Consider refactoring to make the intent clearer, such as handling the first line separately or using a more explicit approach like: > **${label}:** concatenated with the content lines properly quoted.
| const quoted = content | |
| .trimEnd() | |
| .split("\n") | |
| .map((line) => `> ${line}`) | |
| .join("\n"); | |
| return `> **${label}:** ${quoted.slice(2)}`; | |
| const lines = content.trimEnd().split("\n"); | |
| const [firstLine, ...restLines] = lines; | |
| const quotedLines: string[] = []; | |
| if (firstLine !== undefined) { | |
| quotedLines.push(`> **${label}:** ${firstLine}`); | |
| } | |
| for (const line of restLines) { | |
| quotedLines.push(`> ${line}`); | |
| } | |
| return quotedLines.join("\n"); |
| plugins: [ | ||
| starlightLlmsTxt({ | ||
| projectName: "RocketSim", | ||
| description: | ||
| "RocketSim enhances the iOS Simulator with features for capturing screenshots and recordings, comparing designs, testing push notifications, deep links, location simulation, network speed control, accessibility toggles, and more.", | ||
| rawContent: true, | ||
| }), | ||
| ], |
There was a problem hiding this comment.
The PR description mentions generating "llms.txt (lightweight index)" but the actual implementation generates "llms-small.txt" and "llms-full.txt" files. Consider updating either the PR description to match the implementation, or ensure that the starlight-llms-txt plugin actually generates a file named "llms.txt" as mentioned in the description. The discrepancy could confuse users trying to understand what files are being generated.
| result = result.replace(/<Tweet[\s\S]*?\/>/g, (match) => { | ||
| const idMatch = match.match(/id="([^"]+)"/); | ||
| const id = idMatch?.[1]; | ||
| return id ? `https://x.com/x/status/${id}` : match; |
There was a problem hiding this comment.
The generated X/Twitter link URL appears to be incorrect. The pattern "https://x.com/x/status/${id}" includes an extra "/x/" in the path. X/Twitter status URLs should follow the format "https://x.com/[username]/status/[id]" or simply "https://x.com/i/web/status/[id]". The current implementation will generate invalid URLs like "https://x.com/x/status/123456" which won't work. Consider using a format like "https://x.com/i/web/status/${id}" instead, or extract the username from the Tweet component if available.
| return id ? `https://x.com/x/status/${id}` : match; | |
| return id ? `https://x.com/i/web/status/${id}` : match; |
Add a build script that generates llms.txt (lightweight index) and llms-full.txt (full documentation content) from the Starlight doc sources. These files follow the llms.txt specification, making the documentation easily consumable by LLMs and indexable by services like Context7 with minimal token usage.