Unify CLI agent on a single pi-based runtime by youknowriad · Pull Request #3337 · Automattic/studio

youknowriad · 2026-05-04T20:31:55Z

Related issues

Related to Add GPT 5.5 support without modifying the system prompt #3328 (which landed GPT 5.5 via the pi runtime; this PR retires the parallel Claude Agent SDK runtime so both families share one path)

How AI was used in this PR

Built end-to-end with Claude Code. Each step was driven by an explicit prompt and validated against the live wpcom proxy with a real Sonnet 4.6 tool call before moving on. The unification was settled with a small spike test (apps/cli/ai/runtimes/pi/tests/anthropic-pi-spike.test.ts, opt-in via STUDIO_LIVE_SPIKE=1) before touching the runtime.

What to verify carefully:

The wpcom proxy auth path (Bearer via injected Anthropic client). The spike covers it; manual studio code + /login is the most realistic check.
studio mcp still serves Studio's tools to external clients (Claude Desktop / Cursor) after the rewrite to the low-level MCP server API.
The Skill tool is registered (look for Skill in the tool list when running with a local site). The bundled-build static-copy path was a real bug discovered during this work — dist/cli/skills/site-spec/SKILL.md must exist after npm run cli:build.
Reasoning is on for both families: Anthropic uses adaptive thinking with effort 'high'; GPT-5.5 sends reasoning_effort: 'high'.

Proposed Changes

Single runtime, both families. apps/cli/ai/runtimes/pi/ now serves Anthropic Messages and OpenAI Chat Completions via @mariozechner/pi-ai, dispatched off the model's family. The custom Anthropic runtime (apps/cli/ai/runtimes/anthropic/) is deleted; pickRuntime, the cross-family resume guard, and the SDK-cleanup unhandledRejection handler are gone.
wpcom Bearer auth. pi-ai's default Anthropic path uses x-api-key, which the wpcom proxy rejects. We inject a pre-built Anthropic SDK client via a custom streamFn only on that path (Anthropic + ANTHROPIC_AUTH_TOKEN).
Native pi tools. apps/cli/ai/tools/define-tool.ts returns AgentTool directly. Tool input schemas are typebox (no zod, no runtime adapter), result contract is { content } + throw on failure. pi-tool-adapter.ts deleted.
@anthropic-ai/claude-agent-sdk dropped from apps/cli/package.json and the vendor-binary cleanup in apps/studio/forge.config.ts. @anthropic-ai/sdk is added (matches pi-ai's range, hoists once) — required to construct the pre-built client for the wpcom Bearer-auth path.
MCP server rewrite. apps/cli/ai/mcp-server.ts now uses @modelcontextprotocol/sdk/server's low-level Server + setRequestHandler(ListToolsRequestSchema/CallToolRequestSchema, ...) so tools defined in typebox JSON Schema can be exposed without a zod conversion.
Local SDK message types. apps/cli/ai/runtimes/messages.ts defines SDKMessage / TodoWriteInput / content-block types locally. ui.ts, recorder.ts, replay.ts, eval-runner.ts, output-adapter.ts, todo-stream.ts all switch to the local imports.
Skills directory restructure. apps/cli/ai/plugin/skills/ → apps/cli/ai/skills/. The vite static-copy dest is '.' so the bundled CLI puts files at dist/cli/skills/, where the loader resolves them. A console.warn fires at startup if the directory is missing — silent failure is what hid this bug initially.
Reasoning enabled on both families. Agent.state.thinkingLevel: 'high' in pi/index.ts. For Anthropic this maps to adaptive thinking with effort: 'high' (matching the SDK's previous default); for GPT-5.5 it sends reasoning_effort: 'high' on Chat Completions. model.reasoning: true is set for both so pi-ai actually attaches the field.
Cleaner JSON Schema for enum-shaped tool inputs. Switched 5 tools (Skill, take_screenshot, share_screenshot, site_export, wpcom_request) from Type.Union([Type.Literal(...)]) to Type.Enum([...]). Emits { enum: [...] } instead of { anyOf: [{ const: ... }] }, which most LLMs handle more reliably.
Comment cleanup. Stripped multi-paragraph narration comments across the runtime, tools, tests; removed stale trunk #NNNN PR-number references.

Testing Instructions

npm test passes (1529 passing, 1 skipped — the skipped test is the live spike, opt-in via STUDIO_LIVE_SPIKE=1).
npm run typecheck passes across all workspaces.
npm run cli:build produces dist/cli/skills/site-spec/SKILL.md (verifies the bundled-build path fix).
studio code against a local site, ask it to "build me a one-page site for X" — the agent should call the Skill tool with site-spec before creating the site.
studio code with Sonnet 4.6 and with GPT-5.5 (/model), confirm both work and produce thinking-style responses.
studio mcp (or wire the launch command into Claude Desktop / Cursor) — tools/list returns Studio's tools and a tool call round-trips.
Optional: STUDIO_LIVE_SPIKE=1 npm test -- apps/cli/ai/runtimes/pi/tests/anthropic-pi-spike while logged in via studio auth login — round-trips a tool call against Sonnet 4.6 through the wpcom proxy in ~3-4s.

Pre-merge Checklist

TypeScript / lint clean (npm run typecheck, npx eslint).
Tests passing (npm test).
Live spike passing against the wpcom proxy with Sonnet 4.6.
Manual smoke against studio code with both families.
Manual smoke against studio mcp from an external MCP client.

🤖 Generated with Claude Code

Drops the dual-runtime split (Claude Agent SDK for Anthropic, pi for OpenAI) and routes both families through one pi-agent-core runtime that dispatches to the right `@mariozechner/pi-ai` provider per model family. Anthropic's wpcom proxy needs Bearer auth, which pi-ai's default code path doesn't do for non-OAuth tokens — we inject a pre-built `Anthropic` SDK client through a custom `streamFn` only on that path. Tools are now native pi `AgentTool` definitions (typebox schemas, throw on error, no runtime adapter). The Claude Agent SDK dependency is removed; the `studio mcp` command is rewired to the low-level `@modelcontextprotocol/sdk/server` API. Skill files move from `ai/plugin/skills/` to `ai/skills/`. Reasoning is enabled on both families with `thinkingLevel: 'high'`, matching the SDK's previous adaptive default and giving GPT-5.5 a real `reasoning_effort` budget on Chat Completions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

wpmobilebot · 2026-05-04T20:56:07Z

📊 Performance Test Results

Comparing 914a5d6 vs trunk

app-size

Metric	trunk	`914a5d6`	Diff	Change
App Size (Mac)	1667.61 MB	1454.02 MB	213.59 MB	🟢 -12.8%

site-editor

Metric	trunk	`914a5d6`	Diff	Change
load	1470 ms	1530 ms	+60 ms	🔴 4.1%

site-startup

Metric	trunk	`914a5d6`	Diff	Change
siteCreation	8083 ms	8082 ms	1 ms	⚪ 0.0%
siteStartup	4949 ms	4949 ms	0 ms	⚪ 0.0%

Results are median values from multiple test runs.

Legend: 🟢 Improvement (faster) | 🔴 Regression (slower) | ⚪ No change (<50ms diff)

`model.reasoning: true` on Chat Completions causes pi-ai to send reasoning_effort='high' (driven by Agent.state.thinkingLevel='high'), which starves visible output on GPT-5.5 — the model spends its budget on internal reasoning and emits a single-word completion. Leave reasoning off for OpenAI; server-side defaults handle it. Anthropic keeps adaptive thinking with effort='high', which doesn't have the same starvation behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When validate_blocks reports invalid blocks the agent has been treating each Expected/Actual diff as a literal text swap, but core block styles are tied to className/structure — adding or removing classes pulls in or strips core CSS that drives layout, spacing, and color, so naive substitutions visibly break the design while validation passes. Add explicit guidance both in the system prompt (LOCAL_CONTENT_GUIDELINES) and inline in the tool's invalid-blocks output so the agent reads it exactly when it is about to apply fixes: diff markup as a structural change, update style.css selectors that target the old class or nesting in the same batch, preserve intentional className hooks, and take a screenshot to verify the design survived. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Inline guidance fires right when the agent is about to apply fixes, while the system-prompt copy is general advice that drifts after compaction and many turns. Try inline-only first; we can always add the prompt back if the agent still misses it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The agent has been writing block markup that omits `align: "full"` and trying to fix the resulting narrow sections in custom CSS, which loses to core's layout selectors. Add a layout primer to the block content guidelines covering the three patterns that handle 95% of cases (full-bleed with constrained inner, full-bleed with full-bleed inner, plain constrained), and extend the workflow's "Check the result" step with an explicit full-width verification — if a section meant to be full-width only spans ~700px in the desktop screenshot, the fix is in markup (`align: "full"` on the outer group, matching inner layout type), not CSS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The display-name dictionary was keyed on `mcp__studio__*` — the names the SDK produced when it wrapped studio tools as an MCP server. The unified pi runtime exposes tools with bare names (`site_create`, `wp_cli`, etc.), so every studio tool was falling through to the raw name fallback in the UI. Re-key the dictionary on bare names and strip the legacy `mcp__studio__` prefix at lookup so older session JSONL still renders correctly on replay. Also covers tools the SDK era never had labels for (push, pull, import, export, audits, AskUserQuestion, Ls). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The typebox migration grew the type graph enough that ESLint's single process across all apps trips Node's default heap on the Buildkite mac agent. Bumping --max-old-space-size to 8GB in the lint script so the job has headroom; runs locally too with no observable cost on smaller machines (Node only allocates what it needs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Surfaced now that the lint heap bump lets the job finish — `@studio/common` type imports must come before `cli/*` ones. Fallout from rewriting the SDKMessage import in this PR; --fix sorted it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-028488 # Conflicts: # package.json

`/annotate`, `/taxonomist`, `/need-for-speed`, and `/rank-me-up` slash commands all dispatch through `runAgentTurn(buildSkillInvocationPrompt)` which asks the model to load the skill via the `Skill` tool — but the tool's name enum was hardcoded to `site-spec` only, so the model bailed out in prose ("I don't have an /annotate skill available"). Parse `user-invokable: true` from each SKILL.md's frontmatter and use that as the visibility filter, plus `site-spec` which the model loads autonomously at the start of a build. The slash-command dispatcher only invokes Skill when the user explicitly types `/<name>`, so exposing user-invokable skills in the enum is safe — no autonomous-audit risk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

All skills are now exposed via the Skill tool. Removes the dead userInvokable parsing too — nothing else read it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

youknowriad · 2026-05-05T15:18:20Z

I tried to get testing for this, unfortunately everyone is busy. I did a lot of tests, I think this improves stability and speed, simplifies and unifies architecture and doesn't regress site quality.

github-actions Bot assigned youknowriad May 4, 2026

youknowriad and others added 11 commits May 5, 2026 10:24

Merge remote-tracking branch 'origin/trunk' into claude/sharp-murdock…

33857fe

…-028488 # Conflicts: # package.json

Drop the skill visibility filter

4db0e0c

All skills are now exposed via the Skill tool. Removes the dead userInvokable parsing too — nothing else read it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fix import-x/order in list-connected-remote-sites tool

914a5d6

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

youknowriad merged commit 6bc9242 into trunk May 5, 2026
10 checks passed

youknowriad deleted the claude/sharp-murdock-028488 branch May 5, 2026 15:17

youknowriad mentioned this pull request May 6, 2026

AI sessions: adopt pi-coding-agent SessionManager end-to-end #3360

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify CLI agent on a single pi-based runtime#3337

Unify CLI agent on a single pi-based runtime#3337
youknowriad merged 12 commits intotrunkfrom
claude/sharp-murdock-028488

youknowriad commented May 4, 2026

Uh oh!

wpmobilebot commented May 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

youknowriad commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

youknowriad commented May 4, 2026

Related issues

How AI was used in this PR

Proposed Changes

Testing Instructions

Pre-merge Checklist

Uh oh!

wpmobilebot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Performance Test Results

app-size

site-editor

site-startup

Uh oh!

Uh oh!

youknowriad commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wpmobilebot commented May 4, 2026 •

edited

Loading