Unify CLI agent on a single pi-based runtime#3337
Merged
youknowriad merged 12 commits intotrunkfrom May 5, 2026
Merged
Conversation
Drops the dual-runtime split (Claude Agent SDK for Anthropic, pi for OpenAI) and routes both families through one pi-agent-core runtime that dispatches to the right `@mariozechner/pi-ai` provider per model family. Anthropic's wpcom proxy needs Bearer auth, which pi-ai's default code path doesn't do for non-OAuth tokens — we inject a pre-built `Anthropic` SDK client through a custom `streamFn` only on that path. Tools are now native pi `AgentTool` definitions (typebox schemas, throw on error, no runtime adapter). The Claude Agent SDK dependency is removed; the `studio mcp` command is rewired to the low-level `@modelcontextprotocol/sdk/server` API. Skill files move from `ai/plugin/skills/` to `ai/skills/`. Reasoning is enabled on both families with `thinkingLevel: 'high'`, matching the SDK's previous adaptive default and giving GPT-5.5 a real `reasoning_effort` budget on Chat Completions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
📊 Performance Test ResultsComparing 914a5d6 vs trunk app-size
site-editor
site-startup
Results are median values from multiple test runs. Legend: 🟢 Improvement (faster) | 🔴 Regression (slower) | ⚪ No change (<50ms diff) |
`model.reasoning: true` on Chat Completions causes pi-ai to send reasoning_effort='high' (driven by Agent.state.thinkingLevel='high'), which starves visible output on GPT-5.5 — the model spends its budget on internal reasoning and emits a single-word completion. Leave reasoning off for OpenAI; server-side defaults handle it. Anthropic keeps adaptive thinking with effort='high', which doesn't have the same starvation behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When validate_blocks reports invalid blocks the agent has been treating each Expected/Actual diff as a literal text swap, but core block styles are tied to className/structure — adding or removing classes pulls in or strips core CSS that drives layout, spacing, and color, so naive substitutions visibly break the design while validation passes. Add explicit guidance both in the system prompt (LOCAL_CONTENT_GUIDELINES) and inline in the tool's invalid-blocks output so the agent reads it exactly when it is about to apply fixes: diff markup as a structural change, update style.css selectors that target the old class or nesting in the same batch, preserve intentional className hooks, and take a screenshot to verify the design survived. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Inline guidance fires right when the agent is about to apply fixes, while the system-prompt copy is general advice that drifts after compaction and many turns. Try inline-only first; we can always add the prompt back if the agent still misses it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The agent has been writing block markup that omits `align: "full"` and trying to fix the resulting narrow sections in custom CSS, which loses to core's layout selectors. Add a layout primer to the block content guidelines covering the three patterns that handle 95% of cases (full-bleed with constrained inner, full-bleed with full-bleed inner, plain constrained), and extend the workflow's "Check the result" step with an explicit full-width verification — if a section meant to be full-width only spans ~700px in the desktop screenshot, the fix is in markup (`align: "full"` on the outer group, matching inner layout type), not CSS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The display-name dictionary was keyed on `mcp__studio__*` — the names the SDK produced when it wrapped studio tools as an MCP server. The unified pi runtime exposes tools with bare names (`site_create`, `wp_cli`, etc.), so every studio tool was falling through to the raw name fallback in the UI. Re-key the dictionary on bare names and strip the legacy `mcp__studio__` prefix at lookup so older session JSONL still renders correctly on replay. Also covers tools the SDK era never had labels for (push, pull, import, export, audits, AskUserQuestion, Ls). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The typebox migration grew the type graph enough that ESLint's single process across all apps trips Node's default heap on the Buildkite mac agent. Bumping --max-old-space-size to 8GB in the lint script so the job has headroom; runs locally too with no observable cost on smaller machines (Node only allocates what it needs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Surfaced now that the lint heap bump lets the job finish — `@studio/common` type imports must come before `cli/*` ones. Fallout from rewriting the SDKMessage import in this PR; --fix sorted it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-028488 # Conflicts: # package.json
`/annotate`, `/taxonomist`, `/need-for-speed`, and `/rank-me-up` slash
commands all dispatch through `runAgentTurn(buildSkillInvocationPrompt)`
which asks the model to load the skill via the `Skill` tool — but the
tool's name enum was hardcoded to `site-spec` only, so the model bailed
out in prose ("I don't have an /annotate skill available").
Parse `user-invokable: true` from each SKILL.md's frontmatter and use
that as the visibility filter, plus `site-spec` which the model loads
autonomously at the start of a build. The slash-command dispatcher only
invokes Skill when the user explicitly types `/<name>`, so exposing
user-invokable skills in the enum is safe — no autonomous-audit risk.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All skills are now exposed via the Skill tool. Removes the dead userInvokable parsing too — nothing else read it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
I tried to get testing for this, unfortunately everyone is busy. I did a lot of tests, I think this improves stability and speed, simplifies and unifies architecture and doesn't regress site quality. |
11 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related issues
How AI was used in this PR
Built end-to-end with Claude Code. Each step was driven by an explicit prompt and validated against the live wpcom proxy with a real Sonnet 4.6 tool call before moving on. The unification was settled with a small spike test (apps/cli/ai/runtimes/pi/tests/anthropic-pi-spike.test.ts, opt-in via
STUDIO_LIVE_SPIKE=1) before touching the runtime.What to verify carefully:
Anthropicclient). The spike covers it; manualstudio code+/loginis the most realistic check.studio mcpstill serves Studio's tools to external clients (Claude Desktop / Cursor) after the rewrite to the low-level MCP server API.Skillin the tool list when running with a local site). The bundled-build static-copy path was a real bug discovered during this work —dist/cli/skills/site-spec/SKILL.mdmust exist afternpm run cli:build.'high'; GPT-5.5 sendsreasoning_effort: 'high'.Proposed Changes
@mariozechner/pi-ai, dispatched off the model's family. The custom Anthropic runtime (apps/cli/ai/runtimes/anthropic/) is deleted;pickRuntime, the cross-family resume guard, and the SDK-cleanupunhandledRejectionhandler are gone.x-api-key, which the wpcom proxy rejects. We inject a pre-builtAnthropicSDK client via a customstreamFnonly on that path (Anthropic +ANTHROPIC_AUTH_TOKEN).AgentTooldirectly. Tool input schemas are typebox (no zod, no runtime adapter), result contract is{ content }+ throw on failure.pi-tool-adapter.tsdeleted.@anthropic-ai/claude-agent-sdkdropped from apps/cli/package.json and the vendor-binary cleanup in apps/studio/forge.config.ts.@anthropic-ai/sdkis added (matches pi-ai's range, hoists once) — required to construct the pre-built client for the wpcom Bearer-auth path.@modelcontextprotocol/sdk/server's low-levelServer+setRequestHandler(ListToolsRequestSchema/CallToolRequestSchema, ...)so tools defined in typebox JSON Schema can be exposed without a zod conversion.SDKMessage/TodoWriteInput/ content-block types locally. ui.ts, recorder.ts, replay.ts, eval-runner.ts, output-adapter.ts, todo-stream.ts all switch to the local imports.apps/cli/ai/plugin/skills/→apps/cli/ai/skills/. The vite static-copydestis'.'so the bundled CLI puts files atdist/cli/skills/, where the loader resolves them. Aconsole.warnfires at startup if the directory is missing — silent failure is what hid this bug initially.Agent.state.thinkingLevel: 'high'in pi/index.ts. For Anthropic this maps to adaptive thinking witheffort: 'high'(matching the SDK's previous default); for GPT-5.5 it sendsreasoning_effort: 'high'on Chat Completions.model.reasoning: trueis set for both so pi-ai actually attaches the field.Skill,take_screenshot,share_screenshot,site_export,wpcom_request) fromType.Union([Type.Literal(...)])toType.Enum([...]). Emits{ enum: [...] }instead of{ anyOf: [{ const: ... }] }, which most LLMs handle more reliably.trunk #NNNNPR-number references.Testing Instructions
npm testpasses (1529 passing, 1 skipped — the skipped test is the live spike, opt-in viaSTUDIO_LIVE_SPIKE=1).npm run typecheckpasses across all workspaces.npm run cli:buildproducesdist/cli/skills/site-spec/SKILL.md(verifies the bundled-build path fix).studio codeagainst a local site, ask it to "build me a one-page site for X" — the agent should call theSkilltool withsite-specbefore creating the site.studio codewith Sonnet 4.6 and with GPT-5.5 (/model), confirm both work and produce thinking-style responses.studio mcp(or wire the launch command into Claude Desktop / Cursor) —tools/listreturns Studio's tools and a tool call round-trips.STUDIO_LIVE_SPIKE=1 npm test -- apps/cli/ai/runtimes/pi/tests/anthropic-pi-spikewhile logged in viastudio auth login— round-trips a tool call against Sonnet 4.6 through the wpcom proxy in ~3-4s.Pre-merge Checklist
npm run typecheck,npx eslint).npm test).studio codewith both families.studio mcpfrom an external MCP client.🤖 Generated with Claude Code