Skip to content

Unify CLI agent on a single pi-based runtime#3337

Merged
youknowriad merged 12 commits intotrunkfrom
claude/sharp-murdock-028488
May 5, 2026
Merged

Unify CLI agent on a single pi-based runtime#3337
youknowriad merged 12 commits intotrunkfrom
claude/sharp-murdock-028488

Conversation

@youknowriad
Copy link
Copy Markdown
Contributor

Related issues

How AI was used in this PR

Built end-to-end with Claude Code. Each step was driven by an explicit prompt and validated against the live wpcom proxy with a real Sonnet 4.6 tool call before moving on. The unification was settled with a small spike test (apps/cli/ai/runtimes/pi/tests/anthropic-pi-spike.test.ts, opt-in via STUDIO_LIVE_SPIKE=1) before touching the runtime.

What to verify carefully:

  • The wpcom proxy auth path (Bearer via injected Anthropic client). The spike covers it; manual studio code + /login is the most realistic check.
  • studio mcp still serves Studio's tools to external clients (Claude Desktop / Cursor) after the rewrite to the low-level MCP server API.
  • The Skill tool is registered (look for Skill in the tool list when running with a local site). The bundled-build static-copy path was a real bug discovered during this work — dist/cli/skills/site-spec/SKILL.md must exist after npm run cli:build.
  • Reasoning is on for both families: Anthropic uses adaptive thinking with effort 'high'; GPT-5.5 sends reasoning_effort: 'high'.

Proposed Changes

  • Single runtime, both families. apps/cli/ai/runtimes/pi/ now serves Anthropic Messages and OpenAI Chat Completions via @mariozechner/pi-ai, dispatched off the model's family. The custom Anthropic runtime (apps/cli/ai/runtimes/anthropic/) is deleted; pickRuntime, the cross-family resume guard, and the SDK-cleanup unhandledRejection handler are gone.
  • wpcom Bearer auth. pi-ai's default Anthropic path uses x-api-key, which the wpcom proxy rejects. We inject a pre-built Anthropic SDK client via a custom streamFn only on that path (Anthropic + ANTHROPIC_AUTH_TOKEN).
  • Native pi tools. apps/cli/ai/tools/define-tool.ts returns AgentTool directly. Tool input schemas are typebox (no zod, no runtime adapter), result contract is { content } + throw on failure. pi-tool-adapter.ts deleted.
  • @anthropic-ai/claude-agent-sdk dropped from apps/cli/package.json and the vendor-binary cleanup in apps/studio/forge.config.ts. @anthropic-ai/sdk is added (matches pi-ai's range, hoists once) — required to construct the pre-built client for the wpcom Bearer-auth path.
  • MCP server rewrite. apps/cli/ai/mcp-server.ts now uses @modelcontextprotocol/sdk/server's low-level Server + setRequestHandler(ListToolsRequestSchema/CallToolRequestSchema, ...) so tools defined in typebox JSON Schema can be exposed without a zod conversion.
  • Local SDK message types. apps/cli/ai/runtimes/messages.ts defines SDKMessage / TodoWriteInput / content-block types locally. ui.ts, recorder.ts, replay.ts, eval-runner.ts, output-adapter.ts, todo-stream.ts all switch to the local imports.
  • Skills directory restructure. apps/cli/ai/plugin/skills/apps/cli/ai/skills/. The vite static-copy dest is '.' so the bundled CLI puts files at dist/cli/skills/, where the loader resolves them. A console.warn fires at startup if the directory is missing — silent failure is what hid this bug initially.
  • Reasoning enabled on both families. Agent.state.thinkingLevel: 'high' in pi/index.ts. For Anthropic this maps to adaptive thinking with effort: 'high' (matching the SDK's previous default); for GPT-5.5 it sends reasoning_effort: 'high' on Chat Completions. model.reasoning: true is set for both so pi-ai actually attaches the field.
  • Cleaner JSON Schema for enum-shaped tool inputs. Switched 5 tools (Skill, take_screenshot, share_screenshot, site_export, wpcom_request) from Type.Union([Type.Literal(...)]) to Type.Enum([...]). Emits { enum: [...] } instead of { anyOf: [{ const: ... }] }, which most LLMs handle more reliably.
  • Comment cleanup. Stripped multi-paragraph narration comments across the runtime, tools, tests; removed stale trunk #NNNN PR-number references.

Testing Instructions

  • npm test passes (1529 passing, 1 skipped — the skipped test is the live spike, opt-in via STUDIO_LIVE_SPIKE=1).
  • npm run typecheck passes across all workspaces.
  • npm run cli:build produces dist/cli/skills/site-spec/SKILL.md (verifies the bundled-build path fix).
  • studio code against a local site, ask it to "build me a one-page site for X" — the agent should call the Skill tool with site-spec before creating the site.
  • studio code with Sonnet 4.6 and with GPT-5.5 (/model), confirm both work and produce thinking-style responses.
  • studio mcp (or wire the launch command into Claude Desktop / Cursor) — tools/list returns Studio's tools and a tool call round-trips.
  • Optional: STUDIO_LIVE_SPIKE=1 npm test -- apps/cli/ai/runtimes/pi/tests/anthropic-pi-spike while logged in via studio auth login — round-trips a tool call against Sonnet 4.6 through the wpcom proxy in ~3-4s.

Pre-merge Checklist

  • TypeScript / lint clean (npm run typecheck, npx eslint).
  • Tests passing (npm test).
  • Live spike passing against the wpcom proxy with Sonnet 4.6.
  • Manual smoke against studio code with both families.
  • Manual smoke against studio mcp from an external MCP client.

🤖 Generated with Claude Code

Drops the dual-runtime split (Claude Agent SDK for Anthropic, pi for
OpenAI) and routes both families through one pi-agent-core runtime that
dispatches to the right `@mariozechner/pi-ai` provider per model family.
Anthropic's wpcom proxy needs Bearer auth, which pi-ai's default code
path doesn't do for non-OAuth tokens — we inject a pre-built `Anthropic`
SDK client through a custom `streamFn` only on that path.

Tools are now native pi `AgentTool` definitions (typebox schemas, throw
on error, no runtime adapter). The Claude Agent SDK dependency is
removed; the `studio mcp` command is rewired to the low-level
`@modelcontextprotocol/sdk/server` API. Skill files move from
`ai/plugin/skills/` to `ai/skills/`.

Reasoning is enabled on both families with `thinkingLevel: 'high'`,
matching the SDK's previous adaptive default and giving GPT-5.5 a real
`reasoning_effort` budget on Chat Completions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wpmobilebot
Copy link
Copy Markdown
Collaborator

wpmobilebot commented May 4, 2026

📊 Performance Test Results

Comparing 914a5d6 vs trunk

app-size

Metric trunk 914a5d6 Diff Change
App Size (Mac) 1667.61 MB 1454.02 MB 213.59 MB 🟢 -12.8%

site-editor

Metric trunk 914a5d6 Diff Change
load 1470 ms 1530 ms +60 ms 🔴 4.1%

site-startup

Metric trunk 914a5d6 Diff Change
siteCreation 8083 ms 8082 ms 1 ms ⚪ 0.0%
siteStartup 4949 ms 4949 ms 0 ms ⚪ 0.0%

Results are median values from multiple test runs.

Legend: 🟢 Improvement (faster) | 🔴 Regression (slower) | ⚪ No change (<50ms diff)

youknowriad and others added 11 commits May 5, 2026 10:24
`model.reasoning: true` on Chat Completions causes pi-ai to send
reasoning_effort='high' (driven by Agent.state.thinkingLevel='high'),
which starves visible output on GPT-5.5 — the model spends its budget
on internal reasoning and emits a single-word completion. Leave
reasoning off for OpenAI; server-side defaults handle it. Anthropic
keeps adaptive thinking with effort='high', which doesn't have the
same starvation behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When validate_blocks reports invalid blocks the agent has been treating
each Expected/Actual diff as a literal text swap, but core block styles
are tied to className/structure — adding or removing classes pulls in
or strips core CSS that drives layout, spacing, and color, so naive
substitutions visibly break the design while validation passes.

Add explicit guidance both in the system prompt (LOCAL_CONTENT_GUIDELINES)
and inline in the tool's invalid-blocks output so the agent reads it
exactly when it is about to apply fixes: diff markup as a structural
change, update style.css selectors that target the old class or nesting
in the same batch, preserve intentional className hooks, and take a
screenshot to verify the design survived.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Inline guidance fires right when the agent is about to apply fixes, while
the system-prompt copy is general advice that drifts after compaction and
many turns. Try inline-only first; we can always add the prompt back if
the agent still misses it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The agent has been writing block markup that omits `align: "full"` and
trying to fix the resulting narrow sections in custom CSS, which loses
to core's layout selectors. Add a layout primer to the block content
guidelines covering the three patterns that handle 95% of cases
(full-bleed with constrained inner, full-bleed with full-bleed inner,
plain constrained), and extend the workflow's "Check the result" step
with an explicit full-width verification — if a section meant to be
full-width only spans ~700px in the desktop screenshot, the fix is in
markup (`align: "full"` on the outer group, matching inner layout type),
not CSS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The display-name dictionary was keyed on `mcp__studio__*` — the names
the SDK produced when it wrapped studio tools as an MCP server. The
unified pi runtime exposes tools with bare names (`site_create`,
`wp_cli`, etc.), so every studio tool was falling through to the raw
name fallback in the UI.

Re-key the dictionary on bare names and strip the legacy `mcp__studio__`
prefix at lookup so older session JSONL still renders correctly on
replay. Also covers tools the SDK era never had labels for (push, pull,
import, export, audits, AskUserQuestion, Ls).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The typebox migration grew the type graph enough that ESLint's single
process across all apps trips Node's default heap on the Buildkite mac
agent. Bumping --max-old-space-size to 8GB in the lint script so the
job has headroom; runs locally too with no observable cost on smaller
machines (Node only allocates what it needs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Surfaced now that the lint heap bump lets the job finish — `@studio/common`
type imports must come before `cli/*` ones. Fallout from rewriting the
SDKMessage import in this PR; --fix sorted it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`/annotate`, `/taxonomist`, `/need-for-speed`, and `/rank-me-up` slash
commands all dispatch through `runAgentTurn(buildSkillInvocationPrompt)`
which asks the model to load the skill via the `Skill` tool — but the
tool's name enum was hardcoded to `site-spec` only, so the model bailed
out in prose ("I don't have an /annotate skill available").

Parse `user-invokable: true` from each SKILL.md's frontmatter and use
that as the visibility filter, plus `site-spec` which the model loads
autonomously at the start of a build. The slash-command dispatcher only
invokes Skill when the user explicitly types `/<name>`, so exposing
user-invokable skills in the enum is safe — no autonomous-audit risk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All skills are now exposed via the Skill tool. Removes the dead
userInvokable parsing too — nothing else read it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@youknowriad youknowriad merged commit 6bc9242 into trunk May 5, 2026
10 checks passed
@youknowriad youknowriad deleted the claude/sharp-murdock-028488 branch May 5, 2026 15:17
@youknowriad
Copy link
Copy Markdown
Contributor Author

I tried to get testing for this, unfortunately everyone is busy. I did a lot of tests, I think this improves stability and speed, simplifies and unifies architecture and doesn't regress site quality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants