perf(ai): anthropic prompt cache + cache token accounting#53
Merged
Conversation
System prompt + tools were rebuilt and re-sent uncached on every
request. Brain content index alone can grow past 10K tokens, so a
typical 10-turn session was paying for the same prefix ten times.
Anthropic's prompt cache cuts that prefix to ~10% of base input price
when reused within 5 minutes; this PR wires the markers up.
The naive shape — wrap the existing `buildSystemPrompt(...)` string
in a cached block — would have miscached because `buildSchemaSection`
embeds the active-model marker and `buildRulesSection` injects the
out-of-scope rule based on intent. Cache markers placed over content
that varies per request bust the prefix and pay the 1.25x creation
penalty on every turn. The prompt builder is now actually split:
- `buildStaticBody` role, architecture, config, schema (NO
active-model marker), relations, vocab,
permissions, base rules, custom instructions
- `buildDynamicBody` UI context (active-model annotation lives
here), inferred intent, project state,
intent-specific rules (off-topic, etc.)
- `buildContentIndex` already separate (brain cache); rendered
as its own cached block
`buildSystemPromptBlocks(...)` returns the three pieces; `toSystemBlocks(...)`
materializes them as an `AISystemBlock[]` with `cache_control` markers
on the static body and the content index (2 of Anthropic's 4 explicit
breakpoints). The Studio chat handler and the Conversation API handler
both compose the prompt this way and additionally tag the last AITool
with `cache_control`, so the tools array gets the third breakpoint —
tools rarely change within a session, very high hit rate.
`AIProvider.system` accepts `string | AISystemBlock[]`; string callers
(legacy paths, tests) get a single uncached block automatically. The
Anthropic provider also captures `cache_creation_input_tokens` and
`cache_read_input_tokens` from `message_start`/`message_delta` and
surfaces them through a 4-field `AIUsage` shape. The engine
accumulates all four buckets across the tool loop and forwards them
on the `done` event.
Persistence:
Migration 008 ships as additive `_v2` RPCs and new columns:
agent_usage / api_message_usage / messages
+ cache_creation_input_tokens
+ cache_read_input_tokens
increment_agent_usage_tokens_v2
increment_api_usage_tokens_v2
`_v1` RPCs stay registered so a rolling deploy doesn't have to
coordinate schema-and-app cutover. App code calls `_v2` exclusively;
`_v1` becomes legacy and can be dropped in a future cleanup migration.
`saveChatResult` / `saveApiChatResult` switched to object-form
arguments — the positional list had grown unwieldy with four
token fields plus ten other params.
Business semantic preserved: cache is a Contentrain-side cost win.
Plan quotas stay message-based; cache_read tokens DO NOT earn extra
messages. `input_tokens` semantic is unchanged (= non-cached input),
so existing dashboard queries summing it stay correct.
Tests:
- anthropic-ai: three-bucket stream-event capture; system-block +
tools `cache_control` mapping to SDK shape.
- agent-system-prompt-cache: static body is byte-identical across
UI-context / intent / state changes (the actual cache-hit
invariant); contentIndex separates; `toSystemBlocks` emits 2
cached blocks max so tools breakpoint stays available.
- db: cache tokens propagate through `saveChatResult` to both
`agent_usage` and the `messages` row.
- chat-route / overage-soft-cap integration mocks updated to the
new helper names and object-form save signature.
Out of scope (separate follow-ups):
- Message-level cache breakpoints (history mutates per turn —
needs prefix-stability analysis).
- 1-hour cache TTL beta.
- Cache hit-rate dashboard UI.
- History budget increase — defer until we have observed hit rates
from this PR's accounting.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
System prompt + tools were rebuilt and re-sent uncached on every request. Brain content index alone can grow past 10K tokens, so a typical 10-turn session was paying for the same prefix ten times. Anthropic's prompt cache cuts that prefix to ~10% of base input price when reused within 5 minutes; this PR wires the markers up.
Why this PR needed real surgery
The naive shape — wrap the existing
buildSystemPrompt(...)string in a cached block — would have miscached because:buildSchemaSectionembedded the active-model marker (### ▶ Postsvs### Posts) keyed onuiContext.activeModelId.buildRulesSectioninjected theout_of_scoperule based onintent.Cache markers placed over content that varies per request bust the prefix and pay the 1.25x creation penalty on every turn — net negative. So the prompt builder is actually split:
buildStaticBody— role, architecture, config, schema (no active-model marker), relations, vocab, permissions, base rules, custom instructionsbuildDynamicBody— UI context (active-model annotation here), inferred intent, project state, intent-specific rulesbuildContentIndex— already separate (brain cache); its own cached blockbuildSystemPromptBlocks(...)returns the three pieces;toSystemBlocks(...)materializes them asAISystemBlock[]withcache_controlon the static body + content index (2 of 4 breakpoints). LastAIToolgetscache_controltoo (3rd breakpoint, ~95% hit rate within a session).Provider surface
Anthropic provider captures all three input buckets from
message_start/message_deltaand yields normalizedmessage_endonce. Engine accumulates across the tool loop and forwards ondone.Migration 008 — additive
_v2RPCsPer review feedback, RPCs ship as
_v2rather than mutating signatures in place. Mid-deploy schema-and-app skew on Supabase is a real failure mode;_v1stays registered, unused, and can be dropped in a future cleanup.saveChatResult/saveApiChatResultswitched to object-form args (positional list grew unwieldy).Business semantic — UNCHANGED
Cache is a Contentrain-side cost win, not a customer-facing quota expansion:
ai.messages_per_month,api.messages_per_month).cache_read_input_tokensdo NOT earn extra messages.input_tokenssemantic unchanged (= non-cached input only).input_tokensstay correct.Test plan
pnpm typecheckcleanpnpm lint— 0 errorspnpm test— 618 passed (608 + 10 new)anthropic-ai: three-bucket stream capture, system-block + toolscache_controlmappingagent-system-prompt-cache: static body byte-identical across UI/intent/state changes (the actual cache-hit invariant), contentIndex separation,toSystemBlocks≤ 2 cached blocksdb: cache tokens propagate throughsaveChatResulttoagent_usage+messagesrowOut of scope (separate PRs)
Sources