Skip to content

perf(ai): anthropic prompt cache + cache token accounting#53

Merged
ABB65 merged 1 commit into
mainfrom
feat/ai-prompt-cache
May 15, 2026
Merged

perf(ai): anthropic prompt cache + cache token accounting#53
ABB65 merged 1 commit into
mainfrom
feat/ai-prompt-cache

Conversation

@ABB65
Copy link
Copy Markdown
Member

@ABB65 ABB65 commented May 15, 2026

Summary

System prompt + tools were rebuilt and re-sent uncached on every request. Brain content index alone can grow past 10K tokens, so a typical 10-turn session was paying for the same prefix ten times. Anthropic's prompt cache cuts that prefix to ~10% of base input price when reused within 5 minutes; this PR wires the markers up.

Why this PR needed real surgery

The naive shape — wrap the existing buildSystemPrompt(...) string in a cached block — would have miscached because:

  • buildSchemaSection embedded the active-model marker (### ▶ Posts vs ### Posts) keyed on uiContext.activeModelId.
  • buildRulesSection injected the out_of_scope rule based on intent.

Cache markers placed over content that varies per request bust the prefix and pay the 1.25x creation penalty on every turn — net negative. So the prompt builder is actually split:

  • buildStaticBody — role, architecture, config, schema (no active-model marker), relations, vocab, permissions, base rules, custom instructions
  • buildDynamicBody — UI context (active-model annotation here), inferred intent, project state, intent-specific rules
  • buildContentIndex — already separate (brain cache); its own cached block

buildSystemPromptBlocks(...) returns the three pieces; toSystemBlocks(...) materializes them as AISystemBlock[] with cache_control on the static body + content index (2 of 4 breakpoints). Last AITool gets cache_control too (3rd breakpoint, ~95% hit rate within a session).

Provider surface

// ai.ts
system: string | AISystemBlock[]         // string callers wrap to single uncached block
AISystemBlock.cacheControl?: { type: 'ephemeral' }
AITool.cacheControl?: { type: 'ephemeral' }
AIUsage: { inputTokens, outputTokens, cacheCreationInputTokens, cacheReadInputTokens }

Anthropic provider captures all three input buckets from message_start / message_delta and yields normalized message_end once. Engine accumulates across the tool loop and forwards on done.

Migration 008 — additive _v2 RPCs

Per review feedback, RPCs ship as _v2 rather than mutating signatures in place. Mid-deploy schema-and-app skew on Supabase is a real failure mode; _v1 stays registered, unused, and can be dropped in a future cleanup.

ALTER TABLE agent_usage / api_message_usage / messages
  ADD COLUMN cache_creation_input_tokens
  ADD COLUMN cache_read_input_tokens

CREATE FUNCTION increment_agent_usage_tokens_v2(... + 2 cache params)
CREATE FUNCTION increment_api_usage_tokens_v2(... + 2 cache params)

saveChatResult / saveApiChatResult switched to object-form args (positional list grew unwieldy).

Business semantic — UNCHANGED

Cache is a Contentrain-side cost win, not a customer-facing quota expansion:

  • Plan quotas stay message-based (ai.messages_per_month, api.messages_per_month).
  • cache_read_input_tokens do NOT earn extra messages.
  • input_tokens semantic unchanged (= non-cached input only).
  • Existing dashboard queries summing input_tokens stay correct.

Test plan

  • pnpm typecheck clean
  • pnpm lint — 0 errors
  • pnpm test618 passed (608 + 10 new)
    • 3 new anthropic-ai: three-bucket stream capture, system-block + tools cache_control mapping
    • 6 new agent-system-prompt-cache: static body byte-identical across UI/intent/state changes (the actual cache-hit invariant), contentIndex separation, toSystemBlocks ≤ 2 cached blocks
    • 1 new db: cache tokens propagate through saveChatResult to agent_usage + messages row

Out of scope (separate PRs)

  • Message-level cache breakpoint — history mutates per turn, needs prefix-stability analysis.
  • 1-hour cache TTL beta — defer until we observe hit rates from this PR.
  • Cache hit-rate dashboard UI — accounting columns ship here; UI is a separate concern.
  • History budget increase — once we have observed hit rates, the conservative Sonnet/Opus values from PR refactor(chat): shared history builder with model/plan/source-aware budgets #52 can grow safely.

Sources

System prompt + tools were rebuilt and re-sent uncached on every
request. Brain content index alone can grow past 10K tokens, so a
typical 10-turn session was paying for the same prefix ten times.
Anthropic's prompt cache cuts that prefix to ~10% of base input price
when reused within 5 minutes; this PR wires the markers up.

The naive shape — wrap the existing `buildSystemPrompt(...)` string
in a cached block — would have miscached because `buildSchemaSection`
embeds the active-model marker and `buildRulesSection` injects the
out-of-scope rule based on intent. Cache markers placed over content
that varies per request bust the prefix and pay the 1.25x creation
penalty on every turn. The prompt builder is now actually split:

  - `buildStaticBody`     role, architecture, config, schema (NO
                          active-model marker), relations, vocab,
                          permissions, base rules, custom instructions
  - `buildDynamicBody`    UI context (active-model annotation lives
                          here), inferred intent, project state,
                          intent-specific rules (off-topic, etc.)
  - `buildContentIndex`   already separate (brain cache); rendered
                          as its own cached block

`buildSystemPromptBlocks(...)` returns the three pieces; `toSystemBlocks(...)`
materializes them as an `AISystemBlock[]` with `cache_control` markers
on the static body and the content index (2 of Anthropic's 4 explicit
breakpoints). The Studio chat handler and the Conversation API handler
both compose the prompt this way and additionally tag the last AITool
with `cache_control`, so the tools array gets the third breakpoint —
tools rarely change within a session, very high hit rate.

`AIProvider.system` accepts `string | AISystemBlock[]`; string callers
(legacy paths, tests) get a single uncached block automatically. The
Anthropic provider also captures `cache_creation_input_tokens` and
`cache_read_input_tokens` from `message_start`/`message_delta` and
surfaces them through a 4-field `AIUsage` shape. The engine
accumulates all four buckets across the tool loop and forwards them
on the `done` event.

Persistence:

  Migration 008 ships as additive `_v2` RPCs and new columns:

    agent_usage / api_message_usage / messages
       + cache_creation_input_tokens
       + cache_read_input_tokens

    increment_agent_usage_tokens_v2
    increment_api_usage_tokens_v2

  `_v1` RPCs stay registered so a rolling deploy doesn't have to
  coordinate schema-and-app cutover. App code calls `_v2` exclusively;
  `_v1` becomes legacy and can be dropped in a future cleanup migration.

  `saveChatResult` / `saveApiChatResult` switched to object-form
  arguments — the positional list had grown unwieldy with four
  token fields plus ten other params.

Business semantic preserved: cache is a Contentrain-side cost win.
Plan quotas stay message-based; cache_read tokens DO NOT earn extra
messages. `input_tokens` semantic is unchanged (= non-cached input),
so existing dashboard queries summing it stay correct.

Tests:

  - anthropic-ai: three-bucket stream-event capture; system-block +
    tools `cache_control` mapping to SDK shape.
  - agent-system-prompt-cache: static body is byte-identical across
    UI-context / intent / state changes (the actual cache-hit
    invariant); contentIndex separates; `toSystemBlocks` emits 2
    cached blocks max so tools breakpoint stays available.
  - db: cache tokens propagate through `saveChatResult` to both
    `agent_usage` and the `messages` row.
  - chat-route / overage-soft-cap integration mocks updated to the
    new helper names and object-form save signature.

Out of scope (separate follow-ups):

  - Message-level cache breakpoints (history mutates per turn —
    needs prefix-stability analysis).
  - 1-hour cache TTL beta.
  - Cache hit-rate dashboard UI.
  - History budget increase — defer until we have observed hit rates
    from this PR's accounting.
@ABB65 ABB65 merged commit 7580e4e into main May 15, 2026
1 check passed
@ABB65 ABB65 deleted the feat/ai-prompt-cache branch May 15, 2026 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant