Skip to content

System prompt forwarded with per-request x-anthropic-billing-header line — defeats upstream prompt cache #61

@ajcasagrande

Description

@ajcasagrande

Summary

When converting an Anthropic /v1/messages request to an OpenAI Responses API payload, the proxy has two independent issues that defeat the upstream prompt cache:

  1. The inbound system array is concatenated into instructions with no filter for non-semantic Anthropic header lines (whose hashes vary per request).
  2. The Codex adapter overwrites any inbound prompt_cache_key with a fresh uuid.uuid4() per request.

Either alone breaks the cache. Together, every turn presents Codex with a unique prefix AND a unique cache key, so prompt caching is effectively disabled.

Background

Claude Code prefixes its system prompt with a line of the form:

x-anthropic-billing-header: cc_version=2.1.117.48f; cc_entrypoint=cli; cch=71fea;

The cch=<hash> portion regenerates on every request. The line carries no semantic value to the model — it's a billing/telemetry header. When forwarded into the upstream instructions field unchanged, every turn presents a brand-new prefix to the backend.

The OpenAI Responses API exposes a prompt_cache_key field that lets a client pin cache routing across requests with otherwise-equivalent prefixes. Setting it to a stable value (e.g. a session/conversation identifier) is part of the cache contract.

Root cause #1 — billing header forwarded raw

ccproxy/llms/formatters/anthropic_to_openai/requests.py:

if request.system:
    if isinstance(request.system, str):
        instructions_text = request.system
        payload_data["instructions"] = request.system
    else:
        joined = "".join(block.text for block in request.system if block.text)
        instructions_text = joined or None
        if joined:
            payload_data["instructions"] = joined

Every text block is concatenated as-is. There's no filter for the x-anthropic-billing-header: line (or any other non-semantic Anthropic headers), so it propagates verbatim into the upstream instructions.

Root cause #2prompt_cache_key randomized per request

ccproxy/plugins/codex/adapter.py, around line 644:

if "prompt_cache_key" not in merged:
    prompt_cache_key = template.get("prompt_cache_key")
    if isinstance(prompt_cache_key, str) and prompt_cache_key:
        merged["prompt_cache_key"] = str(uuid.uuid4())

When the merged request lacks a prompt_cache_key AND the template has one configured, the adapter sets it to a fresh uuid.uuid4(). Each request therefore presents a brand-new cache key. (As a separate concern, the present logic only fires when the template has a non-empty prompt_cache_key, and silently does nothing when it doesn't — so the field may be missing entirely on other code paths.)

Impact

Each issue alone breaks the cache. Together:

  • Cost. Long multi-turn sessions re-bill the full input context (often a large CLAUDE.md plus tool catalog and accumulated history) on every turn. Typical multiplier vs. correct cache reuse: 5–10×.
  • Latency. Cache-miss prefills are slower than cache hits, so every turn after the first feels sluggish.
  • Subscription throughput. Per-account or per-plan rate limits exhaust faster than they should because effective input-token throughput is lower.

Suggested fix

Fix #1 — strip non-semantic Anthropic header lines

In ccproxy/llms/formatters/anthropic_to_openai/requests.py, run system text through a stripper before joining:

def _strip_nonsemantic_system_lines(text: str) -> str:
    return "\n".join(
        line for line in text.splitlines()
        if not line.strip().lower().startswith("x-anthropic-billing-header:")
    ).strip()


# in the request builder:
if request.system:
    if isinstance(request.system, str):
        cleaned = _strip_nonsemantic_system_lines(request.system)
        if cleaned:
            payload_data["instructions"] = cleaned
            instructions_text = cleaned
    else:
        cleaned_blocks = (
            _strip_nonsemantic_system_lines(block.text or "")
            for block in request.system
            if block.text
        )
        joined = "\n\n".join(b for b in cleaned_blocks if b)
        if joined:
            payload_data["instructions"] = joined
            instructions_text = joined

Fix #2 — derive prompt_cache_key deterministically

Instead of str(uuid.uuid4()), derive the key from a stable identifier that survives across requests in the same conversation. Two reasonable approaches:

# (a) keep the template-configured key as-is when present:
if "prompt_cache_key" not in merged:
    template_key = template.get("prompt_cache_key")
    if isinstance(template_key, str) and template_key:
        merged["prompt_cache_key"] = template_key   # stable; not randomized

# (b) derive from a stable session identifier (preferred when no template key exists):
if "prompt_cache_key" not in merged:
    session_id = ...  # e.g. the conversation/session id available in the request context
    if session_id:
        merged["prompt_cache_key"] = session_id

Generating a fresh UUID per request is the pessimal case — it should only happen when the caller explicitly opts into a cold cache; otherwise the proxy should pass through or compute a deterministic key.

Reproduction

  1. Send two consecutive Anthropic /v1/messages requests, each with a system array beginning with a text block containing x-anthropic-billing-header: cc_version=...; cch=<hash>; (real Claude Code traffic does this automatically; the two cch values will differ).
  2. Capture the outbound Codex Responses payload from each.
  3. Observe:
    • instructions differs in the first line (different cch=<hash>)
    • prompt_cache_key differs entirely (fresh uuid.uuid4() each time)
  4. Compare upstream-reported cache hit metrics — both will be cold prefills.

After applying both fixes, identical-prefix conversations will share a prompt_cache_key and the instructions text will be byte-identical across calls, allowing the upstream cache to hit on the shared prefix.

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions