Problem
ACP agents that don't report costs natively (e.g. Gemini CLI, Codex) currently rely on token-count-based cost estimation using LiteLLM's pricing database. This works but is an approximation — it doesn't account for tiered pricing, cached token discounts, or pricing changes.
Since all ACP agents route through the LiteLLM proxy (LLM_BASE_URL), the proxy already tracks actual per-request costs in its spend logs. We should query these instead of estimating.
Proposed Solution: Virtual key per instance
Use LiteLLM virtual keys — one per eval instance. The proxy tracks spend per key automatically, so we get exact costs without any ACP protocol or server changes.
How it works:
-
Before each instance, create a virtual key via the LiteLLM admin API:
resp = httpx.post(f"{LLM_BASE_URL}/key/generate", headers={"Authorization": f"Bearer {LITELLM_MASTER_KEY}"}, json={
"metadata": {"instance_id": "django__django-12155", "run_id": "23764348286"},
"max_budget": 50.0, # safety limit per instance
})
virtual_key = resp.json()["key"]
-
Pass the virtual key to the ACP agent as ANTHROPIC_API_KEY / OPENAI_API_KEY (depending on agent type). No ACP server changes needed — they already read these env vars.
-
After instance completes, query actual cost:
resp = httpx.get(f"{LLM_BASE_URL}/key/info", params={"key": virtual_key}, headers={"Authorization": f"Bearer {LITELLM_MASTER_KEY}"})
actual_cost = resp.json()["info"]["spend"] # exact USD from proxy
-
Store the cost in the instance output and delete the virtual key:
httpx.post(f"{LLM_BASE_URL}/key/delete", headers={"Authorization": f"Bearer {LITELLM_MASTER_KEY}"}, json={"keys": [virtual_key]})
Why this works for all agents:
- Every API call goes through the LiteLLM proxy regardless of agent type
- The proxy calculates exact per-request cost (including tiered pricing, cache discounts)
- Spend is tracked per key — no metadata injection or header forwarding needed
- Works for Claude Code, Codex, Gemini CLI, and any future ACP server
Implementation
New utility module (benchmarks/utils/litellm_proxy.py):
create_virtual_key(instance_id, run_id) -> str
get_key_spend(key) -> float
delete_key(key)
- Uses
LLM_BASE_URL (existing) and LITELLM_MASTER_KEY (new secret)
Benchmarks harness changes (per-benchmark run_infer.py):
- Before instance: create virtual key
- Pass virtual key to agent instead of shared
LLM_API_KEY
- After instance: query spend, store in output, delete key
ACP env forwarding (benchmarks/utils/acp.py):
- Override
ANTHROPIC_API_KEY / OPENAI_API_KEY with the virtual key
New infrastructure:
LITELLM_MASTER_KEY secret in eval K8s jobs
- LiteLLM proxy must have spend tracking enabled (database backend)
Additional Context
Problem
ACP agents that don't report costs natively (e.g. Gemini CLI, Codex) currently rely on token-count-based cost estimation using LiteLLM's pricing database. This works but is an approximation — it doesn't account for tiered pricing, cached token discounts, or pricing changes.
Since all ACP agents route through the LiteLLM proxy (
LLM_BASE_URL), the proxy already tracks actual per-request costs in its spend logs. We should query these instead of estimating.Proposed Solution: Virtual key per instance
Use LiteLLM virtual keys — one per eval instance. The proxy tracks spend per key automatically, so we get exact costs without any ACP protocol or server changes.
How it works:
Before each instance, create a virtual key via the LiteLLM admin API:
Pass the virtual key to the ACP agent as
ANTHROPIC_API_KEY/OPENAI_API_KEY(depending on agent type). No ACP server changes needed — they already read these env vars.After instance completes, query actual cost:
Store the cost in the instance output and delete the virtual key:
Why this works for all agents:
Implementation
New utility module (
benchmarks/utils/litellm_proxy.py):create_virtual_key(instance_id, run_id) -> strget_key_spend(key) -> floatdelete_key(key)LLM_BASE_URL(existing) andLITELLM_MASTER_KEY(new secret)Benchmarks harness changes (per-benchmark
run_infer.py):LLM_API_KEYACP env forwarding (
benchmarks/utils/acp.py):ANTHROPIC_API_KEY/OPENAI_API_KEYwith the virtual keyNew infrastructure:
LITELLM_MASTER_KEYsecret in eval K8s jobsAdditional Context
ACPAgent._record_usage()as a stopgap