feat: query LiteLLM proxy for per-instance ACP costs via virtual keys#595
feat: query LiteLLM proxy for per-instance ACP costs via virtual keys#595simonrosenberg merged 21 commits intomainfrom
Conversation
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Solid engineering that solves a real problem, but missing unit tests.
What's Good:
- Solves a real problem: exact cost tracking from proxy vs token-count estimates
- Thread-safe design using
threading.localinstead of global state mutation - Backward compatible: all functions no-op when
LITELLM_MASTER_KEYunset - Clean data structure: virtual key in thread-local →
acp_envinjection (no env pollution) - Pragmatic error handling: logs warnings, doesn't crash evaluation
- Evidence provided: validation shows $0.014112 tracked vs $0.00 from SDK
Main Gap:
The new benchmarks/utils/litellm_proxy.py module has no unit tests. The validation script is useful but not a substitute for tests in the test suite. When this codebase adds new modules with public functions (create_virtual_key, get_key_spend, delete_key, thread-local accessors), those should have corresponding tests.
Verdict: Worth merging - the design is sound and solves a real problem. Tests can be added in a follow-up, but they shouldn't be skipped indefinitely.
65efe0b to
2df7d7c
Compare
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Pragmatic design that solves a real problem, but still missing unit tests after previous review.
What's Good:
- Thread-local storage is the right data structure for this problem
- Clean fail-open error handling (logs warnings, doesn't crash evals)
- Backward compatible by design (all functions no-op when unconfigured)
- Evidence shows it works end-to-end ($0.014 tracked from real Codex run)
What's Not:
You're adding ~160 lines of infrastructure code with HTTP requests, error handling, and thread-local state management. The previous review already flagged missing unit tests. They're still missing.
e096fd7 to
df73d23
Compare
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Comprehensive tests added (addresses previous review), solid thread-safe design. However, the PR description lists changes to models.py, report_costs.py, and validate_virtual_keys.py that aren't in this diff. Also, clarify the LITELLM_MASTER_KEY vs LLM_API_KEY naming.
| so that concurrent worker threads (asyncio.to_thread) each track their own | ||
| key without global state mutation. ``build_acp_agent`` in ``acp.py`` reads | ||
| this thread-local to inject the key via ``acp_env``. | ||
| """ |
There was a problem hiding this comment.
🟡 Suggestion: The PR description says infrastructure needs LITELLM_MASTER_KEY, but the code expects LLM_API_KEY. Either:
- Update the docstring to clarify: "LLM_API_KEY should be set to the LiteLLM master key with admin permissions", or
- Update the PR description to say the K8s secret should be exposed as
LLM_API_KEY
This prevents deployment confusion when someone reads the PR description but doesn't see the env var they expect in the code.
There was a problem hiding this comment.
This was fixed in a subsequent commit — the code now uses LLM_API_MASTER_KEY (a dedicated SOPS secret), and the module docstring matches:
Requires:
- LLM_BASE_URL: The LiteLLM proxy URL (existing env var)
- LLM_API_MASTER_KEY: Admin key for virtual key management (SOPS secret)
The Helm chart in OpenHands/evaluation add-llm-api-master-key injects LLM_API_MASTER_KEY from the SOPS-encrypted K8s secret into the pod env.
ACP agents (Codex, Gemini CLI) don't report costs natively through the ACP protocol, so accumulated_cost is $0. This adds per-instance virtual key tracking through the LiteLLM proxy to get exact costs per trajectory. How it works: - Before each instance, create a LiteLLM virtual key via /key/generate - Pass the key to ACP agents via acp_env (thread-safe, no os.environ mutation) - After instance completes, query exact spend via /key/info - Store as proxy_cost on EvalOutput, delete the key Fully backward compatible: all functions are no-ops when LITELLM_MASTER_KEY is not set. Closes #592 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… directly Instead of adding a new field that every consumer must learn about, just set metrics.accumulated_cost with the proxy-reported value. Drop report_costs.py changes and validation script accordingly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The CI eval jobs already have LLM_API_KEY (from secrets.LLM_API_KEY_EVAL) which has admin permissions on the LiteLLM proxy. No new secret needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The eval key already has admin permissions on the proxy — no need for a separate master key env var. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ed_cost Keep both values in the output JSON — the SDK's token-count estimate in metrics.accumulated_cost and the exact proxy cost in test_result.proxy_cost — so they're both available in GCS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Covers success paths, HTTP errors, connection errors, timeouts, config detection, and thread-local isolation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the LiteLLM proxy is configured but key creation fails, raise RuntimeError instead of returning None. This prevents silent fallback to real provider keys in acp_env. Also moves get_key_spend from GET (key in URL query param) to POST (key in request body) to avoid logging secrets in access logs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Separates the admin key (virtual key CRUD) from the shared eval key to limit blast radius if LLM_API_KEY is ever exposed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Includes: - feat(acp): add Gemini CLI as supported ACP server (#2619) - fix(workspace): avoid logging forwarded env values (#2630) - fix(conversation): sanitize remote error logging (#2631) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fd76dc9 to
2d7dede
Compare
The content hash was added to eval-base image tags in #599 to auto-rebuild when the SDK Dockerfile changes, but assembly (agent-layer) tags still used only the SDK SHA. This meant stale assembly images built on top of old base images (without ACP CLIs) were reused even after base images were rebuilt. Include the content hash in assembly tags so Dockerfile changes invalidate both base AND assembly image caches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The build side now includes the Dockerfile content hash in assembly image tags (2396c60), but the pull side (IMAGE_TAG_PREFIX) still only used the SDK short SHA. This mismatch would cause consumers to look for images with the old tag format and fail to find the newly built images. Update IMAGE_TAG_PREFIX to include the same content hash so that run_infer.py and modal_patches.py construct tags that match what build_base_images.py produces. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The default IMAGE_TAG_PREFIX now includes both the SDK short SHA and the Dockerfile content hash. Update the test assertion accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous version.py change added the content hash to IMAGE_TAG_PREFIX globally, which would break legacy-build benchmarks (gaia, commit0, multiswebench, etc.) whose images don't include the content hash. Instead, introduce get_phased_image_tag_prefix() which includes the content hash, and only use it in phased-build consumers: - swebench/run_infer.py - swebenchmultimodal/run_infer.py - swtbench/run_infer.py - modal_patches.py (swebench Modal path) Legacy benchmarks continue using IMAGE_TAG_PREFIX (SDK SHA only). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…efix run_eval.sh sets SDK_SHORT_SHA, which caused get_phased_image_tag_prefix() to short-circuit and return the SHA without content hash. Only an explicit IMAGE_TAG_PREFIX override should bypass the content hash computation. Caught by validation runs: phased builds (swebench, swtbench) produced images with content hash in tags, but inference pulled images without it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
LiteLLM's /key/info endpoint expects GET with a query param, not POST with JSON body. The 405 Method Not Allowed was preventing proxy cost tracking from working. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…into feat/litellm-virtual-key-costs
…nges - Fix test mocks: get_key_spend now uses httpx.get (not post) for /key/info endpoint (LiteLLM returns 405 for POST) - Merge content-hash assembly tag changes from PR #602 to prevent stale image cache bugs when rebuilding Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
all-hands-bot
left a comment
There was a problem hiding this comment.
🟢 Good taste - Clean thread-safe design solving a real production problem.
Data Structure: Thread-local storage is the correct choice for isolated worker state. Clear ownership and lifecycle (create → use → query → delete). No shared mutable state.
Evidence: Strong production validation (10/10 instances with exact costs, $2.35 actual vs $6.53 SDK estimate). Real problem, real solution.
Pragmatism: Backward compatible (no-op when unconfigured), fail-hard where critical (key creation), fail-soft where acceptable (cleanup/query). Tests are comprehensive.
VERDICT: ✅ Worth merging - solid engineering with minor notes below.
| their tags so that Dockerfile changes invalidate cached assemblies. | ||
| The tag format is: ``{sdk_short_sha}-{content_hash}-{custom_tag}-{target}``. | ||
|
|
||
| Benchmarks on the legacy build path (gaia, commit0, etc.) should |
There was a problem hiding this comment.
🟡 Suggestion: Importing from swebench.build_base_images creates an architectural dependency from the general utils/version module to a benchmark-specific module. This is pragmatic for the documented use case ("phased-build benchmarks") and the lazy import avoids loading it unnecessarily, but it's worth noting this couples the modules. Acceptable trade-off given the clear documentation.
| logger.info("[litellm-proxy] Created virtual key for instance %s", instance_id) | ||
| return key | ||
| except Exception as e: | ||
| raise RuntimeError( |
There was a problem hiding this comment.
🟢 Acceptable: Fail-hard error handling on key creation is the RIGHT choice - if you can't create a virtual key when the proxy is configured, that's a critical failure. The asymmetric design (fail-hard create, fail-soft cleanup/query) is intentional and correct.
| # The key is stored in thread-local so build_acp_agent() can inject | ||
| # it into the ACP subprocess env. No-op when LITELLM_MASTER_KEY unset. | ||
| run_id = os.getenv("UNIQUE_EVAL_NAME") | ||
| virtual_key = create_virtual_key(instance.id, run_id=run_id) |
There was a problem hiding this comment.
🟢 Acceptable: Creating the virtual key before workspace prep means if prep fails, you create/delete a key unnecessarily. Previous review noted this - it's a pragmatic choice that keeps the code simple. The cost is negligible (one extra HTTP round-trip on failure path), and the clarity is worth it.
Co-authored-by: Vasco Schiavo <115561717+VascoSch92@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Adds per-instance LiteLLM virtual key tracking to get exact costs from the proxy for ACP agents (Codex, Gemini CLI) that don't report costs natively through the ACP protocol.
Also includes changes from #602 (content-hash assembly tags) to prevent stale image bugs when rebuilding.
How it works
POST /key/generate(with $50 safety cap)acp_env— overrides the provider API key from_get_acp_env()so the proxy tracks spend independentlyGET /key/info?key=..., store asproxy_costintest_resultPOST /key/deleteDesign decisions
threading.localfor virtual key storage — each worker thread tracks its own key without global state mutationRuntimeErrorinstead of silently falling back to real provider keysNonewhenLLM_API_MASTER_KEY/LLM_BASE_URLare unset_get_acp_env(): starts with provider credentials from the mergedforward_envremoval, then overrides API keys with the virtual key when availableFiles changed
benchmarks/utils/litellm_proxy.pybenchmarks/utils/acp.pybuild_acp_agent()injects virtual key viaacp_env, layered on_get_acp_env()benchmarks/utils/evaluation.py_execute_single_attempt()hooks key lifecycletests/test_litellm_proxy.pybenchmarks/utils/version.pytests/test_version.pybenchmarks/swebench/run_infer.pyget_phased_image_tag_prefix()benchmarks/swtbench/run_infer.pyget_phased_image_tag_prefix()benchmarks/swebenchmultimodal/run_infer.pyget_phased_image_tag_prefix()Companion PRs
add-llm-api-master-key— injectsLLM_API_MASTER_KEYfrom SOPS secret into K8s pod env via Helm chartValidated: proxy_cost works end-to-end
Validation run: swebench n=10 with acp-gemini
eval-23879947950-gemini-3-f(Complete, 50m)gs://openhands-evaluation-results/swebench/litellm_proxy-gemini-3-flash-preview/23879947950/All 10 instances have
proxy_costintest_result:Why proxy_cost differs from accumulated_cost
The SDK's
accumulated_costfor ACP agents is unreliable for two reasons:Wrong pricing: SDK prices
"acp-managed"at $0.50/M input + $3.00/M output (generic fallback). Gemini Flash actual pricing is $0.075/M input + $0.30/M output — 6.7x to 10x overpriced per token.Under-reported tokens: ACP only reports final-response token counts. The gemini CLI internally uses ~2.5x more tokens (thinking, tool calls, retries) that it doesn't report back.
These errors partially cancel out (overpriced × under-counted), but the SDK estimate is still 2.8x too high. The
proxy_costfrom the LiteLLM virtual key is the only reliable cost metric for ACP agents.Previous validation runs
geminiCLI (motivating fix: include content hash in assembly image tags #602)Test plan
pytest tests/)test_litellm_proxy.py)test_version.py)proxy_costin outputLLM_API_MASTER_KEYis unsetCloses #592
🤖 Generated with Claude Code