feat: query LiteLLM proxy for per-instance ACP costs via virtual keys by simonrosenberg · Pull Request #595 · OpenHands/benchmarks

simonrosenberg · 2026-03-31T14:07:08Z

Summary

Adds per-instance LiteLLM virtual key tracking to get exact costs from the proxy for ACP agents (Codex, Gemini CLI) that don't report costs natively through the ACP protocol.

Also includes changes from #602 (content-hash assembly tags) to prevent stale image bugs when rebuilding.

How it works

Before each instance: create a virtual key via POST /key/generate (with $50 safety cap)
Pass the key to ACP agents via acp_env — overrides the provider API key from _get_acp_env() so the proxy tracks spend independently
After instance completes: query exact spend via GET /key/info?key=..., store as proxy_cost in test_result
Cleanup: delete the key via POST /key/delete

Design decisions

Thread-safe: uses threading.local for virtual key storage — each worker thread tracks its own key without global state mutation
Fail-hard when configured: if the proxy is configured but virtual key creation fails, raises RuntimeError instead of silently falling back to real provider keys
No-op when unconfigured: all functions return None when LLM_API_MASTER_KEY / LLM_BASE_URL are unset
Builds on _get_acp_env(): starts with provider credentials from the merged forward_env removal, then overrides API keys with the virtual key when available

Files changed

File	Change
`benchmarks/utils/litellm_proxy.py`	New — virtual key CRUD + thread-local storage
`benchmarks/utils/acp.py`	`build_acp_agent()` injects virtual key via `acp_env`, layered on `_get_acp_env()`
`benchmarks/utils/evaluation.py`	`_execute_single_attempt()` hooks key lifecycle
`tests/test_litellm_proxy.py`	New — unit tests for all proxy functions + thread isolation
`benchmarks/utils/version.py`	Content-hash support from #602
`tests/test_version.py`	Tests for phased image tag prefix
`benchmarks/swebench/run_infer.py`	Use `get_phased_image_tag_prefix()`
`benchmarks/swtbench/run_infer.py`	Use `get_phased_image_tag_prefix()`
`benchmarks/swebenchmultimodal/run_infer.py`	Use `get_phased_image_tag_prefix()`

Companion PRs

Benchmarks: fix: include content hash in assembly image tags #602 (content-hash assembly tags) — merged into this branch
Evaluation: OpenHands/evaluation add-llm-api-master-key — injects LLM_API_MASTER_KEY from SOPS secret into K8s pod env via Helm chart

Validated: proxy_cost works end-to-end

Validation run: swebench n=10 with acp-gemini

SDK workflow: 23879934365
Eval workflow: 23879947950
K8s job: eval-23879947950-gemini-3-f (Complete, 50m)
GCS results: gs://openhands-evaluation-results/swebench/litellm_proxy-gemini-3-flash-preview/23879947950/
Result: 10/10 completed, 10/10 resolved, 0 errors

All 10 instances have proxy_cost in test_result:

Instance	proxy_cost (actual)	accumulated_cost (SDK est.)	Ratio
django__django-12155	$0.0669	$0.1351	49.5%
django__django-13279	$0.0894	$0.2835	31.5%
django__django-13670	$0.2320	$0.5616	41.3%
scikit-learn__scikit-learn-13439	$0.1946	$0.5055	38.5%
django__django-14434	$0.2541	$0.6566	38.7%
sphinx-doc__sphinx-7757	$0.2041	$0.5597	36.5%
django__django-11999	$0.2276	$0.5949	38.3%
scikit-learn__scikit-learn-25973	$0.2010	$0.3820	52.6%
scikit-learn__scikit-learn-25232	$0.4276	$1.3790	31.0%
sympy__sympy-15599	$0.4572	$1.4744	31.0%
TOTAL	$2.3546	$6.5322	36.1%

Why proxy_cost differs from accumulated_cost

The SDK's accumulated_cost for ACP agents is unreliable for two reasons:

Wrong pricing: SDK prices "acp-managed" at $0.50/M input + $3.00/M output (generic fallback). Gemini Flash actual pricing is $0.075/M input + $0.30/M output — 6.7x to 10x overpriced per token.
Under-reported tokens: ACP only reports final-response token counts. The gemini CLI internally uses ~2.5x more tokens (thinking, tool calls, retries) that it doesn't report back.

These errors partially cancel out (overpriced × under-counted), but the SDK estimate is still 2.8x too high. The proxy_cost from the LiteLLM virtual key is the only reliable cost metric for ACP agents.

Previous validation runs

eval_limit=1 codex: proxy_cost=$0.014112 confirmed (run 23877049978)
eval_limit=3 gemini: proxy_cost confirmed on 1/3 instances, 2/3 failed due to stale images missing gemini CLI (motivating fix: include content hash in assembly image tags #602)

Test plan

All existing tests pass (pytest tests/)
Unit tests for virtual key CRUD + thread isolation (test_litellm_proxy.py)
Unit tests for phased image tag prefix (test_version.py)
Pre-commit passes
Production validated: 10/10 swebench instances with acp-gemini have proxy_cost in output
Thread-local storage isolation verified
Backward compatible: no-op when LLM_API_MASTER_KEY is unset
Fails hard when proxy is configured but key creation fails

Closes #592

🤖 Generated with Claude Code

all-hands-bot

🟡 Acceptable - Solid engineering that solves a real problem, but missing unit tests.

What's Good:

Solves a real problem: exact cost tracking from proxy vs token-count estimates
Thread-safe design using threading.local instead of global state mutation
Backward compatible: all functions no-op when LITELLM_MASTER_KEY unset
Clean data structure: virtual key in thread-local → acp_env injection (no env pollution)
Pragmatic error handling: logs warnings, doesn't crash evaluation
Evidence provided: validation shows $0.014112 tracked vs $0.00 from SDK

Main Gap:
The new benchmarks/utils/litellm_proxy.py module has no unit tests. The validation script is useful but not a substitute for tests in the test suite. When this codebase adds new modules with public functions (create_virtual_key, get_key_spend, delete_key, thread-local accessors), those should have corresponding tests.

Verdict: Worth merging - the design is sound and solves a real problem. Tests can be added in a follow-up, but they shouldn't be skipped indefinitely.

benchmarks/utils/acp.py

all-hands-bot

🟡 Acceptable - Pragmatic design that solves a real problem, but still missing unit tests after previous review.

What's Good:

Thread-local storage is the right data structure for this problem
Clean fail-open error handling (logs warnings, doesn't crash evals)
Backward compatible by design (all functions no-op when unconfigured)
Evidence shows it works end-to-end ($0.014 tracked from real Codex run)

What's Not:
You're adding ~160 lines of infrastructure code with HTTP requests, error handling, and thread-local state management. The previous review already flagged missing unit tests. They're still missing.

benchmarks/utils/litellm_proxy.py

benchmarks/utils/evaluation.py

all-hands-bot

🟡 Acceptable - Comprehensive tests added (addresses previous review), solid thread-safe design. However, the PR description lists changes to models.py, report_costs.py, and validate_virtual_keys.py that aren't in this diff. Also, clarify the LITELLM_MASTER_KEY vs LLM_API_KEY naming.

all-hands-bot · 2026-04-01T12:23:14Z

benchmarks/utils/litellm_proxy.py

+    so that concurrent worker threads (asyncio.to_thread) each track their own
+    key without global state mutation.  ``build_acp_agent`` in ``acp.py`` reads
+    this thread-local to inject the key via ``acp_env``.
+"""


🟡 Suggestion: The PR description says infrastructure needs LITELLM_MASTER_KEY, but the code expects LLM_API_KEY. Either:

Update the docstring to clarify: "LLM_API_KEY should be set to the LiteLLM master key with admin permissions", or

Update the PR description to say the K8s secret should be exposed as LLM_API_KEY

This prevents deployment confusion when someone reads the PR description but doesn't see the env var they expect in the code.

This was fixed in a subsequent commit — the code now uses LLM_API_MASTER_KEY (a dedicated SOPS secret), and the module docstring matches:

Requires: - LLM_BASE_URL: The LiteLLM proxy URL (existing env var) - LLM_API_MASTER_KEY: Admin key for virtual key management (SOPS secret)

The Helm chart in OpenHands/evaluation add-llm-api-master-key injects LLM_API_MASTER_KEY from the SOPS-encrypted K8s secret into the pod env.

ACP agents (Codex, Gemini CLI) don't report costs natively through the ACP protocol, so accumulated_cost is $0. This adds per-instance virtual key tracking through the LiteLLM proxy to get exact costs per trajectory. How it works: - Before each instance, create a LiteLLM virtual key via /key/generate - Pass the key to ACP agents via acp_env (thread-safe, no os.environ mutation) - After instance completes, query exact spend via /key/info - Store as proxy_cost on EvalOutput, delete the key Fully backward compatible: all functions are no-ops when LITELLM_MASTER_KEY is not set. Closes #592 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… directly Instead of adding a new field that every consumer must learn about, just set metrics.accumulated_cost with the proxy-reported value. Drop report_costs.py changes and validation script accordingly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The CI eval jobs already have LLM_API_KEY (from secrets.LLM_API_KEY_EVAL) which has admin permissions on the LiteLLM proxy. No new secret needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The eval key already has admin permissions on the proxy — no need for a separate master key env var. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ed_cost Keep both values in the output JSON — the SDK's token-count estimate in metrics.accumulated_cost and the exact proxy cost in test_result.proxy_cost — so they're both available in GCS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Covers success paths, HTTP errors, connection errors, timeouts, config detection, and thread-local isolation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When the LiteLLM proxy is configured but key creation fails, raise RuntimeError instead of returning None. This prevents silent fallback to real provider keys in acp_env. Also moves get_key_spend from GET (key in URL query param) to POST (key in request body) to avoid logging secrets in access logs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Separates the admin key (virtual key CRUD) from the shared eval key to limit blast radius if LLM_API_KEY is ever exposed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Includes: - feat(acp): add Gemini CLI as supported ACP server (#2619) - fix(workspace): avoid logging forwarded env values (#2630) - fix(conversation): sanitize remote error logging (#2631) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The content hash was added to eval-base image tags in #599 to auto-rebuild when the SDK Dockerfile changes, but assembly (agent-layer) tags still used only the SDK SHA. This meant stale assembly images built on top of old base images (without ACP CLIs) were reused even after base images were rebuilt. Include the content hash in assembly tags so Dockerfile changes invalidate both base AND assembly image caches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The build side now includes the Dockerfile content hash in assembly image tags (2396c60), but the pull side (IMAGE_TAG_PREFIX) still only used the SDK short SHA. This mismatch would cause consumers to look for images with the old tag format and fail to find the newly built images. Update IMAGE_TAG_PREFIX to include the same content hash so that run_infer.py and modal_patches.py construct tags that match what build_base_images.py produces. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The default IMAGE_TAG_PREFIX now includes both the SDK short SHA and the Dockerfile content hash. Update the test assertion accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The previous version.py change added the content hash to IMAGE_TAG_PREFIX globally, which would break legacy-build benchmarks (gaia, commit0, multiswebench, etc.) whose images don't include the content hash. Instead, introduce get_phased_image_tag_prefix() which includes the content hash, and only use it in phased-build consumers: - swebench/run_infer.py - swebenchmultimodal/run_infer.py - swtbench/run_infer.py - modal_patches.py (swebench Modal path) Legacy benchmarks continue using IMAGE_TAG_PREFIX (SDK SHA only). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…efix run_eval.sh sets SDK_SHORT_SHA, which caused get_phased_image_tag_prefix() to short-circuit and return the SHA without content hash. Only an explicit IMAGE_TAG_PREFIX override should bypass the content hash computation. Caught by validation runs: phased builds (swebench, swtbench) produced images with content hash in tags, but inference pulled images without it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

LiteLLM's /key/info endpoint expects GET with a query param, not POST with JSON body. The 405 Method Not Allowed was preventing proxy cost tracking from working. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…into feat/litellm-virtual-key-costs

…nges - Fix test mocks: get_key_spend now uses httpx.get (not post) for /key/info endpoint (LiteLLM returns 405 for POST) - Merge content-hash assembly tag changes from PR #602 to prevent stale image cache bugs when rebuilding Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

all-hands-bot

🟢 Good taste - Clean thread-safe design solving a real production problem.

Data Structure: Thread-local storage is the correct choice for isolated worker state. Clear ownership and lifecycle (create → use → query → delete). No shared mutable state.

Evidence: Strong production validation (10/10 instances with exact costs, $2.35 actual vs $6.53 SDK estimate). Real problem, real solution.

Pragmatism: Backward compatible (no-op when unconfigured), fail-hard where critical (key creation), fail-soft where acceptable (cleanup/query). Tests are comprehensive.

VERDICT: ✅ Worth merging - solid engineering with minor notes below.

all-hands-bot · 2026-04-02T08:51:11Z

benchmarks/utils/version.py

+    their tags so that Dockerfile changes invalidate cached assemblies.
+    The tag format is: ``{sdk_short_sha}-{content_hash}-{custom_tag}-{target}``.
+
+    Benchmarks on the legacy build path (gaia, commit0, etc.) should


🟡 Suggestion: Importing from swebench.build_base_images creates an architectural dependency from the general utils/version module to a benchmark-specific module. This is pragmatic for the documented use case ("phased-build benchmarks") and the lazy import avoids loading it unnecessarily, but it's worth noting this couples the modules. Acceptable trade-off given the clear documentation.

all-hands-bot · 2026-04-02T08:51:11Z

benchmarks/utils/litellm_proxy.py

+        logger.info("[litellm-proxy] Created virtual key for instance %s", instance_id)
+        return key
+    except Exception as e:
+        raise RuntimeError(


🟢 Acceptable: Fail-hard error handling on key creation is the RIGHT choice - if you can't create a virtual key when the proxy is configured, that's a critical failure. The asymmetric design (fail-hard create, fail-soft cleanup/query) is intentional and correct.

all-hands-bot · 2026-04-02T08:51:11Z

benchmarks/utils/evaluation.py

+            # The key is stored in thread-local so build_acp_agent() can inject
+            # it into the ACP subprocess env. No-op when LITELLM_MASTER_KEY unset.
+            run_id = os.getenv("UNIQUE_EVAL_NAME")
+            virtual_key = create_virtual_key(instance.id, run_id=run_id)


🟢 Acceptable: Creating the virtual key before workspace prep means if prep fails, you create/delete a key unnecessarily. Previous review noted this - it's a pragmatic choice that keeps the code simple. The cost is negligible (one extra HTTP round-trip on failure path), and the clarity is worth it.

benchmarks/utils/litellm_proxy.py

Co-authored-by: Vasco Schiavo <115561717+VascoSch92@users.noreply.github.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

all-hands-bot reviewed Mar 31, 2026

View reviewed changes

benchmarks/utils/acp.py Outdated Show resolved Hide resolved

simonrosenberg force-pushed the feat/litellm-virtual-key-costs branch from 65efe0b to 2df7d7c Compare March 31, 2026 19:31

simonrosenberg requested a review from all-hands-bot March 31, 2026 19:37

all-hands-bot reviewed Mar 31, 2026

View reviewed changes

benchmarks/utils/litellm_proxy.py Show resolved Hide resolved

benchmarks/utils/evaluation.py Show resolved Hide resolved

simonrosenberg force-pushed the feat/litellm-virtual-key-costs branch from e096fd7 to df73d23 Compare April 1, 2026 11:36

simonrosenberg requested a review from all-hands-bot April 1, 2026 12:19

all-hands-bot reviewed Apr 1, 2026

View reviewed changes

Debug Agent and others added 10 commits April 1, 2026 18:20

fix: fall back to LLM_API_KEY when LITELLM_MASTER_KEY is not set

4d3dcac

The CI eval jobs already have LLM_API_KEY (from secrets.LLM_API_KEY_EVAL) which has admin permissions on the LiteLLM proxy. No new secret needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: use LLM_API_KEY only, remove LITELLM_MASTER_KEY

7e4e783

The eval key already has admin permissions on the proxy — no need for a separate master key env var. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

test: add unit tests for litellm_proxy module

d262c40

Covers success paths, HTTP errors, connection errors, timeouts, config detection, and thread-local isolation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: resolve pyright error on _get_config() return type narrowing

349debe

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: use LLM_API_MASTER_KEY instead of LLM_API_KEY for proxy admin ops

cd393f1

Separates the admin key (virtual key CRUD) from the shared eval key to limit blast radius if LLM_API_KEY is ever exposed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

simonrosenberg force-pushed the feat/litellm-virtual-key-costs branch from fd76dc9 to 2d7dede Compare April 1, 2026 21:20

Debug Agent and others added 9 commits April 1, 2026 19:39

fix: update test to match new IMAGE_TAG_PREFIX format

cc3bcdd

The default IMAGE_TAG_PREFIX now includes both the SDK short SHA and the Dockerfile content hash. Update the test assertion accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: apply ruff formatting fixes

ede3220

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: use GET for /key/info endpoint (LiteLLM returns 405 for POST)

cee7108

LiteLLM's /key/info endpoint expects GET with a query param, not POST with JSON body. The 405 Method Not Allowed was preventing proxy cost tracking from working. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/fix/content-hash-assembly-tags' …

920fabd

…into feat/litellm-virtual-key-costs

simonrosenberg requested a review from all-hands-bot April 2, 2026 08:48

all-hands-bot reviewed Apr 2, 2026

View reviewed changes

VascoSch92 approved these changes Apr 2, 2026

View reviewed changes

benchmarks/utils/litellm_proxy.py Outdated Show resolved Hide resolved

Update benchmarks/utils/litellm_proxy.py

90e20be

Co-authored-by: Vasco Schiavo <115561717+VascoSch92@users.noreply.github.com>

simonrosenberg enabled auto-merge (squash) April 2, 2026 11:42

Remove unused Tuple import to fix pre-commit CI

f14c72b

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

simonrosenberg merged commit 2770660 into main Apr 2, 2026
2 checks passed

simonrosenberg mentioned this pull request Apr 2, 2026

fix: enable proxy cost tracking for default agent #605

Merged

juanmichelini mentioned this pull request Apr 7, 2026

feat(swesmith): add SWE-Smith benchmark evaluation scaffold #428

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: query LiteLLM proxy for per-instance ACP costs via virtual keys#595

feat: query LiteLLM proxy for per-instance ACP costs via virtual keys#595
simonrosenberg merged 21 commits intomainfrom
feat/litellm-virtual-key-costs

simonrosenberg commented Mar 31, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

Uh oh!

all-hands-bot left a comment

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot Apr 1, 2026

Uh oh!

simonrosenberg Apr 2, 2026

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot Apr 2, 2026

Uh oh!

all-hands-bot Apr 2, 2026

Uh oh!

all-hands-bot Apr 2, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

simonrosenberg commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Design decisions

Files changed

Companion PRs

Validated: proxy_cost works end-to-end

Validation run: swebench n=10 with acp-gemini

Why proxy_cost differs from accumulated_cost

Previous validation runs

Test plan

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

simonrosenberg Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

simonrosenberg commented Mar 31, 2026 •

edited

Loading