feat: content-hash eval-base image tags to auto-rebuild on Dockerfile changes by simonrosenberg · Pull Request #599 · OpenHands/benchmarks

simonrosenberg · 2026-04-01T18:08:06Z

Summary

Include a 7-char SHA-256 content hash of the SDK Dockerfile in eval-base image tags (eval-base:{hash}-{instance_tag}) so that Dockerfile changes automatically invalidate cached images and trigger rebuilds
Extract dockerfile_content_hash() as a public helper, compute it once per phase in the orchestrator, and pass the result through to ProcessPoolExecutor workers — base_image_tag() stays a pure string formatter
content_hash is a required keyword-only arg — no silent fallback to file I/O
Covers swebench, swebench-multimodal, and swtbench (all go through the same 3-phase pipeline via benchmarks/swebench/build_images.py)

Closes #593

Validation

Triggered real CI builds for django__django-11099 on the feature branch:

Run	Scenario	Base image	Result
#23869153028	Cold build (new tag format)	`eval-base:245a238-sweb.eval.x86_64.django_1776_django-11099`	Built and pushed
#23869164900	Warm (same Dockerfile)	`eval-base:245a238-sweb.eval.x86_64.django_1776_django-11099`	Skipped (`already exists`)

Run 1 built the base image with content hash 245a238 in the tag
Run 2 found the same tag in GHCR and skipped all 3 phases (builder, base, assembly)
Both phases (1 and 2) produce matching tags — assembly correctly references the base image pushed in phase 1

Test plan

All 22 test_phased_build.py tests pass (including new test_dockerfile_content_hash_format)
Pre-commit hooks pass (ruff format, ruff lint, pycodestyle, pyright strict)
CI validation: cold build succeeds and pushes content-hashed tag
CI validation: warm build skips when Dockerfile unchanged

🤖 Generated with Claude Code

… changes Include a 7-char SHA-256 hash of the SDK Dockerfile content in eval-base image tags so that Dockerfile changes automatically invalidate cached images. The hash is computed once per process via lru_cache to avoid redundant file reads across 500+ images. Closes #593 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

all-hands-bot

🟡 Acceptable - Solves the real problem (Dockerfile changes auto-invalidate images) but adds unnecessary complexity. Core logic is sound, tests are comprehensive, but the caching approach creates test pollution.

benchmarks/swebench/build_base_images.py

tests/test_phased_build.py

Remove _dockerfile_content_hash() and its lru_cache. Reading one small file 500 times is ~5-50ms — not worth the test complexity of cache clearing in setup_method across multiple test classes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Deduplicate the mock decorator and trim verbose assertions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

all-hands-bot

🟡 Acceptable - Solves the real problem (Dockerfile changes auto-invalidate cached images), but the design creates unnecessary complexity and I/O overhead. The PR description is misleading about caching.

all-hands-bot · 2026-04-01T18:21:23Z

benchmarks/swebench/build_base_images.py

+    dockerfile = _get_sdk_dockerfile().read_text()
+    content_hash = hashlib.sha256(dockerfile.encode()).hexdigest()[:7]
+    return f"{image}:{content_hash}-{custom_tag}"


🟠 Important: This reads and hashes the Dockerfile on every call to base_image_tag(). With 500+ images mentioned in the description, that's 500+ redundant file reads and hash computations.

Linus-style critique: Wrong data structure. You're treating a constant (Dockerfile hash) as a computed value. This should be calculated once at module load time:

Suggested change

dockerfile = _get_sdk_dockerfile().read_text()

content_hash = hashlib.sha256(dockerfile.encode()).hexdigest()[:7]

return f"{image}:{content_hash}-{custom_tag}"

# At module level (after _get_sdk_dockerfile definition):

_DOCKERFILE_HASH = hashlib.sha256(

_get_sdk_dockerfile().read_text().encode()

).hexdigest()[:7]

def base_image_tag(custom_tag: str, image: str = EVAL_BASE_IMAGE) -> str:

"""Compute the full registry tag for a pre-built base image.

The tag includes a short content hash of the SDK Dockerfile so that

any change to the Dockerfile automatically invalidates cached images.

"""

return f"{image}:{_DOCKERFILE_HASH}-{custom_tag}"

This would:

Eliminate 500+ file reads (compute once per process)

Simplify all tests (no mocking needed - tests can just verify tag format)

Make the hidden I/O explicit and obvious

Follow the principle: "Data structures, not code"

Alternatively, if you prefer lazy evaluation, use @functools.cache on a helper function (Python 3.9+).

all-hands-bot · 2026-04-01T18:21:24Z

tests/test_phased_build.py

+    @patch(
+        "benchmarks.swebench.build_base_images._get_sdk_dockerfile",
+        return_value=MagicMock(read_text=MagicMock(return_value="FROM ubuntu:22.04\n")),
+    )


🟡 Suggestion: This mocking pattern repeats across 6+ tests. It's test pollution caused by the design choice to compute the hash on every call.

If the hash were a module-level constant, these tests wouldn't need any mocking - they'd just verify the tag format includes a 7-char hex prefix. Simpler tests are easier to maintain and less brittle.

all-hands-bot · 2026-04-01T18:21:24Z

tests/test_phased_build.py

+    def test_hash_changes_with_dockerfile_content(self):
+        import hashlib
+
+        from benchmarks.swebench.build_base_images import base_image_tag
+
+        content_a = "FROM ubuntu:22.04\n"
+        content_b = "FROM ubuntu:22.04\nRUN npm install\n"
+
+        with patch(
+            "benchmarks.swebench.build_base_images._get_sdk_dockerfile",
+            return_value=MagicMock(read_text=MagicMock(return_value=content_a)),
+        ):
+            tag1 = base_image_tag("inst")
+
+        with patch(
+            "benchmarks.swebench.build_base_images._get_sdk_dockerfile",
+            return_value=MagicMock(read_text=MagicMock(return_value=content_b)),
+        ):
+            tag2 = base_image_tag("inst")
+
+        # Same instance but different Dockerfile content → different tags.
+        assert tag1 != tag2
+        assert tag1.endswith("-inst")
+        assert tag2.endswith("-inst")
+        # Verify the hashes match expected SHA-256 prefixes.
+        expected_a = hashlib.sha256(content_a.encode()).hexdigest()[:7]
+        expected_b = hashlib.sha256(content_b.encode()).hexdigest()[:7]
+        assert f":{expected_a}-inst" in tag1
+        assert f":{expected_b}-inst" in tag2


🟢 Acceptable: This test validates real behavior (hash changes with content). Good test that exercises actual logic rather than just mocking.

However, note that it's testing a property that could be proven by inspection if the hash were a module constant.

all-hands-bot

🟡 Acceptable - Core logic is sound, but you're not testing what you document.

[TESTING GAPS]
The docstring promises "7-char SHA-256 content hash" but no test validates this format. If someone changes [:7] to [:8] or removes it, tests still pass.

VERDICT: ✅ Worth merging - but add format validation.

KEY INSIGHT: Tests validate behavior (hashes differ with content) but not the contract (7-char hex format).

all-hands-bot · 2026-04-01T19:00:36Z

tests/test_phased_build.py

+        ):
+            tag2 = base_image_tag("x")
+
+        assert tag1 != tag2


🟠 Important: This test validates that different content produces different hashes, which is good. But the docstring in base_image_tag() claims "7-char content hash" and nothing validates that format.

Add an assertion to verify the documented guarantee:

import re assert tag1 != tag2 # Validate the documented format: {image}:{7-hex-chars}-{custom_tag} assert re.match(r'^.+:[0-9a-f]{7}-.+$', tag1), f"Tag format mismatch: {tag1}"

Without this, if someone changes [:7] to [:8] or [:6], your tests won't catch it. You're testing behavior, not the contract.

all-hands-bot

🟡 Acceptable - Solves the real problem (Dockerfile changes auto-invalidate cached images) but uses inefficient data structure. See inline comments for details.

benchmarks/swebench/build_base_images.py

all-hands-bot · 2026-04-01T19:22:49Z

tests/test_phased_build.py

+        ):
+            tag2 = base_image_tag("x")
+
+        assert tag1 != tag2


🟠 Important - Testing Gap: The docstring promises "7-char SHA-256 content hash" but no test validates this format. The test correctly verifies that different content produces different hashes, but doesn't check the documented contract.

Add format validation:

import re assert tag1 != tag2 assert re.match(r'^.+:[a-f0-9]{7}-', tag1), f"Expected 7-char hex prefix in tag: {tag1}" assert re.match(r'^.+:[a-f0-9]{7}-', tag2), f"Expected 7-char hex prefix in tag: {tag2}"

tests/test_phased_build.py

base_image_tag() was spawning `git rev-parse` + reading the Dockerfile on every call — 1000+ subprocess spawns across Phase 1 and Phase 2 inside ProcessPoolExecutor workers. Extract dockerfile_content_hash() as a public helper, compute it once in the orchestrator (build_all_base_images / assemble_all_agent_images), and pass the result through to workers via a content_hash parameter. base_image_tag() falls back to computing the hash itself when called without one (e.g. in build_base_image), keeping the API simple. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Phase 1 workers (build_base_image via ProcessPoolExecutor) were each recomputing the dockerfile hash, spawning git rev-parse per worker. Compute it once in build_all_base_images and pass it through _build_base_with_logging → build_base_image, matching what Phase 2 already does. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

all-hands-bot

🟡 Acceptable - Solves the real problem, but API has a footgun and tests don't validate the documented contract.

benchmarks/swebench/build_base_images.py

tests/test_phased_build.py

Address review feedback: - Make content_hash a required keyword-only arg on base_image_tag() so callers can't silently fall back to file I/O in hot loops. - Add test_dockerfile_content_hash_format validating the 7-char hex contract. - Replace nested MagicMock with Mock in test fixtures. - TestBaseImageTag tests now call base_image_tag directly with content_hash — no mocking needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

all-hands-bot

🔴 Needs improvement - API footgun allows malformed tags

[CRITICAL ISSUE]
The PR description claims content_hash is required, but build_base_image() has a default "" that creates broken tags like eval-base:-custom-tag. Either remove defaults everywhere or auto-compute the hash when missing.

VERDICT: ❌ Needs rework before merge

KEY INSIGHT: You've made it optional at the wrong layer - the helper functions allow broken state that the formatter would reject.

benchmarks/swebench/build_base_images.py

tests/test_phased_build.py

build_base_image, _build_base_with_logging, and _assemble_with_logging still had content_hash="" defaults. Make them keyword-only with no default so callers can't silently trigger file I/O. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

all-hands-bot

🟢 Good taste - Clean solution to a real problem.

This fixes the cache invalidation issue correctly: hash the Dockerfile, embed it in the tag, compute once per phase. The API is safe (required keyword-only arg), tests validate the contract, and the implementation has no special cases. The unresolved threads are bikeshedding about saving ~1ms with module-level constants - not worth the complexity.

Verdict: ✅ Ship it.

The content hash was added to eval-base image tags in #599 to auto-rebuild when the SDK Dockerfile changes, but assembly (agent-layer) tags still used only the SDK SHA. This meant stale assembly images built on top of old base images (without ACP CLIs) were reused even after base images were rebuilt. Include the content hash in assembly tags so Dockerfile changes invalidate both base AND assembly image caches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: include content hash in assembly image tags The content hash was added to eval-base image tags in #599 to auto-rebuild when the SDK Dockerfile changes, but assembly (agent-layer) tags still used only the SDK SHA. This meant stale assembly images built on top of old base images (without ACP CLIs) were reused even after base images were rebuilt. Include the content hash in assembly tags so Dockerfile changes invalidate both base AND assembly image caches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: include content hash in IMAGE_TAG_PREFIX for pull-side tag matching The build side now includes the Dockerfile content hash in assembly image tags (2396c60), but the pull side (IMAGE_TAG_PREFIX) still only used the SDK short SHA. This mismatch would cause consumers to look for images with the old tag format and fail to find the newly built images. Update IMAGE_TAG_PREFIX to include the same content hash so that run_infer.py and modal_patches.py construct tags that match what build_base_images.py produces. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: update test to match new IMAGE_TAG_PREFIX format The default IMAGE_TAG_PREFIX now includes both the SDK short SHA and the Dockerfile content hash. Update the test assertion accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: scope content hash to phased-build consumers only The previous version.py change added the content hash to IMAGE_TAG_PREFIX globally, which would break legacy-build benchmarks (gaia, commit0, multiswebench, etc.) whose images don't include the content hash. Instead, introduce get_phased_image_tag_prefix() which includes the content hash, and only use it in phased-build consumers: - swebench/run_infer.py - swebenchmultimodal/run_infer.py - swtbench/run_infer.py - modal_patches.py (swebench Modal path) Legacy benchmarks continue using IMAGE_TAG_PREFIX (SDK SHA only). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: apply ruff formatting fixes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: don't let SDK_SHORT_SHA env var bypass content hash in phased prefix run_eval.sh sets SDK_SHORT_SHA, which caused get_phased_image_tag_prefix() to short-circuit and return the SHA without content hash. Only an explicit IMAGE_TAG_PREFIX override should bypass the content hash computation. Caught by validation runs: phased builds (swebench, swtbench) produced images with content hash in tags, but inference pulled images without it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Debug Agent <debug@example.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…#595) * feat: query LiteLLM proxy for per-instance ACP costs via virtual keys ACP agents (Codex, Gemini CLI) don't report costs natively through the ACP protocol, so accumulated_cost is $0. This adds per-instance virtual key tracking through the LiteLLM proxy to get exact costs per trajectory. How it works: - Before each instance, create a LiteLLM virtual key via /key/generate - Pass the key to ACP agents via acp_env (thread-safe, no os.environ mutation) - After instance completes, query exact spend via /key/info - Store as proxy_cost on EvalOutput, delete the key Fully backward compatible: all functions are no-ops when LITELLM_MASTER_KEY is not set. Closes #592 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: simplify — drop proxy_cost field, write to accumulated_cost directly Instead of adding a new field that every consumer must learn about, just set metrics.accumulated_cost with the proxy-reported value. Drop report_costs.py changes and validation script accordingly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: fall back to LLM_API_KEY when LITELLM_MASTER_KEY is not set The CI eval jobs already have LLM_API_KEY (from secrets.LLM_API_KEY_EVAL) which has admin permissions on the LiteLLM proxy. No new secret needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: use LLM_API_KEY only, remove LITELLM_MASTER_KEY The eval key already has admin permissions on the proxy — no need for a separate master key env var. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: store proxy cost in test_result instead of overwriting accumulated_cost Keep both values in the output JSON — the SDK's token-count estimate in metrics.accumulated_cost and the exact proxy cost in test_result.proxy_cost — so they're both available in GCS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add unit tests for litellm_proxy module Covers success paths, HTTP errors, connection errors, timeouts, config detection, and thread-local isolation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: raise on virtual key creation failure instead of silent fallback When the LiteLLM proxy is configured but key creation fails, raise RuntimeError instead of returning None. This prevents silent fallback to real provider keys in acp_env. Also moves get_key_spend from GET (key in URL query param) to POST (key in request body) to avoid logging secrets in access logs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: resolve pyright error on _get_config() return type narrowing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use LLM_API_MASTER_KEY instead of LLM_API_KEY for proxy admin ops Separates the admin key (virtual key CRUD) from the shared eval key to limit blast radius if LLM_API_KEY is ever exposed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: update vendor/software-agent-sdk to latest main Includes: - feat(acp): add Gemini CLI as supported ACP server (#2619) - fix(workspace): avoid logging forwarded env values (#2630) - fix(conversation): sanitize remote error logging (#2631) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: include content hash in assembly image tags The content hash was added to eval-base image tags in #599 to auto-rebuild when the SDK Dockerfile changes, but assembly (agent-layer) tags still used only the SDK SHA. This meant stale assembly images built on top of old base images (without ACP CLIs) were reused even after base images were rebuilt. Include the content hash in assembly tags so Dockerfile changes invalidate both base AND assembly image caches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: include content hash in IMAGE_TAG_PREFIX for pull-side tag matching The build side now includes the Dockerfile content hash in assembly image tags (2396c60), but the pull side (IMAGE_TAG_PREFIX) still only used the SDK short SHA. This mismatch would cause consumers to look for images with the old tag format and fail to find the newly built images. Update IMAGE_TAG_PREFIX to include the same content hash so that run_infer.py and modal_patches.py construct tags that match what build_base_images.py produces. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: update test to match new IMAGE_TAG_PREFIX format The default IMAGE_TAG_PREFIX now includes both the SDK short SHA and the Dockerfile content hash. Update the test assertion accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: scope content hash to phased-build consumers only The previous version.py change added the content hash to IMAGE_TAG_PREFIX globally, which would break legacy-build benchmarks (gaia, commit0, multiswebench, etc.) whose images don't include the content hash. Instead, introduce get_phased_image_tag_prefix() which includes the content hash, and only use it in phased-build consumers: - swebench/run_infer.py - swebenchmultimodal/run_infer.py - swtbench/run_infer.py - modal_patches.py (swebench Modal path) Legacy benchmarks continue using IMAGE_TAG_PREFIX (SDK SHA only). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: apply ruff formatting fixes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: don't let SDK_SHORT_SHA env var bypass content hash in phased prefix run_eval.sh sets SDK_SHORT_SHA, which caused get_phased_image_tag_prefix() to short-circuit and return the SHA without content hash. Only an explicit IMAGE_TAG_PREFIX override should bypass the content hash computation. Caught by validation runs: phased builds (swebench, swtbench) produced images with content hash in tags, but inference pulled images without it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use GET for /key/info endpoint (LiteLLM returns 405 for POST) LiteLLM's /key/info endpoint expects GET with a query param, not POST with JSON body. The 405 Method Not Allowed was preventing proxy cost tracking from working. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: update get_key_spend tests to use GET and merge content-hash changes - Fix test mocks: get_key_spend now uses httpx.get (not post) for /key/info endpoint (LiteLLM returns 405 for POST) - Merge content-hash assembly tag changes from PR #602 to prevent stale image cache bugs when rebuilding Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update benchmarks/utils/litellm_proxy.py Co-authored-by: Vasco Schiavo <115561717+VascoSch92@users.noreply.github.com> --------- Co-authored-by: Debug Agent <debug@example.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Vasco Schiavo <115561717+VascoSch92@users.noreply.github.com>