Skip to content

feat: content-hash eval-base image tags to auto-rebuild on Dockerfile changes#599

Merged
simonrosenberg merged 7 commits intomainfrom
feat/content-hash-base-image-tags
Apr 1, 2026
Merged

feat: content-hash eval-base image tags to auto-rebuild on Dockerfile changes#599
simonrosenberg merged 7 commits intomainfrom
feat/content-hash-base-image-tags

Conversation

@simonrosenberg
Copy link
Copy Markdown
Collaborator

@simonrosenberg simonrosenberg commented Apr 1, 2026

Summary

  • Include a 7-char SHA-256 content hash of the SDK Dockerfile in eval-base image tags (eval-base:{hash}-{instance_tag}) so that Dockerfile changes automatically invalidate cached images and trigger rebuilds
  • Extract dockerfile_content_hash() as a public helper, compute it once per phase in the orchestrator, and pass the result through to ProcessPoolExecutor workers — base_image_tag() stays a pure string formatter
  • content_hash is a required keyword-only arg — no silent fallback to file I/O
  • Covers swebench, swebench-multimodal, and swtbench (all go through the same 3-phase pipeline via benchmarks/swebench/build_images.py)

Closes #593

Validation

Triggered real CI builds for django__django-11099 on the feature branch:

Run Scenario Base image Result
#23869153028 Cold build (new tag format) eval-base:245a238-sweb.eval.x86_64.django_1776_django-11099 Built and pushed
#23869164900 Warm (same Dockerfile) eval-base:245a238-sweb.eval.x86_64.django_1776_django-11099 Skipped (already exists)
  • Run 1 built the base image with content hash 245a238 in the tag
  • Run 2 found the same tag in GHCR and skipped all 3 phases (builder, base, assembly)
  • Both phases (1 and 2) produce matching tags — assembly correctly references the base image pushed in phase 1

Test plan

  • All 22 test_phased_build.py tests pass (including new test_dockerfile_content_hash_format)
  • Pre-commit hooks pass (ruff format, ruff lint, pycodestyle, pyright strict)
  • CI validation: cold build succeeds and pushes content-hashed tag
  • CI validation: warm build skips when Dockerfile unchanged

🤖 Generated with Claude Code

… changes

Include a 7-char SHA-256 hash of the SDK Dockerfile content in eval-base
image tags so that Dockerfile changes automatically invalidate cached
images. The hash is computed once per process via lru_cache to avoid
redundant file reads across 500+ images.

Closes #593

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Solves the real problem (Dockerfile changes auto-invalidate images) but adds unnecessary complexity. Core logic is sound, tests are comprehensive, but the caching approach creates test pollution.

Remove _dockerfile_content_hash() and its lru_cache. Reading one small
file 500 times is ~5-50ms — not worth the test complexity of cache
clearing in setup_method across multiple test classes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Deduplicate the mock decorator and trim verbose assertions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Solves the real problem (Dockerfile changes auto-invalidate cached images), but the design creates unnecessary complexity and I/O overhead. The PR description is misleading about caching.

Comment on lines +88 to +90
dockerfile = _get_sdk_dockerfile().read_text()
content_hash = hashlib.sha256(dockerfile.encode()).hexdigest()[:7]
return f"{image}:{content_hash}-{custom_tag}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important: This reads and hashes the Dockerfile on every call to base_image_tag(). With 500+ images mentioned in the description, that's 500+ redundant file reads and hash computations.

Linus-style critique: Wrong data structure. You're treating a constant (Dockerfile hash) as a computed value. This should be calculated once at module load time:

Suggested change
dockerfile = _get_sdk_dockerfile().read_text()
content_hash = hashlib.sha256(dockerfile.encode()).hexdigest()[:7]
return f"{image}:{content_hash}-{custom_tag}"
# At module level (after _get_sdk_dockerfile definition):
_DOCKERFILE_HASH = hashlib.sha256(
_get_sdk_dockerfile().read_text().encode()
).hexdigest()[:7]
def base_image_tag(custom_tag: str, image: str = EVAL_BASE_IMAGE) -> str:
"""Compute the full registry tag for a pre-built base image.
The tag includes a short content hash of the SDK Dockerfile so that
any change to the Dockerfile automatically invalidates cached images.
"""
return f"{image}:{_DOCKERFILE_HASH}-{custom_tag}"

This would:

  • Eliminate 500+ file reads (compute once per process)
  • Simplify all tests (no mocking needed - tests can just verify tag format)
  • Make the hidden I/O explicit and obvious
  • Follow the principle: "Data structures, not code"

Alternatively, if you prefer lazy evaluation, use @functools.cache on a helper function (Python 3.9+).

Comment on lines +42 to +45
@patch(
"benchmarks.swebench.build_base_images._get_sdk_dockerfile",
return_value=MagicMock(read_text=MagicMock(return_value="FROM ubuntu:22.04\n")),
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: This mocking pattern repeats across 6+ tests. It's test pollution caused by the design choice to compute the hash on every call.

If the hash were a module-level constant, these tests wouldn't need any mocking - they'd just verify the tag format includes a 7-char hex prefix. Simpler tests are easier to maintain and less brittle.

Comment on lines +515 to +543
def test_hash_changes_with_dockerfile_content(self):
import hashlib

from benchmarks.swebench.build_base_images import base_image_tag

content_a = "FROM ubuntu:22.04\n"
content_b = "FROM ubuntu:22.04\nRUN npm install\n"

with patch(
"benchmarks.swebench.build_base_images._get_sdk_dockerfile",
return_value=MagicMock(read_text=MagicMock(return_value=content_a)),
):
tag1 = base_image_tag("inst")

with patch(
"benchmarks.swebench.build_base_images._get_sdk_dockerfile",
return_value=MagicMock(read_text=MagicMock(return_value=content_b)),
):
tag2 = base_image_tag("inst")

# Same instance but different Dockerfile content → different tags.
assert tag1 != tag2
assert tag1.endswith("-inst")
assert tag2.endswith("-inst")
# Verify the hashes match expected SHA-256 prefixes.
expected_a = hashlib.sha256(content_a.encode()).hexdigest()[:7]
expected_b = hashlib.sha256(content_b.encode()).hexdigest()[:7]
assert f":{expected_a}-inst" in tag1
assert f":{expected_b}-inst" in tag2
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Acceptable: This test validates real behavior (hash changes with content). Good test that exercises actual logic rather than just mocking.

However, note that it's testing a property that could be proven by inspection if the hash were a module constant.

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Core logic is sound, but you're not testing what you document.

[TESTING GAPS]
The docstring promises "7-char SHA-256 content hash" but no test validates this format. If someone changes [:7] to [:8] or removes it, tests still pass.

VERDICT: ✅ Worth merging - but add format validation.

KEY INSIGHT: Tests validate behavior (hashes differ with content) but not the contract (7-char hex format).

):
tag2 = base_image_tag("x")

assert tag1 != tag2
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important: This test validates that different content produces different hashes, which is good. But the docstring in base_image_tag() claims "7-char content hash" and nothing validates that format.

Add an assertion to verify the documented guarantee:

import re

assert tag1 != tag2
# Validate the documented format: {image}:{7-hex-chars}-{custom_tag}
assert re.match(r'^.+:[0-9a-f]{7}-.+$', tag1), f"Tag format mismatch: {tag1}"

Without this, if someone changes [:7] to [:8] or [:6], your tests won't catch it. You're testing behavior, not the contract.

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Solves the real problem (Dockerfile changes auto-invalidate cached images) but uses inefficient data structure. See inline comments for details.

):
tag2 = base_image_tag("x")

assert tag1 != tag2
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important - Testing Gap: The docstring promises "7-char SHA-256 content hash" but no test validates this format. The test correctly verifies that different content produces different hashes, but doesn't check the documented contract.

Add format validation:

import re

assert tag1 != tag2
assert re.match(r'^.+:[a-f0-9]{7}-', tag1), f"Expected 7-char hex prefix in tag: {tag1}"
assert re.match(r'^.+:[a-f0-9]{7}-', tag2), f"Expected 7-char hex prefix in tag: {tag2}"

Debug Agent and others added 2 commits April 1, 2026 16:23
base_image_tag() was spawning `git rev-parse` + reading the Dockerfile
on every call — 1000+ subprocess spawns across Phase 1 and Phase 2
inside ProcessPoolExecutor workers.

Extract dockerfile_content_hash() as a public helper, compute it once
in the orchestrator (build_all_base_images / assemble_all_agent_images),
and pass the result through to workers via a content_hash parameter.
base_image_tag() falls back to computing the hash itself when called
without one (e.g. in build_base_image), keeping the API simple.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 1 workers (build_base_image via ProcessPoolExecutor) were each
recomputing the dockerfile hash, spawning git rev-parse per worker.
Compute it once in build_all_base_images and pass it through
_build_base_with_logging → build_base_image, matching what Phase 2
already does.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Solves the real problem, but API has a footgun and tests don't validate the documented contract.

Address review feedback:
- Make content_hash a required keyword-only arg on base_image_tag()
  so callers can't silently fall back to file I/O in hot loops.
- Add test_dockerfile_content_hash_format validating the 7-char hex
  contract.
- Replace nested MagicMock with Mock in test fixtures.
- TestBaseImageTag tests now call base_image_tag directly with
  content_hash — no mocking needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Needs improvement - API footgun allows malformed tags

[CRITICAL ISSUE]
The PR description claims content_hash is required, but build_base_image() has a default "" that creates broken tags like eval-base:-custom-tag. Either remove defaults everywhere or auto-compute the hash when missing.

VERDICT: ❌ Needs rework before merge

KEY INSIGHT: You've made it optional at the wrong layer - the helper functions allow broken state that the formatter would reject.

build_base_image, _build_base_with_logging, and _assemble_with_logging
still had content_hash="" defaults. Make them keyword-only with no
default so callers can't silently trigger file I/O.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste - Clean solution to a real problem.

This fixes the cache invalidation issue correctly: hash the Dockerfile, embed it in the tag, compute once per phase. The API is safe (required keyword-only arg), tests validate the contract, and the implementation has no special cases. The unresolved threads are bikeshedding about saving ~1ms with module-level constants - not worth the complexity.

Verdict: ✅ Ship it.

@simonrosenberg simonrosenberg merged commit 5a1d2b2 into main Apr 1, 2026
3 checks passed
@simonrosenberg simonrosenberg deleted the feat/content-hash-base-image-tags branch April 1, 2026 21:16
simonrosenberg pushed a commit that referenced this pull request Apr 1, 2026
The content hash was added to eval-base image tags in #599 to
auto-rebuild when the SDK Dockerfile changes, but assembly (agent-layer)
tags still used only the SDK SHA. This meant stale assembly images
built on top of old base images (without ACP CLIs) were reused even
after base images were rebuilt.

Include the content hash in assembly tags so Dockerfile changes
invalidate both base AND assembly image caches.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
simonrosenberg added a commit that referenced this pull request Apr 2, 2026
* fix: include content hash in assembly image tags

The content hash was added to eval-base image tags in #599 to
auto-rebuild when the SDK Dockerfile changes, but assembly (agent-layer)
tags still used only the SDK SHA. This meant stale assembly images
built on top of old base images (without ACP CLIs) were reused even
after base images were rebuilt.

Include the content hash in assembly tags so Dockerfile changes
invalidate both base AND assembly image caches.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: include content hash in IMAGE_TAG_PREFIX for pull-side tag matching

The build side now includes the Dockerfile content hash in assembly image
tags (2396c60), but the pull side (IMAGE_TAG_PREFIX) still only used the
SDK short SHA. This mismatch would cause consumers to look for images
with the old tag format and fail to find the newly built images.

Update IMAGE_TAG_PREFIX to include the same content hash so that
run_infer.py and modal_patches.py construct tags that match what
build_base_images.py produces.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: update test to match new IMAGE_TAG_PREFIX format

The default IMAGE_TAG_PREFIX now includes both the SDK short SHA and
the Dockerfile content hash. Update the test assertion accordingly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: scope content hash to phased-build consumers only

The previous version.py change added the content hash to
IMAGE_TAG_PREFIX globally, which would break legacy-build benchmarks
(gaia, commit0, multiswebench, etc.) whose images don't include the
content hash.

Instead, introduce get_phased_image_tag_prefix() which includes the
content hash, and only use it in phased-build consumers:
- swebench/run_infer.py
- swebenchmultimodal/run_infer.py
- swtbench/run_infer.py
- modal_patches.py (swebench Modal path)

Legacy benchmarks continue using IMAGE_TAG_PREFIX (SDK SHA only).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: apply ruff formatting fixes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: don't let SDK_SHORT_SHA env var bypass content hash in phased prefix

run_eval.sh sets SDK_SHORT_SHA, which caused get_phased_image_tag_prefix()
to short-circuit and return the SHA without content hash. Only an explicit
IMAGE_TAG_PREFIX override should bypass the content hash computation.

Caught by validation runs: phased builds (swebench, swtbench) produced
images with content hash in tags, but inference pulled images without it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Debug Agent <debug@example.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
simonrosenberg added a commit that referenced this pull request Apr 2, 2026
…#595)

* feat: query LiteLLM proxy for per-instance ACP costs via virtual keys

ACP agents (Codex, Gemini CLI) don't report costs natively through the ACP
protocol, so accumulated_cost is $0. This adds per-instance virtual key
tracking through the LiteLLM proxy to get exact costs per trajectory.

How it works:
- Before each instance, create a LiteLLM virtual key via /key/generate
- Pass the key to ACP agents via acp_env (thread-safe, no os.environ mutation)
- After instance completes, query exact spend via /key/info
- Store as proxy_cost on EvalOutput, delete the key

Fully backward compatible: all functions are no-ops when LITELLM_MASTER_KEY
is not set.

Closes #592

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: simplify — drop proxy_cost field, write to accumulated_cost directly

Instead of adding a new field that every consumer must learn about,
just set metrics.accumulated_cost with the proxy-reported value.
Drop report_costs.py changes and validation script accordingly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: fall back to LLM_API_KEY when LITELLM_MASTER_KEY is not set

The CI eval jobs already have LLM_API_KEY (from secrets.LLM_API_KEY_EVAL)
which has admin permissions on the LiteLLM proxy. No new secret needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: use LLM_API_KEY only, remove LITELLM_MASTER_KEY

The eval key already has admin permissions on the proxy — no need for
a separate master key env var.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: store proxy cost in test_result instead of overwriting accumulated_cost

Keep both values in the output JSON — the SDK's token-count estimate
in metrics.accumulated_cost and the exact proxy cost in
test_result.proxy_cost — so they're both available in GCS.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add unit tests for litellm_proxy module

Covers success paths, HTTP errors, connection errors, timeouts,
config detection, and thread-local isolation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: raise on virtual key creation failure instead of silent fallback

When the LiteLLM proxy is configured but key creation fails, raise
RuntimeError instead of returning None. This prevents silent fallback
to real provider keys in acp_env.

Also moves get_key_spend from GET (key in URL query param) to POST
(key in request body) to avoid logging secrets in access logs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: resolve pyright error on _get_config() return type narrowing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use LLM_API_MASTER_KEY instead of LLM_API_KEY for proxy admin ops

Separates the admin key (virtual key CRUD) from the shared eval key
to limit blast radius if LLM_API_KEY is ever exposed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: update vendor/software-agent-sdk to latest main

Includes:
- feat(acp): add Gemini CLI as supported ACP server (#2619)
- fix(workspace): avoid logging forwarded env values (#2630)
- fix(conversation): sanitize remote error logging (#2631)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: include content hash in assembly image tags

The content hash was added to eval-base image tags in #599 to
auto-rebuild when the SDK Dockerfile changes, but assembly (agent-layer)
tags still used only the SDK SHA. This meant stale assembly images
built on top of old base images (without ACP CLIs) were reused even
after base images were rebuilt.

Include the content hash in assembly tags so Dockerfile changes
invalidate both base AND assembly image caches.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: include content hash in IMAGE_TAG_PREFIX for pull-side tag matching

The build side now includes the Dockerfile content hash in assembly image
tags (2396c60), but the pull side (IMAGE_TAG_PREFIX) still only used the
SDK short SHA. This mismatch would cause consumers to look for images
with the old tag format and fail to find the newly built images.

Update IMAGE_TAG_PREFIX to include the same content hash so that
run_infer.py and modal_patches.py construct tags that match what
build_base_images.py produces.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: update test to match new IMAGE_TAG_PREFIX format

The default IMAGE_TAG_PREFIX now includes both the SDK short SHA and
the Dockerfile content hash. Update the test assertion accordingly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: scope content hash to phased-build consumers only

The previous version.py change added the content hash to
IMAGE_TAG_PREFIX globally, which would break legacy-build benchmarks
(gaia, commit0, multiswebench, etc.) whose images don't include the
content hash.

Instead, introduce get_phased_image_tag_prefix() which includes the
content hash, and only use it in phased-build consumers:
- swebench/run_infer.py
- swebenchmultimodal/run_infer.py
- swtbench/run_infer.py
- modal_patches.py (swebench Modal path)

Legacy benchmarks continue using IMAGE_TAG_PREFIX (SDK SHA only).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: apply ruff formatting fixes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: don't let SDK_SHORT_SHA env var bypass content hash in phased prefix

run_eval.sh sets SDK_SHORT_SHA, which caused get_phased_image_tag_prefix()
to short-circuit and return the SHA without content hash. Only an explicit
IMAGE_TAG_PREFIX override should bypass the content hash computation.

Caught by validation runs: phased builds (swebench, swtbench) produced
images with content hash in tags, but inference pulled images without it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use GET for /key/info endpoint (LiteLLM returns 405 for POST)

LiteLLM's /key/info endpoint expects GET with a query param, not POST
with JSON body. The 405 Method Not Allowed was preventing proxy cost
tracking from working.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: update get_key_spend tests to use GET and merge content-hash changes

- Fix test mocks: get_key_spend now uses httpx.get (not post) for
  /key/info endpoint (LiteLLM returns 405 for POST)
- Merge content-hash assembly tag changes from PR #602 to prevent
  stale image cache bugs when rebuilding

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update benchmarks/utils/litellm_proxy.py

Co-authored-by: Vasco Schiavo <115561717+VascoSch92@users.noreply.github.com>

---------

Co-authored-by: Debug Agent <debug@example.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Vasco Schiavo <115561717+VascoSch92@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Tech Debt]: Content-hash eval-base image tags to auto-rebuild on Dockerfile changes

2 participants