Skip to content

Fleet release: publish the 5-agent combo fleet + base-image CVE fixes#189

Merged
elronbandel merged 8 commits into
mainfrom
elron/fleet-release
Jun 18, 2026
Merged

Fleet release: publish the 5-agent combo fleet + base-image CVE fixes#189
elronbandel merged 8 commits into
mainfrom
elron/fleet-release

Conversation

@elronbandel

@elronbandel elronbandel commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

What this does

Adds one workflow that releases the whole eval-container fleet to GHCR (multi-arch), plus the security fixes that make it pass the CVE gate.

Fleet release (release-images.yml) — folds the old per-task publisher in and adds the combo + per-task publish. A vX.Y.Z tag (or a manual run) builds, in one pipeline:

  • the shared base images once per arch (frozen), then every benchmark / agent / model leaf in parallel;
  • eval combos for the 5 released agents — claude-code, codex, gemini-cli, openclaw, zerostack — × every benchmark, including the per-task ones (all 500 swe-bench tasks + all terminal-bench tasks), plus a single-container -standalone bundle for each combo (lean base + in-process gateway/otelcol/process-compose — the laptop / --mode container artifact);
  • the per-task base images (swe-bench via bake, terminal-bench via its build script);
  • combo + per-task matrices are sharded to fit under GitHub's 256-job cap;
  • builds retry with jittered backoff so Hugging Face rate-limiting on the runner IP can't sink a leaf;
  • combos + compose survive a partial leaf build — one failed leaf can't skip every combo/bundle;
  • CVE gate on the base images, then :tag → :latest promotion + Helm chart;
  • an end-of-run durable size report (fleet-size-report artifact: per-agent/-benchmark size + build time as JSON + CSV) for the audits + README tables.

Multi-arch (amd64 + arm64) — the fleet publishes multi-arch manifest lists via native-per-arch (NOT QEMU — 41 of 103 benchmarks still use pyarrow, which segfaults under emulation): bases / leaves / per-task each carry an arch: [amd64, arm64] matrix and build on their own runner (arm64 → ubuntu-24.04-arm, override with the FLEET_RUNNER_ARM repo Variable), each pushing :TAG-<arch>; a new merge job stitches them into the :TAG manifest list (imagetools create), skipping any image with no per-arch tag so a single-arch failure stays isolated. Combos build multi-arch directly with buildx --platform amd64,arm64 (thin overlays are QEMU-safe). swe-bench FROMs Epoch's per-arch base via EVAL_BASE_ARCH. The Dockerfiles + the per-arch tag flow are validated locally; the arm64 CI path is verified on the release run.

Base-image CVE fixes — the fleet passes the HIGH/CRITICAL trivy gate:

  • litellm: install the four genuinely-needed patched deps (starlette / python-multipart / PyJWT / cryptography) into the base's pip-less uv venv, guarded by a functional RS256 sign+verify smoke (cryptography is forced past litellm's <47 cap);
  • runtime-bundle: build gosu + process-compose from source with a current Go (clears a CRITICAL + the Go-stdlib CVEs the upstream release binaries still carry);
  • otel: build the collector with a current Go toolchain (collector version unchanged → traces stay byte-identical); the grpc CRITICAL (CVE-2026-33186) is fixed via a builder-config grpc→1.79.3 bump, not deferred;
  • gateways/litellm: bumped 1.83.3 → 1.89.1 with the same dep patches;
  • the rest: OS-package upgrades. Remaining findings are no-fix upstream and documented in .trivyignore.

HF dataset auth — build-time Hugging Face downloads now authenticate (Bearer token), so cold parallel builds aren't throttled to anonymous rate limits.

Cleanup — removes frontiermath (a private dataset that can't be built from a clean checkout) and refreshes the counts/docs/tests.

Pre-release review

An independent multi-agent review (CVEs, workflow logic, multi-arch/build, regression) ran over the branch; every confirmed finding is folded in and validated locally: the grpc CRITICAL is now genuinely fixed (it had been a wrong-justification deferral); otel/runtime-bundle/duckdb no longer break on the classic build path (BuildKit-only $TARGETARCH was empty there → native build); combos + compose survive a partial leaf build; the gateway litellm was bumped; npm@latest was pinned; the .secrets.baseline was corrected (it had been regenerated against the pre-#171 layout).

Scope note

swe-bench-pro + swe-lancer per-task publishing is intentionally dropped vs the old publish-per-task.yml (which published skills-bench, terminal-bench, swe-bench-pro, swe-lancer). The unified workflow publishes terminal-bench + skills-bench + swe-bench per-task — matching the release scope (terminal-bench + swe-bench).

Validation

  • CVE gate green across all base images (incl. the duckdb/slim bases); core/litellm rebuilt + trivy-scanned clean with the trimmed pin set (0 non-ignored HIGH/CRITICAL).
  • otel / runtime-bundle / duckdb build + run as native arm64 on the classic DOCKER_BUILDKIT=0 path, and amd64; the per-arch TAG flows into both the image tag and the frozen-base context (bake --print).
  • The standalone bake graph (eval + eval-standalone) resolves; the size-report JSON/CSV generation is validated.
  • Full-fleet config resolves at scale (sharded matrices stay under the 256-job cap, ×2 arch accounted for).
  • The full ~hours-long multi-arch publish + the arm64 runner path are verified only on the release tag (no local arm64 runner; Actions is billing-locked).

Releasing

After merge, push a v0.1.0 tag (matches the current Cargo/Chart version, so the version guard passes) — it fires this workflow + the CLI release.

@elronbandel elronbandel force-pushed the elron/fleet-release branch 4 times, most recently from 19fe6b8 to db8deaf Compare June 17, 2026 15:20
litellm installs its patched transitive deps into the base's pip-less uv venv;
runtime-bundle builds gosu + process-compose from source (go1.26.4) to clear
Go-stdlib CVEs incl. a CRITICAL instead of shipping the lagging release
binaries; otel bumps its OCB toolchain go1.22->1.26.4 (collector stays pinned
0.105.0 for byte-identical traces); the rest apt-upgrade for OS patches.
.trivyignore documents the remaining no-fix deferrals.

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
One workflow publishes the whole fleet: frozen base + per-leaf matrix, plus
sharded combo and per-task matrices (past GitHub's 256-job cap) for the released
agents (claude-code codex gemini-cli openclaw zerostack) x every benchmark,
including the per-task ones (swe-bench via bake, terminal-bench via build.sh).
Builds retry with jittered backoff to ride out HF rate-limiting; CVE gate then
:latest promotion. A vX.Y.Z tag publishes the full fleet.

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
frontiermath is a private Epoch dataset (unbuildable from a clean checkout);
remove it and refresh the benchmark count, audit rollup, affected docs, and
test fixtures.

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
Drop the 16 `--platform=linux/amd64` pins (every base image is multi-arch; the
scratch carriers are arch-neutral) and parameterize the 3 arch-specific spots
with $TARGETARCH: the OCB otel build (GOARCH), the runtime-bundle gosu/process-
compose source builds (GOARCH), and the duckdb CLI download. swe-bench rides
Epoch's per-arch base via EVAL_BASE_ARCH (set per build-arch in the workflow).
Verified otel, runtime-bundle, and the duckdb base all build natively on arm64.

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
… combos)

Pre-release review across CVEs, workflow logic, and multi-arch surfaced the
following; each fix validated locally.

core/otel: the CRITICAL grpc auth-bypass CVE-2026-33186 (on the live OTLP
:4317 receiver) was deferred with a wrong "can't patch" rationale. It is now
genuinely FIXED — builder-config.yaml `replaces` grpc => v1.79.3
(transport-only, traces byte-identical; collector 0.105.0 compiles and the
binary embeds 1.79.3, verified). Only the two Darwin/BSD-only otel-sdk
path-hijack CVEs stay deferred, now with a correct reachability justification.
Added `apk upgrade` to the (gated) alpine final.

core/otel, runtime-bundle, benchmark-base-duckdb: dropped BuildKit-only
$TARGETARCH/$BUILDPLATFORM, which are EMPTY under the classic DOCKER_BUILDKIT=0
path the local Apple-Silicon harness uses (GOARCH= → `go install` failed). Now
native build (GOARCH follows the builder) + a dpkg-derived arch for duckdb.
Verified: all three build + run as native arm64 under DOCKER_BUILDKIT=0.

core/litellm: trimmed 4 of 8 force-pins that were no-ops vs the real v1.89.1
base venv (orjson/urllib3 already fixed, the tornado pin was a downgrade,
pillow isn't installed). Kept starlette/python-multipart/PyJWT/cryptography and
added a functional RS256 sign+verify smoke guarding the cryptography-48 (past
litellm's <47 cap) override. Rebuilt + trivy-scanned: 0 non-ignored HIGH/CRITICAL.

gateways/litellm: bumped 1.83.3 → 1.89.1 with the same dep force-patches (the
core remediation had skipped the gateway flavor) + uv 0.5.14 → 0.11.21.

agent-base-node: pinned npm@latest → npm@11.17.0 (rule 9; verified it bundles
the patched minimatch/glob/tar tree).

release-images.yml: combos + compose now run on a partial build (success OR
failure) so one failed leaf among 124 can't skip every combo + compose bundle;
the per-item loop / per-benchmark matrix isolates a missing parent. release-gate
stays strict (:latest promotes only on a clean build).

.trivyignore: corrected the inaccurate "apk upgrade in every base" header.

.secrets.baseline: restored to main's correct state (the PR had regenerated it
against the pre-#171 layout, pointing sk-proxy at services.yaml instead of
runner.yaml); only the legitimate release-images.yml line-shift remains.

docs/podman-on-apple-silicon: dropped the stale frontiermath ref from the
current gated list (the dated changelog row is kept as history).

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
standalone: the combos job now bakes eval-standalone alongside eval for every
combo, gated by a new include_standalone input (default true, so a tag release
publishes them). eval-standalone is the single-container bundle — lean base +
in-process gateway/otelcol/process-compose, the `--mode container` / laptop
artifact; bake builds eval once and layers standalone onto it via the eval-base
in-graph context (validated via `bake --print`). Name suffix `-standalone`,
same :tag as the lean base (principle 9).

size report: the report job now emits a durable, machine-readable
size-report.json + size-report.csv (per-leaf kind / name / image /
build_seconds / size_bytes, amd64 compressed) and uploads them as the
fleet-size-report artifact — the input for the AUDIT.md size lane + the README
agent/benchmark size tables. Measured once, reused for the existing
step-summary table.

(.secrets.baseline: line-number resync for the HF_TOKEN finding the workflow
edits shifted — no new secret.)

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
The fleet now publishes amd64 + arm64 manifest lists. Native-per-arch, NOT QEMU
— pyarrow segfaults under emulation and 41 of 103 benchmarks still use it, so
every heavy build runs on its own metal:

- bases / build (leaves) / per-task each gain an `arch: [amd64, arm64]` matrix
  and run on the native runner (arm64 → ubuntu-24.04-arm, overridable via the
  FLEET_RUNNER_ARM repo Variable). Each pushes :TAG-<arch>; the frozen-base
  context + per-leaf cache are per-arch so the two arches never collide.
- swe-bench (kind=bake) passes EVAL_BASE_ARCH per arch (amd64→x86_64,
  arm64→arm64) so it FROMs Epoch's matching per-task base.
- a new `merge` job stitches :TAG-<arch> into the :TAG manifest list for every
  base / leaf / per-task image (imagetools create); an image with no per-arch
  tag (a failed arch, or no arm64 upstream) is skipped, isolating the failure.
- combos build multi-arch directly with buildx --platform amd64,arm64 (thin
  overlays are QEMU-safe; setup-qemu added) FROM the merged manifest-list parents.
- compose / release-gate / report consume the :TAG manifest lists.

Matrix guards: the leaf matrix now fans out ×2 arch, so the leaf cap drops to
128 (×2 ≤ 256) and per-task shards ×2 must fit 256.

Validated locally: the per-arch TAG flows into both the image tag and the
frozen-base context (bake --print); otel/runtime-bundle/duckdb build native
arm64; the combo Dockerfiles are pure layering (QEMU-safe). The arm64 CI path
itself is verified only on the release run (no local arm64 runner; Actions is
billing-locked).

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
@elronbandel elronbandel force-pushed the elron/fleet-release branch from 96b20cd to 5f118cb Compare June 18, 2026 11:08
An independent simplicity review (orchestration + shell lenses) confirmed the
architecture is sound and not over-engineered; these are the polish wins it
surfaced. No behavior change:

- runs-on: replace the 3× duplicated nested arch ternary with a matrix `include`
  that maps arch→runner, so `runs-on: ${{ matrix.runner }}` is a plain lookup.
- enumerate: `jq -R . | jq -sc` → a single `jq -Rsc 'split("\n")|…'` (3×).
- per-task / combos loops: 3 per-iteration `jq -r` calls → one `[...]|@tsv` + read.
- merge / report: `IFS="$(printf '\t')"` and `sort -t"$(printf '\t')"` → `$'\t'`.
- report CSV: hand-rolled awk → `jq … |@csv` from the JSON we already build
  (single source; @csv quotes fields, closing a latent comma gap).
- report: drop a redundant `size_of` double-guard.

Declined (would add complexity, not remove it): a composite action for the
retry()/checkout/login preamble (cross-file indirection; the backoffs differ on
purpose), and deduping the 2-line target→dir map via an enumerate output.

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
@elronbandel elronbandel merged commit 7704c7a into main Jun 18, 2026
4 of 5 checks passed
elronbandel added a commit that referenced this pull request Jun 18, 2026
…nch) (#194)

terminal-bench + skills-bench build.sh forced `--platform linux/amd64`, so the
per-task arm64 matrix job built amd64 content and tagged it :TAG-arm64 — broken
multi-arch. Verified the upstreams ARE arm64-capable: all 89 terminal-bench task
environments and ~82/87 skills-bench are FROM standard multi-arch bases
(ubuntu/python/debian), and a terminal-bench task builds clean on native arm64.

- build.sh (both): drop the --platform pin → builds native on whichever runner
  (amd64 or arm64) the per-task job lands on.
- release-images.yml (kind=script): push the per-task image only if its built
  architecture matches the runner's. ~5 skills-bench tasks are upstream-amd64-
  pinned (2 `FROM --platform=linux/amd64`, plus bugswarm/oss-fuzz amd64-only
  bases); on the arm64 runner those build amd64, so the check skips the push and
  the merge keeps them single-arch (amd64) rather than mislabeling.

swe-lancer/swe-bench-pro carry the same pin but are out of the release scope
(dropped in #189); left for when they're re-added.

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
elronbandel added a commit that referenced this pull request Jun 18, 2026
… dropped (#198)

The dry_run smoke caught this: the `report` job's size-report jq crashed
(`jq: error … Expected JSON value (while parsing '')`) on a leaf whose `:latest`
size resolved to empty. #189's simplicity pass dropped `sz=${sz:-0}` as
"redundant" — but it guards `size_of`'s empty-output path (the trailing `// 0`
only fires when jq gets input; on empty input jq emits nothing, so `sz` stays
""). Restored it, and made the JSON build defensive (`tonumber? // 0`) so a
missing/odd size can never crash the report again. The publish path was
unaffected — every --print job passed; only the end-of-run report died.

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
elronbandel added a commit that referenced this pull request Jun 18, 2026
…efault

Write the guide as current state — no #189 mentions, no 'since X' framing.
Rosetta removed from TL;DR: normal builds and evals are native arm64 and
need no Rosetta. Rosetta stays in §1 as optional, scoped to amd64-only
images and test suites (DOCKER_BUILDKIT=0 path).
elronbandel added a commit that referenced this pull request Jun 21, 2026
…uilds

`docker buildx bake` build native arm64 by default on Apple Silicon. The guide
still described QEMU avoidance as the default path, contradicting current
behavior.

- Reframe default: `docker buildx bake` is the normal local build (arm64, no QEMU)
- Demote `DOCKER_BUILDKIT=0` / classic-build to edge-case: test suites that pin
  linux/amd64 for testcontainers compatibility, and genuinely amd64-only images
- Demote Rosetta from "REQUIRED for everything" to "needed for amd64-only images"
- Add §5a historical note explaining the pre-#189 QEMU/Rosetta limitation
- Add Docker Hub 429 rate-limit note with `docker login` / ECR mirror fix
- Update troubleshooting table accordingly

Closes #196
elronbandel added a commit that referenced this pull request Jun 21, 2026
…efault

Write the guide as current state — no #189 mentions, no 'since X' framing.
Rosetta removed from TL;DR: normal builds and evals are native arm64 and
need no Rosetta. Rosetta stays in §1 as optional, scoped to amd64-only
images and test suites (DOCKER_BUILDKIT=0 path).
elronbandel added a commit that referenced this pull request Jun 21, 2026
…uilds

`docker buildx bake` build native arm64 by default on Apple Silicon. The guide
still described QEMU avoidance as the default path, contradicting current
behavior.

- Reframe default: `docker buildx bake` is the normal local build (arm64, no QEMU)
- Demote `DOCKER_BUILDKIT=0` / classic-build to edge-case: test suites that pin
  linux/amd64 for testcontainers compatibility, and genuinely amd64-only images
- Demote Rosetta from "REQUIRED for everything" to "needed for amd64-only images"
- Add §5a historical note explaining the pre-#189 QEMU/Rosetta limitation
- Add Docker Hub 429 rate-limit note with `docker login` / ECR mirror fix
- Update troubleshooting table accordingly

Closes #196

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
elronbandel added a commit that referenced this pull request Jun 21, 2026
…efault

Write the guide as current state — no #189 mentions, no 'since X' framing.
Rosetta removed from TL;DR: normal builds and evals are native arm64 and
need no Rosetta. Rosetta stays in §1 as optional, scoped to amd64-only
images and test suites (DOCKER_BUILDKIT=0 path).

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
elronbandel added a commit that referenced this pull request Jun 21, 2026
…uilds

`docker buildx bake` build native arm64 by default on Apple Silicon. The guide
still described QEMU avoidance as the default path, contradicting current
behavior.

- Reframe default: `docker buildx bake` is the normal local build (arm64, no QEMU)
- Demote `DOCKER_BUILDKIT=0` / classic-build to edge-case: test suites that pin
  linux/amd64 for testcontainers compatibility, and genuinely amd64-only images
- Demote Rosetta from "REQUIRED for everything" to "needed for amd64-only images"
- Add §5a historical note explaining the pre-#189 QEMU/Rosetta limitation
- Add Docker Hub 429 rate-limit note with `docker login` / ECR mirror fix
- Update troubleshooting table accordingly

Closes #196

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
elronbandel added a commit that referenced this pull request Jun 21, 2026
…efault

Write the guide as current state — no #189 mentions, no 'since X' framing.
Rosetta removed from TL;DR: normal builds and evals are native arm64 and
need no Rosetta. Rosetta stays in §1 as optional, scoped to amd64-only
images and test suites (DOCKER_BUILDKIT=0 path).

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
elronbandel added a commit that referenced this pull request Jun 21, 2026
…189 (#199)

* docs(podman): update Apple Silicon guide for post-#189 native arm64 builds

`docker buildx bake` build native arm64 by default on Apple Silicon. The guide
still described QEMU avoidance as the default path, contradicting current
behavior.

- Reframe default: `docker buildx bake` is the normal local build (arm64, no QEMU)
- Demote `DOCKER_BUILDKIT=0` / classic-build to edge-case: test suites that pin
  linux/amd64 for testcontainers compatibility, and genuinely amd64-only images
- Demote Rosetta from "REQUIRED for everything" to "needed for amd64-only images"
- Add §5a historical note explaining the pre-#189 QEMU/Rosetta limitation
- Add Docker Hub 429 rate-limit note with `docker login` / ECR mirror fix
- Update troubleshooting table accordingly

Closes #196

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

* docs(podman): DOCKER_BUILDKIT=0 is still required for test suites (not 'may')

testcontainers uses .with_platform("linux/amd64") in agents/gateways tests,
so the test harness always builds linux/amd64 images — QEMU path, pyarrow
segfault — making DOCKER_BUILDKIT=0 a hard requirement, not a maybe.

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

* docs(podman): remove historical references, Rosetta is optional not default

Write the guide as current state — no #189 mentions, no 'since X' framing.
Rosetta removed from TL;DR: normal builds and evals are native arm64 and
need no Rosetta. Rosetta stays in §1 as optional, scoped to amd64-only
images and test suites (DOCKER_BUILDKIT=0 path).

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

* tests: remove linux/amd64 platform forcing — fleet is now multi-arch

All testcontainers `.with_platform("linux/amd64")` calls and the
`*.platform=linux/amd64` bake overrides in the test harness are stale
after the fleet went multi-arch. Tests now build and run native arm64
on Apple Silicon — no DOCKER_BUILDKIT=0, no QEMU, no pyarrow segfaults.

Remove §5a and the stale DOCKER_BUILDKIT=0 requirement from the
podman-on-apple-silicon guide accordingly.

Closes #196

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

* cli: remove --platform linux/amd64 from oracle runner

Benchmark images are now multi-arch after the fleet migration.
The hardcoded platform pin is stale — run natively on the host arch.

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

* cli,tests: make platform configurable, not forced

oracle: add --platform flag (omit = native)
tests: read TEST_PLATFORM env var for bake and classic-build overrides
Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

* tests: simplify TEST_PLATFORM override construction in bake_targets

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

* tests: trim over-documentation in common/mod.rs

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

* scripts: bake-plan.sh builds multi-arch (amd64 + arm64)

Stale since the fleet went multi-arch.

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

* fix(replay): add --no-pull to build eval to skip arm64 registry miss

combination.Dockerfile uses ARG-based FROM instructions (FROM ${AGENT_IMAGE},
FROM ${BENCHMARK_IMAGE}). Each `cargo run -- build` call is a separate
BuildKit session, so the eval session checks the remote registry manifest
for those images. On arm64 the registry has only linux/amd64 entries — the
check fails.

Add --no-pull to BuildTarget::Eval in the CLI. When set, it pushes
eval.pull=false as a bake override, telling BuildKit to use the content
store (populated by the preceding `build bench` and `build agent` calls)
instead of fetching the remote manifest.

The replay test's bake path calls the CLI end-to-end for all three build
steps and passes --no-pull to the eval call, keeping it a true CLI
black-box (RULES.md R-2). The classic/podman path is unchanged.

Note: the principled fix is named contexts in the bake file so the
dependency is explicit in the build graph (tracked as a follow-up).

Closes #196

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

* containers: remove stale --platform=linux/amd64 pin from benchmark-base-python-slim

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

* tests: add --no-pull to agents smoke eval build (arm64 fix)

Same arm64 registry-miss that replay had: bench and agent images are
just built locally, so the eval bake must use the BuildKit content
store rather than attempting a registry manifest check that returns
only amd64.

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

* fix(eval): use eval-local target for --no-pull builds to avoid arm64 registry miss

Replace the ineffective `eval.pull=false` override with a proper `eval-local`
bake target that wires bench+agent as named contexts. With docker-container
driver (Mac), `--load` does not put images in BuildKit's content store, so
`pull=false` still triggers registry manifest checks when images are absent —
failing on arm64 (no arm64 manifest for most benchmarks in the registry).

Named contexts bypass the pull path entirely: BuildKit treats them as build
graph nodes, not images to fetch. The `eval-local` target inherits from `eval`
and adds `${BENCHMARK_IMAGE}` → `target:benchmark-${EVAL_BENCHMARK}` and
`${AGENT_IMAGE}` → `target:agent-${EVAL_AGENT}` contexts.

To resolve the HCL context keys (which use the HCL variable, not the build
arg), BENCHMARK_IMAGE and AGENT_IMAGE are now passed as environment variables
to bake, not only as `--set *.args.*` overrides.

Per-task builds (task_id.is_some()) keep using the plain `eval` target because
task-specific BENCHMARK_IMAGE URLs cannot map to a named bake target.

Signed-off-by: Elron Bandel <elron@exgentic.com>

---------

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
Signed-off-by: Elron Bandel <elron@exgentic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant