Fleet release: publish the 5-agent combo fleet + base-image CVE fixes#189
Merged
Conversation
19fe6b8 to
db8deaf
Compare
litellm installs its patched transitive deps into the base's pip-less uv venv; runtime-bundle builds gosu + process-compose from source (go1.26.4) to clear Go-stdlib CVEs incl. a CRITICAL instead of shipping the lagging release binaries; otel bumps its OCB toolchain go1.22->1.26.4 (collector stays pinned 0.105.0 for byte-identical traces); the rest apt-upgrade for OS patches. .trivyignore documents the remaining no-fix deferrals. Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
One workflow publishes the whole fleet: frozen base + per-leaf matrix, plus sharded combo and per-task matrices (past GitHub's 256-job cap) for the released agents (claude-code codex gemini-cli openclaw zerostack) x every benchmark, including the per-task ones (swe-bench via bake, terminal-bench via build.sh). Builds retry with jittered backoff to ride out HF rate-limiting; CVE gate then :latest promotion. A vX.Y.Z tag publishes the full fleet. Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
frontiermath is a private Epoch dataset (unbuildable from a clean checkout); remove it and refresh the benchmark count, audit rollup, affected docs, and test fixtures. Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
Drop the 16 `--platform=linux/amd64` pins (every base image is multi-arch; the scratch carriers are arch-neutral) and parameterize the 3 arch-specific spots with $TARGETARCH: the OCB otel build (GOARCH), the runtime-bundle gosu/process- compose source builds (GOARCH), and the duckdb CLI download. swe-bench rides Epoch's per-arch base via EVAL_BASE_ARCH (set per build-arch in the workflow). Verified otel, runtime-bundle, and the duckdb base all build natively on arm64. Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
… combos) Pre-release review across CVEs, workflow logic, and multi-arch surfaced the following; each fix validated locally. core/otel: the CRITICAL grpc auth-bypass CVE-2026-33186 (on the live OTLP :4317 receiver) was deferred with a wrong "can't patch" rationale. It is now genuinely FIXED — builder-config.yaml `replaces` grpc => v1.79.3 (transport-only, traces byte-identical; collector 0.105.0 compiles and the binary embeds 1.79.3, verified). Only the two Darwin/BSD-only otel-sdk path-hijack CVEs stay deferred, now with a correct reachability justification. Added `apk upgrade` to the (gated) alpine final. core/otel, runtime-bundle, benchmark-base-duckdb: dropped BuildKit-only $TARGETARCH/$BUILDPLATFORM, which are EMPTY under the classic DOCKER_BUILDKIT=0 path the local Apple-Silicon harness uses (GOARCH= → `go install` failed). Now native build (GOARCH follows the builder) + a dpkg-derived arch for duckdb. Verified: all three build + run as native arm64 under DOCKER_BUILDKIT=0. core/litellm: trimmed 4 of 8 force-pins that were no-ops vs the real v1.89.1 base venv (orjson/urllib3 already fixed, the tornado pin was a downgrade, pillow isn't installed). Kept starlette/python-multipart/PyJWT/cryptography and added a functional RS256 sign+verify smoke guarding the cryptography-48 (past litellm's <47 cap) override. Rebuilt + trivy-scanned: 0 non-ignored HIGH/CRITICAL. gateways/litellm: bumped 1.83.3 → 1.89.1 with the same dep force-patches (the core remediation had skipped the gateway flavor) + uv 0.5.14 → 0.11.21. agent-base-node: pinned npm@latest → npm@11.17.0 (rule 9; verified it bundles the patched minimatch/glob/tar tree). release-images.yml: combos + compose now run on a partial build (success OR failure) so one failed leaf among 124 can't skip every combo + compose bundle; the per-item loop / per-benchmark matrix isolates a missing parent. release-gate stays strict (:latest promotes only on a clean build). .trivyignore: corrected the inaccurate "apk upgrade in every base" header. .secrets.baseline: restored to main's correct state (the PR had regenerated it against the pre-#171 layout, pointing sk-proxy at services.yaml instead of runner.yaml); only the legitimate release-images.yml line-shift remains. docs/podman-on-apple-silicon: dropped the stale frontiermath ref from the current gated list (the dated changelog row is kept as history). Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
standalone: the combos job now bakes eval-standalone alongside eval for every combo, gated by a new include_standalone input (default true, so a tag release publishes them). eval-standalone is the single-container bundle — lean base + in-process gateway/otelcol/process-compose, the `--mode container` / laptop artifact; bake builds eval once and layers standalone onto it via the eval-base in-graph context (validated via `bake --print`). Name suffix `-standalone`, same :tag as the lean base (principle 9). size report: the report job now emits a durable, machine-readable size-report.json + size-report.csv (per-leaf kind / name / image / build_seconds / size_bytes, amd64 compressed) and uploads them as the fleet-size-report artifact — the input for the AUDIT.md size lane + the README agent/benchmark size tables. Measured once, reused for the existing step-summary table. (.secrets.baseline: line-number resync for the HF_TOKEN finding the workflow edits shifted — no new secret.) Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
The fleet now publishes amd64 + arm64 manifest lists. Native-per-arch, NOT QEMU — pyarrow segfaults under emulation and 41 of 103 benchmarks still use it, so every heavy build runs on its own metal: - bases / build (leaves) / per-task each gain an `arch: [amd64, arm64]` matrix and run on the native runner (arm64 → ubuntu-24.04-arm, overridable via the FLEET_RUNNER_ARM repo Variable). Each pushes :TAG-<arch>; the frozen-base context + per-leaf cache are per-arch so the two arches never collide. - swe-bench (kind=bake) passes EVAL_BASE_ARCH per arch (amd64→x86_64, arm64→arm64) so it FROMs Epoch's matching per-task base. - a new `merge` job stitches :TAG-<arch> into the :TAG manifest list for every base / leaf / per-task image (imagetools create); an image with no per-arch tag (a failed arch, or no arm64 upstream) is skipped, isolating the failure. - combos build multi-arch directly with buildx --platform amd64,arm64 (thin overlays are QEMU-safe; setup-qemu added) FROM the merged manifest-list parents. - compose / release-gate / report consume the :TAG manifest lists. Matrix guards: the leaf matrix now fans out ×2 arch, so the leaf cap drops to 128 (×2 ≤ 256) and per-task shards ×2 must fit 256. Validated locally: the per-arch TAG flows into both the image tag and the frozen-base context (bake --print); otel/runtime-bundle/duckdb build native arm64; the combo Dockerfiles are pure layering (QEMU-safe). The arm64 CI path itself is verified only on the release run (no local arm64 runner; Actions is billing-locked). Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
96b20cd to
5f118cb
Compare
An independent simplicity review (orchestration + shell lenses) confirmed the
architecture is sound and not over-engineered; these are the polish wins it
surfaced. No behavior change:
- runs-on: replace the 3× duplicated nested arch ternary with a matrix `include`
that maps arch→runner, so `runs-on: ${{ matrix.runner }}` is a plain lookup.
- enumerate: `jq -R . | jq -sc` → a single `jq -Rsc 'split("\n")|…'` (3×).
- per-task / combos loops: 3 per-iteration `jq -r` calls → one `[...]|@tsv` + read.
- merge / report: `IFS="$(printf '\t')"` and `sort -t"$(printf '\t')"` → `$'\t'`.
- report CSV: hand-rolled awk → `jq … |@csv` from the JSON we already build
(single source; @csv quotes fields, closing a latent comma gap).
- report: drop a redundant `size_of` double-guard.
Declined (would add complexity, not remove it): a composite action for the
retry()/checkout/login preamble (cross-file indirection; the backoffs differ on
purpose), and deduping the 2-line target→dir map via an enumerate output.
Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
elronbandel
added a commit
that referenced
this pull request
Jun 18, 2026
…nch) (#194) terminal-bench + skills-bench build.sh forced `--platform linux/amd64`, so the per-task arm64 matrix job built amd64 content and tagged it :TAG-arm64 — broken multi-arch. Verified the upstreams ARE arm64-capable: all 89 terminal-bench task environments and ~82/87 skills-bench are FROM standard multi-arch bases (ubuntu/python/debian), and a terminal-bench task builds clean on native arm64. - build.sh (both): drop the --platform pin → builds native on whichever runner (amd64 or arm64) the per-task job lands on. - release-images.yml (kind=script): push the per-task image only if its built architecture matches the runner's. ~5 skills-bench tasks are upstream-amd64- pinned (2 `FROM --platform=linux/amd64`, plus bugswarm/oss-fuzz amd64-only bases); on the arm64 runner those build amd64, so the check skips the push and the merge keeps them single-arch (amd64) rather than mislabeling. swe-lancer/swe-bench-pro carry the same pin but are out of the release scope (dropped in #189); left for when they're re-added. Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
This was referenced Jun 18, 2026
elronbandel
added a commit
that referenced
this pull request
Jun 18, 2026
… dropped (#198) The dry_run smoke caught this: the `report` job's size-report jq crashed (`jq: error … Expected JSON value (while parsing '')`) on a leaf whose `:latest` size resolved to empty. #189's simplicity pass dropped `sz=${sz:-0}` as "redundant" — but it guards `size_of`'s empty-output path (the trailing `// 0` only fires when jq gets input; on empty input jq emits nothing, so `sz` stays ""). Restored it, and made the JSON build defensive (`tonumber? // 0`) so a missing/odd size can never crash the report again. The publish path was unaffected — every --print job passed; only the end-of-run report died. Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
elronbandel
added a commit
that referenced
this pull request
Jun 18, 2026
…efault Write the guide as current state — no #189 mentions, no 'since X' framing. Rosetta removed from TL;DR: normal builds and evals are native arm64 and need no Rosetta. Rosetta stays in §1 as optional, scoped to amd64-only images and test suites (DOCKER_BUILDKIT=0 path).
This was referenced Jun 21, 2026
elronbandel
added a commit
that referenced
this pull request
Jun 21, 2026
…uilds `docker buildx bake` build native arm64 by default on Apple Silicon. The guide still described QEMU avoidance as the default path, contradicting current behavior. - Reframe default: `docker buildx bake` is the normal local build (arm64, no QEMU) - Demote `DOCKER_BUILDKIT=0` / classic-build to edge-case: test suites that pin linux/amd64 for testcontainers compatibility, and genuinely amd64-only images - Demote Rosetta from "REQUIRED for everything" to "needed for amd64-only images" - Add §5a historical note explaining the pre-#189 QEMU/Rosetta limitation - Add Docker Hub 429 rate-limit note with `docker login` / ECR mirror fix - Update troubleshooting table accordingly Closes #196
elronbandel
added a commit
that referenced
this pull request
Jun 21, 2026
…efault Write the guide as current state — no #189 mentions, no 'since X' framing. Rosetta removed from TL;DR: normal builds and evals are native arm64 and need no Rosetta. Rosetta stays in §1 as optional, scoped to amd64-only images and test suites (DOCKER_BUILDKIT=0 path).
elronbandel
added a commit
that referenced
this pull request
Jun 21, 2026
…uilds `docker buildx bake` build native arm64 by default on Apple Silicon. The guide still described QEMU avoidance as the default path, contradicting current behavior. - Reframe default: `docker buildx bake` is the normal local build (arm64, no QEMU) - Demote `DOCKER_BUILDKIT=0` / classic-build to edge-case: test suites that pin linux/amd64 for testcontainers compatibility, and genuinely amd64-only images - Demote Rosetta from "REQUIRED for everything" to "needed for amd64-only images" - Add §5a historical note explaining the pre-#189 QEMU/Rosetta limitation - Add Docker Hub 429 rate-limit note with `docker login` / ECR mirror fix - Update troubleshooting table accordingly Closes #196 Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
elronbandel
added a commit
that referenced
this pull request
Jun 21, 2026
…efault Write the guide as current state — no #189 mentions, no 'since X' framing. Rosetta removed from TL;DR: normal builds and evals are native arm64 and need no Rosetta. Rosetta stays in §1 as optional, scoped to amd64-only images and test suites (DOCKER_BUILDKIT=0 path). Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
elronbandel
added a commit
that referenced
this pull request
Jun 21, 2026
…uilds `docker buildx bake` build native arm64 by default on Apple Silicon. The guide still described QEMU avoidance as the default path, contradicting current behavior. - Reframe default: `docker buildx bake` is the normal local build (arm64, no QEMU) - Demote `DOCKER_BUILDKIT=0` / classic-build to edge-case: test suites that pin linux/amd64 for testcontainers compatibility, and genuinely amd64-only images - Demote Rosetta from "REQUIRED for everything" to "needed for amd64-only images" - Add §5a historical note explaining the pre-#189 QEMU/Rosetta limitation - Add Docker Hub 429 rate-limit note with `docker login` / ECR mirror fix - Update troubleshooting table accordingly Closes #196 Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
elronbandel
added a commit
that referenced
this pull request
Jun 21, 2026
…efault Write the guide as current state — no #189 mentions, no 'since X' framing. Rosetta removed from TL;DR: normal builds and evals are native arm64 and need no Rosetta. Rosetta stays in §1 as optional, scoped to amd64-only images and test suites (DOCKER_BUILDKIT=0 path). Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
elronbandel
added a commit
that referenced
this pull request
Jun 21, 2026
…189 (#199) * docs(podman): update Apple Silicon guide for post-#189 native arm64 builds `docker buildx bake` build native arm64 by default on Apple Silicon. The guide still described QEMU avoidance as the default path, contradicting current behavior. - Reframe default: `docker buildx bake` is the normal local build (arm64, no QEMU) - Demote `DOCKER_BUILDKIT=0` / classic-build to edge-case: test suites that pin linux/amd64 for testcontainers compatibility, and genuinely amd64-only images - Demote Rosetta from "REQUIRED for everything" to "needed for amd64-only images" - Add §5a historical note explaining the pre-#189 QEMU/Rosetta limitation - Add Docker Hub 429 rate-limit note with `docker login` / ECR mirror fix - Update troubleshooting table accordingly Closes #196 Signed-off-by: Elron Bandel <elron.bandel@ibm.com> * docs(podman): DOCKER_BUILDKIT=0 is still required for test suites (not 'may') testcontainers uses .with_platform("linux/amd64") in agents/gateways tests, so the test harness always builds linux/amd64 images — QEMU path, pyarrow segfault — making DOCKER_BUILDKIT=0 a hard requirement, not a maybe. Signed-off-by: Elron Bandel <elron.bandel@ibm.com> * docs(podman): remove historical references, Rosetta is optional not default Write the guide as current state — no #189 mentions, no 'since X' framing. Rosetta removed from TL;DR: normal builds and evals are native arm64 and need no Rosetta. Rosetta stays in §1 as optional, scoped to amd64-only images and test suites (DOCKER_BUILDKIT=0 path). Signed-off-by: Elron Bandel <elron.bandel@ibm.com> * tests: remove linux/amd64 platform forcing — fleet is now multi-arch All testcontainers `.with_platform("linux/amd64")` calls and the `*.platform=linux/amd64` bake overrides in the test harness are stale after the fleet went multi-arch. Tests now build and run native arm64 on Apple Silicon — no DOCKER_BUILDKIT=0, no QEMU, no pyarrow segfaults. Remove §5a and the stale DOCKER_BUILDKIT=0 requirement from the podman-on-apple-silicon guide accordingly. Closes #196 Signed-off-by: Elron Bandel <elron.bandel@ibm.com> * cli: remove --platform linux/amd64 from oracle runner Benchmark images are now multi-arch after the fleet migration. The hardcoded platform pin is stale — run natively on the host arch. Signed-off-by: Elron Bandel <elron.bandel@ibm.com> * cli,tests: make platform configurable, not forced oracle: add --platform flag (omit = native) tests: read TEST_PLATFORM env var for bake and classic-build overrides Signed-off-by: Elron Bandel <elron.bandel@ibm.com> * tests: simplify TEST_PLATFORM override construction in bake_targets Signed-off-by: Elron Bandel <elron.bandel@ibm.com> * tests: trim over-documentation in common/mod.rs Signed-off-by: Elron Bandel <elron.bandel@ibm.com> * scripts: bake-plan.sh builds multi-arch (amd64 + arm64) Stale since the fleet went multi-arch. Signed-off-by: Elron Bandel <elron.bandel@ibm.com> * fix(replay): add --no-pull to build eval to skip arm64 registry miss combination.Dockerfile uses ARG-based FROM instructions (FROM ${AGENT_IMAGE}, FROM ${BENCHMARK_IMAGE}). Each `cargo run -- build` call is a separate BuildKit session, so the eval session checks the remote registry manifest for those images. On arm64 the registry has only linux/amd64 entries — the check fails. Add --no-pull to BuildTarget::Eval in the CLI. When set, it pushes eval.pull=false as a bake override, telling BuildKit to use the content store (populated by the preceding `build bench` and `build agent` calls) instead of fetching the remote manifest. The replay test's bake path calls the CLI end-to-end for all three build steps and passes --no-pull to the eval call, keeping it a true CLI black-box (RULES.md R-2). The classic/podman path is unchanged. Note: the principled fix is named contexts in the bake file so the dependency is explicit in the build graph (tracked as a follow-up). Closes #196 Signed-off-by: Elron Bandel <elron.bandel@ibm.com> * containers: remove stale --platform=linux/amd64 pin from benchmark-base-python-slim Signed-off-by: Elron Bandel <elron.bandel@ibm.com> * tests: add --no-pull to agents smoke eval build (arm64 fix) Same arm64 registry-miss that replay had: bench and agent images are just built locally, so the eval bake must use the BuildKit content store rather than attempting a registry manifest check that returns only amd64. Signed-off-by: Elron Bandel <elron.bandel@ibm.com> * fix(eval): use eval-local target for --no-pull builds to avoid arm64 registry miss Replace the ineffective `eval.pull=false` override with a proper `eval-local` bake target that wires bench+agent as named contexts. With docker-container driver (Mac), `--load` does not put images in BuildKit's content store, so `pull=false` still triggers registry manifest checks when images are absent — failing on arm64 (no arm64 manifest for most benchmarks in the registry). Named contexts bypass the pull path entirely: BuildKit treats them as build graph nodes, not images to fetch. The `eval-local` target inherits from `eval` and adds `${BENCHMARK_IMAGE}` → `target:benchmark-${EVAL_BENCHMARK}` and `${AGENT_IMAGE}` → `target:agent-${EVAL_AGENT}` contexts. To resolve the HCL context keys (which use the HCL variable, not the build arg), BENCHMARK_IMAGE and AGENT_IMAGE are now passed as environment variables to bake, not only as `--set *.args.*` overrides. Per-task builds (task_id.is_some()) keep using the plain `eval` target because task-specific BENCHMARK_IMAGE URLs cannot map to a named bake target. Signed-off-by: Elron Bandel <elron@exgentic.com> --------- Signed-off-by: Elron Bandel <elron.bandel@ibm.com> Signed-off-by: Elron Bandel <elron@exgentic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this does
Adds one workflow that releases the whole eval-container fleet to GHCR (multi-arch), plus the security fixes that make it pass the CVE gate.
Fleet release (
release-images.yml) — folds the old per-task publisher in and adds the combo + per-task publish. AvX.Y.Ztag (or a manual run) builds, in one pipeline:-standalonebundle for each combo (lean base + in-process gateway/otelcol/process-compose — the laptop /--mode containerartifact);:tag → :latestpromotion + Helm chart;fleet-size-reportartifact: per-agent/-benchmark size + build time as JSON + CSV) for the audits + README tables.Multi-arch (amd64 + arm64) — the fleet publishes multi-arch manifest lists via native-per-arch (NOT QEMU — 41 of 103 benchmarks still use pyarrow, which segfaults under emulation): bases / leaves / per-task each carry an
arch: [amd64, arm64]matrix and build on their own runner (arm64 →ubuntu-24.04-arm, override with theFLEET_RUNNER_ARMrepo Variable), each pushing:TAG-<arch>; a newmergejob stitches them into the:TAGmanifest list (imagetools create), skipping any image with no per-arch tag so a single-arch failure stays isolated. Combos build multi-arch directly withbuildx --platform amd64,arm64(thin overlays are QEMU-safe). swe-bench FROMs Epoch's per-arch base viaEVAL_BASE_ARCH. The Dockerfiles + the per-arch tag flow are validated locally; the arm64 CI path is verified on the release run.Base-image CVE fixes — the fleet passes the HIGH/CRITICAL trivy gate:
<47cap);.trivyignore.HF dataset auth — build-time Hugging Face downloads now authenticate (Bearer token), so cold parallel builds aren't throttled to anonymous rate limits.
Cleanup — removes
frontiermath(a private dataset that can't be built from a clean checkout) and refreshes the counts/docs/tests.Pre-release review
An independent multi-agent review (CVEs, workflow logic, multi-arch/build, regression) ran over the branch; every confirmed finding is folded in and validated locally: the grpc CRITICAL is now genuinely fixed (it had been a wrong-justification deferral); otel/runtime-bundle/duckdb no longer break on the classic build path (BuildKit-only
$TARGETARCHwas empty there → native build); combos + compose survive a partial leaf build; the gateway litellm was bumped;npm@latestwas pinned; the.secrets.baselinewas corrected (it had been regenerated against the pre-#171 layout).Scope note
swe-bench-pro + swe-lancer per-task publishing is intentionally dropped vs the old
publish-per-task.yml(which published skills-bench, terminal-bench, swe-bench-pro, swe-lancer). The unified workflow publishes terminal-bench + skills-bench + swe-bench per-task — matching the release scope (terminal-bench + swe-bench).Validation
DOCKER_BUILDKIT=0path, and amd64; the per-archTAGflows into both the image tag and the frozen-base context (bake --print).eval+eval-standalone) resolves; the size-report JSON/CSV generation is validated.Releasing
After merge, push a
v0.1.0tag (matches the current Cargo/Chart version, so the version guard passes) — it fires this workflow + the CLI release.