This release focuses on recipe health scoring, improved deployment validation, improved snapshot/discovery, and extending software supply chain capabilities for enterprise users.
Highlights
Recipe Structural Health - New pkg/health engine computes per-recipe health signals (chart_pinned, constraints_wellformed, declared_coverage) and rolls them up into a recipe-health matrix. aicr recipe list surfaces structural-health columns (with a --no-health opt-out), a tools/health generator and weekly recipe-health-refresh workflow keep the matrix current, and a lint guard now requires healthCheck.assertFile.
Improved Deployment Validation - The chainsaw deployment-phase runner is now an in-process executor rather than a shelled-out binary. aicr validate runs all phases by default with a --fail-fast opt-in, fails closed on evaluator errors, and is nil-safe across health checks.
Snapshot/Discovery - The collector now discovers GPU SKUs without nvidia-smi, removing the CUDA base image dependency and matching SKUs on token boundaries instead of substrings.
Closed Supply Chain - Signing and verification now work end-to-end in air-gapped and enterprise environments. aicr bundle supports KMS-backed signing (--signing-key) and private Sigstore deployments (--fulcio-url, --rekor-url); aicr verify --key validates bundles against a KMS or public key; and aicr evidence publish signs recipe evidence off-network. The recipe catalog itself now ships signed provenance for the V1 closed supply chain, and keyless signing warns before publishing identity to the public transparency log.
New Recipes & Overlays
- A100 training Kubeflow overlay chains for EKS, AKS, GKE COS, and OKE
- GB300 concrete EKS service-bound overlays
- OKE GB200 and AKS H100 Dynamo performance checks
CLI & Bundling
aicr recipe listsubcommand for catalog enumeration- Gatekeeper added as an optional component
Inference Performance & Validation
- Inference-performance validation enhanced and tuned; gated on all worker services Ready
nccl-all-reduce-bwgates wired for EKS + H200; GKE NCCL node selector made dynamic- Bounded absent-resource retries in deployment-phase health checks
Thanks to @atif1996, @cdesiniotis, @dims, @haarchri, @JaydipGabani, @lalitadithya, @lockwobr, @njhensley, @pdmack, @pedjak, @rsd-darshan, @sttts, @xdu31, @yuanchen8911, and @mchmarny.
Changelog
New Features
- cfb0cb0: feat(agentgateway): scope inference-gateway LB to allowed source ranges (#1138) (@yuanchen8911)
- 3b8b1f3: feat(bundle): KMS-backed signing via --signing-key (#407) (#1205) (@lockwobr)
- 5316d3c: feat(bundle): private Sigstore via --fulcio-url and --rekor-url (#1158) (@lockwobr)
- b847f96: feat(bundler): retry sign.Bundle on transient Sigstore failures (#1251) (@mchmarny)
- 3c1f525: feat(bundler): warn on open agentgateway inference-gateway exposure (#1163) (@yuanchen8911)
- cd30fe5: feat(ci): add weekly recipe-health-refresh workflow (#1320) (@njhensley)
- 5f8647d: feat(cli): add --no-health opt-out to recipe list (#1314) (@njhensley)
- f0490fb: feat(cli): add --set-json/--set-file for list and object bundle overrides (#1162) (@yuanchen8911)
- b401339: feat(cli): add structural-health columns to recipe list (#1302) (@njhensley)
- a47a53f: feat(cli): warn before keyless signing publishes identity to public log (#1300) (@njhensley)
- 97934ad: feat(collector): driver-free GPU SKU discovery; remove nvidia-smi + CUDA base (#1352) (@mchmarny)
- eb8728d: feat(coverage): generated CUJ/CLI coverage matrix (RQ3) (#1316) (@mchmarny)
- 1674ea4: feat(evidence): add
aicr evidence publishfor off-network signing (#1140) (@njhensley) - e1b0160: feat(health): add tools/health generator and recipe-health matrix (#1304) (@njhensley)
- 8d0da78: feat(health): chart_pinned signal + declared_coverage descriptor (#1293) (@mchmarny)
- fa92b43: feat(health): constraints_wellformed signal (parse-only, hermetic) (#1301) (@njhensley)
- 10b08f1: feat(health): pkg/health core Compute loop, resolves signal, rollup (#1291) (@mchmarny)
- 3e2f823: feat(recipe): add AKS H100 Dynamo perf check (#1232) (@yuanchen8911)
- 8f8bc56: feat(recipe): add OKE GB200 perf check (#1233) (@yuanchen8911)
- ec95e20: feat(recipe): add aicr recipe list subcommand for catalog enumeration (#1208) (@rsd-darshan)
- 57fbed0: feat(recipe): hydrate healthCheck.assertFile + suppression sentinel (#1231) (@mchmarny)
- ae6819c: feat(recipe): lint guard requiring healthCheck.assertFile + allowlist (#1244) (@mchmarny)
- 0bf2267: feat(recipe): signed catalog provenance for V1 closed supply chain (#1216) (@mchmarny)
- 463d6a1: feat(recipes): add A100 AKS training Kubeflow overlay chain (#1295) (@yuanchen8911)
- d8d3070: feat(recipes): add A100 EKS training Kubeflow overlay chain (#1305) (@yuanchen8911)
- fd64dd7: feat(recipes): add A100 GKE COS training Kubeflow overlay chain (#1306) (@yuanchen8911)
- 6eb85ac: feat(recipes): add A100 OKE training Kubeflow overlay chain (#1294) (@yuanchen8911)
- 4b817ce: feat(recipes): add concrete GB300 EKS service-bound overlays (#1319) (@yuanchen8911)
- cad0142: feat(recipes): backfill chainsaw health checks for 5 missing components (#1243) (@mchmarny)
- 81daab3: feat(recipes): deepen 21 chainsaw health checks; close epic #660 (#1245) (@mchmarny)
- 7bb7059: feat(recipes): migrate nvidia-dra-driver-gpu to registry.k8s.io v0.4.0 (#1285) (@mchmarny)
- e848e1f: feat(tests): KMS bundle-signing e2e against MiniStack over TLS (#1298) (@lockwobr)
- 81c3fb0: feat(tests): private-Sigstore bundle-signing e2e via Helm scaffold (#1321) (@lockwobr)
- 25a6cdd: feat(validator): ship chainsaw binary; activate deployment-phase runner (#1235) (@mchmarny)
- 4d5ac90: feat(validators): enhance inference performance validation (#1133) (@yuanchen8911)
- f6cb3cd: feat(validators): replace chainsaw binary with in-process executor (#1252) (@mchmarny)
- e3460fc: feat(verify): KMS/public-key bundle verification (aicr verify --key) (#1238) (@lockwobr)
Bug Fixes
- e3aa6b4: fix(bundler): disable kataSandboxDevicePlugin in gpu-operator values (#1343) (@atif1996)
- 5f51fe8: fix(ci): actually tear down AWS UAT cluster (destroy → apply) (#1213) (@njhensley)
- 484c61a: fix(ci): always upload recipe-evidence report so comment gate works (#1292) (@njhensley)
- eddc075: fix(ci): build patched nvkind with --config-source=file (#1237) (#1258) (@mchmarny)
- 1ef5bdb: fix(ci): resolve fork PRs for recipe-evidence sticky comment (#1297) (@njhensley)
- 7068779: fix(ci): shard Tier 3 KWOK matrix to stay under 256-config cap (#1173) (@njhensley)
- ff8a756: fix(ci): stamp publish with resolved tag instead of releases/latest API (#1136) (@pdmack)
- 85daf65: fix(ci): suppress chainsaw CVEs + apply VEX on release scan (#1366) (@mchmarny)
- 4e8f778: fix(ci): unblock build-attested workflow on missing HOMEBREW_DEPLOY_KEY (#1296) (@lockwobr)
- 4e86524: fix(docs): catch bare tags in MDX safety check (#1170) (@pedjak)
- c606d17: fix(docs): cover contributor docs in MDX check; catch autolinks (#1151) (@mchmarny)
- 915ed66: fix(docs): escape bare < for Fern MDX + harden MDX checker (#1367) (@mchmarny)
- 4b2864a: fix(docs): keep --set-json code span on single line for MDX check (@mchmarny)
- de2d9dd: fix(docs): use MDX comments in recipe-health.md so Fern publish parses (#1365) (@mchmarny)
- 833a2e5: fix(evidence): emit recipe-evidence pointer.yaml at 2-space indent (#1165) (@yuanchen8911)
- af563b2: fix(evidence): pull by digest and auto-tag pushes instead of :v1 (#1168) (@njhensley)
- 410f5a3: fix(fingerprint): match GPU SKUs on token boundaries not substrings (#1350) (@mchmarny)
- 6e95906: fix(health): exempt manifest-only Helm components from chart_pinned (#1303) (@njhensley)
- e32375c: fix(recipes): pin nvidia-dra-driver-gpu to 0.4.1-rc.1 for strict-YAML fix (#1341) (@yuanchen8911)
- 032b707: fix(scan): match VEX PURL to grype's image PURL + surface CVE IDs (@mchmarny)
- 909629b: fix(security): VEX-suppress CVE-2026-45447 (OpenSSL UAF) on aicr image (@mchmarny)
- f93e36f: fix(snapshotter): auto-target GPU nodes and warn on placement mismatch (#1199) (@njhensley)
- 9dd3726: fix(toolkit): remove ACCEPT_NVIDIA_VISIBLE_DEVICES overrides from all recipes (#1353) (@atif1996)
- 767f17d: fix(validator): address PR #1231 review on chainsaw hydration runtime (#1234) (@mchmarny)
- 225ff43: fix(validator): bound absent-resource retries in health checks of deployment validation (#1324) (@yuanchen8911)
- fa14e51: fix(validator): fail fast on eval errors; nil-safe health checks (#1261) (@yuanchen8911)
- 53d0a7d: fix(validator): gate inference-perf on all workers Ready (#1182) (@yuanchen8911)
- e8defec: fix(validator): honor tag override for main-container pull policy (#1180) (@yuanchen8911)
- ea5c37e: fix(validator): inference performance validation requires all worker services Ready (#1187) (@yuanchen8911)
- 054769f: fix(validator): run all phases by default; add --fail-fast opt-in (#1198) (@njhensley)
- d68eada: fix(validator): stop unenforced nccl-all-reduce-bw gates; wire eks+h200 (#1338) (@yuanchen8911)
- 19dece7: fix(validators): bump dynamo runtime image 0.9.0 -> 1.0.2 (fixes #1192) (#1193) (@yuanchen8911)
- 90b1292: fix(validators): make GKE NCCL node selector dynamic (#1262) (@xdu31)
- cf2a870: fix(validators): reorder validation phases and fix stale paths and documentation issues (#1201) (@yuanchen8911)
- 2d5259b: fix(validators): update and tune inference performance validation (#1196) (@yuanchen8911)
- cc78e97: fix(vex): suppress CVE-2026-7210 (cpython xml hash-flooding) for aiperf-bench (@mchmarny)
- 090eda2: fix: drop removed dynamo-operator sshKeygen override (#1333) (@mchmarny)
Other Tasks
- 1066d45: Add opt-in component readiness gates (--readiness-hooks) (#1110) (@atif1996)
- 00299ba: chore(deps): Update dependency anchore/grype to v0.114.0 (#1241) (@github-actions[bot])
- 8cf2b51: chore(deps): Update dependency go to v1.26.4 (#1204) (@github-actions[bot])
- 11073a6: chore(deps): Update kindest/node Docker tag to v1.36.1 (#1210) (@github-actions[bot])
- e22d4c0: chore(deps): Update module golang.org/x/sync to v0.21.0 (#1242) (@github-actions[bot])
- 40c3331: chore(deps): Update supply-chain (#1211) (@github-actions[bot])
- c2c85dc: chore(deps): Update supply-chain (#1358) (@github-actions[bot])
- 7132559: chore(deps): Update testing-tools (#1212) (@github-actions[bot])
- fa15962: chore(deps): Update testing-tools (#1359) (@github-actions[bot])
- f65d7b0: chore(deps): Update ubuntu:24.04 Docker digest to 786a8b5 (#1169) (@github-actions[bot])
- d54e315: chore(deps): bump GPU Operator chart to v26.3.2 (#1190) (@yuanchen8911)
- d4342ae: chore(deps): bump version of nvkind to 24c190e (#1260) (@cdesiniotis)
- 0a6b21a: chore(fern): register v0.14.0 (#1131) (@github-actions[bot])
- bbd0849: chore(recipes): update nodewright to v0.17.0 (OCI chart) (#1348) (@lockwobr)
- b3b9402: chore(scan): add VEX debug steps to aiperf-bench scan job (@mchmarny)
- 3f7584d: chore(scan): per-CVE OpenVEX evidence for aiperf-bench suppressions (#1137) (@mchmarny)
- 5e89201: chore(scan): scope VEX-suppressed count to vex namespace + add skill (@mchmarny)
- 7f018db: chore: add license header to slide-deck skeleton.html (#1344) (@yuanchen8911)
- 7e4fdc3: chore: add overview images (@mchmarny)
- 2bff62b: chore: add trust image (@mchmarny)
- f26e2fc: chore: align go.mod go directive with .go-version (1.26.4) (#1206) (@lockwobr)
- c1e658a: chore: dep upgrade (@mchmarny)
- e8e11c2: chore: deps: bump actions/checkout from 6.0.2 to 6.0.3 (#1185) (@dependabot[bot])
- d621b78: chore: deps: bump actions/checkout from 6.0.2 to 6.0.3 (#1202) (@dependabot[bot])
- 2fee33b: chore: deps: bump actions/setup-go from 6.2.0 to 6.4.0 (#1331) (@dependabot[bot])
- 713b5d1: chore: deps: bump aws-actions/configure-aws-credentials from 6.1.3 to 6.2.0 (#1184) (@dependabot[bot])
- ed0f201: chore: deps: bump github.com/sigstore/sigstore-go from 1.1.4 to 1.2.0 (#1183) (@dependabot[bot])
- 8f71165: chore: deps: bump github.com/sigstore/sigstore-go from 1.2.0 to 1.2.1 (#1287) (@dependabot[bot])
- 566ce08: chore: deps: bump github.com/urfave/cli/v3 from 3.9.0 to 3.9.1 (#1307) (@dependabot[bot])
- 31a0f00: chore: deps: bump github.com/urfave/cli/v3 from 3.9.1 to 3.10.0 (#1356) (@dependabot[bot])
- 41982d7: chore: deps: bump github/codeql-action from 4.36.0 to 4.36.1 (#1186) (@dependabot[bot])
- a2cfd68: chore: deps: bump github/codeql-action from 4.36.1 to 4.36.2 (#1203) (@dependabot[bot])
- 213dff9: chore: deps: bump golang.org/x/term from 0.43.0 to 0.44.0 in the golang-x group across 1 directory (#1250) (@dependabot[bot])
- 4cfc814: chore: deps: bump goreleaser/goreleaser-action from 7.0.0 to 7.2.2 (#1330) (@dependabot[bot])
- 73a3c71: chore: deps: bump oras.land/oras-go/v2 from 2.6.0 to 2.6.1 (#1240) (@dependabot[bot])
- c0234b5: chore: deps: bump renovatebot/github-action from 46.1.14 to 46.1.15 (#1209) (@dependabot[bot])
- 1c32395: chore: deps: bump sigstore/cosign-installer from 4.0.0 to 4.1.2 (#1329) (@dependabot[bot])
- d98c5c1: chore: deps: bump the kubernetes group with 3 updates (#1355) (@dependabot[bot])
- 90fced4: chore: fix stale label guidance, add issue-type policy (#1315) (@njhensley)
- 6d2c2bf: chore: overview updates (@mchmarny)
- 3fadaf6: chore: trim agent rules under context limit, add review patterns (#1322) (@mchmarny)
- 1ba27a1: chore: update overview image (@mchmarny)
- fec4f3d: chore: upgrade to latest nvsentinel (#1332) (@lalitadithya)
- 4ddf77f: ci(deps): bump Node 20 actions to Node 24 + refresh Go deps (#1309) (@mchmarny)
- 91c009a: ci(kwok): add flux-git lane with in-cluster gitea (#1290) (@haarchri)
- 8c26c1b: ci(vuln-scan): add notify_slack toggle to skip Slack on manual runs (@mchmarny)
- b99fbe6: ci: add DevTrace PR contributor trust scoring (#1346) (@mchmarny)
- 13769fc: ci: drop documentation label from automated-PR workflows (#1325) (@njhensley)
- 98639ba: ci: flag pointer-only changes in recipe-evidence gate (#1166) (@njhensley)
- 3b0750e: docs(contributor): address PR #1143 review feedback (#1145) (@mchmarny)
- 1864f05: docs(contributor): define "leaf" in recipe layered model (@mchmarny)
- 48cbb31: docs(contributor): rewrite, consolidate, and trim the contributor guide (#1143) (@mchmarny)
- 3409390: docs(demos): add recipe-evidence slide deck and demo script (#1188) (@njhensley)
- d913abb: docs(demos): split s3c into provenance + bundle-attestation demos (#1195) (@mchmarny)
- 180a596: docs(integrator): document Dynamo 1.2 NATS SG symptoms on EKS (#1369) (@yuanchen8911)
- 6c14530: docs(network-operator): fix 101c/101e NIC comments to ConnectX-6/mlx5Gen VFs (#1299) (@yuanchen8911)
- cf2545f: docs(readme): link Apache 2.0 LICENSE file (#1218) (@dims)
- 031e4bc: docs(roadmap): tighten v1 acceptance criteria per pillar (@mchmarny)
- a57aca8: docs(site): trim docs/README.md to a persona-driven landing page (#1144) (@mchmarny)
- f71b23c: docs(skills): add creating-guided-demos skill (#1189) (@njhensley)
- 4c74b04: docs(skills): add creating-slide-decks skill (#1191) (@njhensley)
- 02c6f7a: docs(validator): correct main/edge tag-publishing caveats (#1176) (@yuanchen8911)
- b5a6d64: docs(validator): document image tag mechanism (:edge vs :latest) (#1175) (@yuanchen8911)
- 76558b4: recipe: add gatekeeper as optional component (#821) (@JaydipGabani)
- 14c3259: recipes/dynamo: bump platform to 1.2 (#1308) (@sttts)
- f1cf50f: recipes/dynamo: wire NATS storage class (#1360) (@sttts)
- 8d17940: refactor(health): share declared-coverage Compact() across matrix and CLI (#1313) (@njhensley)
- 32ff487: revert: GB300 EKS overlays (#1319) — unresolved issues (#1328) (@yuanchen8911)