fix(validator): resolve dev-build images to SHA-tagged images#655
fix(validator): resolve dev-build images to SHA-tagged images#655
Conversation
When version is a non-release (dev, -next) and a valid commit SHA is available, ResolveImage now resolves :latest to :sha-<commit> matching the tags on-push.yaml already pushes. This allows validation from main to use its own validator images without requiring a release or manual override. The commit SHA is threaded from the CLI ldflags through Validator, catalog.Load, and the deployer env (AICR_CLI_COMMIT) so inner validators (e.g. inference-perf) also resolve correctly. Release builds are unaffected — they continue resolving :latest to the release version tag. Fixes #654
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as outdated.
This comment was marked as outdated.
Handles edge case where commit SHA contains uppercase hex characters by lowercasing at the ResolveImage entry point before validation and tag construction.
Add tests for too-short, too-long, and non-hex commit inputs. Exercise WithCommit in TestNewWithOptions. Catalog coverage 94.4% → 97.2%, isValidCommit and WithCommit now at 100%.
Replace blocklist approach (reject "dev" and "-next") with a regex allowlist (^v?\d+\.\d+\.\d+$) so snapshot strings like v0.0.0-12-gabc1234 and pre-release tags like v1.0.0-rc1 correctly fall through to SHA-based image resolution instead of producing non-existent version tags.
Use properly sized all-hex strings: 40-char (valid full SHA) and 41-char (over max). Adds upper boundary coverage for isValidCommit.
Review summaryOverall: The shape of the fix — thread a build-commit SHA through However there is one blocker: the SHA the CLI has at runtime does not match the SHA the registry is tagged with, so this PR as-is will produce images that Blocker: short-commit vs full-commit mismatch
The PR's test suite doesn't catch this because it checks resolution logic in isolation ( Suggested fix — simplest is to change both occurrences in A consistency check worth adding: an integration-style test or CI guard that asserts Non-blocking observations
VerdictRequest changes — blocker in the short/full-commit mismatch. Fix is small (one-word change in two places in |
|
Hm, wonder why the test didn't catch that. Thanks for the heads up, @njhensley. Looking into this. |
Running `aicr validate` from a feature-branch dev build fails with ImagePullBackOff on every validator pod because on-push.yaml only pushes `:sha-<commit>` images for commits merged to `main`, while PR NVIDIA#655 (rightly) made non-release builds resolve to `:sha-<commit>` instead of `:latest`. Contributors dog-fooding a PR before merge hit a "NotFound" dead-end with no escape hatch. This adds AICR_VALIDATOR_IMAGE_TAG as an opt-in override. When set, the resolved tag is replaced on every validator image — including explicit catalog tags like `:v1.2.3` — so a feature-branch build can point at a known-published tag: AICR_VALIDATOR_IMAGE_TAG=latest aicr validate --phase performance ... Default behavior is unchanged: release builds still resolve to `:v<version>`, main-branch dev builds still resolve to `:sha-<commit>`, and reproducibility for CI paths is preserved. The override is strictly additive and strictly opt-in. The override is forwarded from the CLI invocation into the validator container (alongside AICR_CLI_VERSION, AICR_CLI_COMMIT, and AICR_VALIDATOR_IMAGE_REGISTRY), so validators that resolve inner workload images at runtime (inference-perf's AIPerf benchmark Job) apply the same semantics as catalog.Load. Without this, the outer validator pod would get `:latest` while the inner benchmark pod would still resolve to the same unpublished `:sha-<commit>` and ImagePullBackOff — defeating the motivating feature-branch workflow. Digest-pinned references (`name@sha256:…`) are preserved verbatim. A tag override is meaningless against a content-addressable pin, and naive last-colon splitting would corrupt the digest hash into the tag slot. replaceTag detects the `@` separator and returns the image unchanged; the registry override still applies to digest refs. Tests: - catalog_test.go: 10 new cases covering the tag-override escape hatch, composition with registry override, the no-tag-append case, the empty-env-var no-op, the localhost:5001 port-preservation edge case, digest-only and mixed `name:tag@digest` forms - deployer_test.go: 2 new cases asserting the env var is forwarded into the validator container when set, and strictly omitted when unset (so the default release / main-branch paths are untouched) - docs/contributor/validator.md updated with the resolution order, digest behavior, and env-var forwarding
Running `aicr validate` from a feature-branch dev build fails with ImagePullBackOff on every validator pod because on-push.yaml only pushes `:sha-<commit>` images for commits merged to `main`, while PR NVIDIA#655 (rightly) made non-release builds resolve to `:sha-<commit>` instead of `:latest`. Contributors dog-fooding a PR before merge hit a "NotFound" dead-end with no escape hatch. This adds AICR_VALIDATOR_IMAGE_TAG as an opt-in override. When set, the resolved tag is replaced on every validator image — including explicit catalog tags like `:v1.2.3` — so a feature-branch build can point at a known-published tag: AICR_VALIDATOR_IMAGE_TAG=latest aicr validate --phase performance ... Default behavior is unchanged: release builds still resolve to `:v<version>`, main-branch dev builds still resolve to `:sha-<commit>`, and reproducibility for CI paths is preserved. The override is strictly additive and strictly opt-in. The override is forwarded from the CLI invocation into the validator container (alongside AICR_CLI_VERSION, AICR_CLI_COMMIT, and AICR_VALIDATOR_IMAGE_REGISTRY), so validators that resolve inner workload images at runtime (inference-perf's AIPerf benchmark Job) apply the same semantics as catalog.Load. Without this, the outer validator pod would get `:latest` while the inner benchmark pod would still resolve to the same unpublished `:sha-<commit>` and ImagePullBackOff — defeating the motivating feature-branch workflow. Digest-pinned references (`name@sha256:…`) are preserved verbatim. A tag override is meaningless against a content-addressable pin, and naive last-colon splitting would corrupt the digest hash into the tag slot. replaceTag detects the `@` separator and returns the image unchanged; the registry override still applies to digest refs. Tests: - catalog_test.go: 10 new cases covering the tag-override escape hatch, composition with registry override, the no-tag-append case, the empty-env-var no-op, the localhost:5001 port-preservation edge case, digest-only and mixed `name:tag@digest` forms - deployer_test.go: 2 new cases asserting the env var is forwarded into the validator container when set, and strictly omitted when unset (so the default release / main-branch paths are untouched) - docs/contributor/validator.md updated with the resolution order, digest behavior, and env-var forwarding
Running `aicr validate` from a feature-branch dev build fails with ImagePullBackOff on every validator pod because on-push.yaml only pushes `:sha-<commit>` images for commits merged to `main`, while PR NVIDIA#655 (rightly) made non-release builds resolve to `:sha-<commit>` instead of `:latest`. Contributors dog-fooding a PR before merge hit a "NotFound" dead-end with no escape hatch. This adds AICR_VALIDATOR_IMAGE_TAG as an opt-in override. When set, the resolved tag is replaced on every validator image — including explicit catalog tags like `:v1.2.3` — so a feature-branch build can point at a known-published tag: AICR_VALIDATOR_IMAGE_TAG=latest aicr validate --phase performance ... Default behavior is unchanged: release builds still resolve to `:v<version>`, main-branch dev builds still resolve to `:sha-<commit>`, and reproducibility for CI paths is preserved. The override is strictly additive and strictly opt-in. The override is forwarded from the CLI invocation into the validator container (alongside AICR_CLI_VERSION, AICR_CLI_COMMIT, and AICR_VALIDATOR_IMAGE_REGISTRY), so validators that resolve inner workload images at runtime (inference-perf's AIPerf benchmark Job) apply the same semantics as catalog.Load. Without this, the outer validator pod would get `:latest` while the inner benchmark pod would still resolve to the same unpublished `:sha-<commit>` and ImagePullBackOff — defeating the motivating feature-branch workflow. Digest-pinned references (`name@sha256:…`) are preserved verbatim. A tag override is meaningless against a content-addressable pin, and naive last-colon splitting would corrupt the digest hash into the tag slot. replaceTag detects the `@` separator and returns the image unchanged; the registry override still applies to digest refs. Tests: - catalog_test.go: 10 new cases covering the tag-override escape hatch, composition with registry override, the no-tag-append case, the empty-env-var no-op, the localhost:5001 port-preservation edge case, digest-only and mixed `name:tag@digest` forms - deployer_test.go: 2 new cases asserting the env var is forwarded into the validator container when set, and strictly omitted when unset (so the default release / main-branch paths are untouched) - docs/contributor/validator.md updated with the resolution order, digest behavior, and env-var forwarding
Running `aicr validate` from a feature-branch dev build fails with ImagePullBackOff on every validator pod because on-push.yaml only pushes `:sha-<commit>` images for commits merged to `main`, while PR NVIDIA#655 (rightly) made non-release builds resolve to `:sha-<commit>` instead of `:latest`. Contributors dog-fooding a PR before merge hit a "NotFound" dead-end with no escape hatch. This adds AICR_VALIDATOR_IMAGE_TAG as an opt-in override. When set, the resolved tag is replaced on every validator image — including explicit catalog tags like `:v1.2.3` — so a feature-branch build can point at a known-published tag: AICR_VALIDATOR_IMAGE_TAG=latest aicr validate --phase performance ... Default behavior is unchanged: release builds still resolve to `:v<version>`, main-branch dev builds still resolve to `:sha-<commit>`, and reproducibility for CI paths is preserved. The override is strictly additive and strictly opt-in. The override is forwarded from the CLI invocation into the validator container (alongside AICR_CLI_VERSION, AICR_CLI_COMMIT, and AICR_VALIDATOR_IMAGE_REGISTRY), so validators that resolve inner workload images at runtime (inference-perf's AIPerf benchmark Job) apply the same semantics as catalog.Load. Without this, the outer validator pod would get `:latest` while the inner benchmark pod would still resolve to the same unpublished `:sha-<commit>` and ImagePullBackOff — defeating the motivating feature-branch workflow. Digest-pinned references (`name@sha256:…`) are preserved verbatim. A tag override is meaningless against a content-addressable pin, and naive last-colon splitting would corrupt the digest hash into the tag slot. replaceTag detects the `@` separator and returns the image unchanged; the registry override still applies to digest refs. Tests: - catalog_test.go: 10 new cases covering the tag-override escape hatch, composition with registry override, the no-tag-append case, the empty-env-var no-op, the localhost:5001 port-preservation edge case, digest-only and mixed `name:tag@digest` forms - deployer_test.go: 2 new cases asserting the env var is forwarded into the validator container when set, and strictly omitted when unset (so the default release / main-branch paths are untouched) - docs/contributor/validator.md updated with the resolution order, digest behavior, and env-var forwarding
Running `aicr validate` from a feature-branch dev build fails with ImagePullBackOff on every validator pod because on-push.yaml only pushes `:sha-<commit>` images for commits merged to `main`, while PR NVIDIA#655 (rightly) made non-release builds resolve to `:sha-<commit>` instead of `:latest`. Contributors dog-fooding a PR before merge hit a "NotFound" dead-end with no escape hatch. This adds AICR_VALIDATOR_IMAGE_TAG as an opt-in override. When set, the resolved tag is replaced on every validator image — including explicit catalog tags like `:v1.2.3` — so a feature-branch build can point at a known-published tag: AICR_VALIDATOR_IMAGE_TAG=latest aicr validate --phase performance ... Default behavior is unchanged: release builds still resolve to `:v<version>`, main-branch dev builds still resolve to `:sha-<commit>`, and reproducibility for CI paths is preserved. The override is strictly additive and strictly opt-in. The override is forwarded from the CLI invocation into the validator container (alongside AICR_CLI_VERSION, AICR_CLI_COMMIT, and AICR_VALIDATOR_IMAGE_REGISTRY), so validators that resolve inner workload images at runtime (inference-perf's AIPerf benchmark Job) apply the same semantics as catalog.Load. Without this, the outer validator pod would get `:latest` while the inner benchmark pod would still resolve to the same unpublished `:sha-<commit>` and ImagePullBackOff — defeating the motivating feature-branch workflow. Digest-pinned references (`name@sha256:…`) are preserved verbatim. A tag override is meaningless against a content-addressable pin, and naive last-colon splitting would corrupt the digest hash into the tag slot. replaceTag detects the `@` separator and returns the image unchanged; the registry override still applies to digest refs. Tests: - catalog_test.go: 10 new cases covering the tag-override escape hatch, composition with registry override, the no-tag-append case, the empty-env-var no-op, the localhost:5001 port-preservation edge case, digest-only and mixed `name:tag@digest` forms - deployer_test.go: 2 new cases asserting the env var is forwarded into the validator container when set, and strictly omitted when unset (so the default release / main-branch paths are untouched) - docs/contributor/validator.md updated with the resolution order, digest behavior, and env-var forwarding
Running `aicr validate` from a feature-branch dev build fails with ImagePullBackOff on every validator pod because on-push.yaml only pushes `:sha-<commit>` images for commits merged to `main`, while PR NVIDIA#655 (rightly) made non-release builds resolve to `:sha-<commit>` instead of `:latest`. Contributors dog-fooding a PR before merge hit a "NotFound" dead-end with no escape hatch. This adds AICR_VALIDATOR_IMAGE_TAG as an opt-in override. When set, the resolved tag is replaced on every validator image — including explicit catalog tags like `:v1.2.3` — so a feature-branch build can point at a known-published tag: AICR_VALIDATOR_IMAGE_TAG=latest aicr validate --phase performance ... Default behavior is unchanged: release builds still resolve to `:v<version>`, main-branch dev builds still resolve to `:sha-<commit>`, and reproducibility for CI paths is preserved. The override is strictly additive and strictly opt-in. The override is forwarded from the CLI invocation into the validator container (alongside AICR_CLI_VERSION, AICR_CLI_COMMIT, and AICR_VALIDATOR_IMAGE_REGISTRY), so validators that resolve inner workload images at runtime (inference-perf's AIPerf benchmark Job) apply the same semantics as catalog.Load. Without this, the outer validator pod would get `:latest` while the inner benchmark pod would still resolve to the same unpublished `:sha-<commit>` and ImagePullBackOff — defeating the motivating feature-branch workflow. Digest-pinned references (`name@sha256:…`) are preserved verbatim. A tag override is meaningless against a content-addressable pin, and naive last-colon splitting would corrupt the digest hash into the tag slot. replaceTag detects the `@` separator and returns the image unchanged; the registry override still applies to digest refs. Tests: - catalog_test.go: 10 new cases covering the tag-override escape hatch, composition with registry override, the no-tag-append case, the empty-env-var no-op, the localhost:5001 port-preservation edge case, digest-only and mixed `name:tag@digest` forms - deployer_test.go: 2 new cases asserting the env var is forwarded into the validator container when set, and strictly omitted when unset (so the default release / main-branch paths are untouched) - docs/contributor/validator.md updated with the resolution order, digest behavior, and env-var forwarding
Summary
When version is a non-release (dev, -next) and a valid commit SHA is available,
ResolveImagenow resolves:latestto:sha-<commit>matching the tagson-push.yamlalready pushes. This allows validation from main to use its own validator images without requiring a release or manual override.Motivation / Context
Validator images on main are unreachable by default —
catalog.yamlhardcodes:latest, buton-push.yamlonly pushes SHA-tagged images and:latestis reserved for stable releases. Running validation from main always pulls the last released image, not what was just built.Fixes: #654
Related: N/A
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)Implementation Notes
Resolution order in
ResolveImage(image, version, commit)::vX.Y.Z(unchanged):sha-<commit>(new):latest(unchanged fallback)The commit SHA is threaded from CLI ldflags (
pkg/cli.commit) through:validator.WithCommit(commit)option →Validator.Commitfieldcatalog.Load(version, commit)→ResolveImage(image, version, commit)job.NewDeployer(..., cliCommit, ...)→AICR_CLI_COMMITenv varAICR_CLI_COMMITfor their own image resolutionisValidCommit()accepts 7-40 hex chars, rejects""and"unknown"(the ldflags default).Testing
go test -race ./pkg/validator/catalog/... ./pkg/validator/job/... ./pkg/validator/... ./validators/performance/... golangci-lint run -c .golangci.yaml ./pkg/validator/... ./pkg/cli/... ./validators/performance/...All tests pass, zero lint issues. New test cases cover:
:latest-next+ valid commit → SHA tagAICR_CLI_COMMITenv injection in deployerRisk Assessment
Rollout notes: No migration needed. Release builds are completely unaffected — the new code path only activates for non-release versions with a valid commit SHA. Existing dev builds without a commit SHA (
"unknown") continue using:latestas before.Checklist
make testwith-race)make lint)git commit -S)