Skip to content

fix(validator): resolve dev-build images to SHA-tagged images#655

Merged
mchmarny merged 5 commits intomainfrom
fix/validator-sha-image-tag
Apr 23, 2026
Merged

fix(validator): resolve dev-build images to SHA-tagged images#655
mchmarny merged 5 commits intomainfrom
fix/validator-sha-image-tag

Conversation

@mchmarny
Copy link
Copy Markdown
Member

Summary

When version is a non-release (dev, -next) and a valid commit SHA is available, ResolveImage now resolves :latest to :sha-<commit> matching the tags on-push.yaml already pushes. This allows validation from main to use its own validator images without requiring a release or manual override.

Motivation / Context

Validator images on main are unreachable by default — catalog.yaml hardcodes :latest, but on-push.yaml only pushes SHA-tagged images and :latest is reserved for stable releases. Running validation from main always pulls the last released image, not what was just built.

Fixes: #654
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

Resolution order in ResolveImage(image, version, commit):

  1. Release version (vX.Y.Z) → :vX.Y.Z (unchanged)
  2. Non-release + valid commit SHA → :sha-<commit> (new)
  3. Non-release + no commit → :latest (unchanged fallback)
  4. Registry override applied last (unchanged)

The commit SHA is threaded from CLI ldflags (pkg/cli.commit) through:

  • validator.WithCommit(commit) option → Validator.Commit field
  • catalog.Load(version, commit)ResolveImage(image, version, commit)
  • job.NewDeployer(..., cliCommit, ...)AICR_CLI_COMMIT env var
  • Inner validators (e.g. inference-perf) read AICR_CLI_COMMIT for their own image resolution

isValidCommit() accepts 7-40 hex chars, rejects "" and "unknown" (the ldflags default).

Testing

go test -race ./pkg/validator/catalog/... ./pkg/validator/job/... ./pkg/validator/... ./validators/performance/...
golangci-lint run -c .golangci.yaml ./pkg/validator/... ./pkg/cli/... ./validators/performance/...

All tests pass, zero lint issues. New test cases cover:

  • dev + valid commit → SHA tag
  • dev + unknown/empty commit → stays :latest
  • -next + valid commit → SHA tag
  • release + commit → release takes precedence
  • explicit tag + commit → not modified
  • dev + commit + registry override → both compose
  • AICR_CLI_COMMIT env injection in deployer

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: No migration needed. Release builds are completely unaffected — the new code path only activates for non-release versions with a valid commit SHA. Existing dev builds without a commit SHA ("unknown") continue using :latest as before.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

When version is a non-release (dev, -next) and a valid commit SHA is
available, ResolveImage now resolves :latest to :sha-<commit> matching
the tags on-push.yaml already pushes. This allows validation from main
to use its own validator images without requiring a release or manual
override.

The commit SHA is threaded from the CLI ldflags through Validator,
catalog.Load, and the deployer env (AICR_CLI_COMMIT) so inner
validators (e.g. inference-perf) also resolve correctly.

Release builds are unaffected — they continue resolving :latest to
the release version tag.

Fixes #654
@coderabbitai

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@github-actions

This comment was marked as outdated.

Handles edge case where commit SHA contains uppercase hex characters
by lowercasing at the ResolveImage entry point before validation and
tag construction.
@mchmarny mchmarny requested a review from njhensley April 23, 2026 17:24
Add tests for too-short, too-long, and non-hex commit inputs.
Exercise WithCommit in TestNewWithOptions. Catalog coverage
94.4% → 97.2%, isValidCommit and WithCommit now at 100%.
coderabbitai[bot]

This comment was marked as resolved.

Replace blocklist approach (reject "dev" and "-next") with a regex
allowlist (^v?\d+\.\d+\.\d+$) so snapshot strings like
v0.0.0-12-gabc1234 and pre-release tags like v1.0.0-rc1 correctly
fall through to SHA-based image resolution instead of producing
non-existent version tags.
coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

Use properly sized all-hex strings: 40-char (valid full SHA) and
41-char (over max). Adds upper boundary coverage for isValidCommit.
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@mchmarny mchmarny merged commit 997589a into main Apr 23, 2026
41 of 42 checks passed
@mchmarny mchmarny deleted the fix/validator-sha-image-tag branch April 23, 2026 18:44
@njhensley
Copy link
Copy Markdown
Contributor

Review summary

Overall: The shape of the fix — thread a build-commit SHA through catalog.Load / Deployer so dev builds resolve to their own images — is correct. The regex tightening of isReleaseVersion, the lowercase normalization, and the isValidCommit bounds are all nice touches, and the tests cover the new branches thoroughly.

However there is one blocker: the SHA the CLI has at runtime does not match the SHA the registry is tagged with, so this PR as-is will produce images that ImagePullBackOff.

Blocker: short-commit vs full-commit mismatch

  • .goreleaser.yaml:34 injects the CLI build commit via:

    -X github.com/NVIDIA/aicr/pkg/cli.commit={{.ShortCommit}}
    

    {{.ShortCommit}} is the 7-character abbreviated SHA. make build goes through goreleaser build --snapshot (see Makefile:249-252), so even local dev binaries see a 7-char commit.

  • .github/workflows/on-push.yaml:117-118,161,167-169,179 publishes validator images tagged:

    ${IMAGE}:sha-${{ github.sha }}    # 40-char SHA
    ${IMAGE}:edge                     # moving alias
    

    It never publishes a sha-<7char> variant.

  • After this PR, pkg/validator/catalog/catalog.go will produce:

    ghcr.io/nvidia/aicr-validators/aiperf-bench:sha-abc1234
    

    but the tag on the registry is :sha-abcdef1234...40-chars. The image pull will fail — which is exactly the class of error fix(validator): resolve dev builds to SHA-tagged images #654 is trying to fix.

The PR's test suite doesn't catch this because it checks resolution logic in isolation (commit: "abc1234"sha-abc1234), not the end-to-end contract with what on-push.yaml actually pushes.

Suggested fix — simplest is to change both occurrences in .goreleaser.yaml (lines 34 and 69) from {{.ShortCommit}} to {{.FullCommit}}. isValidCommit already accepts 40-char SHAs (the test "dev version with 40-char full SHA resolves" proves it), so no other code needs to change. The downside is cosmetic (aicr --version now prints a 40-char SHA instead of 7); if that matters, keep both and pass FullCommit as a separate ldflag used only for image resolution.

A consistency check worth adding: an integration-style test or CI guard that asserts ResolveImage(image, cli.version, cli.commit) for a dev build matches some tag that on-push.yaml is configured to push, so future drift between the ldflag and the workflow is caught mechanically.

Non-blocking observations

  1. ResolveImage / Load / NewDeployer signature drift. Each now takes (version, commit) positionally, and NewDeployer already had ~8 positional args before this PR. Every added field breaks every call site and hurts readability of test setups (see the many "run1", "", "", chains in deployer_test.go). If another build-time value needs to thread through later (build date? registry auth?), this gets worse fast. Not for this PR, but the codebase's WithX option pattern (already used on Validator) would fit naturally on Deployer / Load too.

  2. AICR_CLI_COMMIT env var is undocumented. AICR_CLI_VERSION is also undocumented, so this is consistent with existing convention — but CLAUDE.md calls out env vars as requiring doc updates. If treating both as "internal, subject to change" is the intent, a one-line comment in deployer.go beside the buildEnvApply block stating so would be useful so reviewers don't have to re-derive it.

  3. isValidCommit accepts anything 7–40 hex. Fine as a shape check, but the function name implies more than it delivers — it doesn't verify the commit is on main, reachable, or that an image for it was actually pushed. A name like looksLikeCommitSHA would be more honest. Minor.

Verdict

Request changes — blocker in the short/full-commit mismatch. Fix is small (one-word change in two places in .goreleaser.yaml, plus ideally a contract test). Everything else is solid.

@mchmarny
Copy link
Copy Markdown
Member Author

Hm, wonder why the test didn't catch that. Thanks for the heads up, @njhensley. Looking into this.

yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 24, 2026
Running `aicr validate` from a feature-branch dev build fails with
ImagePullBackOff on every validator pod because on-push.yaml only
pushes `:sha-<commit>` images for commits merged to `main`, while
PR NVIDIA#655 (rightly) made non-release builds resolve to `:sha-<commit>`
instead of `:latest`. Contributors dog-fooding a PR before merge hit
a "NotFound" dead-end with no escape hatch.

This adds AICR_VALIDATOR_IMAGE_TAG as an opt-in override. When set,
the resolved tag is replaced on every validator image — including
explicit catalog tags like `:v1.2.3` — so a feature-branch build can
point at a known-published tag:

    AICR_VALIDATOR_IMAGE_TAG=latest aicr validate --phase performance ...

Default behavior is unchanged: release builds still resolve to
`:v<version>`, main-branch dev builds still resolve to `:sha-<commit>`,
and reproducibility for CI paths is preserved. The override is strictly
additive and strictly opt-in.

The override is forwarded from the CLI invocation into the validator
container (alongside AICR_CLI_VERSION, AICR_CLI_COMMIT, and
AICR_VALIDATOR_IMAGE_REGISTRY), so validators that resolve inner
workload images at runtime (inference-perf's AIPerf benchmark Job)
apply the same semantics as catalog.Load. Without this, the outer
validator pod would get `:latest` while the inner benchmark pod
would still resolve to the same unpublished `:sha-<commit>` and
ImagePullBackOff — defeating the motivating feature-branch workflow.

Digest-pinned references (`name@sha256:…`) are preserved verbatim. A
tag override is meaningless against a content-addressable pin, and
naive last-colon splitting would corrupt the digest hash into the
tag slot. replaceTag detects the `@` separator and returns the image
unchanged; the registry override still applies to digest refs.

Tests:
 - catalog_test.go: 10 new cases covering the tag-override escape
   hatch, composition with registry override, the no-tag-append case,
   the empty-env-var no-op, the localhost:5001 port-preservation
   edge case, digest-only and mixed `name:tag@digest` forms
 - deployer_test.go: 2 new cases asserting the env var is forwarded
   into the validator container when set, and strictly omitted when
   unset (so the default release / main-branch paths are untouched)
 - docs/contributor/validator.md updated with the resolution order,
   digest behavior, and env-var forwarding
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 24, 2026
Running `aicr validate` from a feature-branch dev build fails with
ImagePullBackOff on every validator pod because on-push.yaml only
pushes `:sha-<commit>` images for commits merged to `main`, while
PR NVIDIA#655 (rightly) made non-release builds resolve to `:sha-<commit>`
instead of `:latest`. Contributors dog-fooding a PR before merge hit
a "NotFound" dead-end with no escape hatch.

This adds AICR_VALIDATOR_IMAGE_TAG as an opt-in override. When set,
the resolved tag is replaced on every validator image — including
explicit catalog tags like `:v1.2.3` — so a feature-branch build can
point at a known-published tag:

    AICR_VALIDATOR_IMAGE_TAG=latest aicr validate --phase performance ...

Default behavior is unchanged: release builds still resolve to
`:v<version>`, main-branch dev builds still resolve to `:sha-<commit>`,
and reproducibility for CI paths is preserved. The override is strictly
additive and strictly opt-in.

The override is forwarded from the CLI invocation into the validator
container (alongside AICR_CLI_VERSION, AICR_CLI_COMMIT, and
AICR_VALIDATOR_IMAGE_REGISTRY), so validators that resolve inner
workload images at runtime (inference-perf's AIPerf benchmark Job)
apply the same semantics as catalog.Load. Without this, the outer
validator pod would get `:latest` while the inner benchmark pod
would still resolve to the same unpublished `:sha-<commit>` and
ImagePullBackOff — defeating the motivating feature-branch workflow.

Digest-pinned references (`name@sha256:…`) are preserved verbatim. A
tag override is meaningless against a content-addressable pin, and
naive last-colon splitting would corrupt the digest hash into the
tag slot. replaceTag detects the `@` separator and returns the image
unchanged; the registry override still applies to digest refs.

Tests:
 - catalog_test.go: 10 new cases covering the tag-override escape
   hatch, composition with registry override, the no-tag-append case,
   the empty-env-var no-op, the localhost:5001 port-preservation
   edge case, digest-only and mixed `name:tag@digest` forms
 - deployer_test.go: 2 new cases asserting the env var is forwarded
   into the validator container when set, and strictly omitted when
   unset (so the default release / main-branch paths are untouched)
 - docs/contributor/validator.md updated with the resolution order,
   digest behavior, and env-var forwarding
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 24, 2026
Running `aicr validate` from a feature-branch dev build fails with
ImagePullBackOff on every validator pod because on-push.yaml only
pushes `:sha-<commit>` images for commits merged to `main`, while
PR NVIDIA#655 (rightly) made non-release builds resolve to `:sha-<commit>`
instead of `:latest`. Contributors dog-fooding a PR before merge hit
a "NotFound" dead-end with no escape hatch.

This adds AICR_VALIDATOR_IMAGE_TAG as an opt-in override. When set,
the resolved tag is replaced on every validator image — including
explicit catalog tags like `:v1.2.3` — so a feature-branch build can
point at a known-published tag:

    AICR_VALIDATOR_IMAGE_TAG=latest aicr validate --phase performance ...

Default behavior is unchanged: release builds still resolve to
`:v<version>`, main-branch dev builds still resolve to `:sha-<commit>`,
and reproducibility for CI paths is preserved. The override is strictly
additive and strictly opt-in.

The override is forwarded from the CLI invocation into the validator
container (alongside AICR_CLI_VERSION, AICR_CLI_COMMIT, and
AICR_VALIDATOR_IMAGE_REGISTRY), so validators that resolve inner
workload images at runtime (inference-perf's AIPerf benchmark Job)
apply the same semantics as catalog.Load. Without this, the outer
validator pod would get `:latest` while the inner benchmark pod
would still resolve to the same unpublished `:sha-<commit>` and
ImagePullBackOff — defeating the motivating feature-branch workflow.

Digest-pinned references (`name@sha256:…`) are preserved verbatim. A
tag override is meaningless against a content-addressable pin, and
naive last-colon splitting would corrupt the digest hash into the
tag slot. replaceTag detects the `@` separator and returns the image
unchanged; the registry override still applies to digest refs.

Tests:
 - catalog_test.go: 10 new cases covering the tag-override escape
   hatch, composition with registry override, the no-tag-append case,
   the empty-env-var no-op, the localhost:5001 port-preservation
   edge case, digest-only and mixed `name:tag@digest` forms
 - deployer_test.go: 2 new cases asserting the env var is forwarded
   into the validator container when set, and strictly omitted when
   unset (so the default release / main-branch paths are untouched)
 - docs/contributor/validator.md updated with the resolution order,
   digest behavior, and env-var forwarding
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 24, 2026
Running `aicr validate` from a feature-branch dev build fails with
ImagePullBackOff on every validator pod because on-push.yaml only
pushes `:sha-<commit>` images for commits merged to `main`, while
PR NVIDIA#655 (rightly) made non-release builds resolve to `:sha-<commit>`
instead of `:latest`. Contributors dog-fooding a PR before merge hit
a "NotFound" dead-end with no escape hatch.

This adds AICR_VALIDATOR_IMAGE_TAG as an opt-in override. When set,
the resolved tag is replaced on every validator image — including
explicit catalog tags like `:v1.2.3` — so a feature-branch build can
point at a known-published tag:

    AICR_VALIDATOR_IMAGE_TAG=latest aicr validate --phase performance ...

Default behavior is unchanged: release builds still resolve to
`:v<version>`, main-branch dev builds still resolve to `:sha-<commit>`,
and reproducibility for CI paths is preserved. The override is strictly
additive and strictly opt-in.

The override is forwarded from the CLI invocation into the validator
container (alongside AICR_CLI_VERSION, AICR_CLI_COMMIT, and
AICR_VALIDATOR_IMAGE_REGISTRY), so validators that resolve inner
workload images at runtime (inference-perf's AIPerf benchmark Job)
apply the same semantics as catalog.Load. Without this, the outer
validator pod would get `:latest` while the inner benchmark pod
would still resolve to the same unpublished `:sha-<commit>` and
ImagePullBackOff — defeating the motivating feature-branch workflow.

Digest-pinned references (`name@sha256:…`) are preserved verbatim. A
tag override is meaningless against a content-addressable pin, and
naive last-colon splitting would corrupt the digest hash into the
tag slot. replaceTag detects the `@` separator and returns the image
unchanged; the registry override still applies to digest refs.

Tests:
 - catalog_test.go: 10 new cases covering the tag-override escape
   hatch, composition with registry override, the no-tag-append case,
   the empty-env-var no-op, the localhost:5001 port-preservation
   edge case, digest-only and mixed `name:tag@digest` forms
 - deployer_test.go: 2 new cases asserting the env var is forwarded
   into the validator container when set, and strictly omitted when
   unset (so the default release / main-branch paths are untouched)
 - docs/contributor/validator.md updated with the resolution order,
   digest behavior, and env-var forwarding
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 24, 2026
Running `aicr validate` from a feature-branch dev build fails with
ImagePullBackOff on every validator pod because on-push.yaml only
pushes `:sha-<commit>` images for commits merged to `main`, while
PR NVIDIA#655 (rightly) made non-release builds resolve to `:sha-<commit>`
instead of `:latest`. Contributors dog-fooding a PR before merge hit
a "NotFound" dead-end with no escape hatch.

This adds AICR_VALIDATOR_IMAGE_TAG as an opt-in override. When set,
the resolved tag is replaced on every validator image — including
explicit catalog tags like `:v1.2.3` — so a feature-branch build can
point at a known-published tag:

    AICR_VALIDATOR_IMAGE_TAG=latest aicr validate --phase performance ...

Default behavior is unchanged: release builds still resolve to
`:v<version>`, main-branch dev builds still resolve to `:sha-<commit>`,
and reproducibility for CI paths is preserved. The override is strictly
additive and strictly opt-in.

The override is forwarded from the CLI invocation into the validator
container (alongside AICR_CLI_VERSION, AICR_CLI_COMMIT, and
AICR_VALIDATOR_IMAGE_REGISTRY), so validators that resolve inner
workload images at runtime (inference-perf's AIPerf benchmark Job)
apply the same semantics as catalog.Load. Without this, the outer
validator pod would get `:latest` while the inner benchmark pod
would still resolve to the same unpublished `:sha-<commit>` and
ImagePullBackOff — defeating the motivating feature-branch workflow.

Digest-pinned references (`name@sha256:…`) are preserved verbatim. A
tag override is meaningless against a content-addressable pin, and
naive last-colon splitting would corrupt the digest hash into the
tag slot. replaceTag detects the `@` separator and returns the image
unchanged; the registry override still applies to digest refs.

Tests:
 - catalog_test.go: 10 new cases covering the tag-override escape
   hatch, composition with registry override, the no-tag-append case,
   the empty-env-var no-op, the localhost:5001 port-preservation
   edge case, digest-only and mixed `name:tag@digest` forms
 - deployer_test.go: 2 new cases asserting the env var is forwarded
   into the validator container when set, and strictly omitted when
   unset (so the default release / main-branch paths are untouched)
 - docs/contributor/validator.md updated with the resolution order,
   digest behavior, and env-var forwarding
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 24, 2026
Running `aicr validate` from a feature-branch dev build fails with
ImagePullBackOff on every validator pod because on-push.yaml only
pushes `:sha-<commit>` images for commits merged to `main`, while
PR NVIDIA#655 (rightly) made non-release builds resolve to `:sha-<commit>`
instead of `:latest`. Contributors dog-fooding a PR before merge hit
a "NotFound" dead-end with no escape hatch.

This adds AICR_VALIDATOR_IMAGE_TAG as an opt-in override. When set,
the resolved tag is replaced on every validator image — including
explicit catalog tags like `:v1.2.3` — so a feature-branch build can
point at a known-published tag:

    AICR_VALIDATOR_IMAGE_TAG=latest aicr validate --phase performance ...

Default behavior is unchanged: release builds still resolve to
`:v<version>`, main-branch dev builds still resolve to `:sha-<commit>`,
and reproducibility for CI paths is preserved. The override is strictly
additive and strictly opt-in.

The override is forwarded from the CLI invocation into the validator
container (alongside AICR_CLI_VERSION, AICR_CLI_COMMIT, and
AICR_VALIDATOR_IMAGE_REGISTRY), so validators that resolve inner
workload images at runtime (inference-perf's AIPerf benchmark Job)
apply the same semantics as catalog.Load. Without this, the outer
validator pod would get `:latest` while the inner benchmark pod
would still resolve to the same unpublished `:sha-<commit>` and
ImagePullBackOff — defeating the motivating feature-branch workflow.

Digest-pinned references (`name@sha256:…`) are preserved verbatim. A
tag override is meaningless against a content-addressable pin, and
naive last-colon splitting would corrupt the digest hash into the
tag slot. replaceTag detects the `@` separator and returns the image
unchanged; the registry override still applies to digest refs.

Tests:
 - catalog_test.go: 10 new cases covering the tag-override escape
   hatch, composition with registry override, the no-tag-append case,
   the empty-env-var no-op, the localhost:5001 port-preservation
   edge case, digest-only and mixed `name:tag@digest` forms
 - deployer_test.go: 2 new cases asserting the env var is forwarded
   into the validator container when set, and strictly omitted when
   unset (so the default release / main-branch paths are untouched)
 - docs/contributor/validator.md updated with the resolution order,
   digest behavior, and env-var forwarding
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(validator): resolve dev builds to SHA-tagged images

3 participants