feat(validator): add AICR_VALIDATOR_IMAGE_TAG env-var override#666
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughIntroduces AICR_VALIDATOR_IMAGE_TAG as an unconditional tag override applied during validator image resolution (skipping digest-pinned refs). ResolveImage now applies dev-release/commit mapping, then tag override via replaceTag, then AICR_VALIDATOR_IMAGE_REGISTRY prefix replacement. Adds catalog.ImagePullPolicy(image) and uses it to determine ImagePullPolicy for orchestrator and inner validator Jobs. The deployer forwards AICR_VALIDATOR_IMAGE_TAG (alongside existing CLI envs) into validator pods. Tests and docs updated to verify tag override, digest immutability, registry composition, pull-policy rules, and env forwarding. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
d95a99c to
77f25a3
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/validator/job/deployer_test.go`:
- Around line 305-345: Combine TestDeployJobForwardsImageTagOverride and
TestDeployJobOmitsImageTagOverrideWhenUnset into a single table-driven test:
define a slice of cases with fields like name, envValue (e.g. "latest" or ""),
wantPresent (bool) and wantValue (string), iterate cases with t.Run, inside each
subtest set the env via t.Setenv("AICR_VALIDATOR_IMAGE_TAG", tc.envValue),
create namespace and call deployAndGet(NewDeployer(...)) as before, then examine
job.Spec.Template.Spec.Containers[0].Env to assert presence/absence of
"AICR_VALIDATOR_IMAGE_TAG" and when present that its Value equals tc.wantValue;
keep existing helper calls (createUniqueNamespace, deployAndGet, NewDeployer,
testFactory, testEntry) and preserve error messages on assertion failures.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: d115f959-1523-47c6-bc0b-c4ad385221ac
📒 Files selected for processing (5)
docs/contributor/validator.mdpkg/validator/catalog/catalog.gopkg/validator/catalog/catalog_test.gopkg/validator/job/deployer.gopkg/validator/job/deployer_test.go
0bea785 to
dc3e29d
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@validators/performance/inference_perf_constraint.go`:
- Around line 1222-1237: Call resolveAiperfImage() once and reuse the result
instead of invoking it twice: compute a local variable (e.g., image :=
resolveAiperfImage()) before constructing the pod spec and then set Name/Image
to that variable and ImagePullPolicy to catalog.ImagePullPolicy(image); update
usages around the Image and ImagePullPolicy fields so both refer to the same
cached image value to avoid redundant env reads and possible inconsistency.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: b24647df-cbdd-49d8-9608-c966b62da014
📒 Files selected for processing (7)
docs/contributor/validator.mdpkg/validator/catalog/catalog.gopkg/validator/catalog/catalog_test.gopkg/validator/job/deployer.gopkg/validator/job/deployer_test.govalidators/performance/inference_perf_constraint.govalidators/performance/inference_perf_test.go
49f9088 to
b9eba99
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/contributor/validator.md`:
- Line 240: Update the paragraph to accurately document that
catalog.ImagePullPolicy() does not always return Always when
AICR_VALIDATOR_IMAGE_TAG is set: call out the explicit exceptions—refs matching
ko.local/* or kind.local/* resolve to Never, and digest-pinned image refs
resolve to IfNotPresent—so the docs mirror the implementation and note that only
mutable non-digest tags (e.g., latest, edge, main) trigger Always; reference
catalog.ImagePullPolicy(), AICR_VALIDATOR_IMAGE_TAG, ko.local/*, kind.local/*,
and digest-pinned refs in the description.
In `@pkg/validator/catalog/catalog.go`:
- Around line 186-187: The current check using strings.HasPrefix(image,
"ko.local") / "kind.local" is too broad; instead extract the registry segment
(the substring before the first '/') from the image reference and compare it for
exact equality to "ko.local" or "kind.local" so only images whose registry is
exactly those values use corev1.PullNever; update the logic in the function
containing that check (the block that currently reads strings.HasPrefix(image,
"...")) to split the image with strings.SplitN(image, "/", 2) or use a proper
image reference parser, then compare registry == "ko.local" || registry ==
"kind.local" and return corev1.PullNever only in that case.
In `@validators/performance/inference_perf_test.go`:
- Around line 449-459: TestBuildAIPerfJob_PrebuiltImageAndSentinel's isolation
misses AICR_CLI_COMMIT which resolveAiperfImage() also reads; clear that env var
in the same setup by calling t.Setenv("AICR_CLI_COMMIT", "") so :latest won't be
rewritten to :sha-<commit>. Apply the same change to the other isolated test
setups that already clear AICR_CLI_VERSION, AICR_VALIDATOR_IMAGE_REGISTRY, and
AICR_VALIDATOR_IMAGE_TAG (the blocks around the other tests referencing
resolveAiperfImage()).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: d37a48d1-0ee2-4330-b814-dbf7eff452d5
📒 Files selected for processing (7)
docs/contributor/validator.mdpkg/validator/catalog/catalog.gopkg/validator/catalog/catalog_test.gopkg/validator/job/deployer.gopkg/validator/job/deployer_test.govalidators/performance/inference_perf_constraint.govalidators/performance/inference_perf_test.go
Running `aicr validate` from a feature-branch dev build fails with ImagePullBackOff on every validator pod because on-push.yaml only pushes `:sha-<commit>` images for commits merged to `main`, while PR NVIDIA#655 (rightly) made non-release builds resolve to `:sha-<commit>` instead of `:latest`. Contributors dog-fooding a PR before merge hit a "NotFound" dead-end with no escape hatch. This adds AICR_VALIDATOR_IMAGE_TAG as an opt-in override. When set, the resolved tag is replaced on every validator image — including explicit catalog tags like `:v1.2.3` — so a feature-branch build can point at a known-published tag: AICR_VALIDATOR_IMAGE_TAG=latest aicr validate --phase performance ... Default behavior is unchanged: release builds still resolve to `:v<version>`, main-branch dev builds still resolve to `:sha-<commit>`, and reproducibility for CI paths is preserved. The override is strictly additive and strictly opt-in. The override is forwarded from the CLI invocation into the validator container (alongside AICR_CLI_VERSION, AICR_CLI_COMMIT, and AICR_VALIDATOR_IMAGE_REGISTRY), so validators that resolve inner workload images at runtime (inference-perf's AIPerf benchmark Job) apply the same semantics as catalog.Load. Without this, the outer validator pod would get `:latest` while the inner benchmark pod would still resolve to the same unpublished `:sha-<commit>` and ImagePullBackOff — defeating the motivating feature-branch workflow. Digest-pinned references (`name@sha256:…`) are preserved verbatim. A tag override is meaningless against a content-addressable pin, and naive last-colon splitting would corrupt the digest hash into the tag slot. replaceTag detects the `@` separator and returns the image unchanged; the registry override still applies to digest refs. Tests: - catalog_test.go: 10 new cases covering the tag-override escape hatch, composition with registry override, the no-tag-append case, the empty-env-var no-op, the localhost:5001 port-preservation edge case, digest-only and mixed `name:tag@digest` forms - deployer_test.go: 2 new cases asserting the env var is forwarded into the validator container when set, and strictly omitted when unset (so the default release / main-branch paths are untouched) - docs/contributor/validator.md updated with the resolution order, digest behavior, and env-var forwarding
b9eba99 to
bb7af36
Compare
Summary
Adds
AICR_VALIDATOR_IMAGE_TAG— an opt-in env-var override that replacesthe resolved tag on every validator image, regardless of what the default
resolution logic (release version or
:sha-<commit>) produced. Defaultbehavior is unchanged; users only set this when they need a published tag
for a commit that has not been through
on-push.yamlyet.Motivation / Context
Running
aicr validatefrom a feature-branch dev build currently failswith
ImagePullBackOffon every validator pod:This is a direct consequence of PR #655: non-release dev builds resolve
to
:sha-<commit>, buton-push.yamlonly pushes:sha-<commit>forcommits merged to
main. Contributors dog-fooding a PR before mergehit a dead-end with no escape hatch. Repro:
git checkout <feature>; make build; ./aicr validate --phase performance ...→ NotFound.This adds a single-env-var override so the dev-build workflow can point
at a known-published tag:
Release and main-branch paths are untouched.
Fixes: N/A (dev-workflow UX, not a user-visible regression)
Related: #655
Type of Change
Component(s) Affected
pkg/validator)docs/,examples/)Implementation Notes
1. Single env-var read in
ResolveImageMirrors the existing
AICR_VALIDATOR_IMAGE_REGISTRYpattern. AppliedAFTER the existing auto-resolution step (release or SHA), BEFORE the
registry override, so all four layers compose cleanly. When unset,
ResolveImagebehaves exactly as before.2. Tag-replacement helper
replaceTaglocates the tag separator as the last:that sits afterthe last
/, solocalhost:5001/aicr-validators/…:sha-abcis rewrittencorrectly without clobbering the
5001registry port. Images without atag get the override appended.
3. Precedence
The override intentionally replaces ALL tags — including explicit
catalog tags like
:v1.2.3— because the whole point of the flag is"force everything to a tag I know is published." A less aggressive
"only override :latest" scheme would silently leave inference-perf's
inner AIPerf benchmark image on its catalog-pinned tag, defeating the
purpose on feature branches that have bumped that tag.
Testing
go test -race ./pkg/validator/catalog/...— pass, 30 ResolveImage cases(23 existing + 7 new covering the override)
make qualify— passaicr-cuj2:AICR_VALIDATOR_IMAGE_TAG=lateston a feature-branch build resolves the performance validator image to
the published
:latesttag;aicr validate --phase performanceruns the full inference-perf check end-to-end (throughput + TTFT p99)
and returns
passedRisk Assessment
When the env var is unset,
ResolveImageis byte-identical to thepre-change path (same resolution order, same outputs). The new code is
reached only when a user explicitly sets
AICR_VALIDATOR_IMAGE_TAG.Rollout notes: No migration. Contributors running
aicr validateoff a feature branch can opt in starting with the next dev build.
Checklist
make testwith-race)make lint)git commit -S)