Skip to content

test(recipe): enforce per-intent validation phase floor on overlay leaves#989

Merged
mchmarny merged 6 commits into
mainfrom
feat/970-overlay-validation-floor
May 20, 2026
Merged

test(recipe): enforce per-intent validation phase floor on overlay leaves#989
mchmarny merged 6 commits into
mainfrom
feat/970-overlay-validation-floor

Conversation

@mchmarny
Copy link
Copy Markdown
Member

Summary

Adds a Go test that walks each leaf overlay's spec.base chain, resolves the merged ValidationConfig per pkg/recipe/metadata.go Merge semantics, and asserts the resolved validation contains the per-intent required phases. Closes the loophole that let leaf overlays drift to conformance-only without a CI gate.

Motivation / Context

Today there is no test that asserts a leaf overlay's resolved validation meets a per-intent contract — so a new overlay can inherit only the service-root conformance: block and ship with no GPU operator-health check, no NCCL gate, and no inference-perf gate. 27 of 41 GPU overlays already drifted this way.

This is the static gate from #970. The runtime gate (signed evidence bundles from #751) only proves the declared validators ran — it can't catch a missing declaration. #970 forces the declaration to be complete; #751 forces it to actually run.

Fixes: #970
Related: #969, #751

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Build/CI/tooling

Component(s) Affected

  • Recipe engine / data (pkg/recipe)

Implementation Notes

Per-intent floor enforced today:

Intent Required (blocking) Recommended (warn-only)
Training, non-Kind deployment + conformance performance (NCCL)
Inference Dynamo / NIM, non-Kind deployment + conformance performance (inference-perf)
Inference (plain) deployment + conformance
Any intent, Kind deployment + conformance

Strict mode: AICR_VALIDATION_FLOOR_STRICT=1 globally promotes the recommended performance phase from warn-only to required. Default OFF until #969 closes the data gap and Azure/OCI performance testbeds land. When ready, flip this in CI to tighten enforcement.

Known-gap allowlist: the test ships with a knownGaps map listing the 7 overlays that fail the floor today, so the contract can land independently of the data fix in #969. Entries downgrade Errorf to a Logf prefixed with KNOWN GAP (#969):. As #969 lands per-overlay fixes, drain this map. New overlays added without the floor will fail because they are not allowlisted.

Wildcard fragment exclusion: overlays with criteria.intent == any or criteria.service == any (b200-any-training, gb200-any-training, monitoring-hpa) are cross-cutting fragments — not user-selectable leaves — and are excluded from the gate. See docs/contributor/data.md#criteria-wildcard-overlays.

No new CI wiring needed: runs under make test, which make qualify already invokes; CI's existing pkg/recipe/... test job picks it up on every PR.

Deferred (per scope):

  • Per-phase validation: skip-deployment escape hatch — wait for first legitimate case
  • YAML-side metadata.annotations opt-out — would require modifying RecipeMetadataHeader; knownGaps serves the same purpose for now
  • aicr recipe lint CLI surface — not requested

Testing

make qualify     # passed
# default mode: TestOverlayValidationPhaseFloor — PASS (17 leaves, 7 known gaps logged)
AICR_VALIDATION_FLOOR_STRICT=1 go test ./pkg/recipe/ -run TestOverlayValidationPhaseFloor
# strict mode: PASS for allowlisted overlays + 1 non-allowlisted Azure/Dynamo gap caught

Coverage: pkg/recipe: 89.8% → 89.8% (test-only addition).

Risk Assessment

  • Low — Isolated test addition, no production code touched, easy to revert. knownGaps allowlist absorbs the existing 7 failures so CI stays green; the gate triggers only on new overlays that ship below the floor.

Rollout notes: N/A. To enable strict performance enforcement after #969 closes and testbeds exist, set AICR_VALIDATION_FLOOR_STRICT=1 in the CI test step.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed (no user-facing change)
  • Changes follow existing patterns in the codebase (mirrors conformance_test.go and the leaf-discovery pattern at metadata_store_test.go:512)
  • Commits are cryptographically signed (git commit -S)

Walk each leaf overlay's spec.base chain, resolve the merged
ValidationConfig per pkg/recipe/metadata.go Merge semantics, classify
by intent/service/platform, and assert the resolved validation contains
the required phases.

Per-intent floor:
  - Training (non-Kind)               : deployment + conformance
  - Inference Dynamo/NIM (non-Kind)   : deployment + conformance
  - Inference (plain)                 : deployment + conformance
  - Kind (any intent)                 : deployment + conformance
  - Performance is warn-only globally until #969 closes the data gap
    and Azure/OCI testbeds exist; AICR_VALIDATION_FLOOR_STRICT=1
    promotes it to required.

A knownGaps allowlist tracks the 7 overlays that fail today so the
contract can land independently of #969. New overlays added without
the floor will fail because they are not allowlisted. Criteria-wildcard
overlays (intent: any or service: any) are excluded — they are
cross-cutting fragments, not user-selectable leaves.

Refs #970
Refs #969
@mchmarny mchmarny requested a review from a team as a code owner May 20, 2026 16:17
@coderabbitai

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 20, 2026

Coverage Report ✅

Metric Value
Coverage 76.2%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-76.2%25-green)

Merging this branch will decrease overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/aicr/pkg/recipe 89.42% (-0.34%) 👎

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/aicr/pkg/recipe/metadata.go 96.99% (ø) 266 258 8
github.com/NVIDIA/aicr/pkg/recipe/metadata_store.go 82.97% (ø) 364 302 62
github.com/NVIDIA/aicr/pkg/recipe/validation.go 63.64% (+63.64%) 22 (+22) 14 (+14) 8 (+8) 🌟

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

@mchmarny mchmarny self-assigned this May 20, 2026
Address review feedback on #989:

- Track which knownGaps entries actually downgrade a failure during the
  run. Fail the test if any entry never fires — that means the overlay
  has been fixed but the allowlist entry was not removed, which would
  silently mask future regressions.

- Hard-fail when discoverLeafOverlays returns zero leaves. Without this
  guard, a regression in discovery (e.g., refactor breaks leaf detection)
  would let the gate pass trivially.

Verified by injecting a bogus knownGaps entry — test fails with the
expected 'stale knownGaps entries' message and lists the entry to remove.

Refs #970
@mchmarny mchmarny enabled auto-merge (squash) May 20, 2026 18:40
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross-review by Claude Code + Codex + CodeRabbit. Three reviewers plus an integration-analysis pass converged on the same set of findings; cross-review verified each one against the PR head independently. Inline comments below cover the test-file findings. The headline issue is a production bug exposed by this test — since pkg/recipe/metadata.go is outside the diff, I'm raising it here.

1. Production bug: RecipeMetadataSpec.Merge aliases other.Validation and writes phase pointers through that alias

pkg/recipe/metadata.go#L460-L477

When s.Validation == nil, the function assigns s.Validation = other.Validation (pointer alias), then on later iterations the per-phase writes (s.Validation.Deployment = other.Validation.Deployment, etc.) go through that alias. Because the metadata store is cached and reused across recipe builds (metadata_store.go:65, BuildRecipeResultmergeOverlayChains at metadata_store.go:674), resolving one recipe mutates an ancestor overlay for every later resolution.

Reproduction: building h100-gke-cos-inference-dynamo mutates cached gke-cos validation from conformance to deployment, performance, conformance. That pollution masks the missing deployment phase on h100-gke-cos-training-kubeflow, which is not in knownGaps. Running just go test -run 'TestOverlayValidationPhaseFloor/h100-gke-cos-training-kubeflow' shows the unpolluted result: performance, conformance — failing the floor.

Suggested fix: in Merge, deep-copy other.Validation (or copy each non-nil phase pointer into a newly allocated ValidationConfig) instead of aliasing. Fixing this in production removes the test order dependence as a side effect.

Severity ordering

  • Blocker: production Merge aliasing bug (above)
  • Should-fix in this PR: chain-only resolver (L128), global leaf discovery (L148), per-overlay knownGaps granularity (L222)
  • Latent: wildcard filter narrowness (L164)

Comment thread pkg/recipe/validation_phase_floor_test.go Outdated
Comment thread pkg/recipe/validation_phase_floor_test.go Outdated
Comment thread pkg/recipe/validation_phase_floor_test.go Outdated
Comment thread pkg/recipe/validation_phase_floor_test.go
Comment thread pkg/recipe/validation_phase_floor_test.go
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some issues. PTAL

mchmarny and others added 4 commits May 20, 2026 12:41
…llution

RecipeMetadataSpec.Merge aliased other.Validation when the destination
was nil, then later wrote phase pointers through that alias. Across
successive calls on the cached MetadataStore, this mutated whichever
overlay's cached ValidationConfig the alias happened to point at — so
`BuildRecipeResult` was order-dependent, and an overlay that genuinely
lacked a phase could appear compliant after a prior query planted that
phase via pollution.

Fix:

  - Add cloneValidationConfig + cloneValidationPhase helpers that deep-
    copy phases and their backing slices / NodeSelection.
  - Clone in Merge whenever a phase pointer is assigned (both the
    nil-destination branch and the per-phase overwrite branch).
  - Clone in initBaseMergedSpec so the merged spec never aliases
    s.Base.Spec.Validation.

Regression test in metadata_test.go: two sequential merges into a fresh
destination must not mutate the first source. Without the fix the test
detects Deployment leaking from the second merge into the first source's
ValidationConfig.

Refs #970
Address review feedback on PR #989:

- Use store.BuildRecipeResult (the production code path) instead of an
  ad-hoc base-chain walk. The previous helper missed wildcard overlay
  contributions and mixins, and its leaf detection excluded concrete
  intermediate overlays (h100-gke-cos-training, eks-training, etc.)
  that production returns as maximal-leaf candidates for queries that
  do not narrow further.

- Gateable set expanded from 17 to 56 overlays. Per-overlay queries
  exercise the same resolver path the CLI/API use. Wildcard-fragment
  overlays (criteria.intent or criteria.service set to 'any') remain
  excluded because their criteria are not meaningful user queries.

- knownGaps is now map[string]map[string]bool keyed by (overlay, phase).
  An overlay allowlisted for one missing phase no longer silently
  suppresses a regression in a different phase.

- Allowlist repopulated under production semantics: 39 (overlay, phase)
  pairs missing deployment. The companion fix in 5d4f4cad eliminates
  cross-subtest pollution, so the failure set is stable across full and
  isolated runs.

Refs #970
Refs #969
Yuan's #991 added the deployment phase to h100-gke-cos-training; the
fix cascades through the base chain to h100-gke-cos-training-kubeflow.
Both overlays now meet the floor on their own — drop both entries so a
future regression is no longer silently masked.

Refs #970
Refs #969
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 6 review points are addressed with concrete code changes that go beyond the minimum.

@mchmarny mchmarny merged commit ebb0168 into main May 20, 2026
32 checks passed
@mchmarny mchmarny deleted the feat/970-overlay-validation-floor branch May 20, 2026 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add CI guard to enforce per-intent validation phase coverage on recipe overlays

2 participants