Skip to content

Add CI guard to enforce per-intent validation phase coverage on recipe overlays #970

@yuanchen8911

Description

@yuanchen8911

Summary

Add a CI test that resolves every overlay's base: chain and rejects any new or modified overlay whose resolved validation falls below the per-intent floor. Closes the loophole that let 27 of 41 GPU overlays drift to conformance-only without anyone noticing (see companion issue on closing the gaps).

Background

The validation block merges per-phase across the base: chain (pkg/recipe/metadata.go:459-477). There is no test today that asserts a leaf overlay's resolved validation meets a per-intent contract — so a new overlay can be added that inherits only the service-root conformance: block and ship with no GPU operator-health check and no NCCL or inference-perf gate.

Scope

For each overlay in recipes/overlays/ (excluding intermediates like eks-training.yaml that are not user-selectable):

  1. Walk the base: chain and resolve the merged validation block per pkg/recipe/metadata.go Merge semantics.
  2. Classify the overlay by criteria.intent and criteria.service plus filename heuristics (Dynamo / NIM / plain inference, Kind vs non-Kind).
  3. Assert that the resolved validation contains the required phases for that classification.

Per-intent floor

Intent Required (blocking) Recommended (warn-only)
Training (non-Kind) deployment + conformance performance (NCCL)
Inference Dynamo / NIM (non-Kind) deployment + conformance performance (inference-perf)
Inference (plain) deployment + conformance
Kind (any intent) deployment + conformance

Why performance is recommended-only initially: some service overlays (notably Azure AKS and Oracle OKE) lack a performance testbed today (see companion issue). Tighten to required once testbeds exist.

Implementation notes

  • Wire the check into make qualify so local runs catch it before push.
  • On failure, print the overlay name, its classification, resolved phases, and which phase is missing — so the message is self-explanatory.
  • Consider adding a validation: skip-deployment (or similar) escape hatch only if a legitimate use case appears; default behavior should be strict.
  • Future: when Azure / OCI performance testbeds exist, tighten performance from recommended to required for those services.

Done when

  • New overlay added without required phases → test fails with a clear message.
  • Existing overlay modified to drop a required phase → test fails.
  • make qualify runs the check by default; CI workflow runs it on every PR that touches recipes/overlays/.

Related

  • Companion issue: close the existing 27-overlay phase coverage gap.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions