Summary
Add a CI test that resolves every overlay's base: chain and rejects any new or modified overlay whose resolved validation falls below the per-intent floor. Closes the loophole that let 27 of 41 GPU overlays drift to conformance-only without anyone noticing (see companion issue on closing the gaps).
Background
The validation block merges per-phase across the base: chain (pkg/recipe/metadata.go:459-477). There is no test today that asserts a leaf overlay's resolved validation meets a per-intent contract — so a new overlay can be added that inherits only the service-root conformance: block and ship with no GPU operator-health check and no NCCL or inference-perf gate.
Scope
For each overlay in recipes/overlays/ (excluding intermediates like eks-training.yaml that are not user-selectable):
- Walk the
base: chain and resolve the merged validation block per pkg/recipe/metadata.go Merge semantics.
- Classify the overlay by
criteria.intent and criteria.service plus filename heuristics (Dynamo / NIM / plain inference, Kind vs non-Kind).
- Assert that the resolved validation contains the required phases for that classification.
Per-intent floor
| Intent |
Required (blocking) |
Recommended (warn-only) |
| Training (non-Kind) |
deployment + conformance |
performance (NCCL) |
| Inference Dynamo / NIM (non-Kind) |
deployment + conformance |
performance (inference-perf) |
| Inference (plain) |
deployment + conformance |
— |
| Kind (any intent) |
deployment + conformance |
— |
Why performance is recommended-only initially: some service overlays (notably Azure AKS and Oracle OKE) lack a performance testbed today (see companion issue). Tighten to required once testbeds exist.
Implementation notes
- Wire the check into
make qualify so local runs catch it before push.
- On failure, print the overlay name, its classification, resolved phases, and which phase is missing — so the message is self-explanatory.
- Consider adding a
validation: skip-deployment (or similar) escape hatch only if a legitimate use case appears; default behavior should be strict.
- Future: when Azure / OCI performance testbeds exist, tighten performance from recommended to required for those services.
Done when
- New overlay added without required phases → test fails with a clear message.
- Existing overlay modified to drop a required phase → test fails.
make qualify runs the check by default; CI workflow runs it on every PR that touches recipes/overlays/.
Related
- Companion issue: close the existing 27-overlay phase coverage gap.
Summary
Add a CI test that resolves every overlay's
base:chain and rejects any new or modified overlay whose resolved validation falls below the per-intent floor. Closes the loophole that let 27 of 41 GPU overlays drift to conformance-only without anyone noticing (see companion issue on closing the gaps).Background
The validation block merges per-phase across the
base:chain (pkg/recipe/metadata.go:459-477). There is no test today that asserts a leaf overlay's resolved validation meets a per-intent contract — so a new overlay can be added that inherits only the service-rootconformance:block and ship with no GPU operator-health check and no NCCL orinference-perfgate.Scope
For each overlay in
recipes/overlays/(excluding intermediates likeeks-training.yamlthat are not user-selectable):base:chain and resolve the merged validation block perpkg/recipe/metadata.goMergesemantics.criteria.intentandcriteria.serviceplus filename heuristics (Dynamo / NIM / plain inference, Kind vs non-Kind).Per-intent floor
inference-perf)Why performance is recommended-only initially: some service overlays (notably Azure AKS and Oracle OKE) lack a performance testbed today (see companion issue). Tighten to required once testbeds exist.
Implementation notes
make qualifyso local runs catch it before push.validation: skip-deployment(or similar) escape hatch only if a legitimate use case appears; default behavior should be strict.Done when
make qualifyruns the check by default; CI workflow runs it on every PR that touchesrecipes/overlays/.Related