fix(recipes): add deployment-phase checks to h100-gke-cos-training#991
Conversation
The h100-gke-cos-training overlay defined `validation.performance` and
`validation.conformance` blocks but no `validation.deployment` block.
As a result, `aicr validate --phase deployment` against a GKE-COS-trained
recipe reported `catalog=4 selected=0` and the phase was skipped — even
though the four deployment-catalog checks (operator-health,
expected-resources, gpu-operator-version, check-nvidia-smi) are
intent-agnostic and apply to training just as they do to inference.
The same `validation.deployment` block already exists on
`h100-eks-training.yaml` and `h100-aks-training.yaml`; this brings GKE
to parity at the same intent layer ("Defined at the intent layer, not
OS-specific, so all OS variants inherit them").
Verified: re-generated the kubeflow training recipe with --data and ran
`aicr validate --phase deployment` end-to-end against a live GKE H100
cluster — 3 passed, 0 failed, 1 skipped (check-nvidia-smi).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR adds a Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Summary
Add the standard
validation.deploymentblock torecipes/overlays/h100-gke-cos-training.yamlsoaicr validate --phase deploymentruns against GKE-COS-trained recipes instead of skipping the phase.Motivation / Context
The
h100-gke-cos-trainingoverlay definedvalidation.performanceandvalidation.conformanceblocks but novalidation.deploymentblock. As a result,aicr validate --phase deploymentagainst a recipe derived from this overlay reports:The four deployment-catalog checks (
operator-health,expected-resources,gpu-operator-version,check-nvidia-smi) are intent- and OS-agnostic — they verify GPU operator + node-level GPU readiness, which matters for training as much as for inference.The same block already exists on:
recipes/overlays/h100-eks-training.yamlrecipes/overlays/h100-aks-training.yamlrecipes/overlays/gb200-eks-training.yamlThis change brings GKE to parity at the same H100-service-intent layer. (Comment in
h100-eks-training.yaml: "Defined at the intent layer (not OS-specific) so all OS variants inherit them.")Fixes: N/A
Related: N/A
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)pkg/recipe)pkg/validator)Implementation Notes
Mirrors
h100-eks-training.yaml'svalidation.deploymentblock exactly — same four checks and the sameDeployment.gpu-operator.version >= v24.6.0constraint. No new validator catalog entries, no new constraint kinds.Testing
make qualify # passedEnd-to-end validation on a live GKE H100 cluster (gke_eidosx_us-central1_aicr-25910389703, k8s v1.35.3-gke.1234000, gpu-operator v26.3.1):
Risk Assessment
Rollout notes: No migration needed. Recipes regenerated from the updated overlay simply gain a populated
validation.deploymentblock.Checklist
make qualifycoversmake test -race+ lint + e2e + scan)cli-validate-phasesandcli-recipe-overlaysh100-eks-training.yaml,h100-aks-training.yaml)git commit -S)