Skip to content

fix(recipes): add deployment-phase checks to h100-gke-cos-training#991

Merged
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/recipes-gke-cos-training-deployment-checks
May 20, 2026
Merged

fix(recipes): add deployment-phase checks to h100-gke-cos-training#991
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/recipes-gke-cos-training-deployment-checks

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

Summary

Add the standard validation.deployment block to recipes/overlays/h100-gke-cos-training.yaml so aicr validate --phase deployment runs against GKE-COS-trained recipes instead of skipping the phase.

Motivation / Context

The h100-gke-cos-training overlay defined validation.performance and validation.conformance blocks but no validation.deployment block. As a result, aicr validate --phase deployment against a recipe derived from this overlay reports:

phase requested but no checks defined in recipe; phase will be empty: phase=deployment
running validation phase: phase=deployment catalog=4 selected=0
phase completed: phase=deployment status=skipped validators=0 passed=0 failed=0

The four deployment-catalog checks (operator-health, expected-resources, gpu-operator-version, check-nvidia-smi) are intent- and OS-agnostic — they verify GPU operator + node-level GPU readiness, which matters for training as much as for inference.

The same block already exists on:

  • recipes/overlays/h100-eks-training.yaml
  • recipes/overlays/h100-aks-training.yaml
  • recipes/overlays/gb200-eks-training.yaml

This change brings GKE to parity at the same H100-service-intent layer. (Comment in h100-eks-training.yaml: "Defined at the intent layer (not OS-specific) so all OS variants inherit them.")

Fixes: N/A
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature
  • Breaking change
  • Documentation update
  • Refactoring
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server
  • Recipe engine / data (pkg/recipe)
  • Bundlers
  • Collectors / snapshotter
  • Validator (pkg/validator)
  • Core libraries
  • Docs/examples
  • Other

Implementation Notes

Mirrors h100-eks-training.yaml's validation.deployment block exactly — same four checks and the same Deployment.gpu-operator.version >= v24.6.0 constraint. No new validator catalog entries, no new constraint kinds.

Testing

make qualify   # passed

End-to-end validation on a live GKE H100 cluster (gke_eidosx_us-central1_aicr-25910389703, k8s v1.35.3-gke.1234000, gpu-operator v26.3.1):

$ aicr recipe --data ./recipes --service gke --accelerator h100 \
    --intent training --os cos --platform kubeflow -o recipe.yaml
$ AICR_VALIDATOR_IMAGE_TAG=latest aicr validate \
    --recipe recipe.yaml --phase deployment

running validation phase: phase=deployment catalog=4 selected=4
validator completed: name=operator-health status=passed
validator completed: name=expected-resources status=passed
validator completed: name=gpu-operator-version status=passed
validator completed: name=check-nvidia-smi status=skipped
phase completed: phase=deployment status=passed validators=4 passed=3 failed=0

Risk Assessment

  • Low — Isolated YAML-only change in a single overlay. Adds checks where none existed; cannot regress existing behavior. Easy to revert.

Rollout notes: No migration needed. Recipes regenerated from the updated overlay simply gain a populated validation.deployment block.

Checklist

  • Tests pass locally (make qualify covers make test -race + lint + e2e + scan)
  • Linter passes
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality — N/A, mirrors existing overlay shape exercised by chainsaw cli-validate-phases and cli-recipe-overlays
  • I updated docs if user-facing behavior changed — N/A
  • Changes follow existing patterns in the codebase (h100-eks-training.yaml, h100-aks-training.yaml)
  • Commits are cryptographically signed (git commit -S)

The h100-gke-cos-training overlay defined `validation.performance` and
`validation.conformance` blocks but no `validation.deployment` block.
As a result, `aicr validate --phase deployment` against a GKE-COS-trained
recipe reported `catalog=4 selected=0` and the phase was skipped — even
though the four deployment-catalog checks (operator-health,
expected-resources, gpu-operator-version, check-nvidia-smi) are
intent-agnostic and apply to training just as they do to inference.

The same `validation.deployment` block already exists on
`h100-eks-training.yaml` and `h100-aks-training.yaml`; this brings GKE
to parity at the same intent layer ("Defined at the intent layer, not
OS-specific, so all OS variants inherit them").

Verified: re-generated the kubeflow training recipe with --data and ran
`aicr validate --phase deployment` end-to-end against a live GKE H100
cluster — 3 passed, 0 failed, 1 skipped (check-nvidia-smi).
@yuanchen8911 yuanchen8911 requested a review from a team as a code owner May 20, 2026 18:18
@yuanchen8911 yuanchen8911 added enhancement New feature or request area/recipes labels May 20, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 0448cda9-50b8-4e86-b7f1-cb4a748ea751

📥 Commits

Reviewing files that changed from the base of the PR and between f7bebf0 and b50a63d.

📒 Files selected for processing (1)
  • recipes/overlays/h100-gke-cos-training.yaml

📝 Walkthrough

Walkthrough

This PR adds a spec.validation.deployment section to the H100 GKE COS training recipe overlay. The new configuration introduces four deployment-level validation checks: operator-health, expected-resources, gpu-operator-version, and check-nvidia-smi. A constraint is also enforced to require gpu-operator version >= v24.6.0. The change affects only the overlay file with no modifications to exported entities.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related issues

Possibly related PRs

  • NVIDIA/aicr#895: Adds similar validation.deployment overlay sections with the same set of deployment checks and gpu-operator version constraints to corresponding recipe files.

Suggested labels

size/S, bug

Suggested reviewers

  • mchmarny
  • njhensley
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the main change: adding deployment-phase validation checks to the h100-gke-cos-training recipe overlay.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the motivation, implementation details, testing, and risk assessment.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@yuanchen8911 yuanchen8911 merged commit c2b15c4 into NVIDIA:main May 20, 2026
298 of 325 checks passed
mchmarny added a commit that referenced this pull request May 20, 2026
Yuan's #991 added the deployment phase to h100-gke-cos-training; the
fix cascades through the base chain to h100-gke-cos-training-kubeflow.
Both overlays now meet the floor on their own — drop both entries so a
future regression is no longer silently masked.

Refs #970
Refs #969
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants