fix(recipes): add deployment-phase checks to h100-gke-cos-training by yuanchen8911 · Pull Request #991 · NVIDIA/aicr

yuanchen8911 · 2026-05-20T18:18:04Z

Summary

Add the standard validation.deployment block to recipes/overlays/h100-gke-cos-training.yaml so aicr validate --phase deployment runs against GKE-COS-trained recipes instead of skipping the phase.

Motivation / Context

The h100-gke-cos-training overlay defined validation.performance and validation.conformance blocks but no validation.deployment block. As a result, aicr validate --phase deployment against a recipe derived from this overlay reports:

phase requested but no checks defined in recipe; phase will be empty: phase=deployment
running validation phase: phase=deployment catalog=4 selected=0
phase completed: phase=deployment status=skipped validators=0 passed=0 failed=0

The four deployment-catalog checks (operator-health, expected-resources, gpu-operator-version, check-nvidia-smi) are intent- and OS-agnostic — they verify GPU operator + node-level GPU readiness, which matters for training as much as for inference.

The same block already exists on:

recipes/overlays/h100-eks-training.yaml
recipes/overlays/h100-aks-training.yaml
recipes/overlays/gb200-eks-training.yaml

This change brings GKE to parity at the same H100-service-intent layer. (Comment in h100-eks-training.yaml: "Defined at the intent layer (not OS-specific) so all OS variants inherit them.")

Fixes: N/A
Related: N/A

Type of Change

Component(s) Affected

Implementation Notes

Mirrors h100-eks-training.yaml's validation.deployment block exactly — same four checks and the same Deployment.gpu-operator.version >= v24.6.0 constraint. No new validator catalog entries, no new constraint kinds.

Testing

make qualify   # passed

End-to-end validation on a live GKE H100 cluster (gke_eidosx_us-central1_aicr-25910389703, k8s v1.35.3-gke.1234000, gpu-operator v26.3.1):

$ aicr recipe --data ./recipes --service gke --accelerator h100 \
    --intent training --os cos --platform kubeflow -o recipe.yaml
$ AICR_VALIDATOR_IMAGE_TAG=latest aicr validate \
    --recipe recipe.yaml --phase deployment

running validation phase: phase=deployment catalog=4 selected=4
validator completed: name=operator-health status=passed
validator completed: name=expected-resources status=passed
validator completed: name=gpu-operator-version status=passed
validator completed: name=check-nvidia-smi status=skipped
phase completed: phase=deployment status=passed validators=4 passed=3 failed=0

Risk Assessment

Low — Isolated YAML-only change in a single overlay. Adds checks where none existed; cannot regress existing behavior. Easy to revert.

Rollout notes: No migration needed. Recipes regenerated from the updated overlay simply gain a populated validation.deployment block.

Checklist

Tests pass locally (make qualify covers make test -race + lint + e2e + scan)
Linter passes
I did not skip/disable tests to make CI green
I added/updated tests for new functionality — N/A, mirrors existing overlay shape exercised by chainsaw cli-validate-phases and cli-recipe-overlays
I updated docs if user-facing behavior changed — N/A
Changes follow existing patterns in the codebase (h100-eks-training.yaml, h100-aks-training.yaml)
Commits are cryptographically signed (git commit -S)

The h100-gke-cos-training overlay defined `validation.performance` and `validation.conformance` blocks but no `validation.deployment` block. As a result, `aicr validate --phase deployment` against a GKE-COS-trained recipe reported `catalog=4 selected=0` and the phase was skipped — even though the four deployment-catalog checks (operator-health, expected-resources, gpu-operator-version, check-nvidia-smi) are intent-agnostic and apply to training just as they do to inference. The same `validation.deployment` block already exists on `h100-eks-training.yaml` and `h100-aks-training.yaml`; this brings GKE to parity at the same intent layer ("Defined at the intent layer, not OS-specific, so all OS variants inherit them"). Verified: re-generated the kubeflow training recipe with --data and ran `aicr validate --phase deployment` end-to-end against a live GKE H100 cluster — 3 passed, 0 failed, 1 skipped (check-nvidia-smi).

coderabbitai · 2026-05-20T18:19:14Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 0448cda9-50b8-4e86-b7f1-cb4a748ea751

📥 Commits

Reviewing files that changed from the base of the PR and between f7bebf0 and b50a63d.

📒 Files selected for processing (1)

recipes/overlays/h100-gke-cos-training.yaml

📝 Walkthrough

Walkthrough

This PR adds a spec.validation.deployment section to the H100 GKE COS training recipe overlay. The new configuration introduces four deployment-level validation checks: operator-health, expected-resources, gpu-operator-version, and check-nvidia-smi. A constraint is also enforced to require gpu-operator version >= v24.6.0. The change affects only the overlay file with no modifications to exported entities.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related issues

Close deployment-phase validation coverage gaps for accelerator-bound GPU recipes #969: Addresses the gap where h100-gke-cos-training was missing deployment validation checks.
Add CI guard to enforce per-intent validation phase coverage on recipe overlays #970: Aligns with CI guard requirements to ensure overlays include a resolved deployment validation phase.

Possibly related PRs

NVIDIA/aicr#895: Adds similar validation.deployment overlay sections with the same set of deployment checks and gpu-operator version constraints to corresponding recipe files.

Suggested labels

size/S, bug

Suggested reviewers

mchmarny
njhensley

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and specifically describes the main change: adding deployment-phase validation checks to the h100-gke-cos-training recipe overlay.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, explaining the motivation, implementation details, testing, and risk assessment.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Yuan's #991 added the deployment phase to h100-gke-cos-training; the fix cascades through the base chain to h100-gke-cos-training-kubeflow. Both overlays now meet the floor on their own — drop both entries so a future regression is no longer silently masked. Refs #970 Refs #969

yuanchen8911 requested a review from a team as a code owner May 20, 2026 18:18

yuanchen8911 added enhancement New feature or request area/recipes labels May 20, 2026

github-actions Bot added the size/XS label May 20, 2026

mchmarny assigned yuanchen8911 May 20, 2026

mchmarny approved these changes May 20, 2026

View reviewed changes

yuanchen8911 merged commit c2b15c4 into NVIDIA:main May 20, 2026
298 of 325 checks passed

coderabbitai Bot mentioned this pull request May 21, 2026

feat(recipe): deliver deployment-phase floor at per-accelerator wildcards #1001

Open

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(recipes): add deployment-phase checks to h100-gke-cos-training#991

fix(recipes): add deployment-phase checks to h100-gke-cos-training#991
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/recipes-gke-cos-training-deployment-checks

yuanchen8911 commented May 20, 2026

Uh oh!

coderabbitai Bot commented May 20, 2026

Walkthrough

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yuanchen8911 commented May 20, 2026

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented May 20, 2026

Walkthrough

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants