feat(recipes): add nodewright h100 tuning to H200 EKS recipes#1102
Conversation
H200 EKS had no <accelerator>-<service>-<intent> overlays (only the h200-any wildcard) and applied no nodewright node tuning, unlike h100/gb200/b200.
Add h200-eks-inference and h200-eks-training, mirroring the h100 EKS overlays, and reuse the h100 nodewright tuning (tuning.yaml with accelerator=h100). H200 is the same Hopper HGX/DGX-class platform as H100 (NVLink + InfiniBand), and nvidia-setup/nvidia-tuned ship no h200 target — only eks-h100/eks-gb200. The recipe criteria stays accelerator=h200; only the tuning profile selector is h100.
Verified: the h200/eks/{inference,training} recipes now include nodewright-customizations, and the rendered Skyhook CR uses the supported eks-h100 nvidia-setup + h100 nvidia-tuned combo. yamllint + pkg/recipe tests pass.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR adds two new recipe overlays for H200 accelerator support on EKS. The Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
mchmarny
left a comment
There was a problem hiding this comment.
Clean mirror of the h100-eks-* pattern. The accelerator: h100 selector inside the nodewright-customizations override is well-documented inline (and again in the PR description) — exactly where a future maintainer will look when nodewright eventually gains a real h200 target. CI is green across the board (Tier 1 EKS renders on all four deployer formats, KWOK summary, grype, ClamAV, gate). One non-blocking nit on the NCCL floor — H200 has 1.4× H100's HBM bandwidth, so the copied >= 300 is conservative; safe today, worth tightening once empirical numbers land.
| - name: Deployment.gpu-operator.version | ||
| value: ">= v24.6.0" | ||
| performance: | ||
| checks: |
There was a problem hiding this comment.
nit: >= 300 is copied directly from h100-eks-training.yaml, but H200 has HBM3e (~4.8 TB/s) vs H100's HBM3 (~3.35 TB/s), so achievable NCCL all-reduce BW on EFA is meaningfully higher. The h100-sourced floor is safe (it's a floor, not a target) and matches the "reuse h100 baseline" framing in the PR description, but worth a follow-up to tighten once empirical H200/EFA numbers are in — otherwise this gate stops catching regressions well before the platform's real performance envelope.
The rtx-pro-6000 EKS inference overlays deployed nodewright-operator but no nodewright-customizations, so no tuning profile was applied. h100/gb200/b200 all wire it at the <accelerator>-<service>-<intent> level; rtx-pro-6000 was the gap. The shared tuning.yaml can't be reused here: it renders nvidia-setup, which only ships eks-h100/eks-gb200 configs and fails on any other (service, accelerator) combo — so accelerator=generic would fail nodewright-customizations and block gpu-operator. Instead add a dedicated tuning-generic.yaml that runs ONLY nvidia-tuned with the generic profile (no nvidia-setup), and wire it into rtx-pro-6000-eks-inference (ubuntu/dynamo/nim leaves inherit it). nvidia-tuned's generic profile is self-contained baseline GPU tuning; nvidia-setup's kernel/EFA steps are unsupported for generic and unneeded on these PCIe RTX PRO 6000 nodes (no NVLink fabric; platform provisions the kernel; EFA unused). Also align the additionalTolerations fallback across all nodewright manifests (tuning.yaml, tuning-gke.yaml, no-op.yaml, tuning-generic.yaml) to the tolerate-all default 'operator: Exists' (no key), matching ParseTolerations()/DefaultTolerations() and assert-tuning-defaults.yaml. This fallback only fires when no acceleratedTolerations are injected; the default/flag path is unchanged. Document the inherited NCCL all-reduce floor in h200-eks-training.yaml (review nit from NVIDIA#1102): >= 300 is h100-sourced and loose for H200's HBM3e bandwidth — flagged to tighten once empirical H200/EFA numbers exist. Verified: rendered Skyhook CR contains only nvidia-tuned (0.3.0) with accelerator=generic and no nvidia-setup; injected toleration path unchanged; yamllint + pkg/recipe tests pass.
Summary
Add
h200-eks-inferenceandh200-eks-trainingoverlays so H200 EKS recipes apply nodewright node tuning, reusing the existing h100 profile.Motivation / Context
H200 had only the
h200-anywildcard overlay and no<accelerator>-<service>-<intent>EKS overlays, so it applied no nodewright tuning — unlikeh100/gb200/b200, which all wirenodewright-customizationsat that level. H200 is the same Hopper, HGX/DGX-class platform as H100 (NVLink + InfiniBand), andnvidia-setup/nvidia-tunedship no h200 target (onlyeks-h100/eks-gb200), so reusing the h100 tuning is the correct choice.Fixes: N/A
Related: N/A
Type of Change
Component(s) Affected
pkg/recipe) —recipes/overlaysImplementation Notes
h200-eks-inference.yaml/h200-eks-training.yamlmirror theh100-eks-*overlays (same gpu-operator deps, nfd, constraints, and — for training — the full validation block).criteria.acceleratorish200; only thenodewright-customizationstuning override selectsaccelerator: h100(withintent: inference/multiNodeTraining), so it renders the supportedeks-h100nvidia-setup+ h100nvidia-tunedcombo via the sharedtuning.yaml.eks-h100is a fully supportednvidia-setuptarget, sotuning.yamlworks as-is.Testing
h200/eks/inferenceandh200/eks/trainingrecipes: both now apply theh200-eks-*overlay and includenodewright-customizations.nvidia-setup-kernel+nvidia-tuned+nvidia-setup-full, allservice: eks/accelerator: h100(the supported combo).yamllintclean;go test ./pkg/recipe/...pass. CI (KWOK + Tier-1) covers rendering/deploy.Risk Assessment
Rollout notes: New H200 EKS recipes will apply the h100
nvidia-tunedprofile +nvidia-setup(eks-h100) via nodewright. No migration. If H200-specific tuning is later desired, add anh200target upstream innodewright-packagesand flip the override.Checklist
go test ./pkg/recipe/...)yamllint)git commit -S)