feat(recipes): add nodewright h100 tuning to H200 EKS recipes by yuanchen8911 · Pull Request #1102 · NVIDIA/aicr

yuanchen8911 · 2026-05-29T16:53:14Z

Summary

Add h200-eks-inference and h200-eks-training overlays so H200 EKS recipes apply nodewright node tuning, reusing the existing h100 profile.

Motivation / Context

H200 had only the h200-any wildcard overlay and no <accelerator>-<service>-<intent> EKS overlays, so it applied no nodewright tuning — unlike h100/gb200/b200, which all wire nodewright-customizations at that level. H200 is the same Hopper, HGX/DGX-class platform as H100 (NVLink + InfiniBand), and nvidia-setup/nvidia-tuned ship no h200 target (only eks-h100/eks-gb200), so reusing the h100 tuning is the correct choice.

Fixes: N/A
Related: N/A

Type of Change

New feature (non-breaking change that adds functionality)

Component(s) Affected

Recipe engine / data (pkg/recipe) — recipes/overlays

Implementation Notes

New h200-eks-inference.yaml / h200-eks-training.yaml mirror the h100-eks-* overlays (same gpu-operator deps, nfd, constraints, and — for training — the full validation block).
criteria.accelerator is h200; only the nodewright-customizations tuning override selects accelerator: h100 (with intent: inference / multiNodeTraining), so it renders the supported eks-h100 nvidia-setup + h100 nvidia-tuned combo via the shared tuning.yaml.
No new manifest needed (unlike the rtx-pro-6000/generic case): eks-h100 is a fully supported nvidia-setup target, so tuning.yaml works as-is.

Testing

Generated h200/eks/inference and h200/eks/training recipes: both now apply the h200-eks-* overlay and include nodewright-customizations.
Inspected the rendered Skyhook CR: nvidia-setup-kernel + nvidia-tuned + nvidia-setup-full, all service: eks / accelerator: h100 (the supported combo).
yamllint clean; go test ./pkg/recipe/... pass. CI (KWOK + Tier-1) covers rendering/deploy.

Risk Assessment

Low — Two additive overlays following the established h100 pattern; reuses a supported tuning target; easy to revert.

Rollout notes: New H200 EKS recipes will apply the h100 nvidia-tuned profile + nvidia-setup (eks-h100) via nodewright. No migration. If H200-specific tuning is later desired, add an h200 target upstream in nodewright-packages and flip the override.

Checklist

Tests pass locally (go test ./pkg/recipe/...)
Linter passes (yamllint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality (N/A — overlay change; covered by recipe-resolution + CI render tests)
I updated docs if user-facing behavior changed (N/A — no new component/flag)
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

H200 EKS had no <accelerator>-<service>-<intent> overlays (only the h200-any wildcard) and applied no nodewright node tuning, unlike h100/gb200/b200. Add h200-eks-inference and h200-eks-training, mirroring the h100 EKS overlays, and reuse the h100 nodewright tuning (tuning.yaml with accelerator=h100). H200 is the same Hopper HGX/DGX-class platform as H100 (NVLink + InfiniBand), and nvidia-setup/nvidia-tuned ship no h200 target — only eks-h100/eks-gb200. The recipe criteria stays accelerator=h200; only the tuning profile selector is h100. Verified: the h200/eks/{inference,training} recipes now include nodewright-customizations, and the rendered Skyhook CR uses the supported eks-h100 nvidia-setup + h100 nvidia-tuned combo. yamllint + pkg/recipe tests pass.

coderabbitai · 2026-05-29T16:56:45Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: c074d163-9fc7-4254-9614-636e32f5432a

📥 Commits

Reviewing files that changed from the base of the PR and between 8396a5c and 5741e08.

📒 Files selected for processing (2)

recipes/overlays/h200-eks-inference.yaml
recipes/overlays/h200-eks-training.yaml

📝 Walkthrough

Walkthrough

This PR adds two new recipe overlays for H200 accelerator support on EKS. The h200-eks-inference overlay configures GPU operator, node feature discovery, and node tuning for inference workloads. The h200-eks-training overlay extends the same component wiring with intent-layer validation including Kubernetes version constraints, GPU operator version checks, NCCL bandwidth performance validation (>= 300), and conformance checks covering health, metrics, autoscaling, gang scheduling, robust controller, and secure access requirements. Both recipes inherit from existing EKS base recipes and add a minimum K8s version requirement of 1.32.4.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

NVIDIA/aicr#1091: Adds first-class H200 accelerator support with overlay patterns like h200-any; this PR builds on that by adding specific EKS recipe overlays that select and constrain accelerator: h200.

Suggested labels

area/recipes, size/M

Suggested reviewers

xdu31
lockwobr
mchmarny

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: adding h100 tuning support to H200 EKS recipes through new overlays.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, explaining the rationale for reusing h100 tuning for H200 and detailing implementation, testing, and risk assessment.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

mchmarny

Clean mirror of the h100-eks-* pattern. The accelerator: h100 selector inside the nodewright-customizations override is well-documented inline (and again in the PR description) — exactly where a future maintainer will look when nodewright eventually gains a real h200 target. CI is green across the board (Tier 1 EKS renders on all four deployer formats, KWOK summary, grype, ClamAV, gate). One non-blocking nit on the NCCL floor — H200 has 1.4× H100's HBM bandwidth, so the copied >= 300 is conservative; safe today, worth tightening once empirical numbers land.

mchmarny · 2026-05-29T18:04:14Z

+        - name: Deployment.gpu-operator.version
+          value: ">= v24.6.0"
+    performance:
+      checks:


nit: >= 300 is copied directly from h100-eks-training.yaml, but H200 has HBM3e (~4.8 TB/s) vs H100's HBM3 (~3.35 TB/s), so achievable NCCL all-reduce BW on EFA is meaningfully higher. The h100-sourced floor is safe (it's a floor, not a target) and matches the "reuse h100 baseline" framing in the PR description, but worth a follow-up to tighten once empirical H200/EFA numbers are in — otherwise this gate stops catching regressions well before the platform's real performance envelope.

The rtx-pro-6000 EKS inference overlays deployed nodewright-operator but no nodewright-customizations, so no tuning profile was applied. h100/gb200/b200 all wire it at the <accelerator>-<service>-<intent> level; rtx-pro-6000 was the gap. The shared tuning.yaml can't be reused here: it renders nvidia-setup, which only ships eks-h100/eks-gb200 configs and fails on any other (service, accelerator) combo — so accelerator=generic would fail nodewright-customizations and block gpu-operator. Instead add a dedicated tuning-generic.yaml that runs ONLY nvidia-tuned with the generic profile (no nvidia-setup), and wire it into rtx-pro-6000-eks-inference (ubuntu/dynamo/nim leaves inherit it). nvidia-tuned's generic profile is self-contained baseline GPU tuning; nvidia-setup's kernel/EFA steps are unsupported for generic and unneeded on these PCIe RTX PRO 6000 nodes (no NVLink fabric; platform provisions the kernel; EFA unused). Also align the additionalTolerations fallback across all nodewright manifests (tuning.yaml, tuning-gke.yaml, no-op.yaml, tuning-generic.yaml) to the tolerate-all default 'operator: Exists' (no key), matching ParseTolerations()/DefaultTolerations() and assert-tuning-defaults.yaml. This fallback only fires when no acceleratedTolerations are injected; the default/flag path is unchanged. Document the inherited NCCL all-reduce floor in h200-eks-training.yaml (review nit from NVIDIA#1102): >= 300 is h100-sourced and loose for H200's HBM3e bandwidth — flagged to tighten once empirical H200/EFA numbers exist. Verified: rendered Skyhook CR contains only nvidia-tuned (0.3.0) with accelerator=generic and no nvidia-setup; injected toleration path unchanged; yamllint + pkg/recipe tests pass.

yuanchen8911 requested a review from a team as a code owner May 29, 2026 16:53

yuanchen8911 added the enhancement New feature or request label May 29, 2026

github-actions Bot added area/recipes size/M labels May 29, 2026

yuanchen8911 requested review from ayuskauskas, lockwobr and mchmarny May 29, 2026 17:39

ayuskauskas approved these changes May 29, 2026

View reviewed changes

mchmarny approved these changes May 29, 2026

View reviewed changes

mchmarny assigned yuanchen8911 May 29, 2026

yuanchen8911 merged commit 81c0789 into NVIDIA:main May 29, 2026
120 of 121 checks passed

yuanchen8911 mentioned this pull request May 29, 2026

feat(recipes): apply nodewright generic tuning to rtx-pro-6000 EKS #1101

Merged

10 tasks

mchmarny mentioned this pull request May 29, 2026

feat(cli,api): migrate CLI and REST server to consume aicr.Client #1104

Closed

23 tasks

coderabbitai Bot mentioned this pull request May 29, 2026

feat(recipes): add nodewright to bcm with reapply-on-reboot #1105

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(recipes): add nodewright h100 tuning to H200 EKS recipes#1102

feat(recipes): add nodewright h100 tuning to H200 EKS recipes#1102
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/h200-eks-nodewright-h100-tuning

yuanchen8911 commented May 29, 2026

Uh oh!

coderabbitai Bot commented May 29, 2026

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

mchmarny left a comment

Uh oh!

mchmarny May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yuanchen8911 commented May 29, 2026

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented May 29, 2026

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

mchmarny May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants