Skip to content

feat(recipes): add nodewright h100 tuning to H200 EKS recipes#1102

Merged
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/h200-eks-nodewright-h100-tuning
May 29, 2026
Merged

feat(recipes): add nodewright h100 tuning to H200 EKS recipes#1102
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/h200-eks-nodewright-h100-tuning

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

Summary

Add h200-eks-inference and h200-eks-training overlays so H200 EKS recipes apply nodewright node tuning, reusing the existing h100 profile.

Motivation / Context

H200 had only the h200-any wildcard overlay and no <accelerator>-<service>-<intent> EKS overlays, so it applied no nodewright tuning — unlike h100/gb200/b200, which all wire nodewright-customizations at that level. H200 is the same Hopper, HGX/DGX-class platform as H100 (NVLink + InfiniBand), and nvidia-setup/nvidia-tuned ship no h200 target (only eks-h100/eks-gb200), so reusing the h100 tuning is the correct choice.

Fixes: N/A
Related: N/A

Type of Change

  • New feature (non-breaking change that adds functionality)

Component(s) Affected

  • Recipe engine / data (pkg/recipe) — recipes/overlays

Implementation Notes

  • New h200-eks-inference.yaml / h200-eks-training.yaml mirror the h100-eks-* overlays (same gpu-operator deps, nfd, constraints, and — for training — the full validation block).
  • criteria.accelerator is h200; only the nodewright-customizations tuning override selects accelerator: h100 (with intent: inference / multiNodeTraining), so it renders the supported eks-h100 nvidia-setup + h100 nvidia-tuned combo via the shared tuning.yaml.
  • No new manifest needed (unlike the rtx-pro-6000/generic case): eks-h100 is a fully supported nvidia-setup target, so tuning.yaml works as-is.

Testing

  • Generated h200/eks/inference and h200/eks/training recipes: both now apply the h200-eks-* overlay and include nodewright-customizations.
  • Inspected the rendered Skyhook CR: nvidia-setup-kernel + nvidia-tuned + nvidia-setup-full, all service: eks / accelerator: h100 (the supported combo).
  • yamllint clean; go test ./pkg/recipe/... pass. CI (KWOK + Tier-1) covers rendering/deploy.

Risk Assessment

  • Low — Two additive overlays following the established h100 pattern; reuses a supported tuning target; easy to revert.

Rollout notes: New H200 EKS recipes will apply the h100 nvidia-tuned profile + nvidia-setup (eks-h100) via nodewright. No migration. If H200-specific tuning is later desired, add an h200 target upstream in nodewright-packages and flip the override.

Checklist

  • Tests pass locally (go test ./pkg/recipe/...)
  • Linter passes (yamllint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (N/A — overlay change; covered by recipe-resolution + CI render tests)
  • I updated docs if user-facing behavior changed (N/A — no new component/flag)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

H200 EKS had no <accelerator>-<service>-<intent> overlays (only the h200-any wildcard) and applied no nodewright node tuning, unlike h100/gb200/b200.

Add h200-eks-inference and h200-eks-training, mirroring the h100 EKS overlays, and reuse the h100 nodewright tuning (tuning.yaml with accelerator=h100). H200 is the same Hopper HGX/DGX-class platform as H100 (NVLink + InfiniBand), and nvidia-setup/nvidia-tuned ship no h200 target — only eks-h100/eks-gb200. The recipe criteria stays accelerator=h200; only the tuning profile selector is h100.

Verified: the h200/eks/{inference,training} recipes now include nodewright-customizations, and the rendered Skyhook CR uses the supported eks-h100 nvidia-setup + h100 nvidia-tuned combo. yamllint + pkg/recipe tests pass.
@yuanchen8911 yuanchen8911 requested a review from a team as a code owner May 29, 2026 16:53
@yuanchen8911 yuanchen8911 added the enhancement New feature or request label May 29, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 29, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: c074d163-9fc7-4254-9614-636e32f5432a

📥 Commits

Reviewing files that changed from the base of the PR and between 8396a5c and 5741e08.

📒 Files selected for processing (2)
  • recipes/overlays/h200-eks-inference.yaml
  • recipes/overlays/h200-eks-training.yaml

📝 Walkthrough

Walkthrough

This PR adds two new recipe overlays for H200 accelerator support on EKS. The h200-eks-inference overlay configures GPU operator, node feature discovery, and node tuning for inference workloads. The h200-eks-training overlay extends the same component wiring with intent-layer validation including Kubernetes version constraints, GPU operator version checks, NCCL bandwidth performance validation (>= 300), and conformance checks covering health, metrics, autoscaling, gang scheduling, robust controller, and secure access requirements. Both recipes inherit from existing EKS base recipes and add a minimum K8s version requirement of 1.32.4.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • NVIDIA/aicr#1091: Adds first-class H200 accelerator support with overlay patterns like h200-any; this PR builds on that by adding specific EKS recipe overlays that select and constrain accelerator: h200.

Suggested labels

area/recipes, size/M

Suggested reviewers

  • xdu31
  • lockwobr
  • mchmarny
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding h100 tuning support to H200 EKS recipes through new overlays.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the rationale for reusing h100 tuning for H200 and detailing implementation, testing, and risk assessment.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean mirror of the h100-eks-* pattern. The accelerator: h100 selector inside the nodewright-customizations override is well-documented inline (and again in the PR description) — exactly where a future maintainer will look when nodewright eventually gains a real h200 target. CI is green across the board (Tier 1 EKS renders on all four deployer formats, KWOK summary, grype, ClamAV, gate). One non-blocking nit on the NCCL floor — H200 has 1.4× H100's HBM bandwidth, so the copied >= 300 is conservative; safe today, worth tightening once empirical numbers land.

- name: Deployment.gpu-operator.version
value: ">= v24.6.0"
performance:
checks:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: >= 300 is copied directly from h100-eks-training.yaml, but H200 has HBM3e (~4.8 TB/s) vs H100's HBM3 (~3.35 TB/s), so achievable NCCL all-reduce BW on EFA is meaningfully higher. The h100-sourced floor is safe (it's a floor, not a target) and matches the "reuse h100 baseline" framing in the PR description, but worth a follow-up to tighten once empirical H200/EFA numbers are in — otherwise this gate stops catching regressions well before the platform's real performance envelope.

@yuanchen8911 yuanchen8911 merged commit 81c0789 into NVIDIA:main May 29, 2026
120 of 121 checks passed
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 29, 2026
The rtx-pro-6000 EKS inference overlays deployed nodewright-operator but no nodewright-customizations, so no tuning profile was applied. h100/gb200/b200 all wire it at the <accelerator>-<service>-<intent> level; rtx-pro-6000 was the gap.

The shared tuning.yaml can't be reused here: it renders nvidia-setup, which only ships eks-h100/eks-gb200 configs and fails on any other (service, accelerator) combo — so accelerator=generic would fail nodewright-customizations and block gpu-operator. Instead add a dedicated tuning-generic.yaml that runs ONLY nvidia-tuned with the generic profile (no nvidia-setup), and wire it into rtx-pro-6000-eks-inference (ubuntu/dynamo/nim leaves inherit it). nvidia-tuned's generic profile is self-contained baseline GPU tuning; nvidia-setup's kernel/EFA steps are unsupported for generic and unneeded on these PCIe RTX PRO 6000 nodes (no NVLink fabric; platform provisions the kernel; EFA unused).

Also align the additionalTolerations fallback across all nodewright manifests (tuning.yaml, tuning-gke.yaml, no-op.yaml, tuning-generic.yaml) to the tolerate-all default 'operator: Exists' (no key), matching ParseTolerations()/DefaultTolerations() and assert-tuning-defaults.yaml. This fallback only fires when no acceleratedTolerations are injected; the default/flag path is unchanged.

Document the inherited NCCL all-reduce floor in h200-eks-training.yaml (review nit from NVIDIA#1102): >= 300 is h100-sourced and loose for H200's HBM3e bandwidth — flagged to tighten once empirical H200/EFA numbers exist.

Verified: rendered Skyhook CR contains only nvidia-tuned (0.3.0) with accelerator=generic and no nvidia-setup; injected toleration path unchanged; yamllint + pkg/recipe tests pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants