Skip to content

feat(recipes): apply nodewright generic tuning to rtx-pro-6000 EKS#1101

Merged
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/rtx-pro-6000-nodewright-generic-tuning
May 29, 2026
Merged

feat(recipes): apply nodewright generic tuning to rtx-pro-6000 EKS#1101
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/rtx-pro-6000-nodewright-generic-tuning

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented May 29, 2026

Summary

Wire nodewright-customizations with accelerator=generic into the rtx-pro-6000 EKS inference overlay, so RTX PRO 6000 nodes get a baseline GPU node-tuning profile — closing a gap vs h100/gb200/b200, which already wire tuning.

Motivation / Context

The rtx-pro-6000 EKS overlays deployed only nodewright-operator (the controller) with no nodewright-customizations, so aicr applied no tuning profile (left entirely to the platform). nvidia-tuned has no rtx-pro-6000 profile; the generic profile is purpose-built for single-GPU cloud VMs and deliberately omits the NVLink/InfiniBand/hugepage/CPU-isolation settings the h100/gb200 profiles carry. It already ships in nvidia-tuned 0.3.0 — the version the tuning manifest pins.

Fixes: N/A
Related: NVIDIA/nodewright-packages#37 (adds the generic profile)

Type of Change

  • New feature (non-breaking change that adds functionality)

Component(s) Affected

  • Recipe engine / data (pkg/recipe) — recipes/overlays

Implementation Notes

  • Added the nodewright-customizations componentRef to rtx-pro-6000-eks-inference.yaml — the <accelerator>-<service>-<intent> level, mirroring h100-eks-inference/gb200-eks-inference. The -ubuntu-inference, -dynamo, and -nim leaves inherit it via overlay merge (the established convention; leaves never re-declare tuning).
  • overrides.accelerator=generic (the recipe criteria stays rtx-pro-6000); added nodewright-customizations to the gpu-operator dependencyRefs for ordering, same as h100.
  • eks rtx-pro-6000 is inference-only, so this is a single-file change (no training counterpart).

Alternatives Considered

Option 1 — no-op manifest (no-op.yaml): wire nodewright-customizations as a no-op placeholder (a shellscript package that just echoes). This satisfies the gpu-operator dependency and keeps the "customization present" pattern, but applies no tuning — leaving node tuning entirely to the platform/NCP (the status quo on b40). Rejected in favor of Option 2 (tuning-generic.yaml, which actually applies the nvidia-tuned generic profile). The no-op remains the safe fallback if the nodewright team determines RTX PRO 6000 should not run nvidia-tuned standalone — e.g. if it must run nvidia-setup's kernel/EFA steps first (which today support only eks-h100/eks-gb200).

Additional Changes (cross-cutting — disclosed for the next reader)

Beyond the rtx-pro-6000 tuning, this PR includes two changes that touch shared/other files:

  1. Toleration-default consistency fix (affects all recipes, not just rtx-pro-6000). The additionalTolerations fallback in tuning.yaml, tuning-gke.yaml, and no-op.yaml (and the new tuning-generic.yaml) changes from [{ key: dedicated, operator: Exists }] to the tolerate-all [{ operator: Exists }]. This matches ParseTolerations() / DefaultTolerations() and the assert-tuning-defaults.yaml chainsaw fixture; the previous templates were inconsistent (subset matching masked it), so this closes a real gap. It affects every recipe rendering these manifests — h100, gb200, b200, b40, GKE — but only the fallback branch (when no acceleratedTolerations are injected from scheduling defaults/flags); the default/flag path is unchanged. (Raised by CodeRabbit and @mchmarny.)

  2. h200-eks-training NCCL-floor doc. A comment noting the inherited nccl-all-reduce-bw >= 300 floor is h100-sourced and loose for H200's HBM3e bandwidth — flagged to tighten once empirical H200/EFA numbers exist. (Addresses a review nit from feat(recipes): add nodewright h100 tuning to H200 EKS recipes #1102.)

Testing

  • yamllint clean; go test ./pkg/recipe/... pass; no golden fixtures to regenerate.
  • Generated the eks/rtx-pro-6000/ubuntu/inference/dynamo recipe + bundle: the recipe gains nodewright-customizations, and the rendered Skyhook CR sets accelerator: generic against nvidia-tuned 0.3.0 (the profile that ships generic).
  • Render-verified only; not yet applied on a live RTX PRO 6000 cluster. CI (KWOK + Tier-1 deploy matrices) covers recipe rendering and deployment.

Risk Assessment

  • Low — Additive overlay change following an established pattern; easy to revert.

Rollout notes: New deploys of rtx-pro-6000 EKS recipes will now apply the nvidia-generic tuned profile via nodewright. No migration.

Checklist

  • Tests pass locally (go test ./pkg/recipe/...)
  • Linter passes (yamllint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (N/A — overlay change; covered by recipe-resolution + CI render tests)
  • I updated docs if user-facing behavior changed (N/A — no new component/flag; nodewright-customizations already documented)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@yuanchen8911 yuanchen8911 requested a review from a team as a code owner May 29, 2026 16:24
@yuanchen8911 yuanchen8911 added the enhancement New feature or request label May 29, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 29, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: f8e33e03-5643-42c5-9620-e4f410b34515

📥 Commits

Reviewing files that changed from the base of the PR and between 7ad52fa and 2861d23.

📒 Files selected for processing (6)
  • recipes/components/nodewright-customizations/manifests/no-op.yaml
  • recipes/components/nodewright-customizations/manifests/tuning-generic.yaml
  • recipes/components/nodewright-customizations/manifests/tuning-gke.yaml
  • recipes/components/nodewright-customizations/manifests/tuning.yaml
  • recipes/overlays/h200-eks-training.yaml
  • recipes/overlays/rtx-pro-6000-eks-inference.yaml

📝 Walkthrough

Walkthrough

Adds a Helm template recipes/components/nodewright-customizations/manifests/tuning-generic.yaml that conditionally renders a Skyhook tuning CR with a pinned nvidia-tuned package and templated configMap (intent, accelerator, optional service). Changes default additionalTolerations fallback in tuning, tuning-gke, and no-op templates to a tolerate-all toleration. Updates recipes/overlays/rtx-pro-6000-eks-inference.yaml to add gpu-operator -> nodewright-customizations dependency and a nodewright-customizations componentRef pinned to the tuning manifest with overrides service: eks, accelerator: generic, intent: inference and dependency on nodewright-operator. Also adds explanatory comments in h200-eks-training.yaml.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • NVIDIA/aicr#1102: Updates EKS overlays to wire nodewright-customizations into gpu-operator and configure the Skyhook tuning manifests for inference/training.
  • NVIDIA/aicr#1053: Related changes to tuning-gke.yaml tolerations default that affect GKE overlays wiring nodewright-customizations.
  • NVIDIA/aicr#1046: Prior work that created/modified the rtx-pro-6000-eks-inference overlay which this PR extends.

Suggested reviewers

  • mchmarny
  • lockwobr
  • ayuskauskas
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: wiring nodewright generic tuning into the rtx-pro-6000 EKS overlay. It matches the primary objective of the PR.
Description check ✅ Passed The description thoroughly details the motivation, implementation, testing, and risk assessment. It directly relates to the changeset and provides context for why nodewright-customizations with generic tuning is being added to rtx-pro-6000 EKS.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@yuanchen8911 yuanchen8911 marked this pull request as draft May 29, 2026 16:31
@yuanchen8911 yuanchen8911 changed the title feat(recipes): apply nodewright generic tuning to rtx-pro-6000 EKS WIP: feat(recipes): apply nodewright generic tuning to rtx-pro-6000 EKS May 29, 2026
@yuanchen8911 yuanchen8911 force-pushed the feat/rtx-pro-6000-nodewright-generic-tuning branch from 9011b89 to 6daedaa Compare May 29, 2026 16:36
@github-actions github-actions Bot added size/M and removed size/S labels May 29, 2026
@yuanchen8911 yuanchen8911 force-pushed the feat/rtx-pro-6000-nodewright-generic-tuning branch from 6daedaa to ee25f9b Compare May 29, 2026 16:44
@yuanchen8911 yuanchen8911 marked this pull request as ready for review May 29, 2026 16:47
@yuanchen8911 yuanchen8911 changed the title WIP: feat(recipes): apply nodewright generic tuning to rtx-pro-6000 EKS feat(recipes): apply nodewright generic tuning to rtx-pro-6000 EKS May 29, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/components/nodewright-customizations/manifests/tuning-generic.yaml`:
- Around line 54-60: The fallback toleration in tuning-generic.yaml under
additionalTolerations uses "- key: dedicated, operator: Exists" which differs
from the compiled default tolerations used by
DefaultTolerations()/ParseTolerations() and expected by
assert-tuning-defaults.yaml; update the Helm template block that checks
$cust.acceleratedTolerations (the conditional in tuning-generic.yaml referencing
additionalTolerations and $cust.acceleratedTolerations) to emit a toleration
with only "operator: Exists" (no key) as the else/fallback so the template
matches the "tolerate all" default behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: b39bfcdf-b5c0-425b-97fe-9322fb924a7c

📥 Commits

Reviewing files that changed from the base of the PR and between 6daedaa and ee25f9b.

📒 Files selected for processing (2)
  • recipes/components/nodewright-customizations/manifests/tuning-generic.yaml
  • recipes/overlays/rtx-pro-6000-eks-inference.yaml

@yuanchen8911 yuanchen8911 force-pushed the feat/rtx-pro-6000-nodewright-generic-tuning branch from ee25f9b to 7ad52fa Compare May 29, 2026 17:03
ayuskauskas
ayuskauskas previously approved these changes May 29, 2026
The rtx-pro-6000 EKS inference overlays deployed nodewright-operator but no nodewright-customizations, so no tuning profile was applied. h100/gb200/b200 all wire it at the <accelerator>-<service>-<intent> level; rtx-pro-6000 was the gap.

The shared tuning.yaml can't be reused here: it renders nvidia-setup, which only ships eks-h100/eks-gb200 configs and fails on any other (service, accelerator) combo — so accelerator=generic would fail nodewright-customizations and block gpu-operator. Instead add a dedicated tuning-generic.yaml that runs ONLY nvidia-tuned with the generic profile (no nvidia-setup), and wire it into rtx-pro-6000-eks-inference (ubuntu/dynamo/nim leaves inherit it). nvidia-tuned's generic profile is self-contained baseline GPU tuning; nvidia-setup's kernel/EFA steps are unsupported for generic and unneeded on these PCIe RTX PRO 6000 nodes (no NVLink fabric; platform provisions the kernel; EFA unused).

Also align the additionalTolerations fallback across all nodewright manifests (tuning.yaml, tuning-gke.yaml, no-op.yaml, tuning-generic.yaml) to the tolerate-all default 'operator: Exists' (no key), matching ParseTolerations()/DefaultTolerations() and assert-tuning-defaults.yaml. This fallback only fires when no acceleratedTolerations are injected; the default/flag path is unchanged.

Document the inherited NCCL all-reduce floor in h200-eks-training.yaml (review nit from NVIDIA#1102): >= 300 is h100-sourced and loose for H200's HBM3e bandwidth — flagged to tighten once empirical H200/EFA numbers exist.

Verified: rendered Skyhook CR contains only nvidia-tuned (0.3.0) with accelerator=generic and no nvidia-setup; injected toleration path unchanged; yamllint + pkg/recipe tests pass.
@yuanchen8911 yuanchen8911 force-pushed the feat/rtx-pro-6000-nodewright-generic-tuning branch from 7ad52fa to 2861d23 Compare May 29, 2026 18:13
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The substance is good: tuning-generic.yaml is the right factoring (nvidia-setup only ships eks-h100 / eks-gb200 configs, so reusing tuning.yaml for accelerator=generic would fail), the inline justifications at both the call site and the manifest header are clear, and the "Alternatives Considered" section in the body is well-reasoned. CI green across all 118 checks. nvidia-tuned:0.3.0 already in the BOM, no regen needed.

One medium-priority observation inline on tuning.yaml: the default-toleration change in three existing manifests (tuning.yaml, tuning-gke.yaml, no-op.yaml) is correct but isn't disclosed in the PR description. It affects every recipe rendering these manifests — h100, gb200, b200, b40, GKE — not just rtx-pro-6000. Asking for either a one-sentence callout in the description or a split into a follow-up cleanup PR. Not blocking the substance.

Comment thread recipes/components/nodewright-customizations/manifests/tuning.yaml
@yuanchen8911 yuanchen8911 requested a review from mchmarny May 29, 2026 18:14
@yuanchen8911 yuanchen8911 enabled auto-merge (squash) May 29, 2026 18:17
@yuanchen8911 yuanchen8911 merged commit 4daf2af into NVIDIA:main May 29, 2026
120 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants