feat(recipes): apply nodewright generic tuning to rtx-pro-6000 EKS by yuanchen8911 · Pull Request #1101 · NVIDIA/aicr

yuanchen8911 · 2026-05-29T16:24:45Z

Summary

Wire nodewright-customizations with accelerator=generic into the rtx-pro-6000 EKS inference overlay, so RTX PRO 6000 nodes get a baseline GPU node-tuning profile — closing a gap vs h100/gb200/b200, which already wire tuning.

Motivation / Context

The rtx-pro-6000 EKS overlays deployed only nodewright-operator (the controller) with no nodewright-customizations, so aicr applied no tuning profile (left entirely to the platform). nvidia-tuned has no rtx-pro-6000 profile; the generic profile is purpose-built for single-GPU cloud VMs and deliberately omits the NVLink/InfiniBand/hugepage/CPU-isolation settings the h100/gb200 profiles carry. It already ships in nvidia-tuned 0.3.0 — the version the tuning manifest pins.

Fixes: N/A
Related: NVIDIA/nodewright-packages#37 (adds the generic profile)

Type of Change

New feature (non-breaking change that adds functionality)

Component(s) Affected

Recipe engine / data (pkg/recipe) — recipes/overlays

Implementation Notes

Added the nodewright-customizations componentRef to rtx-pro-6000-eks-inference.yaml — the <accelerator>-<service>-<intent> level, mirroring h100-eks-inference/gb200-eks-inference. The -ubuntu-inference, -dynamo, and -nim leaves inherit it via overlay merge (the established convention; leaves never re-declare tuning).
overrides.accelerator=generic (the recipe criteria stays rtx-pro-6000); added nodewright-customizations to the gpu-operator dependencyRefs for ordering, same as h100.
eks rtx-pro-6000 is inference-only, so this is a single-file change (no training counterpart).

Alternatives Considered

Option 1 — no-op manifest (no-op.yaml): wire nodewright-customizations as a no-op placeholder (a shellscript package that just echoes). This satisfies the gpu-operator dependency and keeps the "customization present" pattern, but applies no tuning — leaving node tuning entirely to the platform/NCP (the status quo on b40). Rejected in favor of Option 2 (tuning-generic.yaml, which actually applies the nvidia-tuned generic profile). The no-op remains the safe fallback if the nodewright team determines RTX PRO 6000 should not run nvidia-tuned standalone — e.g. if it must run nvidia-setup's kernel/EFA steps first (which today support only eks-h100/eks-gb200).

Additional Changes (cross-cutting — disclosed for the next reader)

Beyond the rtx-pro-6000 tuning, this PR includes two changes that touch shared/other files:

Toleration-default consistency fix (affects all recipes, not just rtx-pro-6000). The additionalTolerations fallback in tuning.yaml, tuning-gke.yaml, and no-op.yaml (and the new tuning-generic.yaml) changes from [{ key: dedicated, operator: Exists }] to the tolerate-all [{ operator: Exists }]. This matches ParseTolerations() / DefaultTolerations() and the assert-tuning-defaults.yaml chainsaw fixture; the previous templates were inconsistent (subset matching masked it), so this closes a real gap. It affects every recipe rendering these manifests — h100, gb200, b200, b40, GKE — but only the fallback branch (when no acceleratedTolerations are injected from scheduling defaults/flags); the default/flag path is unchanged. (Raised by CodeRabbit and @mchmarny.)
h200-eks-training NCCL-floor doc. A comment noting the inherited nccl-all-reduce-bw >= 300 floor is h100-sourced and loose for H200's HBM3e bandwidth — flagged to tighten once empirical H200/EFA numbers exist. (Addresses a review nit from feat(recipes): add nodewright h100 tuning to H200 EKS recipes #1102.)

Testing

yamllint clean; go test ./pkg/recipe/... pass; no golden fixtures to regenerate.
Generated the eks/rtx-pro-6000/ubuntu/inference/dynamo recipe + bundle: the recipe gains nodewright-customizations, and the rendered Skyhook CR sets accelerator: generic against nvidia-tuned 0.3.0 (the profile that ships generic).
Render-verified only; not yet applied on a live RTX PRO 6000 cluster. CI (KWOK + Tier-1 deploy matrices) covers recipe rendering and deployment.

Risk Assessment

Low — Additive overlay change following an established pattern; easy to revert.

Rollout notes: New deploys of rtx-pro-6000 EKS recipes will now apply the nvidia-generic tuned profile via nodewright. No migration.

Checklist

Tests pass locally (go test ./pkg/recipe/...)
Linter passes (yamllint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality (N/A — overlay change; covered by recipe-resolution + CI render tests)
I updated docs if user-facing behavior changed (N/A — no new component/flag; nodewright-customizations already documented)
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

coderabbitai · 2026-05-29T16:29:22Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: f8e33e03-5643-42c5-9620-e4f410b34515

📥 Commits

Reviewing files that changed from the base of the PR and between 7ad52fa and 2861d23.

📒 Files selected for processing (6)

recipes/components/nodewright-customizations/manifests/no-op.yaml
recipes/components/nodewright-customizations/manifests/tuning-generic.yaml
recipes/components/nodewright-customizations/manifests/tuning-gke.yaml
recipes/components/nodewright-customizations/manifests/tuning.yaml
recipes/overlays/h200-eks-training.yaml
recipes/overlays/rtx-pro-6000-eks-inference.yaml

📝 Walkthrough

Walkthrough

Adds a Helm template recipes/components/nodewright-customizations/manifests/tuning-generic.yaml that conditionally renders a Skyhook tuning CR with a pinned nvidia-tuned package and templated configMap (intent, accelerator, optional service). Changes default additionalTolerations fallback in tuning, tuning-gke, and no-op templates to a tolerate-all toleration. Updates recipes/overlays/rtx-pro-6000-eks-inference.yaml to add gpu-operator -> nodewright-customizations dependency and a nodewright-customizations componentRef pinned to the tuning manifest with overrides service: eks, accelerator: generic, intent: inference and dependency on nodewright-operator. Also adds explanatory comments in h200-eks-training.yaml.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

NVIDIA/aicr#1102: Updates EKS overlays to wire nodewright-customizations into gpu-operator and configure the Skyhook tuning manifests for inference/training.
NVIDIA/aicr#1053: Related changes to tuning-gke.yaml tolerations default that affect GKE overlays wiring nodewright-customizations.
NVIDIA/aicr#1046: Prior work that created/modified the rtx-pro-6000-eks-inference overlay which this PR extends.

Suggested reviewers

mchmarny
lockwobr
ayuskauskas

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: wiring nodewright generic tuning into the rtx-pro-6000 EKS overlay. It matches the primary objective of the PR.
Description check	✅ Passed	The description thoroughly details the motivation, implementation, testing, and risk assessment. It directly relates to the changeset and provides context for why nodewright-customizations with generic tuning is being added to rtx-pro-6000 EKS.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/components/nodewright-customizations/manifests/tuning-generic.yaml`:
- Around line 54-60: The fallback toleration in tuning-generic.yaml under
additionalTolerations uses "- key: dedicated, operator: Exists" which differs
from the compiled default tolerations used by
DefaultTolerations()/ParseTolerations() and expected by
assert-tuning-defaults.yaml; update the Helm template block that checks
$cust.acceleratedTolerations (the conditional in tuning-generic.yaml referencing
additionalTolerations and $cust.acceleratedTolerations) to emit a toleration
with only "operator: Exists" (no key) as the else/fallback so the template
matches the "tolerate all" default behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: b39bfcdf-b5c0-425b-97fe-9322fb924a7c

📥 Commits

Reviewing files that changed from the base of the PR and between 6daedaa and ee25f9b.

📒 Files selected for processing (2)

recipes/components/nodewright-customizations/manifests/tuning-generic.yaml
recipes/overlays/rtx-pro-6000-eks-inference.yaml

The rtx-pro-6000 EKS inference overlays deployed nodewright-operator but no nodewright-customizations, so no tuning profile was applied. h100/gb200/b200 all wire it at the <accelerator>-<service>-<intent> level; rtx-pro-6000 was the gap. The shared tuning.yaml can't be reused here: it renders nvidia-setup, which only ships eks-h100/eks-gb200 configs and fails on any other (service, accelerator) combo — so accelerator=generic would fail nodewright-customizations and block gpu-operator. Instead add a dedicated tuning-generic.yaml that runs ONLY nvidia-tuned with the generic profile (no nvidia-setup), and wire it into rtx-pro-6000-eks-inference (ubuntu/dynamo/nim leaves inherit it). nvidia-tuned's generic profile is self-contained baseline GPU tuning; nvidia-setup's kernel/EFA steps are unsupported for generic and unneeded on these PCIe RTX PRO 6000 nodes (no NVLink fabric; platform provisions the kernel; EFA unused). Also align the additionalTolerations fallback across all nodewright manifests (tuning.yaml, tuning-gke.yaml, no-op.yaml, tuning-generic.yaml) to the tolerate-all default 'operator: Exists' (no key), matching ParseTolerations()/DefaultTolerations() and assert-tuning-defaults.yaml. This fallback only fires when no acceleratedTolerations are injected; the default/flag path is unchanged. Document the inherited NCCL all-reduce floor in h200-eks-training.yaml (review nit from NVIDIA#1102): >= 300 is h100-sourced and loose for H200's HBM3e bandwidth — flagged to tighten once empirical H200/EFA numbers exist. Verified: rendered Skyhook CR contains only nvidia-tuned (0.3.0) with accelerator=generic and no nvidia-setup; injected toleration path unchanged; yamllint + pkg/recipe tests pass.

mchmarny

The substance is good: tuning-generic.yaml is the right factoring (nvidia-setup only ships eks-h100 / eks-gb200 configs, so reusing tuning.yaml for accelerator=generic would fail), the inline justifications at both the call site and the manifest header are clear, and the "Alternatives Considered" section in the body is well-reasoned. CI green across all 118 checks. nvidia-tuned:0.3.0 already in the BOM, no regen needed.

One medium-priority observation inline on tuning.yaml: the default-toleration change in three existing manifests (tuning.yaml, tuning-gke.yaml, no-op.yaml) is correct but isn't disclosed in the PR description. It affects every recipe rendering these manifests — h100, gb200, b200, b40, GKE — not just rtx-pro-6000. Asking for either a one-sentence callout in the description or a split into a follow-up cleanup PR. Not blocking the substance.

yuanchen8911 requested a review from a team as a code owner May 29, 2026 16:24

yuanchen8911 added the enhancement New feature or request label May 29, 2026

github-actions Bot added area/recipes size/S labels May 29, 2026

yuanchen8911 marked this pull request as draft May 29, 2026 16:31

yuanchen8911 changed the title ~~feat(recipes): apply nodewright generic tuning to rtx-pro-6000 EKS~~ WIP: feat(recipes): apply nodewright generic tuning to rtx-pro-6000 EKS May 29, 2026

yuanchen8911 force-pushed the feat/rtx-pro-6000-nodewright-generic-tuning branch from 9011b89 to 6daedaa Compare May 29, 2026 16:36

github-actions Bot added size/M and removed size/S labels May 29, 2026

yuanchen8911 force-pushed the feat/rtx-pro-6000-nodewright-generic-tuning branch from 6daedaa to ee25f9b Compare May 29, 2026 16:44

yuanchen8911 marked this pull request as ready for review May 29, 2026 16:47

yuanchen8911 changed the title ~~WIP: feat(recipes): apply nodewright generic tuning to rtx-pro-6000 EKS~~ feat(recipes): apply nodewright generic tuning to rtx-pro-6000 EKS May 29, 2026

yuanchen8911 requested review from ayuskauskas and mchmarny May 29, 2026 16:47

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

Comment thread recipes/components/nodewright-customizations/manifests/tuning-generic.yaml

yuanchen8911 force-pushed the feat/rtx-pro-6000-nodewright-generic-tuning branch from ee25f9b to 7ad52fa Compare May 29, 2026 17:03

ayuskauskas previously approved these changes May 29, 2026

View reviewed changes

yuanchen8911 requested a review from lockwobr May 29, 2026 17:39

mchmarny assigned yuanchen8911 May 29, 2026

yuanchen8911 dismissed ayuskauskas’s stale review via 2861d23 May 29, 2026 18:13

yuanchen8911 force-pushed the feat/rtx-pro-6000-nodewright-generic-tuning branch from 7ad52fa to 2861d23 Compare May 29, 2026 18:13

mchmarny reviewed May 29, 2026

View reviewed changes

Comment thread recipes/components/nodewright-customizations/manifests/tuning.yaml

yuanchen8911 requested a review from mchmarny May 29, 2026 18:14

yuanchen8911 enabled auto-merge (squash) May 29, 2026 18:17

mchmarny approved these changes May 29, 2026

View reviewed changes

yuanchen8911 merged commit 4daf2af into NVIDIA:main May 29, 2026
120 checks passed

mchmarny mentioned this pull request May 29, 2026

feat(cli,api): migrate CLI and REST server to consume aicr.Client #1104

Closed

23 tasks

coderabbitai Bot mentioned this pull request May 29, 2026

feat(recipes): add nodewright to bcm with reapply-on-reboot #1105

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(recipes): apply nodewright generic tuning to rtx-pro-6000 EKS#1101

feat(recipes): apply nodewright generic tuning to rtx-pro-6000 EKS#1101
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/rtx-pro-6000-nodewright-generic-tuning

yuanchen8911 commented May 29, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 29, 2026 •

edited

Loading

Reviews paused

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

mchmarny left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yuanchen8911 commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Alternatives Considered

Additional Changes (cross-cutting — disclosed for the next reader)

Testing

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuanchen8911 commented May 29, 2026 •

edited

Loading

coderabbitai Bot commented May 29, 2026 •

edited

Loading