feat(recipes): apply nodewright generic tuning to rtx-pro-6000 EKS#1101
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughAdds a Helm template recipes/components/nodewright-customizations/manifests/tuning-generic.yaml that conditionally renders a Skyhook tuning CR with a pinned nvidia-tuned package and templated configMap (intent, accelerator, optional service). Changes default additionalTolerations fallback in tuning, tuning-gke, and no-op templates to a tolerate-all toleration. Updates recipes/overlays/rtx-pro-6000-eks-inference.yaml to add gpu-operator -> nodewright-customizations dependency and a nodewright-customizations componentRef pinned to the tuning manifest with overrides service: eks, accelerator: generic, intent: inference and dependency on nodewright-operator. Also adds explanatory comments in h200-eks-training.yaml. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
9011b89 to
6daedaa
Compare
6daedaa to
ee25f9b
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/components/nodewright-customizations/manifests/tuning-generic.yaml`:
- Around line 54-60: The fallback toleration in tuning-generic.yaml under
additionalTolerations uses "- key: dedicated, operator: Exists" which differs
from the compiled default tolerations used by
DefaultTolerations()/ParseTolerations() and expected by
assert-tuning-defaults.yaml; update the Helm template block that checks
$cust.acceleratedTolerations (the conditional in tuning-generic.yaml referencing
additionalTolerations and $cust.acceleratedTolerations) to emit a toleration
with only "operator: Exists" (no key) as the else/fallback so the template
matches the "tolerate all" default behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: b39bfcdf-b5c0-425b-97fe-9322fb924a7c
📒 Files selected for processing (2)
recipes/components/nodewright-customizations/manifests/tuning-generic.yamlrecipes/overlays/rtx-pro-6000-eks-inference.yaml
ee25f9b to
7ad52fa
Compare
The rtx-pro-6000 EKS inference overlays deployed nodewright-operator but no nodewright-customizations, so no tuning profile was applied. h100/gb200/b200 all wire it at the <accelerator>-<service>-<intent> level; rtx-pro-6000 was the gap. The shared tuning.yaml can't be reused here: it renders nvidia-setup, which only ships eks-h100/eks-gb200 configs and fails on any other (service, accelerator) combo — so accelerator=generic would fail nodewright-customizations and block gpu-operator. Instead add a dedicated tuning-generic.yaml that runs ONLY nvidia-tuned with the generic profile (no nvidia-setup), and wire it into rtx-pro-6000-eks-inference (ubuntu/dynamo/nim leaves inherit it). nvidia-tuned's generic profile is self-contained baseline GPU tuning; nvidia-setup's kernel/EFA steps are unsupported for generic and unneeded on these PCIe RTX PRO 6000 nodes (no NVLink fabric; platform provisions the kernel; EFA unused). Also align the additionalTolerations fallback across all nodewright manifests (tuning.yaml, tuning-gke.yaml, no-op.yaml, tuning-generic.yaml) to the tolerate-all default 'operator: Exists' (no key), matching ParseTolerations()/DefaultTolerations() and assert-tuning-defaults.yaml. This fallback only fires when no acceleratedTolerations are injected; the default/flag path is unchanged. Document the inherited NCCL all-reduce floor in h200-eks-training.yaml (review nit from NVIDIA#1102): >= 300 is h100-sourced and loose for H200's HBM3e bandwidth — flagged to tighten once empirical H200/EFA numbers exist. Verified: rendered Skyhook CR contains only nvidia-tuned (0.3.0) with accelerator=generic and no nvidia-setup; injected toleration path unchanged; yamllint + pkg/recipe tests pass.
7ad52fa to
2861d23
Compare
mchmarny
left a comment
There was a problem hiding this comment.
The substance is good: tuning-generic.yaml is the right factoring (nvidia-setup only ships eks-h100 / eks-gb200 configs, so reusing tuning.yaml for accelerator=generic would fail), the inline justifications at both the call site and the manifest header are clear, and the "Alternatives Considered" section in the body is well-reasoned. CI green across all 118 checks. nvidia-tuned:0.3.0 already in the BOM, no regen needed.
One medium-priority observation inline on tuning.yaml: the default-toleration change in three existing manifests (tuning.yaml, tuning-gke.yaml, no-op.yaml) is correct but isn't disclosed in the PR description. It affects every recipe rendering these manifests — h100, gb200, b200, b40, GKE — not just rtx-pro-6000. Asking for either a one-sentence callout in the description or a split into a follow-up cleanup PR. Not blocking the substance.
Summary
Wire
nodewright-customizationswithaccelerator=genericinto the rtx-pro-6000 EKS inference overlay, so RTX PRO 6000 nodes get a baseline GPU node-tuning profile — closing a gap vsh100/gb200/b200, which already wire tuning.Motivation / Context
The
rtx-pro-6000EKS overlays deployed onlynodewright-operator(the controller) with nonodewright-customizations, so aicr applied no tuning profile (left entirely to the platform).nvidia-tunedhas nortx-pro-6000profile; thegenericprofile is purpose-built for single-GPU cloud VMs and deliberately omits the NVLink/InfiniBand/hugepage/CPU-isolation settings theh100/gb200profiles carry. It already ships innvidia-tuned0.3.0 — the version the tuning manifest pins.Fixes: N/A
Related: NVIDIA/nodewright-packages#37 (adds the
genericprofile)Type of Change
Component(s) Affected
pkg/recipe) —recipes/overlaysImplementation Notes
nodewright-customizationscomponentRef tortx-pro-6000-eks-inference.yaml— the<accelerator>-<service>-<intent>level, mirroringh100-eks-inference/gb200-eks-inference. The-ubuntu-inference,-dynamo, and-nimleaves inherit it via overlay merge (the established convention; leaves never re-declare tuning).overrides.accelerator=generic(the recipe criteria staysrtx-pro-6000); addednodewright-customizationsto thegpu-operatordependencyRefsfor ordering, same as h100.Alternatives Considered
Option 1 — no-op manifest (
no-op.yaml): wirenodewright-customizationsas a no-op placeholder (ashellscriptpackage that just echoes). This satisfies thegpu-operatordependency and keeps the "customization present" pattern, but applies no tuning — leaving node tuning entirely to the platform/NCP (the status quo on b40). Rejected in favor of Option 2 (tuning-generic.yaml, which actually applies thenvidia-tunedgenericprofile). The no-op remains the safe fallback if the nodewright team determines RTX PRO 6000 should not runnvidia-tunedstandalone — e.g. if it must runnvidia-setup's kernel/EFA steps first (which today support onlyeks-h100/eks-gb200).Additional Changes (cross-cutting — disclosed for the next reader)
Beyond the rtx-pro-6000 tuning, this PR includes two changes that touch shared/other files:
Toleration-default consistency fix (affects all recipes, not just rtx-pro-6000). The
additionalTolerationsfallback intuning.yaml,tuning-gke.yaml, andno-op.yaml(and the newtuning-generic.yaml) changes from[{ key: dedicated, operator: Exists }]to the tolerate-all[{ operator: Exists }]. This matchesParseTolerations()/DefaultTolerations()and theassert-tuning-defaults.yamlchainsaw fixture; the previous templates were inconsistent (subset matching masked it), so this closes a real gap. It affects every recipe rendering these manifests — h100, gb200, b200, b40, GKE — but only the fallback branch (when noacceleratedTolerationsare injected from scheduling defaults/flags); the default/flag path is unchanged. (Raised by CodeRabbit and @mchmarny.)h200-eks-trainingNCCL-floor doc. A comment noting the inheritednccl-all-reduce-bw >= 300floor is h100-sourced and loose for H200's HBM3e bandwidth — flagged to tighten once empirical H200/EFA numbers exist. (Addresses a review nit from feat(recipes): add nodewright h100 tuning to H200 EKS recipes #1102.)Testing
yamllintclean;go test ./pkg/recipe/...pass; no golden fixtures to regenerate.eks/rtx-pro-6000/ubuntu/inference/dynamorecipe + bundle: the recipe gainsnodewright-customizations, and the rendered Skyhook CR setsaccelerator: genericagainstnvidia-tuned0.3.0 (the profile that shipsgeneric).Risk Assessment
Rollout notes: New deploys of rtx-pro-6000 EKS recipes will now apply the
nvidia-generictuned profile via nodewright. No migration.Checklist
go test ./pkg/recipe/...)yamllint)git commit -S)