feat(recipes): add inference-perf to gb200-eks-ubuntu-inference-dynamo by yuanchen8911 · Pull Request #977 · NVIDIA/aicr

yuanchen8911 · 2026-05-19T21:58:29Z

Summary

Add an inference-perf performance validator (with placeholder thresholds) to recipes/overlays/gb200-eks-ubuntu-inference-dynamo.yaml so aicr validate --phase performance actually runs against GB200 EKS Dynamo deployments. Today the phase resolves to zero validators and silently no-ops.

Motivation / Context

The H100 inference-dynamo siblings (h100-eks-ubuntu-inference-dynamo.yaml, h100-gke-cos-inference-dynamo.yaml) ship a validation.performance.checks: [inference-perf] block; the GB200 siblings (gb200-eks-ubuntu-inference-dynamo.yaml, gb200-oke-ubuntu-inference-dynamo.yaml) do not. Running aicr validate --phase performance against a GB200 EKS Dynamo cluster prints:

[cli] phase requested but no checks defined in recipe; phase will be empty: phase=performance
[cli] running validation phase: phase=performance catalog=4 selected=0
[cli] phase completed: phase=performance status=skipped validators=0 passed=0 failed=0

This PR closes that gap for the GB200 EKS leaf. (OKE sibling is out of scope — separate PR.)

Type of Change

New feature (non-breaking change that adds functionality)

Component(s) Affected

Recipe engine / data (pkg/recipe)

Implementation Notes

Mirrored shape from h100-eks-ubuntu-inference-dynamo.yaml: validation.performance.checks: [inference-perf] plus inference-throughput / inference-ttft-p99 constraints.
Thresholds are explicit placeholders, intentionally loose:
- inference-throughput >= 10000 tok/s
- inference-ttft-p99 <= 500 ms
Measured baseline on a live cluster (p6e-gb200.36xlarge, 1× GPU node = 4 GB200 GPUs, 4 vllmDecodeWorker replicas pinned to that node, Qwen/Qwen3-0.6B, concurrency=16/GPU): throughput ≈ 18,093 tok/s, TTFT p99 ≈ 219 ms. The floors above sit ~45% below throughput and ~56% below TTFT, deliberately giving headroom for run-to-run variance and not-yet-tuned configurations.
Inline comment captures both the measured numbers and a non-obvious caveat: on small models like Qwen3-0.6B the per-GPU compute advantage of GB200 over H100 isn't exercised — TTFT lands higher than naive Blackwell-vs-Hopper intuition would suggest. Future maintainers tightening these thresholds should not assume "GB200 ≥ H100" without empirical re-tuning.

Testing

make qualify   # PASS — tests + lint + e2e + scan + repo-specific checks

Live cluster validation on an EKS GB200 cluster:

[cli] validator completed: name=inference-perf status=passed
[cli] Inference throughput: 18093.35 tokens/sec
[cli] Inference TTFT p99: 219.28 ms
[cli] phase completed: phase=performance status=passed validators=1 passed=1 failed=0 duration=2m20.553723083s

Both metrics well inside the placeholder floors, status passed.

Risk Assessment

Low — Isolated change, single leaf overlay, ~17 added lines. No effect on recipes that don't resolve to gb200-eks-ubuntu-inference-dynamo. Reversible by removing the performance: block.

Rollout notes: None. New validator coverage; doesn't modify any existing constraint or deployment.

Out of scope

Adding the same block to gb200-oke-ubuntu-inference-dynamo.yaml — should be a separate PR after access to an OKE GB200 test bed.
Tightening thresholds beyond the placeholder floors — gated on more reference runs being captured and published.
The DRA pre-allocation refresh observation from the cluster-side investigation (DRA's containerEdits came back empty before a DRA-pod restart) — captured separately as a follow-up to chore(recipes): make DRA-rollout trigger durable (don't rely on manual annotation bump) #973.

Checklist

Tests pass locally (make qualify)
Linter passes (make lint — included in qualify)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality (the new inference-perf validator catalog entry exists; recipe-engine validation only)
Changes follow existing patterns in the codebase (mirrors H100 sibling)
Commits are cryptographically signed (git commit -S)

coderabbitai · 2026-05-19T22:00:09Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 0eda4c63-ca0d-44d5-b0a9-947f6d2eadfe

📥 Commits

Reviewing files that changed from the base of the PR and between 1ccd96f and 33b462f.

📒 Files selected for processing (1)

recipes/overlays/gb200-eks-ubuntu-inference-dynamo.yaml

📝 Walkthrough

Walkthrough

This PR adds a validation.performance section to the GB200 EKS Ubuntu inference Dynamo overlay recipe. The change introduces an inference-perf check with placeholder performance guardrails: inference throughput constrained to >= 10000 and TTFT p99 constrained to <= 500. Inline comments note that these threshold values are awaiting empirical tuning on GB200 hardware with a Qwen/Qwen3 workload.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related issues

Close deployment-phase validation coverage gaps for accelerator-bound GPU recipes #969: Adds the missing performance gate for the gb200-eks-ubuntu-inference-dynamo overlay by introducing the inference-perf validation check referenced in that issue.

Possibly related PRs

NVIDIA/aicr#952: Both PRs align validation.performance wiring for the inference-perf check in overlay YAMLs, using identical constraint keys (inference-throughput and inference-ttft-p99) with placeholder thresholds.

Suggested labels

size/S

Suggested reviewers

mchmarny
njhensley

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: adding an inference-perf performance validator to the gb200-eks-ubuntu-inference-dynamo recipe.
Description check	✅ Passed	The description provides comprehensive context about the change, including motivation, implementation details, testing results, and risk assessment.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/gb200-eks-ubuntu-inference-dynamo.yaml`:
- Around line 65-73: Update the inline placeholder comment to reconcile it with
the PR objectives by replacing the measured baseline values (throughput ≈ 16,941
tok/s, TTFT p99 ≈ 224 ms) with the PR objective numbers (throughput ≈ 18,093
tok/s, TTFT p99 ≈ 219 ms) or vice versa so both places match; additionally,
append a short note in that same comment block stating which test run or dataset
produced the chosen numbers (e.g., "measured run X" or "PR objective run Y") and
the hardware/config used (GB200, Qwen3-0.6B, concurrency=16, 1×
p6e-gb200.36xlarge, 4 vllmDecodeWorkers) to make the source of the numbers
explicit.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: f76ea715-4a88-4bbe-bbe1-3122b317c56a

📥 Commits

Reviewing files that changed from the base of the PR and between 2dfa5e3 and 1ccd96f.

📒 Files selected for processing (1)

recipes/overlays/gb200-eks-ubuntu-inference-dynamo.yaml

njhensley

LGTM

Add a performance-phase validator block (inference-perf) to the gb200-eks-ubuntu-inference-dynamo leaf overlay so that `aicr validate --phase performance` produces results for GB200 EKS inference deployments. Previously this leaf had only deployment and conformance phases; the performance phase resolved to zero selected validators and was a no-op. Mirrors the existing performance block on h100-eks-ubuntu-inference- dynamo and h100-gke-cos-inference-dynamo siblings. Thresholds are explicit placeholders, deliberately loose: inference-throughput >= 10000 tok/s inference-ttft-p99 <= 500 ms First measured baseline on a real GB200 cluster (1× p6e-gb200.36xlarge, 4 vllmDecodeWorkers, Qwen/Qwen3-0.6B, concurrency=16/GPU): throughput ~18,000 tok/s, TTFT p99 ~220 ms. Floors are well below those to act as a smoke-test gate rather than a perf SLO, with comment guidance to tighten once reference runs are published. The inline comment notes that on small models like Qwen3-0.6B the per-GPU compute advantage of GB200 over H100 isn't exercised — TTFT lands higher than naive Blackwell-vs-Hopper intuition suggests, so maintainers shouldn't assume "GB200 ≥ H100" thresholds before empirical tuning.

yuanchen8911 added enhancement New feature or request area/recipes labels May 19, 2026

github-actions Bot added the size/S label May 19, 2026

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread recipes/overlays/gb200-eks-ubuntu-inference-dynamo.yaml Outdated

yuanchen8911 requested review from mchmarny, njhensley and xdu31 May 19, 2026 22:04

yuanchen8911 marked this pull request as ready for review May 19, 2026 22:05

yuanchen8911 requested a review from a team as a code owner May 19, 2026 22:05

njhensley previously approved these changes May 19, 2026

View reviewed changes

yuanchen8911 dismissed njhensley’s stale review via 33b462f May 19, 2026 22:07

yuanchen8911 force-pushed the feat/gb200-eks-inference-dynamo-perf-validation branch from 1ccd96f to 33b462f Compare May 19, 2026 22:07

mchmarny approved these changes May 19, 2026

View reviewed changes

mchmarny merged commit 8b49397 into NVIDIA:main May 19, 2026
102 checks passed

yuanchen8911 mentioned this pull request May 19, 2026

chore(recipes): bump gpu-operator chart to v26.3.1 and driver to 580.126.20 #965

Merged

12 tasks

njhensley mentioned this pull request May 27, 2026

ci(evidence): warning-only recipe-evidence drift gate for PRs #1065

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(recipes): add inference-perf to gb200-eks-ubuntu-inference-dynamo#977

feat(recipes): add inference-perf to gb200-eks-ubuntu-inference-dynamo#977
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/gb200-eks-inference-dynamo-perf-validation

yuanchen8911 commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

njhensley left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yuanchen8911 commented May 19, 2026

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Out of scope

Checklist

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

njhensley left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai Bot commented May 19, 2026 •

edited

Loading