Skip to content

feat(recipes): add inference-perf to gb200-eks-ubuntu-inference-dynamo#977

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/gb200-eks-inference-dynamo-perf-validation
May 19, 2026
Merged

feat(recipes): add inference-perf to gb200-eks-ubuntu-inference-dynamo#977
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/gb200-eks-inference-dynamo-perf-validation

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

Summary

Add an inference-perf performance validator (with placeholder thresholds) to recipes/overlays/gb200-eks-ubuntu-inference-dynamo.yaml so aicr validate --phase performance actually runs against GB200 EKS Dynamo deployments. Today the phase resolves to zero validators and silently no-ops.

Motivation / Context

The H100 inference-dynamo siblings (h100-eks-ubuntu-inference-dynamo.yaml, h100-gke-cos-inference-dynamo.yaml) ship a validation.performance.checks: [inference-perf] block; the GB200 siblings (gb200-eks-ubuntu-inference-dynamo.yaml, gb200-oke-ubuntu-inference-dynamo.yaml) do not. Running aicr validate --phase performance against a GB200 EKS Dynamo cluster prints:

[cli] phase requested but no checks defined in recipe; phase will be empty: phase=performance
[cli] running validation phase: phase=performance catalog=4 selected=0
[cli] phase completed: phase=performance status=skipped validators=0 passed=0 failed=0

This PR closes that gap for the GB200 EKS leaf. (OKE sibling is out of scope — separate PR.)

Type of Change

  • New feature (non-breaking change that adds functionality)

Component(s) Affected

  • Recipe engine / data (pkg/recipe)

Implementation Notes

  • Mirrored shape from h100-eks-ubuntu-inference-dynamo.yaml: validation.performance.checks: [inference-perf] plus inference-throughput / inference-ttft-p99 constraints.
  • Thresholds are explicit placeholders, intentionally loose:
    • inference-throughput >= 10000 tok/s
    • inference-ttft-p99 <= 500 ms
  • Measured baseline on a live cluster (p6e-gb200.36xlarge, 1× GPU node = 4 GB200 GPUs, 4 vllmDecodeWorker replicas pinned to that node, Qwen/Qwen3-0.6B, concurrency=16/GPU): throughput ≈ 18,093 tok/s, TTFT p99 ≈ 219 ms. The floors above sit ~45% below throughput and ~56% below TTFT, deliberately giving headroom for run-to-run variance and not-yet-tuned configurations.
  • Inline comment captures both the measured numbers and a non-obvious caveat: on small models like Qwen3-0.6B the per-GPU compute advantage of GB200 over H100 isn't exercised — TTFT lands higher than naive Blackwell-vs-Hopper intuition would suggest. Future maintainers tightening these thresholds should not assume "GB200 ≥ H100" without empirical re-tuning.

Testing

make qualify   # PASS — tests + lint + e2e + scan + repo-specific checks

Live cluster validation on an EKS GB200 cluster:

[cli] validator completed: name=inference-perf status=passed
[cli] Inference throughput: 18093.35 tokens/sec
[cli] Inference TTFT p99: 219.28 ms
[cli] phase completed: phase=performance status=passed validators=1 passed=1 failed=0 duration=2m20.553723083s

Both metrics well inside the placeholder floors, status passed.

Risk Assessment

  • Low — Isolated change, single leaf overlay, ~17 added lines. No effect on recipes that don't resolve to gb200-eks-ubuntu-inference-dynamo. Reversible by removing the performance: block.

Rollout notes: None. New validator coverage; doesn't modify any existing constraint or deployment.

Out of scope

  • Adding the same block to gb200-oke-ubuntu-inference-dynamo.yaml — should be a separate PR after access to an OKE GB200 test bed.
  • Tightening thresholds beyond the placeholder floors — gated on more reference runs being captured and published.
  • The DRA pre-allocation refresh observation from the cluster-side investigation (DRA's containerEdits came back empty before a DRA-pod restart) — captured separately as a follow-up to chore(recipes): make DRA-rollout trigger durable (don't rely on manual annotation bump) #973.

Checklist

  • Tests pass locally (make qualify)
  • Linter passes (make lint — included in qualify)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (the new inference-perf validator catalog entry exists; recipe-engine validation only)
  • Changes follow existing patterns in the codebase (mirrors H100 sibling)
  • Commits are cryptographically signed (git commit -S)

@yuanchen8911 yuanchen8911 added enhancement New feature or request area/recipes labels May 19, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 0eda4c63-ca0d-44d5-b0a9-947f6d2eadfe

📥 Commits

Reviewing files that changed from the base of the PR and between 1ccd96f and 33b462f.

📒 Files selected for processing (1)
  • recipes/overlays/gb200-eks-ubuntu-inference-dynamo.yaml

📝 Walkthrough

Walkthrough

This PR adds a validation.performance section to the GB200 EKS Ubuntu inference Dynamo overlay recipe. The change introduces an inference-perf check with placeholder performance guardrails: inference throughput constrained to >= 10000 and TTFT p99 constrained to <= 500. Inline comments note that these threshold values are awaiting empirical tuning on GB200 hardware with a Qwen/Qwen3 workload.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related issues

Possibly related PRs

  • NVIDIA/aicr#952: Both PRs align validation.performance wiring for the inference-perf check in overlay YAMLs, using identical constraint keys (inference-throughput and inference-ttft-p99) with placeholder thresholds.

Suggested labels

size/S

Suggested reviewers

  • mchmarny
  • njhensley
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding an inference-perf performance validator to the gb200-eks-ubuntu-inference-dynamo recipe.
Description check ✅ Passed The description provides comprehensive context about the change, including motivation, implementation details, testing results, and risk assessment.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/gb200-eks-ubuntu-inference-dynamo.yaml`:
- Around line 65-73: Update the inline placeholder comment to reconcile it with
the PR objectives by replacing the measured baseline values (throughput ≈ 16,941
tok/s, TTFT p99 ≈ 224 ms) with the PR objective numbers (throughput ≈ 18,093
tok/s, TTFT p99 ≈ 219 ms) or vice versa so both places match; additionally,
append a short note in that same comment block stating which test run or dataset
produced the chosen numbers (e.g., "measured run X" or "PR objective run Y") and
the hardware/config used (GB200, Qwen3-0.6B, concurrency=16, 1×
p6e-gb200.36xlarge, 4 vllmDecodeWorkers) to make the source of the numbers
explicit.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: f76ea715-4a88-4bbe-bbe1-3122b317c56a

📥 Commits

Reviewing files that changed from the base of the PR and between 2dfa5e3 and 1ccd96f.

📒 Files selected for processing (1)
  • recipes/overlays/gb200-eks-ubuntu-inference-dynamo.yaml

Comment thread recipes/overlays/gb200-eks-ubuntu-inference-dynamo.yaml Outdated
@yuanchen8911 yuanchen8911 marked this pull request as ready for review May 19, 2026 22:05
@yuanchen8911 yuanchen8911 requested a review from a team as a code owner May 19, 2026 22:05
njhensley
njhensley previously approved these changes May 19, 2026
Copy link
Copy Markdown
Member

@njhensley njhensley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Add a performance-phase validator block (inference-perf) to the
gb200-eks-ubuntu-inference-dynamo leaf overlay so that
`aicr validate --phase performance` produces results for GB200 EKS
inference deployments. Previously this leaf had only deployment and
conformance phases; the performance phase resolved to zero selected
validators and was a no-op.

Mirrors the existing performance block on h100-eks-ubuntu-inference-
dynamo and h100-gke-cos-inference-dynamo siblings. Thresholds are
explicit placeholders, deliberately loose:

  inference-throughput >= 10000 tok/s
  inference-ttft-p99   <= 500 ms

First measured baseline on a real GB200 cluster (1× p6e-gb200.36xlarge,
4 vllmDecodeWorkers, Qwen/Qwen3-0.6B, concurrency=16/GPU): throughput
~18,000 tok/s, TTFT p99 ~220 ms. Floors are well below those to act
as a smoke-test gate rather than a perf SLO, with comment guidance to
tighten once reference runs are published.

The inline comment notes that on small models like Qwen3-0.6B the
per-GPU compute advantage of GB200 over H100 isn't exercised — TTFT
lands higher than naive Blackwell-vs-Hopper intuition suggests, so
maintainers shouldn't assume "GB200 ≥ H100" thresholds before
empirical tuning.
@yuanchen8911 yuanchen8911 force-pushed the feat/gb200-eks-inference-dynamo-perf-validation branch from 1ccd96f to 33b462f Compare May 19, 2026 22:07
@mchmarny mchmarny merged commit 8b49397 into NVIDIA:main May 19, 2026
102 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants