feat(recipes): add inference-perf to gb200-eks-ubuntu-inference-dynamo#977
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR adds a Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/overlays/gb200-eks-ubuntu-inference-dynamo.yaml`:
- Around line 65-73: Update the inline placeholder comment to reconcile it with
the PR objectives by replacing the measured baseline values (throughput ≈ 16,941
tok/s, TTFT p99 ≈ 224 ms) with the PR objective numbers (throughput ≈ 18,093
tok/s, TTFT p99 ≈ 219 ms) or vice versa so both places match; additionally,
append a short note in that same comment block stating which test run or dataset
produced the chosen numbers (e.g., "measured run X" or "PR objective run Y") and
the hardware/config used (GB200, Qwen3-0.6B, concurrency=16, 1×
p6e-gb200.36xlarge, 4 vllmDecodeWorkers) to make the source of the numbers
explicit.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: f76ea715-4a88-4bbe-bbe1-3122b317c56a
📒 Files selected for processing (1)
recipes/overlays/gb200-eks-ubuntu-inference-dynamo.yaml
Add a performance-phase validator block (inference-perf) to the gb200-eks-ubuntu-inference-dynamo leaf overlay so that `aicr validate --phase performance` produces results for GB200 EKS inference deployments. Previously this leaf had only deployment and conformance phases; the performance phase resolved to zero selected validators and was a no-op. Mirrors the existing performance block on h100-eks-ubuntu-inference- dynamo and h100-gke-cos-inference-dynamo siblings. Thresholds are explicit placeholders, deliberately loose: inference-throughput >= 10000 tok/s inference-ttft-p99 <= 500 ms First measured baseline on a real GB200 cluster (1× p6e-gb200.36xlarge, 4 vllmDecodeWorkers, Qwen/Qwen3-0.6B, concurrency=16/GPU): throughput ~18,000 tok/s, TTFT p99 ~220 ms. Floors are well below those to act as a smoke-test gate rather than a perf SLO, with comment guidance to tighten once reference runs are published. The inline comment notes that on small models like Qwen3-0.6B the per-GPU compute advantage of GB200 over H100 isn't exercised — TTFT lands higher than naive Blackwell-vs-Hopper intuition suggests, so maintainers shouldn't assume "GB200 ≥ H100" thresholds before empirical tuning.
1ccd96f to
33b462f
Compare
Summary
Add an
inference-perfperformance validator (with placeholder thresholds) torecipes/overlays/gb200-eks-ubuntu-inference-dynamo.yamlsoaicr validate --phase performanceactually runs against GB200 EKS Dynamo deployments. Today the phase resolves to zero validators and silently no-ops.Motivation / Context
The H100 inference-dynamo siblings (
h100-eks-ubuntu-inference-dynamo.yaml,h100-gke-cos-inference-dynamo.yaml) ship avalidation.performance.checks: [inference-perf]block; the GB200 siblings (gb200-eks-ubuntu-inference-dynamo.yaml,gb200-oke-ubuntu-inference-dynamo.yaml) do not. Runningaicr validate --phase performanceagainst a GB200 EKS Dynamo cluster prints:This PR closes that gap for the GB200 EKS leaf. (OKE sibling is out of scope — separate PR.)
Type of Change
Component(s) Affected
pkg/recipe)Implementation Notes
h100-eks-ubuntu-inference-dynamo.yaml:validation.performance.checks: [inference-perf]plusinference-throughput/inference-ttft-p99constraints.inference-throughput >= 10000 tok/sinference-ttft-p99 <= 500 msp6e-gb200.36xlarge, 1× GPU node = 4 GB200 GPUs,4 vllmDecodeWorkerreplicas pinned to that node,Qwen/Qwen3-0.6B,concurrency=16/GPU): throughput ≈ 18,093 tok/s, TTFT p99 ≈ 219 ms. The floors above sit ~45% below throughput and ~56% below TTFT, deliberately giving headroom for run-to-run variance and not-yet-tuned configurations.Testing
make qualify # PASS — tests + lint + e2e + scan + repo-specific checksLive cluster validation on an EKS GB200 cluster:
Both metrics well inside the placeholder floors, status
passed.Risk Assessment
gb200-eks-ubuntu-inference-dynamo. Reversible by removing theperformance:block.Rollout notes: None. New validator coverage; doesn't modify any existing constraint or deployment.
Out of scope
gb200-oke-ubuntu-inference-dynamo.yaml— should be a separate PR after access to an OKE GB200 test bed.containerEditscame back empty before a DRA-pod restart) — captured separately as a follow-up to chore(recipes): make DRA-rollout trigger durable (don't rely on manual annotation bump) #973.Checklist
make qualify)make lint— included in qualify)inference-perfvalidator catalog entry exists; recipe-engine validation only)git commit -S)