fix(validators): update and tune inference performance validation#1196
Conversation
|
🌿 Preview your docs: https://nvidia-preview-fix-perf-gate-hardening.docs.buildwithfern.com/aicr |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds reproducibility flags and constants to the AIPerf inference benchmark runner (fixed random seed, pinned synthetic dataset size, zero input/output token stddev, forced greedy decoding) and documents the determinism methodology. In parallel, relaxes the inference TTFT p99 validation ceiling from <= 1000 ms to <= 2000 ms in three recipe overlays and three documentation examples/sections; adds a contributor investigation doc describing observed validator fluctuations and mitigations. Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
0db2291 to
d6a97bb
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/user/validation.md (1)
205-217:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate the sample CTRF TTFT constraint to match the new documented gate.
Line 216 still shows
<= 1000, which now conflicts with the updated<= 2000examples and can confuse operators comparing outputs.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/user/validation.md` around lines 205 - 217, The sample CTRF JSON in the docs shows a mismatched TTFT constraint; update the "TTFT p99 constraint: <= 1000 → PASS" line in the sample stdout to "TTFT p99 constraint: <= 2000 → PASS" so the example matches the new documented gate; look for the sample block containing the "stdout" array and the "TTFT p99 constraint" string and change the numeric threshold accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@docs/user/validation.md`:
- Around line 205-217: The sample CTRF JSON in the docs shows a mismatched TTFT
constraint; update the "TTFT p99 constraint: <= 1000 → PASS" line in the sample
stdout to "TTFT p99 constraint: <= 2000 → PASS" so the example matches the new
documented gate; look for the sample block containing the "stdout" array and the
"TTFT p99 constraint" string and change the numeric threshold accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 2f3e76b7-208f-4f56-9fea-e7e47b02198e
📒 Files selected for processing (7)
docs/contributor/validator.mddocs/integrator/recipe-development.mddocs/user/validation.mdrecipes/overlays/h100-eks-ubuntu-inference-dynamo.yamlrecipes/overlays/h100-gke-cos-inference-dynamo.yamlrecipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yamlvalidators/performance/inference_perf_constraint.go
d6a97bb to
d2b7daa
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/contributor/inference-perf-fluctuation.md`:
- Around line 110-114: Add language identifiers to the two fenced code blocks:
the block beginning with "worker Running Waiting GPU-util clock
throttle power" and the block beginning with "Node CPU PSI (some avg10): 0.00
– 1.54 ≈ 0 (no CPU stall)"; replace the opening triple backticks with
language-tagged fences (e.g., ```text) for both occurrences (also apply the same
change to the other occurrence at lines 147-151) so the markdown linter (MD040)
recognizes the blocks' language.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: deb50034-ee54-4e25-8f50-067a275fd2c9
📒 Files selected for processing (9)
docs/contributor/inference-perf-fluctuation.mddocs/contributor/validator.mddocs/index.ymldocs/integrator/recipe-development.mddocs/user/validation.mdrecipes/overlays/h100-eks-ubuntu-inference-dynamo.yamlrecipes/overlays/h100-gke-cos-inference-dynamo.yamlrecipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yamlvalidators/performance/inference_perf_constraint.go
d2b7daa to
770a499
Compare
770a499 to
9c597da
Compare
42607f8 to
f8b74e7
Compare
53406c0 to
2042311
Compare
…t fix The inference-perf gate produced false negatives on healthy clusters from two unrelated causes: TTFT-p99 knee jitter, and a serve-readiness probe timeout that was shorter than cold-start first-token latency. This makes the verdict a function of deployment health, not noise or warmup timing. 1. Relax the TTFT p99 target to "<= 2000" across the inference overlays (h100-eks, h100-gke, rtx-pro-6000-eks; gb200-eks already 2000). 2000 ms passes healthy runs (708-1670 ms observed) while still catching genuine stalls (9-45 s) by 5-20x, so it stays discriminating -- matching the precedent gb200-eks already set. 2. Pin AIPerf workload generation for run-to-run reproducibility: --random-seed, --prompt-input-tokens-stddev 0, --prompt-output-tokens-stddev 0, --num-dataset-entries, and --extra-inputs temperature:0 (greedy decoding -- AIPerf's recommended way to get deterministic output without ignore_eos). 3. Raise the serve-readiness probe timeout 30s -> 120s (InferenceEndpointProbeTimeout). A fresh worker's first inference captures CUDA graphs / JIT-warms kernels (~42 s measured on RTX PRO 6000); the generic 30 s HTTPClientTimeout cancelled that legitimate first request mid-warmup, so the probe never succeeded and the phase failed with "timed out waiting for inference endpoint to serve requests" -- the same outer symptom as the (fixed) NVIDIA#1192 discovery panic but a different root cause. Validated on RTX PRO 6000: 74,833 tok/s / 459 ms PASS with the fix. AIPerf's own warmup absorbs steady-state once the probe passes. Also adds AICR_INFERENCE_PERF_NO_CLEANUP=1 (debug, shell-env forwarded to the inference-perf pod): leaves the namespace/DGD/workers/frontend/AIPerf Job in place for post-mortem of a failed run. Docs: - New contributor investigation report docs/contributor/inference-perf-fluctuation.md (symptom, full test setup + results, key findings, mitigations incl. rejected ones, the serve-wait cold-start finding (4.8), proposed long-term solution), wired into docs/index.yml. - validator.md: "Methodology" subsection + cold-start probe timeout + no-cleanup. - validation.md: determinism note, TTFT example -> 2000, AICR_INFERENCE_PERF_NO_CLEANUP. - TTFT-example consistency: recipe-development.md, data-flow.md, cli-reference.md.
2042311 to
860bcc7
Compare
Summary
Make the
inference-perfvalidation gate reproducible and robust to cold start, and document the investigation behind it. Three changes remove distinct false-negatives, plus a debug aid and a contributor report.Motivation / Context
The gate failed on healthy clusters from two unrelated causes:
1000 mstarget left almost no margin (baseline ≈688 ms) → healthy deployments failed on latency jitter. Inputs were also non-deterministic (no seed, variable output length), so the verdict partly reflected RNG.HTTPClientTimeout, which cancelled that legitimate first request mid-warmup, so the probe never succeeded and the phase failed withtimed out waiting for inference endpoint to serve requests. This is the same outer symptom as the (fixed) inference-perf: validator times out on healthy cluster when dynamo frontend discovery bootstrap races (false negative) #1192 discovery panic but a different root cause — discovery was healthy (verified: genuine 1.0.2 frontend by image digest,/v1/modelspopulated, 24 discovery instances, noUnfoldpanic). It's intermittent because RTX cold-start straddles 30 s; H100/GB200 stay under it.Related: #1192 (discovery panic, fixed by #1193), #1193 (dynamo 0.9.0→1.0.2 runtime bump), #1194 (image-reference SSOT).
Changes
Gate / benchmark:
<= 2000 msonh100-eks,h100-gke,rtx-pro-6000-eks(gb200-eksalready 2000). 2000 ms passes the healthy 708–1670 ms range and still catches genuine stalls (9–45 s) by 5–20×.--random-seed,--prompt-input-tokens-stddev 0,--prompt-output-tokens-stddev 0,--num-dataset-entries,--extra-inputs temperature:0(greedy — AIPerf's documented determinism path).InferenceEndpointProbeTimeout). Clears observed cold-start (~42 s) with margin while fitting several polls inside the 5-min window; AIPerf's own warmup then absorbs steady-state. Validated on RTX PRO 6000: 74,833 tok/s / 459 ms PASS (vs repeated serve-timeouts before).Debug aid:
4.
AICR_INFERENCE_PERF_NO_CLEANUP=1— shell-env forwarded to the inference-perf pod (likeHF_TOKEN); leaves the namespace/DGD/workers/frontend/AIPerf Job in place for post-mortem of a failed run.Documentation:
5. New contributor investigation report
docs/contributor/inference-perf-fluctuation.md(wired intodocs/index.yml): symptom, full test setup + per-experiment results, the worker-stall capture, the CPU-contention measurement that refuted the contention hypothesis, the GKE/GB200/RTX cross-cluster runs, §4.8 the serve-wait cold-start finding, 11 key findings, mitigations (incl. rejected), and the proposed long-term solution.6.
validator.mdmethodology subsection + cold-start probe-timeout + no-cleanup notes;validation.mddeterminism note +AICR_INFERENCE_PERF_NO_CLEANUPdoc; TTFT-example consistency acrossrecipe-development.md/data-flow.md/cli-reference.md.Type of Change
Component(s) Affected
pkg/validator,validators/performance) — AIPerf flags, probe timeout, no-cleanuprecipes/overlays/*-inference-dynamo.yaml(TTFT target)Implementation Notes
InferenceEndpointProbeTimeout = 120s), not the sharedHTTPClientTimeout, so other bounded HTTP ops keep their 30 s cap.AICR_INFERENCE_PERF_NO_CLEANUPis forwarded inbuildEnvscoped to the inference-perf entry (mirrorsHF_TOKEN/gateway-toggle); read viastrconv.ParseBool. Unit-tested (TestBuildJobPlan_ForwardsInferencePerfNoCleanupEnv)./health+/v1/models200 with 24 registered endpoints, first/v1/chat/completions200 in ~42 s, warm requests fast → model healthy, first-token warmup is the cost. Distinct from inference-perf: validator times out on healthy cluster when dynamo frontend discovery bootstrap races (false negative) #1192 (noUnfoldpanic /bucket missing).Testing
End-to-end, per GPU platform (combined #1193 + this PR image):
Risk Assessment
Checklist
pkg/defaults,pkg/validator/v1,validators/performance) + new forwarding testgolangci-lint, 0 issues)-S)