fix(validators): update and tune inference performance validation by yuanchen8911 · Pull Request #1196 · NVIDIA/aicr

yuanchen8911 · 2026-06-04T15:35:41Z

Summary

Make the inference-perf validation gate reproducible and robust to cold start, and document the investigation behind it. Three changes remove distinct false-negatives, plus a debug aid and a contributor report.

Motivation / Context

The gate failed on healthy clusters from two unrelated causes:

TTFT-p99 knee jitter. At the 2048 saturation knee, TTFT p99 is inherently noisy (708–1670 ms run-to-run on H100-EKS) and the 1000 ms target left almost no margin (baseline ≈688 ms) → healthy deployments failed on latency jitter. Inputs were also non-deterministic (no seed, variable output length), so the verdict partly reflected RNG.
Serve-readiness cold-start timeout. A fresh worker's first inference captures CUDA graphs / JIT-warms kernels — ~42 s measured on RTX PRO 6000. The readiness probe used the generic 30 s HTTPClientTimeout, which cancelled that legitimate first request mid-warmup, so the probe never succeeded and the phase failed with timed out waiting for inference endpoint to serve requests. This is the same outer symptom as the (fixed) inference-perf: validator times out on healthy cluster when dynamo frontend discovery bootstrap races (false negative) #1192 discovery panic but a different root cause — discovery was healthy (verified: genuine 1.0.2 frontend by image digest, /v1/models populated, 24 discovery instances, no Unfold panic). It's intermittent because RTX cold-start straddles 30 s; H100/GB200 stay under it.

Related: #1192 (discovery panic, fixed by #1193), #1193 (dynamo 0.9.0→1.0.2 runtime bump), #1194 (image-reference SSOT).

Changes

Gate / benchmark:

Relax TTFT p99 → <= 2000 ms on h100-eks, h100-gke, rtx-pro-6000-eks (gb200-eks already 2000). 2000 ms passes the healthy 708–1670 ms range and still catches genuine stalls (9–45 s) by 5–20×.
Pin AIPerf workload generation: --random-seed, --prompt-input-tokens-stddev 0, --prompt-output-tokens-stddev 0, --num-dataset-entries, --extra-inputs temperature:0 (greedy — AIPerf's documented determinism path).
Serve-readiness probe timeout 30 s → 120 s (InferenceEndpointProbeTimeout). Clears observed cold-start (~42 s) with margin while fitting several polls inside the 5-min window; AIPerf's own warmup then absorbs steady-state. Validated on RTX PRO 6000: 74,833 tok/s / 459 ms PASS (vs repeated serve-timeouts before).

Debug aid:
4. AICR_INFERENCE_PERF_NO_CLEANUP=1 — shell-env forwarded to the inference-perf pod (like HF_TOKEN); leaves the namespace/DGD/workers/frontend/AIPerf Job in place for post-mortem of a failed run.

Documentation:
5. New contributor investigation report docs/contributor/inference-perf-fluctuation.md (wired into docs/index.yml): symptom, full test setup + per-experiment results, the worker-stall capture, the CPU-contention measurement that refuted the contention hypothesis, the GKE/GB200/RTX cross-cluster runs, §4.8 the serve-wait cold-start finding, 11 key findings, mitigations (incl. rejected), and the proposed long-term solution.
6. validator.md methodology subsection + cold-start probe-timeout + no-cleanup notes; validation.md determinism note + AICR_INFERENCE_PERF_NO_CLEANUP doc; TTFT-example consistency across recipe-development.md / data-flow.md / cli-reference.md.

Type of Change

Bug fix (gate false-negatives: knee jitter + cold-start timeout)
Documentation (investigation report + methodology)
Build/CI/tooling (benchmark determinism)

Component(s) Affected

Validator (pkg/validator, validators/performance) — AIPerf flags, probe timeout, no-cleanup
Recipe engine / data — recipes/overlays/*-inference-dynamo.yaml (TTFT target)
Docs/examples — new report + 5 existing pages

Implementation Notes

The serve-wait probe timeout is a dedicated constant (InferenceEndpointProbeTimeout = 120s), not the shared HTTPClientTimeout, so other bounded HTTP ops keep their 30 s cap.
AICR_INFERENCE_PERF_NO_CLEANUP is forwarded in buildEnv scoped to the inference-perf entry (mirrors HF_TOKEN/gateway-toggle); read via strconv.ParseBool. Unit-tested (TestBuildJobPlan_ForwardsInferencePerfNoCleanupEnv).
Root-cause confirmed by live probing (RTX): /health + /v1/models 200 with 24 registered endpoints, first /v1/chat/completions 200 in ~42 s, warm requests fast → model healthy, first-token warmup is the cost. Distinct from inference-perf: validator times out on healthy cluster when dynamo frontend discovery bootstrap races (false negative) #1192 (no Unfold panic / bucket missing).
Deliberately NOT included: worker CPU/memory requests (CPU PSI ≈ 0, no contention measured); AIPerf node-pinning (a test-only instrument, not product behavior).

Testing

golangci-lint run -c .golangci.yaml ./pkg/defaults/... ./pkg/validator/v1/... ./validators/performance/...  # 0 issues
go test ./pkg/defaults/... ./pkg/validator/v1/... ./validators/performance/...                              # ok

End-to-end, per GPU platform (combined #1193 + this PR image):

GKE H100 3/3 PASS (304–420 ms, balanced)
GB200 (4-GPU node, 1024) PASS — 79,400 tok/s / 902 ms; consistent with prior runs
EKS RTX PRO 6000 PASS — 74,833 tok/s / 459 ms after the 120 s probe fix (previously serve-timed-out 2× on genuine 1.0.2)
EKS H100 clean when balanced (~119 k); the residual stochastic stall at 2048 is tracked separately (inference-perf: stochastic worker-stall / throughput degradation on EKS H100 at 2048 concurrency #1197), not this PR's scope

Risk Assessment

Low — constraint-value + benchmark-flag + timeout-constant + debug-env changes; no change to validator control flow or deployment topology. Probe timeout only lengthens a wait; no-cleanup is opt-in.

Checklist

Tests pass locally (pkg/defaults, pkg/validator/v1, validators/performance) + new forwarding test
Linter passes (golangci-lint, 0 issues)
I did not skip/disable tests to make CI green
Docs updated (new report + methodology + determinism + cold-start + no-cleanup)
Internal cluster names / account IDs scrubbed from repo docs (EKS H100 / GKE H100 / RTX PRO 6000 / GB200)
Commit cryptographically signed (-S)

github-actions · 2026-06-04T15:36:45Z

🌿 Preview your docs: https://nvidia-preview-fix-perf-gate-hardening.docs.buildwithfern.com/aicr

coderabbitai · 2026-06-04T15:41:07Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds reproducibility flags and constants to the AIPerf inference benchmark runner (fixed random seed, pinned synthetic dataset size, zero input/output token stddev, forced greedy decoding) and documents the determinism methodology. In parallel, relaxes the inference TTFT p99 validation ceiling from <= 1000 ms to <= 2000 ms in three recipe overlays and three documentation examples/sections; adds a contributor investigation doc describing observed validator fluctuations and mitigations.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

NVIDIA/aicr#1133: Refactors AIPerf job templating/injection logic and touches the same inference_perf_constraint path modified here.

Suggested labels

area/validator, documentation, size/M

Suggested reviewers

mchmarny
njhensley
lalitadithya

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, detailing the motivation, specific changes, testing, and risk assessment for the inference-perf gate hardening and cold-start fixes.
Title check	✅ Passed	The title 'fix(validators): update and tune inference performance validation' accurately summarizes the main change—relaxing and tuning the inference-perf validation gate with updated constraints and deterministic AIPerf settings.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

docs/user/validation.md (1)
205-217: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the sample CTRF TTFT constraint to match the new documented gate.

Line 216 still shows <= 1000, which now conflicts with the updated <= 2000 examples and can confuse operators comparing outputs.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/user/validation.md` around lines 205 - 217, The sample CTRF JSON in the
docs shows a mismatched TTFT constraint; update the "TTFT p99 constraint: <=
1000 → PASS" line in the sample stdout to "TTFT p99 constraint: <= 2000 → PASS"
so the example matches the new documented gate; look for the sample block
containing the "stdout" array and the "TTFT p99 constraint" string and change
the numeric threshold accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@docs/user/validation.md`:
- Around line 205-217: The sample CTRF JSON in the docs shows a mismatched TTFT
constraint; update the "TTFT p99 constraint: <= 1000 → PASS" line in the sample
stdout to "TTFT p99 constraint: <= 2000 → PASS" so the example matches the new
documented gate; look for the sample block containing the "stdout" array and the
"TTFT p99 constraint" string and change the numeric threshold accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 2f3e76b7-208f-4f56-9fea-e7e47b02198e

📥 Commits

Reviewing files that changed from the base of the PR and between 0db2291 and d6a97bb.

📒 Files selected for processing (7)

docs/contributor/validator.md
docs/integrator/recipe-development.md
docs/user/validation.md
recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml
recipes/overlays/h100-gke-cos-inference-dynamo.yaml
recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yaml
validators/performance/inference_perf_constraint.go

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/contributor/inference-perf-fluctuation.md`:
- Around line 110-114: Add language identifiers to the two fenced code blocks:
the block beginning with "worker        Running  Waiting  GPU-util  clock    
throttle  power" and the block beginning with "Node CPU PSI (some avg10):  0.00
– 1.54   ≈ 0   (no CPU stall)"; replace the opening triple backticks with
language-tagged fences (e.g., ```text) for both occurrences (also apply the same
change to the other occurrence at lines 147-151) so the markdown linter (MD040)
recognizes the blocks' language.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: deb50034-ee54-4e25-8f50-067a275fd2c9

📥 Commits

Reviewing files that changed from the base of the PR and between d6a97bb and d2b7daa.

📒 Files selected for processing (9)

docs/contributor/inference-perf-fluctuation.md
docs/contributor/validator.md
docs/index.yml
docs/integrator/recipe-development.md
docs/user/validation.md
recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml
recipes/overlays/h100-gke-cos-inference-dynamo.yaml
recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yaml
validators/performance/inference_perf_constraint.go

…t fix The inference-perf gate produced false negatives on healthy clusters from two unrelated causes: TTFT-p99 knee jitter, and a serve-readiness probe timeout that was shorter than cold-start first-token latency. This makes the verdict a function of deployment health, not noise or warmup timing. 1. Relax the TTFT p99 target to "<= 2000" across the inference overlays (h100-eks, h100-gke, rtx-pro-6000-eks; gb200-eks already 2000). 2000 ms passes healthy runs (708-1670 ms observed) while still catching genuine stalls (9-45 s) by 5-20x, so it stays discriminating -- matching the precedent gb200-eks already set. 2. Pin AIPerf workload generation for run-to-run reproducibility: --random-seed, --prompt-input-tokens-stddev 0, --prompt-output-tokens-stddev 0, --num-dataset-entries, and --extra-inputs temperature:0 (greedy decoding -- AIPerf's recommended way to get deterministic output without ignore_eos). 3. Raise the serve-readiness probe timeout 30s -> 120s (InferenceEndpointProbeTimeout). A fresh worker's first inference captures CUDA graphs / JIT-warms kernels (~42 s measured on RTX PRO 6000); the generic 30 s HTTPClientTimeout cancelled that legitimate first request mid-warmup, so the probe never succeeded and the phase failed with "timed out waiting for inference endpoint to serve requests" -- the same outer symptom as the (fixed) NVIDIA#1192 discovery panic but a different root cause. Validated on RTX PRO 6000: 74,833 tok/s / 459 ms PASS with the fix. AIPerf's own warmup absorbs steady-state once the probe passes. Also adds AICR_INFERENCE_PERF_NO_CLEANUP=1 (debug, shell-env forwarded to the inference-perf pod): leaves the namespace/DGD/workers/frontend/AIPerf Job in place for post-mortem of a failed run. Docs: - New contributor investigation report docs/contributor/inference-perf-fluctuation.md (symptom, full test setup + results, key findings, mitigations incl. rejected ones, the serve-wait cold-start finding (4.8), proposed long-term solution), wired into docs/index.yml. - validator.md: "Methodology" subsection + cold-start probe timeout + no-cleanup. - validation.md: determinism note, TTFT example -> 2000, AICR_INFERENCE_PERF_NO_CLEANUP. - TTFT-example consistency: recipe-development.md, data-flow.md, cli-reference.md.

yuanchen8911 added area/recipes area/validator area/docs bug labels Jun 4, 2026

github-actions Bot added size/S and removed area/validator labels Jun 4, 2026

yuanchen8911 mentioned this pull request Jun 4, 2026

inference-perf: validator times out on healthy cluster when dynamo frontend discovery bootstrap races (false negative) #1192

Closed

4 tasks

yuanchen8911 force-pushed the fix/perf-gate-hardening branch from 0db2291 to d6a97bb Compare June 4, 2026 15:41

github-actions Bot added size/M and removed size/S labels Jun 4, 2026

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

yuanchen8911 changed the title ~~fix(validators): reproducible inference-perf gate — TTFT <= 2000ms + pinned AIPerf inputs~~ fix(validators): tweak inference-perf validation Jun 4, 2026

yuanchen8911 force-pushed the fix/perf-gate-hardening branch from d6a97bb to d2b7daa Compare June 4, 2026 16:05

github-actions Bot added size/L and removed size/M labels Jun 4, 2026

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread docs/contributor/inference-perf-fluctuation.md Outdated

yuanchen8911 force-pushed the fix/perf-gate-hardening branch from d2b7daa to 770a499 Compare June 4, 2026 16:29

yuanchen8911 changed the title ~~fix(validators): tweak inference-perf validation~~ fix(validators): inference-perf gate hardening + investigation report Jun 4, 2026

mchmarny assigned yuanchen8911 Jun 4, 2026

yuanchen8911 changed the title ~~fix(validators): inference-perf gate hardening + investigation report~~ WIP fix(validators): inference-perf gate hardening + investigation report Jun 4, 2026

yuanchen8911 force-pushed the fix/perf-gate-hardening branch from 770a499 to 9c597da Compare June 4, 2026 17:10

yuanchen8911 mentioned this pull request Jun 4, 2026

inference-perf: stochastic worker-stall / throughput degradation on EKS H100 at 2048 concurrency #1197

Open

yuanchen8911 force-pushed the fix/perf-gate-hardening branch 3 times, most recently from 42607f8 to f8b74e7 Compare June 4, 2026 22:07

github-actions Bot added area/validator size/XL and removed size/L labels Jun 4, 2026

yuanchen8911 changed the title ~~WIP fix(validators): inference-perf gate hardening + investigation report~~ fix(validators): inference-perf gate hardening + serve-wait cold-start fix Jun 4, 2026

yuanchen8911 changed the title ~~fix(validators): inference-perf gate hardening + serve-wait cold-start fix~~ fix(validators): update and tune inference performance validation Jun 4, 2026

yuanchen8911 force-pushed the fix/perf-gate-hardening branch 2 times, most recently from 53406c0 to 2042311 Compare June 4, 2026 22:20

yuanchen8911 force-pushed the fix/perf-gate-hardening branch from 2042311 to 860bcc7 Compare June 4, 2026 22:28

yuanchen8911 marked this pull request as ready for review June 4, 2026 22:29

yuanchen8911 requested review from a team as code owners June 4, 2026 22:29

yuanchen8911 enabled auto-merge (squash) June 4, 2026 22:31

yuanchen8911 requested a review from mchmarny June 4, 2026 22:31

mchmarny approved these changes Jun 4, 2026

View reviewed changes

yuanchen8911 merged commit 2d5259b into NVIDIA:main Jun 4, 2026
120 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(validators): update and tune inference performance validation#1196

fix(validators): update and tune inference performance validation#1196
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/perf-gate-hardening

yuanchen8911 commented Jun 4, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

Reviews paused

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yuanchen8911 commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Changes

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuanchen8911 commented Jun 4, 2026 •

edited

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading