Skip to content

fix(validators): update and tune inference performance validation#1196

Merged
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/perf-gate-hardening
Jun 4, 2026
Merged

fix(validators): update and tune inference performance validation#1196
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/perf-gate-hardening

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented Jun 4, 2026

Summary

Make the inference-perf validation gate reproducible and robust to cold start, and document the investigation behind it. Three changes remove distinct false-negatives, plus a debug aid and a contributor report.

Motivation / Context

The gate failed on healthy clusters from two unrelated causes:

  1. TTFT-p99 knee jitter. At the 2048 saturation knee, TTFT p99 is inherently noisy (708–1670 ms run-to-run on H100-EKS) and the 1000 ms target left almost no margin (baseline ≈688 ms) → healthy deployments failed on latency jitter. Inputs were also non-deterministic (no seed, variable output length), so the verdict partly reflected RNG.
  2. Serve-readiness cold-start timeout. A fresh worker's first inference captures CUDA graphs / JIT-warms kernels — ~42 s measured on RTX PRO 6000. The readiness probe used the generic 30 s HTTPClientTimeout, which cancelled that legitimate first request mid-warmup, so the probe never succeeded and the phase failed with timed out waiting for inference endpoint to serve requests. This is the same outer symptom as the (fixed) inference-perf: validator times out on healthy cluster when dynamo frontend discovery bootstrap races (false negative) #1192 discovery panic but a different root cause — discovery was healthy (verified: genuine 1.0.2 frontend by image digest, /v1/models populated, 24 discovery instances, no Unfold panic). It's intermittent because RTX cold-start straddles 30 s; H100/GB200 stay under it.

Related: #1192 (discovery panic, fixed by #1193), #1193 (dynamo 0.9.0→1.0.2 runtime bump), #1194 (image-reference SSOT).

Changes

Gate / benchmark:

  1. Relax TTFT p99 → <= 2000 ms on h100-eks, h100-gke, rtx-pro-6000-eks (gb200-eks already 2000). 2000 ms passes the healthy 708–1670 ms range and still catches genuine stalls (9–45 s) by 5–20×.
  2. Pin AIPerf workload generation: --random-seed, --prompt-input-tokens-stddev 0, --prompt-output-tokens-stddev 0, --num-dataset-entries, --extra-inputs temperature:0 (greedy — AIPerf's documented determinism path).
  3. Serve-readiness probe timeout 30 s → 120 s (InferenceEndpointProbeTimeout). Clears observed cold-start (~42 s) with margin while fitting several polls inside the 5-min window; AIPerf's own warmup then absorbs steady-state. Validated on RTX PRO 6000: 74,833 tok/s / 459 ms PASS (vs repeated serve-timeouts before).

Debug aid:
4. AICR_INFERENCE_PERF_NO_CLEANUP=1 — shell-env forwarded to the inference-perf pod (like HF_TOKEN); leaves the namespace/DGD/workers/frontend/AIPerf Job in place for post-mortem of a failed run.

Documentation:
5. New contributor investigation report docs/contributor/inference-perf-fluctuation.md (wired into docs/index.yml): symptom, full test setup + per-experiment results, the worker-stall capture, the CPU-contention measurement that refuted the contention hypothesis, the GKE/GB200/RTX cross-cluster runs, §4.8 the serve-wait cold-start finding, 11 key findings, mitigations (incl. rejected), and the proposed long-term solution.
6. validator.md methodology subsection + cold-start probe-timeout + no-cleanup notes; validation.md determinism note + AICR_INFERENCE_PERF_NO_CLEANUP doc; TTFT-example consistency across recipe-development.md / data-flow.md / cli-reference.md.

Type of Change

  • Bug fix (gate false-negatives: knee jitter + cold-start timeout)
  • Documentation (investigation report + methodology)
  • Build/CI/tooling (benchmark determinism)

Component(s) Affected

  • Validator (pkg/validator, validators/performance) — AIPerf flags, probe timeout, no-cleanup
  • Recipe engine / data — recipes/overlays/*-inference-dynamo.yaml (TTFT target)
  • Docs/examples — new report + 5 existing pages

Implementation Notes

  • The serve-wait probe timeout is a dedicated constant (InferenceEndpointProbeTimeout = 120s), not the shared HTTPClientTimeout, so other bounded HTTP ops keep their 30 s cap.
  • AICR_INFERENCE_PERF_NO_CLEANUP is forwarded in buildEnv scoped to the inference-perf entry (mirrors HF_TOKEN/gateway-toggle); read via strconv.ParseBool. Unit-tested (TestBuildJobPlan_ForwardsInferencePerfNoCleanupEnv).
  • Root-cause confirmed by live probing (RTX): /health + /v1/models 200 with 24 registered endpoints, first /v1/chat/completions 200 in ~42 s, warm requests fast → model healthy, first-token warmup is the cost. Distinct from inference-perf: validator times out on healthy cluster when dynamo frontend discovery bootstrap races (false negative) #1192 (no Unfold panic / bucket missing).
  • Deliberately NOT included: worker CPU/memory requests (CPU PSI ≈ 0, no contention measured); AIPerf node-pinning (a test-only instrument, not product behavior).

Testing

golangci-lint run -c .golangci.yaml ./pkg/defaults/... ./pkg/validator/v1/... ./validators/performance/...  # 0 issues
go test ./pkg/defaults/... ./pkg/validator/v1/... ./validators/performance/...                              # ok

End-to-end, per GPU platform (combined #1193 + this PR image):

Risk Assessment

  • Low — constraint-value + benchmark-flag + timeout-constant + debug-env changes; no change to validator control flow or deployment topology. Probe timeout only lengthens a wait; no-cleanup is opt-in.

Checklist

  • Tests pass locally (pkg/defaults, pkg/validator/v1, validators/performance) + new forwarding test
  • Linter passes (golangci-lint, 0 issues)
  • I did not skip/disable tests to make CI green
  • Docs updated (new report + methodology + determinism + cold-start + no-cleanup)
  • Internal cluster names / account IDs scrubbed from repo docs (EKS H100 / GKE H100 / RTX PRO 6000 / GB200)
  • Commit cryptographically signed (-S)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds reproducibility flags and constants to the AIPerf inference benchmark runner (fixed random seed, pinned synthetic dataset size, zero input/output token stddev, forced greedy decoding) and documents the determinism methodology. In parallel, relaxes the inference TTFT p99 validation ceiling from <= 1000 ms to <= 2000 ms in three recipe overlays and three documentation examples/sections; adds a contributor investigation doc describing observed validator fluctuations and mitigations.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • NVIDIA/aicr#1133: Refactors AIPerf job templating/injection logic and touches the same inference_perf_constraint path modified here.

Suggested labels

area/validator, documentation, size/M

Suggested reviewers

  • mchmarny
  • njhensley
  • lalitadithya
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, detailing the motivation, specific changes, testing, and risk assessment for the inference-perf gate hardening and cold-start fixes.
Title check ✅ Passed The title 'fix(validators): update and tune inference performance validation' accurately summarizes the main change—relaxing and tuning the inference-perf validation gate with updated constraints and deterministic AIPerf settings.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@yuanchen8911 yuanchen8911 force-pushed the fix/perf-gate-hardening branch from 0db2291 to d6a97bb Compare June 4, 2026 15:41
@github-actions github-actions Bot added size/M and removed size/S labels Jun 4, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/user/validation.md (1)

205-217: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the sample CTRF TTFT constraint to match the new documented gate.

Line 216 still shows <= 1000, which now conflicts with the updated <= 2000 examples and can confuse operators comparing outputs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/user/validation.md` around lines 205 - 217, The sample CTRF JSON in the
docs shows a mismatched TTFT constraint; update the "TTFT p99 constraint: <=
1000 → PASS" line in the sample stdout to "TTFT p99 constraint: <= 2000 → PASS"
so the example matches the new documented gate; look for the sample block
containing the "stdout" array and the "TTFT p99 constraint" string and change
the numeric threshold accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@docs/user/validation.md`:
- Around line 205-217: The sample CTRF JSON in the docs shows a mismatched TTFT
constraint; update the "TTFT p99 constraint: <= 1000 → PASS" line in the sample
stdout to "TTFT p99 constraint: <= 2000 → PASS" so the example matches the new
documented gate; look for the sample block containing the "stdout" array and the
"TTFT p99 constraint" string and change the numeric threshold accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 2f3e76b7-208f-4f56-9fea-e7e47b02198e

📥 Commits

Reviewing files that changed from the base of the PR and between 0db2291 and d6a97bb.

📒 Files selected for processing (7)
  • docs/contributor/validator.md
  • docs/integrator/recipe-development.md
  • docs/user/validation.md
  • recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/h100-gke-cos-inference-dynamo.yaml
  • recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yaml
  • validators/performance/inference_perf_constraint.go

@yuanchen8911 yuanchen8911 changed the title fix(validators): reproducible inference-perf gate — TTFT <= 2000ms + pinned AIPerf inputs fix(validators): tweak inference-perf validation Jun 4, 2026
@yuanchen8911 yuanchen8911 force-pushed the fix/perf-gate-hardening branch from d6a97bb to d2b7daa Compare June 4, 2026 16:05
@github-actions github-actions Bot added size/L and removed size/M labels Jun 4, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/contributor/inference-perf-fluctuation.md`:
- Around line 110-114: Add language identifiers to the two fenced code blocks:
the block beginning with "worker        Running  Waiting  GPU-util  clock    
throttle  power" and the block beginning with "Node CPU PSI (some avg10):  0.00
– 1.54   ≈ 0   (no CPU stall)"; replace the opening triple backticks with
language-tagged fences (e.g., ```text) for both occurrences (also apply the same
change to the other occurrence at lines 147-151) so the markdown linter (MD040)
recognizes the blocks' language.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: deb50034-ee54-4e25-8f50-067a275fd2c9

📥 Commits

Reviewing files that changed from the base of the PR and between d6a97bb and d2b7daa.

📒 Files selected for processing (9)
  • docs/contributor/inference-perf-fluctuation.md
  • docs/contributor/validator.md
  • docs/index.yml
  • docs/integrator/recipe-development.md
  • docs/user/validation.md
  • recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/h100-gke-cos-inference-dynamo.yaml
  • recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yaml
  • validators/performance/inference_perf_constraint.go

Comment thread docs/contributor/inference-perf-fluctuation.md Outdated
@yuanchen8911 yuanchen8911 force-pushed the fix/perf-gate-hardening branch from d2b7daa to 770a499 Compare June 4, 2026 16:29
@yuanchen8911 yuanchen8911 changed the title fix(validators): tweak inference-perf validation fix(validators): inference-perf gate hardening + investigation report Jun 4, 2026
@yuanchen8911 yuanchen8911 changed the title fix(validators): inference-perf gate hardening + investigation report WIP fix(validators): inference-perf gate hardening + investigation report Jun 4, 2026
@yuanchen8911 yuanchen8911 force-pushed the fix/perf-gate-hardening branch from 770a499 to 9c597da Compare June 4, 2026 17:10
@yuanchen8911 yuanchen8911 force-pushed the fix/perf-gate-hardening branch 3 times, most recently from 42607f8 to f8b74e7 Compare June 4, 2026 22:07
@yuanchen8911 yuanchen8911 changed the title WIP fix(validators): inference-perf gate hardening + investigation report fix(validators): inference-perf gate hardening + serve-wait cold-start fix Jun 4, 2026
@yuanchen8911 yuanchen8911 changed the title fix(validators): inference-perf gate hardening + serve-wait cold-start fix fix(validators): update and tune inference performance validation Jun 4, 2026
@yuanchen8911 yuanchen8911 force-pushed the fix/perf-gate-hardening branch 2 times, most recently from 53406c0 to 2042311 Compare June 4, 2026 22:20
…t fix

The inference-perf gate produced false negatives on healthy clusters from two
unrelated causes: TTFT-p99 knee jitter, and a serve-readiness probe timeout that
was shorter than cold-start first-token latency. This makes the verdict a
function of deployment health, not noise or warmup timing.

1. Relax the TTFT p99 target to "<= 2000" across the inference overlays
   (h100-eks, h100-gke, rtx-pro-6000-eks; gb200-eks already 2000). 2000 ms
   passes healthy runs (708-1670 ms observed) while still catching genuine
   stalls (9-45 s) by 5-20x, so it stays discriminating -- matching the
   precedent gb200-eks already set.

2. Pin AIPerf workload generation for run-to-run reproducibility:
   --random-seed, --prompt-input-tokens-stddev 0, --prompt-output-tokens-stddev 0,
   --num-dataset-entries, and --extra-inputs temperature:0 (greedy decoding --
   AIPerf's recommended way to get deterministic output without ignore_eos).

3. Raise the serve-readiness probe timeout 30s -> 120s
   (InferenceEndpointProbeTimeout). A fresh worker's first inference captures
   CUDA graphs / JIT-warms kernels (~42 s measured on RTX PRO 6000); the generic
   30 s HTTPClientTimeout cancelled that legitimate first request mid-warmup, so
   the probe never succeeded and the phase failed with "timed out waiting for
   inference endpoint to serve requests" -- the same outer symptom as the (fixed)
   NVIDIA#1192 discovery panic but a different root cause. Validated on RTX PRO 6000:
   74,833 tok/s / 459 ms PASS with the fix. AIPerf's own warmup absorbs
   steady-state once the probe passes.

Also adds AICR_INFERENCE_PERF_NO_CLEANUP=1 (debug, shell-env forwarded to the
inference-perf pod): leaves the namespace/DGD/workers/frontend/AIPerf Job in
place for post-mortem of a failed run.

Docs:
- New contributor investigation report docs/contributor/inference-perf-fluctuation.md
  (symptom, full test setup + results, key findings, mitigations incl. rejected
  ones, the serve-wait cold-start finding (4.8), proposed long-term solution),
  wired into docs/index.yml.
- validator.md: "Methodology" subsection + cold-start probe timeout + no-cleanup.
- validation.md: determinism note, TTFT example -> 2000, AICR_INFERENCE_PERF_NO_CLEANUP.
- TTFT-example consistency: recipe-development.md, data-flow.md, cli-reference.md.
@yuanchen8911 yuanchen8911 force-pushed the fix/perf-gate-hardening branch from 2042311 to 860bcc7 Compare June 4, 2026 22:28
@yuanchen8911 yuanchen8911 marked this pull request as ready for review June 4, 2026 22:29
@yuanchen8911 yuanchen8911 requested review from a team as code owners June 4, 2026 22:29
@yuanchen8911 yuanchen8911 enabled auto-merge (squash) June 4, 2026 22:31
@yuanchen8911 yuanchen8911 requested a review from mchmarny June 4, 2026 22:31
@yuanchen8911 yuanchen8911 merged commit 2d5259b into NVIDIA:main Jun 4, 2026
120 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants