Skip to content

fix(performance): add GKE inference performance validation + cold-start readiness probe#952

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/937-gke-inference-perf-validation
May 19, 2026
Merged

fix(performance): add GKE inference performance validation + cold-start readiness probe#952
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/937-gke-inference-perf-validation

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented May 18, 2026

Summary

Two related fixes that together unblock aicr validate --phase performance against the GKE H100 COS Dynamo recipe:

  1. Wire the GKE Dynamo overlay to the inference-perf validator.
    Without this, the phase is a no-op (validators=0, status=skipped)
    — the validator and EKS sibling overlay already exist; the GKE
    overlay was never extended.
  2. Replace the inference-perf validator's GET /health probe with
    a real POST /v1/chat/completions probe.
    Dynamo's frontend
    returns 200 from /health as soon as the HTTP server is up — well
    before backend workers register or the model finishes loading.
    Hitting that window with AIPerf produced an "all requests completed,
    zero tokens" benchmark result that masqueraded as a regression. The
    chat-completion probe only accepts the endpoint once the response
    carries a non-empty completion — the only signal both necessary and
    sufficient to know AIPerf will produce real numbers. Affects both
    EKS and GKE Dynamo overlays (single shared codepath).

Motivation / Context

aicr validate --phase performance against the GKE Dynamo recipe was
silently a no-op:

[cli] phase requested but no checks defined in recipe;
      phase will be empty: phase=performance
[cli] phase completed: phase=performance status=skipped

Wiring the overlay to the inference-perf validator exposed a
pre-existing cold-start flake in the validator itself — first runs
against a fresh workload reported throughput=0 tok/s, while a third
run several minutes later reported 33,982 tok/s. Root-caused to the
/health probe accepting an endpoint that's not actually serving yet
(Dynamo's frontend reports instances: [] until backend workers
register, even though /health already returns 200). The fix replaces
the probe with a real chat-completion request that demands a non-empty
response before AIPerf launches.

Fixes: #937
Related: #641

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

  • Recipe engine / data (pkg/recipe)
  • Validator (pkg/validator, validators/performance)
  • Core libraries (pkg/defaults)

Implementation Notes

  • recipes/overlays/h100-gke-cos-inference-dynamo.yaml: mirrors the
    EKS overlay's validation.performance block one-for-one
    (inference-perf check + placeholder >= 5000 tok/s /
    <= 200 ms p99 thresholds, both empirically validated below).
  • validators/performance/inference_perf_constraint.go:
    • Replaces waitForEndpointHealth (a GET /health poll) with
      waitForEndpointReady (a POST /v1/chat/completions poll that
      requires a non-empty choices[0].message.content).
    • Bounds the probe response body with io.LimitReader against a
      new inferenceProbeBodyLimit const (per the repo's "no unbounded
      io.ReadAll" rule).
    • Splits an internal waitForEndpointReadyWithInterval(..., pollInterval) seam so unit tests can drive the loop in
      milliseconds without changing the production 10 s poll.
  • pkg/defaults/timeouts.go: refreshes the InferenceHealthTimeout
    godoc to reflect the new probe shape; the 5 min / 10 s budget is
    unchanged.
  • validators/performance/inference_perf_test.go: two new
    httptest-backed tests cover the "503 → 200-empty → 200-real"
    accept path and the persistent-empty timeout path.

Testing

make qualify
  • make lint (yamllint + golangci-lint + license headers + agents sync + BOM): clean
  • make test-coverage: pass (project-wide floor maintained at 76.7%)
  • make e2e (chainsaw): 21/21 pass
  • make scan (Grype): no vulnerabilities

End-to-end run against a real GKE H100-mega-80gb COS Dynamo cluster
with the fix in place:

[cli] validator completed: name=inference-perf status=passed
[cli] Inference throughput: 33982.54 tokens/sec
[cli] Inference TTFT p99: 119.79 ms
[cli] phase completed: phase=performance status=passed validators=1 passed=1 failed=0

33,982 tok/s is comfortably above the >= 5000 placeholder; 119.79 ms
sits under the <= 200 placeholder. Per-platform threshold tuning is
not required at this time — the placeholders carried over from the EKS
overlay turn out to fit GKE H100-mega-80gb fine.

New unit tests:

=== RUN   TestWaitForEndpointReady_AcceptsOnFirstRealCompletion
--- PASS: TestWaitForEndpointReady_AcceptsOnFirstRealCompletion (0.01s)
=== RUN   TestWaitForEndpointReady_TimesOutWhenAlwaysEmpty
--- PASS: TestWaitForEndpointReady_TimesOutWhenAlwaysEmpty (0.05s)

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: The validator-side change replaces one readiness
probe with another stricter one; existing EKS Dynamo runs that relied
on the /health probe will now wait for a real chat completion before
launching AIPerf — the same behavior every EKS run should already
have had, so this is a tightening, not a behavior break. No
migrations, feature flags, or backwards-compat shims required. Reverts
cleanly.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (two httptest-backed cases for the new probe)
  • I updated docs if user-facing behavior changed (N/A — no user-facing surface added)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR wires the inference-perf check into the GKE H100 COS inference-dynamo recipe, updates timeout docs to reference a model-backed readiness probe, replaces the /health poll with a POST /v1/chat/completions probe that requires a non-empty completion, and adds tests verifying successful readiness detection and timeout behavior.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

area/tests, size/L

Suggested reviewers

  • mchmarny
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the two main changes: adding GKE inference performance validation and replacing the cold-start readiness probe from /health to /v1/chat/completions.
Description check ✅ Passed The description comprehensively explains the motivation, implementation, and testing of the changes, directly relating to the changeset modifications across recipe, validator, and timeout files.
Linked Issues check ✅ Passed All coding requirements from issue #937 are met: the GKE overlay now has a validation.performance block with inference-perf check and placeholder constraints (>=5000 tok/s, <=200ms p99), mirroring the EKS overlay as requested.
Out of Scope Changes check ✅ Passed All changes are within scope: GKE overlay wiring, readiness probe replacement, test additions, and timeout documentation updates directly address the linked issue #937 requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@yuanchen8911 yuanchen8911 marked this pull request as draft May 18, 2026 21:10
@yuanchen8911 yuanchen8911 force-pushed the fix/937-gke-inference-perf-validation branch from 365a514 to ce0298d Compare May 18, 2026 21:21
@github-actions github-actions Bot added size/M and removed size/S labels May 18, 2026
@yuanchen8911 yuanchen8911 changed the title fix(recipes): wire inference perf validation into GKE Dynamo overlay fix(performance): wire GKE Dynamo perf validation + cold-start readiness probe May 18, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@validators/performance/inference_perf_test.go`:
- Around line 981-983: The test calls waitForEndpointReadyWithInterval with
context.Background(), which can hang; replace that with a bounded context using
context.WithTimeout (e.g., a few seconds) and defer cancel() so the success-path
probe fails fast; update the call site that passes the context to
waitForEndpointReadyWithInterval (the invocation using srv.URL and "test-model")
to use the new timeout context and ensure cancel() is deferred.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 2e226bff-4388-42f8-8988-e3c0ac09ceb3

📥 Commits

Reviewing files that changed from the base of the PR and between 365a514 and ce0298d.

📒 Files selected for processing (5)
  • pkg/defaults/timeouts.go
  • pkg/defaults/timeouts_test.go
  • recipes/overlays/h100-gke-cos-inference-dynamo.yaml
  • validators/performance/inference_perf_constraint.go
  • validators/performance/inference_perf_test.go

Comment thread validators/performance/inference_perf_test.go Outdated
@yuanchen8911 yuanchen8911 force-pushed the fix/937-gke-inference-perf-validation branch from ce0298d to f28de1b Compare May 18, 2026 21:26
@yuanchen8911 yuanchen8911 marked this pull request as ready for review May 18, 2026 21:30
@yuanchen8911 yuanchen8911 requested a review from a team as a code owner May 18, 2026 21:30
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@validators/performance/inference_perf_constraint.go`:
- Around line 1015-1019: The file-local constant inferenceProbeBodyLimit should
be moved into the central defaults package and consumed from there; create a
named constant (e.g., Defaults.InferenceProbeBodyLimit or
InferenceProbeBodyLimit) in pkg/defaults, replace the local
inferenceProbeBodyLimit with that exported default in
inference_perf_constraint.go, and update any other usages around the other
occurrences (lines referencing the probe body cap at 1057-1063) to import and
use the pkg/defaults value instead of the magic literal.

In `@validators/performance/inference_perf_test.go`:
- Around line 958-963: The readiness-probe success test's HTTP handler (created
via httptest.NewServer in inference_perf_test.go, where srv :=
httptest.NewServer(http.HandlerFunc(...)) and calls.Add is used) only asserts
the request path; add an assertion that r.Method == "POST" inside that handler
(alongside the existing path check) so the test fails if the probe uses GET or
any other method.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 4f405cab-cb4a-4d61-a35c-b439f88c4ffb

📥 Commits

Reviewing files that changed from the base of the PR and between ce0298d and f28de1b.

📒 Files selected for processing (5)
  • pkg/defaults/timeouts.go
  • pkg/defaults/timeouts_test.go
  • recipes/overlays/h100-gke-cos-inference-dynamo.yaml
  • validators/performance/inference_perf_constraint.go
  • validators/performance/inference_perf_test.go

Comment thread validators/performance/inference_perf_constraint.go Outdated
Comment thread validators/performance/inference_perf_test.go
@yuanchen8911 yuanchen8911 force-pushed the fix/937-gke-inference-perf-validation branch from f28de1b to cb32810 Compare May 18, 2026 21:41
@yuanchen8911 yuanchen8911 changed the title fix(performance): wire GKE Dynamo perf validation + cold-start readiness probe fix(performance): add GKE inference performance validation + cold-start readiness probe May 18, 2026
@yuanchen8911 yuanchen8911 force-pushed the fix/937-gke-inference-perf-validation branch from cb32810 to c4203c1 Compare May 18, 2026 23:31
…ld-start readiness probe

Two related fixes:

1. Wire the GKE H100 COS Dynamo overlay's performance phase to the
   inference-perf validator. Without this, aicr validate --phase
   performance against the GKE recipe is a no-op (validators=0,
   status=skipped) — the validator is in the catalog, the EKS overlay
   subscribes (NVIDIA#641), but the GKE sibling was never extended.

2. Replace the inference-perf validator's GET /health readiness probe
   with a real POST /v1/chat/completions probe. Dynamo's frontend
   returns 200 from /health as soon as the HTTP server is up — well
   before backend workers register or the model finishes loading.
   Hitting that window with AIPerf produced an "all requests completed,
   zero tokens" failure that looked like a benchmark regression. The
   chat-completion probe only accepts the endpoint once the response
   carries a non-empty completion, which is the only signal both
   necessary and sufficient to know AIPerf will produce real numbers.
   Affects both EKS and GKE Dynamo overlays (single shared codepath).

Validation: ran end-to-end aicr validate --phase performance against
GKE H100 COS Dynamo with the fix in place. Throughput 33,982 tok/s
(>= 5000 PASS); TTFT p99 119.79 ms (<= 200 PASS). The placeholder
thresholds carried over from the EKS overlay sit comfortably inside
both observed values, so no per-platform tuning is required at this
time.

Fixes NVIDIA#937
@yuanchen8911 yuanchen8911 force-pushed the fix/937-gke-inference-perf-validation branch from c4203c1 to 167b079 Compare May 19, 2026 03:13
@mchmarny mchmarny merged commit 047123f into NVIDIA:main May 19, 2026
51 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add inference performance validation to GKE inference-dynamo overlay

2 participants