fix(performance): add GKE inference performance validation + cold-start readiness probe by yuanchen8911 · Pull Request #952 · NVIDIA/aicr

yuanchen8911 · 2026-05-18T20:24:27Z

Summary

Two related fixes that together unblock aicr validate --phase performance against the GKE H100 COS Dynamo recipe:

Wire the GKE Dynamo overlay to the inference-perf validator.
Without this, the phase is a no-op (validators=0, status=skipped)
— the validator and EKS sibling overlay already exist; the GKE
overlay was never extended.
Replace the inference-perf validator's GET /health probe with
a real POST /v1/chat/completions probe. Dynamo's frontend
returns 200 from /health as soon as the HTTP server is up — well
before backend workers register or the model finishes loading.
Hitting that window with AIPerf produced an "all requests completed,
zero tokens" benchmark result that masqueraded as a regression. The
chat-completion probe only accepts the endpoint once the response
carries a non-empty completion — the only signal both necessary and
sufficient to know AIPerf will produce real numbers. Affects both
EKS and GKE Dynamo overlays (single shared codepath).

Motivation / Context

aicr validate --phase performance against the GKE Dynamo recipe was
silently a no-op:

[cli] phase requested but no checks defined in recipe;
      phase will be empty: phase=performance
[cli] phase completed: phase=performance status=skipped

Wiring the overlay to the inference-perf validator exposed a
pre-existing cold-start flake in the validator itself — first runs
against a fresh workload reported throughput=0 tok/s, while a third
run several minutes later reported 33,982 tok/s. Root-caused to the
/health probe accepting an endpoint that's not actually serving yet
(Dynamo's frontend reports instances: [] until backend workers
register, even though /health already returns 200). The fix replaces
the probe with a real chat-completion request that demands a non-empty
response before AIPerf launches.

Fixes: #937
Related: #641

Type of Change

Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

Recipe engine / data (pkg/recipe)
Validator (pkg/validator, validators/performance)
Core libraries (pkg/defaults)

Implementation Notes

recipes/overlays/h100-gke-cos-inference-dynamo.yaml: mirrors the
EKS overlay's validation.performance block one-for-one
(inference-perf check + placeholder >= 5000 tok/s /
<= 200 ms p99 thresholds, both empirically validated below).
validators/performance/inference_perf_constraint.go:
- Replaces waitForEndpointHealth (a GET /health poll) with
  waitForEndpointReady (a POST /v1/chat/completions poll that
  requires a non-empty choices[0].message.content).
- Bounds the probe response body with io.LimitReader against a
  new inferenceProbeBodyLimit const (per the repo's "no unbounded
  io.ReadAll" rule).
- Splits an internal waitForEndpointReadyWithInterval(..., pollInterval) seam so unit tests can drive the loop in
  milliseconds without changing the production 10 s poll.
pkg/defaults/timeouts.go: refreshes the InferenceHealthTimeout
godoc to reflect the new probe shape; the 5 min / 10 s budget is
unchanged.
validators/performance/inference_perf_test.go: two new
httptest-backed tests cover the "503 → 200-empty → 200-real"
accept path and the persistent-empty timeout path.

Testing

make qualify

make lint (yamllint + golangci-lint + license headers + agents sync + BOM): clean
make test-coverage: pass (project-wide floor maintained at 76.7%)
make e2e (chainsaw): 21/21 pass
make scan (Grype): no vulnerabilities

End-to-end run against a real GKE H100-mega-80gb COS Dynamo cluster
with the fix in place:

[cli] validator completed: name=inference-perf status=passed
[cli] Inference throughput: 33982.54 tokens/sec
[cli] Inference TTFT p99: 119.79 ms
[cli] phase completed: phase=performance status=passed validators=1 passed=1 failed=0

33,982 tok/s is comfortably above the >= 5000 placeholder; 119.79 ms
sits under the <= 200 placeholder. Per-platform threshold tuning is
not required at this time — the placeholders carried over from the EKS
overlay turn out to fit GKE H100-mega-80gb fine.

New unit tests:

=== RUN   TestWaitForEndpointReady_AcceptsOnFirstRealCompletion
--- PASS: TestWaitForEndpointReady_AcceptsOnFirstRealCompletion (0.01s)
=== RUN   TestWaitForEndpointReady_TimesOutWhenAlwaysEmpty
--- PASS: TestWaitForEndpointReady_TimesOutWhenAlwaysEmpty (0.05s)

Risk Assessment

Low — Isolated change, well-tested, easy to revert

Rollout notes: The validator-side change replaces one readiness
probe with another stricter one; existing EKS Dynamo runs that relied
on the /health probe will now wait for a real chat completion before
launching AIPerf — the same behavior every EKS run should already
have had, so this is a tightening, not a behavior break. No
migrations, feature flags, or backwards-compat shims required. Reverts
cleanly.

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality (two httptest-backed cases for the new probe)
I updated docs if user-facing behavior changed (N/A — no user-facing surface added)
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

coderabbitai · 2026-05-18T20:27:24Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR wires the inference-perf check into the GKE H100 COS inference-dynamo recipe, updates timeout docs to reference a model-backed readiness probe, replaces the /health poll with a POST /v1/chat/completions probe that requires a non-empty completion, and adds tests verifying successful readiness detection and timeout behavior.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

area/tests, size/L

Suggested reviewers

mchmarny

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the two main changes: adding GKE inference performance validation and replacing the cold-start readiness probe from /health to /v1/chat/completions.
Description check	✅ Passed	The description comprehensively explains the motivation, implementation, and testing of the changes, directly relating to the changeset modifications across recipe, validator, and timeout files.
Linked Issues check	✅ Passed	All coding requirements from issue `#937` are met: the GKE overlay now has a validation.performance block with inference-perf check and placeholder constraints (>=5000 tok/s, <=200ms p99), mirroring the EKS overlay as requested.
Out of Scope Changes check	✅ Passed	All changes are within scope: GKE overlay wiring, readiness probe replacement, test additions, and timeout documentation updates directly address the linked issue `#937` requirements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@validators/performance/inference_perf_test.go`:
- Around line 981-983: The test calls waitForEndpointReadyWithInterval with
context.Background(), which can hang; replace that with a bounded context using
context.WithTimeout (e.g., a few seconds) and defer cancel() so the success-path
probe fails fast; update the call site that passes the context to
waitForEndpointReadyWithInterval (the invocation using srv.URL and "test-model")
to use the new timeout context and ensure cancel() is deferred.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 2e226bff-4388-42f8-8988-e3c0ac09ceb3

📥 Commits

Reviewing files that changed from the base of the PR and between 365a514 and ce0298d.

📒 Files selected for processing (5)

pkg/defaults/timeouts.go
pkg/defaults/timeouts_test.go
recipes/overlays/h100-gke-cos-inference-dynamo.yaml
validators/performance/inference_perf_constraint.go
validators/performance/inference_perf_test.go

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@validators/performance/inference_perf_constraint.go`:
- Around line 1015-1019: The file-local constant inferenceProbeBodyLimit should
be moved into the central defaults package and consumed from there; create a
named constant (e.g., Defaults.InferenceProbeBodyLimit or
InferenceProbeBodyLimit) in pkg/defaults, replace the local
inferenceProbeBodyLimit with that exported default in
inference_perf_constraint.go, and update any other usages around the other
occurrences (lines referencing the probe body cap at 1057-1063) to import and
use the pkg/defaults value instead of the magic literal.

In `@validators/performance/inference_perf_test.go`:
- Around line 958-963: The readiness-probe success test's HTTP handler (created
via httptest.NewServer in inference_perf_test.go, where srv :=
httptest.NewServer(http.HandlerFunc(...)) and calls.Add is used) only asserts
the request path; add an assertion that r.Method == "POST" inside that handler
(alongside the existing path check) so the test fails if the probe uses GET or
any other method.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 4f405cab-cb4a-4d61-a35c-b439f88c4ffb

📥 Commits

Reviewing files that changed from the base of the PR and between ce0298d and f28de1b.

📒 Files selected for processing (5)

pkg/defaults/timeouts.go
pkg/defaults/timeouts_test.go
recipes/overlays/h100-gke-cos-inference-dynamo.yaml
validators/performance/inference_perf_constraint.go
validators/performance/inference_perf_test.go

…ld-start readiness probe Two related fixes: 1. Wire the GKE H100 COS Dynamo overlay's performance phase to the inference-perf validator. Without this, aicr validate --phase performance against the GKE recipe is a no-op (validators=0, status=skipped) — the validator is in the catalog, the EKS overlay subscribes (NVIDIA#641), but the GKE sibling was never extended. 2. Replace the inference-perf validator's GET /health readiness probe with a real POST /v1/chat/completions probe. Dynamo's frontend returns 200 from /health as soon as the HTTP server is up — well before backend workers register or the model finishes loading. Hitting that window with AIPerf produced an "all requests completed, zero tokens" failure that looked like a benchmark regression. The chat-completion probe only accepts the endpoint once the response carries a non-empty completion, which is the only signal both necessary and sufficient to know AIPerf will produce real numbers. Affects both EKS and GKE Dynamo overlays (single shared codepath). Validation: ran end-to-end aicr validate --phase performance against GKE H100 COS Dynamo with the fix in place. Throughput 33,982 tok/s (>= 5000 PASS); TTFT p99 119.79 ms (<= 200 PASS). The placeholder thresholds carried over from the EKS overlay sit comfortably inside both observed values, so no per-platform tuning is required at this time. Fixes NVIDIA#937

yuanchen8911 requested a review from a team as a code owner May 18, 2026 20:24

yuanchen8911 added area/recipes area/validator bug labels May 18, 2026

github-actions Bot added size/S and removed area/validator labels May 18, 2026

mchmarny assigned yuanchen8911 May 18, 2026

yuanchen8911 marked this pull request as draft May 18, 2026 21:10

yuanchen8911 force-pushed the fix/937-gke-inference-perf-validation branch from 365a514 to ce0298d Compare May 18, 2026 21:21

github-actions Bot added size/M and removed size/S labels May 18, 2026

yuanchen8911 changed the title ~~fix(recipes): wire inference perf validation into GKE Dynamo overlay~~ fix(performance): wire GKE Dynamo perf validation + cold-start readiness probe May 18, 2026

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

Comment thread validators/performance/inference_perf_test.go Outdated

yuanchen8911 force-pushed the fix/937-gke-inference-perf-validation branch from ce0298d to f28de1b Compare May 18, 2026 21:26

yuanchen8911 marked this pull request as ready for review May 18, 2026 21:30

yuanchen8911 requested a review from a team as a code owner May 18, 2026 21:30

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

Comment thread validators/performance/inference_perf_constraint.go Outdated

Comment thread validators/performance/inference_perf_test.go

yuanchen8911 requested review from mchmarny and njhensley May 18, 2026 21:30

yuanchen8911 force-pushed the fix/937-gke-inference-perf-validation branch from f28de1b to cb32810 Compare May 18, 2026 21:41

yuanchen8911 changed the title ~~fix(performance): wire GKE Dynamo perf validation + cold-start readiness probe~~ fix(performance): add GKE inference performance validation + cold-start readiness probe May 18, 2026

yuanchen8911 force-pushed the fix/937-gke-inference-perf-validation branch from cb32810 to c4203c1 Compare May 18, 2026 23:31

yuanchen8911 force-pushed the fix/937-gke-inference-perf-validation branch from c4203c1 to 167b079 Compare May 19, 2026 03:13

mchmarny approved these changes May 19, 2026

View reviewed changes

mchmarny merged commit 047123f into NVIDIA:main May 19, 2026
51 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(performance): add GKE inference performance validation + cold-start readiness probe#952

fix(performance): add GKE inference performance validation + cold-start readiness probe#952
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/937-gke-inference-perf-validation

yuanchen8911 commented May 18, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Reviews paused

Walkthrough

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yuanchen8911 commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuanchen8911 commented May 18, 2026 •

edited

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading