fix(performance): add GKE inference performance validation + cold-start readiness probe#952
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR wires the inference-perf check into the GKE H100 COS inference-dynamo recipe, updates timeout docs to reference a model-backed readiness probe, replaces the /health poll with a POST /v1/chat/completions probe that requires a non-empty completion, and adds tests verifying successful readiness detection and timeout behavior. Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
365a514 to
ce0298d
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@validators/performance/inference_perf_test.go`:
- Around line 981-983: The test calls waitForEndpointReadyWithInterval with
context.Background(), which can hang; replace that with a bounded context using
context.WithTimeout (e.g., a few seconds) and defer cancel() so the success-path
probe fails fast; update the call site that passes the context to
waitForEndpointReadyWithInterval (the invocation using srv.URL and "test-model")
to use the new timeout context and ensure cancel() is deferred.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 2e226bff-4388-42f8-8988-e3c0ac09ceb3
📒 Files selected for processing (5)
pkg/defaults/timeouts.gopkg/defaults/timeouts_test.gorecipes/overlays/h100-gke-cos-inference-dynamo.yamlvalidators/performance/inference_perf_constraint.govalidators/performance/inference_perf_test.go
ce0298d to
f28de1b
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@validators/performance/inference_perf_constraint.go`:
- Around line 1015-1019: The file-local constant inferenceProbeBodyLimit should
be moved into the central defaults package and consumed from there; create a
named constant (e.g., Defaults.InferenceProbeBodyLimit or
InferenceProbeBodyLimit) in pkg/defaults, replace the local
inferenceProbeBodyLimit with that exported default in
inference_perf_constraint.go, and update any other usages around the other
occurrences (lines referencing the probe body cap at 1057-1063) to import and
use the pkg/defaults value instead of the magic literal.
In `@validators/performance/inference_perf_test.go`:
- Around line 958-963: The readiness-probe success test's HTTP handler (created
via httptest.NewServer in inference_perf_test.go, where srv :=
httptest.NewServer(http.HandlerFunc(...)) and calls.Add is used) only asserts
the request path; add an assertion that r.Method == "POST" inside that handler
(alongside the existing path check) so the test fails if the probe uses GET or
any other method.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 4f405cab-cb4a-4d61-a35c-b439f88c4ffb
📒 Files selected for processing (5)
pkg/defaults/timeouts.gopkg/defaults/timeouts_test.gorecipes/overlays/h100-gke-cos-inference-dynamo.yamlvalidators/performance/inference_perf_constraint.govalidators/performance/inference_perf_test.go
f28de1b to
cb32810
Compare
cb32810 to
c4203c1
Compare
…ld-start readiness probe Two related fixes: 1. Wire the GKE H100 COS Dynamo overlay's performance phase to the inference-perf validator. Without this, aicr validate --phase performance against the GKE recipe is a no-op (validators=0, status=skipped) — the validator is in the catalog, the EKS overlay subscribes (NVIDIA#641), but the GKE sibling was never extended. 2. Replace the inference-perf validator's GET /health readiness probe with a real POST /v1/chat/completions probe. Dynamo's frontend returns 200 from /health as soon as the HTTP server is up — well before backend workers register or the model finishes loading. Hitting that window with AIPerf produced an "all requests completed, zero tokens" failure that looked like a benchmark regression. The chat-completion probe only accepts the endpoint once the response carries a non-empty completion, which is the only signal both necessary and sufficient to know AIPerf will produce real numbers. Affects both EKS and GKE Dynamo overlays (single shared codepath). Validation: ran end-to-end aicr validate --phase performance against GKE H100 COS Dynamo with the fix in place. Throughput 33,982 tok/s (>= 5000 PASS); TTFT p99 119.79 ms (<= 200 PASS). The placeholder thresholds carried over from the EKS overlay sit comfortably inside both observed values, so no per-platform tuning is required at this time. Fixes NVIDIA#937
c4203c1 to
167b079
Compare
Summary
Two related fixes that together unblock
aicr validate --phase performanceagainst the GKE H100 COS Dynamo recipe:inference-perfvalidator.Without this, the phase is a no-op (
validators=0, status=skipped)— the validator and EKS sibling overlay already exist; the GKE
overlay was never extended.
inference-perfvalidator'sGET /healthprobe witha real
POST /v1/chat/completionsprobe. Dynamo's frontendreturns 200 from
/healthas soon as the HTTP server is up — wellbefore backend workers register or the model finishes loading.
Hitting that window with AIPerf produced an "all requests completed,
zero tokens" benchmark result that masqueraded as a regression. The
chat-completion probe only accepts the endpoint once the response
carries a non-empty completion — the only signal both necessary and
sufficient to know AIPerf will produce real numbers. Affects both
EKS and GKE Dynamo overlays (single shared codepath).
Motivation / Context
aicr validate --phase performanceagainst the GKE Dynamo recipe wassilently a no-op:
Wiring the overlay to the
inference-perfvalidator exposed apre-existing cold-start flake in the validator itself — first runs
against a fresh workload reported
throughput=0 tok/s, while a thirdrun several minutes later reported
33,982 tok/s. Root-caused to the/healthprobe accepting an endpoint that's not actually serving yet(Dynamo's frontend reports
instances: []until backend workersregister, even though
/healthalready returns 200). The fix replacesthe probe with a real chat-completion request that demands a non-empty
response before AIPerf launches.
Fixes: #937
Related: #641
Type of Change
Component(s) Affected
pkg/recipe)pkg/validator,validators/performance)pkg/defaults)Implementation Notes
recipes/overlays/h100-gke-cos-inference-dynamo.yaml: mirrors theEKS overlay's
validation.performanceblock one-for-one(
inference-perfcheck + placeholder>= 5000 tok/s/<= 200 ms p99thresholds, both empirically validated below).validators/performance/inference_perf_constraint.go:waitForEndpointHealth(aGET /healthpoll) withwaitForEndpointReady(aPOST /v1/chat/completionspoll thatrequires a non-empty
choices[0].message.content).io.LimitReaderagainst anew
inferenceProbeBodyLimitconst (per the repo's "no unboundedio.ReadAll" rule).waitForEndpointReadyWithInterval(..., pollInterval)seam so unit tests can drive the loop inmilliseconds without changing the production 10 s poll.
pkg/defaults/timeouts.go: refreshes theInferenceHealthTimeoutgodoc to reflect the new probe shape; the 5 min / 10 s budget is
unchanged.
validators/performance/inference_perf_test.go: two newhttptest-backed tests cover the "503 → 200-empty → 200-real"accept path and the persistent-empty timeout path.
Testing
make lint(yamllint + golangci-lint + license headers + agents sync + BOM): cleanmake test-coverage: pass (project-wide floor maintained at 76.7%)make e2e(chainsaw): 21/21 passmake scan(Grype): no vulnerabilitiesEnd-to-end run against a real GKE H100-mega-80gb COS Dynamo cluster
with the fix in place:
33,982 tok/sis comfortably above the>= 5000placeholder;119.79 mssits under the
<= 200placeholder. Per-platform threshold tuning isnot required at this time — the placeholders carried over from the EKS
overlay turn out to fit GKE H100-mega-80gb fine.
New unit tests:
Risk Assessment
Rollout notes: The validator-side change replaces one readiness
probe with another stricter one; existing EKS Dynamo runs that relied
on the
/healthprobe will now wait for a real chat completion beforelaunching AIPerf — the same behavior every EKS run should already
have had, so this is a tightening, not a behavior break. No
migrations, feature flags, or backwards-compat shims required. Reverts
cleanly.
Checklist
make testwith-race)make lint)httptest-backed cases for the new probe)git commit -S)