fix(ci,kwok): retry ensure_kwok_context label check to ride out apiserver visibility race by njhensley · Pull Request #975 · NVIDIA/aicr

njhensley · 2026-05-19T18:27:19Z

Summary

ensure_kwok_context now retries the kubectl get nodes -l type=kwok check up to 10× at 0.5 s intervals (5 s budget) instead of failing on the first empty result.

Motivation / Context

The strict check added in #956 races against the kube-apiserver's label-index visibility path. apply-nodes.sh's kubectl wait --for=condition=Ready -l type=kwok returns via the watch cache, then validate-scheduling.sh is launched in a fresh subshell ~100 ms later — and the follow-up kubectl get -l type=kwok from that subshell sees an empty list, even though the nodes exist (the subsequent force-delete cleanup finds them by name). This was breaking unrelated PR-gate runs on Tier-1 KWOK lanes.

Fixes: #974
Related: #956

Type of Change

Bug fix (non-breaking change that fixes an issue)
Build/CI/tooling

Component(s) Affected

Other: KWOK CI scripts (kwok/scripts/lib/cleanup.sh)

Implementation Notes

Picked option 1 from the issue's suggested fixes — bounded retry on the existing strict check. Loose context-name guard runs first and is unchanged; only the node-label probe is now retry-wrapped. 5 s is well past the observed ~100 ms race window but still tight enough to surface a genuinely empty cluster within a normal CI step. The error message now states (checked 10× over 5s) so future failures are visibly post-retry rather than first-shot.

The alternative options (name-prefix check instead of label selector; pre-flight in apply-nodes.sh) would have either lost the label-typed safety property or pushed the wait to a different script — option 1 keeps the fix at the call site that needs it.

Testing

bash -n syntax check passes; behaviour is exercised on the next KWOK Tier-1 run on this PR. No unit-testable Go change.

Risk Assessment

Low — Isolated change, well-tested, easy to revert

Rollout notes: N/A — pure CI shell change. The retry only adds tolerance; a truly missing-nodes failure still surfaces with the same error message and exit code, just up to 5 s later.

Checklist

Tests pass locally (make test with -race) — N/A, shell-only change
Linter passes (make lint) — bash -n clean
I did not skip/disable tests to make CI green
I added/updated tests for new functionality — N/A, defensive retry on existing check
I updated docs if user-facing behavior changed — N/A
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

kwok/scripts/lib/cleanup.sh — ensure_kwok_context now retries the kubectl get nodes -l type=kwok check up to 10× at 0.5s intervals (5s total) instead of failing on the first empty result. This rides out the sub-second visibility gap between kubectl wait's watch path (which saw the labels) and the follow-up kubectl get from a fresh subshell (which didn't). Comment captures the race-window rationale and the chosen budget. Bash syntax checked clean.

coderabbitai · 2026-05-19T18:30:12Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: c27c9f67-be55-423c-bda3-c85c27da8eb7

📥 Commits

Reviewing files that changed from the base of the PR and between 291439a and 38f08fa.

📒 Files selected for processing (1)

kwok/scripts/lib/cleanup.sh

📝 Walkthrough

Walkthrough

The ensure_kwok_context function in kwok/scripts/lib/cleanup.sh is modified to implement a bounded retry mechanism for detecting kwok-typed nodes. Instead of a single immediate kubectl get nodes -l type=kwok query that fails on label-index visibility races, the function now retries up to 10 times with 0.5s sleep between attempts. Documentation is expanded to explain the kube-apiserver label-selector visibility gap between kubectl wait's watch path and subsequent get queries. The change preserves the existing error-exit behavior after all retries are exhausted.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: implementing a retry mechanism for the ensure_kwok_context label check to handle a kube-apiserver visibility race condition.
Description check	✅ Passed	The description provides clear motivation (races against kube-apiserver's label-index visibility), implementation details (10 retries, 0.5s intervals, 5s budget), and context linking to issues `#974` and `#956`.
Linked Issues check	✅ Passed	The PR fully addresses the primary objective from `#974`: implementing bounded retry on the strict label-selector check (`#974` option 1) to tolerate the ~100ms visibility gap while preserving label-typed safety.
Out of Scope Changes check	✅ Passed	The changes are scoped solely to ensure_kwok_context in kwok/scripts/lib/cleanup.sh, directly addressing the visibility race. No unrelated modifications to other files or components are present.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

njhensley requested a review from a team as a code owner May 19, 2026 18:27

github-actions Bot added the size/S label May 19, 2026

njhensley self-assigned this May 19, 2026

Merge branch 'main' into ci/kwok-ensure-context-retry

c8b5a90

mchmarny approved these changes May 19, 2026

View reviewed changes

mchmarny enabled auto-merge (squash) May 19, 2026 19:35

mchmarny merged commit b34eca2 into NVIDIA:main May 19, 2026
102 checks passed

njhensley deleted the ci/kwok-ensure-context-retry branch May 19, 2026 19:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci,kwok): retry ensure_kwok_context label check to ride out apiserver visibility race#975

fix(ci,kwok): retry ensure_kwok_context label check to ride out apiserver visibility race#975
mchmarny merged 2 commits into
NVIDIA:mainfrom
njhensley:ci/kwok-ensure-context-retry

njhensley commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026

Walkthrough

Estimated code review effort

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

njhensley commented May 19, 2026

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented May 19, 2026

Walkthrough

Estimated code review effort

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants