Skip to content

fix(ci,kwok): retry ensure_kwok_context label check to ride out apiserver visibility race#975

Merged
mchmarny merged 2 commits into
NVIDIA:mainfrom
njhensley:ci/kwok-ensure-context-retry
May 19, 2026
Merged

fix(ci,kwok): retry ensure_kwok_context label check to ride out apiserver visibility race#975
mchmarny merged 2 commits into
NVIDIA:mainfrom
njhensley:ci/kwok-ensure-context-retry

Conversation

@njhensley
Copy link
Copy Markdown
Member

Summary

ensure_kwok_context now retries the kubectl get nodes -l type=kwok check up to 10× at 0.5 s intervals (5 s budget) instead of failing on the first empty result.

Motivation / Context

The strict check added in #956 races against the kube-apiserver's label-index visibility path. apply-nodes.sh's kubectl wait --for=condition=Ready -l type=kwok returns via the watch cache, then validate-scheduling.sh is launched in a fresh subshell ~100 ms later — and the follow-up kubectl get -l type=kwok from that subshell sees an empty list, even though the nodes exist (the subsequent force-delete cleanup finds them by name). This was breaking unrelated PR-gate runs on Tier-1 KWOK lanes.

Fixes: #974
Related: #956

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • Build/CI/tooling

Component(s) Affected

  • Other: KWOK CI scripts (kwok/scripts/lib/cleanup.sh)

Implementation Notes

Picked option 1 from the issue's suggested fixes — bounded retry on the existing strict check. Loose context-name guard runs first and is unchanged; only the node-label probe is now retry-wrapped. 5 s is well past the observed ~100 ms race window but still tight enough to surface a genuinely empty cluster within a normal CI step. The error message now states (checked 10× over 5s) so future failures are visibly post-retry rather than first-shot.

The alternative options (name-prefix check instead of label selector; pre-flight in apply-nodes.sh) would have either lost the label-typed safety property or pushed the wait to a different script — option 1 keeps the fix at the call site that needs it.

Testing

bash -n syntax check passes; behaviour is exercised on the next KWOK Tier-1 run on this PR. No unit-testable Go change.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: N/A — pure CI shell change. The retry only adds tolerance; a truly missing-nodes failure still surfaces with the same error message and exit code, just up to 5 s later.

Checklist

  • Tests pass locally (make test with -race) — N/A, shell-only change
  • Linter passes (make lint) — bash -n clean
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality — N/A, defensive retry on existing check
  • I updated docs if user-facing behavior changed — N/A
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

kwok/scripts/lib/cleanup.sh — ensure_kwok_context now retries     the
kubectl get nodes -l type=kwok check up to 10× at 0.5s intervals (5s
  total) instead of failing on the first empty result. This rides out
the         sub-second visibility gap between kubectl wait's watch path
(which saw the
  labels) and the follow-up kubectl get from a fresh subshell (which
didn't).
  Comment captures the race-window rationale and the chosen budget. Bash
syntax   checked clean.
@njhensley njhensley requested a review from a team as a code owner May 19, 2026 18:27
@njhensley njhensley self-assigned this May 19, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: c27c9f67-be55-423c-bda3-c85c27da8eb7

📥 Commits

Reviewing files that changed from the base of the PR and between 291439a and 38f08fa.

📒 Files selected for processing (1)
  • kwok/scripts/lib/cleanup.sh

📝 Walkthrough

Walkthrough

The ensure_kwok_context function in kwok/scripts/lib/cleanup.sh is modified to implement a bounded retry mechanism for detecting kwok-typed nodes. Instead of a single immediate kubectl get nodes -l type=kwok query that fails on label-index visibility races, the function now retries up to 10 times with 0.5s sleep between attempts. Documentation is expanded to explain the kube-apiserver label-selector visibility gap between kubectl wait's watch path and subsequent get queries. The change preserves the existing error-exit behavior after all retries are exhausted.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: implementing a retry mechanism for the ensure_kwok_context label check to handle a kube-apiserver visibility race condition.
Description check ✅ Passed The description provides clear motivation (races against kube-apiserver's label-index visibility), implementation details (10 retries, 0.5s intervals, 5s budget), and context linking to issues #974 and #956.
Linked Issues check ✅ Passed The PR fully addresses the primary objective from #974: implementing bounded retry on the strict label-selector check (#974 option 1) to tolerate the ~100ms visibility gap while preserving label-typed safety.
Out of Scope Changes check ✅ Passed The changes are scoped solely to ensure_kwok_context in kwok/scripts/lib/cleanup.sh, directly addressing the visibility race. No unrelated modifications to other files or components are present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@mchmarny mchmarny enabled auto-merge (squash) May 19, 2026 19:35
@mchmarny mchmarny merged commit b34eca2 into NVIDIA:main May 19, 2026
102 checks passed
@njhensley njhensley deleted the ci/kwok-ensure-context-retry branch May 19, 2026 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(kwok): ensure_kwok_context races against kubectl wait — label index visibility gap

2 participants