fix(ci,kwok): retry ensure_kwok_context label check to ride out apiserver visibility race#975
Conversation
kwok/scripts/lib/cleanup.sh — ensure_kwok_context now retries the kubectl get nodes -l type=kwok check up to 10× at 0.5s intervals (5s total) instead of failing on the first empty result. This rides out the sub-second visibility gap between kubectl wait's watch path (which saw the labels) and the follow-up kubectl get from a fresh subshell (which didn't). Comment captures the race-window rationale and the chosen budget. Bash syntax checked clean.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Summary
ensure_kwok_contextnow retries thekubectl get nodes -l type=kwokcheck up to 10× at 0.5 s intervals (5 s budget) instead of failing on the first empty result.Motivation / Context
The strict check added in #956 races against the kube-apiserver's label-index visibility path.
apply-nodes.sh'skubectl wait --for=condition=Ready -l type=kwokreturns via the watch cache, thenvalidate-scheduling.shis launched in a fresh subshell ~100 ms later — and the follow-upkubectl get -l type=kwokfrom that subshell sees an empty list, even though the nodes exist (the subsequent force-delete cleanup finds them by name). This was breaking unrelated PR-gate runs on Tier-1 KWOK lanes.Fixes: #974
Related: #956
Type of Change
Component(s) Affected
kwok/scripts/lib/cleanup.sh)Implementation Notes
Picked option 1 from the issue's suggested fixes — bounded retry on the existing strict check. Loose context-name guard runs first and is unchanged; only the node-label probe is now retry-wrapped. 5 s is well past the observed ~100 ms race window but still tight enough to surface a genuinely empty cluster within a normal CI step. The error message now states
(checked 10× over 5s)so future failures are visibly post-retry rather than first-shot.The alternative options (name-prefix check instead of label selector; pre-flight in
apply-nodes.sh) would have either lost the label-typed safety property or pushed the wait to a different script — option 1 keeps the fix at the call site that needs it.Testing
bash -nsyntax check passes; behaviour is exercised on the next KWOK Tier-1 run on this PR. No unit-testable Go change.Risk Assessment
Rollout notes: N/A — pure CI shell change. The retry only adds tolerance; a truly missing-nodes failure still surfaces with the same error message and exit code, just up to 5 s later.
Checklist
make testwith-race) — N/A, shell-only changemake lint) —bash -ncleangit commit -S)