Fix: idle TEs leak from scheduler pool after disable + reconnect + expiry#850
Merged
Fix: idle TEs leak from scheduler pool after disable + reconnect + expiry#850
Conversation
james-lubin
approved these changes
Apr 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Resource-cluster scheduling can enter a state where a TaskExecutor is reported as available by the metrics endpoint and by
/getAvailableTaskExecutors, yet new reservations for its SKU keep failing withNoResourceAvailableException. This PR closes the gap inExecutorStateManagerActorthat produces that drift and ships a deterministic repro test covering both by-id and attribute-based disable paths.Symptom
From an impacted master log:
After these two lines, the TE:
ResourceOverview.numAvailableTaskExecutors,GetAvailableTaskExecutorsRequestresponses,findBestFit; assignment requests for its SKU raiseNoResourceAvailableExceptionwith the accompanying scheduler log"Cannot find any matching sku for request"/"Not enough available TEs found for scheduling constraints".Root cause
The three views read different data structures:
ResourceOverview.numAvailablemetrictaskExecutorStateMap.values().filter(isAvailable)(ExecutorStateManagerImpl.java:253)GetAvailableTaskExecutorsRequestAPItaskExecutorStateMap.entrySet().filter(isAvailable)(ExecutorStateManagerActor.java:410)findBestFit)executorsByGroup: Map<TaskExecutorGroupKey, NavigableSet<TaskExecutorHolder>>(ExecutorStateManagerImpl.java:124,381-417)executorsByGroupis populated only bytryMarkAvailable(id, state)and requiresstate.isAvailable() && state.getRegistration() != nullat the moment of the call.Drift sequence that produces the stuck state (trace for
2af3d069-…):man-nfmantis'sUpgradeClusterChildWorkflowImpldrain).state.onNodeDisabled()setsdisabled=true. Scheduler's filter at:407correctly excludes the TE — fine.onDisconnection()clearsregistrationandavailabilityState;archive()moves the state out oftaskExecutorStateMap. A stale holder may linger inexecutorsByGroupbut is dropped by the!taskExecutorStateMap.containsKey(...)guard during assignment.trackIfAbsentputs a freshTaskExecutorStateinto the map.tryMarkAvailable(id, new state)is a no-op because the new state hasregistration=null, availabilityState=null.onHeartbeatloads the registration, callsonRegistration(returnstrue), then sees the TE is still indisabledTaskExecutorsand callsonNodeDisabled()→disabled=true. This is the"Reconnected task executor ... was already marked for disabling"log.state.onHeartbeat(hb)flipsavailabilityState=Pending, but theif (stateChange && state.isAvailable())gate fails becauseisAvailable()=Pending && !disabled=false.tryMarkAvailableis not called; the TE never entersexecutorsByGroup.onDisableTaskExecutorsRequestExpirycallsstate.onNodeEnabled(), which clears the flag — but it does not re-index the TE intoexecutorsByGroup. After this,state.isAvailable()istrue, so the metric and the API both report the TE as available, whilefindBestFitstill cannot see it.The symmetric gap exists in
onCheckDisabledTaskExecutors: when a TE is disabled while present inexecutorsByGroup,state.onNodeDisabled()is called without a matchingtryMarkUnavailable. The filter at:407happens to rescue the assignment side, but the stale holder adds noise to the generation-first ordering infindBestGroupBySizeNameMatch.Fix
Three edits in
ExecutorStateManagerActor.java; all keepexecutorsByGroupin sync withisAvailable():onDisableTaskExecutorsRequestExpiryby-id path — afterstate.onNodeEnabled()flips tofalse, calldelegate.tryMarkAvailable(taskExecutorID). Idempotent: no-op ifisAvailable()==falseor if the holder is already present.onDisableTaskExecutorsRequestExpiryattribute path — same fix for the by-attributes branch.onCheckDisabledTaskExecutors— afterstate.onNodeDisabled()flips totrue, calldelegate.tryMarkUnavailable(id). The by-id branch also guards onstate.getRegistration() != nullto avoidtryMarkUnavailable'slog.warnfiring when a disable request arrives for a TE that is currently disconnected.All calls are guarded by the
onNodeDisabled/onNodeEnabledreturn value, so they only run on an actual flag flip — no double-remove or double-add.tryMarkAvailable/tryMarkUnavailableare already safe and idempotent under the preconditions.No behavior change in any other path.
New
DisableReconnectExpireStuckReproTestexercises the full sequence end-to-end viaResourceCluster:reconnectDuringTargetedDisablePlusExpirePreservesSchedulability— by-idDisableTaskExecutorsRequest.reconnectDuringAttributeBasedDisablePlusExpirePreservesSchedulability— attribute-matchingDisableTaskExecutorsRequest(exercises the other fix site; confirmed by there-enable TE:log line).Each test registers a TE, heartbeats
Occupied(so it drops fromexecutorsByGroupjust like a running TE during ASG drain), disables it, disconnects it, re-registers it with an Available heartbeat before the expiry, waits past expiry, and then asserts thatgetTaskExecutorsForreturns the TE. On unfixed master both tests fail with"NoResourceAvailableException despite TE being reported as available…"; with the fix, both pass.