fix(control-plane): rescue nodes stuck in lifecycle_status=starting#487
Merged
fix(control-plane): rescue nodes stuck in lifecycle_status=starting#487
Conversation
Agents that register and then send heartbeats indefinitely with status="starting" (notably the Python SDK, whose _current_status is initialized to STARTING and only ever transitions to OFFLINE on shutdown) were left wedged in lifecycle_status="starting" forever: - needsReconciliation() only fired for stuck-starting agents when their heartbeat was ALSO stale, which never happens for a healthy agent heartbeating every 2s. - reconcileAgentStatus() only promoted empty/offline → ready; it preserved "starting" even when the heartbeat was fresh. - The UpdateAgentStatus auto-sync also only promoted offline/empty → ready when state flipped to Active, so a successful HTTP health check couldn't pull an agent out of "starting" either. - Every "starting" heartbeat from the SDK re-asserted lifecycle_status= "starting" via UpdateFromHeartbeat, clobbering any promotion. This patch: - Adds a reconciliation rule for agents stuck in "starting" past MaxTransitionTime since RegisteredAt with a FRESH heartbeat — the fresh heartbeat proves liveness, registration age proves startup is done. - Promotes "starting" → "ready" in reconcileAgentStatus when the heartbeat is fresh. - Promotes "starting" → "ready" in the UpdateAgentStatus auto-sync when state transitions to Active (e.g. successful HTTP health check). - Guards UpdateFromHeartbeat so "starting" heartbeats don't regress an already-promoted "ready"/"degraded" agent. Adds three tests covering the full scenario end-to-end: reconciliation rescues the stuck node, repeated "starting" heartbeats do not regress it, and health-check-driven Active state also promotes "starting" → "ready". Fixes #484
Contributor
📊 Coverage gateThresholds from
✅ Gate passedNo surface regressed past the allowed threshold and the aggregate stayed above the floor. |
Contributor
📐 Patch coverage gateThreshold: 80% on lines this PR touches vs
✅ Patch gate passedEvery surface whose lines were touched by this PR has patch coverage at or above the threshold. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #484. Nodes that register with the control plane and then send heartbeats indefinitely with
status="starting"(notably the Python SDK, whose_current_statusis initialized toSTARTINGand only ever moves toOFFLINEon shutdown) were left wedged inlifecycle_status="starting"forever — despite fresh heartbeats, a healthy/healthendpoint, and successful executions.Four cooperating bugs in
status_manager.gocreated this:needsReconciliation()only flagged stuck-starting agents whose heartbeat was ALSO stale. If the agent is healthy and heartbeating every 2s, that branch never fires.reconcileAgentStatus()only promotedempty/offline→ready; it preservedstartingeven when the heartbeat was fresh, so even if reconciliation did run the agent stayed stuck.UpdateAgentStatusauto-sync had the same blind spot: when the health monitor marked an agentAgentStateActive, it only advancedlifecycle_statusout ofoffline/empty, not out ofstarting.UpdateFromHeartbeathonored the SDK's regressivestatus="starting"signal, meaning any promotion would get clobbered by the next heartbeat.Fix
startingagents with a fresh heartbeat whoseRegisteredAtis older thanMaxTransitionTime(default 2m). Fresh heartbeat proves liveness; registration age proves startup is done.starting→readyinreconcileAgentStatuswhen the heartbeat is fresh.starting→readyin theUpdateAgentStatusauto-sync when state transitions toActive(e.g. after a successful HTTP/statuscheck).UpdateFromHeartbeat, ignore astartinglifecycle signal if the agent is alreadyready/degraded. Liveness side-effects (LastSeen, state) are still applied — only the regressive lifecycle downgrade is dropped.No schema or config changes. No behavior change for agents that already transition to
readyexplicitly (e.g. the Go SDK'smarkReady).Test plan
Three new tests, all using real SQLite storage + real reconciliation flow:
TestStatusManager_Reconciliation_UsesConfiguredThreshold— extended to cover the "stuck starting + fresh heartbeat + past grace period" case that previously returnedfalsefromneedsReconciliation.TestStatusManager_StuckStartingIsReconciledToReady— end-to-end reproduction: register a node 10 minutes ago instartingwith a fresh heartbeat, callperformReconciliation(), confirm promotion toready, then blast 5UpdateFromHeartbeatcalls withstatus="starting"and confirm the agent staysready.TestStatusManager_UpdateAgentStatus_ActivePromotesStarting— simulate the health monitor marking the agentAgentStateActiveand confirmlifecycle_statusadvances fromstarting→ready.falsefor stuck-starting + fresh heartbeat past the grace period).go test ./internal/services/— all pass.go test ./internal/handlers/ ./pkg/types/— all pass.go test ./...— the only failure isTestDevServiceRunDevininternal/core/services, which is a pre-existing environmental failure (agent port discovery times out in WSL); confirmed to fail identically onmainwith this change stashed.Follow-up (not in this PR)
The Python SDK never transitions
_current_statusout ofSTARTING— it only goesSTARTING → OFFLINEon shutdown (sdk/python/agentfield/agent.py:559, sdk/python/agentfield/agent.py:3489). The control-plane fix makes third-party SDK behavior irrelevant, but the SDK should still set_current_status = AgentStatus.READYonce its FastAPI server is accepting requests. Worth a separate issue.