Skip to content

fix(control-plane): rescue nodes stuck in lifecycle_status=starting#487

Merged
AbirAbbas merged 1 commit intomainfrom
fix/node-stuck-starting-lifecycle
Apr 20, 2026
Merged

fix(control-plane): rescue nodes stuck in lifecycle_status=starting#487
AbirAbbas merged 1 commit intomainfrom
fix/node-stuck-starting-lifecycle

Conversation

@AbirAbbas
Copy link
Copy Markdown
Contributor

Summary

Fixes #484. Nodes that register with the control plane and then send heartbeats indefinitely with status="starting" (notably the Python SDK, whose _current_status is initialized to STARTING and only ever moves to OFFLINE on shutdown) were left wedged in lifecycle_status="starting" forever — despite fresh heartbeats, a healthy /health endpoint, and successful executions.

Four cooperating bugs in status_manager.go created this:

  1. needsReconciliation() only flagged stuck-starting agents whose heartbeat was ALSO stale. If the agent is healthy and heartbeating every 2s, that branch never fires.
  2. reconcileAgentStatus() only promoted empty/offlineready; it preserved starting even when the heartbeat was fresh, so even if reconciliation did run the agent stayed stuck.
  3. The UpdateAgentStatus auto-sync had the same blind spot: when the health monitor marked an agent AgentStateActive, it only advanced lifecycle_status out of offline/empty, not out of starting.
  4. UpdateFromHeartbeat honored the SDK's regressive status="starting" signal, meaning any promotion would get clobbered by the next heartbeat.

Fix

  • Reconciliation detection — detect starting agents with a fresh heartbeat whose RegisteredAt is older than MaxTransitionTime (default 2m). Fresh heartbeat proves liveness; registration age proves startup is done.
  • Reconciliation action — promote startingready in reconcileAgentStatus when the heartbeat is fresh.
  • Health-check promotion — promote startingready in the UpdateAgentStatus auto-sync when state transitions to Active (e.g. after a successful HTTP /status check).
  • Regression guard — in UpdateFromHeartbeat, ignore a starting lifecycle signal if the agent is already ready/degraded. Liveness side-effects (LastSeen, state) are still applied — only the regressive lifecycle downgrade is dropped.

No schema or config changes. No behavior change for agents that already transition to ready explicitly (e.g. the Go SDK's markReady).

Test plan

Three new tests, all using real SQLite storage + real reconciliation flow:

  • TestStatusManager_Reconciliation_UsesConfiguredThreshold — extended to cover the "stuck starting + fresh heartbeat + past grace period" case that previously returned false from needsReconciliation.
  • TestStatusManager_StuckStartingIsReconciledToReady — end-to-end reproduction: register a node 10 minutes ago in starting with a fresh heartbeat, call performReconciliation(), confirm promotion to ready, then blast 5 UpdateFromHeartbeat calls with status="starting" and confirm the agent stays ready.
  • TestStatusManager_UpdateAgentStatus_ActivePromotesStarting — simulate the health monitor marking the agent AgentStateActive and confirm lifecycle_status advances from startingready.
  • Existing reconciliation test was updated because it encoded the buggy behavior (asserted false for stuck-starting + fresh heartbeat past the grace period).
  • go test ./internal/services/ — all pass.
  • go test ./internal/handlers/ ./pkg/types/ — all pass.
  • Full go test ./... — the only failure is TestDevServiceRunDev in internal/core/services, which is a pre-existing environmental failure (agent port discovery times out in WSL); confirmed to fail identically on main with this change stashed.

Follow-up (not in this PR)

The Python SDK never transitions _current_status out of STARTING — it only goes STARTING → OFFLINE on shutdown (sdk/python/agentfield/agent.py:559, sdk/python/agentfield/agent.py:3489). The control-plane fix makes third-party SDK behavior irrelevant, but the SDK should still set _current_status = AgentStatus.READY once its FastAPI server is accepting requests. Worth a separate issue.

Agents that register and then send heartbeats indefinitely with
status="starting" (notably the Python SDK, whose _current_status is
initialized to STARTING and only ever transitions to OFFLINE on shutdown)
were left wedged in lifecycle_status="starting" forever:

- needsReconciliation() only fired for stuck-starting agents when their
  heartbeat was ALSO stale, which never happens for a healthy agent
  heartbeating every 2s.
- reconcileAgentStatus() only promoted empty/offline → ready; it preserved
  "starting" even when the heartbeat was fresh.
- The UpdateAgentStatus auto-sync also only promoted offline/empty → ready
  when state flipped to Active, so a successful HTTP health check couldn't
  pull an agent out of "starting" either.
- Every "starting" heartbeat from the SDK re-asserted lifecycle_status=
  "starting" via UpdateFromHeartbeat, clobbering any promotion.

This patch:

- Adds a reconciliation rule for agents stuck in "starting" past
  MaxTransitionTime since RegisteredAt with a FRESH heartbeat — the
  fresh heartbeat proves liveness, registration age proves startup is done.
- Promotes "starting" → "ready" in reconcileAgentStatus when the heartbeat
  is fresh.
- Promotes "starting" → "ready" in the UpdateAgentStatus auto-sync when
  state transitions to Active (e.g. successful HTTP health check).
- Guards UpdateFromHeartbeat so "starting" heartbeats don't regress an
  already-promoted "ready"/"degraded" agent.

Adds three tests covering the full scenario end-to-end: reconciliation
rescues the stuck node, repeated "starting" heartbeats do not regress it,
and health-check-driven Active state also promotes "starting" → "ready".

Fixes #484
@github-actions
Copy link
Copy Markdown
Contributor

📊 Coverage gate

Thresholds from .coverage-gate.toml: per-surface ≥ 86%, aggregate ≥ 88%, max per-surface regression ≤ 1.0 pp, max aggregate regression ≤ 0.50 pp.

Surface Current Baseline Δ
control-plane 87.20% 87.30% ↓ -0.10 pp 🟡
sdk-go 90.70% 90.70% → +0.00 pp 🟢
sdk-python 93.63% 93.63% ↑ +0.00 pp 🟢
sdk-typescript 92.63% 92.56% ↑ +0.07 pp 🟢
web-ui 90.03% 90.01% ↑ +0.02 pp 🟢
aggregate 88.99% 89.01% ↓ -0.02 pp 🟡

✅ Gate passed

No surface regressed past the allowed threshold and the aggregate stayed above the floor.

@github-actions
Copy link
Copy Markdown
Contributor

📐 Patch coverage gate

Threshold: 80% on lines this PR touches vs origin/main (from .coverage-gate.toml:thresholds.min_patch).

Surface Touched lines Patch coverage Status
control-plane 28 100.00%
sdk-go 0 ➖ no changes
sdk-python 0 ➖ no changes
sdk-typescript 0 ➖ no changes
web-ui 0 ➖ no changes

✅ Patch gate passed

Every surface whose lines were touched by this PR has patch coverage at or above the threshold.

@AbirAbbas AbirAbbas merged commit 81a3ac1 into main Apr 20, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Nodes stuck at lifecycle_status: starting with recent heartbeats and registered capabilities

1 participant