fix(control-plane): rescue nodes stuck in lifecycle_status=starting by AbirAbbas · Pull Request #487 · Agent-Field/agentfield

AbirAbbas · 2026-04-20T13:31:47Z

Summary

Fixes #484. Nodes that register with the control plane and then send heartbeats indefinitely with status="starting" (notably the Python SDK, whose _current_status is initialized to STARTING and only ever moves to OFFLINE on shutdown) were left wedged in lifecycle_status="starting" forever — despite fresh heartbeats, a healthy /health endpoint, and successful executions.

Four cooperating bugs in status_manager.go created this:

needsReconciliation() only flagged stuck-starting agents whose heartbeat was ALSO stale. If the agent is healthy and heartbeating every 2s, that branch never fires.
reconcileAgentStatus() only promoted empty/offline → ready; it preserved starting even when the heartbeat was fresh, so even if reconciliation did run the agent stayed stuck.
The UpdateAgentStatus auto-sync had the same blind spot: when the health monitor marked an agent AgentStateActive, it only advanced lifecycle_status out of offline/empty, not out of starting.
UpdateFromHeartbeat honored the SDK's regressive status="starting" signal, meaning any promotion would get clobbered by the next heartbeat.

Fix

Reconciliation detection — detect starting agents with a fresh heartbeat whose RegisteredAt is older than MaxTransitionTime (default 2m). Fresh heartbeat proves liveness; registration age proves startup is done.
Reconciliation action — promote starting → ready in reconcileAgentStatus when the heartbeat is fresh.
Health-check promotion — promote starting → ready in the UpdateAgentStatus auto-sync when state transitions to Active (e.g. after a successful HTTP /status check).
Regression guard — in UpdateFromHeartbeat, ignore a starting lifecycle signal if the agent is already ready/degraded. Liveness side-effects (LastSeen, state) are still applied — only the regressive lifecycle downgrade is dropped.

No schema or config changes. No behavior change for agents that already transition to ready explicitly (e.g. the Go SDK's markReady).

Test plan

Three new tests, all using real SQLite storage + real reconciliation flow:

TestStatusManager_Reconciliation_UsesConfiguredThreshold — extended to cover the "stuck starting + fresh heartbeat + past grace period" case that previously returned false from needsReconciliation.
TestStatusManager_StuckStartingIsReconciledToReady — end-to-end reproduction: register a node 10 minutes ago in starting with a fresh heartbeat, call performReconciliation(), confirm promotion to ready, then blast 5 UpdateFromHeartbeat calls with status="starting" and confirm the agent stays ready.
TestStatusManager_UpdateAgentStatus_ActivePromotesStarting — simulate the health monitor marking the agent AgentStateActive and confirm lifecycle_status advances from starting → ready.
Existing reconciliation test was updated because it encoded the buggy behavior (asserted false for stuck-starting + fresh heartbeat past the grace period).
go test ./internal/services/ — all pass.
go test ./internal/handlers/ ./pkg/types/ — all pass.
Full go test ./... — the only failure is TestDevServiceRunDev in internal/core/services, which is a pre-existing environmental failure (agent port discovery times out in WSL); confirmed to fail identically on main with this change stashed.

Follow-up (not in this PR)

The Python SDK never transitions _current_status out of STARTING — it only goes STARTING → OFFLINE on shutdown (sdk/python/agentfield/agent.py:559, sdk/python/agentfield/agent.py:3489). The control-plane fix makes third-party SDK behavior irrelevant, but the SDK should still set _current_status = AgentStatus.READY once its FastAPI server is accepting requests. Worth a separate issue.

Agents that register and then send heartbeats indefinitely with status="starting" (notably the Python SDK, whose _current_status is initialized to STARTING and only ever transitions to OFFLINE on shutdown) were left wedged in lifecycle_status="starting" forever: - needsReconciliation() only fired for stuck-starting agents when their heartbeat was ALSO stale, which never happens for a healthy agent heartbeating every 2s. - reconcileAgentStatus() only promoted empty/offline → ready; it preserved "starting" even when the heartbeat was fresh. - The UpdateAgentStatus auto-sync also only promoted offline/empty → ready when state flipped to Active, so a successful HTTP health check couldn't pull an agent out of "starting" either. - Every "starting" heartbeat from the SDK re-asserted lifecycle_status= "starting" via UpdateFromHeartbeat, clobbering any promotion. This patch: - Adds a reconciliation rule for agents stuck in "starting" past MaxTransitionTime since RegisteredAt with a FRESH heartbeat — the fresh heartbeat proves liveness, registration age proves startup is done. - Promotes "starting" → "ready" in reconcileAgentStatus when the heartbeat is fresh. - Promotes "starting" → "ready" in the UpdateAgentStatus auto-sync when state transitions to Active (e.g. successful HTTP health check). - Guards UpdateFromHeartbeat so "starting" heartbeats don't regress an already-promoted "ready"/"degraded" agent. Adds three tests covering the full scenario end-to-end: reconciliation rescues the stuck node, repeated "starting" heartbeats do not regress it, and health-check-driven Active state also promotes "starting" → "ready". Fixes #484

github-actions · 2026-04-20T13:36:24Z

📊 Coverage gate

Thresholds from .coverage-gate.toml: per-surface ≥ 86%, aggregate ≥ 88%, max per-surface regression ≤ 1.0 pp, max aggregate regression ≤ 0.50 pp.

Surface	Current	Baseline	Δ
`control-plane`	87.20%	87.30%	↓ -0.10 pp	🟡
`sdk-go`	90.70%	90.70%	→ +0.00 pp	🟢
`sdk-python`	93.63%	93.63%	↑ +0.00 pp	🟢
`sdk-typescript`	92.63%	92.56%	↑ +0.07 pp	🟢
`web-ui`	90.03%	90.01%	↑ +0.02 pp	🟢
aggregate	88.99%	89.01%	↓ -0.02 pp	🟡

✅ Gate passed

No surface regressed past the allowed threshold and the aggregate stayed above the floor.

github-actions · 2026-04-20T13:36:25Z

📐 Patch coverage gate

Threshold: 80% on lines this PR touches vs origin/main (from .coverage-gate.toml:thresholds.min_patch).

Surface	Touched lines	Patch coverage	Status
`control-plane`	28	100.00%	✅
`sdk-go`	0	—	➖ no changes
`sdk-python`	0	—	➖ no changes
`sdk-typescript`	0	—	➖ no changes
`web-ui`	0	—	➖ no changes

✅ Patch gate passed

Every surface whose lines were touched by this PR has patch coverage at or above the threshold.

AbirAbbas requested a review from a team as a code owner April 20, 2026 13:31

AbirAbbas mentioned this pull request Apr 20, 2026

Nodes stuck at lifecycle_status: starting with recent heartbeats and registered capabilities #484

Closed

AbirAbbas merged commit 81a3ac1 into main Apr 20, 2026
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(control-plane): rescue nodes stuck in lifecycle_status=starting#487

fix(control-plane): rescue nodes stuck in lifecycle_status=starting#487
AbirAbbas merged 1 commit intomainfrom
fix/node-stuck-starting-lifecycle

AbirAbbas commented Apr 20, 2026

Uh oh!

github-actions bot commented Apr 20, 2026

Uh oh!

github-actions bot commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AbirAbbas commented Apr 20, 2026

Summary

Fix

Test plan

Follow-up (not in this PR)

Uh oh!

github-actions bot commented Apr 20, 2026

📊 Coverage gate

✅ Gate passed

Uh oh!

github-actions bot commented Apr 20, 2026

📐 Patch coverage gate

✅ Patch gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant