Description
Some long-running agent nodes never transition from lifecycle_status: "starting" to "ready" despite being fully operational. The issue persists across restarts of both the agent container and the control plane.
Observed Behavior
- Node registers successfully with the control plane
- Heartbeats are sent every 2s and accepted (HTTP 200)
/health endpoint returns {"status": "healthy"}
- Capabilities (reasoners and/or skills) are registered and callable
- Execution via
/api/v1/execute/{node_id}.{capability} works correctly
- But
lifecycle_status remains "starting" and health_status remains "unknown" indefinitely
Affected Node Patterns
The issue affects two distinct node profiles:
- Skills-only nodes — 0 reasoners, 4+ skills. Never promoted.
- Nodes with reasoners — 1+ reasoners, 3+ skills. Also stuck at
starting in some environments.
Other nodes with identical configurations (same SDK version, same registration pattern) transition to ready/active as expected. The issue appears environment-specific rather than code-specific.
Timeline
- Nodes registered as early as February 2026 are still stuck at
starting
- Restarting the agent container does not resolve
- Restarting the AgentField control plane does not resolve (nodes re-register but remain at
starting)
Environment
- AgentField SDK:
0.1.68
- Deployment: Docker containers (long_running)
- Control plane: single instance, healthy
Example Node State
{
"id": "example-agent",
"health_status": "unknown",
"lifecycle_status": "starting",
"last_heartbeat": "2026-04-19T14:03:55.066758Z",
"registered_at": "2026-02-19T11:27:00.102Z",
"reasoners": [...],
"skills": [...]
}
Note: last_heartbeat is always recent (within seconds), confirming the agent is actively communicating.
Expected Behavior
Nodes with recent heartbeats, registered capabilities, and healthy /health endpoints should transition to lifecycle_status: "ready" and health_status: "active".
Workaround
We've implemented a client-side workaround that treats nodes as healthy if they have a recent heartbeat (<120s) and registered capabilities, regardless of lifecycle_status. This works but bypasses the intended lifecycle model.
Questions
- What triggers the
starting → ready transition? Is there a handshake or confirmation step?
- Could this be related to stale state in the control plane for nodes registered long ago?
- Is there a recommended way to force-reconcile node lifecycle state?
Description
Some long-running agent nodes never transition from
lifecycle_status: "starting"to"ready"despite being fully operational. The issue persists across restarts of both the agent container and the control plane.Observed Behavior
/healthendpoint returns{"status": "healthy"}/api/v1/execute/{node_id}.{capability}works correctlylifecycle_statusremains"starting"andhealth_statusremains"unknown"indefinitelyAffected Node Patterns
The issue affects two distinct node profiles:
startingin some environments.Other nodes with identical configurations (same SDK version, same registration pattern) transition to
ready/activeas expected. The issue appears environment-specific rather than code-specific.Timeline
startingstarting)Environment
0.1.68Example Node State
{ "id": "example-agent", "health_status": "unknown", "lifecycle_status": "starting", "last_heartbeat": "2026-04-19T14:03:55.066758Z", "registered_at": "2026-02-19T11:27:00.102Z", "reasoners": [...], "skills": [...] }Note:
last_heartbeatis always recent (within seconds), confirming the agent is actively communicating.Expected Behavior
Nodes with recent heartbeats, registered capabilities, and healthy
/healthendpoints should transition tolifecycle_status: "ready"andhealth_status: "active".Workaround
We've implemented a client-side workaround that treats nodes as healthy if they have a recent heartbeat (<120s) and registered capabilities, regardless of
lifecycle_status. This works but bypasses the intended lifecycle model.Questions
starting→readytransition? Is there a handshake or confirmation step?