Skip to content

Nodes stuck at lifecycle_status: starting with recent heartbeats and registered capabilities #484

@rtdean93

Description

@rtdean93

Description

Some long-running agent nodes never transition from lifecycle_status: "starting" to "ready" despite being fully operational. The issue persists across restarts of both the agent container and the control plane.

Observed Behavior

  • Node registers successfully with the control plane
  • Heartbeats are sent every 2s and accepted (HTTP 200)
  • /health endpoint returns {"status": "healthy"}
  • Capabilities (reasoners and/or skills) are registered and callable
  • Execution via /api/v1/execute/{node_id}.{capability} works correctly
  • But lifecycle_status remains "starting" and health_status remains "unknown" indefinitely

Affected Node Patterns

The issue affects two distinct node profiles:

  1. Skills-only nodes — 0 reasoners, 4+ skills. Never promoted.
  2. Nodes with reasoners — 1+ reasoners, 3+ skills. Also stuck at starting in some environments.

Other nodes with identical configurations (same SDK version, same registration pattern) transition to ready/active as expected. The issue appears environment-specific rather than code-specific.

Timeline

  • Nodes registered as early as February 2026 are still stuck at starting
  • Restarting the agent container does not resolve
  • Restarting the AgentField control plane does not resolve (nodes re-register but remain at starting)

Environment

  • AgentField SDK: 0.1.68
  • Deployment: Docker containers (long_running)
  • Control plane: single instance, healthy

Example Node State

{
  "id": "example-agent",
  "health_status": "unknown",
  "lifecycle_status": "starting",
  "last_heartbeat": "2026-04-19T14:03:55.066758Z",
  "registered_at": "2026-02-19T11:27:00.102Z",
  "reasoners": [...],
  "skills": [...]
}

Note: last_heartbeat is always recent (within seconds), confirming the agent is actively communicating.

Expected Behavior

Nodes with recent heartbeats, registered capabilities, and healthy /health endpoints should transition to lifecycle_status: "ready" and health_status: "active".

Workaround

We've implemented a client-side workaround that treats nodes as healthy if they have a recent heartbeat (<120s) and registered capabilities, regardless of lifecycle_status. This works but bypasses the intended lifecycle model.

Questions

  1. What triggers the startingready transition? Is there a handshake or confirmation step?
  2. Could this be related to stale state in the control plane for nodes registered long ago?
  3. Is there a recommended way to force-reconcile node lifecycle state?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions