Nodes stuck at lifecycle_status: starting with recent heartbeats and registered capabilities

## Description

Some long-running agent nodes never transition from `lifecycle_status: "starting"` to `"ready"` despite being fully operational. The issue persists across restarts of both the agent container and the control plane.

## Observed Behavior

- Node registers successfully with the control plane
- Heartbeats are sent every 2s and accepted (HTTP 200)
- `/health` endpoint returns `{"status": "healthy"}`
- Capabilities (reasoners and/or skills) are registered and callable
- Execution via `/api/v1/execute/{node_id}.{capability}` works correctly
- **But** `lifecycle_status` remains `"starting"` and `health_status` remains `"unknown"` indefinitely

## Affected Node Patterns

The issue affects two distinct node profiles:

1. **Skills-only nodes** — 0 reasoners, 4+ skills. Never promoted.
2. **Nodes with reasoners** — 1+ reasoners, 3+ skills. Also stuck at `starting` in some environments.

Other nodes with identical configurations (same SDK version, same registration pattern) transition to `ready/active` as expected. The issue appears environment-specific rather than code-specific.

## Timeline

- Nodes registered as early as **February 2026** are still stuck at `starting`
- Restarting the agent container does not resolve
- Restarting the AgentField control plane does not resolve (nodes re-register but remain at `starting`)

## Environment

- AgentField SDK: `0.1.68`
- Deployment: Docker containers (long_running)
- Control plane: single instance, healthy

## Example Node State

```json
{
  "id": "example-agent",
  "health_status": "unknown",
  "lifecycle_status": "starting",
  "last_heartbeat": "2026-04-19T14:03:55.066758Z",
  "registered_at": "2026-02-19T11:27:00.102Z",
  "reasoners": [...],
  "skills": [...]
}
```

Note: `last_heartbeat` is always recent (within seconds), confirming the agent is actively communicating.

## Expected Behavior

Nodes with recent heartbeats, registered capabilities, and healthy `/health` endpoints should transition to `lifecycle_status: "ready"` and `health_status: "active"`.

## Workaround

We've implemented a client-side workaround that treats nodes as healthy if they have a recent heartbeat (<120s) and registered capabilities, regardless of `lifecycle_status`. This works but bypasses the intended lifecycle model.

## Questions

1. What triggers the `starting` → `ready` transition? Is there a handshake or confirmation step?
2. Could this be related to stale state in the control plane for nodes registered long ago?
3. Is there a recommended way to force-reconcile node lifecycle state?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes stuck at lifecycle_status: starting with recent heartbeats and registered capabilities #484

Description

Observed Behavior

Affected Node Patterns

Timeline

Environment

Example Node State

Expected Behavior

Workaround

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Nodes stuck at lifecycle_status: starting with recent heartbeats and registered capabilities #484

Description

Description

Observed Behavior

Affected Node Patterns

Timeline

Environment

Example Node State

Expected Behavior

Workaround

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions