Bug Report
Versions
effect: 3.21.2
@effect/cluster: 0.58.2
@effect/workflow: 0.18.1
@effect/platform-node: 0.106.0
- Runtime: Node.js
- Storage: PostgreSQL
- Cluster layer:
NodeClusterSocket.layer({ storage: "sql" })
Description
We observed a production incident where SQL-backed workflow consumption stopped after a worker runner transition.
The worker processes stayed alive, runners continued to appear healthy, but cluster_messages stopped being consumed. Existing workflow messages remained unprocessed and last_read stopped advancing.
The relevant log emitted by @effect/cluster was:
Could not find entity manager for address, retrying
The affected entity type was a workflow entity. Entity names and IDs are omitted because they come from a private application.
Observed Behavior
After a rolling deployment / scale-out event:
- New runners registered successfully.
- Runner health stayed healthy.
- The runner socket started listening.
cluster_messages contained unprocessed due messages.
cluster_messages.last_read stopped advancing.
- Worker-level health checks stayed green because the process was alive.
- No workflow handlers were invoked after the stall.
This created a state where the cluster looked healthy externally, but SQL message consumption was effectively dead.
Expected Behavior
A runner should not remain healthy while its SQL storage read loop is no longer making progress.
Also, if the storage read loop sees a message for an entity type whose entity manager is not registered yet, it should not permanently reserve or stall that message.
Expected alternatives would be acceptable:
- do not read/reserve messages until the relevant entity type is registered;
- leave
last_read untouched for messages whose entity manager does not exist yet;
- fail the runner if it cannot recover from this state;
- expose a cluster-level health signal when SQL message consumption is stalled.
Why This Looks Like a Cluster Lifecycle Issue
The cluster runner can start listening and begin SQL storage polling before all workflow entities are registered.
With existing backlog during a runner transition, the storage read loop can encounter messages for workflow entity types before their entity managers are available.
The log comes from Sharding.ts:
Could not find entity manager for address, retrying
After that, the runner may still remain healthy while message consumption no longer progresses.
Related Issue
This looks related to:
That issue also describes SQL-backed cluster runners that are registered/healthy but non-functional after runner transitions.
Suggested Fix Direction
The SQL storage read loop should avoid making durable progress on messages that cannot be dispatched to an entity manager.
Potential fix directions:
- delay SQL polling until entity registration has completed;
- do not update
last_read for messages whose entityType is not registered;
- synchronously release/reset messages that were read before an entity manager existed;
- reopen the storage read latch when
registerEntity completes;
- expose cluster storage-drain health so applications can fail fast.
Suggested Regression Test
A regression test could:
- Seed SQL storage with an unprocessed message for a clustered entity.
- Start a runner with SQL storage.
- Delay registration of the entity manager.
- Assert that the message is not left stuck with stale
last_read.
- Register the entity.
- Assert that the message is eventually consumed.
Bug Report
Versions
effect:3.21.2@effect/cluster:0.58.2@effect/workflow:0.18.1@effect/platform-node:0.106.0NodeClusterSocket.layer({ storage: "sql" })Description
We observed a production incident where SQL-backed workflow consumption stopped after a worker runner transition.
The worker processes stayed alive, runners continued to appear healthy, but
cluster_messagesstopped being consumed. Existing workflow messages remained unprocessed andlast_readstopped advancing.The relevant log emitted by
@effect/clusterwas:The affected entity type was a workflow entity. Entity names and IDs are omitted because they come from a private application.
Observed Behavior
After a rolling deployment / scale-out event:
cluster_messagescontained unprocessed due messages.cluster_messages.last_readstopped advancing.This created a state where the cluster looked healthy externally, but SQL message consumption was effectively dead.
Expected Behavior
A runner should not remain healthy while its SQL storage read loop is no longer making progress.
Also, if the storage read loop sees a message for an entity type whose entity manager is not registered yet, it should not permanently reserve or stall that message.
Expected alternatives would be acceptable:
last_readuntouched for messages whose entity manager does not exist yet;Why This Looks Like a Cluster Lifecycle Issue
The cluster runner can start listening and begin SQL storage polling before all workflow entities are registered.
With existing backlog during a runner transition, the storage read loop can encounter messages for workflow entity types before their entity managers are available.
The log comes from
Sharding.ts:After that, the runner may still remain healthy while message consumption no longer progresses.
Related Issue
This looks related to:
That issue also describes SQL-backed cluster runners that are registered/healthy but non-functional after runner transitions.
Suggested Fix Direction
The SQL storage read loop should avoid making durable progress on messages that cannot be dispatched to an entity manager.
Potential fix directions:
last_readfor messages whoseentityTypeis not registered;registerEntitycompletes;Suggested Regression Test
A regression test could:
last_read.