Skip to content

[Bug] Queue workers use stale cached flow code after flow is updated — jobs hang until visibility timeout #13

@suphakonb

Description

@suphakonb

What happened?

After updating a queue-bound flow's code (via UI "Save Changes" or PUT /api/flows/{id}),
queue workers continue executing the old stale code on every subsequent job.

The old code contained a publish step that called an unreachable internal host
(http://runloop-engine:8080). This causes Node.js http.request() to hang
indefinitely — because req.setTimeout() does NOT fire during TCP connection
establishment (only on idle connected sockets).

The queue's visibility timeout (300s) eventually kills the hung job,
producing the misleading error: "Execution cancelled".

Run Now (manual trigger) always uses fresh code from DB → completes in < 10s ✅
Queue-triggered jobs use cached old code → hang for exactly 5m 0s ❌

The DB confirmed the new code was saved correctly:

  • GET /runloop/rl/api/flows/{id} returns the updated code
  • updatedAt: 2026-05-07T09:49:16.693Z matches the last save

Disabling and re-enabling the queue (Workers=0 → Workers=1) does NOT fix it.
New goroutines still load from cache, not fresh from DB.

Steps to reproduce

  1. Create a queue bound to a flow containing Node.js code that calls an
    unreachable host (e.g. http://runloop-engine:8080)
  2. Enqueue a job → observe it hang for exactly visibility seconds → fails
    with "Execution cancelled"
  3. Update the flow code via UI (SAVE CHANGES) to remove the broken call
  4. Confirm via GET /api/flows/{id} that DB has the new code
  5. Click Run Now on the flow → ✅ completes in < 10 seconds (uses new DB code)
  6. Enqueue a new queue job → ❌ still hangs 5 minutes (still running old code)
  7. Edit queue → disable (Enabled=false) → Save → re-enable → Save
  8. Enqueue again → ❌ still hangs 5 minutes (cache survives worker restart)

Only confirmed workaround:
Create a brand-new flow via POST /api/flows with the same code,
then rebind with PATCH /api/queues/{name} using the new flowId.
Fresh workers have no cache → load code correctly from DB ✓

Version / commit

  • RunLoop version: v0.1.0 BETA - Engine: ONLINE (observed at community.oneweb.tech) - Node.js runtime inside executor: v24.14.1 - Queue backend: PostgreSQL - Concurrency: 1 | max_attempts: 3 | visibility: 300s

How are you running RunLoop?

Local dev (npm run dev)

Logs / errors

Anything else?

Summary

Queue workers cache the bound flow's code when they start up.
After updating the flow code via the UI ("Save Changes") or the PUT /api/flows/{id} API,
queue workers continue executing the old (stale) code until the engine process is restarted.

"Run Now" (manual trigger) always uses the latest code from DB — working correctly.
Only queue-triggered executions are affected.


Environment

  • RunLoop version: v0.1.0 BETA
  • Queue backend: PostgreSQL
  • Affected queue: pipeline-tasks
  • Flow: Pipeline Executor (o6m9d5kp6yxgn0pqalg2qxhhx)

Steps to Reproduce

  1. Create a queue bound to flow FlowA
  2. FlowA has Node.js code that calls http://internal-host:8080 (unreachable)
  3. Enqueue a job → it hangs for exactly visibility seconds → fails with "Execution cancelled"
  4. Update FlowA code via UI (Save Changes) to remove the broken call
  5. Confirm via GET /api/flows/{id} that DB has new code (updatedAt is recent)
  6. Click "Run Now" → ✅ completes in < 10 seconds (uses new code from DB)
  7. Enqueue another queue job → ❌ still hangs for 5 minutes (still using old cached code)
  8. Disable queue (set Enabled=false), re-enable (Enabled=true) → ❌ still same behavior
  9. Only fix found: create a brand new flow with POST /api/flows, then PATCH /api/queues/{name} with new flowId

Expected Behavior

When flow code is updated (via UI or API), queue workers should use the new code
on the next job pickup — without requiring engine restart or queue recreation.

The PATCH /api/queues/{name} API docs state:

"Changes apply on next worker pickup"

This should also apply to flow code changes, not just config parameters.


Actual Behavior

Queue workers continue running stale cached code indefinitely.
The only workarounds are:

  • Restart the RunLoop engine process
  • Create a new flow + rebind queue via PATCH /api/queues/{name} with new flowId

Root Cause Analysis

The RunLoop engine appears to load the flow definition (including Node.js code)
into the worker goroutine's memory at queue startup or first job pickup,
then caches it for all subsequent jobs.

Evidence:

  • GET /runloop/rl/api/flows/{id} confirms DB has correct code
  • updatedAt: 2026-05-07T09:49:16.693Z on flow matches our last save
  • "Run Now" executions use fresh code (no caching)
  • Queue executions use old code (with stale publish call to unreachable host)
  • Last error on job after queue disable: "load flow: context cancelled" — confirms worker loads flow code at pickup time, but context was cancelled due to disable

Impact

  • All queue jobs fail at exactly visibility seconds (300s default)
  • Error appears as "Execution cancelled" — misleading, actual cause is stale code hang
  • System appears to work (no startup error) but silently degrades

Suggested Fix

Option 1 (Preferred): Hot-reload on flow update
When PUT /api/flows/{id} or PATCH /api/queues/{name} is called,
signal active workers to reload the flow definition from DB on next job pickup.

Option 2: Force-reload button
Add a "Reload Workers" button on the Queue detail page that signals workers to reload flow code.

Option 3 (Workaround, already works):
PATCH /api/queues/{name} with a new flowId — forces workers to use updated definition.
Document this as the official workaround until hot-reload is implemented.


Additional: req.setTimeout() Warning

The affected flow used Node.js req.setTimeout(30000, ...) expecting it to protect against hangs.
However, req.setTimeout() is a socket idle timeout — it does NOT fire during TCP connection establishment.

If a target host drops TCP SYN packets (unreachable, firewall, wrong hostname),
req.setTimeout() will never fire, causing an infinite hang.

Recommendation: Add a documentation note or example in the Node.js executor
showing proper TCP connect timeout using the socket event:

req.on('socket', function(socket) {
  socket.setTimeout(connectTimeoutMs);
  socket.on('timeout', function() { req.destroy(new Error('Connect timeout')); });
});

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions