[Bug] Queue workers use stale cached flow code after flow is updated — jobs hang until visibility timeout

### What happened?

After updating a queue-bound flow's code (via UI "Save Changes" or PUT /api/flows/{id}),
queue workers **continue executing the old stale code** on every subsequent job.

The old code contained a `publish` step that called an unreachable internal host
(`http://runloop-engine:8080`). This causes Node.js `http.request()` to hang
indefinitely — because `req.setTimeout()` does NOT fire during TCP connection
establishment (only on idle connected sockets).

The queue's `visibility` timeout (300s) eventually kills the hung job,
producing the misleading error: `"Execution cancelled"`.

**Run Now (manual trigger)** always uses fresh code from DB → completes in < 10s ✅  
**Queue-triggered jobs** use cached old code → hang for exactly 5m 0s ❌

The DB confirmed the new code was saved correctly:
- `GET /runloop/rl/api/flows/{id}` returns the updated code
- `updatedAt: 2026-05-07T09:49:16.693Z` matches the last save

Disabling and re-enabling the queue (Workers=0 → Workers=1) does NOT fix it.
New goroutines still load from cache, not fresh from DB.

### Steps to reproduce

1. Create a queue bound to a flow containing Node.js code that calls an
   unreachable host (e.g. `http://runloop-engine:8080`)
2. Enqueue a job → observe it hang for exactly `visibility` seconds → fails
   with "Execution cancelled"
3. Update the flow code via UI (SAVE CHANGES) to remove the broken call
4. Confirm via `GET /api/flows/{id}` that DB has the new code
5. Click **Run Now** on the flow → ✅ completes in < 10 seconds (uses new DB code)
6. Enqueue a new queue job → ❌ still hangs 5 minutes (still running old code)
7. Edit queue → disable (Enabled=false) → Save → re-enable → Save
8. Enqueue again → ❌ still hangs 5 minutes (cache survives worker restart)

**Only confirmed workaround:**
Create a brand-new flow via `POST /api/flows` with the same code,
then rebind with `PATCH /api/queues/{name}` using the new `flowId`.
Fresh workers have no cache → load code correctly from DB ✓

### Version / commit

- RunLoop version: v0.1.0 BETA - Engine: ONLINE (observed at community.oneweb.tech) - Node.js runtime inside executor: v24.14.1 - Queue backend: PostgreSQL - Concurrency: 1 | max_attempts: 3 | visibility: 300s

### How are you running RunLoop?

Local dev (npm run dev)

### Logs / errors

```shell

```

### Anything else?

## Summary

Queue workers cache the bound flow's code when they start up.  
After updating the flow code via the UI ("Save Changes") or the `PUT /api/flows/{id}` API,  
**queue workers continue executing the old (stale) code** until the engine process is restarted.

"Run Now" (manual trigger) always uses the latest code from DB — working correctly.  
Only **queue-triggered** executions are affected.

---

## Environment

- RunLoop version: v0.1.0 BETA
- Queue backend: PostgreSQL
- Affected queue: `pipeline-tasks`
- Flow: `Pipeline Executor` (`o6m9d5kp6yxgn0pqalg2qxhhx`)

---

## Steps to Reproduce

1. Create a queue bound to flow `FlowA`
2. `FlowA` has Node.js code that calls `http://internal-host:8080` (unreachable)
3. Enqueue a job → it hangs for exactly `visibility` seconds → fails with `"Execution cancelled"`
4. Update `FlowA` code via UI (Save Changes) to remove the broken call
5. Confirm via `GET /api/flows/{id}` that DB has new code (`updatedAt` is recent)
6. Click **"Run Now"** → ✅ completes in < 10 seconds (uses new code from DB)
7. Enqueue another queue job → ❌ still hangs for 5 minutes (still using old cached code)
8. Disable queue (set Enabled=false), re-enable (Enabled=true) → ❌ still same behavior
9. Only fix found: create a **brand new flow** with `POST /api/flows`, then `PATCH /api/queues/{name}` with new `flowId`

---

## Expected Behavior

When flow code is updated (via UI or API), queue workers should use the new code  
on the **next job pickup** — without requiring engine restart or queue recreation.

The `PATCH /api/queues/{name}` API docs state:  
> *"Changes apply on next worker pickup"*  

This should also apply to flow code changes, not just config parameters.

---

## Actual Behavior

Queue workers continue running **stale cached code** indefinitely.  
The only workarounds are:
- Restart the RunLoop engine process
- Create a new flow + rebind queue via `PATCH /api/queues/{name}` with new `flowId`

---

## Root Cause Analysis

The RunLoop engine appears to load the flow definition (including Node.js code)  
into the worker goroutine's memory at **queue startup** or **first job pickup**,  
then caches it for all subsequent jobs.

Evidence:
- `GET /runloop/rl/api/flows/{id}` confirms DB has correct code
- `updatedAt: 2026-05-07T09:49:16.693Z` on flow matches our last save
- "Run Now" executions use fresh code (no caching)
- Queue executions use old code (with stale `publish` call to unreachable host)
- Last error on job after queue disable: `"load flow: context cancelled"` — confirms worker loads flow code at pickup time, but context was cancelled due to disable

---

## Impact

- All queue jobs fail at exactly `visibility` seconds (300s default)
- Error appears as `"Execution cancelled"` — misleading, actual cause is stale code hang
- System appears to work (no startup error) but silently degrades

---

## Suggested Fix

Option 1 (Preferred): **Hot-reload on flow update**  
When `PUT /api/flows/{id}` or `PATCH /api/queues/{name}` is called,  
signal active workers to reload the flow definition from DB on next job pickup.

Option 2: **Force-reload button**  
Add a "Reload Workers" button on the Queue detail page that signals workers to reload flow code.

Option 3 (Workaround, already works):  
`PATCH /api/queues/{name}` with a new `flowId` — forces workers to use updated definition.  
Document this as the official workaround until hot-reload is implemented.

---

## Additional: req.setTimeout() Warning

The affected flow used Node.js `req.setTimeout(30000, ...)` expecting it to protect against hangs.  
However, `req.setTimeout()` is a **socket idle timeout** — it does NOT fire during TCP connection establishment.

If a target host drops TCP SYN packets (unreachable, firewall, wrong hostname),  
`req.setTimeout()` will never fire, causing an infinite hang.

**Recommendation**: Add a documentation note or example in the Node.js executor  
showing proper TCP connect timeout using the `socket` event:

```javascript
req.on('socket', function(socket) {
  socket.setTimeout(connectTimeoutMs);
  socket.on('timeout', function() { req.destroy(new Error('Connect timeout')); });
});

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Queue workers use stale cached flow code after flow is updated — jobs hang until visibility timeout #13

What happened?

Steps to reproduce

Version / commit

How are you running RunLoop?

Logs / errors

Anything else?

Summary

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause Analysis

Impact

Suggested Fix

Additional: req.setTimeout() Warning

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] Queue workers use stale cached flow code after flow is updated — jobs hang until visibility timeout #13

Description

What happened?

Steps to reproduce

Version / commit

How are you running RunLoop?

Logs / errors

Anything else?

Summary

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause Analysis

Impact

Suggested Fix

Additional: req.setTimeout() Warning

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions