Skip to content

Fix 10 delegation/worker bugs in mini-a-subtask.js and mini-a-worker.yaml#131

Merged
nmaguiar merged 2 commits intomainfrom
copilot/fix-delegation-worker-functionality
Mar 31, 2026
Merged

Fix 10 delegation/worker bugs in mini-a-subtask.js and mini-a-worker.yaml#131
nmaguiar merged 2 commits intomainfrom
copilot/fix-delegation-worker-functionality

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 30, 2026

A set of correctness, thread-safety, and resource-leak bugs in the delegation and worker subsystems. Fixes range from a broken timeout accumulation to a thread-unsafe global variable swap to CIDR notation that was documented but never actually parsed.

mini-a-subtask.js

  • waitForAll timeout (#1): timeoutMs - (results.length > 0 ? 0 : 0) always subtracted zero — each waitFor received the full budget, allowing up to N × timeoutMs total block. Now tracks startAll and passes Math.max(1, timeoutMs - elapsed) each iteration.
  • workerMaxFailures (#4): _recordWorkerFailure used defaultMaxAttempts (default 2) as the per-worker death threshold — one transient error on two tasks killed the worker permanently. Added dedicated workerMaxFailures option (default 5).
  • _failOrRetrySubtask blanket kill (#5): Replaced _markWorkerDead with _recordWorkerFailure so logic failures (bad goal, LLM refusal) don't immediately evict the worker; threshold from #4 now applies.
  • _processQueue inconsistency (#6): Item was only removed via shift() in the catch branch; success relied on start()'s internal indexOf + splice. Now always shift() before start() with a status !== "pending" guard for stale entries.
  • Dead code (#7): Removed the unused _nextWorker method (superseded by _nextWorkerForSubtask).
  • Watchdog stop mechanism (#10): while (true) loop had no exit path. Added this._running = true in the constructor, while (parent._running) in the watchdog, and a destroy() method.

mini-a-worker.yaml

  • Thread-safe shared task submission (#2): /message:send reassigned the shared request global and copy-pasted the entire /task body verbatim — not thread-safe under concurrent requests. Extracted global.__worker_submitTask(postData, wargs) called from both handlers:
    // Both /task and /message:send now call:
    var result = global.__worker_submitTask(postData, global.__worker_args);
  • Accurate /task response status (#3): Response hardcoded status: "queued" even after global.__worker_tasks[taskId].status had already been set to "running". Now returns the actual status.
  • CIDR allowlist (#8): apiallow documented CIDR support (e.g. 192.168.1.0/24) but used a naive indexOf(...) === 0 prefix match — 192.168.1.100 would not match. Added global.__worker_cidrMatch(ip, cidr) with bitwise subnet arithmetic and full input validation (octet range, prefix 0–32, NaN guards).
  • Eager SubtaskManager init (#9): Manager was lazily created on the first request using that request's merged args (including maxsteps, format, etc.) — all subsequent tasks then shared a manager misconfigured by the first caller. Now initialized eagerly in Init with clean base wargs.
Original prompt

Overview

Fix all identified bugs and apply significant improvements to the delegation/worker functionality across mini-a-subtask.js and mini-a-worker.yaml.


Bug Fixes

1. waitForAll — Timeout never decrements (mini-a-subtask.js line ~1106)

The expression timeoutMs - (results.length > 0 ? 0 : 0) always subtracts zero, so each waitFor call in the loop gets the full timeoutMs. The aggregate waitForAll can block for up to N × timeoutMs.

Fix: Track a start timestamp before the loop and subtract elapsed time for each iteration:

SubtaskManager.prototype.waitForAll = function(subtaskIds, timeoutMs) {
  if (!isArray(subtaskIds)) {
    throw new Error("subtaskIds must be an array")
  }
  timeoutMs = _$(timeoutMs, "timeoutMs").isNumber().default(300000)
  var results = []
  var startAll = new Date().getTime()
  for (var i = 0; i < subtaskIds.length; i++) {
    var elapsed = new Date().getTime() - startAll
    var remainingTime = Math.max(1, timeoutMs - elapsed)
    results.push(this.waitFor(subtaskIds[i], remainingTime))
  }
  return results
}

2. Non-thread-safe request variable swap in /message:send (mini-a-worker.yaml lines ~246–308)

The /message:send handler reassigns the shared global request variable to re-use the /task logic inline. This is not thread-safe: concurrent HTTP requests can corrupt each other's request reference.

Additionally, the entire task-submission logic is copy-pasted verbatim from the /task handler — a significant maintenance burden and source of bugs.

Fix: Extract the shared task submission logic into a reusable JavaScript helper function (e.g., global.__worker_submitTask(postData, wargs)) called from both /task and /message:send, eliminating the unsafe request swap and the duplication. The helper should:

  • Accept parsed postData and wargs as arguments
  • Validate the goal (trim, length check)
  • Merge allowed remote args
  • Parse and clamp timeout
  • Lazily init SubtaskManager if needed (but see fix Consolidate Kubernetes read tools #9 — ideally done in Init)
  • Register the task in global.__worker_tasks and start it
  • Return { taskId, createdAt }

3. /task response always says "queued" even when task is already "running" (mini-a-worker.yaml line ~206)

After calling global.__worker_taskManager.start(subtaskId), the task status is updated to "running", but the HTTP response body still hardcodes status: "queued".

Fix: Use the actual task status in the response:

return ow.server.httpd.reply(stringify({ taskId: taskId, status: global.__worker_tasks[taskId].status, createdAt: now }, __, ""), 202, ow.server.httpd.mimes.JSON)

4. Worker is killed using task retry threshold (mini-a-subtask.js line ~373)

_recordWorkerFailure marks a worker dead when failures >= this.defaultMaxAttempts. defaultMaxAttempts is a per-subtask retry count (default: 2), not a meaningful per-worker failure budget. A single transient error on two tasks immediately kills the worker.

Fix: Introduce a dedicated workerMaxFailures option (default: 5) and use it in _recordWorkerFailure:

  • Add this.workerMaxFailures = _$(opts.workerMaxFailures, "opts.workerMaxFailures").isNumber().default(5) in the constructor.
  • Change the check to if (failures >= this.workerMaxFailures).

5. Worker marked dead on any task exhaustion, not just transport errors (mini-a-subtask.js lines ~632–633)

In _failOrRetrySubtask, when a subtask exhausts all retries, the worker that ran it is immediately marked dead — even if the task failed due to a bad goal, LLM refusal, or logic error (not a worker fault).

Fix: Only mark the worker dead if there is evidence of a transport/connectivity problem. Remove the blanket _markWorkerDead call from _failOrRetrySubtask. Worker health should be tracked exclusively through _recordWorkerFailure (which uses the new workerMaxFailures threshold from fix #4).


6. Inconsistent queue removal in _processQueue (mini-a-subtask.js lines ~1202–1215)

The queue item is only removed via pendingQueue.shift() in the catch branch. In the success path, start() removes it internally via indexOf + splice. This is inconsistent and fragile — if the queue changes between pendingQueue[0] and the error-path shift(), the wrong item could be removed.

Fix: Always shift the item off the queue before calling start(), and handle rollback on failure:

SubtaskManager.prototype._processQueue = function() {
  while (this.runningCount < this.maxConcurrent && this.pendingQueue.length > 0) {
    var nextId = this.pendingQueue.shift()  // remove first
    var subtask = this.subtasks[nextId]
    if (!isDef(subtask) || subtask.status !== "pending") continue
    try {
      this.start(nextId)
    } catch(e) {
      if (isDef(subtask)) {
        subtask.status = "failed"
        subtask.error = "Failed to start: " + (isDef(e) && isString(e.message) ? e.m...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

*This pull request was created from Copilot chat.*
>

Copilot AI changed the title [WIP] Fix identified bugs and improve delegation functionality Fix 10 delegation/worker bugs in mini-a-subtask.js and mini-a-worker.yaml Mar 30, 2026
Copilot AI requested a review from nmaguiar March 30, 2026 13:19
@nmaguiar nmaguiar merged commit 6bad4bc into main Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants