Conversation
…yaml Agent-Logs-Url: https://github.com/OpenAF/mini-a/sessions/c0190ccb-ed30-46ee-a90e-3a3430a8f32f Co-authored-by: nmaguiar <11761746+nmaguiar@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Fix identified bugs and improve delegation functionality
Fix 10 delegation/worker bugs in mini-a-subtask.js and mini-a-worker.yaml
Mar 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A set of correctness, thread-safety, and resource-leak bugs in the delegation and worker subsystems. Fixes range from a broken timeout accumulation to a thread-unsafe global variable swap to CIDR notation that was documented but never actually parsed.
mini-a-subtask.jswaitForAlltimeout (#1):timeoutMs - (results.length > 0 ? 0 : 0)always subtracted zero — eachwaitForreceived the full budget, allowing up toN × timeoutMstotal block. Now tracksstartAlland passesMath.max(1, timeoutMs - elapsed)each iteration.workerMaxFailures(#4):_recordWorkerFailureuseddefaultMaxAttempts(default 2) as the per-worker death threshold — one transient error on two tasks killed the worker permanently. Added dedicatedworkerMaxFailuresoption (default 5)._failOrRetrySubtaskblanket kill (#5): Replaced_markWorkerDeadwith_recordWorkerFailureso logic failures (bad goal, LLM refusal) don't immediately evict the worker; threshold from#4now applies._processQueueinconsistency (#6): Item was only removed viashift()in thecatchbranch; success relied onstart()'s internalindexOf + splice. Now alwaysshift()beforestart()with astatus !== "pending"guard for stale entries.#7): Removed the unused_nextWorkermethod (superseded by_nextWorkerForSubtask).#10):while (true)loop had no exit path. Addedthis._running = truein the constructor,while (parent._running)in the watchdog, and adestroy()method.mini-a-worker.yaml#2):/message:sendreassigned the sharedrequestglobal and copy-pasted the entire/taskbody verbatim — not thread-safe under concurrent requests. Extractedglobal.__worker_submitTask(postData, wargs)called from both handlers:/taskresponse status (#3): Response hardcodedstatus: "queued"even afterglobal.__worker_tasks[taskId].statushad already been set to"running". Now returns the actual status.#8):apiallowdocumented CIDR support (e.g.192.168.1.0/24) but used a naiveindexOf(...) === 0prefix match —192.168.1.100would not match. Addedglobal.__worker_cidrMatch(ip, cidr)with bitwise subnet arithmetic and full input validation (octet range, prefix 0–32, NaN guards).SubtaskManagerinit (#9): Manager was lazily created on the first request using that request's mergedargs(includingmaxsteps,format, etc.) — all subsequent tasks then shared a manager misconfigured by the first caller. Now initialized eagerly inInitwith clean basewargs.Original prompt
Overview
Fix all identified bugs and apply significant improvements to the delegation/worker functionality across
mini-a-subtask.jsandmini-a-worker.yaml.Bug Fixes
1.
waitForAll— Timeout never decrements (mini-a-subtask.jsline ~1106)The expression
timeoutMs - (results.length > 0 ? 0 : 0)always subtracts zero, so eachwaitForcall in the loop gets the fulltimeoutMs. The aggregatewaitForAllcan block for up toN × timeoutMs.Fix: Track a start timestamp before the loop and subtract elapsed time for each iteration:
2. Non-thread-safe
requestvariable swap in/message:send(mini-a-worker.yamllines ~246–308)The
/message:sendhandler reassigns the shared globalrequestvariable to re-use the/tasklogic inline. This is not thread-safe: concurrent HTTP requests can corrupt each other'srequestreference.Additionally, the entire task-submission logic is copy-pasted verbatim from the
/taskhandler — a significant maintenance burden and source of bugs.Fix: Extract the shared task submission logic into a reusable JavaScript helper function (e.g.,
global.__worker_submitTask(postData, wargs)) called from both/taskand/message:send, eliminating the unsaferequestswap and the duplication. The helper should:postDataandwargsas argumentsSubtaskManagerif needed (but see fix Consolidate Kubernetes read tools #9 — ideally done inInit)global.__worker_tasksand start it{ taskId, createdAt }3.
/taskresponse always says"queued"even when task is already"running"(mini-a-worker.yamlline ~206)After calling
global.__worker_taskManager.start(subtaskId), the task status is updated to"running", but the HTTP response body still hardcodesstatus: "queued".Fix: Use the actual task status in the response:
4. Worker is killed using task retry threshold (
mini-a-subtask.jsline ~373)_recordWorkerFailuremarks a worker dead whenfailures >= this.defaultMaxAttempts.defaultMaxAttemptsis a per-subtask retry count (default:2), not a meaningful per-worker failure budget. A single transient error on two tasks immediately kills the worker.Fix: Introduce a dedicated
workerMaxFailuresoption (default:5) and use it in_recordWorkerFailure:this.workerMaxFailures = _$(opts.workerMaxFailures, "opts.workerMaxFailures").isNumber().default(5)in the constructor.if (failures >= this.workerMaxFailures).5. Worker marked dead on any task exhaustion, not just transport errors (
mini-a-subtask.jslines ~632–633)In
_failOrRetrySubtask, when a subtask exhausts all retries, the worker that ran it is immediately marked dead — even if the task failed due to a bad goal, LLM refusal, or logic error (not a worker fault).Fix: Only mark the worker dead if there is evidence of a transport/connectivity problem. Remove the blanket
_markWorkerDeadcall from_failOrRetrySubtask. Worker health should be tracked exclusively through_recordWorkerFailure(which uses the newworkerMaxFailuresthreshold from fix #4).6. Inconsistent queue removal in
_processQueue(mini-a-subtask.jslines ~1202–1215)The queue item is only removed via
pendingQueue.shift()in thecatchbranch. In the success path,start()removes it internally viaindexOf + splice. This is inconsistent and fragile — if the queue changes betweenpendingQueue[0]and the error-pathshift(), the wrong item could be removed.Fix: Always shift the item off the queue before calling
start(), and handle rollback on failure: