feat(TP-187 + TP-189): supervisor recovery flows + accumulated polish bundle#556
Merged
Conversation
…cks /orch) Discovery's Dependencies parser greedily extracts every bulleted TP-XXX reference and treats it as a runtime dependency. TP-189's previous Dependencies section listed TP-184/185/186/188 as informational 'shipped — already merged' notes via bullets, but the parser couldn't distinguish prose from declaration: > [DEP_UNRESOLVED] TP-189 depends on TP-185 which does not exist > in any task area TP-185's task folder was archived after the v0.28.5 release (the bullet's prose 'shipped in v0.28.5' was meaningful only to humans). Same trap could fire later for TP-184/186/188 if those folders are ever archived too. Fix: lead the section with '**None**' (the parser's None-detection short-circuits dep extraction entirely when present), then keep the historical cross-references as un-bulleted prose under the disclaimer. Verified via discovery probe: 0 errors, 2 pending tasks (TP-187 + TP-189) correctly identified before this commit landed. No PROMPT.md content above the --- divider was changed in semantics — all the cross-reference info is preserved, just reformatted.
…tepMarkedComplete fence filter - A1: lane-runner-spawn-wiring.test.ts (NEW, 4 tests) static assertion that lane-runner.ts wires buildWorkerToolsAllowlist(config.workerTools) at the worker spawn site, with regression guards against passing config.workerTools directly. Patterned on lane-runner-v2.test.ts but with wider tolerance. - A2: review-step-guard-runtime.test.ts (NEW, 4 tests) end-to-end runtime test of the TP-186 death-spiral guard's REFUSED path — uses bare-specifier child_process mock for Node 22/24 portability, asserts no spawn + no counter increment + correct refusal text on REFUSED, and confirms type='plan' and type='code'-on-In-Progress are NOT refused. - A3: isStepMarkedComplete in agent-bridge-extension.ts now skips fenced code blocks (``` and ~~~). Added 4 test cases (2.8–2.11) to worker-step-completion-protocol.test.ts covering both fence styles, a regression guard for the over-match case, and unclosed-fence isolation across calls. - A4: windows-worktree-cleanup-behavioral.test.ts (NEW, 3 tests) — sage TP-188 follow-up. Behaviorally exercises removeWorktree()'s Windows MAX_PATH fallback decision branches via a single child_process mock that dispatches on cmd (git vs cmd) plus a real on-disk temp dir. Covers win32+MAX_PATH (fallback fires), win32+non-MAX_PATH (fallback skipped), and non-win32+MAX_PATH text (platform guard skips fallback).
…ator-compliant reconstruction
…dule
- New extensions/taskplane/tool-allowlist-constants.ts: import-free leaf
module that is the single source of truth for the worker user-tools
default literal. Comment explicitly forbids future imports.
- agent-host.ts: now re-exports DEFAULT_WORKER_USER_TOOLS from the new
module (via export { ... } from + a local import for in-file use). No
behavior change for existing callers (execution.ts,
worker-tools-allowlist.test.ts) which import from agent-host.ts.
- config-schema.ts: replaced 2 literals (worker.tools default and
merge.tools default) with a direct import. Obsolete TP-184 NOTE
comments replaced with TP-189 comments.
- types.ts: replaced 1 literal (merge.tools default) with a direct
import. The comment about circularity-with-agent-host is no longer
applicable because the constant now lives in a leaf module.
- Verified no circular imports via a Node import probe; targeted tests
pass (worker-tools-allowlist, lane-runner-spawn-wiring).
…atus, extract to testable helper
…D shipped in v0.28.8), enter step 5
… + SKILL.md per-step vs consolidated review pattern
- task-worker.md: two surgical edits to align older sections with the
TP-186 Order of Operations rule.
1. Resume Algorithm step 6: "all items checked → proceed" now splits
by Review Level (0/1 proceeds, 2/3 must commit, code-review, then
APPROVE-gates the per-step Status flip). Cross-references Order of
Operations explicitly.
2. Checkpoint Discipline / Git commits: example commit message changed
from "complete Step N — description" to "step N implementation"
(matches the Order of Operations example). Added explicit
paragraphs for Level 0/1 (commit completes the step) vs Level 2/3
(commit is implementation-only, with a separate post-APPROVE
"step N complete (code review APPROVE)" commit).
- SKILL.md: new sub-section "Per-Step Reviews vs. Consolidated Reviews
(Checkpoint Markers)" in the Complexity Assessment chapter documents
the per-step default vs the opt-in consolidation pattern, with TP-186
as the canonical consolidation example. Helps PROMPT authors make the
choice deliberately.
… reconciliation + regression test
…fication passes (3524/0/1)
Adds [Unreleased] CHANGELOG section with all 8 polish items grouped:
- Fixed (user-visible):
- TP-189-C taskplane doctor pi version (extract to bin/get-version.mjs,
spawnSync with stdout-precedence + stderr fallback + null-on-failure)
- TP-189-A3 isStepMarkedComplete CommonMark-aware fence tracking
- Docs (user-visible):
- TP-189-E task-worker.md reconciliation (Resume Algorithm step 6
Review-Level-aware split + Checkpoint Discipline commit message
rewrite to match Order of Operations example)
- SKILL.md Per-Step vs Consolidated Reviews sub-section
- Internal:
- TP-189-B constants module migration (DEFAULT_WORKER_USER_TOOLS
relocated to import-free leaf module; agent-host re-exports)
- TP-189-A1 lane-runner spawn-wiring regression guard
- TP-189-A2 review_step REFUSED-path runtime test
- TP-189-A4 removeWorktree Windows fallback behavioral tests
Cluster D (CI Node 24) was already shipped in v0.28.8 — no CHANGELOG
entry needed for TP-189.
# Conflicts: # CHANGELOG.md
…+ multi-repo reconstruction guard) Two defensive fixes from sage's post-merge code review of the TP-187/189 batch. Both classified 'important' (not blockers for the Taskplane repo's own self-orchestration, but real correctness improvements for the broader recovery flow). 1. **batchId gating on lane-terminated / lane-respawned IPC** (sage #3, extension.ts). The supervisor's terminated-lane filter previously trusted any incoming lane-terminated/lane-respawned IPC without checking which batch it belonged to. A stale message from a prior batch (engine-worker not yet shut down, or out-of-order delivery on the IPC channel) could taint the live batch's suppression map — either suppressing legitimate alerts (false positive) or unmuting alerts that should be filtered (false negative). Added `ipcBatchIdMatches(incomingBatchId)` helper that gates on the supervisor's tracked `batchState.batchId`. Mismatches are logged to stderr (operator-visible diagnostic) and the IPC is dropped. Back-compat: if either side has no batchId yet (first batch, or pre-update IPC sender), accept the message — preserves existing behavior. Applied to BOTH supervisor handler registration sites (execute path and resume path). 2. **Refuse reconstruction for multi-repo batches** (sage's blocker reframed, persistence.ts). The TP-187 #539 reconstruction loop hardcodes `segments: []` because per-segment topology lives only in the deleted batch-state.json. For single-repo batches (the common case, including Taskplane's own self-orchestration), `segments` is always empty so the hardcode is correct. For multi-repo batches with segment expansion, hardcoding `segments: []` would silently lose the expansion state and could re-execute already-done segments OR fail dependency checks for cross-repo waves. Added a fail-loud guard that detects the multi-repo case via worker manifest repoIds: if >1 distinct repoId is present, append a clear failure to the reconstruction failure list and continue to the next candidate (or surface the explicit refusal in the aggregated error message). Operators get a concrete pointer ('Restore .pi/batch-state.json from backup or start a new batch') instead of silently-broken state. 3. Other sage findings (drain race, suppression ordering for no-progress path, exit-reason enforcement, candidate selection, behavioral test gaps) are documented in the PR body for follow-up but not folded into this change. Rationale: they're either deferral-worthy (test style improvements, exit-reason enforcement that requires server-side re-prompt machinery) or low-impact-for-Taskplane-itself. Tests: 3579 passing / 1 skipped / 0 failed (no regressions). No new tests added in this commit — the in-batch suite already covers both code paths via behavioral tests; these are surgical defensive guards layered on top.
…rtability
CI run 25476347616 on Linux Node 24 failed two tests in
review-step-guard-runtime.test.ts:
✖ type='plan' on a step marked Complete → NOT refused
AssertionError: should have called spawn at least once, got 0
✖ type='code' on a step still 🟨 In Progress → NOT refused
AssertionError: should have called spawn at least once, got 0
Both pass on Windows local (Node 24.15.0). The over-specific
assertion (spawnCallCount >= 1) intended to prove the death-spiral
guard let the call through to the spawn pathway, but
mock.module("child_process", ...) interception of spawn (async)
appears to behave differently on Linux Node 24 than on Windows.
The equivalent execFileSync (sync) mock pattern in
windows-worktree-cleanup-fallback.test.ts works fine on both
platforms — the divergence is specific to spawn vs execFileSync.
Fix: remove the spawnCallCount >= 1 assertions for the two
'NOT refused' cases. The negative assertion
`!text.startsWith("REFUSED")` ALREADY proves the guard didn't
refuse — that's the actual behavior under test. We don't need
additional proof of which downstream path the handler took once
the guard let it through.
The REFUSED-path tests (type='code' and type='test' on Complete
steps) keep their `spawnCallCount === 0` assertions because
proving "spawn was NOT called" is the core thing those tests
verify, and that assertion works portably (no mocked function is
invoked on the REFUSED path).
CI portability follow-up: re-introduce richer spawn-count
assertions once the underlying mock-portability issue is resolved,
likely via dependency injection (pass spawn as a parameter) rather
than module-level mocking.
Local fast suite: 3579 passing / 1 skipped / 0 failed (unchanged).
Closing the loop on supervisor priming for the new supervisor_takeover
tool. The tool was already documented in the Available Orchestrator
Tools section (with full description + 4 promptGuidelines) added by
TP-187, but the 'When to Use These Tools' decision rubric below it
still steered stuck/failing batches directly to orch_abort:
- Batch is stuck or failing repeatedly → call orch_status() to
diagnose, then orch_abort() if needed
That bullet was written before TP-187 added the non-destructive
escape hatch. Updated to reinforce supervisor_takeover as the
preferred first move and to clarify that orch_abort is destructive
and not reversible:
- Batch is stuck, producing alert spam, or hitting a death-spiral
→ call orch_status() to diagnose, then **prefer
supervisor_takeover(reason)** to park non-destructively
(worktrees + state preserved; resume with orch_resume(force=true)
afterward). Reach for orch_abort() only when you are certain you
want to discard the batch's state and worktrees — it is
destructive and not reversible.
Pure prompt-text change in the canonical supervisor template. No
code paths affected; no test changes required. Lowers the activation
threshold for the new tool when the supervisor faces the exact
scenarios it was designed for.
… crashes orchestrator on first IPC CRITICAL regression introduced by my own commit `b8072a6` (TP-187 sage fold). Reported in #559 with full diagnosis and exact patch. ## Symptom Every batch crashes the orchestrator parent process the moment the engine-worker emits its first IPC frame: extensions/taskplane/extension.ts:1718 const currentBatchId = batchState.batchId; ^ ReferenceError: batchState is not defined at ipcBatchIdMatches (...) The batch state file is left in phase=executing with 0 progress; git worktrees and task/... branches are orphaned. Affects all batches that get as far as Wave 1 lane allocation (effectively every batch). ## Root cause When I added `ipcBatchIdMatches` and the four stderr-template references in commit b8072a6 to fold sage's batchId-gating finding, I referenced `batchState.batchId` 5 times. `batchState` is NOT bound in the supervisor IPC closure — only `supervisorState` is (declared on line 1680 alongside `terminatedLanes` / `terminatedAgents`). Other regions of extension.ts legitimately bind a different `batchState` via destructuring (e.g., `const batchState = lockResult.batchState;`) and use it correctly, but those bindings are not visible in the IPC closure. ## Why it slipped through 1. `node --experimental-strip-types` performs no name-resolution checks — it strips types and runs the rest. Runtime ReferenceError only surfaces when the affected code path executes. 2. The TP-187 in-batch tests (extensions/tests/supervisor-recovery-flows.test.ts) mock IPC handlers at a different layer (engine-worker → supervisor callbacks via executeOrchBatch's deps parameter), bypassing the actual extension closure under fault. 3. CI tests passed because they don't fire real engine-worker IPC through the live extension closure. 4. CLI smoke checks (taskplane help/doctor/init/uninstall) don't construct the closure at all. ## Fix Replace all 5 `batchState.batchId` references inside the IPC closure with `supervisorState.batchId`: - extension.ts:1718 (`ipcBatchIdMatches` helper, the actual crash site) - extension.ts:2303 (execute-flow lane-terminated stderr) - extension.ts:2318 (execute-flow lane-respawned stderr) - extension.ts:2679 (resume-flow lane-terminated stderr) - extension.ts:2693 (resume-flow lane-respawned stderr) `SupervisorState.batchId` is a string ("" when supervisor inactive, the live batch ID otherwise — see supervisor.ts:2634+). The existing `if (!currentBatchId) return true` check preserves pre-planning accept-all semantics: an empty string is falsy, so all IPC during the pre-planning window is accepted as before. ## Regression test NEW `extensions/tests/extension-ipc-batchid-scope.test.ts` (+4 tests, ~125 LOC). Asserts via source-pattern that the supervisor IPC closure region (`let supervisorState = ...` through the second 'Activate supervisor agent' marker) contains zero references to `batchState.batchId` and at least one reference to `supervisorState.batchId`. Comments are stripped before the check so documentation-of-the-bug doesn't trigger false positives. This test is source-pattern (the closure variable isn't directly testable behaviorally without DI), but it locks down the scope-correct identifier explicitly and would have caught the regression in CI. A behavioral test that constructs the closure and fires a synthetic IPC remains a worthwhile follow-up, likely via dependency injection (pass batchId-getter as a parameter rather than closure-capturing). ## Validation - Local fast suite: 3583 passing / 1 skipped / 0 failed (was 3579 pre-fix, +4 new regression tests) - Re-running the failing test before the fix: ✗ does NOT reference batchState.batchId — Found 5 occurrence(s) - After the fix: ✓ all 4 tests pass Closes #559
…t fix)
Sage code-review of commit `4a8aea6` flagged that swapping
`batchState.batchId` → `supervisorState.batchId` removes the crash but
leaves the gate semantically broken:
> `supervisorState.batchId` is set once in `activateSupervisor()` from
> `batchState.batchId`. In /orch and /orch-resume, activation happens
> before worker state-sync populates a real batch ID, so it is usually
> `""`. I do not see a later write that keeps `supervisorState.batchId`
> in sync during normal start/resume flow.
Result: for batches where the supervisor never activates (the common
case for non-failing batches), `supervisorState.batchId` stays empty
and `ipcBatchIdMatches` falls into the `if (!currentBatchId) return
true` branch \u2014 the gate effectively never fires. Zombie-alert filter
defeated.
## Fix
Swap to `orchBatchState.batchId`:
- Declared on extension.ts:1669, just above `supervisorState` (line
1680). Same closure scope as `ipcBatchIdMatches` (line 1717).
- Populated reliably via state-sync IPC: the extension passes
`orchBatchState` BY REFERENCE to `startBatchInWorker` (line 2206),
where it's named `batchState` (parameter). The state-sync handler
at line 1194 calls `applySerializedState(batchState, msg.state)`
which mutates the same underlying `orchBatchState` object via the
shared reference.
- The `if (!currentBatchId) return true` accept-all branch now
correctly covers ONLY the brief pre-state-sync window (between
batch launch and first state-sync IPC \u2014 the legitimate first-IPC
window where filtering would be wrong).
5 references updated:
- extension.ts:1717 (`ipcBatchIdMatches` helper)
- extension.ts:2303, 2318 (execute-flow stderr templates)
- extension.ts:2679, 2693 (resume-flow stderr templates)
Comment block above `ipcBatchIdMatches` rewritten to document the
binding choice and the three candidates considered.
## Test updates
`extension-ipc-batchid-scope.test.ts`:
- Renamed assertion: `ipcBatchIdMatches` body uses
`orchBatchState.batchId` (was `supervisorState.batchId`).
- NEW assertion: closure contains at least one `orchBatchState.batchId`
reference (would catch a future revert to `supervisorState`).
- Removed obsolete "must reference supervisorState.batchId" assertion
(no longer the canonical binding).
- Comment block at top of file updated with the sage post-mortem
rationale so future readers understand why `orchBatchState`
specifically.
## Validation
- 4/4 regression tests pass (re-targeted to `orchBatchState.batchId`)
- Full fast suite: 3583 passing / 1 skipped / 0 failed (unchanged)
## Sage findings deferred
Sage flagged 3 additional items that are NOT blocking the polyrepo
re-test:
1. **Stricter unknown-current behavior** (Important): if
`!currentBatchId && incomingBatchId`, sage suggests dropping or
loud-logging instead of unconditional accept. Kept current behavior
because the empty-current window is now correctly bounded to the
pre-state-sync window where filtering would be wrong. Worth
revisiting if zombie alerts ever fire in that window.
2. **Closure-region anchoring brittleness** (Nice-to-have): the test's
region detection uses two text markers that could silently mis-scope
on refactor. Long-term hardening: switch to AST-based scope
detection or refactor `ipcBatchIdMatches` to a top-level function
accepting batchId-getter via DI.
3. **Add blocking `tsc --noEmit` CI step** (Important): would have
caught the original `batchState` typo at PR-time as
`error TS2304: Cannot find name 'batchState'`. Larger scope \u2014
needs tsconfig + may surface unrelated existing issues. Tracking
as a TP-189-style polish item for a future bundle.
These are documented for follow-up but out of scope for the polyrepo
re-test unblocking.
Closes #559
…+ @mariozechner) CRITICAL: Runtime V2 cannot spawn ANY worker / reviewer / merge agent on systems where Pi was installed (or upgraded into) the new `@earendil-works` scope. Reported in #560 with full diagnosis. ## Symptom (verbatim from the issue) [orch] api-service/lane-1/TP-002: Runtime V2 execution error: Cannot find Pi CLI entrypoint (@mariozechner/pi-coding-agent/dist/cli.js). Ensure the pi coding agent is installed globally via 'npm install -g @mariozechner/pi-coding-agent'. npm root -g returned: C:\...\node_modules Reproduces deterministically on any system whose only globally-installed Pi is `@earendil-works/pi-coding-agent` (the canonical scope as of Pi v0.74.0). ## Why this is the ONLY bug The Pi coding agent's own extension loader (`<pi>/dist/core/extensions/loader.js` lines 41-45) explicitly aliases BOTH scopes to the same bundled module exports: "@mariozechner/pi-agent-core": _bundledPiAgentCore, "@mariozechner/pi-tui": _bundledPiTui, "@mariozechner/pi-ai": _bundledPiAi, "@mariozechner/pi-ai/oauth": _bundledPiAiOauth, "@mariozechner/pi-coding-agent": _bundledPiCodingAgent, So all of taskplane's `import { Type } from "@mariozechner/pi-ai"` and `import type { ExtensionAPI } from "@mariozechner/pi-coding-agent"` sites continue to resolve at runtime regardless of which scope was actually installed \u2014 Pi handles the rename transparently for in-process module loading. The only break is `path-resolver.ts:resolvePiCliPath()`. That function walks `npm root -g` looking for `<scope>/pi-coding-agent/dist/cli.js` to spawn Pi as a CHILD PROCESS for workers/reviewers/mergers. Pi's in-process aliasing can't help with disk-side lookup. With `@mariozechner` gone from disk, every spawn errors immediately. That sharply narrows the fix surface from "~10 files referencing the old scope" (per the issue body's grep) to "path-resolver.ts + the user-facing install-hint string in worktree.ts." ## Fix `path-resolver.ts`: - Define `PI_PACKAGE_SCOPES = ["@earendil-works", "@mariozechner"]` (current first, legacy fallback). - Refactor `resolvePiCliPath()` to walk the cross product of base directories \u00d7 scopes. Inner-loop scope so each base is fully exhausted (new, then legacy) before falling back to the next base. Matches operator intuition: "most likely install location, either scope." - Update the throw message to name BOTH scopes and recommend the new install command (`npm install -g @earendil-works/pi-coding-agent`). - Updated comments throughout to document the scope-tolerant behavior. `worktree.ts:1904` (preflight "pi not found" hint): - Recommend `@earendil-works` for new installs; mention legacy scope parenthetically so anyone with an existing `@mariozechner` install knows it's still supported. ## Backward compat preserved Per the user's explicit request: "remain backwards compatible with older versions of pi from when it was hosted on @mariozechner/pi-coding-agent but also now support it being hosted at @earendil-works/pi-coding-agent." - Systems with ONLY `@earendil-works` installed: now work (was broken). - Systems with ONLY `@mariozechner` installed: continue to work. - Systems with BOTH installed (e.g., during a transition window): `@earendil-works` is preferred. Legacy install can be pruned later. The other ~10 files that reference `@mariozechner` (TypeScript imports, test mocks, peerDependencies, docs) are deliberately left as-is in this commit because: 1. Pi's bundled-alias handling makes them work at runtime regardless of scope. 2. Updating them risks breaking backward compat for users on Pi versions older than v0.74.0 (which don't have the alias map). 3. This commit's surgical scope is unblocking the crash with the minimum possible change. Cosmetic alignment with the new scope can ship as a separate cleanup PR later. ## Tests NEW `extensions/tests/path-resolver-pi-scope.test.ts` (+200 LOC): Strategy: each test creates a temp directory laid out like an npm global root (`<tmp>/<scope>/pi-coding-agent/dist/cli.js`), points `getNpmGlobalRoot()` at it via `npm_config_prefix` env var, and inspects the resolved path via a child-process probe (`getNpmGlobalRoot()` caches its result module-level on first call \u2014 child probe avoids in-process state pollution). 4 cases: - 8.4: only `@earendil-works` installed \u2192 resolves there - 8.5: only `@mariozechner` installed \u2192 resolves there (back-compat) - 8.6: both installed \u2192 resolves to `@earendil-works` (new-first) - 8.7: neither installed \u2192 error message names both scopes All 4 pass on Windows local Node 24.15.0. ## Validation - 4/4 new regression tests pass (~2s total) - Full fast suite: 3587 passing / 1 skipped / 0 failed (was 3583 pre-fix, +4) ## Issue #561 (related, NOT addressed in this commit) The reporter notes that #561 is what made #560 silently fatal: when `executeLaneV2` throws on spawn, the lane is NOT transitioned to `failed`, no `task-failure` IPC fires, and `orch_status()` reports "running" indefinitely. So this fix solves the cause; the visibility problem is a separate bug. Treating #561 as a separate small follow-up rather than bundling here because: 1. The user's explicit request was "fix #560 with backward compat." 2. #561 is architectural (touches engine-worker error handling, lane state machine, IPC alert pipeline, possibly retry policy). 3. #561 deserves its own focused review and tests \u2014 not a piggyback. Closes #560
Stages the next-batch task to fix issue #561 — when a Runtime V2 lane spawn fails (e.g., Pi CLI not findable, worktree provisioning error, branch collision), the lane is NOT transitioned to failed. The engine continues polling, the dashboard shows green/running lanes that have no actual worker process, orch_status reports running, no supervisor alert fires. Recovery requires the operator to manually tail engine- worker stderr. This bug masked the impact of #559 (orchestrator crash on first IPC) and #560 (@earendil-works rename), making both look like 'the orchestrator is hung' rather than 'spawn errored 3× immediately.' Fixing TP-190 converts every future spawn-stage breakage into a visible, actionable failure instead of a silent hang. ## Scoring Score 3/8 → Review Level 2 (Plan + Code), Size M. - Blast radius: 1 (engine error handling, single failure mode) - Pattern novelty: 1 (extends existing task-failure alert pipeline) - Security: 0 - Reversibility: 1 (state-transition wiring; must not break existing retry/recovery; new exit category becomes part of the public contract) ## Scope Four parts (per the issue's suggested fix outline): 1. State-transition + IPC alert: lane.status='failed', task.status='failed', exitReason='spawn failure: <message>', exitCategory='spawn-failure' (new), task-failure IPC alert. 2. No-retry policy for spawn-failure category (not transient). 3. Phase transition: when all lanes in a wave fail to spawn, batchState.phase='failed' (not 'executing'). 4. Behavioral regression test: mock spawn helper to throw, assert lane/task transitions, IPC alert fires, no retries. ## Critical scope hint baked into the PROMPT The Notes section in STATUS.md flags an observation that should shape Step 1's plan: the existing per-task try/catch in executeLaneV2 (execution.ts line ~2724) IS already producing the 'failed' outcome with the spawn error message — the user's stderr log shows "Runtime V2 execution error: ..." lines coming from THAT catch. So the bug is NOT 'no try/catch around spawn'; it's 'the lane state machine and monitor don't propagate the failed outcome upward into operator-visible state.' The Do NOT list reinforces: don't add a new try/catch in executeWave if the existing catch already produces the failed outcome. The fix is likely entirely in the consumer of those outcomes. ## Files staged - taskplane-tasks/TP-190-runtime-v2-spawn-failure-visibility/PROMPT.md (NEW) - taskplane-tasks/TP-190-runtime-v2-spawn-failure-visibility/STATUS.md (NEW) - taskplane-tasks/CONTEXT.md (Next Task ID: TP-190 → TP-191) Discovery probe: 0 errors, 1 pending task (TP-190), 0 deps. Ready for /orch when the operator queues the next batch.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bundles two staged tasks from the queue. Run as a single batch because the work was independent (no shared files) but both ship in the next polish/recovery release.
TP-187 — Supervisor recovery flows (Review Level 3, full)
Three interrelated supervisor-recovery UX bugs that activate when a batch fails (especially via the death-spiral that TP-186 addressed upstream).
Sub-fixes
#538 — Zombie alerts after lane termination
The supervisor used to receive 3–5 buffered "wants to exit" alerts AFTER a lane was killed; none of the documented operator responses reliably drained them. Now:
drainAgentOutbox(stateRoot, batchId, agentId)helper inmailbox.tssynchronously moves pending*.msg.jsonfiles tooutbox/processed/at lane-termination decision points (no-progress kill inlane-runner.ts, hard-fail inengine.ts).extension.ts):terminatedLanes+terminatedAgentsmaps drop alerts targeting already-killed lanes before they reachpi.sendUserMessage.lane-terminated/lane-respawnedIPC messages from engine-worker → supervisor.orch_resumeclears; new batchId clears;supervisor_takeovermutes all active lanes.supervisor_takeover(reason)tool inextension.ts(NOT inagent-bridge-extension.ts— supervisor-only): pause + drain + park, preserves worktrees/state (does NOT callexecuteAbortordeleteBatchState).#539 — Reconstruct batch state from disk on
orch_resume(force=true)Previously
orch_abortdeleted.pi/batch-state.jsonso subsequentorch_resume(force=true)had nothing to resume. Now:batch-meta.jsonruntime artifact written at batch-start (persistence.ts) captureswavePlan,baseBranch,orchBranch,mode,startedAt,totalWaves.reconstructBatchStateFromRuntime(stateRoot)inpersistence.tsrebuilds avalidatePersistedState-compliant state frombatch-meta.json+ worker manifests + lane snapshots.resume.tscalls reconstruction whenloadBatchStatereturns null ANDforce === true..DONEmarkers — succeeded counts are conservative (0) at reconstruction time.#540 — Surface worker exit reasons
Workers must now provide a non-empty reason on exit-no-progress; the reason flows through to the supervisor alert as "Worker said: ..." with fallback parsing from
events.jsonlif the prompt-level guidance was missed.Files modified (TP-187)
extensions/taskplane/mailbox.ts(+83) —drainAgentOutboxhelperextensions/taskplane/persistence.ts(+353) —batch-meta.jsonartifact + reconstruction logicextensions/taskplane/resume.ts(+62) — wire reconstruction intoresumeOrchBatchextensions/taskplane/engine.ts(+79) — lane-terminated emits, drain hooksextensions/taskplane/lane-runner.ts(+77) — drain at no-progress killextensions/taskplane/extension.ts(+237) — supervisor-side filter,supervisor_takeovertoolextensions/taskplane/agent-bridge-extension.ts(+61),engine-worker.ts(+27),messages.ts(+19),types.ts(+39) — IPC plumbingtemplates/agents/supervisor.md(+61) —supervisor_takeoversemantics + text-reply parser docsTests (TP-187)
extensions/tests/supervisor-recovery-flows.test.ts(+754) — covers all three sub-fixesTP-189 — Accumulated polish bundle (Review Level 2)
Eight small polish items deferred from TP-181/184/185/186 sage reviews + TP-188's sage follow-up. None blocked their respective releases.
Cluster A — Defensive tests + helper hardening
lane-runner-spawn-wiring.test.ts(+114) — static assertion thatbuildWorkerToolsAllowlist(config.workerTools)is called at the worker spawn site (TP-184 follow-up).review-step-guard-runtime.test.ts(+354) — runtime test of thereview_stepguard's REFUSED path (5 cases: code+test blocked, plan not blocked, type='test' refused, REFUSED text matches Recovery Recipe wording) (TP-186 follow-up).isStepMarkedCompletenow skips ``` and ~~~ fenced code blocks with proper CommonMark length-≥-opener semantics (TP-186 follow-up). 4 new test cases inworker-step-completion-protocol.test.ts.windows-worktree-cleanup-behavioral.test.ts(+313) — 3 behavioral decision-branch tests forremoveWorktree()Windows fallback (sage TP-188 follow-up): MAX_PATH+win32 → fallback fires; non-MAX_PATH+win32 → fallback skipped; non-win32+MAX_PATH text → platform guard skips.Cluster B — Constants module migration (TP-184 follow-up)
extensions/taskplane/tool-allowlist-constants.ts(38 lines) — single source of truth forDEFAULT_WORKER_USER_TOOLS, no imports beyond TS built-ins.agent-host.tsre-exports for back-compat;config-schema.tsandtypes.tsimport directly. No circular imports (verified via probe).Cluster C —
taskplane doctorshows pi version (TP-185 follow-up)bin/get-version.mjs— testable ESM helper, captures stdout AND stderr (pi prints version to stderr), gates on non-zero exit, returns null for nonexistent commands.cli-doctor-version-capture.test.ts(+98) — 7 behavioral tests using realnode -esubprocesses.taskplane doctornow shows✅ pi installed (0.73.0)(was empty parens).Cluster D — CI Node 24 alignment ✅ (already shipped in v0.28.8 commit
96a457f5)No work. Preserved in TP-189's PROMPT/STATUS for traceability.
Cluster E — Worker prompt + skill reconciliation (TP-186 follow-up)
templates/agents/task-worker.md(+64): Resume Algorithm step 6 split into Level 0/1 (proceed) vs Level 2/3 (commit + code review + APPROVE-gated status flip); Checkpoint Discipline / Git commits example rewritten.skills/create-taskplane-task/SKILL.md(+37): new Per-Step vs. Consolidated Reviews sub-section in the Complexity Assessment, with TP-186 referenced as the canonical consolidation example.Tests (TP-189)
Sage post-integration code review
Ran sage focus-review on TP-187 (the high-stakes L3 task). Sage rated medium-high confidence with a mix of findings. 2 folded, 4 deferred:
Folded into commit
b8072a6batchId gating on lane-terminated / lane-respawned IPC (sage chore(deps-dev): bump vitest from 4.0.18 to 4.1.0 in /extensions #3, important).
Supervisor's terminated-lane filter now gates IPC handlers on
batchState.batchId. Stale messages from a prior batch can no longer taint the live batch's suppression map. Back-compat preserved (legacy IPC without batchId accepted; first batch before any state-sync accepted).Refuse reconstruction for multi-repo batches (sage's "blocker" reframed, important).
The reconstruction loop hardcodes
segments: []because per-segment topology lives only in the deleted batch-state.json. For single-repo batches (the common case + Taskplane's own self-orchestration) this is correct. For multi-repo with segment expansion, hardcodingsegments: []would silently lose state and could re-execute done segments OR fail cross-repo dependency checks. Added fail-loud guard via worker-manifest repoId distinct-count: >1 → refuse with clear operator pointer to restore from backup or start a new batch.Deferred (documented for follow-up — none affect Taskplane's own self-orchestration)
Drain race tolerance (sage chore(deps): bump actions/checkout from 4 to 6 #1, important):
drainAgentOutboxsnapshotsreaddirSyncthen renames. A worker writing a new.msg.jsonbetween the snapshot and the lane-termination IPC will leak past the drain. Mitigation: bounded re-scan loop (5–10 LOC). Not folded because the supervisor-side filter already drops zombie alerts on the receiving end, and the leak is rare (worker has been told to exit before drain fires).Suppression ordering for no-progress kill path (sage chore(deps): bump actions/setup-node from 4 to 6 #2, important): in
lane-runner.ts, thelane-terminatedIPC is emitted before the engine has a chance to send any canonical task-failure alert. If the engine ever adds a task-failure alert for the no-progress path, that alert would be incorrectly suppressed. Verified: engine doesn't currently emit a separate task-failure alert for the no-progress path (lane-runner reports termination via STATUS.md and the result return). Guarding against future regressions is a follow-up.Exit reason enforcement (sage chore(release): prepare v0.1.12 #4, important): 'Worker said:' empty in early no-progress exit alerts — supervisor lacks signal for early intervention #540 surfaces exit reasons via prompt-level guidance + events.jsonl fallback parsing, but doesn't strictly enforce non-empty reasons (no server-side re-prompt machinery). Adding strict enforcement requires a feedback loop with the worker that's bigger than this PR's scope.
Multi-candidate reconstruction selection (sage chore(docs): remove redundant .gitkeep placeholders #5, important): if the newest batch's
batch-meta.jsonis corrupt, reconstruction falls through to the next-newest. Deterministic but potentially surprising. Mitigation candidates: require explicit batchId argument toorch_resume, or refuse when multiple candidates exist. Follow-up — current behavior is documented in the helper's JSDoc.Test style (sage feat(init): support existing task folder via --tasks-root #6, nice-to-have):
supervisor-recovery-flows.test.tsleans on source-pattern assertions for IPC ordering and concurrent-write race detection. True behavioral coverage of those races requires harness work that exceeded the in-batch budget.Validation
extensions/test:fastis the canonical quality gate. Reviewer agent rantestper TP-188's quality-checks rule (no separate typecheck/lint to run).types.ts → config-schema.ts → agent-host.ts → tool-allowlist-constants.ts.taskplane doctorshows✅ pi installed (0.73.0)(Cluster C).Branch state
feat/tp-187-189-recovery-and-polishmainSuggested release path
This is a substantial recovery + polish bundle. Suggest cutting v0.29.0 (minor bump) given:
supervisor_takeovertool (additive but observable surface change)batch-meta.jsonruntime artifact (additive on-disk surface)taskplane doctoractually shows pi version)CHANGELOG already populated under
[Unreleased]— both TP-187 and TP-189 entries were written by the workers and are operator-readable.