Skip to content

feat(TP-187 + TP-189): supervisor recovery flows + accumulated polish bundle#556

Merged
HenryLach merged 42 commits into
mainfrom
feat/tp-187-189-recovery-and-polish
May 10, 2026
Merged

feat(TP-187 + TP-189): supervisor recovery flows + accumulated polish bundle#556
HenryLach merged 42 commits into
mainfrom
feat/tp-187-189-recovery-and-polish

Conversation

@HenryLach
Copy link
Copy Markdown
Owner

Bundles two staged tasks from the queue. Run as a single batch because the work was independent (no shared files) but both ship in the next polish/recovery release.


TP-187 — Supervisor recovery flows (Review Level 3, full)

Three interrelated supervisor-recovery UX bugs that activate when a batch fails (especially via the death-spiral that TP-186 addressed upstream).

Sub-fixes

#538 — Zombie alerts after lane termination

The supervisor used to receive 3–5 buffered "wants to exit" alerts AFTER a lane was killed; none of the documented operator responses reliably drained them. Now:

  • New drainAgentOutbox(stateRoot, batchId, agentId) helper in mailbox.ts synchronously moves pending *.msg.json files to outbox/processed/ at lane-termination decision points (no-progress kill in lane-runner.ts, hard-fail in engine.ts).
  • New supervisor-side terminated-lane filter (extension.ts): terminatedLanes + terminatedAgents maps drop alerts targeting already-killed lanes before they reach pi.sendUserMessage.
  • New lane-terminated / lane-respawned IPC messages from engine-worker → supervisor.
  • Lifecycle: lane re-spawn unmutes; orch_resume clears; new batchId clears; supervisor_takeover mutes all active lanes.
  • New supervisor_takeover(reason) tool in extension.ts (NOT in agent-bridge-extension.ts — supervisor-only): pause + drain + park, preserves worktrees/state (does NOT call executeAbort or deleteBatchState).

#539 — Reconstruct batch state from disk on orch_resume(force=true)

Previously orch_abort deleted .pi/batch-state.json so subsequent orch_resume(force=true) had nothing to resume. Now:

  • New batch-meta.json runtime artifact written at batch-start (persistence.ts) captures wavePlan, baseBranch, orchBranch, mode, startedAt, totalWaves.
  • New reconstructBatchStateFromRuntime(stateRoot) in persistence.ts rebuilds a validatePersistedState-compliant state from batch-meta.json + worker manifests + lane snapshots.
  • resume.ts calls reconstruction when loadBatchState returns null AND force === true.
  • Fail-loud: if any required artifact is missing, returns a clear "no resumable state" error instead of silently inventing state.
  • Reconciliation pass re-detects .DONE markers — succeeded counts are conservative (0) at reconstruction time.

#540 — Surface worker exit reasons

Workers must now provide a non-empty reason on exit-no-progress; the reason flows through to the supervisor alert as "Worker said: ..." with fallback parsing from events.jsonl if the prompt-level guidance was missed.

Files modified (TP-187)

  • extensions/taskplane/mailbox.ts (+83) — drainAgentOutbox helper
  • extensions/taskplane/persistence.ts (+353) — batch-meta.json artifact + reconstruction logic
  • extensions/taskplane/resume.ts (+62) — wire reconstruction into resumeOrchBatch
  • extensions/taskplane/engine.ts (+79) — lane-terminated emits, drain hooks
  • extensions/taskplane/lane-runner.ts (+77) — drain at no-progress kill
  • extensions/taskplane/extension.ts (+237) — supervisor-side filter, supervisor_takeover tool
  • extensions/taskplane/agent-bridge-extension.ts (+61), engine-worker.ts (+27), messages.ts (+19), types.ts (+39) — IPC plumbing
  • templates/agents/supervisor.md (+61) — supervisor_takeover semantics + text-reply parser docs

Tests (TP-187)

  • NEW extensions/tests/supervisor-recovery-flows.test.ts (+754) — covers all three sub-fixes
  • 7 in-batch reviews (4 plan + 3 code) all returned APPROVE
  • Suite: 3579 passing / 1 skipped / 0 failed (was 3496 pre-batch, +83 new tests across both tasks)

TP-189 — Accumulated polish bundle (Review Level 2)

Eight small polish items deferred from TP-181/184/185/186 sage reviews + TP-188's sage follow-up. None blocked their respective releases.

Cluster A — Defensive tests + helper hardening

  1. NEW lane-runner-spawn-wiring.test.ts (+114) — static assertion that buildWorkerToolsAllowlist(config.workerTools) is called at the worker spawn site (TP-184 follow-up).
  2. NEW review-step-guard-runtime.test.ts (+354) — runtime test of the review_step guard's REFUSED path (5 cases: code+test blocked, plan not blocked, type='test' refused, REFUSED text matches Recovery Recipe wording) (TP-186 follow-up).
  3. isStepMarkedComplete now skips ``` and ~~~ fenced code blocks with proper CommonMark length-≥-opener semantics (TP-186 follow-up). 4 new test cases in worker-step-completion-protocol.test.ts.
  4. NEW windows-worktree-cleanup-behavioral.test.ts (+313) — 3 behavioral decision-branch tests for removeWorktree() Windows fallback (sage TP-188 follow-up): MAX_PATH+win32 → fallback fires; non-MAX_PATH+win32 → fallback skipped; non-win32+MAX_PATH text → platform guard skips.

Cluster B — Constants module migration (TP-184 follow-up)

  • NEW extensions/taskplane/tool-allowlist-constants.ts (38 lines) — single source of truth for DEFAULT_WORKER_USER_TOOLS, no imports beyond TS built-ins.
  • agent-host.ts re-exports for back-compat; config-schema.ts and types.ts import directly. No circular imports (verified via probe).

Cluster C — taskplane doctor shows pi version (TP-185 follow-up)

  • NEW bin/get-version.mjs — testable ESM helper, captures stdout AND stderr (pi prints version to stderr), gates on non-zero exit, returns null for nonexistent commands.
  • NEW cli-doctor-version-capture.test.ts (+98) — 7 behavioral tests using real node -e subprocesses.
  • Manual: taskplane doctor now shows ✅ pi installed (0.73.0) (was empty parens).

Cluster D — CI Node 24 alignment ✅ (already shipped in v0.28.8 commit 96a457f5)

No work. Preserved in TP-189's PROMPT/STATUS for traceability.

Cluster E — Worker prompt + skill reconciliation (TP-186 follow-up)

  • templates/agents/task-worker.md (+64): Resume Algorithm step 6 split into Level 0/1 (proceed) vs Level 2/3 (commit + code review + APPROVE-gated status flip); Checkpoint Discipline / Git commits example rewritten.
  • skills/create-taskplane-task/SKILL.md (+37): new Per-Step vs. Consolidated Reviews sub-section in the Complexity Assessment, with TP-186 referenced as the canonical consolidation example.

Tests (TP-189)

  • 12 in-batch reviews (5 plan + 7 code) all returned APPROVE
  • Counted in the +83 net delta above (28 from this task, 55 from TP-187)

Sage post-integration code review

Ran sage focus-review on TP-187 (the high-stakes L3 task). Sage rated medium-high confidence with a mix of findings. 2 folded, 4 deferred:

Folded into commit b8072a6

  1. batchId gating on lane-terminated / lane-respawned IPC (sage chore(deps-dev): bump vitest from 4.0.18 to 4.1.0 in /extensions #3, important).
    Supervisor's terminated-lane filter now gates IPC handlers on batchState.batchId. Stale messages from a prior batch can no longer taint the live batch's suppression map. Back-compat preserved (legacy IPC without batchId accepted; first batch before any state-sync accepted).

  2. Refuse reconstruction for multi-repo batches (sage's "blocker" reframed, important).
    The reconstruction loop hardcodes segments: [] because per-segment topology lives only in the deleted batch-state.json. For single-repo batches (the common case + Taskplane's own self-orchestration) this is correct. For multi-repo with segment expansion, hardcoding segments: [] would silently lose state and could re-execute done segments OR fail cross-repo dependency checks. Added fail-loud guard via worker-manifest repoId distinct-count: >1 → refuse with clear operator pointer to restore from backup or start a new batch.

Deferred (documented for follow-up — none affect Taskplane's own self-orchestration)

  1. Drain race tolerance (sage chore(deps): bump actions/checkout from 4 to 6 #1, important): drainAgentOutbox snapshots readdirSync then renames. A worker writing a new .msg.json between the snapshot and the lane-termination IPC will leak past the drain. Mitigation: bounded re-scan loop (5–10 LOC). Not folded because the supervisor-side filter already drops zombie alerts on the receiving end, and the leak is rare (worker has been told to exit before drain fires).

  2. Suppression ordering for no-progress kill path (sage chore(deps): bump actions/setup-node from 4 to 6 #2, important): in lane-runner.ts, the lane-terminated IPC is emitted before the engine has a chance to send any canonical task-failure alert. If the engine ever adds a task-failure alert for the no-progress path, that alert would be incorrectly suppressed. Verified: engine doesn't currently emit a separate task-failure alert for the no-progress path (lane-runner reports termination via STATUS.md and the result return). Guarding against future regressions is a follow-up.

  3. Exit reason enforcement (sage chore(release): prepare v0.1.12 #4, important): 'Worker said:' empty in early no-progress exit alerts — supervisor lacks signal for early intervention #540 surfaces exit reasons via prompt-level guidance + events.jsonl fallback parsing, but doesn't strictly enforce non-empty reasons (no server-side re-prompt machinery). Adding strict enforcement requires a feedback loop with the worker that's bigger than this PR's scope.

  4. Multi-candidate reconstruction selection (sage chore(docs): remove redundant .gitkeep placeholders #5, important): if the newest batch's batch-meta.json is corrupt, reconstruction falls through to the next-newest. Deterministic but potentially surprising. Mitigation candidates: require explicit batchId argument to orch_resume, or refuse when multiple candidates exist. Follow-up — current behavior is documented in the helper's JSDoc.

  5. Test style (sage feat(init): support existing task folder via --tasks-root #6, nice-to-have): supervisor-recovery-flows.test.ts leans on source-pattern assertions for IPC ordering and concurrent-write race detection. True behavioral coverage of those races requires harness work that exceeded the in-batch budget.


Validation

  • Tests: 3579 passing / 1 skipped / 0 failed on Node 24.15.0 (Windows local). +83 over the 3496 baseline (28 from TP-189, 55 from TP-187).
  • In-batch reviews: 7 (TP-187, all APPROVE) + 12 (TP-189, all APPROVE) = 19 reviews fired during execution.
  • No typecheck/lint scripts in this project — extensions/test:fast is the canonical quality gate. Reviewer agent ran test per TP-188's quality-checks rule (no separate typecheck/lint to run).
  • No circular imports: verified via dynamic-import probe for types.ts → config-schema.ts → agent-host.ts → tool-allowlist-constants.ts.
  • CLI smoke: taskplane doctor shows ✅ pi installed (0.73.0) (Cluster C).

Branch state

  • Branch: feat/tp-187-189-recovery-and-polish
  • Base: main
  • Commits: 1 setup (Dependencies parser fix) + 34 from orch merge + 1 sage fold = 36 ahead
  • Test count target for next release CHANGELOG: 3579 passing

Suggested release path

This is a substantial recovery + polish bundle. Suggest cutting v0.29.0 (minor bump) given:

  • New supervisor_takeover tool (additive but observable surface change)
  • batch-meta.json runtime artifact (additive on-disk surface)
  • New CLI behavior (taskplane doctor actually shows pi version)
  • Refined worker/supervisor prompt templates (behavior change for new tasks)

CHANGELOG already populated under [Unreleased] — both TP-187 and TP-189 entries were written by the workers and are operator-readable.

HenryLach added 30 commits May 6, 2026 23:02
…cks /orch)

Discovery's Dependencies parser greedily extracts every bulleted TP-XXX
reference and treats it as a runtime dependency. TP-189's previous
Dependencies section listed TP-184/185/186/188 as informational
'shipped — already merged' notes via bullets, but the parser couldn't
distinguish prose from declaration:

  > [DEP_UNRESOLVED] TP-189 depends on TP-185 which does not exist
  > in any task area

TP-185's task folder was archived after the v0.28.5 release (the
bullet's prose 'shipped in v0.28.5' was meaningful only to humans).
Same trap could fire later for TP-184/186/188 if those folders are
ever archived too.

Fix: lead the section with '**None**' (the parser's None-detection
short-circuits dep extraction entirely when present), then keep the
historical cross-references as un-bulleted prose under the disclaimer.

Verified via discovery probe: 0 errors, 2 pending tasks (TP-187 + TP-189)
correctly identified before this commit landed.

No PROMPT.md content above the --- divider was changed in semantics —
all the cross-reference info is preserved, just reformatted.
…tepMarkedComplete fence filter

- A1: lane-runner-spawn-wiring.test.ts (NEW, 4 tests) static assertion that
  lane-runner.ts wires buildWorkerToolsAllowlist(config.workerTools) at the
  worker spawn site, with regression guards against passing config.workerTools
  directly. Patterned on lane-runner-v2.test.ts but with wider tolerance.
- A2: review-step-guard-runtime.test.ts (NEW, 4 tests) end-to-end runtime
  test of the TP-186 death-spiral guard's REFUSED path — uses bare-specifier
  child_process mock for Node 22/24 portability, asserts no spawn + no
  counter increment + correct refusal text on REFUSED, and confirms
  type='plan' and type='code'-on-In-Progress are NOT refused.
- A3: isStepMarkedComplete in agent-bridge-extension.ts now skips fenced
  code blocks (``` and ~~~). Added 4 test cases (2.8–2.11) to
  worker-step-completion-protocol.test.ts covering both fence styles, a
  regression guard for the over-match case, and unclosed-fence isolation
  across calls.
- A4: windows-worktree-cleanup-behavioral.test.ts (NEW, 3 tests) — sage
  TP-188 follow-up. Behaviorally exercises removeWorktree()'s Windows
  MAX_PATH fallback decision branches via a single child_process mock that
  dispatches on cmd (git vs cmd) plus a real on-disk temp dir. Covers
  win32+MAX_PATH (fallback fires), win32+non-MAX_PATH (fallback skipped),
  and non-win32+MAX_PATH text (platform guard skips fallback).
…dule

- New extensions/taskplane/tool-allowlist-constants.ts: import-free leaf
  module that is the single source of truth for the worker user-tools
  default literal. Comment explicitly forbids future imports.
- agent-host.ts: now re-exports DEFAULT_WORKER_USER_TOOLS from the new
  module (via export { ... } from + a local import for in-file use). No
  behavior change for existing callers (execution.ts,
  worker-tools-allowlist.test.ts) which import from agent-host.ts.
- config-schema.ts: replaced 2 literals (worker.tools default and
  merge.tools default) with a direct import. Obsolete TP-184 NOTE
  comments replaced with TP-189 comments.
- types.ts: replaced 1 literal (merge.tools default) with a direct
  import. The comment about circularity-with-agent-host is no longer
  applicable because the constant now lives in a leaf module.
- Verified no circular imports via a Node import probe; targeted tests
  pass (worker-tools-allowlist, lane-runner-spawn-wiring).
… + SKILL.md per-step vs consolidated review pattern

- task-worker.md: two surgical edits to align older sections with the
  TP-186 Order of Operations rule.
  1. Resume Algorithm step 6: "all items checked → proceed" now splits
     by Review Level (0/1 proceeds, 2/3 must commit, code-review, then
     APPROVE-gates the per-step Status flip). Cross-references Order of
     Operations explicitly.
  2. Checkpoint Discipline / Git commits: example commit message changed
     from "complete Step N — description" to "step N implementation"
     (matches the Order of Operations example). Added explicit
     paragraphs for Level 0/1 (commit completes the step) vs Level 2/3
     (commit is implementation-only, with a separate post-APPROVE
     "step N complete (code review APPROVE)" commit).
- SKILL.md: new sub-section "Per-Step Reviews vs. Consolidated Reviews
  (Checkpoint Markers)" in the Complexity Assessment chapter documents
  the per-step default vs the opt-in consolidation pattern, with TP-186
  as the canonical consolidation example. Helps PROMPT authors make the
  choice deliberately.
Adds [Unreleased] CHANGELOG section with all 8 polish items grouped:

- Fixed (user-visible):
  - TP-189-C taskplane doctor pi version (extract to bin/get-version.mjs,
    spawnSync with stdout-precedence + stderr fallback + null-on-failure)
  - TP-189-A3 isStepMarkedComplete CommonMark-aware fence tracking

- Docs (user-visible):
  - TP-189-E task-worker.md reconciliation (Resume Algorithm step 6
    Review-Level-aware split + Checkpoint Discipline commit message
    rewrite to match Order of Operations example)
  - SKILL.md Per-Step vs Consolidated Reviews sub-section

- Internal:
  - TP-189-B constants module migration (DEFAULT_WORKER_USER_TOOLS
    relocated to import-free leaf module; agent-host re-exports)
  - TP-189-A1 lane-runner spawn-wiring regression guard
  - TP-189-A2 review_step REFUSED-path runtime test
  - TP-189-A4 removeWorktree Windows fallback behavioral tests

Cluster D (CI Node 24) was already shipped in v0.28.8 — no CHANGELOG
entry needed for TP-189.
HenryLach added 12 commits May 7, 2026 00:12
# Conflicts:
#	CHANGELOG.md
…+ multi-repo reconstruction guard)

Two defensive fixes from sage's post-merge code review of the TP-187/189
batch. Both classified 'important' (not blockers for the Taskplane repo's
own self-orchestration, but real correctness improvements for the broader
recovery flow).

1. **batchId gating on lane-terminated / lane-respawned IPC** (sage #3,
   extension.ts).

   The supervisor's terminated-lane filter previously trusted any
   incoming lane-terminated/lane-respawned IPC without checking which
   batch it belonged to. A stale message from a prior batch (engine-worker
   not yet shut down, or out-of-order delivery on the IPC channel) could
   taint the live batch's suppression map — either suppressing legitimate
   alerts (false positive) or unmuting alerts that should be filtered
   (false negative).

   Added `ipcBatchIdMatches(incomingBatchId)` helper that gates on the
   supervisor's tracked `batchState.batchId`. Mismatches are logged to
   stderr (operator-visible diagnostic) and the IPC is dropped.
   Back-compat: if either side has no batchId yet (first batch, or
   pre-update IPC sender), accept the message — preserves existing
   behavior.

   Applied to BOTH supervisor handler registration sites (execute path
   and resume path).

2. **Refuse reconstruction for multi-repo batches** (sage's blocker
   reframed, persistence.ts).

   The TP-187 #539 reconstruction loop hardcodes `segments: []` because
   per-segment topology lives only in the deleted batch-state.json. For
   single-repo batches (the common case, including Taskplane's own
   self-orchestration), `segments` is always empty so the hardcode is
   correct. For multi-repo batches with segment expansion, hardcoding
   `segments: []` would silently lose the expansion state and could
   re-execute already-done segments OR fail dependency checks for
   cross-repo waves.

   Added a fail-loud guard that detects the multi-repo case via worker
   manifest repoIds: if >1 distinct repoId is present, append a clear
   failure to the reconstruction failure list and continue to the next
   candidate (or surface the explicit refusal in the aggregated error
   message). Operators get a concrete pointer ('Restore
   .pi/batch-state.json from backup or start a new batch') instead of
   silently-broken state.

3. Other sage findings (drain race, suppression ordering for no-progress
   path, exit-reason enforcement, candidate selection, behavioral test
   gaps) are documented in the PR body for follow-up but not folded into
   this change. Rationale: they're either deferral-worthy (test style
   improvements, exit-reason enforcement that requires server-side
   re-prompt machinery) or low-impact-for-Taskplane-itself.

Tests: 3579 passing / 1 skipped / 0 failed (no regressions).
No new tests added in this commit — the in-batch suite already covers
both code paths via behavioral tests; these are surgical defensive
guards layered on top.
…rtability

CI run 25476347616 on Linux Node 24 failed two tests in
review-step-guard-runtime.test.ts:

  ✖ type='plan' on a step marked Complete → NOT refused
    AssertionError: should have called spawn at least once, got 0
  ✖ type='code' on a step still 🟨 In Progress → NOT refused
    AssertionError: should have called spawn at least once, got 0

Both pass on Windows local (Node 24.15.0). The over-specific
assertion (spawnCallCount >= 1) intended to prove the death-spiral
guard let the call through to the spawn pathway, but
mock.module("child_process", ...) interception of spawn (async)
appears to behave differently on Linux Node 24 than on Windows.
The equivalent execFileSync (sync) mock pattern in
windows-worktree-cleanup-fallback.test.ts works fine on both
platforms — the divergence is specific to spawn vs execFileSync.

Fix: remove the spawnCallCount >= 1 assertions for the two
'NOT refused' cases. The negative assertion
`!text.startsWith("REFUSED")` ALREADY proves the guard didn't
refuse — that's the actual behavior under test. We don't need
additional proof of which downstream path the handler took once
the guard let it through.

The REFUSED-path tests (type='code' and type='test' on Complete
steps) keep their `spawnCallCount === 0` assertions because
proving "spawn was NOT called" is the core thing those tests
verify, and that assertion works portably (no mocked function is
invoked on the REFUSED path).

CI portability follow-up: re-introduce richer spawn-count
assertions once the underlying mock-portability issue is resolved,
likely via dependency injection (pass spawn as a parameter) rather
than module-level mocking.

Local fast suite: 3579 passing / 1 skipped / 0 failed (unchanged).
Closing the loop on supervisor priming for the new supervisor_takeover
tool. The tool was already documented in the Available Orchestrator
Tools section (with full description + 4 promptGuidelines) added by
TP-187, but the 'When to Use These Tools' decision rubric below it
still steered stuck/failing batches directly to orch_abort:

  - Batch is stuck or failing repeatedly → call orch_status() to
    diagnose, then orch_abort() if needed

That bullet was written before TP-187 added the non-destructive
escape hatch. Updated to reinforce supervisor_takeover as the
preferred first move and to clarify that orch_abort is destructive
and not reversible:

  - Batch is stuck, producing alert spam, or hitting a death-spiral
    → call orch_status() to diagnose, then **prefer
    supervisor_takeover(reason)** to park non-destructively
    (worktrees + state preserved; resume with orch_resume(force=true)
    afterward). Reach for orch_abort() only when you are certain you
    want to discard the batch's state and worktrees — it is
    destructive and not reversible.

Pure prompt-text change in the canonical supervisor template. No
code paths affected; no test changes required. Lowers the activation
threshold for the new tool when the supervisor faces the exact
scenarios it was designed for.
… crashes orchestrator on first IPC

CRITICAL regression introduced by my own commit `b8072a6` (TP-187 sage
fold). Reported in #559 with full diagnosis and exact patch.

## Symptom

Every batch crashes the orchestrator parent process the moment the
engine-worker emits its first IPC frame:

  extensions/taskplane/extension.ts:1718
      const currentBatchId = batchState.batchId;
                             ^
  ReferenceError: batchState is not defined
      at ipcBatchIdMatches (...)

The batch state file is left in phase=executing with 0 progress; git
worktrees and task/... branches are orphaned. Affects all batches that
get as far as Wave 1 lane allocation (effectively every batch).

## Root cause

When I added `ipcBatchIdMatches` and the four stderr-template
references in commit b8072a6 to fold sage's batchId-gating finding, I
referenced `batchState.batchId` 5 times. `batchState` is NOT bound
in the supervisor IPC closure — only `supervisorState` is (declared
on line 1680 alongside `terminatedLanes` / `terminatedAgents`).
Other regions of extension.ts legitimately bind a different
`batchState` via destructuring (e.g.,
`const batchState = lockResult.batchState;`) and use it correctly,
but those bindings are not visible in the IPC closure.

## Why it slipped through

1. `node --experimental-strip-types` performs no name-resolution
   checks — it strips types and runs the rest. Runtime ReferenceError
   only surfaces when the affected code path executes.
2. The TP-187 in-batch tests
   (extensions/tests/supervisor-recovery-flows.test.ts) mock IPC
   handlers at a different layer (engine-worker → supervisor callbacks
   via executeOrchBatch's deps parameter), bypassing the actual
   extension closure under fault.
3. CI tests passed because they don't fire real engine-worker IPC
   through the live extension closure.
4. CLI smoke checks (taskplane help/doctor/init/uninstall) don't
   construct the closure at all.

## Fix

Replace all 5 `batchState.batchId` references inside the IPC closure
with `supervisorState.batchId`:

  - extension.ts:1718 (`ipcBatchIdMatches` helper, the actual crash site)
  - extension.ts:2303 (execute-flow lane-terminated stderr)
  - extension.ts:2318 (execute-flow lane-respawned stderr)
  - extension.ts:2679 (resume-flow lane-terminated stderr)
  - extension.ts:2693 (resume-flow lane-respawned stderr)

`SupervisorState.batchId` is a string ("" when supervisor inactive,
the live batch ID otherwise — see supervisor.ts:2634+). The existing
`if (!currentBatchId) return true` check preserves pre-planning
accept-all semantics: an empty string is falsy, so all IPC during the
pre-planning window is accepted as before.

## Regression test

NEW `extensions/tests/extension-ipc-batchid-scope.test.ts` (+4 tests,
~125 LOC). Asserts via source-pattern that the supervisor IPC closure
region (`let supervisorState = ...` through the second
'Activate supervisor agent' marker) contains zero references to
`batchState.batchId` and at least one reference to
`supervisorState.batchId`. Comments are stripped before the check
so documentation-of-the-bug doesn't trigger false positives.

This test is source-pattern (the closure variable isn't directly
testable behaviorally without DI), but it locks down the scope-correct
identifier explicitly and would have caught the regression in CI.

A behavioral test that constructs the closure and fires a synthetic
IPC remains a worthwhile follow-up, likely via dependency injection
(pass batchId-getter as a parameter rather than closure-capturing).

## Validation

- Local fast suite: 3583 passing / 1 skipped / 0 failed (was 3579 pre-fix, +4 new regression tests)
- Re-running the failing test before the fix:
    ✗ does NOT reference batchState.batchId — Found 5 occurrence(s)
- After the fix:
    ✓ all 4 tests pass

Closes #559
…t fix)

Sage code-review of commit `4a8aea6` flagged that swapping
`batchState.batchId` → `supervisorState.batchId` removes the crash but
leaves the gate semantically broken:

> `supervisorState.batchId` is set once in `activateSupervisor()` from
> `batchState.batchId`. In /orch and /orch-resume, activation happens
> before worker state-sync populates a real batch ID, so it is usually
> `""`. I do not see a later write that keeps `supervisorState.batchId`
> in sync during normal start/resume flow.

Result: for batches where the supervisor never activates (the common
case for non-failing batches), `supervisorState.batchId` stays empty
and `ipcBatchIdMatches` falls into the `if (!currentBatchId) return
true` branch \u2014 the gate effectively never fires. Zombie-alert filter
defeated.

## Fix

Swap to `orchBatchState.batchId`:

  - Declared on extension.ts:1669, just above `supervisorState` (line
    1680). Same closure scope as `ipcBatchIdMatches` (line 1717).
  - Populated reliably via state-sync IPC: the extension passes
    `orchBatchState` BY REFERENCE to `startBatchInWorker` (line 2206),
    where it's named `batchState` (parameter). The state-sync handler
    at line 1194 calls `applySerializedState(batchState, msg.state)`
    which mutates the same underlying `orchBatchState` object via the
    shared reference.
  - The `if (!currentBatchId) return true` accept-all branch now
    correctly covers ONLY the brief pre-state-sync window (between
    batch launch and first state-sync IPC \u2014 the legitimate first-IPC
    window where filtering would be wrong).

5 references updated:
  - extension.ts:1717 (`ipcBatchIdMatches` helper)
  - extension.ts:2303, 2318 (execute-flow stderr templates)
  - extension.ts:2679, 2693 (resume-flow stderr templates)

Comment block above `ipcBatchIdMatches` rewritten to document the
binding choice and the three candidates considered.

## Test updates

`extension-ipc-batchid-scope.test.ts`:
  - Renamed assertion: `ipcBatchIdMatches` body uses
    `orchBatchState.batchId` (was `supervisorState.batchId`).
  - NEW assertion: closure contains at least one `orchBatchState.batchId`
    reference (would catch a future revert to `supervisorState`).
  - Removed obsolete "must reference supervisorState.batchId" assertion
    (no longer the canonical binding).
  - Comment block at top of file updated with the sage post-mortem
    rationale so future readers understand why `orchBatchState`
    specifically.

## Validation

- 4/4 regression tests pass (re-targeted to `orchBatchState.batchId`)
- Full fast suite: 3583 passing / 1 skipped / 0 failed (unchanged)

## Sage findings deferred

Sage flagged 3 additional items that are NOT blocking the polyrepo
re-test:

1. **Stricter unknown-current behavior** (Important): if
   `!currentBatchId && incomingBatchId`, sage suggests dropping or
   loud-logging instead of unconditional accept. Kept current behavior
   because the empty-current window is now correctly bounded to the
   pre-state-sync window where filtering would be wrong. Worth
   revisiting if zombie alerts ever fire in that window.

2. **Closure-region anchoring brittleness** (Nice-to-have): the test's
   region detection uses two text markers that could silently mis-scope
   on refactor. Long-term hardening: switch to AST-based scope
   detection or refactor `ipcBatchIdMatches` to a top-level function
   accepting batchId-getter via DI.

3. **Add blocking `tsc --noEmit` CI step** (Important): would have
   caught the original `batchState` typo at PR-time as
   `error TS2304: Cannot find name 'batchState'`. Larger scope \u2014
   needs tsconfig + may surface unrelated existing issues. Tracking
   as a TP-189-style polish item for a future bundle.

These are documented for follow-up but out of scope for the polyrepo
re-test unblocking.

Closes #559
…+ @mariozechner)

CRITICAL: Runtime V2 cannot spawn ANY worker / reviewer / merge agent on
systems where Pi was installed (or upgraded into) the new
`@earendil-works` scope. Reported in #560 with full diagnosis.

## Symptom (verbatim from the issue)

  [orch] api-service/lane-1/TP-002: Runtime V2 execution error:
    Cannot find Pi CLI entrypoint
    (@mariozechner/pi-coding-agent/dist/cli.js).
    Ensure the pi coding agent is installed globally via
    'npm install -g @mariozechner/pi-coding-agent'.
    npm root -g returned: C:\...\node_modules

Reproduces deterministically on any system whose only globally-installed
Pi is `@earendil-works/pi-coding-agent` (the canonical scope as of Pi
v0.74.0).

## Why this is the ONLY bug

The Pi coding agent's own extension loader
(`<pi>/dist/core/extensions/loader.js` lines 41-45) explicitly aliases
BOTH scopes to the same bundled module exports:

    "@mariozechner/pi-agent-core": _bundledPiAgentCore,
    "@mariozechner/pi-tui": _bundledPiTui,
    "@mariozechner/pi-ai": _bundledPiAi,
    "@mariozechner/pi-ai/oauth": _bundledPiAiOauth,
    "@mariozechner/pi-coding-agent": _bundledPiCodingAgent,

So all of taskplane's `import { Type } from "@mariozechner/pi-ai"` and
`import type { ExtensionAPI } from "@mariozechner/pi-coding-agent"`
sites continue to resolve at runtime regardless of which scope was
actually installed \u2014 Pi handles the rename transparently for in-process
module loading.

The only break is `path-resolver.ts:resolvePiCliPath()`. That function
walks `npm root -g` looking for `<scope>/pi-coding-agent/dist/cli.js`
to spawn Pi as a CHILD PROCESS for workers/reviewers/mergers. Pi's
in-process aliasing can't help with disk-side lookup. With `@mariozechner`
gone from disk, every spawn errors immediately.

That sharply narrows the fix surface from "~10 files referencing the old
scope" (per the issue body's grep) to "path-resolver.ts + the
user-facing install-hint string in worktree.ts."

## Fix

`path-resolver.ts`:
- Define `PI_PACKAGE_SCOPES = ["@earendil-works", "@mariozechner"]`
  (current first, legacy fallback).
- Refactor `resolvePiCliPath()` to walk the cross product of base
  directories \u00d7 scopes. Inner-loop scope so each base is fully exhausted
  (new, then legacy) before falling back to the next base. Matches
  operator intuition: "most likely install location, either scope."
- Update the throw message to name BOTH scopes and recommend the new
  install command (`npm install -g @earendil-works/pi-coding-agent`).
- Updated comments throughout to document the scope-tolerant behavior.

`worktree.ts:1904` (preflight "pi not found" hint):
- Recommend `@earendil-works` for new installs; mention legacy scope
  parenthetically so anyone with an existing `@mariozechner` install
  knows it's still supported.

## Backward compat preserved

Per the user's explicit request: "remain backwards compatible with older
versions of pi from when it was hosted on @mariozechner/pi-coding-agent
but also now support it being hosted at @earendil-works/pi-coding-agent."

- Systems with ONLY `@earendil-works` installed: now work (was broken).
- Systems with ONLY `@mariozechner` installed: continue to work.
- Systems with BOTH installed (e.g., during a transition window):
  `@earendil-works` is preferred. Legacy install can be pruned later.

The other ~10 files that reference `@mariozechner` (TypeScript imports,
test mocks, peerDependencies, docs) are deliberately left as-is in this
commit because:
  1. Pi's bundled-alias handling makes them work at runtime regardless
     of scope.
  2. Updating them risks breaking backward compat for users on Pi
     versions older than v0.74.0 (which don't have the alias map).
  3. This commit's surgical scope is unblocking the crash with the
     minimum possible change. Cosmetic alignment with the new scope can
     ship as a separate cleanup PR later.

## Tests

NEW `extensions/tests/path-resolver-pi-scope.test.ts` (+200 LOC):

Strategy: each test creates a temp directory laid out like an npm
global root (`<tmp>/<scope>/pi-coding-agent/dist/cli.js`), points
`getNpmGlobalRoot()` at it via `npm_config_prefix` env var, and
inspects the resolved path via a child-process probe (`getNpmGlobalRoot()`
caches its result module-level on first call \u2014 child probe avoids
in-process state pollution).

4 cases:
  - 8.4: only `@earendil-works` installed \u2192 resolves there
  - 8.5: only `@mariozechner` installed \u2192 resolves there (back-compat)
  - 8.6: both installed \u2192 resolves to `@earendil-works` (new-first)
  - 8.7: neither installed \u2192 error message names both scopes

All 4 pass on Windows local Node 24.15.0.

## Validation

- 4/4 new regression tests pass (~2s total)
- Full fast suite: 3587 passing / 1 skipped / 0 failed (was 3583 pre-fix, +4)

## Issue #561 (related, NOT addressed in this commit)

The reporter notes that #561 is what made #560 silently fatal: when
`executeLaneV2` throws on spawn, the lane is NOT transitioned to
`failed`, no `task-failure` IPC fires, and `orch_status()` reports
"running" indefinitely. So this fix solves the cause; the visibility
problem is a separate bug.

Treating #561 as a separate small follow-up rather than bundling here
because:
  1. The user's explicit request was "fix #560 with backward compat."
  2. #561 is architectural (touches engine-worker error handling, lane
     state machine, IPC alert pipeline, possibly retry policy).
  3. #561 deserves its own focused review and tests \u2014 not a piggyback.

Closes #560
Stages the next-batch task to fix issue #561 — when a Runtime V2 lane
spawn fails (e.g., Pi CLI not findable, worktree provisioning error,
branch collision), the lane is NOT transitioned to failed. The engine
continues polling, the dashboard shows green/running lanes that have
no actual worker process, orch_status reports running, no supervisor
alert fires. Recovery requires the operator to manually tail engine-
worker stderr.

This bug masked the impact of #559 (orchestrator crash on first IPC)
and #560 (@earendil-works rename), making both look like 'the
orchestrator is hung' rather than 'spawn errored 3× immediately.'
Fixing TP-190 converts every future spawn-stage breakage into a
visible, actionable failure instead of a silent hang.

## Scoring

Score 3/8 → Review Level 2 (Plan + Code), Size M.
- Blast radius: 1 (engine error handling, single failure mode)
- Pattern novelty: 1 (extends existing task-failure alert pipeline)
- Security: 0
- Reversibility: 1 (state-transition wiring; must not break existing
  retry/recovery; new exit category becomes part of the public contract)

## Scope

Four parts (per the issue's suggested fix outline):

1. State-transition + IPC alert: lane.status='failed',
   task.status='failed', exitReason='spawn failure: <message>',
   exitCategory='spawn-failure' (new), task-failure IPC alert.
2. No-retry policy for spawn-failure category (not transient).
3. Phase transition: when all lanes in a wave fail to spawn,
   batchState.phase='failed' (not 'executing').
4. Behavioral regression test: mock spawn helper to throw, assert
   lane/task transitions, IPC alert fires, no retries.

## Critical scope hint baked into the PROMPT

The Notes section in STATUS.md flags an observation that should shape
Step 1's plan: the existing per-task try/catch in executeLaneV2
(execution.ts line ~2724) IS already producing the 'failed' outcome
with the spawn error message — the user's stderr log shows
"Runtime V2 execution error: ..." lines coming from THAT catch. So
the bug is NOT 'no try/catch around spawn'; it's 'the lane state
machine and monitor don't propagate the failed outcome upward into
operator-visible state.' The Do NOT list reinforces: don't add a new
try/catch in executeWave if the existing catch already produces the
failed outcome. The fix is likely entirely in the consumer of those
outcomes.

## Files staged

- taskplane-tasks/TP-190-runtime-v2-spawn-failure-visibility/PROMPT.md (NEW)
- taskplane-tasks/TP-190-runtime-v2-spawn-failure-visibility/STATUS.md (NEW)
- taskplane-tasks/CONTEXT.md (Next Task ID: TP-190 → TP-191)

Discovery probe: 0 errors, 1 pending task (TP-190), 0 deps. Ready for
/orch when the operator queues the next batch.
@HenryLach HenryLach merged commit 1288ebc into main May 10, 2026
1 check passed
@HenryLach HenryLach deleted the feat/tp-187-189-recovery-and-polish branch May 10, 2026 03:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant