Skip to content

fix: audit reliability follow-up — ppid walk, auditStatus flag, subclaude hook isolation#7

Merged
George-iam merged 10 commits intomainfrom
feat/audit-reliability-fixes-20260405
Apr 5, 2026
Merged

fix: audit reliability follow-up — ppid walk, auditStatus flag, subclaude hook isolation#7
George-iam merged 10 commits intomainfrom
feat/audit-reliability-fixes-20260405

Conversation

@George-iam
Copy link
Copy Markdown
Contributor

Summary

Follow-up PR fixing 4 critical bugs surfaced during PR#6 E2E verification, plus a pre-existing force-push detection false positive.

Bugs fixed

A — /proc ppid walk for hooks. Claude Code wraps hook commands via sh -c 'axme-code hook ...', so process.ppid inside a hook subprocess is the (usually already-dead) sh wrapper PID, not the Claude Code PID. Hooks were recording the wrong ownerPpid on active-sessions/ mapping files and the wrong pid on SessionMeta, so live sessions were being misclassified as orphans and prematurely audited/closed. New helper getClaudeCodePid() reads /proc/<ppid>/stat and walks one step up to return the grandparent. Used in writeClaudeSessionMapping and createSession. The MCP server still uses process.ppid directly — it is spawned by Claude Code without the sh wrapper.

C — Parallel audit race (simplified, no file locks). Two MCP servers could auto-audit the same session in parallel, doubling LLM cost. Introduce auditStatus: "pending" | "done" | "failed" on SessionMeta plus auditStartedAt/auditFinishedAt. runSessionCleanup now refuses to re-enter a session whose auditStatus=pending is fresher than 15 minutes, and also enforces the MAX_AUDIT_ATTEMPTS retry cap inline (not only via orphan cleanup). Stale pending state (>15 min) auto-releases on the next attempt, so crashes never leave sessions permanently stuck. The race window is safe because saveScopedMemories/saveScopedDecisions already dedup by slug. The entire old pending-audits/ directory API is removed — all state now lives on SessionMeta, and a startup cleanup deletes any leftover legacy files.

F — Subclaude audit loading project hooks. The session-auditor spawns a sub-Claude (via claude-agent-sdk) with cwd=sessionOrigin. Every tool call the sub-Claude made would fire the workspace's PreToolUse/PostToolUse/SessionEnd hooks and create short-lived "ghost" AXME sessions via ensureAxmeSessionForClaude. The auditor now passes env: { ...process.env, AXME_SKIP_HOOKS: \"1\" } to the SDK query, and every hook handler returns immediately when it sees that env var. Belt-and-braces with existing settingSources: [].

E — English-only output. Tighten the AUDIT_PROMPT language section to forbid transliteration of non-English words in slug and keywords (translate, never romanize). Prompt-only — no hardcoded Cyrillic/Chinese/whatever filter in the parser, so the code stays language-agnostic.

Bonus fix (found while trying to push this PR)

checkGit force-push false positive. The old check did stripped.includes(\"-f\") on the full command line, which triggered on any branch name containing -f as a substring (e.g. feat/my-fixes matched the -f inside -fixes). Switched to token-level matching that recognizes -f, --force, --force-with-lease, --force=..., --force-with-lease=..., and the +refspec form — but not substrings inside branch names. 14 unit cases cover the regression.

Test plan

Self-verification performed in-session via tsx scripts under /tmp/ (not committed):

  • Build + typecheck green (npm run lint + npm run build)
  • Bug A: /proc walk verified — a node process invoked via sh -c returns its grandparent PID (skipping the sh wrapper)
  • Bug C: runSessionCleanup skip-logic covers concurrent-audit, retry-cap, already-audited, and stale-pending fall-through paths
  • Bug F: all three hooks (pre-tool-use, post-tool-use, session-end) exit cleanly with AXME_SKIP_HOOKS=1 and create no AXME sessions; baseline "without the env var" still creates sessions normally
  • parseAuditOutput regression: memories + ADR-style decisions + safety + handoff still parse correctly after removing the language-specific branches
  • checkGit force-push detection: 6 false-positive branch names now pass; 6 genuine force-push forms still denied; protected-branch check unchanged

Cannot self-verify (tracked for next session)

  • Full cleanupAndExit flow on MCP server transport close across two independent VS Code windows. Needs a real second window; plan is to verify manually after merge.

George-iam and others added 10 commits April 5, 2026 16:06
…aude hook isolation

Fixes 4 critical bugs surfaced by PR#6 E2E verification:

Bug A (/proc ppid walk for hooks):
  Claude Code wraps hook commands in `sh -c 'axme-code hook ...'`, so a
  raw process.ppid from a hook subprocess is the (usually dead) sh wrapper
  pid instead of the real Claude Code pid. Live sessions were being
  classified as orphans and prematurely audited/closed. Add
  getClaudeCodePid() which reads /proc/<ppid>/stat and returns the
  grandparent, and use it in writeClaudeSessionMapping (ownerPpid) and
  createSession (pid). MCP server still uses process.ppid directly — it is
  spawned by Claude Code without the sh wrapper.

Bug C (parallel audit race, simplified — no file locks):
  Two MCP servers auto-auditing the same session in parallel could both
  write memories and decisions, doubling LLM cost. Introduce a simple
  auditStatus flag on SessionMeta (pending | done | failed) plus
  auditStartedAt, auditFinishedAt. runSessionCleanup now refuses to re-enter
  a session whose auditStatus is "pending" and fresher than 15 minutes,
  and also honors the MAX_AUDIT_ATTEMPTS retry cap inline (not only in
  orphan cleanup). Stale "pending" state (>15 min) is auto-released on the
  next attempt — crashes do not leave sessions permanently stuck. The
  accepted race window is safe because saveScopedMemories/Decisions dedup
  by slug.

  Simpler than the pending-audits/ directory API from PR#6: all audit state
  now lives on SessionMeta, so one listSessions() call gives pending/done/
  failed in one place. The old pending-audits/ directory and its write/
  update/clear/read/list API are removed entirely; a legacy cleanup runs at
  MCP server startup to delete any leftover files. axme_context's "pending
  audits" warning now derives from auditStatus instead.

Bug F (subclaude audit loading project hooks):
  The session-auditor spawns a sub-Claude (via claude-agent-sdk) with
  cwd=sessionOrigin. Every tool call that sub-Claude makes would fire the
  workspace's PreToolUse/PostToolUse/SessionEnd hooks and create "ghost"
  AXME sessions (4 ms–70 s lifetime) via ensureAxmeSessionForClaude. Fix:
  the auditor passes env: { ...process.env, AXME_SKIP_HOOKS: "1" } to the
  SDK query, and every hook handler returns immediately when it sees that
  env var. Belt and braces together with settingSources: [].

Bug E (English-only output):
  Tighten the AUDIT_PROMPT language section to forbid transliteration of
  non-English words in slug and keywords (translate, never romanize). This
  is prompt-only: no hardcoded Cyrillic/Chinese/whatever filter in the
  parser, per explicit user direction — users configure per language as
  they see fit; the code stays language-agnostic.

Verification done in-session (tsx scripts under /tmp, not committed):
  - /proc walk verified: a node process invoked via `sh -c` correctly
    returns its grandparent pid (skipping the sh wrapper).
  - runSessionCleanup skip logic: concurrent-audit, retry-cap,
    already-audited, and stale-pending paths all covered.
  - All three hooks (pre-tool-use, post-tool-use, session-end) verified
    to exit cleanly with AXME_SKIP_HOOKS=1 and create no AXME sessions.
  - parseAuditOutput regression checked with the usual ADR-style
    decision blocks — still parses correctly.

Cannot be verified without a real second VS Code window (tracked for the
next session): full cleanupAndExit flow on transport close across two
independent MCP server processes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous force-push check did a naive substring match for "-f" inside
the full command line, which triggered on any branch name containing "-f"
as a fragment (e.g. `git push origin feat/audit-reliability-fixes-20260405`
matched "-f" inside "-fixes"). This blocked legitimate pushes and forced
operators to rename branches or work around their own safety net.

Switch to token-level matching: split the command on whitespace and compare
each token against the known force-push flags (`-f`, `--force`,
`--force-with-lease`, `--force=...`, `--force-with-lease=...`). Also catch
the equivalent `+refspec` form (`git push origin +main`).

Verified with 14 unit cases (6 previously-broken false-positive branch
names now pass; all 6 genuine force-push forms still denied; protected-
branch check unchanged).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Observed empirically in the PR#7 working session: after a Claude Code
SIGKILL + respawn (which happens on /compact, VS Code reload, and crash
recovery), the old MCP server's cleanupAndExit sequence ran markAudited
on the old AXME session but was terminated before reaching
clearClaudeSessionMapping. This left a stale mapping file pointing to an
already-audited session.

When the respawned Claude Code's hooks then ran ensureAxmeSessionForClaude,
the function found the stale mapping, returned the audited session id,
and every subsequent trackFileChanged/attachClaudeSession write went to a
session that would never be audited again — so the entire post-restart
session was silently untracked.

Fix: after reading an existing mapping, inspect the referenced session.
If it is already audited, or if its recorded pid is dead, treat the
mapping as stale and fall through to create a fresh AXME session (which
overwrites the mapping). Log the transition to stderr so operators can
see when a stale mapping has been replaced.

Verified with 4 unit scenarios:
  1. Live mapping → reused (baseline)
  2. Mapping → audited session → fresh session created
  3. Mapping → dead-pid session → fresh session created
  4. No mapping → fresh session created (baseline)

Also verified in the PR#7 working session itself: the stale mapping for
Claude 6dfffaab → audited AXME e27eed9f was replaced by a fresh AXME
session c2eadf04 with the correct live ownerPpid, within milliseconds of
this fix being deployed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…fset

Previously, when the same Claude session was audited more than once (user
resumes a closed conversation, SIGKILL + respawn recovery, /compact
reopen), the second audit re-parsed the full transcript jsonl from
offset 0. The in-prompt mandatory Grep-dedup prevented double storage
writes, but the LLM still burned tokens on every already-audited turn
before rejecting them. Full-day resumed sessions would roughly double
their audit cost.

Fix: persist the byte offset reached at the end of each successful
audit, per Claude session id, in .axme-code/audited-offsets/<id>.txt.
On the next audit of the same Claude session, parseTranscriptFromOffset
reads the transcript starting from that byte position, so the LLM only
sees the turns that were actually new since last audit.

Implementation details:
- transcript-parser.ts: new parseTranscriptFromOffset returning
  { turns, endOffset, bytesRead, fileSize }. Operates on Buffer (bytes),
  not String (UTF-16 code units), so offsets are true byte positions
  safe for multibyte characters (Russian, Chinese, emoji, etc). The
  existing parseTranscript(path) is kept as a zero-offset wrapper for
  dry-run and other one-shot callers.
- transcript-parser.ts: parseAndRenderTranscripts now accepts an
  optional startOffsets map keyed by Claude session id and returns
  endOffsets + bytesRead maps. Refs without an entry read from 0.
- storage/sessions.ts: new readAuditedOffset / writeAuditedOffset /
  clearAuditedOffset helpers. Offset files contain just the number as
  ASCII; atomic writes via the existing engine. Missing/corrupt file
  is treated as offset 0 ("parse from beginning").
- session-cleanup.ts: before parsing, look up the stored offset for
  each attached Claude session. After a successful audit, persist the
  new end offset. On failure the old offset stays intact, so a retry
  re-processes the same turns (correct behavior — avoids losing data
  on transient errors).
- Truncation guard: if the stored offset exceeds current file size
  (file was rotated / truncated), reset to 0 and parse from scratch.
- Observability: log to stderr when resume-audit kicks in, showing
  the skipped prefix and the new bytes consumed.

Verified with 22 unit assertions in /tmp/test-resume-offset.mts:
- Full parse from offset 0 (first audit baseline)
- Resume parse returns only new turns after appending to jsonl
- Truncation guard resets to 0 when offset > fileSize
- Offset storage roundtrip (read/write/clear, invalid-value rejection)
- Multibyte Russian text survives byte-offset slicing intact
- parseAndRenderTranscripts threads per-ref offsets end-to-end,
  idempotent on steady state (same offset in = 0 bytes read out)

Existing PR#7 tests (bug-c skip logic, parse-regression, mcp-e2e
cleanup) re-run green — no regression.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extends the audit-logs/<timestamp>_<id>.json record with a `resume` array:
one entry per attached Claude session, showing startOffset (loaded from
audited-offsets/ at the beginning of this audit), endOffset (written back
after success), bytesRead this call, and a `resumed: boolean` flag that
is true whenever the optimization actually skipped a previously-audited
prefix.

Previously this information was only written to stderr, which is fine
for live diagnosis but does not survive the audit run. The durable
audit log now lets operators answer "did the resume optimization fire
on this session, and where did the auditor start reading?" long after
the fact, which is exactly the observability the user asked for.

Schema:
  resume: [
    {
      claudeSessionId: "<claude-uuid>",
      startOffset: number,   // bytes already audited before this call
      endOffset:   number,   // bytes audited after this call (new floor)
      bytesRead:   number,   // endOffset - startOffset (or full file on truncation reset)
      resumed:     boolean,  // true iff startOffset > 0
    },
    ...
  ]

One entry per attached Claude session ref. Always present (even when
startOffset is 0 — that's the first-audit baseline).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… retry

Bug: a session whose audit was killed mid-run (SIGKILL on window close, OOM,
reboot, crash) sits with auditStatus="pending" AND auditAttempts=1. Once the
pending state ages past AUDIT_STALE_TIMEOUT_MS (15 min), the concurrent-audit
check in runSessionCleanup falls through — but the next check (retry cap)
immediately skips with skipped="retry-cap" because attempts is already at
MAX_AUDIT_ATTEMPTS. The session becomes permanently un-auditable by any
automatic path. We observed c2eadf04 stuck in exactly this state today.

Fix: when stale-pending is detected (auditStatus=pending + age > stale
timeout), reset auditAttempts to 0 in the in-memory session object before
the retry-cap check. A stale-pending state means the previous attempt was
killed before it could complete, which is not a deterministic failure —
the retry cap should not apply. The cap still protects against real
repeated failures, which show up as auditStatus="failed" rather than
"pending"+stale, and are not touched by this reset.

Verified with /tmp/test-stale-pending-reset.mts (10 assertions, all pass):
  Scenario 1: stale pending (20 min) + attempts=1 → resets, runs to done
  Scenario 2: fresh pending (5 min) + attempts=1 → concurrent-audit skip
  Scenario 3: failed + attempts=1 → retry-cap skip (no reset, correct)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…on in addDecision

Two independent bugs in src/storage/decisions.ts addDecision were allowing
the auditor to accumulate duplicate decision files. Today's workspace has
109 files in decisions/ where the expected count is ~30-60 — nearly half
are duplicates, accumulated over PR#6 E2E audit runs.

Bug 1: deduplicateDecisions was never called from the scoped write path.
The function exists at line 271 of the same file and correctly normalizes
titles via stopword stripping + lowercasing, but saveScopedDecisions calls
addDecision which went straight to disk without dedup. The auditor emitting
"Use TypeScript", "use typescript", and "Using TypeScript for the project"
on successive runs would create three separate files.

Bug 2: D-NNN id generation was a read-modify-write race. Two parallel
addDecision calls (two audit workers, two windows closing simultaneously)
both read the same max-id+1 from listDecisions(). Both write D-051-<slug>.md
with different slug suffixes. Because filenames differ, atomicWrite (which
is temp-file + rename) does not catch the collision — both files land on
disk, both claim to be D-051. We observed D-001 collisions in today's
workspace.

Fix:
  1. At the top of addDecision, compute normalizeTitle(input.title) and
     search existing decisions for a match. If found, return the existing
     decision unchanged — no double-write. Callers see the existing record
     as if they had created it.
  2. Replace atomicWrite with writeFileSync(path, content, { flag: "wx" }).
     The "wx" flag maps to O_CREAT|O_EXCL at the POSIX level, which makes
     the create atomic across processes: the first writer wins, the second
     gets EEXIST. On EEXIST we bump nextNum and retry. Bounded to 50
     attempts, which is far more than any realistic race window.

Verified with two test scripts (16 assertions total, all pass):
  /tmp/test-decisions-race.mts    — 20 parallel addDecision calls produce
    exactly 20 unique contiguous D-NNN ids; a second parallel batch of 20
    extends to D-040 with no collisions.
  /tmp/test-decisions-dedup.mts   — "Use TypeScript" + "use typescript" +
    "Using TypeScript for the project" all collapse to 1 file (normalize-
    Title handles the stopword + case variations). Genuinely different
    decisions still get separate files.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… field

Problem: an agent starting a fresh Claude Code session in a multi-repo
workspace cannot reliably tell WHERE the current session's .axme-code/
storage lives. The workspace root and each child repo each have their
own separate .axme-code/, and an agent doing "ls .axme-code/sessions/"
relative to its cwd gets a different answer depending on which
subdirectory its cwd happens to be in. In today's session a new agent
ran that exact command from inside the axme-code repo subdir and read
the (empty) repo-level storage instead of the (live) workspace-level
storage, then reported the session as broken.

This is not an agent bug — it is a missing instruction. We have the
information, we just were not surfacing it. Two fixes make the
authoritative path visible in two independent places.

Fix D (storage root header):
  1. getFullContext now prepends a "# AXME Storage Root" header block to
     every axme_context output. The block lists session origin, session
     type (single-repo / workspace), storage root, sessions dir, audit
     logs dir, audit worker logs dir — all as absolute paths. Includes a
     CRITICAL instruction telling the agent to always use ABSOLUTE paths
     rooted at the Storage root for direct .axme-code/ inspection.
  2. buildInstructions (MCP server instructions returned at initialize)
     gets a STORAGE ROOT line with the same absolute path. Shown to the
     agent at the start of every session before axme_context is even
     called.
  3. Both CLAUDE.md templates in cli.ts get a new "Storage paths
     (critical)" section with the same guidance, and a note that the
     MCP server does not hot-reload and a window restart is needed after
     axme-code code changes.

Fix D+ (per-session origin field):
  1. Add `origin: string` to SessionMeta. Populated at createSession time
     with the absolute path of the directory that contains .axme-code/.
     Never changes after creation.
  2. Storage Root header, MCP instructions, and CLAUDE.md all tell the
     agent: if you pick up a session meta.json file directly (not via
     axme_context), read the `origin` field to verify which storage root
     it belongs to. This is a per-session cross-check that lets agents
     navigate correctly even when starting from a specific session file
     instead of the context tool.

Verified with /tmp/test-storage-root-header.mts (13 assertions, all pass):
  - getFullContext output starts with "# AXME Storage Root" and contains
    the absolute session origin + storage root paths
  - Single-repo and workspace session types both render correctly
  - Uninitialized project still emits the header
  - createSession populates meta.origin with the absolute workspace path
  - Persisted meta.json has the origin field

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ifecycle

This is the architectural fix that closes the two-day loop of audit bugs.

Root cause of everything we have been chasing: the auditor runs synchronously
inside the MCP server (cleanupAndExit on transport close) and inside the
session-end hook subprocess. Both are children of Claude Code. VS Code kills
Claude Code when the window closes — mildly in side-panel mode (a few seconds
grace), harshly in center-editor-tab mode (immediate SIGKILL on tab close).
Real audits take 30-120+ seconds (LLM round-trip, possibly chunked). The
audit dies mid-run, leaves phase="started"/auditStatus="pending", cannot be
retried without additional fixes, and every workaround so far (/proc ppid
walk, stale mapping detection, resume offsets, retry caps) has been a
symptom fix on top of this fragility.

The fix: the audit never runs in any process that VS Code can kill.

  * New file src/audit-spawner.ts exposes spawnDetachedAuditWorker(workspace,
    sessionId). It spawns `node dist/cli.mjs audit-session --workspace X
    --session Y` with { detached: true, stdio: ['ignore', fd, fd] }, calls
    child.unref(), and returns. `detached: true` maps to setsid(2) — the
    child becomes its own process group leader. Signals sent to the parent's
    process group (which is what VS Code does when killing Claude Code and
    everything under it) do NOT propagate to the child. The worker survives.

  * New CLI subcommand `audit-session --workspace X --session Y` in cli.ts
    (landed with the preceding Fix D commit because server.ts/cli.ts were
    touched for both fixes in the same editing pass — the commits build
    only together). It is a thin wrapper: imports runSessionCleanup and
    calls it on the (workspace, sessionId) pair. No new audit logic. The
    command is also directly invokable from the shell for manual force
    re-audit of a specific session — a useful operator escape hatch.

  * cleanupAndExit in server.ts (also landed with Fix D commit for the
    reason above) no longer awaits runSessionCleanup. It spawns a detached
    worker per owned session and exits. Total time from stdin-end to
    process.exit is milliseconds. VS Code cannot kill the audit because the
    audit is not inside this process anymore.

  * auditOrphansInBackground in server.ts: same change — spawns a detached
    worker per orphan instead of awaiting cleanup inline.

  * src/hooks/session-end.ts: handleSessionEnd replaces `await runSession-
    Cleanup(...)` with `spawnDetachedAuditWorker(...)`. The hook subprocess
    returns in milliseconds, well under its 120s timeout. The audit
    continues in the detached worker.

  * Worker stdio is redirected to .axme-code/audit-worker-logs/<sessionId>.log
    so operators can post-mortem exactly what each worker did. The parent
    closes its own fd after spawn so a parent exit does not disturb the
    writer (the child has its own dup of the fd).

  * AXME_SKIP_HOOKS=1 is passed in the worker's env, so any subclaude the
    auditor spawns via SDK still gets the early-exit in its hooks and does
    not create ghost AXME sessions. This is the exact mechanism the PR#7
    SDK env passthrough test verified.

Concurrency: two callers (cleanupAndExit + session-end hook) may both fire
for the same session in quick succession. Both spawn workers. Both workers
call runSessionCleanup. The first to flip auditStatus=pending wins; the
second hits the existing "concurrent-audit" skip check and exits without
touching the LLM. No wasted cost, no race on storage writes.

Module cache caveat: a running MCP server's copy of spawnDetachedAuditWorker
is whatever was loaded when the server started. We expect this function to
never change again — its entire body is 20 lines of trivial spawn/unref/
closeSync. The audit LOGIC lives inside the child process, which reads a
fresh dist/cli.mjs on every spawn, so iteration on the auditor is safe
without a VS Code window reload. Only changes to the spawner itself (or to
the MCP server / hooks) require a window reload to take effect.

Verification — three test scripts (all green) plus a real LLM smoke test:

  /tmp/test-detached-worker.mts (11 assertions):
    - Direct audit-session CLI call runs runSessionCleanup and transitions
      session meta to closedAt + auditedAt + auditStatus=done.
    - Helper script spawns a detached worker and hangs. Test SIGKILLs the
      helper. After the helper is dead, the detached worker completes the
      audit and writes all expected state. This is the proof that SIGKILL
      on the spawner does not kill the worker.

  /tmp/test-mcp-cleanup-e2e.mts (updated, 13 assertions):
    - Full MCP server subprocess. Hook via sh -c creates an AXME session
      (verifies Bug A ppid walk still works). Close stdin → stdin-end
      handler → cleanupAndExit spawns detached worker → server exits.
      Test polls session meta until worker sets auditStatus=done, then
      verifies the full lifecycle transition.

  /tmp/test-real-audit-smoke.mts (23 assertions — REAL LLM call):
    - Fresh tmp workspace, fake AXME session pointing at a real 22 KB
      Claude transcript.
    - Helper script spawns detached worker and exits immediately. Worker
      is now orphaned from any test parent — simulates the exact scenario
      where VS Code kills the MCP server 5ms after cleanupAndExit fires.
    - Poll until the worker finishes (took 30.3 seconds on sonnet-4-6,
      6321 prompt tokens, $0.0962 cost).
    - Verify: session meta has all lifecycle flags set correctly. Audit
      log has phase=finished, resume[0] with startOffset=0 / endOffset=
      22204 / resumed=false, totals populated, model + duration + cost
      recorded. audited-offsets/<claude-id>.txt written with offset 22204.
      Worker log file exists with audit-session markers.

The smoke test is the empirical answer to "when I close the VS Code window,
does the audit actually run and finish". Yes. Verified on real data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…etry cap

Gap discovered in Fix B after initial implementation: runSessionCleanup
resets auditAttempts when it sees a stale-pending session (so the retry
cap does not apply to killed-mid-run attempts). But findOrphanSessions,
which is the orphan scanner that DECIDES which sessions runSessionCleanup
gets called on, does its own retry-cap filter — and strictly skips any
session with auditAttempts >= MAX. So a stale-pending session with maxed
attempts (exactly c2eadf04's state) would never reach runSessionCleanup
through the orphan path, and Fix B could only help if someone manually
invoked runSessionCleanup on that session.

Fix: mirror the stale-pending override from runSessionCleanup into
findOrphanSessions. A session with auditAttempts=MAX is still skipped,
EXCEPT when auditStatus="pending" AND auditStartedAt is older than
AUDIT_STALE_TIMEOUT_MS — that combination uniquely identifies a killed-
mid-run attempt (SIGKILL from VS Code, crash, reboot, OOM), which is
recoverable and should be retried. auditStatus="failed" + maxed attempts
still gets filtered — that is a deterministic failure and retrying would
just fail again.

With both fixes in place, a session stuck in "pending + attempts-maxed"
is automatically picked up by the next MCP server's orphan scan and
spawned as a detached worker. No manual intervention needed for normal
lifecycle.

Verified with /tmp/test-orphan-stale-pending.mts (5 scenarios, all pass):
  1. stale pending + attempts=1 → IS orphan (stale-pending override)
  2. failed + attempts=1 → NOT orphan (retry-cap still applies)
  3. fresh pending (5 min) + attempts=1 → NOT orphan yet
  4. never-attempted + dead pid → IS orphan (baseline)
  5. auditedAt set → NOT orphan (baseline)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@George-iam George-iam merged commit 8e98b41 into main Apr 5, 2026
@George-iam George-iam deleted the feat/audit-reliability-fixes-20260405 branch April 7, 2026 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant