Skip to content

Rectify: Stale Detector Kills Actively Working Sessions During run_skill#3391

Merged
Trecek merged 17 commits into
developfrom
stale-detector-kills-actively-working-sessions-during-run-di/3349
May 31, 2026
Merged

Rectify: Stale Detector Kills Actively Working Sessions During run_skill#3391
Trecek merged 17 commits into
developfrom
stale-detector-kills-actively-working-sessions-during-run-di/3349

Conversation

@Trecek
Copy link
Copy Markdown
Collaborator

@Trecek Trecek commented May 31, 2026

Summary

The stale detector uses JSONL file-size growth as its sole liveness signal. When a session is blocked on a long-running MCP tool call (e.g., run_skill executing analyze-pipeline-health), the JSONL stops growing because Claude writes only on conversation turns, not while awaiting tool results. The stale detector interprets this silence as staleness and kills the session.

The architectural weakness: dispatch_food_truck creates a dispatch marker that suppresses stale kills in all three watchers (_session_log_monitor, _watch_stdout_idle, _watch_child_activity), but run_skill creates no such marker. Furthermore, the run_skill call chain through DefaultHeadlessExecutor.run()run_headless_core() never passes marker_dir or caller_session_id to _execute_claude_headless() — even though that function already accepts both parameters. This means the monitoring tasks inside run_managed_async receive marker_dir=None and skip all marker-based suppression unconditionally.

The solution: extend the execution marker protocol to run_skill sessions by (1) renaming and generalizing the marker detection function, (2) threading marker_dir and caller_session_id through the run() call chain, and (3) creating markers in run_skill before blocking on executor.run().

Requirements

Two dispatch_food_truck sessions (issue #3246, issue #3317) completed ALL substantive work — planning, implementation, code review, PR merge via merge queue, and issue release — but were killed by the stale detector during the final run_diagnostic step. The orchestrator then triggered unnecessary resume dispatches, wasting tokens on sessions whose PRs were already merged 18–70 minutes earlier.

Closes #3349

Implementation Plan

Plan file: /home/talon/projects/autoskillit-runs/remediation-20260530-174510-498188/.autoskillit/temp/rectify/rectify_stale_detector_mcp_blind_spot_2026-05-30_174510_part_a.md

🤖 Generated with Claude Code via AutoSkillit

Token Usage Summary

Step Model count uncached output cache_read peak_ctx turns cache_write time
rectify* opus[1m] 1 7.6k 26.4k 1.9M 124.2k 263 108.4k 15m 40s
review_approach* sonnet 1 9.9k 7.1k 225.0k 50.5k 105 38.6k 6m 42s
dry_walkthrough* opus 4 207 46.3k 3.8M 85.2k 498 208.0k 25m 59s
implement* sonnet 4 4.1k 112.9k 18.4M 173.2k 512 366.9k 41m 41s
audit_impl* sonnet 4 2.8k 60.5k 2.2M 86.7k 342 295.9k 29m 35s
make_plan* sonnet 2 1.3k 29.8k 1.4M 80.1k 130 142.2k 15m 22s
prepare_pr* sonnet 1 66.8k 5.1k 163.9k 27.8k 17 15.9k 1m 27s
compose_pr* sonnet 1 57.8k 3.9k 191.7k 27.8k 17 15.5k 1m 12s
review_pr* sonnet 2 388 62.4k 3.6M 128.3k 206 289.7k 22m 29s
resolve_review* opus 2 173 40.2k 7.8M 118.8k 186 228.7k 40m 39s
Total 151.0k 394.6k 39.6M 173.2k 1.7M 3h 20m

* Step used a non-Anthropic provider; caching behavior may differ.

Token Efficiency

Step LoC Changed cache_read/LoC cache_write/LoC output/LoC
rectify 0
review_approach 0
dry_walkthrough 0
implement 767 23963.4 478.4 147.1
audit_impl 0
make_plan 0
prepare_pr 0
compose_pr 0
review_pr 0
resolve_review 93 84195.7 2458.7 432.3
Total 860 46006.4 1988.0 458.8

Model Usage Breakdown

Model steps uncached output cache_read cache_write time
opus[1m] 1 7.6k 26.4k 1.9M 108.4k 15m 40s
sonnet 7 143.1k 281.7k 26.1M 1.2M 1h 58m
opus 2 380 86.5k 11.6M 436.6k 1h 6m

@Trecek Trecek force-pushed the stale-detector-kills-actively-working-sessions-during-run-di/3349 branch from af5551c to 81fc2d9 Compare May 31, 2026 05:11
Trecek and others added 17 commits May 30, 2026 22:45
Adds TDD tests that fail until the execution marker glob is generalized:
- test_stale_NOT_fired_when_execution_marker_active: session monitor must
  suppress stale when run-skill-in-progress-* marker is fresh
- test_stale_fired_when_execution_marker_expired: marker present but stale
  still fires when execution marker is expired
- test_execution_marker_suppression_bounded_by_max_suppression_seconds:
  suppression is bounded even with continuously-touched run-skill marker
- test_stdout_idle_NOT_fired_when_execution_marker_active: idle watchdog
  must suppress IDLE_STALL when run-skill-in-progress-* marker is fresh

Current dispatch-in-progress-* glob does not match run-skill-in-progress-*
markers, so tests 1a, 1c, 1d fail before the fix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…_marker

Generalizes the marker detection function and its glob pattern to match
both dispatch-in-progress-* and run-skill-in-progress-* markers:

  Old glob: dispatch-in-progress-{session_id}-*.marker
  New glob: *-in-progress-{session_id}-*.marker

This makes the stale detector's suppression mechanism work for any
execution context, not just fleet dispatches. All callers in
_process_race.py (_watch_stdout_idle, _watch_child_activity) and
_process_monitor.py (_session_log_monitor) are updated.

Updated all test references: monkeypatch targets, imports, class names,
frozenset, arch guard predicate string, and CLAUDE.md table entry.
Adds parametrized backward-compat test confirming both prefixes match
and a wrong-session-id guard test.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds marker_dir: Path | None and caller_session_id: str | None to:
- HeadlessExecutor.run() Protocol
- DefaultHeadlessExecutor.run() concrete method
- run_headless_core() function

Both params are forwarded to _execute_claude_headless() as marker_dir
and session_id respectively, making the stale detector's marker-based
suppression available for run_skill sessions (not just dispatch_food_truck).

Adds integration tests verifying the forwarding chain:
- test_run_headless_core_forwards_marker_dir_and_caller_session_id
- test_default_executor_run_forwards_marker_dir_and_caller_session_id
- test_headless_executor_protocol_includes_marker_dir_params

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ker creation

Create core/_execution_marker.py with a shared async context manager that
handles marker write, heartbeat, and cleanup. Refactor fleet/_api.py to
use execution_marker instead of inline marker logic, and wire run_skill
in tools_execution.py to create markers before blocking on executor.run().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…fication gap

Verifies that execution_marker context manager creates a marker file on entry
and deletes it on exit — the lifecycle unit test that was missing from the
stale-detector run_skill marker blind spot audit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d_dispatch

Add TestResumeSuccessGuard to test_resume_preflight.py covering the
_run_dispatch early-return path when prior dispatch already succeeded.
test_resume_rejected_when_prior_dispatch_succeeded fails until the
guard is wired in Phase 2.

Add TestHasCompletedDispatch to test_state.py and implement
has_completed_dispatch in state_recovery.py (name-specific SUCCESS
check, fail-open on missing/corrupted state).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e-exports

Wire _build_success_short_circuit into _run_dispatch: when prior
dispatch already has DispatchStatus.SUCCESS, return the cached result
immediately without launching a subprocess.

Add has_completed_dispatch re-exports through state.py and fleet/__init__
so callers can use `from autoskillit.fleet import has_completed_dispatch`.

Add dispatch_food_truck MCP tool pre-flight guard: if campaign state
already shows SUCCESS for effective_name, return the cached result
before execute_dispatch is called (prevents semaphore acquisition and
state handle creation for duplicate calls).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…scanner depth

Add stale_threshold: 2400 to all run_diagnostic* steps in implementation,
remediation, and merge-prs recipes (10 steps total). The default 1200s
is too short for multi-subagent scanner work; 2400s matches the existing
convention for long-running steps (implement, retry_worktree).

Add scanner maxTurns and wall-clock soft-deadline guidance to
analyze-pipeline-health SKILL.md to bound investigation per scanner
and prevent the stale detector misfiring window.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move DispatchStateHandle.open_continued inside the try block so that
FileNotFoundError (a subclass of OSError) is caught by the existing
except handler. Add a create_fresh fallback in the except so handle is
always bound after the try/except, enabling fail-open resume when the
prior state file is absent.

Extract recipe_snapshot before the if/else so it is available in both
the except fallback and the else branch.

Add test_resume_proceeds_when_prior_state_missing to
TestResumeSuccessGuard to verify the fail-open behaviour.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace duplicate label suffix in marker filename with uuid4 hex to
prevent collisions between concurrent executions sharing the same
label + session_id. Replace asyncio.create_task with anyio task group
for structured concurrency and proper exception visibility.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add find_completed_dispatch helper that returns the full DispatchRecord
for a completed dispatch. Use it in the dispatch_food_truck MCP tool
pre-flight guard to forward actual dispatch_id and dispatched_session_id
instead of hardcoded empty strings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ch fields

Add monitor_status emptiness assertion in
test_stale_NOT_fired_when_execution_marker_active to distinguish STALE
from other early-exit conditions. Add dispatch_id, dispatched_session_id,
and reason field assertions in test_resume_rejected_when_prior_dispatch_succeeded.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Revert from anyio task group (yield inside task group body causes
cleanup issues with @asynccontextmanager) to asyncio.create_task with
split exception handling: CancelledError silenced, other exceptions
logged. Update tests to match new _touch_marker 2-arg signature.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ppression test

Adds assert elapsed < 5.0 after the lower-bound check so a broken
suppression path (marker never matched) cannot silently pass when the
test happens to run slowly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…checks

Adds assert "marker_dir" in sig.parameters and assert
"caller_session_id" in sig.parameters before accessing .default,
so a missing parameter produces an actionable assertion message
instead of an uninformative KeyError.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New core/ module _execution_marker was missing from MODULE_CASCADE_CORE,
causing test_all_core_stems_classified to fail. Maps it to the four
consuming packages: core, execution, fleet, server.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Trecek Trecek force-pushed the stale-detector-kills-actively-working-sessions-during-run-di/3349 branch from 81fc2d9 to e2edc1b Compare May 31, 2026 05:45
@Trecek Trecek added this pull request to the merge queue May 31, 2026
Merged via the queue into develop with commit 1c665ae May 31, 2026
3 checks passed
@Trecek Trecek deleted the stale-detector-kills-actively-working-sessions-during-run-di/3349 branch May 31, 2026 05:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant