Bug Report
Severity: High — resume is completely broken for food truck dispatches. Every resumed session is killed in ~14s with 0 tokens before Claude Code can make any API call.
Discovered: 2026-05-30, during dispatch of issue #99 in TalonT-Org/api-simulator.
Related issues: #2400 (staged — cross-session JSONL recovery for resume dispatches) addresses result-parsing after the kill but does NOT prevent the kill itself. #2940 (staged — stream parser captures wrong UUID on resume) is a separate resume bug in session ID resolution. #2538 (staged — food truck timeout→resume gap) covers different resume gaps. This issue is the root cause that renders all resume attempts DOA regardless of those fixes.
Symptom
When dispatch_food_truck is called with resume_session_id to resume a previously-failed food truck session, the result is:
lifespan_started: false
input_tokens: 0, output_tokens: 0
exit_code: 143 (SIGTERM)
kill_reason: kill_after_completion
duration_seconds: ~14 (just startup overhead)
cli_subtype: unparseable
The session is killed by autoskillit's own process monitor before Claude Code reaches the API.
Root Cause
Channel B's _session_log_monitor has no resume-boundary awareness. Phase 2 unconditionally initializes scan_pos = 0 and reads the entire JSONL file on its first poll — including content from the original session that already contains the completion marker.
Causal Chain
-
Marker reuse: _run_dispatch reconstructs DispatchIdentity from prior_dispatch_id via DispatchStateHandle.open_continued → DispatchIdentity.from_dispatch_id() (core/types/_type_dispatch_identity.py:60-68). This deterministically produces the same completion_marker (%%L3_DONE::{dispatch_id[:8]}%%) that the original session already emitted.
-
Same JSONL file: claude --resume <session_id> appends to the existing session JSONL at ~/.claude/projects/<project-dir>/<session_id>.jsonl. The file already contains the old completion marker in an assistant-type record.
-
Phase 1 discovery: _session_log_monitor Phase 1 (_process_monitor.py:234-238) filters JSONL files by st_ctime > spawn_time. The resumed session's JSONL passes this filter because Claude Code updates st_ctime when it appends new records. (Confirmed: on Linux, appending to a file updates both st_ctime and st_mtime.)
-
Phase 2 false-fire: Phase 2 initializes scan_pos = 0 (_process_monitor.py:289). First poll: current_size > last_size (since last_size=0 and file has pre-existing content). Reads content[0:] — the entire file. _jsonl_contains_marker finds the old %%L3_DONE::...%% in a historical assistant record (only assistant-type records are scanned per session_record_types=frozenset({"assistant"}) in CLAUDE_CODE_CAPABILITIES). Returns ChannelBStatus.COMPLETION immediately (_process_monitor.py:324).
-
Kill: resolve_termination → COMPLETED → DRAIN_THEN_KILL_IF_ALIVE → SIGTERM after drain window. Process dies at exit code 143. Claude Code never made an API call.
Channel A is NOT affected: The stdout heartbeat (_heartbeat) reads from a fresh tempfile.NamedTemporaryFile created per dispatch (_process_io.py:36-38), so there is no pre-existing content to false-fire on.
Evidence from the failing session
| Field |
Original session (67156089) |
Resume attempt |
| Session ID |
67156089-92d3-429c-a5f2-cfeb63860041 |
Same (resumed) |
| Duration |
1881s |
14s |
| Tokens |
130 in / 8201 out |
0 / 0 |
| Exit code |
0 |
143 |
| Kill reason |
natural_exit |
kill_after_completion |
| JSONL marker |
%%L3_DONE::b2fc2669%% at line 72 |
Same file, marker already present |
The JSONL file at ~/.claude/projects/-home-talon-projects-api-simulator/67156089-92d3-429c-a5f2-cfeb63860041.jsonl shows the resume user message was written (lines 74-76) but no assistant response was ever generated — the process was killed first. The proc_trace.jsonl for the resume session shows Claude Code was alive and working at +5s (89 ESTABLISHED connections, 19 threads, 18.6% CPU, 110KB I/O) — it was killed externally, not self-terminated.
The completion marker also appears verbatim in the resume prompt injected into queue-operation (line 74) and user (line 76) records, but these are correctly filtered out by session_record_types=frozenset({"assistant"}). The false-fire is exclusively from the original session's assistant record at line 72.
Why prior_completion_markers doesn't help
The prior_completion_markers parameter is threaded through _run_dispatch → dispatch_food_truck → _execute_claude_headless → _build_skill_result. However, it is used only in post-hoc result adjudication (_headless_result.py, _session_content.py:140-146,210-211), never in Channel B's real-time monitor. Confirmed: prior_completion_markers does not appear in _process_monitor.py, _process_race.py, or the run_managed_async function signature. The fix was applied at the wrong layer: the result parser can tolerate old markers, but the process monitor kills the session before the result parser ever runs.
Affected Code
| File |
Line(s) |
Role |
src/autoskillit/execution/process/_process_monitor.py |
289 |
scan_pos = 0 — no resume offset |
src/autoskillit/execution/process/_process_jsonl.py |
39-73 |
_jsonl_contains_marker — no time-boundary awareness |
src/autoskillit/execution/process/_process_race.py |
426-496 |
resolve_termination — cannot distinguish stale vs fresh markers |
src/autoskillit/core/types/_type_dispatch_identity.py |
60-68 |
from_dispatch_id — deterministically reproduces same marker |
src/autoskillit/fleet/_api.py |
497 |
completion_marker = identity.completion_marker — reuses original |
Recommended Fix
Initialize scan_pos to the file's existing byte length when Phase 1 discovers the JSONL file.
After Phase 1 selects the session file (_process_monitor.py:284), read the current file content size and set scan_pos (and last_size) to that value before entering Phase 2. This ensures Phase 2 only scans content written after the subprocess was spawned, skipping all historical records including stale completion markers.
This is the minimal correct fix — it doesn't require changes to the marker identity chain, the race resolution logic, or the prior_completion_markers threading. It works for all resume scenarios (food truck, skill, campaign) because the root cause is universal: Phase 2 starts at byte 0 regardless of file history.
A regression test should create a JSONL file with a pre-existing completion marker, spawn a monitored process against it, and assert that Channel B does NOT fire on the stale marker.
Test Gap
No existing test exercises the resume path where Channel B monitors a JSONL file that already contains a completion marker from a prior session.
Bug Report
Severity: High — resume is completely broken for food truck dispatches. Every resumed session is killed in ~14s with 0 tokens before Claude Code can make any API call.
Discovered: 2026-05-30, during dispatch of issue #99 in TalonT-Org/api-simulator.
Related issues: #2400 (staged — cross-session JSONL recovery for resume dispatches) addresses result-parsing after the kill but does NOT prevent the kill itself. #2940 (staged — stream parser captures wrong UUID on resume) is a separate resume bug in session ID resolution. #2538 (staged — food truck timeout→resume gap) covers different resume gaps. This issue is the root cause that renders all resume attempts DOA regardless of those fixes.
Symptom
When
dispatch_food_truckis called withresume_session_idto resume a previously-failed food truck session, the result is:lifespan_started: falseinput_tokens: 0,output_tokens: 0exit_code: 143(SIGTERM)kill_reason: kill_after_completionduration_seconds: ~14(just startup overhead)cli_subtype: unparseableThe session is killed by autoskillit's own process monitor before Claude Code reaches the API.
Root Cause
Channel B's
_session_log_monitorhas no resume-boundary awareness. Phase 2 unconditionally initializesscan_pos = 0and reads the entire JSONL file on its first poll — including content from the original session that already contains the completion marker.Causal Chain
Marker reuse:
_run_dispatchreconstructsDispatchIdentityfromprior_dispatch_idviaDispatchStateHandle.open_continued→DispatchIdentity.from_dispatch_id()(core/types/_type_dispatch_identity.py:60-68). This deterministically produces the samecompletion_marker(%%L3_DONE::{dispatch_id[:8]}%%) that the original session already emitted.Same JSONL file:
claude --resume <session_id>appends to the existing session JSONL at~/.claude/projects/<project-dir>/<session_id>.jsonl. The file already contains the old completion marker in anassistant-type record.Phase 1 discovery:
_session_log_monitorPhase 1 (_process_monitor.py:234-238) filters JSONL files byst_ctime > spawn_time. The resumed session's JSONL passes this filter because Claude Code updatesst_ctimewhen it appends new records. (Confirmed: on Linux, appending to a file updates bothst_ctimeandst_mtime.)Phase 2 false-fire: Phase 2 initializes
scan_pos = 0(_process_monitor.py:289). First poll:current_size > last_size(sincelast_size=0and file has pre-existing content). Readscontent[0:]— the entire file._jsonl_contains_markerfinds the old%%L3_DONE::...%%in a historicalassistantrecord (onlyassistant-type records are scanned persession_record_types=frozenset({"assistant"})inCLAUDE_CODE_CAPABILITIES). ReturnsChannelBStatus.COMPLETIONimmediately (_process_monitor.py:324).Kill:
resolve_termination→COMPLETED→DRAIN_THEN_KILL_IF_ALIVE→ SIGTERM after drain window. Process dies at exit code 143. Claude Code never made an API call.Channel A is NOT affected: The stdout heartbeat (
_heartbeat) reads from a freshtempfile.NamedTemporaryFilecreated per dispatch (_process_io.py:36-38), so there is no pre-existing content to false-fire on.Evidence from the failing session
67156089)67156089-92d3-429c-a5f2-cfeb63860041%%L3_DONE::b2fc2669%%at line 72The JSONL file at
~/.claude/projects/-home-talon-projects-api-simulator/67156089-92d3-429c-a5f2-cfeb63860041.jsonlshows the resume user message was written (lines 74-76) but no assistant response was ever generated — the process was killed first. Theproc_trace.jsonlfor the resume session shows Claude Code was alive and working at +5s (89 ESTABLISHED connections, 19 threads, 18.6% CPU, 110KB I/O) — it was killed externally, not self-terminated.The completion marker also appears verbatim in the resume prompt injected into
queue-operation(line 74) anduser(line 76) records, but these are correctly filtered out bysession_record_types=frozenset({"assistant"}). The false-fire is exclusively from the original session'sassistantrecord at line 72.Why
prior_completion_markersdoesn't helpThe
prior_completion_markersparameter is threaded through_run_dispatch→dispatch_food_truck→_execute_claude_headless→_build_skill_result. However, it is used only in post-hoc result adjudication (_headless_result.py,_session_content.py:140-146,210-211), never in Channel B's real-time monitor. Confirmed:prior_completion_markersdoes not appear in_process_monitor.py,_process_race.py, or therun_managed_asyncfunction signature. The fix was applied at the wrong layer: the result parser can tolerate old markers, but the process monitor kills the session before the result parser ever runs.Affected Code
src/autoskillit/execution/process/_process_monitor.pyscan_pos = 0— no resume offsetsrc/autoskillit/execution/process/_process_jsonl.py_jsonl_contains_marker— no time-boundary awarenesssrc/autoskillit/execution/process/_process_race.pyresolve_termination— cannot distinguish stale vs fresh markerssrc/autoskillit/core/types/_type_dispatch_identity.pyfrom_dispatch_id— deterministically reproduces same markersrc/autoskillit/fleet/_api.pycompletion_marker = identity.completion_marker— reuses originalRecommended Fix
Initialize
scan_posto the file's existing byte length when Phase 1 discovers the JSONL file.After Phase 1 selects the session file (
_process_monitor.py:284), read the current file content size and setscan_pos(andlast_size) to that value before entering Phase 2. This ensures Phase 2 only scans content written after the subprocess was spawned, skipping all historical records including stale completion markers.This is the minimal correct fix — it doesn't require changes to the marker identity chain, the race resolution logic, or the
prior_completion_markersthreading. It works for all resume scenarios (food truck, skill, campaign) because the root cause is universal: Phase 2 starts at byte 0 regardless of file history.A regression test should create a JSONL file with a pre-existing completion marker, spawn a monitored process against it, and assert that Channel B does NOT fire on the stale marker.
Test Gap
No existing test exercises the resume path where Channel B monitors a JSONL file that already contains a completion marker from a prior session.