Skip to content

Channel B false-fires on stale completion marker when resuming sessions #3360

@Trecek

Description

@Trecek

Bug Report

Severity: High — resume is completely broken for food truck dispatches. Every resumed session is killed in ~14s with 0 tokens before Claude Code can make any API call.

Discovered: 2026-05-30, during dispatch of issue #99 in TalonT-Org/api-simulator.

Related issues: #2400 (staged — cross-session JSONL recovery for resume dispatches) addresses result-parsing after the kill but does NOT prevent the kill itself. #2940 (staged — stream parser captures wrong UUID on resume) is a separate resume bug in session ID resolution. #2538 (staged — food truck timeout→resume gap) covers different resume gaps. This issue is the root cause that renders all resume attempts DOA regardless of those fixes.

Symptom

When dispatch_food_truck is called with resume_session_id to resume a previously-failed food truck session, the result is:

  • lifespan_started: false
  • input_tokens: 0, output_tokens: 0
  • exit_code: 143 (SIGTERM)
  • kill_reason: kill_after_completion
  • duration_seconds: ~14 (just startup overhead)
  • cli_subtype: unparseable

The session is killed by autoskillit's own process monitor before Claude Code reaches the API.

Root Cause

Channel B's _session_log_monitor has no resume-boundary awareness. Phase 2 unconditionally initializes scan_pos = 0 and reads the entire JSONL file on its first poll — including content from the original session that already contains the completion marker.

Causal Chain

  1. Marker reuse: _run_dispatch reconstructs DispatchIdentity from prior_dispatch_id via DispatchStateHandle.open_continuedDispatchIdentity.from_dispatch_id() (core/types/_type_dispatch_identity.py:60-68). This deterministically produces the same completion_marker (%%L3_DONE::{dispatch_id[:8]}%%) that the original session already emitted.

  2. Same JSONL file: claude --resume <session_id> appends to the existing session JSONL at ~/.claude/projects/<project-dir>/<session_id>.jsonl. The file already contains the old completion marker in an assistant-type record.

  3. Phase 1 discovery: _session_log_monitor Phase 1 (_process_monitor.py:234-238) filters JSONL files by st_ctime > spawn_time. The resumed session's JSONL passes this filter because Claude Code updates st_ctime when it appends new records. (Confirmed: on Linux, appending to a file updates both st_ctime and st_mtime.)

  4. Phase 2 false-fire: Phase 2 initializes scan_pos = 0 (_process_monitor.py:289). First poll: current_size > last_size (since last_size=0 and file has pre-existing content). Reads content[0:] — the entire file. _jsonl_contains_marker finds the old %%L3_DONE::...%% in a historical assistant record (only assistant-type records are scanned per session_record_types=frozenset({"assistant"}) in CLAUDE_CODE_CAPABILITIES). Returns ChannelBStatus.COMPLETION immediately (_process_monitor.py:324).

  5. Kill: resolve_terminationCOMPLETEDDRAIN_THEN_KILL_IF_ALIVE → SIGTERM after drain window. Process dies at exit code 143. Claude Code never made an API call.

Channel A is NOT affected: The stdout heartbeat (_heartbeat) reads from a fresh tempfile.NamedTemporaryFile created per dispatch (_process_io.py:36-38), so there is no pre-existing content to false-fire on.

Evidence from the failing session

Field Original session (67156089) Resume attempt
Session ID 67156089-92d3-429c-a5f2-cfeb63860041 Same (resumed)
Duration 1881s 14s
Tokens 130 in / 8201 out 0 / 0
Exit code 0 143
Kill reason natural_exit kill_after_completion
JSONL marker %%L3_DONE::b2fc2669%% at line 72 Same file, marker already present

The JSONL file at ~/.claude/projects/-home-talon-projects-api-simulator/67156089-92d3-429c-a5f2-cfeb63860041.jsonl shows the resume user message was written (lines 74-76) but no assistant response was ever generated — the process was killed first. The proc_trace.jsonl for the resume session shows Claude Code was alive and working at +5s (89 ESTABLISHED connections, 19 threads, 18.6% CPU, 110KB I/O) — it was killed externally, not self-terminated.

The completion marker also appears verbatim in the resume prompt injected into queue-operation (line 74) and user (line 76) records, but these are correctly filtered out by session_record_types=frozenset({"assistant"}). The false-fire is exclusively from the original session's assistant record at line 72.

Why prior_completion_markers doesn't help

The prior_completion_markers parameter is threaded through _run_dispatchdispatch_food_truck_execute_claude_headless_build_skill_result. However, it is used only in post-hoc result adjudication (_headless_result.py, _session_content.py:140-146,210-211), never in Channel B's real-time monitor. Confirmed: prior_completion_markers does not appear in _process_monitor.py, _process_race.py, or the run_managed_async function signature. The fix was applied at the wrong layer: the result parser can tolerate old markers, but the process monitor kills the session before the result parser ever runs.

Affected Code

File Line(s) Role
src/autoskillit/execution/process/_process_monitor.py 289 scan_pos = 0 — no resume offset
src/autoskillit/execution/process/_process_jsonl.py 39-73 _jsonl_contains_marker — no time-boundary awareness
src/autoskillit/execution/process/_process_race.py 426-496 resolve_termination — cannot distinguish stale vs fresh markers
src/autoskillit/core/types/_type_dispatch_identity.py 60-68 from_dispatch_id — deterministically reproduces same marker
src/autoskillit/fleet/_api.py 497 completion_marker = identity.completion_marker — reuses original

Recommended Fix

Initialize scan_pos to the file's existing byte length when Phase 1 discovers the JSONL file.

After Phase 1 selects the session file (_process_monitor.py:284), read the current file content size and set scan_pos (and last_size) to that value before entering Phase 2. This ensures Phase 2 only scans content written after the subprocess was spawned, skipping all historical records including stale completion markers.

This is the minimal correct fix — it doesn't require changes to the marker identity chain, the race resolution logic, or the prior_completion_markers threading. It works for all resume scenarios (food truck, skill, campaign) because the root cause is universal: Phase 2 starts at byte 0 regardless of file history.

A regression test should create a JSONL file with a pre-existing completion marker, spawn a monitored process against it, and assert that Channel B does NOT fire on the stale marker.

Test Gap

No existing test exercises the resume path where Channel B monitors a JSONL file that already contains a completion marker from a prior session.

Metadata

Metadata

Assignees

No one assigned

    Labels

    recipe:remediationBug fix / broken behavior — routed to remediation recipestagedImplementation staged and waiting for promotion to main

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions