Skip to content

Rectify: PTY Wrapper Tracer PID — Target-Resolution & Self-Identifying Snapshots (Issue #806)#809

Merged
Trecek merged 11 commits intointegrationfrom
linux-process-tracer-monitors-script-1-pty-wrapper-instead-o/806
Apr 13, 2026
Merged

Rectify: PTY Wrapper Tracer PID — Target-Resolution & Self-Identifying Snapshots (Issue #806)#809
Trecek merged 11 commits intointegrationfrom
linux-process-tracer-monitors-script-1-pty-wrapper-instead-o/806

Conversation

@Trecek
Copy link
Copy Markdown
Collaborator

@Trecek Trecek commented Apr 13, 2026

Summary

anyio.open_process([script, -qefc, "<claude cmd>", /dev/null]) returns the PID of the script(1) PTY wrapper, not the intended claude binary. That wrapper PID was used as the subject of observation across ten call sites in process.py and downstream through the entire telemetry pipeline (tracer, anomaly detection, peak aggregation, sessions.jsonl, GitHub issue bodies). The broken subsystem has been silently wrong since PTY wrapping was introduced — and two prior investigations (#771/#776) saw the same ~2 MB RSS / 1 thread / do_sys_poll fingerprint without identifying the PTY wrapper as the root cause.

This rectify produces immunity against the entire bug class — not just the script(1) instance — by introducing a TraceTarget newtype that can only be produced by a resolve_trace_target() resolver (which walks descendants and returns the workload process, not the spawn process), adding a comm field to ProcSnapshot so every row in proc_trace.jsonl self-identifies the process it describes, changing start_linux_tracing(pid: int) to start_linux_tracing(target: TraceTarget) so the raw-int path becomes unrepresentable, and adding an AST architectural test (ARCH-008) that forbids passing a raw Attribute PID node as the target argument to start_linux_tracing.

Architecture Impact

Concurrency Diagram

%%{init: {'flowchart': {'nodeSpacing': 40, 'rankSpacing': 50, 'curve': 'basis'}}}%%
flowchart TB
    %% CLASS DEFINITIONS %%
    classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff;
    classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;
    classDef gap fill:#ff6f00,stroke:#ffa726,stroke-width:2px,color:#000;

    START([● run_headless_core / DefaultHeadlessExecutor])

    subgraph PreSpawn ["SEQUENTIAL PRE-SPAWN  —  ● headless.py · ● process.py"]
        direction TB
        PTY_WRAP["● pty_wrap_command()<br/>━━━━━━━━━━<br/>Rewrites cmd: script -qefc<br/>Stores _workload_basename BEFORE rewrite"]
        SPAWN["● anyio.open_process()<br/>━━━━━━━━━━<br/>Returns script(1) wrapper PID<br/>(not the workload PID)"]
        PTY_DEC{"PTY<br/>mode?"}
    end

    subgraph PIDRect ["PID RECTIFICATION  —  ● linux_tracing.py  (BLOCKING SYNC before task group)"]
        direction TB
        RESOLVE["● resolve_trace_target()<br/>━━━━━━━━━━<br/>Blocks event loop: time.sleep(0.05) × ≤40<br/>Walks /proc descendants from wrapper PID<br/>Matches expected_basename (workload)"]
        RESOLVE_ERR["● TraceTargetResolutionError<br/>━━━━━━━━━━<br/>Hard fail — never silently falls back<br/>Carries .root_pid + .expected_basename"]
        DIRECT["● trace_target_from_pid()<br/>━━━━━━━━━━<br/>Single /proc read<br/>Non-PTY direct path"]
        TRACE_TARGET["● TraceTarget (frozen dataclass)<br/>━━━━━━━━━━<br/>pid · comm · cmdline<br/>starttime_ticks (PID-recycle guard)<br/>boot_id collision-resistant triple"]
        ENROLL["● _write_enrollment_atomic()<br/>━━━━━━━━━━<br/>tempfile.mkstemp + os.replace<br/>Crash-recovery identity record"]
    end

    FORK["● anyio.create_task_group() — FORK<br/>━━━━━━━━━━<br/>process.py: starts 5–6 concurrent tasks"]

    subgraph TaskGroup ["CONCURRENT ASYNC WATCHERS  —  anyio cooperative multitasking (single thread)"]
        direction LR
        W_PROC["_watch_process<br/>━━━━━━━━━━<br/>Awaits subprocess exit<br/>CHANNEL: natural exit"]
        W_HEART["_watch_heartbeat<br/>━━━━━━━━━━<br/>Polls stdout JSONL<br/>Completion marker records<br/>CHANNEL A"]
        W_SESS["_watch_session_log<br/>━━━━━━━━━━<br/>Polls Claude session JSONL dir<br/>CHANNEL B"]
        W_IDLE["_watch_stdout_idle<br/>━━━━━━━━━━<br/>Fires if stdout silent<br/>idle_output_timeout guard"]
        MONITOR["● _run_monitor / proc_monitor<br/>━━━━━━━━━━<br/>anyio.sleep(proc_interval)<br/>Reads /proc at each tick<br/>Accumulates ProcSnapshot list<br/>Injected into same TaskGroup"]
    end

    subgraph RaceSignals ["RACE SIGNALS  —  ● process.py (shared, anyio-safe)"]
        direction TB
        TRIGGER["● anyio.Event: trigger<br/>━━━━━━━━━━<br/>First watcher to fire wins"]
        CB_READY["anyio.Event: channel_b_ready<br/>━━━━━━━━━━<br/>Symmetric Channel B drain guard"]
        SID_READY["anyio.Event: stdout_session_id_ready<br/>━━━━━━━━━━<br/>Session ID from stdout JSONL"]
    end

    BARRIER["trigger.wait() + move_on_after(timeout)<br/>━━━━━━━━━━<br/>● process.py — awaits first race signal<br/>Drain window: move_on_after(completion_drain_timeout)"]
    CANCEL["tg.cancel_scope.cancel()<br/>━━━━━━━━━━<br/>Terminates all remaining tasks<br/>Including _run_monitor CancelScope"]

    subgraph PostHoc ["POST-HOC ANALYSIS  —  sequential, after task group exits"]
        direction TB
        STOP["● tracing_handle.stop()<br/>━━━━━━━━━━<br/>Cancels LinuxTracingHandle._monitor_cancel_scope<br/>Returns accumulated ProcSnapshot list"]
        RESULT["● SubprocessResult<br/>━━━━━━━━━━<br/>proc_snapshots: list[dict] | None<br/>tracked_comm: str|None  ← Issue #806 fix<br/>session_id · channel_confirmation · pid"]
        FLUSH["● flush_session_log()<br/>━━━━━━━━━━<br/>Writes proc_trace.jsonl to session dir"]
        ANOMALY["● detect_anomalies()<br/>━━━━━━━━━━<br/>Pure sync · post-hoc only<br/>OOM spike/critical · zombie · D-state<br/>high CPU · RSS growth · FDs"]
        DRIFT["● detect_identity_drift()<br/>━━━━━━━━━━<br/>Pure sync · checks each snapshot<br/>comm matches TraceTarget.comm"]
    end

    subgraph NewTests ["★ NEW INTEGRATION TESTS"]
        direction LR
        T_RESOLVER["★ test_trace_target_resolver.py<br/>━━━━━━━━━━<br/>test_resolve_trace_target_walks_from_wrapper_to_workload<br/>Asserts target.pid ≠ wrapper_pid<br/>test_resolve_trace_target_raises_on_miss<br/>Asserts TraceTargetResolutionError raised"]
        T_PTY["★ test_linux_tracing_pty_integration.py<br/>━━━━━━━━━━<br/>test_pty_wrapped_command_is_traced_as_grandchild<br/>Asserts peak RSS > 30,000 KB (not script wrapper 2 MB)<br/>test_pty_wrapped_tracing_produces_no_script_snapshots<br/>Asserts no comm='script' in proc_trace.jsonl"]
    end

    %% MAIN SEQUENTIAL FLOW %%
    START --> PTY_WRAP
    PTY_WRAP --> SPAWN
    SPAWN --> PTY_DEC
    PTY_DEC -->|"pty_mode=True"| RESOLVE
    PTY_DEC -->|"pty_mode=False"| DIRECT
    RESOLVE -->|"found within timeout"| TRACE_TARGET
    RESOLVE -->|"timeout exceeded"| RESOLVE_ERR
    DIRECT --> TRACE_TARGET
    TRACE_TARGET --> ENROLL
    ENROLL --> FORK

    %% FORK TO PARALLEL WATCHERS %%
    FORK --> W_PROC
    FORK --> W_HEART
    FORK --> W_SESS
    FORK --> W_IDLE
    FORK --> MONITOR

    %% WATCHERS FIRE TRIGGER %%
    W_PROC --> TRIGGER
    W_HEART --> TRIGGER
    W_SESS --> TRIGGER
    W_IDLE --> TRIGGER

    %% MONITOR ACCUMULATES (does not fire trigger) %%
    MONITOR -.->|"ProcSnapshot accumulation"| STOP

    %% RACE BARRIER %%
    TRIGGER --> BARRIER
    CB_READY -.->|"drain guard"| BARRIER
    SID_READY -.->|"drain guard"| BARRIER

    BARRIER --> CANCEL
    CANCEL --> STOP

    %% POST-HOC CHAIN %%
    STOP --> RESULT
    RESULT --> FLUSH
    FLUSH --> ANOMALY
    FLUSH --> DRIFT

    %% TEST COVERAGE (dashed) %%
    T_RESOLVER -.->|"exercises"| RESOLVE
    T_PTY -.->|"exercises"| MONITOR
    T_PTY -.->|"exercises"| FLUSH

    %% CLASS ASSIGNMENTS %%
    class START terminal;
    class PTY_WRAP,SPAWN handler;
    class PTY_DEC stateNode;
    class RESOLVE phase;
    class RESOLVE_ERR detector;
    class DIRECT phase;
    class TRACE_TARGET stateNode;
    class ENROLL output;
    class FORK detector;
    class W_PROC,W_HEART,W_SESS,W_IDLE handler;
    class MONITOR phase;
    class TRIGGER,CB_READY,SID_READY stateNode;
    class BARRIER,CANCEL detector;
    class STOP,RESULT output;
    class FLUSH,ANOMALY,DRIFT phase;
    class T_RESOLVER,T_PTY newComponent;
Loading

Process Flow Diagram

%%{init: {'flowchart': {'nodeSpacing': 40, 'rankSpacing': 50, 'curve': 'basis'}}}%%
flowchart TB
    %% CLASS DEFINITIONS %%
    classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;

    %% TERMINALS %%
    START([START])
    COMPLETE([COMPLETE])
    ERROR([ERROR — no fallback])

    %% ── PHASE 1: SESSION ORCHESTRATION ── %%
    subgraph Orchestration ["Phase 1 — Session Orchestration (● headless.py)"]
        direction TB
        RHC["● run_headless_core<br/>━━━━━━━━━━<br/>Entry: skill + LinuxTracingConfig"]
        PTYMode["pty_mode = True<br/>━━━━━━━━━━<br/>Hardcoded for all headless sessions"]
    end

    %% ── PHASE 2: PTY PROCESS SPAWN ── %%
    subgraph Spawn ["Phase 2 — PTY Process Spawn (● process.py)"]
        direction TB
        CaptureBasename["● Capture _workload_basename<br/>━━━━━━━━━━<br/>Path(cmd[0]).name BEFORE PTY wrap<br/>(#806 guard — preserves real target name)"]
        WrapCmd["pty_wrap_command(cmd)<br/>━━━━━━━━━━<br/>Rewrites cmd → script(1) wrapper<br/>cmd[0] is now the PTY shim"]
        OpenProcess["anyio.open_process(cmd)<br/>━━━━━━━━━━<br/>Spawns; proc.pid = script(1) PID<br/>(NOT the claude workload)"]
    end

    %% ── PHASE 3: PID RECTIFICATION ── %%
    subgraph Rectification ["Phase 3 — PTY PID Rectification (● process.py + ● linux_tracing.py)"]
        direction TB
        TracingEnabled{"linux_tracing_config<br/>not None?"}
        PtyLinux{"pty_mode AND<br/>LINUX_TRACING_AVAILABLE?"}
        ResolveTarget["● resolve_trace_target<br/>━━━━━━━━━━<br/>root_pid=proc.pid, basename=workload<br/>Poll children() every 50 ms, timeout=2s<br/>Match name or cmdline[0] basename"]
        FoundTarget{"workload child<br/>found in time?"}
        TraceTargetDirect["● trace_target_from_pid<br/>━━━━━━━━━━<br/>Non-PTY: direct /proc read<br/>Builds TraceTarget; never raises"]
        RawPID["_observed_pid = proc.pid<br/>━━━━━━━━━━<br/>Tracing disabled path<br/>Raw wrapper PID used"]
        RectifiedPID["● _observed_pid = target.pid<br/>━━━━━━━━━━<br/>Rectified workload PID<br/>_tracked_comm = target.comm"]
        RTError["● TraceTargetResolutionError<br/>━━━━━━━━━━<br/>No silent fallback to wrapper PID<br/>(fallback would recreate #806)"]
    end

    %% ── PHASE 4: EXECUTION RACE ── %%
    subgraph ExecRace ["Phase 4 — Execution Race (● process.py)"]
        direction TB
        StartTracing["start_linux_tracing(target)<br/>━━━━━━━━━━<br/>target must be TraceTarget<br/>TypeError guard blocks raw int PIDs"]
        Race["Race Watchers<br/>━━━━━━━━━━<br/>_watch_process | _watch_heartbeat<br/>_watch_session_log | _watch_stdout_idle<br/>First to signal trigger wins"]
        ResolveTerm{"resolve_termination<br/>NATURAL_EXIT?"}
        KillTree["async_kill_process_tree<br/>━━━━━━━━━━<br/>Reaps PTY wrapper + workload<br/>for TIMED_OUT / STALL / STALE"]
        SubprocResult["● SubprocessResult<br/>━━━━━━━━━━<br/>pid=_observed_pid (rectified)<br/>tracked_comm=_tracked_comm"]
    end

    %% ── PHASE 5: SESSION LOGGING ── %%
    subgraph SessionLog ["Phase 5 — Session Logging (● session_log.py + ● anomaly_detection.py)"]
        direction TB
        FlushLog["● flush_session_log<br/>━━━━━━━━━━<br/>tracked_comm=result.tracked_comm<br/>Propagated from SubprocessResult"]
        ModalComm["● Modal comm resolution<br/>━━━━━━━━━━<br/>Compute from snapshots if absent<br/>Sets tracked_comm_drift flag"]
        DetectAnomalies["● detect_anomalies<br/>━━━━━━━━━━<br/>OOM / zombie / D-state<br/>CPU / FD / RSS thresholds"]
        IdentityDrift["● detect_identity_drift<br/>━━━━━━━━━━<br/>Emit IDENTITY_DRIFT CRITICAL<br/>if any snap.comm != expected_comm<br/>(immunity check for #806 regression)"]
        WriteArtifacts["● Write session artifacts<br/>━━━━━━━━━━<br/>summary.json (tracer_version=2)<br/>sessions.jsonl (tracked_comm)<br/>anomalies.jsonl (if non-empty)"]
    end

    %% ── PHASE 6: CRASH RECOVERY ── %%
    subgraph CrashRecovery ["Phase 6 — Crash Recovery (● session_log.py)"]
        direction TB
        RecoverScan["recover_crashed_sessions<br/>━━━━━━━━━━<br/>Scan tmpfs sidecars (age > 30s)"]
        Gate1{"Gate 1: enrollment<br/>sidecar exists?"}
        Gate2{"Gate 2: boot_id<br/>matches current boot?"}
        Gate3{"Gate 3: PID dead or<br/>starttime_ticks mismatch?"}
        Gate4{"● Gate 4: snap[0].comm<br/>non-empty AND != 'claude'?"}
        AlienReject["● Alien file rejection<br/>━━━━━━━━━━<br/>PTY wrapper trace detected<br/>Delete sidecar files, skip recovery<br/>(comm = script/pty shim name)"]
        RecoverFlush["flush_session_log<br/>━━━━━━━━━━<br/>subtype='crashed', exit_code=-1<br/>termination='CRASHED'"]
    end

    %% ── MAIN FLOW ── %%
    START --> RHC
    RHC --> PTYMode
    PTYMode --> CaptureBasename
    CaptureBasename --> WrapCmd
    WrapCmd --> OpenProcess
    OpenProcess --> TracingEnabled
    TracingEnabled -->|"YES"| PtyLinux
    TracingEnabled -->|"NO"| RawPID
    PtyLinux -->|"YES: PTY + Linux"| ResolveTarget
    PtyLinux -->|"NO: non-PTY or unavailable"| TraceTargetDirect
    ResolveTarget --> FoundTarget
    FoundTarget -->|"YES: TraceTarget built"| RectifiedPID
    FoundTarget -->|"NO: deadline exceeded"| RTError
    TraceTargetDirect --> RectifiedPID
    RawPID --> StartTracing
    RTError --> ERROR
    RectifiedPID --> StartTracing
    StartTracing --> Race
    Race --> ResolveTerm
    ResolveTerm -->|"YES: process exited cleanly"| SubprocResult
    ResolveTerm -->|"NO: TIMED_OUT / STALL / STALE"| KillTree
    KillTree --> SubprocResult
    SubprocResult --> FlushLog
    FlushLog --> ModalComm
    ModalComm --> DetectAnomalies
    DetectAnomalies --> IdentityDrift
    IdentityDrift --> WriteArtifacts
    WriteArtifacts --> COMPLETE

    %% ── CRASH RECOVERY FLOW (parallel operational path) ── %%
    RecoverScan --> Gate1
    Gate1 -->|"NO"| SKIP_G1["skip"]
    Gate1 -->|"YES"| Gate2
    Gate2 -->|"NO: stale boot"| SKIP_G2["delete + skip"]
    Gate2 -->|"YES"| Gate3
    Gate3 -->|"NO: still alive"| SKIP_G3["still running, skip"]
    Gate3 -->|"YES: process gone"| Gate4
    Gate4 -->|"YES: alien comm"| AlienReject
    Gate4 -->|"NO: claude or empty comm"| RecoverFlush
    AlienReject --> SKIP_G4["deleted, skipped"]
    RecoverFlush --> COMPLETE

    %% ── CLASS ASSIGNMENTS ── %%
    class START,COMPLETE,ERROR terminal;
    class SKIP_G1,SKIP_G2,SKIP_G3,SKIP_G4 terminal;
    class RHC,CaptureBasename,WrapCmd,OpenProcess phase;
    class PTYMode stateNode;
    class TracingEnabled,PtyLinux,FoundTarget,ResolveTerm stateNode;
    class ResolveTarget,TraceTargetDirect handler;
    class RectifiedPID,RawPID stateNode;
    class RTError detector;
    class StartTracing,Race,KillTree handler;
    class SubprocResult stateNode;
    class FlushLog,ModalComm,DetectAnomalies handler;
    class IdentityDrift detector;
    class WriteArtifacts output;
    class RecoverScan phase;
    class Gate1,Gate2,Gate3,Gate4 stateNode;
    class AlienReject detector;
    class RecoverFlush handler;
Loading

State Lifecycle Diagram

%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 65, 'curve': 'basis'}}}%%
flowchart TB
    %% CLASS DEFINITIONS %%
    classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff;
    classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;
    classDef gap fill:#ff6f00,stroke:#ffa726,stroke-width:2px,color:#000;
    classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;

    %% TERMINALS %%
    START([PROCESS SPAWN])
    CRASH_END([CRASH RECOVERY COMPLETE])
    STOP_END([SESSION STOP])

    %% ──────────────────────────────────────── %%
    subgraph PIDResolution ["● PTY WRAPPER PID RESOLUTION  ·  linux_tracing.py"]
        PTYGate{"PTY mode?"}
        ResolvePTY["● resolve_trace_target<br/>━━━━━━━━━━<br/>Walk descendants at 50ms polls<br/>Match expected_basename<br/>Capture comm + cmdline + starttime_ticks<br/>HARD FAIL on timeout — NO wrapper fallback"]
        ResolveDirect["trace_target_from_pid<br/>━━━━━━━━━━<br/>Direct /proc read<br/>Returns empty strings on OSError<br/>Never raises"]
        ResolutionError["TraceTargetResolutionError<br/>━━━━━━━━━━<br/>root_pid + expected_basename carried<br/>Cannot trace: wrapper PID forbidden<br/>Abort — recreates #806 otherwise"]
    end

    %% ──────────────────────────────────────── %%
    subgraph InitOnlyFields ["INIT_ONLY FIELDS  ·  frozen=True  ·  linux_tracing.py ● / _type_subprocess.py ●"]
        TraceTarget["● TraceTarget  INIT_ONLY<br/>━━━━━━━━━━<br/>pid: workload PID only (not script wrapper)<br/>comm: /proc/pid/comm (max 15 chars)<br/>cmdline: tuple[str,...] immutable<br/>starttime_ticks: PID-reuse guard<br/>resolved_at: UTC datetime"]
        EnrollRecord["● TraceEnrollmentRecord  INIT_ONLY<br/>━━━━━━━━━━<br/>schema_version: 2 (post-#806)<br/>pid, boot_id, starttime_ticks<br/>session_id, enrolled_at, kitchen_id<br/>comm: new field — alien file rejection"]
        ProcSnap["● ProcSnapshot  INIT_ONLY<br/>━━━━━━━━━━<br/>comm: /proc/pid/comm  NEW field (#806)<br/>captured_at: unique UTC str (monotonic)<br/>vm_rss_kb, threads, fd_count, fd_soft_limit<br/>sig_pnd/blk/cgt, oom_score, wchan<br/>ctx_switches_voluntary/involuntary, cpu_percent"]
    end

    %% ──────────────────────────────────────── %%
    subgraph ContractGates ["CONTRACT ENFORCEMENT GATES  ·  linux_tracing.py ●"]
        TypeCheck{"isinstance(target,<br/>TraceTarget)?"}
        TypeFail["TypeError  HARD FAIL<br/>━━━━━━━━━━<br/>ARCH-008 / issue #806<br/>Raw int PID rejected explicitly<br/>No recovery path"]
        PlatformGate{"Linux + enabled<br/>+ tg not None?"}
        NoopReturn["Returns None<br/>━━━━━━━━━━<br/>Platform/config gate<br/>Tracing silently disabled"]
        MonoGate["Monotonic Timestamp Gate<br/>━━━━━━━━━━<br/>proc_monitor ensures captured_at<br/>strictly increases per snapshot<br/>+1μs applied on NTP/WSL2 regression"]
    end

    %% ──────────────────────────────────────── %%
    subgraph MutableHandle ["MUTABLE STATE  ·  LinuxTracingHandle  ·  linux_tracing.py ●"]
        HandleInit["Handle constructed<br/>━━━━━━━━━━<br/>_snapshots: []<br/>_trace_path: None<br/>_trace_file: None<br/>_enrollment_path: None<br/>_monitor_cancel_scope: None"]
        PostInit["Post-init mutation<br/>━━━━━━━━━━<br/>_monitor_cancel_scope set by start()<br/>_trace_path, _trace_file opened in tmpfs<br/>_enrollment_path written atomically<br/>All set inside start_linux_tracing()"]
        TraceDegraded["Degraded mode<br/>━━━━━━━━━━<br/>tmpfs unavailable → _trace_path=None<br/>write error → _trace_file closed/None<br/>Snapshots accumulate in-memory only"]
        HandleStop["stop() teardown<br/>━━━━━━━━━━<br/>Cancel scope cancelled<br/>_trace_file flushed + closed → None<br/>trace JSONL deleted from tmpfs<br/>enrollment sidecar deleted from tmpfs"]
    end

    %% ──────────────────────────────────────── %%
    subgraph RaceTransition ["MUTABLE → FROZEN TRANSITION  ·  process.py ●"]
        RaceAcc["● RaceAccumulator  MUTABLE<br/>━━━━━━━━━━<br/>process_exited, process_returncode<br/>channel_a_confirmed, channel_b_status<br/>channel_b_session_id, stdout_session_id, idle_stall<br/>Each field written exactly once by one coroutine"]
        RaceSignals["RaceSignals  INIT_ONLY after freeze<br/>━━━━━━━━━━<br/>Produced by to_race_signals()<br/>Immutable result consumed by resolve_termination()<br/>basis for SubprocessResult construction"]
    end

    %% ──────────────────────────────────────── %%
    subgraph ResultFields ["● SubprocessResult  MIXED-LIFECYCLE  ·  _type_subprocess.py ●"]
        ResultInit["Required at construction<br/>━━━━━━━━━━<br/>returncode, stdout, stderr<br/>termination, pid (workload PID post-#806)"]
        ResultPostSet["● Set post-construction by headless.py<br/>━━━━━━━━━━<br/>start_ts, end_ts: ISO timestamps<br/>elapsed_seconds: monotonic pre-computed<br/>proc_snapshots: list[dict] or None<br/>tracked_comm: TraceTarget.comm propagated<br/>channel_confirmation, session_id"]
    end

    %% ──────────────────────────────────────── %%
    subgraph CrashRecovery ["RESUME DETECTION  ·  recover_crashed_sessions()  ·  session_log.py ●"]
        AgeGate["Gate 1: Age Guard<br/>━━━━━━━━━━<br/>mtime within 30s → SKIP<br/>Active session protection"]
        EnrollGate["Gate 2: Enrollment Sidecar<br/>━━━━━━━━━━<br/>read_enrollment() = None → SKIP<br/>Alien/test file rejection"]
        BootGate["Gate 3: Boot ID Match<br/>━━━━━━━━━━<br/>boot_id mismatch → DELETE BOTH<br/>Pre-reboot stale rejection"]
        PIDGate["Gate 4: PID Liveness + starttime_ticks<br/>━━━━━━━━━━<br/>alive + ticks match → SKIP (still running)<br/>alive + ticks differ → CRASH (PID recycled)<br/>not alive → CRASH"]
        CommGate["● Gate 5: comm Alien Rejection  NEW<br/>━━━━━━━━━━<br/>snapshots[0].comm non-empty AND != claude<br/>→ DELETE trace + enrollment<br/>Pre-fix (no comm field) → RECOVER as-is"]
        CrashFlush["flush_session_log  CRASH path<br/>━━━━━━━━━━<br/>subtype=crashed, exit_code=-1<br/>detect_identity_drift() appended<br/>Deletes trace + enrollment after write"]
    end

    %% ──────────────────────────────────────── %%
    subgraph WriteOnlyArtifacts ["WRITE-ONLY SESSION ARTIFACTS  ·  session_log.py ●"]
        SummaryJson["summary.json<br/>━━━━━━━━━━<br/>Atomic write (tempfile+replace)<br/>tracer_target_resolution_version=2<br/>tracked_comm_drift flag<br/>peak_rss_kb, peak_oom_score, peak_fd_ratio"]
        ProcTrace["proc_trace.jsonl<br/>━━━━━━━━━━<br/>Every ProcSnapshot line-buffered<br/>Survives crash (stop() not called)<br/>Deleted after recovery or stop()"]
        SessionsJsonl["sessions.jsonl<br/>━━━━━━━━━━<br/>Append-only index (known race)<br/>Retention sweeps at 500 sessions<br/>WRITE-ONLY for logic decisions"]
        AnomaliesJsonl["● anomalies.jsonl<br/>━━━━━━━━━━<br/>detect_anomalies() output<br/>detect_identity_drift() — IDENTITY_DRIFT kind<br/>WRITE-ONLY — never read for decisions"]
    end

    %% ──────────────────────────────────────── %%
    %% NEW TEST NODES %%
    PTYIntegTest["★ test_linux_tracing_pty_integration.py<br/>━━━━━━━━━━<br/>No snapshot has comm=script<br/>max(vm_rss_kb) > 30,000 KB<br/>proc_trace.jsonl: comm on every row"]
    ResolverTest["★ test_trace_target_resolver.py<br/>━━━━━━━━━━<br/>resolved pid != wrapper pid<br/>TraceTargetResolutionError on miss<br/>error carries root_pid + expected_basename"]

    %% ──────────────────────────────────────── %%
    %% FLOW CONNECTIONS %%

    START --> PTYGate
    PTYGate -- "PTY=True" --> ResolvePTY
    PTYGate -- "PTY=False" --> ResolveDirect
    ResolvePTY -- "timeout" --> ResolutionError
    ResolvePTY -- "match found" --> TraceTarget
    ResolveDirect --> TraceTarget

    TraceTarget --> EnrollRecord
    TraceTarget --> TypeCheck

    TypeCheck -- "raw int PID" --> TypeFail
    TypeCheck -- "TraceTarget OK" --> PlatformGate
    PlatformGate -- "disabled/non-Linux" --> NoopReturn
    PlatformGate -- "enabled" --> HandleInit
    HandleInit --> PostInit
    PostInit -- "tmpfs ok" --> MonoGate
    PostInit -- "tmpfs missing" --> TraceDegraded
    MonoGate --> ProcSnap
    TraceDegraded --> ProcSnap

    ProcSnap --> RaceAcc
    ProcSnap --> ProcTrace
    RaceAcc --> RaceSignals
    RaceSignals --> ResultInit
    ResultInit --> ResultPostSet

    PostInit -- "enrolled_at" --> EnrollRecord
    ResultPostSet --> HandleStop
    HandleStop --> STOP_END

    %% Crash recovery path %%
    ProcTrace -- "process crashes (stop not called)" --> AgeGate
    AgeGate -- "age > 30s" --> EnrollGate
    EnrollGate -- "sidecar exists" --> BootGate
    BootGate -- "boot_id match" --> PIDGate
    PIDGate -- "crash detected" --> CommGate
    CommGate -- "comm=claude or legacy" --> CrashFlush
    CrashFlush --> SummaryJson
    CrashFlush --> SessionsJsonl
    CrashFlush --> AnomaliesJsonl
    CrashFlush --> CRASH_END

    %% Normal session log flush %%
    HandleStop --> SummaryJson
    HandleStop --> SessionsJsonl

    %% Test coverage arrows %%
    ResolvePTY -. "tested by" .-> PTYIntegTest
    ProcSnap -. "comm invariant" .-> PTYIntegTest
    TraceTarget -. "tested by" .-> ResolverTest
    ResolutionError -. "hard-fail contract" .-> ResolverTest

    %% CLASS ASSIGNMENTS %%
    class START,CRASH_END,STOP_END terminal;
    class PTYGate,TypeCheck,PlatformGate phase;
    class ResolvePTY,ResolveDirect handler;
    class ResolutionError,TypeFail detector;
    class TraceTarget,EnrollRecord,ProcSnap stateNode;
    class HandleInit,PostInit,TraceDegraded,HandleStop,MonoGate handler;
    class RaceAcc handler;
    class RaceSignals stateNode;
    class ResultInit,ResultPostSet stateNode;
    class AgeGate,EnrollGate,BootGate,PIDGate,CommGate detector;
    class CrashFlush handler;
    class SummaryJson,ProcTrace,SessionsJsonl,AnomaliesJsonl output;
    class PTYIntegTest,ResolverTest newComponent;
    class NoopReturn,TraceDegraded gap;
Loading

Closes #806

Implementation Plan

Plan file: /home/talon/projects/autoskillit-runs/remediation-806-20260413-075454-098046/.autoskillit/temp/rectify/rectify_pty_wrapper_tracer_pid_806_2026-04-13_110443.md

🤖 Generated with Claude Code via AutoSkillit

Token Usage Summary

Step uncached output cache_read cache_write count time
investigate 2.3k 19.9k 1.3M 113.8k 1 31m 1s
rectify 3.4k 24.7k 540.5k 76.0k 1 9m 1s
review 78 10.2k 289.0k 48.7k 1 17m 29s
dry_walkthrough 168 16.0k 1.0M 56.1k 1 6m 33s
implement 454 60.8k 6.4M 166.7k 1 18m 30s
retry_worktree 624 34.8k 6.3M 103.9k 1 19m 7s
prepare_pr 100 8.1k 350.0k 27.8k 1 2m 29s
run_arch_lenses 3.1k 35.2k 598.4k 115.0k 3 13m 45s
compose_pr 67 12.8k 251.8k 40.6k 1 3m 14s
Total 10.3k 222.6k 17.1M 748.6k 2h 1m

Copy link
Copy Markdown
Collaborator Author

@Trecek Trecek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoSkillit PR Review — Verdict: changes_requested

)

# Skip the entire module when script(1) is absent; no stub needed.
pytestmark_script = pytest.mark.skipif(
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[critical] tests: Double pytestmark assignment — the module-level pytestmark on line 21 (Linux platform guard) is silently overwritten by the second pytestmark = pytest.mark.skipif(shutil.which('script') is None, ...) assignment here. Only the script(1) availability check survives; the Linux-only guard is lost. Tests in this module will execute on non-Linux platforms and fail. Fix: use a list: pytestmark = [pytest.mark.skipif(sys.platform != 'linux', reason='Linux only — tests PTY wrapping + /proc tracing'), pytest.mark.skipif(shutil.which('script') is None, reason='script(1) not available on this system')]

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigated — this is intentional. Line 27 assigns to pytestmark_script (a distinct variable name), NOT pytestmark. The module-level pytestmark on line 21 (Linux platform guard) is NOT overwritten. pytestmark_script is a separate per-test decorator applied via @pytestmark_script at lines 40 and 94. Both guards coexist: the Linux guard applies module-wide, the script(1) guard applies per-test via decorator.

Comment thread src/autoskillit/execution/session_log.py Outdated
Comment thread src/autoskillit/execution/session_log.py
Comment thread tests/execution/test_session_log.py Outdated

@pytest.mark.anyio
@pytest.mark.skipif(
__import__("shutil").which("script") is None,
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] tests: test_peak_rss_kb_above_sanity_floor skips when script(1) is absent but has no Linux-platform guard. On macOS/Windows, LINUX_TRACING_AVAILABLE is False so proc_snapshots will be None, causing the assert result.proc_snapshots is not None to fire with a confusing message rather than a clean skip. Add @pytest.mark.skipif(sys.platform != 'linux', reason='Linux only') alongside the existing script check.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigated — this is intentional. The file has pytestmark = pytest.mark.skipif(sys.platform != 'linux', reason='Linux only') at line 14 (module-level), which applies to ALL tests in the module including test_peak_rss_kb_above_sanity_floor. On macOS/Windows, the entire module is skipped before any test body executes, so the assert result.proc_snapshots is not None line is never reached.

Comment thread src/autoskillit/execution/anomaly_detection.py Outdated
root_proc = psutil.Process(root_pid)
children = root_proc.children(recursive=True)
except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.ZombieProcess):
break
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] defense: In resolve_trace_target, when psutil raises NoSuchProcess/AccessDenied/ZombieProcess on root_proc.children(), the loop breaks immediately and raises TraceTargetResolutionError. A single transient OS error causes permanent resolution failure with no retry. Replace break with time.sleep(0.05); continue so the polling loop can recover within the deadline.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigated — this is intentional. The except (NoSuchProcess, AccessDenied, ZombieProcess): break fires when root_proc.children() fails because the root process (root_pid, i.e. the script(1) wrapper) has disappeared. If the root is gone, all its descendants are gone too — continue would immediately re-fail on the next psutil.Process(root_pid) call, consuming the polling window uselessly. break + raise TraceTargetResolutionError is correct when the root disappears. Individual child errors are already handled by the inner except at line 174 with continue.

Comment thread tests/execution/test_session_log_integration.py Outdated

# Anomaly detection
# Compute effective tracked_comm from snapshots if not provided by caller
if _effective_tracked_comm is None:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] cohesion: Drift detection uses two parallel mechanisms in the same flush path: (1) inline modal-comm computation + set-cardinality check produces _tracked_comm_drift boolean (lines 164-178), and (2) detect_identity_drift() produces IDENTITY_DRIFT anomaly records. Both detect mismatched comm values but through different algorithms with different outputs. Consider collapsing the boolean drift flag into the anomaly detection path.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid observation — flagged for design decision. The inline modal-comm computation (lines 164-178) and detect_identity_drift() serve different consumers: the boolean _tracked_comm_drift feeds into the session summary JSON, while IDENTITY_DRIFT anomaly records go to anomalies.jsonl. Consolidating them requires deciding whether the boolean flag is redundant given anomaly records, which is a design trade-off for a future pass.

Comment thread src/autoskillit/execution/process.py Outdated
Copy link
Copy Markdown
Collaborator Author

@Trecek Trecek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoSkillit review found 11 blocking issues (3 critical, 8 warning). See inline comments.

@Trecek Trecek added this pull request to the merge queue Apr 13, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Apr 13, 2026
Trecek and others added 11 commits April 13, 2026 16:07
Adds 11 new tests and ARCH-008 rule that currently fail against HEAD:

Test 1.1: PTY-wrapped run_managed_async is traced as workload, not script(1)
Test 1.2: ProcSnapshot.comm field populated from /proc/{pid}/comm
Test 1.3: resolve_trace_target walks descendants to find workload
Test 1.4: resolve_trace_target raises TraceTargetResolutionError on miss
Test 1.5: start_linux_tracing signature requires TraceTarget, not raw int
Test 1.6: peak_rss_kb > 30_000 sanity floor for 60 MB workload
Test 1.7: anomaly detection liveness canary fires on OOM_CRITICAL stream
Test 1.8: detect_identity_drift fires when comm != expected_comm
Test 1.9: ARCH-008 AST rule forbids proc.pid as start_linux_tracing target
Test 1.10: proc_trace.jsonl rows include comm field for self-identification
Test 1.11: recover_crashed_sessions excludes alien (non-claude) trace files

ARCH-008 (no-raw-pid-to-start-linux-tracing) added to _rules.py, detection
logic added to ArchitectureViolationVisitor.visit_Call in _helpers.py, and
expected_ids updated in test_registry.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces TraceTarget newtype, resolve_trace_target() resolver, comm field
on ProcSnapshot, and ARCH-008 AST guard to eliminate silent wrong-process
observation when anyio.open_process() returns the script(1) wrapper PID.

- Add TraceTarget frozen dataclass (pid, comm, cmdline, starttime_ticks)
- Add resolve_trace_target() to walk descendants and find workload by basename
- Add trace_target_from_pid() for non-PTY mode (direct child = workload)
- Add TraceTargetResolutionError — no silent fallback to wrapper PID
- Change start_linux_tracing(pid: int) → start_linux_tracing(target: TraceTarget)
- Add comm field to ProcSnapshot for self-identifying snapshots
- Add detect_identity_drift() anomaly detector for post-hoc PTY drift
- Wire resolve_trace_target into run_managed_async after open_process
- Extend SubprocessResult with tracked_comm; propagate through headless.py
- Extend flush_session_log with tracked_comm, tracked_comm_drift, schema v2
- Extend _format_diagnostics_section to surface tracked_comm in GitHub bodies
- Add comm-based alien file rejection in recover_crashed_sessions
- Update all existing tests to use TraceTarget via trace_target_from_pid
- Fix rfind-based starttime_ticks parsing for comm containing ")"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ien-file rejection

Addresses reviewer comments #3075871672 and #3075871675: Gate 4 in
recover_crashed_sessions now uses enrollment.comm (available for
schema_version=2 records) as the expected comm rather than the hardcoded
string 'claude'. Pre-fix schema_version=1 records with empty comm="" still
skip the alien check, preserving recovery of legitimate crash data.
…nvariant

Addresses reviewer comment #3075871682: `assert claude_sessions or count >= 1`
could pass even if the alien trace was recovered and claude's wasn't. Tightened
to `assert claude_sessions` to directly validate the intended invariant.
…ract

Addresses reviewer comment #3075871693: changes `_target: object = None` to
`_target: TraceTarget | None = None` with a TYPE_CHECKING import, adds an
`assert _target is not None` guard, and removes three `# type: ignore` suppressions.
Addresses reviewer comment #3075871697: replaces bare `int(snap.get('pid') or 0)`
with `_safe_int(value, default=0)` which catches ValueError/TypeError from corrupt
snapshot fields instead of letting them propagate uncaught.
…onftest.py

Addresses reviewer comment #3075871706: the identical 60 MB allocation script
was defined in both test_linux_tracing_pty_integration.py and
test_session_log_integration.py. Extracted to tests/execution/conftest.py
as a shared module-level constant and imported in both test files.
… process.py

Addresses reviewer comment #3075871713: the import was annotated with
`# noqa: F401 — used by raise below` but no raise appears in process.py scope.
The exception propagates naturally from resolve_trace_target(). Removed the
unused import.
…'claude'

_write_old_trace_with_comm now writes schema_version=2 enrollment records with
comm='claude', matching the production behavior where autoskillit always enrolls
its own binary comm. Gate 4 then correctly rejects traces where first_comm
(e.g. 'sleep') != enrollment.comm ('claude').
@Trecek Trecek force-pushed the linux-process-tracer-monitors-script-1-pty-wrapper-instead-o/806 branch from fa261ea to ad5507b Compare April 13, 2026 23:15
@Trecek Trecek added this pull request to the merge queue Apr 13, 2026
Merged via the queue into integration with commit 80271dc Apr 13, 2026
2 checks passed
@Trecek Trecek deleted the linux-process-tracer-monitors-script-1-pty-wrapper-instead-o/806 branch April 13, 2026 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant