Rectify: PTY Wrapper Tracer PID — Target-Resolution & Self-Identifying Snapshots (Issue #806)#809
Conversation
Trecek
left a comment
There was a problem hiding this comment.
AutoSkillit PR Review — Verdict: changes_requested
| ) | ||
|
|
||
| # Skip the entire module when script(1) is absent; no stub needed. | ||
| pytestmark_script = pytest.mark.skipif( |
There was a problem hiding this comment.
[critical] tests: Double pytestmark assignment — the module-level pytestmark on line 21 (Linux platform guard) is silently overwritten by the second pytestmark = pytest.mark.skipif(shutil.which('script') is None, ...) assignment here. Only the script(1) availability check survives; the Linux-only guard is lost. Tests in this module will execute on non-Linux platforms and fail. Fix: use a list: pytestmark = [pytest.mark.skipif(sys.platform != 'linux', reason='Linux only — tests PTY wrapping + /proc tracing'), pytest.mark.skipif(shutil.which('script') is None, reason='script(1) not available on this system')]
There was a problem hiding this comment.
Investigated — this is intentional. Line 27 assigns to pytestmark_script (a distinct variable name), NOT pytestmark. The module-level pytestmark on line 21 (Linux platform guard) is NOT overwritten. pytestmark_script is a separate per-test decorator applied via @pytestmark_script at lines 40 and 94. Both guards coexist: the Linux guard applies module-wide, the script(1) guard applies per-test via decorator.
|
|
||
| @pytest.mark.anyio | ||
| @pytest.mark.skipif( | ||
| __import__("shutil").which("script") is None, |
There was a problem hiding this comment.
[warning] tests: test_peak_rss_kb_above_sanity_floor skips when script(1) is absent but has no Linux-platform guard. On macOS/Windows, LINUX_TRACING_AVAILABLE is False so proc_snapshots will be None, causing the assert result.proc_snapshots is not None to fire with a confusing message rather than a clean skip. Add @pytest.mark.skipif(sys.platform != 'linux', reason='Linux only') alongside the existing script check.
There was a problem hiding this comment.
Investigated — this is intentional. The file has pytestmark = pytest.mark.skipif(sys.platform != 'linux', reason='Linux only') at line 14 (module-level), which applies to ALL tests in the module including test_peak_rss_kb_above_sanity_floor. On macOS/Windows, the entire module is skipped before any test body executes, so the assert result.proc_snapshots is not None line is never reached.
| root_proc = psutil.Process(root_pid) | ||
| children = root_proc.children(recursive=True) | ||
| except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.ZombieProcess): | ||
| break |
There was a problem hiding this comment.
[warning] defense: In resolve_trace_target, when psutil raises NoSuchProcess/AccessDenied/ZombieProcess on root_proc.children(), the loop breaks immediately and raises TraceTargetResolutionError. A single transient OS error causes permanent resolution failure with no retry. Replace break with time.sleep(0.05); continue so the polling loop can recover within the deadline.
There was a problem hiding this comment.
Investigated — this is intentional. The except (NoSuchProcess, AccessDenied, ZombieProcess): break fires when root_proc.children() fails because the root process (root_pid, i.e. the script(1) wrapper) has disappeared. If the root is gone, all its descendants are gone too — continue would immediately re-fail on the next psutil.Process(root_pid) call, consuming the polling window uselessly. break + raise TraceTargetResolutionError is correct when the root disappears. Individual child errors are already handled by the inner except at line 174 with continue.
|
|
||
| # Anomaly detection | ||
| # Compute effective tracked_comm from snapshots if not provided by caller | ||
| if _effective_tracked_comm is None: |
There was a problem hiding this comment.
[warning] cohesion: Drift detection uses two parallel mechanisms in the same flush path: (1) inline modal-comm computation + set-cardinality check produces _tracked_comm_drift boolean (lines 164-178), and (2) detect_identity_drift() produces IDENTITY_DRIFT anomaly records. Both detect mismatched comm values but through different algorithms with different outputs. Consider collapsing the boolean drift flag into the anomaly detection path.
There was a problem hiding this comment.
Valid observation — flagged for design decision. The inline modal-comm computation (lines 164-178) and detect_identity_drift() serve different consumers: the boolean _tracked_comm_drift feeds into the session summary JSON, while IDENTITY_DRIFT anomaly records go to anomalies.jsonl. Consolidating them requires deciding whether the boolean flag is redundant given anomaly records, which is a design trade-off for a future pass.
Trecek
left a comment
There was a problem hiding this comment.
AutoSkillit review found 11 blocking issues (3 critical, 8 warning). See inline comments.
Adds 11 new tests and ARCH-008 rule that currently fail against HEAD:
Test 1.1: PTY-wrapped run_managed_async is traced as workload, not script(1)
Test 1.2: ProcSnapshot.comm field populated from /proc/{pid}/comm
Test 1.3: resolve_trace_target walks descendants to find workload
Test 1.4: resolve_trace_target raises TraceTargetResolutionError on miss
Test 1.5: start_linux_tracing signature requires TraceTarget, not raw int
Test 1.6: peak_rss_kb > 30_000 sanity floor for 60 MB workload
Test 1.7: anomaly detection liveness canary fires on OOM_CRITICAL stream
Test 1.8: detect_identity_drift fires when comm != expected_comm
Test 1.9: ARCH-008 AST rule forbids proc.pid as start_linux_tracing target
Test 1.10: proc_trace.jsonl rows include comm field for self-identification
Test 1.11: recover_crashed_sessions excludes alien (non-claude) trace files
ARCH-008 (no-raw-pid-to-start-linux-tracing) added to _rules.py, detection
logic added to ArchitectureViolationVisitor.visit_Call in _helpers.py, and
expected_ids updated in test_registry.py.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces TraceTarget newtype, resolve_trace_target() resolver, comm field on ProcSnapshot, and ARCH-008 AST guard to eliminate silent wrong-process observation when anyio.open_process() returns the script(1) wrapper PID. - Add TraceTarget frozen dataclass (pid, comm, cmdline, starttime_ticks) - Add resolve_trace_target() to walk descendants and find workload by basename - Add trace_target_from_pid() for non-PTY mode (direct child = workload) - Add TraceTargetResolutionError — no silent fallback to wrapper PID - Change start_linux_tracing(pid: int) → start_linux_tracing(target: TraceTarget) - Add comm field to ProcSnapshot for self-identifying snapshots - Add detect_identity_drift() anomaly detector for post-hoc PTY drift - Wire resolve_trace_target into run_managed_async after open_process - Extend SubprocessResult with tracked_comm; propagate through headless.py - Extend flush_session_log with tracked_comm, tracked_comm_drift, schema v2 - Extend _format_diagnostics_section to surface tracked_comm in GitHub bodies - Add comm-based alien file rejection in recover_crashed_sessions - Update all existing tests to use TraceTarget via trace_target_from_pid - Fix rfind-based starttime_ticks parsing for comm containing ")" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ien-file rejection Addresses reviewer comments #3075871672 and #3075871675: Gate 4 in recover_crashed_sessions now uses enrollment.comm (available for schema_version=2 records) as the expected comm rather than the hardcoded string 'claude'. Pre-fix schema_version=1 records with empty comm="" still skip the alien check, preserving recovery of legitimate crash data.
…nvariant Addresses reviewer comment #3075871682: `assert claude_sessions or count >= 1` could pass even if the alien trace was recovered and claude's wasn't. Tightened to `assert claude_sessions` to directly validate the intended invariant.
…ract Addresses reviewer comment #3075871693: changes `_target: object = None` to `_target: TraceTarget | None = None` with a TYPE_CHECKING import, adds an `assert _target is not None` guard, and removes three `# type: ignore` suppressions.
Addresses reviewer comment #3075871697: replaces bare `int(snap.get('pid') or 0)`
with `_safe_int(value, default=0)` which catches ValueError/TypeError from corrupt
snapshot fields instead of letting them propagate uncaught.
…onftest.py Addresses reviewer comment #3075871706: the identical 60 MB allocation script was defined in both test_linux_tracing_pty_integration.py and test_session_log_integration.py. Extracted to tests/execution/conftest.py as a shared module-level constant and imported in both test files.
… process.py Addresses reviewer comment #3075871713: the import was annotated with `# noqa: F401 — used by raise below` but no raise appears in process.py scope. The exception propagates naturally from resolve_trace_target(). Removed the unused import.
…'claude'
_write_old_trace_with_comm now writes schema_version=2 enrollment records with
comm='claude', matching the production behavior where autoskillit always enrolls
its own binary comm. Gate 4 then correctly rejects traces where first_comm
(e.g. 'sleep') != enrollment.comm ('claude').
fa261ea to
ad5507b
Compare
Summary
anyio.open_process([script, -qefc, "<claude cmd>", /dev/null])returns the PID of thescript(1)PTY wrapper, not the intended claude binary. That wrapper PID was used as the subject of observation across ten call sites inprocess.pyand downstream through the entire telemetry pipeline (tracer, anomaly detection, peak aggregation,sessions.jsonl, GitHub issue bodies). The broken subsystem has been silently wrong since PTY wrapping was introduced — and two prior investigations (#771/#776) saw the same~2 MB RSS / 1 thread / do_sys_pollfingerprint without identifying the PTY wrapper as the root cause.This rectify produces immunity against the entire bug class — not just the
script(1)instance — by introducing aTraceTargetnewtype that can only be produced by aresolve_trace_target()resolver (which walks descendants and returns the workload process, not the spawn process), adding acommfield toProcSnapshotso every row inproc_trace.jsonlself-identifies the process it describes, changingstart_linux_tracing(pid: int)tostart_linux_tracing(target: TraceTarget)so the raw-int path becomes unrepresentable, and adding an AST architectural test (ARCH-008) that forbids passing a rawAttributePID node as thetargetargument tostart_linux_tracing.Architecture Impact
Concurrency Diagram
%%{init: {'flowchart': {'nodeSpacing': 40, 'rankSpacing': 50, 'curve': 'basis'}}}%% flowchart TB %% CLASS DEFINITIONS %% classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff; classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff; classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff; classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff; classDef gap fill:#ff6f00,stroke:#ffa726,stroke-width:2px,color:#000; START([● run_headless_core / DefaultHeadlessExecutor]) subgraph PreSpawn ["SEQUENTIAL PRE-SPAWN — ● headless.py · ● process.py"] direction TB PTY_WRAP["● pty_wrap_command()<br/>━━━━━━━━━━<br/>Rewrites cmd: script -qefc<br/>Stores _workload_basename BEFORE rewrite"] SPAWN["● anyio.open_process()<br/>━━━━━━━━━━<br/>Returns script(1) wrapper PID<br/>(not the workload PID)"] PTY_DEC{"PTY<br/>mode?"} end subgraph PIDRect ["PID RECTIFICATION — ● linux_tracing.py (BLOCKING SYNC before task group)"] direction TB RESOLVE["● resolve_trace_target()<br/>━━━━━━━━━━<br/>Blocks event loop: time.sleep(0.05) × ≤40<br/>Walks /proc descendants from wrapper PID<br/>Matches expected_basename (workload)"] RESOLVE_ERR["● TraceTargetResolutionError<br/>━━━━━━━━━━<br/>Hard fail — never silently falls back<br/>Carries .root_pid + .expected_basename"] DIRECT["● trace_target_from_pid()<br/>━━━━━━━━━━<br/>Single /proc read<br/>Non-PTY direct path"] TRACE_TARGET["● TraceTarget (frozen dataclass)<br/>━━━━━━━━━━<br/>pid · comm · cmdline<br/>starttime_ticks (PID-recycle guard)<br/>boot_id collision-resistant triple"] ENROLL["● _write_enrollment_atomic()<br/>━━━━━━━━━━<br/>tempfile.mkstemp + os.replace<br/>Crash-recovery identity record"] end FORK["● anyio.create_task_group() — FORK<br/>━━━━━━━━━━<br/>process.py: starts 5–6 concurrent tasks"] subgraph TaskGroup ["CONCURRENT ASYNC WATCHERS — anyio cooperative multitasking (single thread)"] direction LR W_PROC["_watch_process<br/>━━━━━━━━━━<br/>Awaits subprocess exit<br/>CHANNEL: natural exit"] W_HEART["_watch_heartbeat<br/>━━━━━━━━━━<br/>Polls stdout JSONL<br/>Completion marker records<br/>CHANNEL A"] W_SESS["_watch_session_log<br/>━━━━━━━━━━<br/>Polls Claude session JSONL dir<br/>CHANNEL B"] W_IDLE["_watch_stdout_idle<br/>━━━━━━━━━━<br/>Fires if stdout silent<br/>idle_output_timeout guard"] MONITOR["● _run_monitor / proc_monitor<br/>━━━━━━━━━━<br/>anyio.sleep(proc_interval)<br/>Reads /proc at each tick<br/>Accumulates ProcSnapshot list<br/>Injected into same TaskGroup"] end subgraph RaceSignals ["RACE SIGNALS — ● process.py (shared, anyio-safe)"] direction TB TRIGGER["● anyio.Event: trigger<br/>━━━━━━━━━━<br/>First watcher to fire wins"] CB_READY["anyio.Event: channel_b_ready<br/>━━━━━━━━━━<br/>Symmetric Channel B drain guard"] SID_READY["anyio.Event: stdout_session_id_ready<br/>━━━━━━━━━━<br/>Session ID from stdout JSONL"] end BARRIER["trigger.wait() + move_on_after(timeout)<br/>━━━━━━━━━━<br/>● process.py — awaits first race signal<br/>Drain window: move_on_after(completion_drain_timeout)"] CANCEL["tg.cancel_scope.cancel()<br/>━━━━━━━━━━<br/>Terminates all remaining tasks<br/>Including _run_monitor CancelScope"] subgraph PostHoc ["POST-HOC ANALYSIS — sequential, after task group exits"] direction TB STOP["● tracing_handle.stop()<br/>━━━━━━━━━━<br/>Cancels LinuxTracingHandle._monitor_cancel_scope<br/>Returns accumulated ProcSnapshot list"] RESULT["● SubprocessResult<br/>━━━━━━━━━━<br/>proc_snapshots: list[dict] | None<br/>tracked_comm: str|None ← Issue #806 fix<br/>session_id · channel_confirmation · pid"] FLUSH["● flush_session_log()<br/>━━━━━━━━━━<br/>Writes proc_trace.jsonl to session dir"] ANOMALY["● detect_anomalies()<br/>━━━━━━━━━━<br/>Pure sync · post-hoc only<br/>OOM spike/critical · zombie · D-state<br/>high CPU · RSS growth · FDs"] DRIFT["● detect_identity_drift()<br/>━━━━━━━━━━<br/>Pure sync · checks each snapshot<br/>comm matches TraceTarget.comm"] end subgraph NewTests ["★ NEW INTEGRATION TESTS"] direction LR T_RESOLVER["★ test_trace_target_resolver.py<br/>━━━━━━━━━━<br/>test_resolve_trace_target_walks_from_wrapper_to_workload<br/>Asserts target.pid ≠ wrapper_pid<br/>test_resolve_trace_target_raises_on_miss<br/>Asserts TraceTargetResolutionError raised"] T_PTY["★ test_linux_tracing_pty_integration.py<br/>━━━━━━━━━━<br/>test_pty_wrapped_command_is_traced_as_grandchild<br/>Asserts peak RSS > 30,000 KB (not script wrapper 2 MB)<br/>test_pty_wrapped_tracing_produces_no_script_snapshots<br/>Asserts no comm='script' in proc_trace.jsonl"] end %% MAIN SEQUENTIAL FLOW %% START --> PTY_WRAP PTY_WRAP --> SPAWN SPAWN --> PTY_DEC PTY_DEC -->|"pty_mode=True"| RESOLVE PTY_DEC -->|"pty_mode=False"| DIRECT RESOLVE -->|"found within timeout"| TRACE_TARGET RESOLVE -->|"timeout exceeded"| RESOLVE_ERR DIRECT --> TRACE_TARGET TRACE_TARGET --> ENROLL ENROLL --> FORK %% FORK TO PARALLEL WATCHERS %% FORK --> W_PROC FORK --> W_HEART FORK --> W_SESS FORK --> W_IDLE FORK --> MONITOR %% WATCHERS FIRE TRIGGER %% W_PROC --> TRIGGER W_HEART --> TRIGGER W_SESS --> TRIGGER W_IDLE --> TRIGGER %% MONITOR ACCUMULATES (does not fire trigger) %% MONITOR -.->|"ProcSnapshot accumulation"| STOP %% RACE BARRIER %% TRIGGER --> BARRIER CB_READY -.->|"drain guard"| BARRIER SID_READY -.->|"drain guard"| BARRIER BARRIER --> CANCEL CANCEL --> STOP %% POST-HOC CHAIN %% STOP --> RESULT RESULT --> FLUSH FLUSH --> ANOMALY FLUSH --> DRIFT %% TEST COVERAGE (dashed) %% T_RESOLVER -.->|"exercises"| RESOLVE T_PTY -.->|"exercises"| MONITOR T_PTY -.->|"exercises"| FLUSH %% CLASS ASSIGNMENTS %% class START terminal; class PTY_WRAP,SPAWN handler; class PTY_DEC stateNode; class RESOLVE phase; class RESOLVE_ERR detector; class DIRECT phase; class TRACE_TARGET stateNode; class ENROLL output; class FORK detector; class W_PROC,W_HEART,W_SESS,W_IDLE handler; class MONITOR phase; class TRIGGER,CB_READY,SID_READY stateNode; class BARRIER,CANCEL detector; class STOP,RESULT output; class FLUSH,ANOMALY,DRIFT phase; class T_RESOLVER,T_PTY newComponent;Process Flow Diagram
%%{init: {'flowchart': {'nodeSpacing': 40, 'rankSpacing': 50, 'curve': 'basis'}}}%% flowchart TB %% CLASS DEFINITIONS %% classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff; classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff; classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff; %% TERMINALS %% START([START]) COMPLETE([COMPLETE]) ERROR([ERROR — no fallback]) %% ── PHASE 1: SESSION ORCHESTRATION ── %% subgraph Orchestration ["Phase 1 — Session Orchestration (● headless.py)"] direction TB RHC["● run_headless_core<br/>━━━━━━━━━━<br/>Entry: skill + LinuxTracingConfig"] PTYMode["pty_mode = True<br/>━━━━━━━━━━<br/>Hardcoded for all headless sessions"] end %% ── PHASE 2: PTY PROCESS SPAWN ── %% subgraph Spawn ["Phase 2 — PTY Process Spawn (● process.py)"] direction TB CaptureBasename["● Capture _workload_basename<br/>━━━━━━━━━━<br/>Path(cmd[0]).name BEFORE PTY wrap<br/>(#806 guard — preserves real target name)"] WrapCmd["pty_wrap_command(cmd)<br/>━━━━━━━━━━<br/>Rewrites cmd → script(1) wrapper<br/>cmd[0] is now the PTY shim"] OpenProcess["anyio.open_process(cmd)<br/>━━━━━━━━━━<br/>Spawns; proc.pid = script(1) PID<br/>(NOT the claude workload)"] end %% ── PHASE 3: PID RECTIFICATION ── %% subgraph Rectification ["Phase 3 — PTY PID Rectification (● process.py + ● linux_tracing.py)"] direction TB TracingEnabled{"linux_tracing_config<br/>not None?"} PtyLinux{"pty_mode AND<br/>LINUX_TRACING_AVAILABLE?"} ResolveTarget["● resolve_trace_target<br/>━━━━━━━━━━<br/>root_pid=proc.pid, basename=workload<br/>Poll children() every 50 ms, timeout=2s<br/>Match name or cmdline[0] basename"] FoundTarget{"workload child<br/>found in time?"} TraceTargetDirect["● trace_target_from_pid<br/>━━━━━━━━━━<br/>Non-PTY: direct /proc read<br/>Builds TraceTarget; never raises"] RawPID["_observed_pid = proc.pid<br/>━━━━━━━━━━<br/>Tracing disabled path<br/>Raw wrapper PID used"] RectifiedPID["● _observed_pid = target.pid<br/>━━━━━━━━━━<br/>Rectified workload PID<br/>_tracked_comm = target.comm"] RTError["● TraceTargetResolutionError<br/>━━━━━━━━━━<br/>No silent fallback to wrapper PID<br/>(fallback would recreate #806)"] end %% ── PHASE 4: EXECUTION RACE ── %% subgraph ExecRace ["Phase 4 — Execution Race (● process.py)"] direction TB StartTracing["start_linux_tracing(target)<br/>━━━━━━━━━━<br/>target must be TraceTarget<br/>TypeError guard blocks raw int PIDs"] Race["Race Watchers<br/>━━━━━━━━━━<br/>_watch_process | _watch_heartbeat<br/>_watch_session_log | _watch_stdout_idle<br/>First to signal trigger wins"] ResolveTerm{"resolve_termination<br/>NATURAL_EXIT?"} KillTree["async_kill_process_tree<br/>━━━━━━━━━━<br/>Reaps PTY wrapper + workload<br/>for TIMED_OUT / STALL / STALE"] SubprocResult["● SubprocessResult<br/>━━━━━━━━━━<br/>pid=_observed_pid (rectified)<br/>tracked_comm=_tracked_comm"] end %% ── PHASE 5: SESSION LOGGING ── %% subgraph SessionLog ["Phase 5 — Session Logging (● session_log.py + ● anomaly_detection.py)"] direction TB FlushLog["● flush_session_log<br/>━━━━━━━━━━<br/>tracked_comm=result.tracked_comm<br/>Propagated from SubprocessResult"] ModalComm["● Modal comm resolution<br/>━━━━━━━━━━<br/>Compute from snapshots if absent<br/>Sets tracked_comm_drift flag"] DetectAnomalies["● detect_anomalies<br/>━━━━━━━━━━<br/>OOM / zombie / D-state<br/>CPU / FD / RSS thresholds"] IdentityDrift["● detect_identity_drift<br/>━━━━━━━━━━<br/>Emit IDENTITY_DRIFT CRITICAL<br/>if any snap.comm != expected_comm<br/>(immunity check for #806 regression)"] WriteArtifacts["● Write session artifacts<br/>━━━━━━━━━━<br/>summary.json (tracer_version=2)<br/>sessions.jsonl (tracked_comm)<br/>anomalies.jsonl (if non-empty)"] end %% ── PHASE 6: CRASH RECOVERY ── %% subgraph CrashRecovery ["Phase 6 — Crash Recovery (● session_log.py)"] direction TB RecoverScan["recover_crashed_sessions<br/>━━━━━━━━━━<br/>Scan tmpfs sidecars (age > 30s)"] Gate1{"Gate 1: enrollment<br/>sidecar exists?"} Gate2{"Gate 2: boot_id<br/>matches current boot?"} Gate3{"Gate 3: PID dead or<br/>starttime_ticks mismatch?"} Gate4{"● Gate 4: snap[0].comm<br/>non-empty AND != 'claude'?"} AlienReject["● Alien file rejection<br/>━━━━━━━━━━<br/>PTY wrapper trace detected<br/>Delete sidecar files, skip recovery<br/>(comm = script/pty shim name)"] RecoverFlush["flush_session_log<br/>━━━━━━━━━━<br/>subtype='crashed', exit_code=-1<br/>termination='CRASHED'"] end %% ── MAIN FLOW ── %% START --> RHC RHC --> PTYMode PTYMode --> CaptureBasename CaptureBasename --> WrapCmd WrapCmd --> OpenProcess OpenProcess --> TracingEnabled TracingEnabled -->|"YES"| PtyLinux TracingEnabled -->|"NO"| RawPID PtyLinux -->|"YES: PTY + Linux"| ResolveTarget PtyLinux -->|"NO: non-PTY or unavailable"| TraceTargetDirect ResolveTarget --> FoundTarget FoundTarget -->|"YES: TraceTarget built"| RectifiedPID FoundTarget -->|"NO: deadline exceeded"| RTError TraceTargetDirect --> RectifiedPID RawPID --> StartTracing RTError --> ERROR RectifiedPID --> StartTracing StartTracing --> Race Race --> ResolveTerm ResolveTerm -->|"YES: process exited cleanly"| SubprocResult ResolveTerm -->|"NO: TIMED_OUT / STALL / STALE"| KillTree KillTree --> SubprocResult SubprocResult --> FlushLog FlushLog --> ModalComm ModalComm --> DetectAnomalies DetectAnomalies --> IdentityDrift IdentityDrift --> WriteArtifacts WriteArtifacts --> COMPLETE %% ── CRASH RECOVERY FLOW (parallel operational path) ── %% RecoverScan --> Gate1 Gate1 -->|"NO"| SKIP_G1["skip"] Gate1 -->|"YES"| Gate2 Gate2 -->|"NO: stale boot"| SKIP_G2["delete + skip"] Gate2 -->|"YES"| Gate3 Gate3 -->|"NO: still alive"| SKIP_G3["still running, skip"] Gate3 -->|"YES: process gone"| Gate4 Gate4 -->|"YES: alien comm"| AlienReject Gate4 -->|"NO: claude or empty comm"| RecoverFlush AlienReject --> SKIP_G4["deleted, skipped"] RecoverFlush --> COMPLETE %% ── CLASS ASSIGNMENTS ── %% class START,COMPLETE,ERROR terminal; class SKIP_G1,SKIP_G2,SKIP_G3,SKIP_G4 terminal; class RHC,CaptureBasename,WrapCmd,OpenProcess phase; class PTYMode stateNode; class TracingEnabled,PtyLinux,FoundTarget,ResolveTerm stateNode; class ResolveTarget,TraceTargetDirect handler; class RectifiedPID,RawPID stateNode; class RTError detector; class StartTracing,Race,KillTree handler; class SubprocResult stateNode; class FlushLog,ModalComm,DetectAnomalies handler; class IdentityDrift detector; class WriteArtifacts output; class RecoverScan phase; class Gate1,Gate2,Gate3,Gate4 stateNode; class AlienReject detector; class RecoverFlush handler;State Lifecycle Diagram
%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 65, 'curve': 'basis'}}}%% flowchart TB %% CLASS DEFINITIONS %% classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff; classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff; classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff; classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff; classDef gap fill:#ff6f00,stroke:#ffa726,stroke-width:2px,color:#000; classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; %% TERMINALS %% START([PROCESS SPAWN]) CRASH_END([CRASH RECOVERY COMPLETE]) STOP_END([SESSION STOP]) %% ──────────────────────────────────────── %% subgraph PIDResolution ["● PTY WRAPPER PID RESOLUTION · linux_tracing.py"] PTYGate{"PTY mode?"} ResolvePTY["● resolve_trace_target<br/>━━━━━━━━━━<br/>Walk descendants at 50ms polls<br/>Match expected_basename<br/>Capture comm + cmdline + starttime_ticks<br/>HARD FAIL on timeout — NO wrapper fallback"] ResolveDirect["trace_target_from_pid<br/>━━━━━━━━━━<br/>Direct /proc read<br/>Returns empty strings on OSError<br/>Never raises"] ResolutionError["TraceTargetResolutionError<br/>━━━━━━━━━━<br/>root_pid + expected_basename carried<br/>Cannot trace: wrapper PID forbidden<br/>Abort — recreates #806 otherwise"] end %% ──────────────────────────────────────── %% subgraph InitOnlyFields ["INIT_ONLY FIELDS · frozen=True · linux_tracing.py ● / _type_subprocess.py ●"] TraceTarget["● TraceTarget INIT_ONLY<br/>━━━━━━━━━━<br/>pid: workload PID only (not script wrapper)<br/>comm: /proc/pid/comm (max 15 chars)<br/>cmdline: tuple[str,...] immutable<br/>starttime_ticks: PID-reuse guard<br/>resolved_at: UTC datetime"] EnrollRecord["● TraceEnrollmentRecord INIT_ONLY<br/>━━━━━━━━━━<br/>schema_version: 2 (post-#806)<br/>pid, boot_id, starttime_ticks<br/>session_id, enrolled_at, kitchen_id<br/>comm: new field — alien file rejection"] ProcSnap["● ProcSnapshot INIT_ONLY<br/>━━━━━━━━━━<br/>comm: /proc/pid/comm NEW field (#806)<br/>captured_at: unique UTC str (monotonic)<br/>vm_rss_kb, threads, fd_count, fd_soft_limit<br/>sig_pnd/blk/cgt, oom_score, wchan<br/>ctx_switches_voluntary/involuntary, cpu_percent"] end %% ──────────────────────────────────────── %% subgraph ContractGates ["CONTRACT ENFORCEMENT GATES · linux_tracing.py ●"] TypeCheck{"isinstance(target,<br/>TraceTarget)?"} TypeFail["TypeError HARD FAIL<br/>━━━━━━━━━━<br/>ARCH-008 / issue #806<br/>Raw int PID rejected explicitly<br/>No recovery path"] PlatformGate{"Linux + enabled<br/>+ tg not None?"} NoopReturn["Returns None<br/>━━━━━━━━━━<br/>Platform/config gate<br/>Tracing silently disabled"] MonoGate["Monotonic Timestamp Gate<br/>━━━━━━━━━━<br/>proc_monitor ensures captured_at<br/>strictly increases per snapshot<br/>+1μs applied on NTP/WSL2 regression"] end %% ──────────────────────────────────────── %% subgraph MutableHandle ["MUTABLE STATE · LinuxTracingHandle · linux_tracing.py ●"] HandleInit["Handle constructed<br/>━━━━━━━━━━<br/>_snapshots: []<br/>_trace_path: None<br/>_trace_file: None<br/>_enrollment_path: None<br/>_monitor_cancel_scope: None"] PostInit["Post-init mutation<br/>━━━━━━━━━━<br/>_monitor_cancel_scope set by start()<br/>_trace_path, _trace_file opened in tmpfs<br/>_enrollment_path written atomically<br/>All set inside start_linux_tracing()"] TraceDegraded["Degraded mode<br/>━━━━━━━━━━<br/>tmpfs unavailable → _trace_path=None<br/>write error → _trace_file closed/None<br/>Snapshots accumulate in-memory only"] HandleStop["stop() teardown<br/>━━━━━━━━━━<br/>Cancel scope cancelled<br/>_trace_file flushed + closed → None<br/>trace JSONL deleted from tmpfs<br/>enrollment sidecar deleted from tmpfs"] end %% ──────────────────────────────────────── %% subgraph RaceTransition ["MUTABLE → FROZEN TRANSITION · process.py ●"] RaceAcc["● RaceAccumulator MUTABLE<br/>━━━━━━━━━━<br/>process_exited, process_returncode<br/>channel_a_confirmed, channel_b_status<br/>channel_b_session_id, stdout_session_id, idle_stall<br/>Each field written exactly once by one coroutine"] RaceSignals["RaceSignals INIT_ONLY after freeze<br/>━━━━━━━━━━<br/>Produced by to_race_signals()<br/>Immutable result consumed by resolve_termination()<br/>basis for SubprocessResult construction"] end %% ──────────────────────────────────────── %% subgraph ResultFields ["● SubprocessResult MIXED-LIFECYCLE · _type_subprocess.py ●"] ResultInit["Required at construction<br/>━━━━━━━━━━<br/>returncode, stdout, stderr<br/>termination, pid (workload PID post-#806)"] ResultPostSet["● Set post-construction by headless.py<br/>━━━━━━━━━━<br/>start_ts, end_ts: ISO timestamps<br/>elapsed_seconds: monotonic pre-computed<br/>proc_snapshots: list[dict] or None<br/>tracked_comm: TraceTarget.comm propagated<br/>channel_confirmation, session_id"] end %% ──────────────────────────────────────── %% subgraph CrashRecovery ["RESUME DETECTION · recover_crashed_sessions() · session_log.py ●"] AgeGate["Gate 1: Age Guard<br/>━━━━━━━━━━<br/>mtime within 30s → SKIP<br/>Active session protection"] EnrollGate["Gate 2: Enrollment Sidecar<br/>━━━━━━━━━━<br/>read_enrollment() = None → SKIP<br/>Alien/test file rejection"] BootGate["Gate 3: Boot ID Match<br/>━━━━━━━━━━<br/>boot_id mismatch → DELETE BOTH<br/>Pre-reboot stale rejection"] PIDGate["Gate 4: PID Liveness + starttime_ticks<br/>━━━━━━━━━━<br/>alive + ticks match → SKIP (still running)<br/>alive + ticks differ → CRASH (PID recycled)<br/>not alive → CRASH"] CommGate["● Gate 5: comm Alien Rejection NEW<br/>━━━━━━━━━━<br/>snapshots[0].comm non-empty AND != claude<br/>→ DELETE trace + enrollment<br/>Pre-fix (no comm field) → RECOVER as-is"] CrashFlush["flush_session_log CRASH path<br/>━━━━━━━━━━<br/>subtype=crashed, exit_code=-1<br/>detect_identity_drift() appended<br/>Deletes trace + enrollment after write"] end %% ──────────────────────────────────────── %% subgraph WriteOnlyArtifacts ["WRITE-ONLY SESSION ARTIFACTS · session_log.py ●"] SummaryJson["summary.json<br/>━━━━━━━━━━<br/>Atomic write (tempfile+replace)<br/>tracer_target_resolution_version=2<br/>tracked_comm_drift flag<br/>peak_rss_kb, peak_oom_score, peak_fd_ratio"] ProcTrace["proc_trace.jsonl<br/>━━━━━━━━━━<br/>Every ProcSnapshot line-buffered<br/>Survives crash (stop() not called)<br/>Deleted after recovery or stop()"] SessionsJsonl["sessions.jsonl<br/>━━━━━━━━━━<br/>Append-only index (known race)<br/>Retention sweeps at 500 sessions<br/>WRITE-ONLY for logic decisions"] AnomaliesJsonl["● anomalies.jsonl<br/>━━━━━━━━━━<br/>detect_anomalies() output<br/>detect_identity_drift() — IDENTITY_DRIFT kind<br/>WRITE-ONLY — never read for decisions"] end %% ──────────────────────────────────────── %% %% NEW TEST NODES %% PTYIntegTest["★ test_linux_tracing_pty_integration.py<br/>━━━━━━━━━━<br/>No snapshot has comm=script<br/>max(vm_rss_kb) > 30,000 KB<br/>proc_trace.jsonl: comm on every row"] ResolverTest["★ test_trace_target_resolver.py<br/>━━━━━━━━━━<br/>resolved pid != wrapper pid<br/>TraceTargetResolutionError on miss<br/>error carries root_pid + expected_basename"] %% ──────────────────────────────────────── %% %% FLOW CONNECTIONS %% START --> PTYGate PTYGate -- "PTY=True" --> ResolvePTY PTYGate -- "PTY=False" --> ResolveDirect ResolvePTY -- "timeout" --> ResolutionError ResolvePTY -- "match found" --> TraceTarget ResolveDirect --> TraceTarget TraceTarget --> EnrollRecord TraceTarget --> TypeCheck TypeCheck -- "raw int PID" --> TypeFail TypeCheck -- "TraceTarget OK" --> PlatformGate PlatformGate -- "disabled/non-Linux" --> NoopReturn PlatformGate -- "enabled" --> HandleInit HandleInit --> PostInit PostInit -- "tmpfs ok" --> MonoGate PostInit -- "tmpfs missing" --> TraceDegraded MonoGate --> ProcSnap TraceDegraded --> ProcSnap ProcSnap --> RaceAcc ProcSnap --> ProcTrace RaceAcc --> RaceSignals RaceSignals --> ResultInit ResultInit --> ResultPostSet PostInit -- "enrolled_at" --> EnrollRecord ResultPostSet --> HandleStop HandleStop --> STOP_END %% Crash recovery path %% ProcTrace -- "process crashes (stop not called)" --> AgeGate AgeGate -- "age > 30s" --> EnrollGate EnrollGate -- "sidecar exists" --> BootGate BootGate -- "boot_id match" --> PIDGate PIDGate -- "crash detected" --> CommGate CommGate -- "comm=claude or legacy" --> CrashFlush CrashFlush --> SummaryJson CrashFlush --> SessionsJsonl CrashFlush --> AnomaliesJsonl CrashFlush --> CRASH_END %% Normal session log flush %% HandleStop --> SummaryJson HandleStop --> SessionsJsonl %% Test coverage arrows %% ResolvePTY -. "tested by" .-> PTYIntegTest ProcSnap -. "comm invariant" .-> PTYIntegTest TraceTarget -. "tested by" .-> ResolverTest ResolutionError -. "hard-fail contract" .-> ResolverTest %% CLASS ASSIGNMENTS %% class START,CRASH_END,STOP_END terminal; class PTYGate,TypeCheck,PlatformGate phase; class ResolvePTY,ResolveDirect handler; class ResolutionError,TypeFail detector; class TraceTarget,EnrollRecord,ProcSnap stateNode; class HandleInit,PostInit,TraceDegraded,HandleStop,MonoGate handler; class RaceAcc handler; class RaceSignals stateNode; class ResultInit,ResultPostSet stateNode; class AgeGate,EnrollGate,BootGate,PIDGate,CommGate detector; class CrashFlush handler; class SummaryJson,ProcTrace,SessionsJsonl,AnomaliesJsonl output; class PTYIntegTest,ResolverTest newComponent; class NoopReturn,TraceDegraded gap;Closes #806
Implementation Plan
Plan file:
/home/talon/projects/autoskillit-runs/remediation-806-20260413-075454-098046/.autoskillit/temp/rectify/rectify_pty_wrapper_tracer_pid_806_2026-04-13_110443.md🤖 Generated with Claude Code via AutoSkillit
Token Usage Summary