Rectify: Stdout Idle Watchdog + Bounded Suppression for Stall Detection#735
Conversation
Trecek
left a comment
There was a problem hiding this comment.
AutoSkillit PR Review — Verdict: changes_requested (posted as COMMENT — cannot REQUEST_CHANGES on own PR)
| stdout, _ = await proc.communicate() | ||
| io_task = asyncio.ensure_future(proc.communicate()) | ||
| try: | ||
| await asyncio.wait_for(proc.wait(), timeout=15.0) |
There was a problem hiding this comment.
[warning] defense: wait_for(proc.wait()) races with concurrent proc.communicate() future. wait() does not drain stdout pipe; if subprocess fills the pipe buffer before exiting, this can deadlock. Use asyncio.wait_for(proc.communicate(), timeout=15.0) directly.
There was a problem hiding this comment.
Investigated — this is intentional. The code creates io_task = asyncio.ensure_future(proc.communicate()) at line 54 BEFORE calling wait_for(proc.wait()). The communicate() future runs concurrently on the event loop and actively drains stdout, so the pipe buffer deadlock scenario cannot occur here. wait() alone would deadlock, but the concurrent communicate() task prevents it.
| input_data: str | None = None, | ||
| completion_drain_timeout: float = 5.0, | ||
| linux_tracing_config: Any | None = None, | ||
| idle_output_timeout: float | None = None, |
There was a problem hiding this comment.
[warning] arch: idle_output_timeout and max_suppression_seconds are execution-tuning params (L1 concerns) added to the HeadlessExecutor protocol in L0 core/. This couples L0 to L1 operational details. Consider kwargs escape hatch or moving protocol to L1.
There was a problem hiding this comment.
Investigated — this is intentional. The SubprocessRunner protocol already contained multiple execution-tuning params at L0 before this PR: timeout (float), stale_threshold (float=1200), completion_drain_timeout (float=5.0), pty_mode (bool), linux_tracing_config (Any|None). The new idle_output_timeout and max_suppression_seconds follow the identical pre-existing pattern. This is the established design for this protocol — it serves as the structural contract mirroring run_managed_async.
| Orthogonal to Channel A/B: NOT suppressed by active API connections. | ||
| Monitors raw byte count (st_size), not JSONL record structure. | ||
| """ | ||
| import time as _time |
There was a problem hiding this comment.
[info] cohesion: import time as _time is a local import inside _watch_stdout_idle. The rest of the module uses module-level imports. Promote to module level.
| await anyio.sleep(_poll_interval) | ||
| try: | ||
| current_size = stdout_path.stat().st_size | ||
| except OSError: |
There was a problem hiding this comment.
[info] defense: except OSError: continue silently swallows all OS errors on stdout_path.stat(). ENOENT is expected, but other errors (permission denied, I/O error) should be logged at DEBUG.
| _phase2_poll=_phase2_poll, | ||
| _phase1_timeout=_phase1_timeout, | ||
| expected_session_id=acc.stdout_session_id, | ||
| **_monitor_kwargs, # type: ignore[arg-type] |
There was a problem hiding this comment.
[info] defense: # type: ignore[arg-type] suppresses type error without rationale. dict[str, object] is too wide for the target signature. Consider TypedDict or add explanatory comment.
| completion_drain_timeout: float = 5.0 | ||
| exit_after_stop_delay_ms: int = 120000 | ||
| idle_output_timeout: int = 600 | ||
| max_suppression_seconds: int = 1800 |
There was a problem hiding this comment.
[info] defense: max_suppression_seconds and idle_output_timeout have no range validation. Zero or negative values would cause silent logic bugs (immediate expiry). Consider post_init assertion.
| is_error=True, | ||
| exit_code=-1, | ||
| needs_retry=True, | ||
| retry_reason=RetryReason.STALE, |
There was a problem hiding this comment.
[info] defense: RetryReason.STALE reused for IDLE_STALL termination. Downstream retry telemetry cannot distinguish the two failure modes.
…ss timeouts Addresses 82-minute hung SSE stream where ESTABLISHED TCP connection suppressed stale-kill indefinitely. Three layers of fix: 1. Stdout idle watchdog (_watch_stdout_idle): independent coroutine in the anyio task group that monitors stdout temp file byte count. Fires IDLE_STALL if no growth for idle_output_timeout seconds. NOT suppressed by active API connections — orthogonal to Channel A/B. 2. Bounded suppression: max_suppression_seconds caps _has_active_api_connection and _has_active_child_processes suppression gates so they cannot defer stale kills indefinitely. Counter resets on genuine JSONL activity. 3. Subprocess timeouts: add timeout= to all unguarded subprocess.run() and asyncio subprocess calls in remote_resolver, smoke_utils, clone, and token_summary_appender.
Add io_task.cancel() in the TimeoutError handler to prevent a task leak when proc.communicate() future is abandoned after timeout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Split shared suppression_start into suppression_start_api and suppression_start_child to prevent elapsed time carry-over when sessions alternate between API connection and child process conditions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ynaconf The contract test test_every_subconfig_field_referenced_in_from_dynaconf requires every RunSkillConfig field to have a val() line in from_dynaconf. These two fields were added to the dataclass but not wired into the config builder, causing merge-queue CI failure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
deeff77 to
4584ee1
Compare
## Summary `clone_repo` collapses three distinct subprocess outcomes (`TimeoutExpired`, non-zero `returncode`, genuinely absent origin) into a single silent fallback — `detected_url = ""` → `_resolve_clone_source` → `git clone <local-path>` via git's local transport. The return dict's `remote_url` is an untagged `str`, so four downstream recipes that splice `context.remote_url` into shell commands and `push_to_remote` cannot distinguish a healthy clone from a silently-corrupted one. The same `""` value causes the `if effective_url:` guard at `clone.py:333` to skip the `#377` origin-isolation rewrite, compounding the regression. **Bug class:** "subprocess fails → value collapses to empty string → downstream consumers silently use the empty value → invariant violation goes undetected." Five prior PRs (#368, #377, #424, #544, #735) each patched an adjacent sub-path but left this collapse site untouched. **Part A architectural immunity (three phases, sequentially layered):** 1. **Typed probe result.** Replace the one-line `_resolve_clone_source` collapse with `_probe_origin_url` returning a `CloneSourceResolution` dataclass carrying `reason` (`"ok" | "no_origin" | "timeout" | "error"`), `url`, and `stderr`. The call site must switch on `reason`; no path can collapse the four cases back into a single string. 2. **Typed return contract.** Introduce `CloneResult` — a `TypedDict` union with a `clone_source_type: Literal["remote", "local"]` discriminator on the success shape. mypy statically rejects any return branch that omits the discriminator, so provenance is impossible to lose. 3. **Unconditional origin isolation.** Extract `_ensure_origin_isolated` and call it for every successful clone regardless of URL availability. Removes the `if effective_url:` guard that silently skipped the `#377` rewrite. Together these three phases make the `git clone <local-path>` failure mode structurally unreachable from the default strategy: the only way to get a local-source clone is the explicit `strategy="clone_local"` branch (`shutil.copytree`), which now also produces a typed success result with `clone_source_type="local"` and a `file://` origin. ## Architecture Impact ### Error/Resilience Diagram ```mermaid %%{init: {'flowchart': {'nodeSpacing': 40, 'rankSpacing': 55, 'curve': 'basis'}}}%% flowchart TB %% CLASS DEFINITIONS %% classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff; classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff; classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff; classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff; classDef gap fill:#ff6f00,stroke:#ffa726,stroke-width:2px,color:#000; START([START]) %% ── RECIPE-TIME VALIDATION ───────────────────────────────────────────── subgraph RecipeValidation ["★ RECIPE-TIME VALIDATION (rules_clone.py)"] direction TB RuleGate["★ clone-local-strategy-with-<br/>remote-url-capture<br/>━━━━━━━━━━<br/>fired at open_kitchen / load_recipe<br/>inspects strategy + capture_list"] RuleErr["★ RuleFinding ERROR<br/>━━━━━━━━━━<br/>strategy='clone_local'<br/>+ remote_url capture"] RuleWarn["★ RuleFinding WARNING<br/>━━━━━━━━━━<br/>templated / unknown strategy<br/>+ remote_url capture"] end T_RECIPE_BLOCKED(["RECIPE_BLOCKED<br/>recipe invalid at load time"]) %% ── CLONE ENTRY GATES ────────────────────────────────────────────────── subgraph CloneGates ["● CLONE ENTRY GATES (clone_repo)"] direction TB SourceCheck{"● source_dir<br/>exists?"} UncommittedGate{"● detect_uncommitted_changes<br/>━━━━━━━━━━<br/>only when strategy=''<br/>never raises"} UnpublishedGate{"● detect_unpublished_branch<br/>━━━━━━━━━━<br/>15 s timeout — fail-open<br/>no origin → False"} end T_VAL_ERR(["ValueError<br/>source_dir missing"]) T_UNCOMMITTED(["★ CloneGateUncommitted<br/>━━━━━━━━━━<br/>soft stop: dirty tree<br/>returns dict, no raise"]) T_UNPUBLISHED(["★ CloneGateUnpublished<br/>━━━━━━━━━━<br/>soft stop: branch not on remote<br/>returns dict, no raise"]) %% ── ORIGIN PROBE ─────────────────────────────────────────────────────── subgraph OriginProbe ["★ ORIGIN PROBE (_probe_origin_url)"] direction LR Probe["★ git remote get-url origin<br/>━━━━━━━━━━<br/>30 s timeout<br/>always runs (all strategies)"] ProbeOK["★ reason='ok'<br/>URL captured"] ProbeNoOrigin["★ reason='no_origin'<br/>no remote configured"] ProbeTimeout["★ reason='timeout'<br/>subprocess hung ≥ 30 s"] ProbeError["★ reason='error'<br/>non-zero / unexpected stderr"] end %% ── STRATEGY ROUTING ─────────────────────────────────────────────────── subgraph StrategyRouter ["● STRATEGY ROUTING (clone_repo)"] direction TB StratDecide{"● strategy?"} ProbeOKCheck{"probe.reason<br/>== 'ok'?"} LocalCopy["clone_local:<br/>━━━━━━━━━━<br/>shutil.copytree<br/>remote_url always ''"] RemoteClone["● git clone {url}<br/>━━━━━━━━━━<br/>300 s timeout<br/>--branch if set"] end T_PROBE_ERR(["RuntimeError<br/>clone_origin_probe_failed<br/>use strategy='clone_local'<br/>for local-only repos"]) T_CLONE_FAIL(["RuntimeError<br/>git clone non-zero exit"]) %% ── POST-CLONE ISOLATION ─────────────────────────────────────────────── subgraph PostClone ["POST-CLONE ORIGIN ISOLATION (_ensure_origin_isolated)"] direction TB IsolateOrigin["_ensure_origin_isolated<br/>━━━━━━━━━━<br/>origin → file://clone_path<br/>upstream → real remote URL<br/>(if effective_url truthy)"] Decontaminate["decontaminate:<br/>━━━━━━━━━━<br/>git rm --cached GENERATED_FILES<br/>silent on failure (warns only)"] end T_ISOLATE_ERR(["RuntimeError<br/>origin isolation failed<br/>git remote add/set-url"]) T_SUCCESS_LOCAL(["★ CloneSuccessResult<br/>━━━━━━━━━━<br/>clone_source_type='local'<br/>remote_url=''"]) T_SUCCESS_REMOTE(["★ CloneSuccessResult<br/>━━━━━━━━━━<br/>clone_source_type='remote'<br/>remote_url=probed URL"]) %% ── PUSH RESILIENCE ──────────────────────────────────────────────────── subgraph PushPath ["● PUSH RESILIENCE (push_to_remote)"] direction TB ProtectedCheck{"protected<br/>branch?"} URLResolve{"● remote_url param<br/>provided?<br/>━━━━━━━━━━<br/>fall back to source_dir<br/>origin lookup"} BareCheck{"classify_remote_url<br/>== 'nonbare_local'?"} PushExec["● git push -u upstream {branch}<br/>━━━━━━━━━━<br/>120 s timeout<br/>--force-with-lease if force=True"] end T_PUSH_PROT(["success=False<br/>error_type='protected_branch_push'"]) T_PUSH_NOURL(["success=False<br/>neither remote_url<br/>nor source_dir"]) T_PUSH_NONBARE(["success=False<br/>error_type='local_non_bare_remote'"]) T_PUSH_FAIL(["success=False<br/>stderr=git output<br/>error_type if force"]) T_PUSH_OK(["success=True"]) %% ── HOOK FORMATTER ───────────────────────────────────────────────────── subgraph FmtLayer ["★ HOOK OUTPUT FORMATTER (_fmt_clone_repo)"] FmtDetect{"★ payload keys?"} FmtWarn["★ ⚠ WARNING<br/>uncommitted_changes<br/>or unpublished_branch"] FmtFail["★ ✗ FAIL<br/>'error' key present"] FmtOK["★ ✓ OK<br/>clone succeeded"] end %% ══ CONNECTIONS ════════════════════════════════════════════════════════ START -.->|"static analysis<br/>before any runtime call"| RuleGate RuleGate -->|"clone_local + remote_url capture"| RuleErr RuleGate -->|"templated/unknown + remote_url capture"| RuleWarn RuleErr --> T_RECIPE_BLOCKED RuleWarn --> T_RECIPE_BLOCKED RuleGate -->|"safe strategies / no capture"| SourceCheck SourceCheck -->|"missing"| T_VAL_ERR SourceCheck -->|"exists"| UncommittedGate UncommittedGate -->|"dirty (strategy='')"| T_UNCOMMITTED UncommittedGate -->|"clean / strategy≠''"| UnpublishedGate UnpublishedGate -->|"unpublished (exit 2)"| T_UNPUBLISHED UnpublishedGate -->|"ok / timeout / no origin"| Probe Probe -->|"returncode=0, url≠''"| ProbeOK Probe -->|"'no such remote' in stderr"| ProbeNoOrigin Probe -->|"≥30 s elapsed"| ProbeTimeout Probe -->|"other non-zero exit"| ProbeError ProbeOK --> StratDecide ProbeNoOrigin --> StratDecide ProbeTimeout --> StratDecide ProbeError --> StratDecide StratDecide -->|"strategy='clone_local'"| LocalCopy StratDecide -->|"proceed / default"| ProbeOKCheck ProbeOKCheck -->|"no (no_origin / timeout / error)"| T_PROBE_ERR ProbeOKCheck -->|"yes"| RemoteClone RemoteClone -->|"returncode≠0"| T_CLONE_FAIL RemoteClone -->|"success"| IsolateOrigin LocalCopy --> IsolateOrigin IsolateOrigin -->|"git error (not already-exists)"| T_ISOLATE_ERR IsolateOrigin -->|"ok"| Decontaminate Decontaminate -->|"clone_local path"| T_SUCCESS_LOCAL Decontaminate -->|"remote clone path"| T_SUCCESS_REMOTE T_SUCCESS_REMOTE -.->|"later in pipeline"| ProtectedCheck ProtectedCheck -->|"yes"| T_PUSH_PROT ProtectedCheck -->|"no"| URLResolve URLResolve -->|"neither provided"| T_PUSH_NOURL URLResolve -->|"resolved"| BareCheck BareCheck -->|"nonbare_local"| T_PUSH_NONBARE BareCheck -->|"network / bare_local / none"| PushExec PushExec -->|"returncode≠0"| T_PUSH_FAIL PushExec -->|"success"| T_PUSH_OK T_SUCCESS_LOCAL -.->|"result piped to hook"| FmtDetect T_SUCCESS_REMOTE -.->|"result piped to hook"| FmtDetect T_UNCOMMITTED -.->|"result piped to hook"| FmtDetect T_UNPUBLISHED -.->|"result piped to hook"| FmtDetect FmtDetect -->|"uncommitted_changes / unpublished_branch"| FmtWarn FmtDetect -->|"error key"| FmtFail FmtDetect -->|"neither"| FmtOK %% CLASS ASSIGNMENTS class START terminal; class RuleGate,RuleErr,RuleWarn newComponent; class T_RECIPE_BLOCKED,T_VAL_ERR,T_UNCOMMITTED,T_UNPUBLISHED,T_PROBE_ERR,T_CLONE_FAIL,T_ISOLATE_ERR terminal; class T_SUCCESS_LOCAL,T_SUCCESS_REMOTE,T_PUSH_OK terminal; class T_PUSH_PROT,T_PUSH_NOURL,T_PUSH_NONBARE,T_PUSH_FAIL terminal; class SourceCheck,UncommittedGate,UnpublishedGate detector; class Probe,ProbeOK,ProbeNoOrigin,ProbeTimeout,ProbeError newComponent; class StratDecide,ProbeOKCheck phase; class LocalCopy,RemoteClone handler; class IsolateOrigin,Decontaminate output; class ProtectedCheck,URLResolve,BareCheck detector; class PushExec handler; class FmtDetect,FmtWarn,FmtFail,FmtOK newComponent; ``` ### State Lifecycle Diagram ```mermaid %%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 60, 'curve': 'basis'}}}%% flowchart TB %% CLASS DEFINITIONS %% classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff; classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff; classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff; classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff; classDef gap fill:#ff6f00,stroke:#ffa726,stroke-width:2px,color:#000; classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; %% ENTRY %% CALL(["clone_repo(remote_url='', strategy=?)"]) subgraph RuntimeGates ["RUNTIME VALIDATION GATES — workspace/clone.py ●"] direction TB SRC_CHECK["G1 · source.is_dir()<br/>━━━━━━━━━━<br/>Raises ValueError<br/>if not a directory"] STRAT_CHECK{"strategy<br/>== '' ?"} UNCOMMIT["G2 · detect_uncommitted_changes<br/>━━━━━━━━━━<br/>git status --porcelain<br/>Returns gate result — no clone created"] UNPUB["G3 · detect_unpublished_branch<br/>━━━━━━━━━━<br/>git ls-remote --exit-code origin<br/>Returns gate result — no clone created"] PROBE["G4 · _probe_origin_url ●<br/>━━━━━━━━━━<br/>Raises RuntimeError if reason != ok<br/>NO silent local-path fallback"] end subgraph GateResults ["GATE RESULT TYPES — core/_type_results.py ●"] direction LR RES_UNCOMMIT["CloneGateUncommitted<br/>━━━━━━━━━━<br/>uncommitted_changes: Literal['true']<br/>changed_files: str<br/>total_changed: str<br/>branch, source_dir<br/>— NO clone_path — NO remote_url —"] RES_UNPUB["CloneGateUnpublished<br/>━━━━━━━━━━<br/>unpublished_branch: Literal['true']<br/>branch, source_dir<br/>— NO clone_path — NO remote_url —"] end subgraph RemoteResolution ["remote_url RESOLUTION — workspace/clone.py ●"] direction TB INPUT_URL["Input: remote_url = '' (optional)<br/>━━━━━━━━━━<br/>Caller may supply or leave blank"] LOCAL_STRAT{"strategy ==<br/>'clone_local' ?"} USE_PROBE["effective_url = resolution.url<br/>━━━━━━━━━━<br/>Real remote (from G4 probe)"] USE_CALLER["effective_url = caller_url OR resolution.url<br/>━━━━━━━━━━<br/>Caller override takes priority"] ISOLATE["_ensure_origin_isolated<br/>━━━━━━━━━━<br/>origin → file://clone_path (unique)<br/>upstream → effective_url"] end subgraph SuccessResult ["SUCCESS CONTRACT — core/_type_results.py ●"] direction TB CLONE_PATH["clone_path: str [INIT_ONLY]<br/>━━━━━━━━━━<br/><run_name>-YYYYMMDD-HHMMSS-ffffff<br/>Unique per call — never reused"] SRC_DIR["source_dir: str [INIT_ONLY]<br/>━━━━━━━━━━<br/>expanduser().resolve() — fixed at entry"] REMOTE_URL["remote_url: str [RESOLVED · required]<br/>━━━━━━━━━━<br/>Non-Optional on CloneSuccessResult<br/>'' for clone_local · real URL for proceed"] SRC_TYPE["clone_source_type: Literal['remote','local'] [INIT_ONLY]<br/>━━━━━━━━━━<br/>'local' when strategy='clone_local'"] SRC_REASON["clone_source_reason: str [INIT_ONLY]<br/>━━━━━━━━━━<br/>'strategy_clone_local' or 'ok'"] end subgraph RecipeGate ["★ RECIPE SEMANTIC GATE — recipe/rules_clone.py ●"] direction TB RULE["clone-local-strategy-with-remote-url-capture<br/>━━━━━━━━━━<br/>Fires at recipe-load time (open_kitchen / load_recipe)<br/>Checks: python=clone_repo AND capture references result.remote_url<br/>Inspects template VALUE not capture key name"] STRAT_EVAL{"strategy<br/>value?"} PASS_RULE["PASS — strategy='' / 'proceed'<br/>━━━━━━━━━━<br/>Part A guarantees non-empty remote_url<br/>Safe to capture downstream"] ERROR_RULE["ERROR — strategy='clone_local' literal<br/>━━━━━━━━━━<br/>remote_url always '' for local transport<br/>Capturing it corrupts downstream push"] WARN_RULE["WARN — strategy=${{template}}<br/>━━━━━━━━━━<br/>Cannot determine statically<br/>May produce empty remote_url"] end subgraph Formatter ["★ DISPLAY LAYER — hooks/_fmt_status.py (new tests)"] FMT["_fmt_clone_repo<br/>━━━━━━━━━━<br/>All scalar fields rendered unconditionally<br/>remote_url: '' rendered as-is for local clones<br/>Sentinel key PRESENCE → WARNING mark (not value)"] end %% RUNTIME FLOW %% CALL --> SRC_CHECK SRC_CHECK -->|invalid dir| ERR_VAL(["ValueError"]) SRC_CHECK -->|valid| STRAT_CHECK STRAT_CHECK -->|yes · check clean| UNCOMMIT UNCOMMIT -->|changes found| RES_UNCOMMIT UNCOMMIT -->|clean| UNPUB UNPUB -->|branch absent| RES_UNPUB UNPUB -->|published| PROBE STRAT_CHECK -->|no| PROBE PROBE -->|reason != ok| ERR_PROBE(["RuntimeError no_origin"]) PROBE -->|reason == ok| INPUT_URL %% RESOLUTION FLOW %% INPUT_URL --> LOCAL_STRAT LOCAL_STRAT -->|yes — shutil.copytree| USE_CALLER LOCAL_STRAT -->|no — git clone| USE_PROBE USE_PROBE --> USE_CALLER USE_CALLER --> ISOLATE ISOLATE --> CLONE_PATH ISOLATE --> SRC_DIR ISOLATE --> REMOTE_URL ISOLATE --> SRC_TYPE ISOLATE --> SRC_REASON %% DISPLAY FLOW %% CLONE_PATH --> FMT SRC_DIR --> FMT REMOTE_URL --> FMT SRC_TYPE --> FMT SRC_REASON --> FMT RES_UNCOMMIT --> FMT RES_UNPUB --> FMT %% RECIPE GATE FLOW %% RULE --> STRAT_EVAL STRAT_EVAL -->|safe strategy| PASS_RULE STRAT_EVAL -->|clone_local literal| ERROR_RULE STRAT_EVAL -->|template expr| WARN_RULE %% CLASS ASSIGNMENTS %% class CALL terminal; class SRC_CHECK,UNCOMMIT,UNPUB,PROBE detector; class STRAT_CHECK,LOCAL_STRAT,STRAT_EVAL phase; class RES_UNCOMMIT,RES_UNPUB gap; class INPUT_URL,USE_PROBE,USE_CALLER,ISOLATE handler; class CLONE_PATH,SRC_DIR,SRC_TYPE,SRC_REASON stateNode; class REMOTE_URL gap; class RULE detector; class PASS_RULE output; class ERROR_RULE detector; class WARN_RULE gap; class FMT output; class ERR_VAL,ERR_PROBE terminal; ``` ### Process Flow Diagram ```mermaid %%{init: {'flowchart': {'nodeSpacing': 40, 'rankSpacing': 50, 'curve': 'basis'}}}%% flowchart TB %% CLASS DEFINITIONS %% classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff; classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff; classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff; classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff; %% TERMINALS %% START([START]) GATED([GATED — early return]) FAIL([ERROR — raises RuntimeError]) COMPLETE([COMPLETE]) subgraph Entry ["Entry & Resolution"] direction TB A["● clone_repo<br/>━━━━━━━━━━<br/>workspace/clone.py<br/>args: source_dir, run_name,<br/>branch, strategy, remote_url"] B["Resolve source_dir<br/>━━━━━━━━━━<br/>detect_source_dir(cwd)<br/>if source_dir empty"] C["Detect branch<br/>━━━━━━━━━━<br/>detect_branch(source_dir)<br/>blank if HEAD / detached"] end subgraph Gates ["Strategy Gates — bypassed when strategy='clone_local' or 'proceed'"] direction TB D{"strategy<br/>bypasses gates?"} E["detect_uncommitted_changes<br/>━━━━━━━━━━<br/>git status --porcelain<br/>reads: working tree"] F{"any uncommitted<br/>changes?"} G["detect_unpublished_branch<br/>━━━━━━━━━━<br/>git ls-remote origin<br/>15 s timeout — fail-open"] H{"branch<br/>unpublished?"} GU["CloneGateUncommitted<br/>━━━━━━━━━━<br/>writes: {uncommitted_changes,<br/>changed_files, total_changed}"] GP["CloneGateUnpublished<br/>━━━━━━━━━━<br/>writes: {unpublished_branch,<br/>branch, source_dir}"] end subgraph Dispatch ["Clone Dispatch"] direction TB K{"strategy ==<br/>'clone_local'?"} L["● _probe_origin_url<br/>━━━━━━━━━━<br/>git remote get-url origin<br/>30 s timeout<br/>reads: git remote config<br/>writes: CloneSourceResolution"] M{"resolution.reason<br/>== 'ok'?"} N["shutil.copytree<br/>━━━━━━━━━━<br/>full worktree copy<br/>source_type = 'local'<br/>reason = 'strategy_clone_local'"] O["git clone --branch<br/>━━━━━━━━━━<br/>resolution.url → clone_path<br/>300 s timeout<br/>source_type = 'remote'"] end subgraph Isolation ["● Origin Isolation Fix — #807 / #377"] direction TB P["● _ensure_origin_isolated<br/>━━━━━━━━━━<br/>UNCONDITIONAL on both paths<br/>writes: origin = file://clone_path<br/>(self-referential — breaks repo aliasing)"] Q{"effective_url<br/>truthy?"} R["_add_or_set_upstream<br/>━━━━━━━━━━<br/>writes: upstream = real remote URL<br/>preserves provenance for push_to_remote"] S["Decontaminate clone<br/>━━━━━━━━━━<br/>git rm --cached GENERATED_FILES<br/>git commit --no-verify<br/>os.unlink disk copies"] end subgraph Result ["● Result Assembly"] direction TB T["● CloneSuccessResult<br/>━━━━━━━━━━<br/>clone_path, source_dir<br/>remote_url (effective_url)<br/>★ clone_source_type: 'remote'|'local'<br/>★ clone_source_reason: str<br/>writes: _type_results.py"] end subgraph Downstream ["Downstream Consumers"] direction LR U["★ _fmt_clone_repo<br/>━━━━━━━━━━<br/>hooks/_fmt_status.py<br/>renders clone_source_type<br/>+ clone_source_reason<br/>as flat KV output"] V["● Recipe Validator<br/>━━━━━━━━━━<br/>run_semantic_rules()<br/>reads: Recipe steps + captures"] end subgraph RuleLogic ["● New Rule: clone-local-strategy-with-remote-url-capture"] direction TB W{"step = clone_repo<br/>AND captures<br/>result.remote_url?"} X{"strategy value?"} RE["★ RuleFinding ERROR<br/>━━━━━━━━━━<br/>strategy='clone_local'<br/>remote_url always ''\n→ recipe INVALID<br/>fixture: invalid.yaml"] RW["★ RuleFinding WARNING<br/>━━━━━━━━━━<br/>template / unknown strategy<br/>cannot resolve statically<br/>→ recipe VALID, flagged"] RN["No Finding<br/>━━━━━━━━━━<br/>safe/proceed/empty strategy<br/>or no remote_url capture<br/>fixture: valid.yaml"] end %% MAIN FLOW %% START --> A A --> B --> C C --> D D -->|"yes — skip gates"| K D -->|"no"| E E --> F F -->|"yes"| GU --> GATED F -->|"no"| G G --> H H -->|"yes"| GP --> GATED H -->|"no"| K K -->|"yes (clone_local)"| N K -->|"no"| L L --> M M -->|"no — no usable origin"| FAIL M -->|"yes"| O N --> P O --> P P --> Q Q -->|"yes"| R --> S Q -->|"no"| S S --> T T --> U T --> V V --> W W -->|"no"| RN W -->|"yes"| X X -->|"'clone_local'"| RE X -->|"template / unknown"| RW X -->|"'' / 'proceed'"| RN RE --> COMPLETE RW --> COMPLETE RN --> COMPLETE %% CLASS ASSIGNMENTS %% class START,GATED,FAIL,COMPLETE terminal; class A,N,O,P,R,S handler; class B,C,L phase; class D,F,H,K,M,Q,W,X stateNode; class GU,GP detector; class T output; class U newComponent; class V,RE,RW detector; class RN phase; ``` Closes #807 ## Implementation Plan Plan file: `/home/talon/projects/autoskillit-runs/remediation-807-20260413-075530-622615/.autoskillit/temp/rectify/rectify_clone_repo_silent_fallback_2026-04-13_110616_part_a.md` 🤖 Generated with [Claude Code](https://claude.com/claude-code) via AutoSkillit ## Token Usage Summary | Step | uncached | output | cache_read | cache_write | count | time | |------|----------|--------|------------|-------------|-------|------| | investigate | 120 | 14.5k | 477.8k | 78.4k | 1 | 5m 38s | | rectify | 4.1k | 68.6k | 2.2M | 233.2k | 2 | 25m 4s | | review | 2.9k | 8.5k | 226.5k | 55.4k | 1 | 6m 55s | | dry_walkthrough | 3.2k | 45.6k | 2.7M | 137.3k | 2 | 15m 50s | | implement | 2.4k | 49.1k | 7.0M | 146.4k | 2 | 19m 12s | | assess | 1.2k | 25.3k | 2.6M | 77.2k | 1 | 11m 51s | | prepare_pr | 60 | 6.7k | 212.6k | 30.7k | 1 | 1m 43s | | run_arch_lenses | 3.2k | 31.5k | 622.1k | 118.9k | 3 | 9m 49s | | compose_pr | 67 | 11.6k | 243.5k | 38.1k | 1 | 2m 46s | | **Total** | 17.2k | 261.3k | 16.3M | 915.6k | | 1h 38m | --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
A headless Claude subprocess hung for 82 minutes on a silently stalled HTTPS SSE stream. The TCP connection remained ESTABLISHED with zero traffic, so the
_has_active_api_connectionsuppression gate in_process_monitor.pykept resetting the stale-kill timer every 2-second poll cycle. Neither Channel A nor Channel B detected the stall because Channel A only checks for valid JSONL result records (not raw byte growth), and Channel B's stale detection was suppressed by the ESTABLISHED port-443 connection.Three layers of fix provide immunity:
Stdout idle watchdog (primary) — new
_watch_stdout_idlecoroutine in the anyio task group monitors stdout temp file byte count. FiresTerminationReason.IDLE_STALLif no growth foridle_output_timeoutseconds. NOT suppressed by active API connections — orthogonal to Channel A/B.Bounded suppression (defense-in-depth) —
max_suppression_secondscaps the_has_active_api_connectionand_has_active_child_processessuppression gates so they cannot defer stale kills indefinitely. Counter resets on genuine JSONL activity.Subprocess timeouts —
timeout=added to all unguardedsubprocess.run()andasynciosubprocess calls inremote_resolver,smoke_utils,clone, andtoken_summary_appender.Changed Files
New (★)
tests/execution/test_process_idle_watchdog.py— watchdog coroutine unit teststests/workspace/test_clone_timeouts.py— static analysis: git network commands have timeoutsModified (●)
src/autoskillit/core/_type_enums.py—IDLE_STALLmembers onTerminationReasonandCliSubtypesrc/autoskillit/config/settings.py—idle_output_timeoutandmax_suppression_secondsonRunSkillConfigsrc/autoskillit/execution/_process_race.py—_watch_stdout_idlecoroutine,idle_stallfield onRaceAccumulator/RaceSignals, priority update inresolve_terminationsrc/autoskillit/execution/_process_monitor.py— bounded suppression logic withsuppression_starttrackersrc/autoskillit/execution/process.py— wire watchdog into task group,IDLE_STALLkill dispatchsrc/autoskillit/execution/headless.py—IDLE_STALLrouting in_build_skill_result(retriable)src/autoskillit/execution/session.py— exhaustive match arms forIDLE_STALLsrc/autoskillit/core/_type_subprocess.py— protocol updated with new paramssrc/autoskillit/execution/recording.py— runner implementations updatedsrc/autoskillit/execution/remote_resolver.py— 15s timeout viaasyncio.wait_forsrc/autoskillit/smoke_utils.py—timeout=60on all subprocess callssrc/autoskillit/workspace/clone.py—timeout=300(clone),timeout=120(push)src/autoskillit/hooks/token_summary_appender.py—timeout=30on gh api callsTest Plan
test_watch_stdout_idle_fires_on_silence— watchdog fires IDLE_STALL when stdout stops growingtest_watch_stdout_idle_resets_on_continuous_output— watchdog does NOT fire when output is steadytest_watch_stdout_idle_handles_missing_file— tolerates missing file until it appearstest_resolve_termination_idle_stall_priority— idle stall outranks stale in priority chaintest_resolve_termination_process_exit_beats_idle_stall— process exit still highest prioritytest_resolve_termination_idle_stall_beats_stale— idle stall outranks staletest_race_signals_field_count— sentinel: 7 fields on RaceSignalstest_build_skill_result_idle_stall_is_retriable— IDLE_STALL produces retriable SkillResulttest_run_managed_async_idle_stall_kills_hanging_process— end-to-end integrationtest_stale_suppression_bounded_by_max_duration— bounded suppression fires after captest_stale_suppression_resets_on_genuine_activity— counter resets on real JSONL activitytest_stale_suppression_logs_warning_on_bounded_kill— warning log emittedtimeout=Implementation Plan
Plan file:
.autoskillit/temp/rectify/rectify_idle_stall_watchdog_2026-04-11_140500.md🤖 Generated with Claude Code via AutoSkillit