Implementation Plan: resolve-failures — Test Polling Cascade + Output Context Bloat#862
Conversation
Trecek
left a comment
There was a problem hiding this comment.
AutoSkillit PR Review — Verdict: changes_requested
| step3 = _extract_step(text, "Step 3") | ||
| # Must mention extracting/retaining only the key failure info | ||
| has_extract = bool( | ||
| re.search(r"extract|retain only|only.{0,40}(fail|count|name)", step3, re.IGNORECASE) |
There was a problem hiding this comment.
[warning] bugs: Duplicate line number [L89] in the diff annotation — both the re.search(...) call and its closing ) carry the L89 marker; L90 is skipped entirely. In the actual source file, every line from 91 onward maps to one line number lower than the diff label, causing off-by-one navigation errors in editors and coverage tools. The annotation tooling should be fixed to emit unique sequential markers.
There was a problem hiding this comment.
Investigated — this is intentional. The test file at line 89 is functionally correct. The duplicate L89 annotation label is an artifact of how diff_annotator.py formats diff hunks, not a bug in this test. The actual source lines 89 (re.search call) and 90 (closing paren) are correctly sequenced. Any fix to the annotation tooling belongs in a separate issue against diff_annotator.py, not in this test file.
| SKILL_MD = pkg_root() / "skills_extended" / "resolve-failures" / "SKILL.md" | ||
|
|
||
|
|
||
| def _skill_text() -> str: |
There was a problem hiding this comment.
[warning] defense: _skill_text() calls SKILL_MD.read_text() with no error handling. A missing or unreadable file surfaces as a bare FileNotFoundError/PermissionError with no context about which path failed. Wrap in a try/except that re-raises with the resolved path so CI diagnostics are self-explanatory.
There was a problem hiding this comment.
Investigated — this is intentional. Python's Path.read_text() raises FileNotFoundError with the full path already embedded in the OS-level error message (e.g., [Errno 2] No such file or directory: '/path/to/SKILL.md'). The reviewer's concern about 'no context about which path failed' does not apply here — the path is included automatically. Additionally, CLAUDE.md guidelines prohibit adding error handling for scenarios that cannot occur in a correctly installed package. No code change applied.
|
|
||
| # --- Polling-cascade fix --- | ||
|
|
||
| def test_resolve_failures_step2_prohibits_bash_test_run() -> None: |
There was a problem hiding this comment.
[info] tests: test_resolve_failures_step2_prohibits_bash_test_run and test_resolve_failures_step2_prescribes_test_check_mcp both call _extract_step(text, 'Step 2') and verify the same invariant. Separate tests add no isolation value; consider collapsing into a single test with two assertions.
|
|
||
| from autoskillit.core.paths import pkg_root | ||
|
|
||
| SKILL_MD = pkg_root() / "skills_extended" / "resolve-failures" / "SKILL.md" |
There was a problem hiding this comment.
[info] tests: SKILL_MD is computed at module import time via pkg_root(). If pkg_root() raises, the error surfaces as a pytest collection error rather than a clear test failure. Deferring the computation inside _skill_text() would make failures more transparent.
| """ | ||
| text = _skill_text() | ||
| step3 = _extract_step(text, "Step 3") | ||
| # Must mention extracting/retaining only the key failure info |
There was a problem hiding this comment.
[info] slop: The inline comment # Must mention extracting/retaining only the key failure info restates exactly what the docstring and assertion message already say. Remove it.
| ``` | ||
| test_check(worktree_path="{worktree_path}") | ||
| ``` | ||
| Do NOT re-run via Bash — see Step 2 rationale. |
There was a problem hiding this comment.
[info] slop: "Do NOT re-run via Bash — see Step 2 rationale." duplicates the rationale already stated in full in Step 2. The cross-reference alone is sufficient; the redundant prose can be dropped.
Trecek
left a comment
There was a problem hiding this comment.
AutoSkillit review found 3 blocking issues (warnings). See inline comments. (Note: REQUEST_CHANGES is not available on own PRs — verdict is changes_requested.)
…t fixes
Six new tests that verify:
- Step 2 and Step 3 no longer instruct running tests via Bash (cd {worktree_path} && {test_command})
- Both steps prescribe test_check MCP tool instead
- Step 3 instructs extracting only the failure signal and discarding full stdout
…lures skill Eliminates polling cascade (22+ wasted API calls) caused by Bash auto-backgrounding long-running test commands. Replaces both Step 2 and Step 3 cd/test_command patterns with synchronous test_check MCP calls. Adds Step 3 instruction to extract and discard full pytest stdout, preventing ~17K token context accumulation per session.
… anchor on pytest/stdout context
Replaces the overly broad `discard` alternative with `discard.{0,60}(pytest|stdout|full output)`,
which requires the keyword to appear in the context of discarding test output — preventing false
passes on unrelated uses of "discard" in SKILL.md prose.
327ae5d to
a865416
Compare
Summary
Two compounding inefficiencies in
resolve-failures/SKILL.mdwaste tokens and API calls:Polling cascade (22+ wasted API calls per session): Steps 2 and 3 instruct test
execution via
`cd {worktree_path} && {test_command}`through the Bash tool. Whentests exceed the Bash timeout (~5–7 min), the command auto-backgrounds and the LLM
enters a manual polling loop of 18+ Read/Bash calls to wait for completion.
Output context bloat (~17K tokens per session): Each test run injects 3–10K chars
of raw pytest stdout (progress dots, install lines, timing) into conversation context.
Across 5+ runs this accumulates to ~68K chars carried in every subsequent API call.
The fix is entirely in
resolve-failures/SKILL.md: replace both Bash-based testinvocations with
test_checkMCP tool calls (which block synchronously, returning in onecall), and add an instruction in the fix loop to extract only the failure signal from each
result and discard the full stdout.
Architecture Impact
Process Flow Diagram
%%{init: {'flowchart': {'nodeSpacing': 40, 'rankSpacing': 50, 'curve': 'basis'}}}%% flowchart TB %% CLASS DEFINITIONS %% classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff; classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff; classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff; classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff; %% TERMINALS %% START(["START<br/>run_skill resolve-failures"]) COMPLETE(["COMPLETE<br/>fixes_applied = N emitted"]) ERROR(["ERROR<br/>max iterations reached"]) subgraph Setup ["Setup & Validation — Steps 0 through 0.7"] direction TB S0["Step 0: Validate Args<br/>━━━━━━━━━━<br/>Path-detect worktree_path,<br/>plan_path, base_branch tokens"] S03["Step 0.3: Code-Index Init<br/>━━━━━━━━━━<br/>set_project_path repo root<br/>enables find_files / search_code"] S05["Step 0.5: Commit Dirty Tree<br/>━━━━━━━━━━<br/>git status --porcelain<br/>auto-commit if non-empty"] S07["Step 0.7: Switch Code-Index<br/>━━━━━━━━━━<br/>set_project_path worktree<br/>isolates queries to worktree"] end subgraph Context ["Context Understanding — Steps 1 and 2a"] direction TB S1["Step 1: Read Plan + Git History<br/>━━━━━━━━━━<br/>Plan intent, git log --oneline,<br/>git diff --stat from merge-base"] S2a["Step 2a: Reproduce CI Env<br/>━━━━━━━━━━<br/>Read CI workflow; run pre-test<br/>generation steps if present"] end subgraph TestRun ["● Step 2: Test Run — Single Synchronous MCP Call"] direction TB S2["● test_check MCP<br/>━━━━━━━━━━<br/>test_check(worktree_path)<br/>Blocks until suite completes<br/>Returns passed true or false<br/>in exactly 1 API call<br/><br/>BEFORE: Bash auto-backgrounds<br/>after timeout; LLM polls<br/>via 20+ Read/Bash calls"] D_PASS{"passed?"} end subgraph FixLoop ["● Step 3: Fix Loop — max 3 iterations"] direction TB S3A["Analyze Failures vs Plan<br/>━━━━━━━━━━<br/>Root cause vs implementation intent"] S3B["Apply Targeted Fixes<br/>━━━━━━━━━━<br/>pre-commit run --all-files<br/>git commit: fix description"] S3C["Write Fix Log<br/>━━━━━━━━━━<br/>AUTOSKILLIT_TEMP/resolve-failures/<br/>fix_log_N_ts.md per iteration"] S3D["● test_check MCP re-run<br/>━━━━━━━━━━<br/>test_check(worktree_path)<br/>Blocks synchronously<br/>BEFORE: Bash re-run, same<br/>polling-cascade risk"] S3E["● Extract Failure Signal<br/>━━━━━━━━━━<br/>Retain: counts, test names,<br/>error msgs 10-15 lines each<br/>Discard: full pytest stdout,<br/>progress dots, timing lines"] D_GREEN{"green?"} D_ITER{"iter >= 3?"} end subgraph GuardSuite ["★ tests/skills/test_resolve_failures_guards.py — new file"] direction TB G12["★ Step 2 Poll-Cascade Guards<br/>━━━━━━━━━━<br/>prohibits_bash_test_run<br/>prescribes_test_check_mcp"] G34["★ Step 3 Poll-Cascade Guards<br/>━━━━━━━━━━<br/>prohibits_bash_rerun<br/>prescribes_test_check_mcp"] G56["★ Step 3 Output-Bloat Guards<br/>━━━━━━━━━━<br/>instructs_output_summarization<br/>instructs_stdout_discard"] end %% MAIN FLOW %% START --> S0 S0 --> S03 S03 --> S05 S05 --> S07 S07 --> S1 S1 --> S2a S2a --> S2 S2 --> D_PASS D_PASS -->|"passed: true"| COMPLETE D_PASS -->|"passed: false"| S3A S3A --> S3B S3B --> S3C S3C --> S3D S3D --> S3E S3E --> D_GREEN D_GREEN -->|"green"| COMPLETE D_GREEN -->|"still red"| D_ITER D_ITER -->|"yes — report failure"| ERROR D_ITER -->|"no — next iteration"| S3A %% GUARD CONNECTIONS (build-time validation of SKILL.md text) %% G12 -. "validates SKILL.md Step 2" .-> S2 G34 -. "validates SKILL.md Step 3 re-run" .-> S3D G56 -. "validates SKILL.md Step 3 extract" .-> S3E %% CLASS ASSIGNMENTS %% class START,COMPLETE,ERROR terminal; class S0,S03,S05,S07 phase; class S1,S2a handler; class S2,S3D stateNode; class D_PASS,D_GREEN,D_ITER stateNode; class S3A,S3B,S3C handler; class S3E output; class G12,G34,G56 newComponent;Error/Resilience Diagram
%%{init: {'flowchart': {'nodeSpacing': 42, 'rankSpacing': 54, 'curve': 'basis'}}}%% flowchart TB %% CLASS DEFINITIONS %% classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff; classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff; classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff; classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff; classDef gap fill:#ff6f00,stroke:#ffa726,stroke-width:2px,color:#000; classDef integration fill:#c62828,stroke:#ef9a9a,stroke-width:2px,color:#fff; classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; %% ENTRY POINT %% INVOKE(["run_skill → resolve-failures<br/>━━━━━━━━━━<br/>Orchestrator triggers after merge FAIL"]) subgraph OLD ["❌ ELIMINATED FAILURE PATH (old Bash invocation)"] direction TB BASH["Bash: cd {worktree} &&<br/>task test-all<br/>━━━━━━━━━━<br/>Blocks on 5–7 min test suite"] TIMEOUT["Bash Auto-Background<br/>━━━━━━━━━━<br/>Built-in timeout exceeded<br/>Command silently backgrounded"] LOST_SIGNAL["Lost DONE Signal<br/>━━━━━━━━━━<br/>LLM has no completion event<br/>Process runs detached"] CASCADE["Polling Cascade<br/>━━━━━━━━━━<br/>18–22+ API calls (Read/Bash)<br/>Full context resent each call"] BLOAT["Output Bloat<br/>━━━━━━━━━━<br/>3–10K chars pytest stdout/run<br/>~17K tokens accumulated/session<br/>cache_read cost inflated"] end subgraph STEP2 ["● STEP 2 — test_check BLOCKING PATH (resolve-failures/SKILL.md)"] direction TB TC_CALL["● test_check MCP Tool<br/>━━━━━━━━━━<br/>server/tools_workspace.py<br/>Single blocking async call"] RUNNER["DefaultTestRunner.run()<br/>━━━━━━━━━━<br/>execution/testing.py<br/>Configures cmd + timeout + env"] MANAGED["run_managed_async()<br/>━━━━━━━━━━<br/>execution/process.py<br/>• Temp-file I/O (not pipes)<br/>• start_new_session=True<br/>• stdin=DEVNULL<br/>• anyio task-group watchers"] PROC_DONE["Process Exits / Timeout<br/>━━━━━━━━━━<br/>On timeout: SIGTERM → SIGKILL<br/>process tree killed"] TC_RESULT["Test Result<br/>━━━━━━━━━━<br/>passed: bool, stdout, stderr<br/>Single atomic MCP response"] end subgraph STEP3 ["● STEP 3 — FIX LOOP (circuit-breaker: max 3 iterations)"] direction TB APPLY_FIX["Apply Code Fix<br/>━━━━━━━━━━<br/>Read/Edit/Write on failing tests<br/>No unrelated changes"] RERUN["● test_check MCP Re-run<br/>━━━━━━━━━━<br/>Same blocking path as Step 2<br/>NOT Bash — identical isolation"] EXTRACT["● Output Extraction<br/>━━━━━━━━━━<br/>Retain ONLY:<br/>• pass/fail counts<br/>• failing test names<br/>• first 10–15 lines/error"] DISCARD["● Stdout Discard<br/>━━━━━━━━━━<br/>Drop: full pytest stdout<br/>progress dots, timing lines<br/>install-worktree output"] CB{"Circuit Breaker<br/>━━━━━━━━━━<br/>iteration < 3?"} end subgraph GUARDS ["★ REGRESSION GUARDS (tests/skills/test_resolve_failures_guards.py)"] direction LR G1["★ prohibits_bash_test_run<br/>━━━━━━━━━━<br/>Step 2: 'cd {worktree}' absent"] G2["★ prescribes_test_check_mcp<br/>━━━━━━━━━━<br/>Step 2: 'test_check' present"] G3["★ fix_loop_prohibits_bash_rerun<br/>━━━━━━━━━━<br/>Step 3: 'cd {worktree}' absent"] G4["★ fix_loop_prescribes_test_check<br/>━━━━━━━━━━<br/>Step 3: 'test_check' present"] G5["★ instructs_output_summarization<br/>━━━━━━━━━━<br/>Step 3: extract|retain only<br/>failure signal present"] G6["★ instructs_stdout_discard<br/>━━━━━━━━━━<br/>Step 3: discard|do not retain<br/>full output present"] end %% TERMINAL STATES %% T_GREEN(["WORKTREE GREEN<br/>━━━━━━━━━━<br/>passed=true → merge-ready"]) T_RED(["FAILURE — MAX ITERATIONS<br/>━━━━━━━━━━<br/>Worktree preserved for inspection<br/>Orchestrator notified"]) T_CASCADE(["UNCONTROLLED CASCADE<br/>━━━━━━━━━━<br/>Session cost explodes"]) %% ENTRY ROUTING %% INVOKE -.->|"OLD: eliminated"| BASH INVOKE -->|"NEW: active"| TC_CALL %% OLD FAILURE PATH (dead end) %% BASH -->|"timeout fires"| TIMEOUT TIMEOUT -->|"silent background"| LOST_SIGNAL LOST_SIGNAL -->|"LLM polls"| CASCADE CASCADE -->|"accumulates"| BLOAT BLOAT --> T_CASCADE %% NEW EXECUTION PATH %% TC_CALL --> RUNNER RUNNER --> MANAGED MANAGED -->|"process lifecycle"| PROC_DONE PROC_DONE --> TC_RESULT %% RESULT ROUTING %% TC_RESULT -->|"passed=true"| T_GREEN TC_RESULT -->|"passed=false"| APPLY_FIX %% FIX LOOP %% APPLY_FIX --> RERUN RERUN --> EXTRACT EXTRACT --> DISCARD DISCARD --> CB CB -->|"yes — another iteration"| APPLY_FIX CB -->|"no — exhausted"| T_RED %% GUARD ENFORCEMENT (static analysis, dashed) %% G1 -.->|"asserts absent"| TC_CALL G2 -.->|"asserts present"| TC_CALL G3 -.->|"asserts absent"| RERUN G4 -.->|"asserts present"| RERUN G5 -.->|"asserts present"| EXTRACT G6 -.->|"asserts present"| DISCARD %% CLASS ASSIGNMENTS %% class INVOKE terminal; class BASH,TIMEOUT integration; class LOST_SIGNAL,CASCADE,BLOAT,T_CASCADE gap; class TC_CALL,RERUN handler; class RUNNER,MANAGED phase; class PROC_DONE,APPLY_FIX handler; class TC_RESULT stateNode; class EXTRACT,DISCARD output; class CB stateNode; class G1,G2,G3,G4,G5,G6 detector; class T_GREEN,T_RED terminal;Scenarios Diagram
%%{init: {'flowchart': {'nodeSpacing': 40, 'rankSpacing': 50, 'curve': 'basis'}}}%% flowchart LR %% CLASS DEFINITIONS %% classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff; classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff; classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff; classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff; classDef gap fill:#ff6f00,stroke:#ffa726,stroke-width:2px,color:#000; classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; subgraph S1 ["S1 (BEFORE): Bash Polling Cascade — Mitigated"] direction LR S1_SKILL["● SKILL.md Step 2<br/>━━━━━━━━━━<br/>Old: run tests via Bash"] S1_BASH["Bash Tool<br/>━━━━━━━━━━<br/>task test-all"] S1_BG["Auto-Background<br/>━━━━━━━━━━<br/>test suite over 5 min<br/>command detaches"] S1_POLL["LLM Polling Loop<br/>━━━━━━━━━━<br/>20+ API calls<br/>to detect completion"] end subgraph S2 ["S2 (AFTER): test_check Single Blocking Call"] direction LR S2_SKILL["● SKILL.md Step 2<br/>━━━━━━━━━━<br/>Now: use test_check MCP<br/>not Bash"] S2_MCP["test_check MCP<br/>━━━━━━━━━━<br/>tools_workspace.py<br/>Blocking async call"] S2_PROC["subprocess<br/>━━━━━━━━━━<br/>task test-all<br/>awaited to completion"] S2_ADJ["check_test_passed()<br/>━━━━━━━━━━<br/>Cross-validates exit code<br/>and pytest summary lines"] S2_RESULT["Structured Result<br/>━━━━━━━━━━<br/>passed: true/false<br/>One call — no polling"] end subgraph S3 ["S3 (AFTER): Fix Loop — Summarized Output (max 3 iterations)"] direction LR S3_SKILL["● SKILL.md Step 3<br/>━━━━━━━━━━<br/>Fix loop: test_check only<br/>Do NOT re-run via Bash"] S3_FIX["Apply Fix<br/>━━━━━━━━━━<br/>Edit source files<br/>in worktree"] S3_MCP["test_check MCP<br/>━━━━━━━━━━<br/>Re-run tests<br/>per iteration"] S3_EXTRACT["Extract Signal<br/>━━━━━━━━━━<br/>pass/fail counts<br/>failing test names<br/>first 10-15 error lines"] S3_DISCARD["Discard stdout<br/>━━━━━━━━━━<br/>No dots / timing<br/>No install output"] S3_CTX["Bounded Context<br/>━━━━━━━━━━<br/>Minimal token growth<br/>across all iterations"] end subgraph S4 ["S4: Guard Enforcement — Structural SKILL.md Validation"] direction LR S4_CI["CI / task test-check<br/>━━━━━━━━━━<br/>Runs test suite in parallel"] S4_GUARD["★ test_resolve_failures_guards.py<br/>━━━━━━━━━━<br/>6 regex guard tests<br/>xdist-safe: pure text assertions"] S4_LOAD["SKILL.md loaded<br/>━━━━━━━━━━<br/>pkg_root() path lookup<br/>read once at module scope"] S4_SCOPE["_extract_step()<br/>━━━━━━━━━━<br/>Regex scope to Step 2 or 3<br/>lookahead: Step 2 not Step 2a"] S4_POLLING["4 Polling-Cascade Guards<br/>━━━━━━━━━━<br/>Bash pattern ABSENT<br/>test_check PRESENT<br/>in Steps 2 and 3"] S4_BLOAT["2 Output-Bloat Guards<br/>━━━━━━━━━━<br/>summarize instruction present<br/>discard instruction present<br/>in Step 3"] S4_PASS["CI PASS<br/>━━━━━━━━━━<br/>SKILL.md structurally<br/>compliant — both fixes hold"] end %% S1: Bash polling cascade flow S1_SKILL -->|"LLM invokes Bash"| S1_BASH S1_BASH -->|"auto-backgrounds"| S1_BG S1_BG -->|"LLM must detect completion"| S1_POLL %% S2: test_check blocking flow S2_SKILL -->|"LLM invokes MCP"| S2_MCP S2_MCP -->|"awaits subprocess"| S2_PROC S2_PROC -->|"exit code + stdout"| S2_ADJ S2_ADJ -->|"single JSON response"| S2_RESULT %% S3: fix loop summarization flow S3_SKILL --> S3_FIX S3_FIX -->|"test_check MCP call"| S3_MCP S3_MCP -->|"full stdout returned"| S3_EXTRACT S3_EXTRACT -->|"retain failure signal only"| S3_DISCARD S3_DISCARD -->|"loop if still failing"| S3_CTX %% S4: guard enforcement flow S4_CI --> S4_GUARD S4_GUARD -->|"reads"| S4_LOAD S4_LOAD -->|"scopes text"| S4_SCOPE S4_SCOPE -->|"Steps 2 + 3"| S4_POLLING S4_SCOPE -->|"Step 3 only"| S4_BLOAT S4_POLLING -->|"assert"| S4_PASS S4_BLOAT -->|"assert"| S4_PASS %% CLASS ASSIGNMENTS %% class S1_SKILL,S2_SKILL,S3_SKILL phase; class S1_BASH,S1_BG,S1_POLL gap; class S2_MCP,S3_MCP handler; class S2_PROC,S2_ADJ phase; class S2_RESULT,S3_CTX output; class S3_FIX handler; class S3_EXTRACT phase; class S3_DISCARD detector; class S4_CI cli; class S4_GUARD newComponent; class S4_LOAD stateNode; class S4_SCOPE phase; class S4_POLLING,S4_BLOAT detector; class S4_PASS output;Closes #815
Implementation Plan
Plan file:
/home/talon/projects/autoskillit-runs/impl-20260413-185024-166119/.autoskillit/temp/make-plan/resolve-failures-test-polling-cascade_plan_2026-04-13_185340.md🤖 Generated with Claude Code via AutoSkillit
Token Usage Summary