Rectify: TestRunner Protocol Returns Lossy Bare Tuple; Non-Pytest Runners Produce False Negatives#520
Conversation
- Add TestResult dataclass to core/_type_results.py (passed, stdout, stderr)
- Export TestResult from core/__init__.py and types.py
- Update TestRunner Protocol return type from tuple[bool, str] to TestResult
- Invert check_test_passed() CWA: no pytest summary → trust exit code (non-pytest runners)
- Add stderr parameter to check_test_passed(); combine stdout+stderr for pytest detection
- DefaultTestRunner.run() now returns TestResult with full stdout and stderr
- test_check MCP handler returns {passed, stdout, stderr} instead of {passed, output}
- perform_merge() uses TestResult attribute access at both test gate call sites
- Update StatefulMockTester to accept list[TestResult] and return TestResult
- Update all StatefulMockTester construction sites in test_git.py to use TestResult
- Flip two CWA test assertions (empty output / log-only now expect True)
- Add 9 new tests: non-pytest runner paths, TestResult fields, schema, cargo nextest style
- Add test_test_runner_return_type_is_test_result() contract test
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lt in test_tools_git
Trecek
left a comment
There was a problem hiding this comment.
AutoSkillit PR Review — Verdict: changes_requested
6 blocking findings (1 critical, 5 warnings). See inline comments.
| if returncode != 0: | ||
| return False | ||
| counts = parse_pytest_summary(stdout) | ||
| combined = stdout + ("\n" + stderr if stderr else "") |
There was a problem hiding this comment.
[warning] bugs: stderr concatenation may cause incorrect regex match across the join boundary — if stdout does not end with a newline, the concatenated string could form a spurious pytest summary line. Safer to call parse_pytest_summary on stdout and stderr independently and merge the resulting count dicts.
There was a problem hiding this comment.
Investigated — this is intentional. The concatenation always inserts a literal "\n" between stdout and stderr when stderr is non-empty (stdout + ("\n" + stderr if stderr else "")). Since parse_pytest_summary processes lines via splitlines(), the inserted "\n" guarantees stdout and stderr always land on separate lines — a cross-boundary spurious match cannot form. Pass 1 requires a line that both starts and ends with =; Pass 2 requires a in N.Ns timing suffix. Neither can be synthesized by concatenating the tail of stdout with the head of stderr across a newline boundary.
| assert check_test_passed(1, "", "") is False | ||
|
|
||
|
|
||
| def test_check_test_passed_parses_pytest_summary_in_stderr() -> None: |
There was a problem hiding this comment.
[warning] tests: test_check_test_passed_parses_pytest_summary_in_stderr overlaps with test_check_test_passed_true_when_no_summary_stderr_only (L283) — both assert check_test_passed(0, "", ) is True. Consider merging into a parametrized test to avoid silent divergence.
There was a problem hiding this comment.
Investigated — this is intentional. These two tests cover distinct code branches: test_check_test_passed_true_when_no_summary_stderr_only (L283) passes "test result: ok. 42 passed; 0 failed\n" (cargo/nextest-style, no pytest summary pattern) — it verifies the non-pytest fallback path where no summary is found and exit code is trusted. test_check_test_passed_parses_pytest_summary_in_stderr (L304) passes "= 5 passed in 1.2s =\n" — a canonical pytest summary line, verifying that pytest summary detection works when the summary appears in stderr. Merging them would collapse coverage of two separate code paths into one test.
| assert "PASS [0.5s] 5 tests" in result["stderr"] | ||
|
|
||
| @pytest.mark.anyio | ||
| async def test_response_schema_includes_stderr(self, tool_ctx): |
There was a problem hiding this comment.
[info] tests: test_response_schema_includes_stderr and test_does_not_expose_summary (L154) both assert set(result.keys()) == {"passed", "stdout", "stderr"}. The schema assertion is now duplicated across two tests.
| ) | ||
|
|
||
| @pytest.mark.anyio | ||
| async def test_cargo_nextest_style_passes(self, tool_ctx): |
There was a problem hiding this comment.
[info] tests: test_cargo_nextest_style_passes checks result["stderr"] with in rather than equality. If the implementation appends extra text the assertion still passes. Use == for deterministic output validation.
Summary
The
TestRunnerProtocol (core/_type_protocols.py) returned a bare positionaltuple[bool, str]— the only Protocol in the codebase that used a positional tuple for complex return data rather than a named dataclass. This structural gap made it impossible to carrystderrthrough the testing pipeline without a breaking change, which is whyDefaultTestRunner.run()silently discardedresult.stderr. The downstream consequence was twofold: (1)check_test_passed()operated on stdout-only data, and (2) its closed-world assumption (CWA) — returningFalsewhen no pytest summary was found — fired unconditionally for all non-pytest runners (cargo nextest,go test, custom scripts) that pass via exit code 0 with no pytest-format output.The fix introduces a named
TestResultdataclass (matching the established pattern ofSkillResult,CleanupResult,CIRunScopein_type_results.py) and promotes it to the Protocol return type. This makes it structurally impossible to silently drop stderr, forces every caller to name the fields they consume, and eliminates the implicit "bare tuple = stdout-only" assumption. The adjudication logic is inverted to be runner-agnostic: trust exit code when no recognized parser (pytest) matches.Closes #513
Architecture Impact
Process Flow Diagram
%%{init: {'flowchart': {'nodeSpacing': 40, 'rankSpacing': 50, 'curve': 'basis'}}}%% flowchart TB %% CLASS DEFINITIONS %% classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff; classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff; classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff; classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff; START([test_check MCP call]) PASS([PASS]) FAIL([FAIL]) subgraph Entry ["MCP Entry — ● tools_workspace.py"] TESTER{"tester<br/>configured?"} NOCFG["return error<br/>━━━━━━━━━━<br/>not configured"] end subgraph Execution ["Subprocess Execution — ● testing.py"] DTR["● DefaultTestRunner.run(cwd)<br/>━━━━━━━━━━<br/>reads config.test_check.command<br/>reads config.test_check.timeout"] RUNNER["SubprocessRunner<br/>━━━━━━━━━━<br/>executes command in cwd<br/>captures stdout + stderr"] SPR["SubprocessResult<br/>━━━━━━━━━━<br/>{returncode, stdout, stderr}"] end subgraph Adjudication ["● Adjudication — check_test_passed()"] RC{"returncode != 0?"} COMBINE["build combined<br/>━━━━━━━━━━<br/>stdout + '\n' + stderr"] PPS["● parse_pytest_summary(combined)<br/>━━━━━━━━━━<br/>2-pass: =-delimited + bare -q<br/>searches stdout AND stderr"] MATCH{"pytest summary<br/>found?"} TRUST["● Trust exit code<br/>━━━━━━━━━━<br/>returncode == 0 → True<br/>runner-agnostic path"] COUNTS{"failed > 0<br/>or error > 0?"} end subgraph Result ["● TestResult Contract — core/_type_results.py"] TR["● TestResult<br/>━━━━━━━━━━<br/>{passed: bool,<br/>stdout: str,<br/>stderr: str}"] end subgraph Response ["● MCP Response — tools_workspace.py"] RESP["● test_check response<br/>━━━━━━━━━━<br/>{passed, stdout, stderr}<br/>truncated text"] end subgraph MergeGate ["perform_merge() Test Gates — ● git.py"] GATE{"test_gate_on_merge?"} PRETEST["● Pre-rebase gate<br/>━━━━━━━━━━<br/>tester.run(worktree_path)<br/>blocks on failure"] REBASE["git rebase<br/>━━━━━━━━━━<br/>onto base branch"] POSTTEST["● Post-rebase gate<br/>━━━━━━━━━━<br/>tester.run(worktree_path)<br/>blocks on regression"] MERGE["git merge --ff-only<br/>━━━━━━━━━━<br/>fast-forward merge"] end MERGEPASS([MERGE SUCCESS]) MERGEFAIL([MERGE BLOCKED]) START --> TESTER TESTER -->|"yes"| DTR TESTER -->|"no"| NOCFG DTR --> RUNNER RUNNER --> SPR SPR -->|"stdout + stderr forwarded"| TR TR -->|"passed=?"| RC RC -->|"yes"| FAIL RC -->|"no"| COMBINE COMBINE --> PPS PPS --> MATCH MATCH -->|"no match"| TRUST MATCH -->|"pytest found"| COUNTS TRUST -->|"rc == 0"| PASS TRUST -->|"rc != 0"| FAIL COUNTS -->|"clean"| PASS COUNTS -->|"failures"| FAIL PASS --> TR FAIL --> TR TR --> RESP GATE -->|"yes"| PRETEST GATE -->|"no"| MERGE PRETEST -->|"passed"| REBASE PRETEST -->|"failed"| MERGEFAIL REBASE --> POSTTEST POSTTEST -->|"passed"| MERGE POSTTEST -->|"failed"| MERGEFAIL MERGE --> MERGEPASS class START terminal; class PASS,FAIL terminal; class MERGEPASS,MERGEFAIL terminal; class NOCFG detector; class DTR handler; class RUNNER,SPR stateNode; class TR newComponent; class RC,MATCH,TESTER,GATE,COUNTS stateNode; class COMBINE,PPS phase; class TRUST newComponent; class RESP output; class PRETEST,POSTTEST handler; class REBASE,MERGE phase;Module Dependency Diagram
%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 70, 'curve': 'basis'}}}%% graph TB %% CLASS DEFINITIONS %% classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff; classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff; classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff; classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff; classDef integration fill:#c62828,stroke:#ef9a9a,stroke-width:2px,color:#fff; classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; subgraph L0 ["L0 — CORE (Zero autoskillit imports)"] direction LR RESULTS["● core/_type_results.py<br/>━━━━━━━━━━<br/>TestResult dataclass added<br/>SkillResult, CleanupResult, etc."] PROTOCOLS["● core/_type_protocols.py<br/>━━━━━━━━━━<br/>TestRunner Protocol<br/>returns TestResult (was tuple)"] INIT["● core/__init__.py<br/>━━━━━━━━━━<br/>re-exports TestResult<br/>public surface"] end subgraph L1 ["L1 — EXECUTION / PIPELINE"] direction LR TESTING["● execution/testing.py<br/>━━━━━━━━━━<br/>DefaultTestRunner<br/>check_test_passed()"] CTX["pipeline/context.py<br/>━━━━━━━━━━<br/>tester: TestRunner | None<br/>DI container"] end subgraph L3 ["L3 — SERVER"] direction LR WORKSPACE["● server/tools_workspace.py<br/>━━━━━━━━━━<br/>test_check MCP handler<br/>returns {passed, stdout, stderr}"] GIT["● server/git.py<br/>━━━━━━━━━━<br/>perform_merge()<br/>pre + post rebase gates"] FACTORY["server/_factory.py<br/>━━━━━━━━━━<br/>wires DefaultTestRunner<br/>into ToolContext.tester"] end subgraph Tests ["TESTS (cross-layer)"] direction LR CONFTEST["● tests/conftest.py<br/>━━━━━━━━━━<br/>StatefulMockTester<br/>uses TestResult"] TTEST["● tests/execution/test_testing.py<br/>━━━━━━━━━━<br/>check_test_passed tests<br/>non-pytest runner tests"] CONTRACTS["● tests/contracts/test_protocol_satisfaction.py<br/>━━━━━━━━━━<br/>TestRunner return type<br/>verified as TestResult"] WTEST["● tests/server/test_tools_workspace.py<br/>━━━━━━━━━━<br/>schema: {passed,stdout,stderr}<br/>cargo nextest style test"] GTEST["● tests/server/test_git.py<br/>━━━━━━━━━━<br/>StatefulMockTester calls<br/>updated to TestResult"] TGTEST["● tests/server/test_tools_git.py<br/>━━━━━━━━━━<br/>ls-files result added<br/>merge gate tests"] end RESULTS -->|"imported by"| PROTOCOLS RESULTS -->|"re-exported via"| INIT RESULTS -->|"TestResult"| TESTING PROTOCOLS -->|"TestRunner Protocol"| CTX CTX -->|"ToolContext.tester"| WORKSPACE CTX -->|"ToolContext.tester"| GIT TESTING -->|"wired as tester"| FACTORY FACTORY -->|"injects"| CTX RESULTS -->|"TestResult"| CONFTEST PROTOCOLS -->|"TestRunner"| CONTRACTS TESTING -->|"check_test_passed"| TTEST CONFTEST -->|"StatefulMockTester"| WTEST CONFTEST -->|"StatefulMockTester"| GTEST GIT -->|"perform_merge"| TGTEST class RESULTS newComponent; class PROTOCOLS,INIT handler; class TESTING handler; class CTX stateNode; class WORKSPACE,GIT phase; class FACTORY phase; class CONFTEST,TTEST,CONTRACTS,WTEST,GTEST,TGTEST output;State Lifecycle Diagram
%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 60, 'curve': 'basis'}}}%% flowchart TB %% CLASS DEFINITIONS %% classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff; classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff; classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff; classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff; classDef gap fill:#ff6f00,stroke:#ffa726,stroke-width:2px,color:#000; subgraph SubprocessContract ["SUBPROCESS RESULT CONTRACT (immutable source)"] direction LR RC["SubprocessResult.returncode<br/>━━━━━━━━━━<br/>int — exit code<br/>INIT_ONLY: read once"] STDOUT["SubprocessResult.stdout<br/>━━━━━━━━━━<br/>str — process stdout<br/>INIT_ONLY: forwarded as-is"] STDERR["SubprocessResult.stderr<br/>━━━━━━━━━━<br/>str — process stderr<br/>INIT_ONLY: was DROPPED; now forwarded"] end subgraph AdjudicationGate ["● ADJUDICATION GATE — check_test_passed()"] direction TB RCGATE{"returncode != 0?<br/>━━━━━━━━━━<br/>FAIL-FAST gate"} COMBINE["combined = stdout + stderr<br/>━━━━━━━━━━<br/>DERIVED: ephemeral union<br/>used for parsing only"] PARSEGATE{"parse_pytest_summary(combined)<br/>━━━━━━━━━━<br/>counts found?"} TRUSTGATE["● trust exit code path<br/>━━━━━━━━━━<br/>no counts → rc==0 means PASS<br/>replaces old CWA: return False"] COUNTGATE{"failed > 0 or error > 0?<br/>━━━━━━━━━━<br/>pytest count validation"} end subgraph TestResultContract ["● TESTRESULT CONTRACT — core/_type_results.py"] direction LR TR_PASSED["● TestResult.passed<br/>━━━━━━━━━━<br/>bool: INIT_ONLY<br/>set by adjudication gate"] TR_STDOUT["● TestResult.stdout<br/>━━━━━━━━━━<br/>str: INIT_ONLY<br/>raw subprocess stdout"] TR_STDERR["● TestResult.stderr<br/>━━━━━━━━━━<br/>str: INIT_ONLY (was ABSENT)<br/>raw subprocess stderr — now carried"] end subgraph ProtocolGate ["● PROTOCOL CONFORMANCE GATE — core/_type_protocols.py"] PROTO["● TestRunner Protocol<br/>━━━━━━━━━━<br/>async run(cwd) → TestResult<br/>(was tuple[bool, str])"] MYPY["mypy type checker<br/>━━━━━━━━━━<br/>enforces return type<br/>at static analysis time"] end subgraph ConsumerContracts ["● CONSUMER FIELD CONTRACTS"] direction LR TC_RESP["● test_check response<br/>━━━━━━━━━━<br/>{passed, stdout, stderr}<br/>stderr: was ABSENT in response"] MERGE_PRE["● perform_merge pre-gate<br/>━━━━━━━━━━<br/>{test_stdout, test_stderr}<br/>stderr: was ABSENT in error dict"] MERGE_POST["● perform_merge post-gate<br/>━━━━━━━━━━<br/>test_result.passed<br/>(was: passed, _ = ...)"] end PASS_STATE(["passed = True"]) FAIL_STATE(["passed = False"]) RC -->|"read once"| RCGATE STDOUT -->|"forwarded"| COMBINE STDERR -->|"NOW forwarded"| COMBINE RCGATE -->|"yes: fail-fast"| FAIL_STATE RCGATE -->|"no"| COMBINE COMBINE --> PARSEGATE PARSEGATE -->|"no match"| TRUSTGATE PARSEGATE -->|"pytest found"| COUNTGATE TRUSTGATE -->|"rc == 0"| PASS_STATE TRUSTGATE -->|"rc != 0"| FAIL_STATE COUNTGATE -->|"clean"| PASS_STATE COUNTGATE -->|"failures"| FAIL_STATE PASS_STATE -->|"passed=True"| TR_PASSED FAIL_STATE -->|"passed=False"| TR_PASSED STDOUT -->|"stdout="| TR_STDOUT STDERR -->|"stderr="| TR_STDERR TR_PASSED --> PROTO TR_STDOUT --> PROTO TR_STDERR --> PROTO PROTO --> MYPY TR_PASSED -->|"test_result.passed"| TC_RESP TR_STDOUT -->|"truncated"| TC_RESP TR_STDERR -->|"truncated (NEW)"| TC_RESP TR_PASSED -->|"pre-gate check"| MERGE_PRE TR_STDOUT -->|"test_stdout"| MERGE_PRE TR_STDERR -->|"test_stderr (NEW)"| MERGE_PRE TR_PASSED -->|"post-gate check"| MERGE_POST class RC,STDOUT stateNode; class STDERR gap; class RCGATE,PARSEGATE,COUNTGATE stateNode; class COMBINE phase; class TRUSTGATE newComponent; class TR_PASSED,TR_STDOUT newComponent; class TR_STDERR newComponent; class PROTO,MYPY detector; class TC_RESP,MERGE_PRE,MERGE_POST output; class PASS_STATE,FAIL_STATE cli;Implementation Plan
Plan file:
/home/talon/projects/autoskillit-runs/remediation-20260326-162603-381072/.autoskillit/temp/rectify/rectify_test_runner_protocol_lossy_tuple_2026-03-26_000000.md🤖 Generated with Claude Code via AutoSkillit