Skip to content

Rectify: TestRunner Protocol Returns Lossy Bare Tuple; Non-Pytest Runners Produce False Negatives#520

Merged
Trecek merged 6 commits intointegrationfrom
test-check-returns-false-negative-for-non-pytest-test-runner/513
Mar 27, 2026
Merged

Rectify: TestRunner Protocol Returns Lossy Bare Tuple; Non-Pytest Runners Produce False Negatives#520
Trecek merged 6 commits intointegrationfrom
test-check-returns-false-negative-for-non-pytest-test-runner/513

Conversation

@Trecek
Copy link
Copy Markdown
Collaborator

@Trecek Trecek commented Mar 27, 2026

Summary

The TestRunner Protocol (core/_type_protocols.py) returned a bare positional tuple[bool, str] — the only Protocol in the codebase that used a positional tuple for complex return data rather than a named dataclass. This structural gap made it impossible to carry stderr through the testing pipeline without a breaking change, which is why DefaultTestRunner.run() silently discarded result.stderr. The downstream consequence was twofold: (1) check_test_passed() operated on stdout-only data, and (2) its closed-world assumption (CWA) — returning False when no pytest summary was found — fired unconditionally for all non-pytest runners (cargo nextest, go test, custom scripts) that pass via exit code 0 with no pytest-format output.

The fix introduces a named TestResult dataclass (matching the established pattern of SkillResult, CleanupResult, CIRunScope in _type_results.py) and promotes it to the Protocol return type. This makes it structurally impossible to silently drop stderr, forces every caller to name the fields they consume, and eliminates the implicit "bare tuple = stdout-only" assumption. The adjudication logic is inverted to be runner-agnostic: trust exit code when no recognized parser (pytest) matches.

Closes #513

Architecture Impact

Process Flow Diagram

%%{init: {'flowchart': {'nodeSpacing': 40, 'rankSpacing': 50, 'curve': 'basis'}}}%%
flowchart TB
    %% CLASS DEFINITIONS %%
    classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff;
    classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;

    START([test_check MCP call])
    PASS([PASS])
    FAIL([FAIL])

    subgraph Entry ["MCP Entry — ● tools_workspace.py"]
        TESTER{"tester<br/>configured?"}
        NOCFG["return error<br/>━━━━━━━━━━<br/>not configured"]
    end

    subgraph Execution ["Subprocess Execution — ● testing.py"]
        DTR["● DefaultTestRunner.run(cwd)<br/>━━━━━━━━━━<br/>reads config.test_check.command<br/>reads config.test_check.timeout"]
        RUNNER["SubprocessRunner<br/>━━━━━━━━━━<br/>executes command in cwd<br/>captures stdout + stderr"]
        SPR["SubprocessResult<br/>━━━━━━━━━━<br/>{returncode, stdout, stderr}"]
    end

    subgraph Adjudication ["● Adjudication — check_test_passed()"]
        RC{"returncode != 0?"}
        COMBINE["build combined<br/>━━━━━━━━━━<br/>stdout + '\n' + stderr"]
        PPS["● parse_pytest_summary(combined)<br/>━━━━━━━━━━<br/>2-pass: =-delimited + bare -q<br/>searches stdout AND stderr"]
        MATCH{"pytest summary<br/>found?"}
        TRUST["● Trust exit code<br/>━━━━━━━━━━<br/>returncode == 0 → True<br/>runner-agnostic path"]
        COUNTS{"failed > 0<br/>or error > 0?"}
    end

    subgraph Result ["● TestResult Contract — core/_type_results.py"]
        TR["● TestResult<br/>━━━━━━━━━━<br/>{passed: bool,<br/>stdout: str,<br/>stderr: str}"]
    end

    subgraph Response ["● MCP Response — tools_workspace.py"]
        RESP["● test_check response<br/>━━━━━━━━━━<br/>{passed, stdout, stderr}<br/>truncated text"]
    end

    subgraph MergeGate ["perform_merge() Test Gates — ● git.py"]
        GATE{"test_gate_on_merge?"}
        PRETEST["● Pre-rebase gate<br/>━━━━━━━━━━<br/>tester.run(worktree_path)<br/>blocks on failure"]
        REBASE["git rebase<br/>━━━━━━━━━━<br/>onto base branch"]
        POSTTEST["● Post-rebase gate<br/>━━━━━━━━━━<br/>tester.run(worktree_path)<br/>blocks on regression"]
        MERGE["git merge --ff-only<br/>━━━━━━━━━━<br/>fast-forward merge"]
    end

    MERGEPASS([MERGE SUCCESS])
    MERGEFAIL([MERGE BLOCKED])

    START --> TESTER
    TESTER -->|"yes"| DTR
    TESTER -->|"no"| NOCFG
    DTR --> RUNNER
    RUNNER --> SPR
    SPR -->|"stdout + stderr forwarded"| TR
    TR -->|"passed=?"| RC
    RC -->|"yes"| FAIL
    RC -->|"no"| COMBINE
    COMBINE --> PPS
    PPS --> MATCH
    MATCH -->|"no match"| TRUST
    MATCH -->|"pytest found"| COUNTS
    TRUST -->|"rc == 0"| PASS
    TRUST -->|"rc != 0"| FAIL
    COUNTS -->|"clean"| PASS
    COUNTS -->|"failures"| FAIL
    PASS --> TR
    FAIL --> TR
    TR --> RESP

    GATE -->|"yes"| PRETEST
    GATE -->|"no"| MERGE
    PRETEST -->|"passed"| REBASE
    PRETEST -->|"failed"| MERGEFAIL
    REBASE --> POSTTEST
    POSTTEST -->|"passed"| MERGE
    POSTTEST -->|"failed"| MERGEFAIL
    MERGE --> MERGEPASS

    class START terminal;
    class PASS,FAIL terminal;
    class MERGEPASS,MERGEFAIL terminal;
    class NOCFG detector;
    class DTR handler;
    class RUNNER,SPR stateNode;
    class TR newComponent;
    class RC,MATCH,TESTER,GATE,COUNTS stateNode;
    class COMBINE,PPS phase;
    class TRUST newComponent;
    class RESP output;
    class PRETEST,POSTTEST handler;
    class REBASE,MERGE phase;
Loading

Module Dependency Diagram

%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 70, 'curve': 'basis'}}}%%
graph TB
    %% CLASS DEFINITIONS %%
    classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff;
    classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;
    classDef integration fill:#c62828,stroke:#ef9a9a,stroke-width:2px,color:#fff;
    classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;

    subgraph L0 ["L0 — CORE (Zero autoskillit imports)"]
        direction LR
        RESULTS["● core/_type_results.py<br/>━━━━━━━━━━<br/>TestResult dataclass added<br/>SkillResult, CleanupResult, etc."]
        PROTOCOLS["● core/_type_protocols.py<br/>━━━━━━━━━━<br/>TestRunner Protocol<br/>returns TestResult (was tuple)"]
        INIT["● core/__init__.py<br/>━━━━━━━━━━<br/>re-exports TestResult<br/>public surface"]
    end

    subgraph L1 ["L1 — EXECUTION / PIPELINE"]
        direction LR
        TESTING["● execution/testing.py<br/>━━━━━━━━━━<br/>DefaultTestRunner<br/>check_test_passed()"]
        CTX["pipeline/context.py<br/>━━━━━━━━━━<br/>tester: TestRunner | None<br/>DI container"]
    end

    subgraph L3 ["L3 — SERVER"]
        direction LR
        WORKSPACE["● server/tools_workspace.py<br/>━━━━━━━━━━<br/>test_check MCP handler<br/>returns {passed, stdout, stderr}"]
        GIT["● server/git.py<br/>━━━━━━━━━━<br/>perform_merge()<br/>pre + post rebase gates"]
        FACTORY["server/_factory.py<br/>━━━━━━━━━━<br/>wires DefaultTestRunner<br/>into ToolContext.tester"]
    end

    subgraph Tests ["TESTS (cross-layer)"]
        direction LR
        CONFTEST["● tests/conftest.py<br/>━━━━━━━━━━<br/>StatefulMockTester<br/>uses TestResult"]
        TTEST["● tests/execution/test_testing.py<br/>━━━━━━━━━━<br/>check_test_passed tests<br/>non-pytest runner tests"]
        CONTRACTS["● tests/contracts/test_protocol_satisfaction.py<br/>━━━━━━━━━━<br/>TestRunner return type<br/>verified as TestResult"]
        WTEST["● tests/server/test_tools_workspace.py<br/>━━━━━━━━━━<br/>schema: {passed,stdout,stderr}<br/>cargo nextest style test"]
        GTEST["● tests/server/test_git.py<br/>━━━━━━━━━━<br/>StatefulMockTester calls<br/>updated to TestResult"]
        TGTEST["● tests/server/test_tools_git.py<br/>━━━━━━━━━━<br/>ls-files result added<br/>merge gate tests"]
    end

    RESULTS -->|"imported by"| PROTOCOLS
    RESULTS -->|"re-exported via"| INIT
    RESULTS -->|"TestResult"| TESTING
    PROTOCOLS -->|"TestRunner Protocol"| CTX
    CTX -->|"ToolContext.tester"| WORKSPACE
    CTX -->|"ToolContext.tester"| GIT
    TESTING -->|"wired as tester"| FACTORY
    FACTORY -->|"injects"| CTX
    RESULTS -->|"TestResult"| CONFTEST
    PROTOCOLS -->|"TestRunner"| CONTRACTS
    TESTING -->|"check_test_passed"| TTEST
    CONFTEST -->|"StatefulMockTester"| WTEST
    CONFTEST -->|"StatefulMockTester"| GTEST
    GIT -->|"perform_merge"| TGTEST

    class RESULTS newComponent;
    class PROTOCOLS,INIT handler;
    class TESTING handler;
    class CTX stateNode;
    class WORKSPACE,GIT phase;
    class FACTORY phase;
    class CONFTEST,TTEST,CONTRACTS,WTEST,GTEST,TGTEST output;
Loading

State Lifecycle Diagram

%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 60, 'curve': 'basis'}}}%%
flowchart TB
    %% CLASS DEFINITIONS %%
    classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff;
    classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;
    classDef gap fill:#ff6f00,stroke:#ffa726,stroke-width:2px,color:#000;

    subgraph SubprocessContract ["SUBPROCESS RESULT CONTRACT (immutable source)"]
        direction LR
        RC["SubprocessResult.returncode<br/>━━━━━━━━━━<br/>int — exit code<br/>INIT_ONLY: read once"]
        STDOUT["SubprocessResult.stdout<br/>━━━━━━━━━━<br/>str — process stdout<br/>INIT_ONLY: forwarded as-is"]
        STDERR["SubprocessResult.stderr<br/>━━━━━━━━━━<br/>str — process stderr<br/>INIT_ONLY: was DROPPED; now forwarded"]
    end

    subgraph AdjudicationGate ["● ADJUDICATION GATE — check_test_passed()"]
        direction TB
        RCGATE{"returncode != 0?<br/>━━━━━━━━━━<br/>FAIL-FAST gate"}
        COMBINE["combined = stdout + stderr<br/>━━━━━━━━━━<br/>DERIVED: ephemeral union<br/>used for parsing only"]
        PARSEGATE{"parse_pytest_summary(combined)<br/>━━━━━━━━━━<br/>counts found?"}
        TRUSTGATE["● trust exit code path<br/>━━━━━━━━━━<br/>no counts → rc==0 means PASS<br/>replaces old CWA: return False"]
        COUNTGATE{"failed > 0 or error > 0?<br/>━━━━━━━━━━<br/>pytest count validation"}
    end

    subgraph TestResultContract ["● TESTRESULT CONTRACT — core/_type_results.py"]
        direction LR
        TR_PASSED["● TestResult.passed<br/>━━━━━━━━━━<br/>bool: INIT_ONLY<br/>set by adjudication gate"]
        TR_STDOUT["● TestResult.stdout<br/>━━━━━━━━━━<br/>str: INIT_ONLY<br/>raw subprocess stdout"]
        TR_STDERR["● TestResult.stderr<br/>━━━━━━━━━━<br/>str: INIT_ONLY (was ABSENT)<br/>raw subprocess stderr — now carried"]
    end

    subgraph ProtocolGate ["● PROTOCOL CONFORMANCE GATE — core/_type_protocols.py"]
        PROTO["● TestRunner Protocol<br/>━━━━━━━━━━<br/>async run(cwd) → TestResult<br/>(was tuple[bool, str])"]
        MYPY["mypy type checker<br/>━━━━━━━━━━<br/>enforces return type<br/>at static analysis time"]
    end

    subgraph ConsumerContracts ["● CONSUMER FIELD CONTRACTS"]
        direction LR
        TC_RESP["● test_check response<br/>━━━━━━━━━━<br/>{passed, stdout, stderr}<br/>stderr: was ABSENT in response"]
        MERGE_PRE["● perform_merge pre-gate<br/>━━━━━━━━━━<br/>{test_stdout, test_stderr}<br/>stderr: was ABSENT in error dict"]
        MERGE_POST["● perform_merge post-gate<br/>━━━━━━━━━━<br/>test_result.passed<br/>(was: passed, _ = ...)"]
    end

    PASS_STATE(["passed = True"])
    FAIL_STATE(["passed = False"])

    RC -->|"read once"| RCGATE
    STDOUT -->|"forwarded"| COMBINE
    STDERR -->|"NOW forwarded"| COMBINE
    RCGATE -->|"yes: fail-fast"| FAIL_STATE
    RCGATE -->|"no"| COMBINE
    COMBINE --> PARSEGATE
    PARSEGATE -->|"no match"| TRUSTGATE
    PARSEGATE -->|"pytest found"| COUNTGATE
    TRUSTGATE -->|"rc == 0"| PASS_STATE
    TRUSTGATE -->|"rc != 0"| FAIL_STATE
    COUNTGATE -->|"clean"| PASS_STATE
    COUNTGATE -->|"failures"| FAIL_STATE
    PASS_STATE -->|"passed=True"| TR_PASSED
    FAIL_STATE -->|"passed=False"| TR_PASSED
    STDOUT -->|"stdout="| TR_STDOUT
    STDERR -->|"stderr="| TR_STDERR
    TR_PASSED --> PROTO
    TR_STDOUT --> PROTO
    TR_STDERR --> PROTO
    PROTO --> MYPY
    TR_PASSED -->|"test_result.passed"| TC_RESP
    TR_STDOUT -->|"truncated"| TC_RESP
    TR_STDERR -->|"truncated (NEW)"| TC_RESP
    TR_PASSED -->|"pre-gate check"| MERGE_PRE
    TR_STDOUT -->|"test_stdout"| MERGE_PRE
    TR_STDERR -->|"test_stderr (NEW)"| MERGE_PRE
    TR_PASSED -->|"post-gate check"| MERGE_POST

    class RC,STDOUT stateNode;
    class STDERR gap;
    class RCGATE,PARSEGATE,COUNTGATE stateNode;
    class COMBINE phase;
    class TRUSTGATE newComponent;
    class TR_PASSED,TR_STDOUT newComponent;
    class TR_STDERR newComponent;
    class PROTO,MYPY detector;
    class TC_RESP,MERGE_PRE,MERGE_POST output;
    class PASS_STATE,FAIL_STATE cli;
Loading

Implementation Plan

Plan file: /home/talon/projects/autoskillit-runs/remediation-20260326-162603-381072/.autoskillit/temp/rectify/rectify_test_runner_protocol_lossy_tuple_2026-03-26_000000.md

🤖 Generated with Claude Code via AutoSkillit

Trecek and others added 2 commits March 26, 2026 17:16
- Add TestResult dataclass to core/_type_results.py (passed, stdout, stderr)
- Export TestResult from core/__init__.py and types.py
- Update TestRunner Protocol return type from tuple[bool, str] to TestResult
- Invert check_test_passed() CWA: no pytest summary → trust exit code (non-pytest runners)
- Add stderr parameter to check_test_passed(); combine stdout+stderr for pytest detection
- DefaultTestRunner.run() now returns TestResult with full stdout and stderr
- test_check MCP handler returns {passed, stdout, stderr} instead of {passed, output}
- perform_merge() uses TestResult attribute access at both test gate call sites
- Update StatefulMockTester to accept list[TestResult] and return TestResult
- Update all StatefulMockTester construction sites in test_git.py to use TestResult
- Flip two CWA test assertions (empty output / log-only now expect True)
- Add 9 new tests: non-pytest runner paths, TestResult fields, schema, cargo nextest style
- Add test_test_runner_return_type_is_test_result() contract test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator Author

@Trecek Trecek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoSkillit PR Review — Verdict: changes_requested

6 blocking findings (1 critical, 5 warnings). See inline comments.

Comment thread src/autoskillit/execution/testing.py
if returncode != 0:
return False
counts = parse_pytest_summary(stdout)
combined = stdout + ("\n" + stderr if stderr else "")
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] bugs: stderr concatenation may cause incorrect regex match across the join boundary — if stdout does not end with a newline, the concatenated string could form a spurious pytest summary line. Safer to call parse_pytest_summary on stdout and stderr independently and merge the resulting count dicts.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigated — this is intentional. The concatenation always inserts a literal "\n" between stdout and stderr when stderr is non-empty (stdout + ("\n" + stderr if stderr else "")). Since parse_pytest_summary processes lines via splitlines(), the inserted "\n" guarantees stdout and stderr always land on separate lines — a cross-boundary spurious match cannot form. Pass 1 requires a line that both starts and ends with =; Pass 2 requires a in N.Ns timing suffix. Neither can be synthesized by concatenating the tail of stdout with the head of stderr across a newline boundary.

Comment thread src/autoskillit/server/git.py
Comment thread tests/execution/test_testing.py Outdated
assert check_test_passed(1, "", "") is False


def test_check_test_passed_parses_pytest_summary_in_stderr() -> None:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] tests: test_check_test_passed_parses_pytest_summary_in_stderr overlaps with test_check_test_passed_true_when_no_summary_stderr_only (L283) — both assert check_test_passed(0, "", ) is True. Consider merging into a parametrized test to avoid silent divergence.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigated — this is intentional. These two tests cover distinct code branches: test_check_test_passed_true_when_no_summary_stderr_only (L283) passes "test result: ok. 42 passed; 0 failed\n" (cargo/nextest-style, no pytest summary pattern) — it verifies the non-pytest fallback path where no summary is found and exit code is trusted. test_check_test_passed_parses_pytest_summary_in_stderr (L304) passes "= 5 passed in 1.2s =\n" — a canonical pytest summary line, verifying that pytest summary detection works when the summary appears in stderr. Merging them would collapse coverage of two separate code paths into one test.

Comment thread tests/execution/test_testing.py Outdated
assert "PASS [0.5s] 5 tests" in result["stderr"]

@pytest.mark.anyio
async def test_response_schema_includes_stderr(self, tool_ctx):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[info] tests: test_response_schema_includes_stderr and test_does_not_expose_summary (L154) both assert set(result.keys()) == {"passed", "stdout", "stderr"}. The schema assertion is now duplicated across two tests.

)

@pytest.mark.anyio
async def test_cargo_nextest_style_passes(self, tool_ctx):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[info] tests: test_cargo_nextest_style_passes checks result["stderr"] with in rather than equality. If the implementation appends extra text the assertion still passes. Use == for deterministic output validation.

@Trecek Trecek added this pull request to the merge queue Mar 27, 2026
Merged via the queue into integration with commit 604a860 Mar 27, 2026
2 checks passed
@Trecek Trecek deleted the test-check-returns-false-negative-for-non-pytest-test-runner/513 branch March 27, 2026 01:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant