Skip to content

Fix Misleading Test Names and Stale Logic (groupF)#560

Merged
Trecek merged 3 commits intointegrationfrom
test-audit/552/groupF
Mar 28, 2026
Merged

Fix Misleading Test Names and Stale Logic (groupF)#560
Trecek merged 3 commits intointegrationfrom
test-audit/552/groupF

Conversation

@Trecek
Copy link
Copy Markdown
Collaborator

@Trecek Trecek commented Mar 28, 2026

Summary

Eight targeted fixes across seven test files: four test/class renames that resolve name–assertion mismatches (2.19, 2.20, 2.21, 2.23), a rewrite of SmokeExecutor._run_with_retry and its routing logic to use the current flat retries: / on_exhausted: schema (2.24/2.30) along with updates to the two tests that exercised the old schema, deletion of a single stale set-equality assertion (2.25), and strengthening of two contract-test assertions that use overly permissive string matching (2.27, 2.28). No production code changes; only test code.

Architecture Impact

Process Flow Diagram

%%{init: {'flowchart': {'nodeSpacing': 40, 'rankSpacing': 50, 'curve': 'basis'}}}%%
flowchart TB
    %% CLASS DEFINITIONS %%
    classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;
    classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff;

    START([SmokeExecutor.run])

    subgraph Loop ["Step Execution Loop (max_steps=30)"]
        direction TB
        ACTION{"action?<br/>━━━━━━━━━━<br/>stop / route / python / default"}
        STOP["stop<br/>━━━━━━━━━━<br/>return terminal_step + message"]
        ROUTE_ACTION["route (action)<br/>━━━━━━━━━━<br/>jump to on_success"]
        PYTHON["python:<br/>━━━━━━━━━━<br/>_execute_python → run_python MCP"]
    end

    subgraph Dispatch ["● _execute (test_smoke_pipeline.py)"]
        direction TB
        HAS_RETRIES{"retries in step_def?<br/>━━━━━━━━━━<br/>flat schema check"}
        DIRECT_CALL["direct call<br/>━━━━━━━━━━<br/>_TOOL_MAP[tool](**args)"]
        DELEGATE["delegate<br/>━━━━━━━━━━<br/>→ _run_with_retry"]
    end

    subgraph Retry ["● _run_with_retry (test_smoke_pipeline.py)"]
        direction TB
        FIRST_CALL["initial attempt<br/>━━━━━━━━━━<br/>tool_fn(**args)"]
        SUCCESS_CHECK{"● _is_success?<br/>━━━━━━━━━━<br/>tool-specific check"}
        RETRY_LOOP["retry loop<br/>━━━━━━━━━━<br/>for _ in range(max_retries)"]
        RETRY_SUCCESS{"_is_success?<br/>━━━━━━━━━━<br/>after re-attempt"}
        EXHAUSTED["● _retries_exhausted<br/>━━━━━━━━━━<br/>result[_retries_exhausted]=True"]
    end

    subgraph Routing ["● _route (test_smoke_pipeline.py)"]
        direction TB
        ON_RESULT{"on_result block?<br/>━━━━━━━━━━<br/>field-value dispatch"}
        EXHAUSTION{"● on_exhausted<br/>━━━━━━━━━━<br/>_retries_exhausted flagged?"}
        SF_CHECK{"_is_success?<br/>━━━━━━━━━━<br/>success / failure"}
    end

    CAPTURE["_capture<br/>━━━━━━━━━━<br/>extract key=value into context"]
    NEXT_STEP["advance to next_step<br/>━━━━━━━━━━<br/>loop iteration"]

    DONE([DONE / terminal_step])

    START --> ACTION
    ACTION -->|"stop"| STOP
    ACTION -->|"route"| ROUTE_ACTION
    ACTION -->|"python:"| PYTHON
    ACTION -->|"default"| HAS_RETRIES
    STOP --> DONE
    ROUTE_ACTION --> NEXT_STEP
    PYTHON --> CAPTURE

    HAS_RETRIES -->|"no"| DIRECT_CALL
    HAS_RETRIES -->|"yes"| DELEGATE
    DIRECT_CALL --> CAPTURE
    DELEGATE --> FIRST_CALL

    FIRST_CALL --> SUCCESS_CHECK
    SUCCESS_CHECK -->|"pass"| CAPTURE
    SUCCESS_CHECK -->|"fail"| RETRY_LOOP
    RETRY_LOOP --> RETRY_SUCCESS
    RETRY_SUCCESS -->|"pass"| CAPTURE
    RETRY_SUCCESS -->|"fail → loop"| RETRY_LOOP
    RETRY_LOOP -->|"exhausted"| EXHAUSTED
    EXHAUSTED --> CAPTURE

    CAPTURE --> ON_RESULT
    ON_RESULT -->|"match → target step"| NEXT_STEP
    ON_RESULT -->|"no match"| EXHAUSTION
    EXHAUSTION -->|"yes → on_exhausted"| NEXT_STEP
    EXHAUSTION -->|"no"| SF_CHECK
    SF_CHECK -->|"pass → on_success"| NEXT_STEP
    SF_CHECK -->|"fail → on_failure"| NEXT_STEP
    NEXT_STEP --> ACTION

    class START,DONE terminal;
    class ACTION,HAS_RETRIES,SUCCESS_CHECK,RETRY_SUCCESS,ON_RESULT,EXHAUSTION,SF_CHECK stateNode;
    class STOP,ROUTE_ACTION,PYTHON,DIRECT_CALL,DELEGATE,FIRST_CALL,RETRY_LOOP,NEXT_STEP handler;
    class EXHAUSTED,CAPTURE phase;
Loading

Color Legend:

Color Category Description
Dark Blue Terminal Start and completion states
Teal Decision Routing and branching decisions
Orange Handler Tool invocation and step execution
Purple Phase State mutation (capture, exhaustion sentinel)

Scenarios Diagram

%%{init: {'flowchart': {'nodeSpacing': 40, 'rankSpacing': 50, 'curve': 'basis'}}}%%
flowchart LR
    %% CLASS DEFINITIONS %%
    classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;

    subgraph S1 ["SCENARIO 1: Test Name Accuracy (REQ 2.19–2.23)"]
        direction LR
        S1A["● test_core.py<br/>━━━━━━━━━━<br/>test_atomic_write_<br/>docstring_contains_<br/>atomic_keyword"]
        S1B["● test_config.py<br/>━━━━━━━━━━<br/>class TestRunSkill<br/>ConfigFields"]
        S1C["● test_skills.py<br/>━━━━━━━━━━<br/>test_58_skills_in_<br/>skills_extended"]
        S1D["● test_headless_add_dirs.py<br/>━━━━━━━━━━<br/>test_raw_skills_extended_<br/>excluded_from_run_skill_<br/>add_dirs"]
        S1E["pytest assertion<br/>━━━━━━━━━━<br/>name == assertion content"]
    end

    subgraph S2 ["SCENARIO 2: SmokeExecutor Retry — Flat Schema (REQ 2.24, 2.25, 2.30)"]
        direction LR
        S2A["step_def<br/>━━━━━━━━━━<br/>retries: 3<br/>on_exhausted: escalate"]
        S2B["● _execute<br/>━━━━━━━━━━<br/>checks 'retries'<br/>in step_def"]
        S2C["● _run_with_retry<br/>━━━━━━━━━━<br/>initial attempt<br/>+ retry loop"]
        S2D["● _is_success<br/>━━━━━━━━━━<br/>tool-specific<br/>pass check"]
        S2E["● _retries_exhausted<br/>━━━━━━━━━━<br/>sentinel flag<br/>in result"]
        S2F["● _route<br/>━━━━━━━━━━<br/>on_exhausted path<br/>or on_success/failure"]
    end

    subgraph S3 ["SCENARIO 3: Contract Assertion Strength (REQ 2.27, 2.28)"]
        direction LR
        S3A["● test_pr_traceability<br/>_contracts.py<br/>━━━━━━━━━━<br/>reads SKILL.md text"]
        S3B["● heading check<br/>━━━━━━━━━━<br/>'## requirements'<br/>in normalized"]
        S3C["● test_triage<br/>_contracts.py<br/>━━━━━━━━━━<br/>reads SKILL.md text"]
        S3D["● scoped bypass<br/>━━━━━━━━━━<br/>'already has' AND<br/>'## requirements' in lower"]
        S3E["assert has_req<br/>━━━━━━━━━━<br/>heading-structure<br/>validated"]
    end

    %% SCENARIO 1 FLOW %%
    S1A --> S1E
    S1B --> S1E
    S1C --> S1E
    S1D --> S1E

    %% SCENARIO 2 FLOW %%
    S2A --> S2B
    S2B -->|"yes"| S2C
    S2C --> S2D
    S2D -->|"fail"| S2C
    S2D -->|"exhausted"| S2E
    S2D -->|"pass"| S2F
    S2E --> S2F

    %% SCENARIO 3 FLOW %%
    S3A --> S3B
    S3B --> S3E
    S3C --> S3D
    S3D --> S3E

    %% CLASS ASSIGNMENTS %%
    class S1A,S1B,S1C,S1D handler;
    class S1E output;
    class S2A cli;
    class S2B,S2C phase;
    class S2D stateNode;
    class S2E detector;
    class S2F output;
    class S3A,S3C handler;
    class S3B,S3D phase;
    class S3E output;
Loading

Color Legend:

Color Category Description
Dark Blue Entry Step definition inputs
Orange Handler Modified test files
Purple Process Modified execution/assertion logic
Teal Decision Success/state checks
Red Sentinel Exhaustion detection
Dark Teal Output Assertion outcomes

Implementation Plan

Plan file: /home/talon/projects/autoskillit-runs/impl-groupF-20260328-084624-134986/.autoskillit/temp/make-plan/fix_misleading_tests_groupF_plan_2026-03-28_084900.md

🤖 Generated with Claude Code via AutoSkillit

Token Usage Summary

Step input output cached count time
group 2.7k 25.0k 721.4k 1 9m 3s
plan 462 84.7k 14.8M 9 41m 2s
verify 98 55.7k 3.3M 6 18m 12s
implement 536 45.2k 10.3M 9 25m 25s
fix 60 10.0k 1.6M 2 11m 53s
Total 3.8k 220.7k 30.7M 1h 45m

Copy link
Copy Markdown
Collaborator Author

@Trecek Trecek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoSkillit PR Review — Verdict: changes_requested (cannot REQUEST_CHANGES on own PR, posting as COMMENT)

result = json.loads(raw_result)
if not result.get(retry_field, False):
return result # succeeded on first try
if self._is_success(step_def, result):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] tests: _is_success called in _run_with_retry but its implementation is not shown in the diff. If _is_success has a trivial/always-true impl, retry logic will not be exercised meaningfully. Verify _is_success actually checks failure conditions.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigated — this is intentional. _is_success is fully defined at lines 178-185: returns result.get('passed', False) for test_check, 'error' not in result for merge_worktree/classify_fix, and result.get('success', True) as fallback. The test_executor_retry_logic test (line 324) mocks run_skill to return success:False on the first call and success:True on the second, confirming retry logic is exercised. The implementation was present but not in the diff hunk shown to the reviewer (false_positive_intentional_pattern).

text = _read("pipeline-summary")
has_req = "## Requirements" in text or "requirements" in text.lower()
normalized = text.lower()
has_req = "## requirements" in normalized or "# requirements" in normalized
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] tests: Requirement check tightened to require a markdown heading (## requirements or # requirements). This could produce false negatives if the skill documents requirements without a heading. Confirm the actual skill content matches this stricter expectation.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigated — this is intentional. pipeline-summary/SKILL.md uses ## Requirements as an explicit heading in three distinct locations: line 28 (argument description referencing extraction), line 135 (Step 5b: 'Extract the ## Requirements section'), and line 167 (PR body template includes '## Requirements'). The normalized heading check ('## requirements' in normalized) correctly matches all occurrences. Commit f1b5644 intentionally replaced the bare 'requirements' in text.lower() fallback with this stricter heading check, and the skill content was already conformant. No false negative risk exists — skill documents requirements exclusively via the ## heading format (false_positive_intentional_pattern).

@@ -376,12 +348,13 @@ async def mock_run_skill(**kwargs: object) -> str:
assert result["success"] is True

async def test_executor_max_attempts_zero_routes_to_on_exhausted(self) -> None:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] tests: test_executor_max_attempts_zero_routes_to_on_exhausted: the diff shows test setup but not the assertions. If the test does not assert the route returned is 'retry_wt' (on_exhausted), the test title is misleading and the behavior is untested.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigated — this is intentional. Lines 388-391 contain explicit assertions: assert len(call_log) == 1 (verifying exactly one call with retries=0) and assert terminal_step == 'retry_wt' (verifying the on_exhausted route 'retry_wt' is returned). The reviewer claimed assertions were missing, which is factually incorrect — they are present in the actual file at lines 388-391 (stale_comment — assertions were present but outside the diff hunk shown to the reviewer).

return result

return result # exhausted — always defined
result["_retries_exhausted"] = True
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] cohesion: _retries_exhausted sentinel uses a private-style underscore prefix but is consumed in _route() outside _run_with_retry. Rename to 'retries_exhausted' to match the flat public schema convention used by 'retries' and 'on_exhausted'.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid observation — flagged for design decision. _retries_exhausted (line 126) is set in _run_with_retry and consumed only in _route (line 171). The underscore prefix distinguishes it as an internal inter-method sentinel injected into the result dict, not a user-facing recipe schema field. The reviewer's point about naming consistency with flat schema fields ('retries', 'on_exhausted') is valid but debatable: those are recipe YAML keys, while _retries_exhausted is a runtime dict annotation. Whether to expose it without underscore or keep it private-style is a deliberate design choice that warrants human review before changing.

Copy link
Copy Markdown
Collaborator Author

@Trecek Trecek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoSkillit review found 3 blocking issues. See inline comments. Verdict: changes_requested

@Trecek Trecek enabled auto-merge March 28, 2026 17:24
Trecek added 3 commits March 28, 2026 10:34
- Rename test_core_io_module_has_docstring → test_atomic_write_docstring_contains_atomic_keyword
- Rename class TestRunSkillRetryConfigFields → TestRunSkillConfigFields
- Rename test_57_skills_in_skills_extended → test_58_skills_in_skills_extended; fix docstring count
- Rename test_run_skill_passes_ephemeral_session_dir_as_add_dir → test_raw_skills_extended_excluded_from_run_skill_add_dirs
…quality assertion (REQ 2.24, 2.25, 2.30)

- _execute: check "retries" in step_def instead of "retry"
- _run_with_retry: rewrite using flat retries:/on_exhausted with _retries_exhausted sentinel
- _route: use step_def.get("on_exhausted") + result["_retries_exhausted"] instead of nested retry block
- test_executor_retry_logic: update step_def to flat retries: 3 / on_exhausted format
- test_executor_max_attempts_zero_routes_to_on_exhausted: update to retries: 0; fix docstring
- Remove expected_steps set definition and set-equality assertion; retain all tool-type assertions
- test_pr_traceability_contracts: replace bare "requirements" in text.lower() fallback
  with heading-structure check ("## requirements" or "# requirements" in normalized)
- test_triage_contracts: scope "already has" bypass to require "## requirements" context
@Trecek Trecek force-pushed the test-audit/552/groupF branch from f1b5644 to 0976de5 Compare March 28, 2026 17:36
@Trecek Trecek added this pull request to the merge queue Mar 28, 2026
Merged via the queue into integration with commit 7a2fd4f Mar 28, 2026
2 checks passed
@Trecek Trecek deleted the test-audit/552/groupF branch March 28, 2026 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant