Implementation Plan: Ensure Research Experiments Include Test Infrastructure by Trecek · Pull Request #797 · TalonT-Org/AutoSkillit

Trecek · 2026-04-13T08:18:23Z

Summary

The research recipe's test_check gate runs after generate_report and expects pytest-discoverable tests to exist in the experiment worktree. Currently neither plan-experiment nor implement-experiment directs the agent to create test files — so test_check reliably fails on research projects, routing through fix_tests → retest → escalate_stop.

The fix is entirely upstream of test_check. Two SKILL.md files need additions:

plan-experiment/SKILL.md — add a tests/ folder to the directory layout template, add pytest to the environment specification, and add a "Phase 5: Test Infrastructure" to the Implementation Phases template.
implement-experiment/SKILL.md — add test-file creation to the ALWAYS list and to Step 4, and allow pytest --collect-only as a verification step (but not the full suite).

No recipe YAML changes are needed — the test_check gate is already correctly placed.

Architecture Impact

Scenarios Diagram

%%{init: {'flowchart': {'nodeSpacing': 40, 'rankSpacing': 55, 'curve': 'basis'}}}%%
flowchart LR
    %% CLASS DEFINITIONS %%
    classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff;
    classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;
    classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;

    %% ─────────────────────────────────────────── %%
    %% SCENARIO 1: Plan Experiment                 %%
    %% ─────────────────────────────────────────── %%
    subgraph S1 ["SCENARIO 1: Plan Experiment → Test Infrastructure Plan"]
        direction LR
        S1_IN["Scope Report<br/>━━━━━━━━━━<br/>Input to skill"]
        S1_SKILL["● plan-experiment<br/>━━━━━━━━━━<br/>SKILL.md"]
        S1_ASSESS["Feasibility<br/>Subagents<br/>━━━━━━━━━━<br/>measure · data · env"]
        S1_LAYOUT["Dir Layout<br/>━━━━━━━━━━<br/>tests/<br/>conftest.py<br/>test_{script}.py"]
        S1_PHASE["Test Infrastructure<br/>Phase<br/>━━━━━━━━━━<br/>create · write · collect"]
        S1_ENV["environment.yml<br/>━━━━━━━━━━<br/>pytest required"]
        S1_OUT["experiment_plan<br/>━━━━━━━━━━<br/>temp/ token emitted"]
    end

    S1_IN -->|reads| S1_SKILL
    S1_SKILL -->|launches| S1_ASSESS
    S1_ASSESS -->|informs plan| S1_LAYOUT
    S1_LAYOUT -->|includes| S1_PHASE
    S1_PHASE -->|declares dep| S1_ENV
    S1_ENV -->|saves| S1_OUT

    %% ─────────────────────────────────────────── %%
    %% SCENARIO 2: Implement Experiment with Tests %%
    %% ─────────────────────────────────────────── %%
    subgraph S2 ["SCENARIO 2: Implement Experiment → Test File Creation"]
        direction LR
        S2_PLAN["experiment_plan<br/>━━━━━━━━━━<br/>from Scenario 1"]
        S2_SKILL["● implement-experiment<br/>━━━━━━━━━━<br/>SKILL.md"]
        S2_WT["Git Worktree<br/>━━━━━━━━━━<br/>isolated branch"]
        S2_SCRIPT["Scripts<br/>━━━━━━━━━━<br/>analysis.py etc."]
        S2_TESTS["test_{script}.py<br/>━━━━━━━━━━<br/>per script in tests/"]
        S2_COLLECT["pytest --collect-only<br/>━━━━━━━━━━<br/>verify discovery"]
        S2_COMMIT["Commit<br/>━━━━━━━━━━<br/>test suite + scripts"]
    end

    S2_PLAN -->|inputs to| S2_SKILL
    S2_SKILL -->|creates| S2_WT
    S2_WT -->|implements| S2_SCRIPT
    S2_SCRIPT -->|also creates| S2_TESTS
    S2_TESTS -->|validated by| S2_COLLECT
    S2_COLLECT -->|on success| S2_COMMIT

    %% ─────────────────────────────────────────── %%
    %% SCENARIO 3: Contract Guard — plan-experiment%%
    %% ─────────────────────────────────────────── %%
    subgraph S3 ["SCENARIO 3: Contract Guard — plan-experiment"]
        direction LR
        S3_PYTEST["pytest<br/>━━━━━━━━━━<br/>task test-check"]
        S3_CONTRACT["● test_plan_experiment<br/>_contracts.py<br/>━━━━━━━━━━<br/>6 contract tests"]
        S3_SKILL["plan-experiment<br/>━━━━━━━━━━<br/>SKILL.md (on disk)"]
        S3_CHECKS["Assertions<br/>━━━━━━━━━━<br/>data_manifest · tests/<br/>Test Infra phase · pytest"]
        S3_RESULT["PASS / FAIL<br/>━━━━━━━━━━<br/>contract enforced"]
    end

    S3_PYTEST -->|runs| S3_CONTRACT
    S3_CONTRACT -->|reads| S3_SKILL
    S3_SKILL -->|content checked by| S3_CHECKS
    S3_CHECKS -->|adjudicates| S3_RESULT

    %% ───────────────────────────────────────────────── %%
    %% SCENARIO 4: Contract Guard — implement-experiment %%
    %% ───────────────────────────────────────────────── %%
    subgraph S4 ["SCENARIO 4: Contract Guard — implement-experiment (NEW)"]
        direction LR
        S4_PYTEST["pytest<br/>━━━━━━━━━━<br/>task test-check"]
        S4_CONTRACT["★ test_implement_experiment<br/>_contracts.py<br/>━━━━━━━━━━<br/>3 contract tests"]
        S4_SKILL["implement-experiment<br/>━━━━━━━━━━<br/>SKILL.md (on disk)"]
        S4_CHECKS["Assertions<br/>━━━━━━━━━━<br/>ALWAYS write tests<br/>test_ naming<br/>--collect-only"]
        S4_RESULT["PASS / FAIL<br/>━━━━━━━━━━<br/>contract enforced"]
    end

    S4_PYTEST -->|runs| S4_CONTRACT
    S4_CONTRACT -->|reads| S4_SKILL
    S4_SKILL -->|content checked by| S4_CHECKS
    S4_CHECKS -->|adjudicates| S4_RESULT

    %% CLASS ASSIGNMENTS %%
    class S1_IN,S2_PLAN,S3_PYTEST,S4_PYTEST cli;
    class S1_SKILL,S2_SKILL handler;
    class S1_ASSESS,S1_PHASE,S2_COLLECT phase;
    class S1_LAYOUT,S1_ENV,S2_WT,S2_SCRIPT stateNode;
    class S1_OUT,S2_TESTS,S2_COMMIT output;
    class S3_CONTRACT,S4_CHECKS,S3_CHECKS detector;
    class S3_SKILL,S4_SKILL stateNode;
    class S3_RESULT,S4_RESULT output;
    class S4_CONTRACT newComponent;

State Lifecycle Diagram

%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 60, 'curve': 'basis'}}}%%
flowchart TB
    %% CLASS DEFINITIONS %%
    classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff;
    classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;
    classDef gap fill:#ff6f00,stroke:#ffa726,stroke-width:2px,color:#000;
    classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;

    START([SKILL INVOCATION])

    subgraph PlanDirectives ["● PLAN-EXPERIMENT SKILL.MD — DIRECTIVE STATE (INIT_ONLY)"]
        direction TB
        TestsDir["● tests/ in Directory Layout<br/>━━━━━━━━━━<br/>INIT_ONLY: planned once<br/>enforce: test_plan_experiment_includes_tests_directory_in_layout"]
        TestPhase["● Test Infrastructure Phase<br/>━━━━━━━━━━<br/>INIT_ONLY: planned once<br/>enforce: test_plan_experiment_has_test_infrastructure_phase"]
        PytestEnv["● pytest in Environment Spec<br/>━━━━━━━━━━<br/>INIT_ONLY: environment.yml<br/>enforce: test_plan_experiment_environment_mentions_pytest"]
        DataManifest["● data_manifest Schema<br/>━━━━━━━━━━<br/>INIT_ONLY: source_type, acquisition<br/>verification, hypothesis<br/>enforce: test_data_manifest_required_fields"]
    end

    subgraph PlanGates ["PLAN VALIDATION GATES (V1–V9)"]
        direction TB
        V1V4["V1–V4 ERROR Gates<br/>━━━━━━━━━━<br/>baseline present, contrast set<br/>statistical_plan present, spec_path set<br/>FAIL → suppress frontmatter, emit errors"]
        V5V8["V5–V8 WARNING Gates<br/>━━━━━━━━━━<br/>primary metric uniqueness<br/>NEW canonical names, numeric thresholds<br/>CONTINUE → inline YAML comments"]
        V9["V9 ERROR Gate<br/>━━━━━━━━━━<br/>data_manifest completeness<br/>hypothesis coverage<br/>FAIL → suppress frontmatter"]
    end

    subgraph PlanOutput ["PLAN OUTPUT STATE"]
        direction TB
        FrontmatterBlock["Frontmatter Block<br/>━━━━━━━━━━<br/>MUTABLE → written iff V1-V4/V9 pass<br/>INIT_ONLY once written to file"]
        ExperimentPlanFile["experiment_plan token<br/>━━━━━━━━━━<br/>INIT_PRESERVE: last line of output<br/>pipeline captures for handoff"]
    end

    subgraph ImplDirectives ["● IMPLEMENT-EXPERIMENT SKILL.MD — DIRECTIVE STATE"]
        direction TB
        AlwaysWrite["● ALWAYS: write test_*.py<br/>━━━━━━━━━━<br/>INIT_ONLY directive<br/>enforce: test_implement_experiment_always_includes_test_creation"]
        Step4Tests["● Step 4: test_*.py per script<br/>━━━━━━━━━━<br/>MUTABLE: one file per script created<br/>enforce: test_implement_experiment_step4_mentions_test_files"]
        CollectOnly["● pytest --collect-only gate<br/>━━━━━━━━━━<br/>MUTABLE: runs after each test file<br/>enforce: test_implement_experiment_allows_pytest_collect_only"]
    end

    subgraph CheckpointState ["IMPLEMENT RESUME CHECKPOINT STATE"]
        direction TB
        EarlyTokens["worktree_path + branch_name tokens<br/>━━━━━━━━━━<br/>INIT_PRESERVE: emitted Step 1 cont.<br/>captured by pipeline before context limit"]
        BaseBranchFile["base-branch checkpoint file<br/>━━━━━━━━━━<br/>INIT_PRESERVE: written immediately<br/>survives context exhaustion"]
        WorktreeOnDisk["Worktree on disk<br/>━━━━━━━━━━<br/>INIT_PRESERVE: never deleted by skill<br/>orchestrator discovers via tokens"]
    end

    subgraph TestsState ["APPEND_ONLY TEST ARTIFACT STATE"]
        direction TB
        TestsDirectory["tests/ directory<br/>━━━━━━━━━━<br/>APPEND_ONLY: grows per script<br/>NEVER removed or overwritten"]
        GitCommits["Git commit history<br/>━━━━━━━━━━<br/>APPEND_ONLY: per-phase commits<br/>NEVER force-pushed by skill"]
    end

    subgraph ImplGates ["IMPLEMENT VALIDATION GATES"]
        direction TB
        DiscoveryGate["pytest --collect-only<br/>━━━━━━━━━━<br/>MUST pass before each commit<br/>FAIL → fix imports, re-run"]
        PrecommitGate["pre-commit run --all-files<br/>━━━━━━━━━━<br/>Step 6 final gate<br/>FAIL → fix, re-stage, retry"]
    end

    subgraph ContractEnforcement ["CONTRACT ENFORCEMENT LAYER"]
        direction TB
        PlanContracts["● test_plan_experiment_contracts.py<br/>━━━━━━━━━━<br/>6 tests: data_manifest schema<br/>tests/ layout, test phase, pytest env"]
        ImplContracts["★ test_implement_experiment_contracts.py<br/>━━━━━━━━━━<br/>3 tests: ALWAYS write tests<br/>test_ filenames, collect-only gate"]
    end

    END([HANDOFF REPORT])

    %% FLOW %%
    START --> PlanDirectives
    PlanDirectives --> PlanGates
    V1V4 --> FrontmatterBlock
    V5V8 --> FrontmatterBlock
    V9 --> FrontmatterBlock
    FrontmatterBlock --> ExperimentPlanFile
    ExperimentPlanFile --> ImplDirectives

    AlwaysWrite --> EarlyTokens
    Step4Tests --> TestsDirectory
    CollectOnly --> DiscoveryGate
    EarlyTokens --> BaseBranchFile
    BaseBranchFile --> WorktreeOnDisk
    DiscoveryGate --> TestsDirectory
    TestsDirectory --> GitCommits
    GitCommits --> PrecommitGate
    PrecommitGate --> END

    %% CONTRACT LAYER VALIDATES DIRECTIVE STATE %%
    PlanContracts -.->|validates directives present| PlanDirectives
    ImplContracts -.->|validates directives present| ImplDirectives

    %% CLASS ASSIGNMENTS %%
    class START,END terminal;
    class TestsDir,TestPhase,PytestEnv,DataManifest gap;
    class V1V4,V9 detector;
    class V5V8 handler;
    class FrontmatterBlock,ExperimentPlanFile stateNode;
    class AlwaysWrite,Step4Tests,CollectOnly gap;
    class EarlyTokens,BaseBranchFile,WorktreeOnDisk phase;
    class TestsDirectory,GitCommits handler;
    class DiscoveryGate,PrecommitGate detector;
    class PlanContracts output;
    class ImplContracts newComponent;

Closes #786

Implementation Plan

Plan file: /home/talon/projects/autoskillit-runs/impl-786-20260413-002705-648780/.autoskillit/temp/make-plan/ensure_research_experiments_test_infra_plan_2026-04-13_000000.md

🤖 Generated with Claude Code via AutoSkillit

Token Usage Summary

Step	uncached	output	cache_read	cache_write	count	time
plan	199	11.4k	1.1M	76.9k	1	3m 44s
verify	60	7.8k	213.5k	51.0k	1	2m 53s
implement	298	15.0k	1.7M	49.7k	1	4m 51s
prepare_pr	68	5.0k	200.5k	22.0k	1	2m 20s
run_arch_lenses	2.0k	13.0k	319.9k	48.4k	2	5m 22s
compose_pr	59	7.2k	188.1k	27.4k	1	1m 53s
Total	2.7k	59.4k	3.7M	275.5k		21m 5s

Trecek · 2026-04-13T08:29:33Z

+    """plan-experiment/SKILL.md Experiment Directory Layout must include a tests/ folder."""
+    content = SKILL_PATH.read_text()
+    # The directory layout template must show a tests/ directory
+    assert "tests/" in content, (


[info] tests: assert 'tests/' in content will pass trivially if the string 'tests/' appears anywhere in the file (prose, path reference, or prior section), not necessarily in the Experiment Directory Layout template. Consider scoping to the layout section by splitting on a known heading.

Trecek · 2026-04-13T08:29:34Z

+def test_plan_experiment_environment_mentions_pytest() -> None:
+    """plan-experiment/SKILL.md must mention pytest in the environment specification."""
+    content = SKILL_PATH.read_text()
+    assert "pytest" in content, (


[info] tests: assert 'pytest' in content is document-wide and passes as long as 'pytest' appears anywhere in the file. The docstring says this validates the environment specification section, but the test does not enforce that. Scope to the environment section for the assertion to be meaningful.

Trecek · 2026-04-13T08:29:35Z

+def test_implement_experiment_allows_pytest_collect_only() -> None:
+    """implement-experiment must allow running pytest --collect-only as a verification step."""
+    text = SKILL_PATH.read_text()
+    assert "collect-only" in text or "collect_only" in text, (


[info] tests: assert 'collect-only' in text or 'collect_only' in text checks two spellings, but the SKILL.md only uses the canonical CLI spelling 'collect-only'. The 'collect_only' branch is dead code that can never be triggered by the current SKILL.md content.

Trecek · 2026-04-13T08:29:38Z

 - Verify metrics are collected correctly
 - Confirm end-to-end pipeline works before committing to full runs

+### Phase 5: Test Infrastructure


[info] cohesion: plan-experiment labels test work 'Phase 5: Test Infrastructure' (a separate dedicated phase), but implement-experiment's detail section calls it 'Test creation (required alongside each script phase)' and ties it to Phases 2 and 3. The two skills use different mental models for when test infrastructure is produced (separate phase vs. inline with each script phase), creating a planning/implementation mismatch.

Trecek · 2026-04-13T08:29:39Z

 - Put all experiment artifacts in one self-contained `research/` subfolder
 - Commit per phase with descriptive messages
 - Leave the worktree intact when done
+- Write `tests/test_{script_name}.py` alongside each experiment script created in Step 4


[info] slop: The ALWAYS bullet at L62-L63 duplicates instructions restated in full detail in the 'Test creation' block at L192-L205. Both deliver the same directive; the ALWAYS bullet is redundant given the expanded block below.

Trecek · 2026-04-13T08:29:40Z

+def test_plan_experiment_includes_tests_directory_in_layout() -> None:
+    """plan-experiment/SKILL.md Experiment Directory Layout must include a tests/ folder."""
+    content = SKILL_PATH.read_text()
+    # The directory layout template must show a tests/ directory


[info] slop: Inline comment '# The directory layout template must show a tests/ directory' restates exactly what the assertion and the function name already communicate. It adds no information.

Trecek · 2026-04-13T08:29:41Z

+def test_implement_experiment_always_includes_test_creation() -> None:
+    """implement-experiment ALWAYS list must include a directive to write test files."""
+    text = SKILL_PATH.read_text()
+    # The ALWAYS block must reference writing test files


[info] slop: Inline comment '# The ALWAYS block must reference writing test files' restates the obvious — the assertion immediately below it checks exactly that. The comment is redundant.

Trecek · 2026-04-13T08:29:42Z

+    )
+
+
+def test_plan_experiment_environment_mentions_pytest() -> None:


[info] cohesion: test_plan_experiment_environment_mentions_pytest checks that 'pytest' appears anywhere in the file. The parallel implement-experiment contract test_implement_experiment_allows_pytest_collect_only checks for 'collect-only' specifically. The two tests use different assertion granularity for analogous coverage, which is asymmetric across the paired skills.

Trecek · 2026-04-13T08:29:43Z

+    )
+
+
+def test_implement_experiment_step4_mentions_test_files() -> None:


[info] cohesion: test_implement_experiment_step4_mentions_test_files asserts 'test_' appears anywhere in the SKILL.md. This does not verify the string is in or near Step 4, contrary to the test name and docstring. The plan-experiment tests use more precise section-scoped assertions.

Trecek · 2026-04-13T08:29:44Z

 - Merge the worktree branch into any branch
 - Delete or remove the worktree
- Run the full test suite (the orchestrator handles testing)
+- Run the full test suite — `pytest` with no args or targeting the entire repo


[info] slop: The NEVER list item at L48-L49 rewording adds a parenthetical that duplicates reasoning stated at L205 ('Do NOT run the full test suite — that is the orchestrator's responsibility via test_check'). The constraint is now stated in two places within close proximity.

Trecek · 2026-04-13T08:29:45Z

 **Option A — No custom environment needed:**
 {The project's existing toolchain is sufficient because {reason}. No
 environment.yml will be created.}
+Verify that `pytest` is available in the existing toolchain (`pytest --version`). If not, note that test_check requires pytest to pass.


[info] slop: The sentence 'Verify that pytest is available in the existing toolchain (pytest --version). If not, note that test_check requires pytest to pass.' is appended inline to the Option A prose block without a line break. The same requirement is already covered by Phase 5 (L264-L270) and the Option B dependency list. The inline sentence is a redundant clarification.

Trecek

AutoSkillit review found 4 blocking issues (verdict: changes_requested). See inline comments for details.

Actionable findings (requires fix):

tests/skills_extended/test_implement_experiment_contracts.py L21: weak compound assertion in test_implement_experiment_always_includes_test_creation
tests/skills_extended/test_implement_experiment_contracts.py L30: vacuous assert 'test_' in text in test_implement_experiment_step4_mentions_test_files
tests/skills_extended/test_implement_experiment_contracts.py L1: _contracts naming convention mismatched to tests/skills_extended/ placement
src/autoskillit/skills_extended/implement-experiment/SKILL.md L205: 'Do NOT run the full test suite' stated three times (NEVER/ALWAYS/inline)

Decision findings (human review requested):

tests/contracts/test_plan_experiment_contracts.py L34: asymmetric placement — plan-experiment contracts in tests/contracts/, implement-experiment contracts in tests/skills_extended/
src/autoskillit/skills_extended/plan-experiment/SKILL.md L264: planning/implementation mismatch — plan-experiment models tests as Phase 5 (separate), implement-experiment models as inline with Phases 2-3

Add test_implement_experiment_contracts.py (3 tests) and 3 new functions to test_plan_experiment_contracts.py. These verify that both SKILL.md files include test infrastructure directives — tests directory layout, pytest env mention, test infrastructure phase, ALWAYS list directive, Step 4 test_ references, and pytest --collect-only verification. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add tests/ folder to the Experiment Directory Layout template (with conftest.py and test_{script}.py stubs), add pytest to the Option B environment.yml dependencies and an availability note to Option A, and add Phase 5: Test Infrastructure to the Implementation Phases template. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…L.md Add two ALWAYS bullets directing the agent to write test_*.py files alongside each experiment script and verify discovery with pytest --collect-only. Add Step 4 test creation section with concrete instructions for tests/ layout, conftest.py, per-script test files, and collect-only verification. Narrow the NEVER bullet from "full test suite" to clarify pytest --collect-only and per-directory verification runs are permitted. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tracts/ SKILL.md contract tests belong in tests/contracts/ per CLAUDE.md convention. Move from tests/skills_extended/ to tests/contracts/ to match placement of test_plan_experiment_contracts.py and align with the established pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…contracts Replace broad token-membership checks with specific pattern assertions: - test_implement_experiment_always_includes_test_creation: assert 'tests/test_' in the ALWAYS section instead of loose 'test'+'write/creat' compound check - test_implement_experiment_step4_mentions_test_files: scope to Step 4 text and assert 'tests/test_' instead of document-wide 'test_' check Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ment-experiment SKILL.md The constraint appeared three times (NEVER list, Step 4 footer, Step 6 footer). Remove the two redundant inline repetitions; the NEVER list at the top is the single authoritative statement, eliminating divergence risk. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…own register