Implementation Plan: Ensure Research Experiments Include Test Infrastructure#797
Conversation
| """plan-experiment/SKILL.md Experiment Directory Layout must include a tests/ folder.""" | ||
| content = SKILL_PATH.read_text() | ||
| # The directory layout template must show a tests/ directory | ||
| assert "tests/" in content, ( |
There was a problem hiding this comment.
[info] tests: assert 'tests/' in content will pass trivially if the string 'tests/' appears anywhere in the file (prose, path reference, or prior section), not necessarily in the Experiment Directory Layout template. Consider scoping to the layout section by splitting on a known heading.
| def test_plan_experiment_environment_mentions_pytest() -> None: | ||
| """plan-experiment/SKILL.md must mention pytest in the environment specification.""" | ||
| content = SKILL_PATH.read_text() | ||
| assert "pytest" in content, ( |
There was a problem hiding this comment.
[info] tests: assert 'pytest' in content is document-wide and passes as long as 'pytest' appears anywhere in the file. The docstring says this validates the environment specification section, but the test does not enforce that. Scope to the environment section for the assertion to be meaningful.
| def test_implement_experiment_allows_pytest_collect_only() -> None: | ||
| """implement-experiment must allow running pytest --collect-only as a verification step.""" | ||
| text = SKILL_PATH.read_text() | ||
| assert "collect-only" in text or "collect_only" in text, ( |
There was a problem hiding this comment.
[info] tests: assert 'collect-only' in text or 'collect_only' in text checks two spellings, but the SKILL.md only uses the canonical CLI spelling 'collect-only'. The 'collect_only' branch is dead code that can never be triggered by the current SKILL.md content.
| - Verify metrics are collected correctly | ||
| - Confirm end-to-end pipeline works before committing to full runs | ||
|
|
||
| ### Phase 5: Test Infrastructure |
There was a problem hiding this comment.
[info] cohesion: plan-experiment labels test work 'Phase 5: Test Infrastructure' (a separate dedicated phase), but implement-experiment's detail section calls it 'Test creation (required alongside each script phase)' and ties it to Phases 2 and 3. The two skills use different mental models for when test infrastructure is produced (separate phase vs. inline with each script phase), creating a planning/implementation mismatch.
| - Put all experiment artifacts in one self-contained `research/` subfolder | ||
| - Commit per phase with descriptive messages | ||
| - Leave the worktree intact when done | ||
| - Write `tests/test_{script_name}.py` alongside each experiment script created in Step 4 |
There was a problem hiding this comment.
[info] slop: The ALWAYS bullet at L62-L63 duplicates instructions restated in full detail in the 'Test creation' block at L192-L205. Both deliver the same directive; the ALWAYS bullet is redundant given the expanded block below.
| def test_plan_experiment_includes_tests_directory_in_layout() -> None: | ||
| """plan-experiment/SKILL.md Experiment Directory Layout must include a tests/ folder.""" | ||
| content = SKILL_PATH.read_text() | ||
| # The directory layout template must show a tests/ directory |
There was a problem hiding this comment.
[info] slop: Inline comment '# The directory layout template must show a tests/ directory' restates exactly what the assertion and the function name already communicate. It adds no information.
| def test_implement_experiment_always_includes_test_creation() -> None: | ||
| """implement-experiment ALWAYS list must include a directive to write test files.""" | ||
| text = SKILL_PATH.read_text() | ||
| # The ALWAYS block must reference writing test files |
There was a problem hiding this comment.
[info] slop: Inline comment '# The ALWAYS block must reference writing test files' restates the obvious — the assertion immediately below it checks exactly that. The comment is redundant.
| ) | ||
|
|
||
|
|
||
| def test_plan_experiment_environment_mentions_pytest() -> None: |
There was a problem hiding this comment.
[info] cohesion: test_plan_experiment_environment_mentions_pytest checks that 'pytest' appears anywhere in the file. The parallel implement-experiment contract test_implement_experiment_allows_pytest_collect_only checks for 'collect-only' specifically. The two tests use different assertion granularity for analogous coverage, which is asymmetric across the paired skills.
| ) | ||
|
|
||
|
|
||
| def test_implement_experiment_step4_mentions_test_files() -> None: |
There was a problem hiding this comment.
[info] cohesion: test_implement_experiment_step4_mentions_test_files asserts 'test_' appears anywhere in the SKILL.md. This does not verify the string is in or near Step 4, contrary to the test name and docstring. The plan-experiment tests use more precise section-scoped assertions.
| - Merge the worktree branch into any branch | ||
| - Delete or remove the worktree | ||
| - Run the full test suite (the orchestrator handles testing) | ||
| - Run the full test suite — `pytest` with no args or targeting the entire repo |
There was a problem hiding this comment.
[info] slop: The NEVER list item at L48-L49 rewording adds a parenthetical that duplicates reasoning stated at L205 ('Do NOT run the full test suite — that is the orchestrator's responsibility via test_check'). The constraint is now stated in two places within close proximity.
| **Option A — No custom environment needed:** | ||
| {The project's existing toolchain is sufficient because {reason}. No | ||
| environment.yml will be created.} | ||
| Verify that `pytest` is available in the existing toolchain (`pytest --version`). If not, note that test_check requires pytest to pass. |
There was a problem hiding this comment.
[info] slop: The sentence 'Verify that pytest is available in the existing toolchain (pytest --version). If not, note that test_check requires pytest to pass.' is appended inline to the Option A prose block without a line break. The same requirement is already covered by Phase 5 (L264-L270) and the Option B dependency list. The inline sentence is a redundant clarification.
Trecek
left a comment
There was a problem hiding this comment.
AutoSkillit review found 4 blocking issues (verdict: changes_requested). See inline comments for details.
Actionable findings (requires fix):
tests/skills_extended/test_implement_experiment_contracts.pyL21: weak compound assertion intest_implement_experiment_always_includes_test_creationtests/skills_extended/test_implement_experiment_contracts.pyL30: vacuousassert 'test_' in textintest_implement_experiment_step4_mentions_test_filestests/skills_extended/test_implement_experiment_contracts.pyL1:_contractsnaming convention mismatched totests/skills_extended/placementsrc/autoskillit/skills_extended/implement-experiment/SKILL.mdL205: 'Do NOT run the full test suite' stated three times (NEVER/ALWAYS/inline)
Decision findings (human review requested):
tests/contracts/test_plan_experiment_contracts.pyL34: asymmetric placement — plan-experiment contracts intests/contracts/, implement-experiment contracts intests/skills_extended/src/autoskillit/skills_extended/plan-experiment/SKILL.mdL264: planning/implementation mismatch — plan-experiment models tests as Phase 5 (separate), implement-experiment models as inline with Phases 2-3
Add test_implement_experiment_contracts.py (3 tests) and 3 new functions to test_plan_experiment_contracts.py. These verify that both SKILL.md files include test infrastructure directives — tests directory layout, pytest env mention, test infrastructure phase, ALWAYS list directive, Step 4 test_ references, and pytest --collect-only verification. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add tests/ folder to the Experiment Directory Layout template (with conftest.py
and test_{script}.py stubs), add pytest to the Option B environment.yml
dependencies and an availability note to Option A, and add Phase 5: Test
Infrastructure to the Implementation Phases template.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…L.md Add two ALWAYS bullets directing the agent to write test_*.py files alongside each experiment script and verify discovery with pytest --collect-only. Add Step 4 test creation section with concrete instructions for tests/ layout, conftest.py, per-script test files, and collect-only verification. Narrow the NEVER bullet from "full test suite" to clarify pytest --collect-only and per-directory verification runs are permitted. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tracts/ SKILL.md contract tests belong in tests/contracts/ per CLAUDE.md convention. Move from tests/skills_extended/ to tests/contracts/ to match placement of test_plan_experiment_contracts.py and align with the established pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…contracts Replace broad token-membership checks with specific pattern assertions: - test_implement_experiment_always_includes_test_creation: assert 'tests/test_' in the ALWAYS section instead of loose 'test'+'write/creat' compound check - test_implement_experiment_step4_mentions_test_files: scope to Step 4 text and assert 'tests/test_' instead of document-wide 'test_' check Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ment-experiment SKILL.md The constraint appeared three times (NEVER list, Step 4 footer, Step 6 footer). Remove the two redundant inline repetitions; the NEVER list at the top is the single authoritative statement, eliminating divergence risk. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1dbecec to
443b057
Compare
Summary
The research recipe's
test_checkgate runs aftergenerate_reportand expects pytest-discoverable tests to exist in the experiment worktree. Currently neitherplan-experimentnorimplement-experimentdirects the agent to create test files — sotest_checkreliably fails on research projects, routing throughfix_tests → retest → escalate_stop.The fix is entirely upstream of
test_check. Two SKILL.md files need additions:plan-experiment/SKILL.md— add atests/folder to the directory layout template, add pytest to the environment specification, and add a "Phase 5: Test Infrastructure" to the Implementation Phases template.implement-experiment/SKILL.md— add test-file creation to the ALWAYS list and to Step 4, and allowpytest --collect-onlyas a verification step (but not the full suite).No recipe YAML changes are needed — the
test_checkgate is already correctly placed.Architecture Impact
Scenarios Diagram
%%{init: {'flowchart': {'nodeSpacing': 40, 'rankSpacing': 55, 'curve': 'basis'}}}%% flowchart LR %% CLASS DEFINITIONS %% classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff; classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff; classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff; classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff; classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; %% ─────────────────────────────────────────── %% %% SCENARIO 1: Plan Experiment %% %% ─────────────────────────────────────────── %% subgraph S1 ["SCENARIO 1: Plan Experiment → Test Infrastructure Plan"] direction LR S1_IN["Scope Report<br/>━━━━━━━━━━<br/>Input to skill"] S1_SKILL["● plan-experiment<br/>━━━━━━━━━━<br/>SKILL.md"] S1_ASSESS["Feasibility<br/>Subagents<br/>━━━━━━━━━━<br/>measure · data · env"] S1_LAYOUT["Dir Layout<br/>━━━━━━━━━━<br/>tests/<br/>conftest.py<br/>test_{script}.py"] S1_PHASE["Test Infrastructure<br/>Phase<br/>━━━━━━━━━━<br/>create · write · collect"] S1_ENV["environment.yml<br/>━━━━━━━━━━<br/>pytest required"] S1_OUT["experiment_plan<br/>━━━━━━━━━━<br/>temp/ token emitted"] end S1_IN -->|reads| S1_SKILL S1_SKILL -->|launches| S1_ASSESS S1_ASSESS -->|informs plan| S1_LAYOUT S1_LAYOUT -->|includes| S1_PHASE S1_PHASE -->|declares dep| S1_ENV S1_ENV -->|saves| S1_OUT %% ─────────────────────────────────────────── %% %% SCENARIO 2: Implement Experiment with Tests %% %% ─────────────────────────────────────────── %% subgraph S2 ["SCENARIO 2: Implement Experiment → Test File Creation"] direction LR S2_PLAN["experiment_plan<br/>━━━━━━━━━━<br/>from Scenario 1"] S2_SKILL["● implement-experiment<br/>━━━━━━━━━━<br/>SKILL.md"] S2_WT["Git Worktree<br/>━━━━━━━━━━<br/>isolated branch"] S2_SCRIPT["Scripts<br/>━━━━━━━━━━<br/>analysis.py etc."] S2_TESTS["test_{script}.py<br/>━━━━━━━━━━<br/>per script in tests/"] S2_COLLECT["pytest --collect-only<br/>━━━━━━━━━━<br/>verify discovery"] S2_COMMIT["Commit<br/>━━━━━━━━━━<br/>test suite + scripts"] end S2_PLAN -->|inputs to| S2_SKILL S2_SKILL -->|creates| S2_WT S2_WT -->|implements| S2_SCRIPT S2_SCRIPT -->|also creates| S2_TESTS S2_TESTS -->|validated by| S2_COLLECT S2_COLLECT -->|on success| S2_COMMIT %% ─────────────────────────────────────────── %% %% SCENARIO 3: Contract Guard — plan-experiment%% %% ─────────────────────────────────────────── %% subgraph S3 ["SCENARIO 3: Contract Guard — plan-experiment"] direction LR S3_PYTEST["pytest<br/>━━━━━━━━━━<br/>task test-check"] S3_CONTRACT["● test_plan_experiment<br/>_contracts.py<br/>━━━━━━━━━━<br/>6 contract tests"] S3_SKILL["plan-experiment<br/>━━━━━━━━━━<br/>SKILL.md (on disk)"] S3_CHECKS["Assertions<br/>━━━━━━━━━━<br/>data_manifest · tests/<br/>Test Infra phase · pytest"] S3_RESULT["PASS / FAIL<br/>━━━━━━━━━━<br/>contract enforced"] end S3_PYTEST -->|runs| S3_CONTRACT S3_CONTRACT -->|reads| S3_SKILL S3_SKILL -->|content checked by| S3_CHECKS S3_CHECKS -->|adjudicates| S3_RESULT %% ───────────────────────────────────────────────── %% %% SCENARIO 4: Contract Guard — implement-experiment %% %% ───────────────────────────────────────────────── %% subgraph S4 ["SCENARIO 4: Contract Guard — implement-experiment (NEW)"] direction LR S4_PYTEST["pytest<br/>━━━━━━━━━━<br/>task test-check"] S4_CONTRACT["★ test_implement_experiment<br/>_contracts.py<br/>━━━━━━━━━━<br/>3 contract tests"] S4_SKILL["implement-experiment<br/>━━━━━━━━━━<br/>SKILL.md (on disk)"] S4_CHECKS["Assertions<br/>━━━━━━━━━━<br/>ALWAYS write tests<br/>test_ naming<br/>--collect-only"] S4_RESULT["PASS / FAIL<br/>━━━━━━━━━━<br/>contract enforced"] end S4_PYTEST -->|runs| S4_CONTRACT S4_CONTRACT -->|reads| S4_SKILL S4_SKILL -->|content checked by| S4_CHECKS S4_CHECKS -->|adjudicates| S4_RESULT %% CLASS ASSIGNMENTS %% class S1_IN,S2_PLAN,S3_PYTEST,S4_PYTEST cli; class S1_SKILL,S2_SKILL handler; class S1_ASSESS,S1_PHASE,S2_COLLECT phase; class S1_LAYOUT,S1_ENV,S2_WT,S2_SCRIPT stateNode; class S1_OUT,S2_TESTS,S2_COMMIT output; class S3_CONTRACT,S4_CHECKS,S3_CHECKS detector; class S3_SKILL,S4_SKILL stateNode; class S3_RESULT,S4_RESULT output; class S4_CONTRACT newComponent;State Lifecycle Diagram
%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 60, 'curve': 'basis'}}}%% flowchart TB %% CLASS DEFINITIONS %% classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff; classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff; classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff; classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff; classDef gap fill:#ff6f00,stroke:#ffa726,stroke-width:2px,color:#000; classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; START([SKILL INVOCATION]) subgraph PlanDirectives ["● PLAN-EXPERIMENT SKILL.MD — DIRECTIVE STATE (INIT_ONLY)"] direction TB TestsDir["● tests/ in Directory Layout<br/>━━━━━━━━━━<br/>INIT_ONLY: planned once<br/>enforce: test_plan_experiment_includes_tests_directory_in_layout"] TestPhase["● Test Infrastructure Phase<br/>━━━━━━━━━━<br/>INIT_ONLY: planned once<br/>enforce: test_plan_experiment_has_test_infrastructure_phase"] PytestEnv["● pytest in Environment Spec<br/>━━━━━━━━━━<br/>INIT_ONLY: environment.yml<br/>enforce: test_plan_experiment_environment_mentions_pytest"] DataManifest["● data_manifest Schema<br/>━━━━━━━━━━<br/>INIT_ONLY: source_type, acquisition<br/>verification, hypothesis<br/>enforce: test_data_manifest_required_fields"] end subgraph PlanGates ["PLAN VALIDATION GATES (V1–V9)"] direction TB V1V4["V1–V4 ERROR Gates<br/>━━━━━━━━━━<br/>baseline present, contrast set<br/>statistical_plan present, spec_path set<br/>FAIL → suppress frontmatter, emit errors"] V5V8["V5–V8 WARNING Gates<br/>━━━━━━━━━━<br/>primary metric uniqueness<br/>NEW canonical names, numeric thresholds<br/>CONTINUE → inline YAML comments"] V9["V9 ERROR Gate<br/>━━━━━━━━━━<br/>data_manifest completeness<br/>hypothesis coverage<br/>FAIL → suppress frontmatter"] end subgraph PlanOutput ["PLAN OUTPUT STATE"] direction TB FrontmatterBlock["Frontmatter Block<br/>━━━━━━━━━━<br/>MUTABLE → written iff V1-V4/V9 pass<br/>INIT_ONLY once written to file"] ExperimentPlanFile["experiment_plan token<br/>━━━━━━━━━━<br/>INIT_PRESERVE: last line of output<br/>pipeline captures for handoff"] end subgraph ImplDirectives ["● IMPLEMENT-EXPERIMENT SKILL.MD — DIRECTIVE STATE"] direction TB AlwaysWrite["● ALWAYS: write test_*.py<br/>━━━━━━━━━━<br/>INIT_ONLY directive<br/>enforce: test_implement_experiment_always_includes_test_creation"] Step4Tests["● Step 4: test_*.py per script<br/>━━━━━━━━━━<br/>MUTABLE: one file per script created<br/>enforce: test_implement_experiment_step4_mentions_test_files"] CollectOnly["● pytest --collect-only gate<br/>━━━━━━━━━━<br/>MUTABLE: runs after each test file<br/>enforce: test_implement_experiment_allows_pytest_collect_only"] end subgraph CheckpointState ["IMPLEMENT RESUME CHECKPOINT STATE"] direction TB EarlyTokens["worktree_path + branch_name tokens<br/>━━━━━━━━━━<br/>INIT_PRESERVE: emitted Step 1 cont.<br/>captured by pipeline before context limit"] BaseBranchFile["base-branch checkpoint file<br/>━━━━━━━━━━<br/>INIT_PRESERVE: written immediately<br/>survives context exhaustion"] WorktreeOnDisk["Worktree on disk<br/>━━━━━━━━━━<br/>INIT_PRESERVE: never deleted by skill<br/>orchestrator discovers via tokens"] end subgraph TestsState ["APPEND_ONLY TEST ARTIFACT STATE"] direction TB TestsDirectory["tests/ directory<br/>━━━━━━━━━━<br/>APPEND_ONLY: grows per script<br/>NEVER removed or overwritten"] GitCommits["Git commit history<br/>━━━━━━━━━━<br/>APPEND_ONLY: per-phase commits<br/>NEVER force-pushed by skill"] end subgraph ImplGates ["IMPLEMENT VALIDATION GATES"] direction TB DiscoveryGate["pytest --collect-only<br/>━━━━━━━━━━<br/>MUST pass before each commit<br/>FAIL → fix imports, re-run"] PrecommitGate["pre-commit run --all-files<br/>━━━━━━━━━━<br/>Step 6 final gate<br/>FAIL → fix, re-stage, retry"] end subgraph ContractEnforcement ["CONTRACT ENFORCEMENT LAYER"] direction TB PlanContracts["● test_plan_experiment_contracts.py<br/>━━━━━━━━━━<br/>6 tests: data_manifest schema<br/>tests/ layout, test phase, pytest env"] ImplContracts["★ test_implement_experiment_contracts.py<br/>━━━━━━━━━━<br/>3 tests: ALWAYS write tests<br/>test_ filenames, collect-only gate"] end END([HANDOFF REPORT]) %% FLOW %% START --> PlanDirectives PlanDirectives --> PlanGates V1V4 --> FrontmatterBlock V5V8 --> FrontmatterBlock V9 --> FrontmatterBlock FrontmatterBlock --> ExperimentPlanFile ExperimentPlanFile --> ImplDirectives AlwaysWrite --> EarlyTokens Step4Tests --> TestsDirectory CollectOnly --> DiscoveryGate EarlyTokens --> BaseBranchFile BaseBranchFile --> WorktreeOnDisk DiscoveryGate --> TestsDirectory TestsDirectory --> GitCommits GitCommits --> PrecommitGate PrecommitGate --> END %% CONTRACT LAYER VALIDATES DIRECTIVE STATE %% PlanContracts -.->|validates directives present| PlanDirectives ImplContracts -.->|validates directives present| ImplDirectives %% CLASS ASSIGNMENTS %% class START,END terminal; class TestsDir,TestPhase,PytestEnv,DataManifest gap; class V1V4,V9 detector; class V5V8 handler; class FrontmatterBlock,ExperimentPlanFile stateNode; class AlwaysWrite,Step4Tests,CollectOnly gap; class EarlyTokens,BaseBranchFile,WorktreeOnDisk phase; class TestsDirectory,GitCommits handler; class DiscoveryGate,PrecommitGate detector; class PlanContracts output; class ImplContracts newComponent;Closes #786
Implementation Plan
Plan file:
/home/talon/projects/autoskillit-runs/impl-786-20260413-002705-648780/.autoskillit/temp/make-plan/ensure_research_experiments_test_infra_plan_2026-04-13_000000.md🤖 Generated with Claude Code via AutoSkillit
Token Usage Summary