Spec 004: Phase 2 (Project Bootstrap) end-to-end testing#109
Spec 004: Phase 2 (Project Bootstrap) end-to-end testing#109jeremymanning merged 20 commits intomainfrom
Conversation
#46 #62) Spec 003 / D10 introduced the 'validated' stage AFTER the spawner was written, so the allowlist was out-of-date. Phase 2 testing requires spawning siblings at validated to route them to project_initializer (per STAGE_TO_AGENT[VALIDATED] in src/llmxive/pipeline/graph.py:70). One-line set extension; no other change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r (FR-011 Q3 P2-D03, #46 #62) Two in-PR HIGH-defect fixes from spec 004's Phase 2 diagnostic plan: 1. Skip-if-exists guard before constitution write (FR-011 / Q3 / spec 004 research.md Decision 2). Re-rendering a governance document silently mutates downstream Constitution Checks; the new guard matches the init_speckit_in skip-if-dir-exists pattern at src/llmxive/speckit/runner.py:114. 2. Fail-fast on missing idea file (P2-D03 / Constitution Principle V). The previous defensive `if idea_path.exists()` masked missing inputs and produced constitutions untethered from any idea body. Now raises FileNotFoundError immediately (caught by US4 induced- failure scenario 2 verification). Plus: 4-test pytest harness at tests/phase1/test_idempotency.py proving SC-009 (full .specify/ tree byte-equality after second project_initializer invocation). All 4 tests pass in 0.08s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…001, #46 #62) Spawned via tests/phase1/sibling_project.py at --start-stage validated (now allowed per FR-003a / commit e5e423c). Both siblings have: - sha256-verified byte-identical clone of canonical's idea file - fresh state YAML at current_stage: validated - no .specify/ scaffold yet (project_initializer produces it next) Substrate for US1 happy-path runs (T017/T018). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s (US1, #46 #62) Both PROJ-261-iter2 and PROJ-262-iter2 advanced from validated → project_initialized in <90s wall-clock against the real Dartmouth Chat backend (qwen.qwen3.5-122b). PROJ-261-iter2 run: - run_id: e9a3dfce-8435-455f-bf7a-8e4206ffb754 - duration: 63s (01:35:25 → 01:36:28) - constitution: .specify/memory/constitution.md (LLM-rendered) - scaffold: 4 scripts + 5 templates (mechanical via init_speckit_in) PROJ-262-iter2 run: - run_id: 4a04a919-0a1c-46f9-a9a3-fab5a96200ce - duration: 72s (01:36:33 → 01:37:45) - same artifact set Both run-log entries: outcome=success, no failure_reason. State YAMLs both at current_stage=project_initialized. Substrate for US2 (constitution audit) and US3 (idempotency check). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ns + HTML comments (P2-D04 P2-D05, US2 §4, #46 #62) US2 audit on iter2 happy-path siblings surfaced two defects: - P2-D05 (CRITICAL per spec.md SC-011): PROJ-262-iter2's constitution introduced an external citation — `DOI: 10.6084/m9.figshare.9981994` for the QM9 dataset — into Reproducibility Requirements. The prompt said "DO NOT introduce external citations" but didn't define what qualifies, and the LLM treated the DOI as a data-source identifier. - P2-D04 (MEDIUM): PROJ-261-iter2's constitution preserved the HTML comment block from the constitution template explaining substitution tokens. The comments are scaffolding for the LLM, not content for the rendered document. Prompt v1.0.0 → v1.1.0 (MINOR — adds new behavior constraints, doesn't break the output contract): - Enumerate forbidden citation forms explicitly: DOIs, arXiv IDs, URLs, Figshare/Zenodo/OSF/HF record IDs. - Allow naming datasets by name without their canonical pointers. - Forbid HTML comment blocks in output (strip template scaffolding). Phase 7 next: spawn iter3 siblings of both PROJ-261 and PROJ-262, re-run project_initializer with the patched prompt, re-audit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…titutions pass full US2 audit (Phase 7, P2-D04 P2-D05 fixed, #46 #62) After commit 8f2fe48 tightened agents/prompts/project_initializer.md to forbid external citations + HTML comments (prompt v1.0.0 → v1.1.0), spawned iter3 siblings of both PROJ-261 and PROJ-262, re-ran project_initializer, re-audited: PROJ-261-iter3 (computer science): 6/6 contract items PASS - No HTML comment block (P2-D04 fixed) - Domain-specific principles VI (Model & Compute Integrity) + VII (Code Licensing & Compliance) — both well-grounded - Reproducibility Requirements names codeparrot/github-code (allowed per v1.1.0 — dataset name, not citation) PROJ-262-iter3 (chemistry): 6/6 contract items PASS - No DOI / arXiv / URL anywhere in body (P2-D05 fixed) - Domain-specific principles VI (Physical Consistency) + VII (Benchmark Integrity) — both grounded in chemistry domain - Reproducibility Requirements references QM9 by name only Both run-log entries: outcome=success, prompt_version=1.1.0. Both state YAMLs at current_stage=project_initialized. Phase 7 iteration loop converged in 1 iteration (well under FR-005 5-cycle cap). iter3 siblings are the carry-forward candidates for spec 005. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#62) All three deliberate failure scenarios from US4 / Q2 clarification exercised on dedicated sibling iters; each produced a loud + recorded failure with state unchanged. Scenario 1 (backend unreachable) — PROJ-261-iter4: - Set DARTMOUTH_CHAT_API_KEY=invalid for one orchestrator run - Result: every backend in chain failed (dartmouth/HF/local) - failure_reason quotes all three backend errors - State current_stage=validated (unchanged); no .specify/ Scenario 2 (idea file missing) — PROJ-262-iter4: - Spawned then deleted idea/<slug>.md before orchestrator - The new fail-fast guard (T008 / commit e8e09f7) raised FileNotFoundError immediately; no LLM call made - failure_reason: "FileNotFoundError: project_initializer requires at least one input (idea file path); got ctx.inputs=[]" Scenario 3 (template file missing) — PROJ-261-iter5: - Renamed agents/templates/research_project_constitution.md to .bak for one run; restored after - render_prompt() raised FileNotFoundError before LLM invocation - State unchanged; template restored to canonical path; git clean All three siblings marked archived_at: 2026-05-06T01:46:00Z (FR-019). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… FR-013 FR-017, #46 #62) § 1-8 of the diagnostic report at notes/2026-05-05-phase2-diagnostic.md covers: inputs, agent behavior on 7 runs (4 happy-path iter2/iter3 + 3 induced-failure iter4/iter5), constitution audits (with verbatim fail/pass per the 6-item contract), full sha256-tree idempotency verification (US3 / SC-009 / pytest 4/4), defects table (5 P2-D## all fixed in-PR), iteration diff for the v1.0.0→v1.1.0 prompt patch, and carry-forward decision. Carry-forward manifest names two iter3 siblings as the substrate for spec 005: - PROJ-261-evaluating-the-impact-of-code-duplicatio-iter3 (CS) - PROJ-262-predicting-molecular-dipole-moments-with-iter3 (chemistry) Both at current_stage=project_initialized; both pass the full US2 audit cleanly under prompt v1.1.0. Schema follows spec 003's manifest with one new field per data-model.md E7 (phase2_iter2_id) recording which iter2 sibling produced the audited constitution. Per-issue verdict (§ 6): Issue #62 (project_initializer): all 3 acceptance boxes PASS Issue #46 (Phase 2 parent): all 5 acceptance boxes PASS No CRITICAL/HIGH defects remain unresolved. No follow-up issues opened — all 5 P2-D## defects fixed in this PR (commits e5e423c, e8e09f7, 8f2fe48). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…8a, #46 #62) iter3 siblings (with prompt v1.1.0) carried forward instead of iter2. Mark iter2 siblings archived per FR-019 — never deleted, just flagged for clarity. They remain readable for spec 005 if it wants to inspect the iter2-vs-iter3 diff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ruff --fix on the two files spec 004 touched:
- src/llmxive/agents/project_initializer.py: I001 import sort,
UP017 datetime.timezone.utc → datetime.UTC alias.
- tests/phase1/test_idempotency.py: I001 import sort.
Behavior unchanged. All 15 tests in tests/phase1/ still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Commits artifacts the spec-kit workflow generated but never auto-staged:
- plan.md, research.md, data-model.md, quickstart.md, 4× contracts/,
requirements.md (the spec-004 design substrate)
- CLAUDE.md SPECKIT plan reference updated 003 → 004
- .history.jsonl run-history files (one per iter project)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… require principle grounding (P2-D06, #46 #62) Deep audit re-check on iter3 surfaced a MEDIUM defect missed in the shallow audit: PROJ-261-iter3's added Principle VII "Code Licensing & Compliance" claims things about GPL / restrictive licensing that have NO basis in the project's idea body (which is about clone density vs LLM perplexity, not licensing). The prompt's "Adapt Core Principles to the specific research domain" instruction permitted the LLM to extrapolate too freely and invent generic-good-practice principles that don't govern the actual research scope. Prompt v1.1.0 → v1.2.0 (MINOR): added explicit grounding requirement — each new principle must trace claims back to specific idea-body sections (Methodology / Expected results / Motivation / Research question). Forbid fabrication of generic-good-practice principles (licensing, deployment, maintenance) that don't address the project's specific research. Require new principles to reference idea-body's named datasets/models/methods when codifying domain norms. Phase 7 next: spawn iter4 siblings, verify both projects' VI/VII principles are grounded in their idea bodies. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… deep audit (Phase 7 round 2, P2-D06 fixed, #46 #62) Deep re-audit on iter3 (after merge of v1.1.0 patch) surfaced a MEDIUM defect missed in the original audit: PROJ-261-iter3's added Principle VII "Code Licensing & Compliance" claimed things about GPL that have no basis in the project's idea body. The v1.1.0 prompt allowed too-liberal extrapolation; v1.2.0 added explicit grounding requirements (every claim must trace to a specific idea-body section). iter6 re-runs (with v1.2.0 prompt) produce dramatically improved output: PROJ-261-iter6: - VI "Statistical Correlation Integrity" — grounds in idea Methodology + Expected results (p < 0.05 threshold, Spearman's rank correlation) - VII "Clone Detection Consistency" — grounds in idea Methodology (AST-based clone detector, codeparrot/github-code subset) - No fabricated principles. No license/compliance fabrication. PROJ-262-iter6: - VI "3D Geometry Preservation" — grounds in idea Methodology sketch + Expected results, with explicit "This principle is grounded in..." annotations citing specific idea sections - VII "Chemical Interpretability" — grounds in idea Research question + Motivation with quoted text - LLM internalized v1.2.0 instruction beautifully (auto-included grounding annotations in the constitution body). Regression checks all pass: - No {{token}} leaks - No DOI / arXiv / URL citations - No HTML comments - All 4 inherited principles (I-IV) byte-identical to template - Principle V differs only in substituted project_id (expected) Quality monitoring: iter3 → iter6 strictly IMPROVED. No regression. 1 iteration cycle to converge on the new defect (well under FR-005 5-cycle cap; total cycles for this spec: 2). iter3 siblings now archived (superseded by iter6); iter6 are the carry-forward candidates for spec 005. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…6 selection (P2-D06, #46 #62) Diagnostic report: - § 4: P2-D06 added (MEDIUM, fixed in commit 7c5cc08) - § 5: round-2 iteration diff (v1.1.0 → v1.2.0) - § 3.6 + § 3.7: deep audit subsections for iter6 PROJ-261 + PROJ-262 - § 8: re-selection of iter6 siblings as carry-forward (was iter3) carry-forward.yaml now names iter6 siblings: - PROJ-261-evaluating-the-impact-of-code-duplicatio-iter6 - PROJ-262-predicting-molecular-dipole-moments-with-iter6 Each with project_initializer iterations: 3 (iter2 + iter3 + iter6). Quality monitoring: strictly monotone improvement across iter2 → iter3 → iter6. No regressions. Total Phase 7 cycles for spec 004: 2 (well under FR-005 5-cycle cap). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Phase 7 round 2 complete (P2-D06 fixed) After the user's request for high-quality verification, a deep re-audit on iter3 surfaced one more defect:
Patched the project_initializer prompt v1.1.0 → v1.2.0 to require explicit principle-grounding (every claim must trace to a specific idea-body section). Spawned iter6 siblings, re-ran:
Carry-forward manifest updated to point to iter6. Phase 7 total: 2 iteration cycles, strictly monotone quality improvement, no regressions. All 15 pytest tests still pass. Latest commit: 5f72de2. |
…ng forward (#46 #62) Removes the proliferating PROJ-NNN-<slug>-iterN sibling directories (8 of them across PROJ-261 and PROJ-262 from spec 004's iteration loops + induced-failure scenarios) and promotes iter6's audited constitutions onto the canonical paths in place. Convention change documented at notes/2026-05-06-iteration-convention-change.md: - Iterate in place on canonical PROJ-NNN-<slug>/. - Use git commits + log notes to track iteration trail (one commit per iteration with a descriptive message). - Diagnostic reports gain a § 5 "Iteration log" indexing commits by phase + agent + iteration number. What's removed: - All 9 PROJ-261-...-iter[2-6] and PROJ-262-...-iter[2-4,6] dirs - All state/projects/PROJ-26*-iter*.yaml files - All state/projects/PROJ-26*-iter*.history.jsonl files Run-log JSONL entries kept in state/run-log/2026-05/ as historical audit evidence. What's promoted: - projects/PROJ-261-evaluating-the-impact-of-code-duplicatio/.specify/memory/constitution.md ← copy of iter6 audited content, project_id substituted to canonical - projects/PROJ-262-predicting-molecular-dipole-moments-with/.specify/memory/constitution.md ← same What's preserved: - tests/phase1/sibling_project.py (deprecation banner added; spec 003's historical reproducibility holds) - All phase2/spec-004 commits (the iteration trail v1.0.0 → v1.1.0 → v1.2.0 remains browsable via git log on the prompt path) - spec 003 + spec 004 diagnostic reports' historical references to siblings (they describe historical state) Verification: - find projects/PROJ-26*-iter* → empty - ls state/projects/ | grep iter → empty - pytest tests/phase1/ → 15/15 pass - canonical constitutions hold the audited iter6 content with -iter6 stripped from substituted project_ids Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Convention change: sibling-iter directories retired in favor of in-place iteration After review, the Removed: all PROJ-261-...-iterN/ and PROJ-262-...-iterN/ directories + their state files. Run-log entries kept in Promoted: iter6's audited constitution copied in place onto each canonical ( Going forward: future phase-test specs (005+) iterate in place on canonicals, with git commits + log notes as the iteration trail. Sibling spawner deprecated (banner added; preserved for spec 003's historical reproducibility). Documented at Verification: |
…ROJ-261/PROJ-262 (Q1B Q3A, #46 #62) Two duplicate PROJ-NNN groups existed on main from concurrent cron runs racing in cli._cmd_brainstorm: - PROJ-261: evaluating-... (carry-forward) + investigating-... - PROJ-262: predicting-... (carry-forward) + quantifying-... Q1B (race-condition fix): New src/llmxive/state/project_id_lock.py provides: - project_id_lock(repo_root): exclusive fcntl.flock on state/.brainstorm.lock; held only during disk-snapshot + state- YAML write (microseconds, NOT during LLM call). - next_available_proj_num(repo_root, starting_num=1): scans state/projects/ AND projects/ for used PROJ-NNN slots; returns smallest free n. Handles -iterN historic suffixes correctly. cli._cmd_brainstorm now wraps the per-seed allocation in the lock and writes the state YAML eagerly inside the lock as the atomic ID claim. The LLM call happens BEFORE the lock; the body write happens AFTER (with the ID already claimed). 8 new tests at tests/phase1/test_project_id_lock.py including a real os.fork() concurrent-allocation test that proves two children racing for the lock produce DISTINCT project numbers. Q3A (cleanup): Renamed the non-carry-forward duplicates to next-free IDs: PROJ-261-investigating-... → PROJ-331-investigating-... PROJ-262-quantifying-... → PROJ-332-quantifying-... Carry-forward projects (PROJ-261-evaluating-, PROJ-262-predicting-) keep their numbers since spec 003 + spec 004 + carry-forward manifests + the parent issue all reference them. Updated 5 file groups: project dirs, state YAMLs (id field), history JSONLs, web/data/projects.json, run-log JSONL entries. Verification: - grep -rn "PROJ-261-investigating|PROJ-262-quantifying" → 0 matches - pytest tests/phase1/ → 23/23 PASS (12.2s, no regression) - all 4 PROJ-26[12] / PROJ-33[12] dirs verified unique on disk Documented at notes/2026-05-06-project-id-numbering-fix.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#46 #62) Captures the user's design proposal from the spec 004 review: - Single canonical librarian agent for all literature search + citation verification (Constitution Principle I — single source of truth) - Multi-step expanded-term search when initial results are thin (<5 verified citations); brainstorm 10-20 alt terms, iterate, accumulate, log expanded trail to idea.md - Re-validate flesh_out + research_question_validator from spec 003 once the librarian is in place 3 user stories (P1 each), ~4-5 days estimated effort. 5 open design questions captured for the next /speckit-clarify pass. This is a HANDOFF NOTE only; the actual spec 005 directory will be created by /speckit-specify when the user starts the next session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Project-ID race fixed + duplicates renamed (Q1B + Q3A complete) After review, the duplicate PROJ-261 / PROJ-262 issue had a real concurrency root cause in Q1B: new Q3A: renamed the non-carry-forward duplicates to next-free IDs (PROJ-331 + PROJ-332). Carry-forward projects keep their numbers since spec 003 + spec 004 + manifests + tracker all reference them. Spec 005 handoff note: Tests: 23/23 PASS (15 prior + 8 new lock tests, 12.2s, no regression). Latest commit: |
…ne_e2e.py (#46 #62) Test expected `flesh_out_complete → project_initialized` after one step, but spec 003 / D10 inserted research_question_validator between those two stages. The next dispatch from FLESH_OUT_COMPLETE is now the validator, which has four legitimate verdicts: - VALIDATED (question passed) - VALIDATOR_REVISE → FLESH_OUT_IN_PROGRESS - VALIDATOR_REJECTED → BRAINSTORMED - HUMAN_INPUT_NEEDED The synthetic smoke fixture has no real research question (just a title + empty idea), so the validator correctly rejects it to BRAINSTORMED — which the assertion now allows. This was caught by CI on PR #109 against spec 004; fix is one test-assertion change, no production-code shift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Validates Phase 2 of the llmXive pipeline end-to-end on iter2/iter3 siblings of spec 003's carry-forward projects (PROJ-261, PROJ-262), per issue #46 and sub-issue #62.
Lands four production fixes:
ALLOWED_START_STAGESto includevalidated(FR-003a)project_initializer's constitution write (FR-011 / Q3)FileNotFoundErroron missing idea file (P2-D03 / Constitution Principle V)project_initializerprompt v1.0.0 → v1.1.0 to forbid external citations + HTML comments (P2-D04 P2-D05 from US2 audit)Diagnostic
Full report at
notes/2026-05-05-phase2-diagnostic.md. Carry-forward manifest atspecs/004-phase2-project-bootstrap-testing/carry-forward.yamlnames two iter3 siblings as input substrate for spec 005 (Phase 3 testing):PROJ-261-evaluating-the-impact-of-code-duplicatio-iter3(CS)PROJ-262-predicting-molecular-dipole-moments-with-iter3(chemistry)Defects (all 5 fixed in-PR)
e8e09f7(skip-if-exists guard)validatede5e423ce8e09f7(raises FileNotFoundError); verified by US4 Scenario 28f2fe48(prompt v1.1.0); verified by iter3 audit8f2fe48(prompt v1.1.0 enumerates forbidden citation forms); verified by iter3 auditTest plan
tests/phase1/test_idempotency.pytests passtests/phase1/test_citation_resolver.pytests pass (regression)init_speckit_inbyte-equal (diffexit 0)Per-issue acceptance verdict
🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com