diff --git a/CLAUDE.md b/CLAUDE.md index 83524d88..28ddc746 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -70,5 +70,5 @@ Since this is primarily a research documentation repository without traditional For additional context about technologies to be used, project structure, shell commands, and other important information, read the current plan: -[specs/003-phase1-idea-lifecycle-testing/plan.md](specs/003-phase1-idea-lifecycle-testing/plan.md). +[specs/004-phase2-project-bootstrap-testing/plan.md](specs/004-phase2-project-bootstrap-testing/plan.md). diff --git a/agents/prompts/project_initializer.md b/agents/prompts/project_initializer.md index 01182f12..a2b60406 100644 --- a/agents/prompts/project_initializer.md +++ b/agents/prompts/project_initializer.md @@ -1,6 +1,6 @@ # Project-Initializer Agent -**Version**: 1.0.0 +**Version**: 1.2.0 **Stage owned**: `flesh_out_complete` → `project_initialized` **Default backend**: dartmouth (fallback huggingface, then local) @@ -45,7 +45,40 @@ literal `**Project ID**: …` footer line. - Add at most TWO domain-specific principles (numbered I, II, III, IV, V already exist; if you add one it becomes VI; if two, VII). +- **Each added principle MUST be explicitly grounded in the idea body.** + Concretely: every claim a new principle makes (about methodology, + data sources, evaluation, etc.) MUST trace back to a specific section + of the idea body — Methodology sketch, Expected results, Motivation, + or Research question. If you cannot point to a sentence in the idea + body that justifies a claim in your new principle, do NOT include + that claim. Add fewer principles rather than fabricating ones. +- DO NOT add principles about topics the idea body does not address + (e.g., licensing, IP, deployment, or maintenance) just because they + seem like generic "good practice" for the field. Generic-good-practice + principles belong in the parent constitution, not in the project-level + one. The project-level constitution governs THIS project's specific + research scope. +- Each new principle's body should reference the idea's specific + artifacts (named datasets, named models, named methods) when codifying + a domain norm. Vague principles ("must use good engineering practices") + are not acceptable. - DO NOT remove any of the inherited principles. -- DO NOT introduce external citations here — the constitution is a - governance document, not a research artifact. +- **DO NOT introduce ANY external citations or external identifiers in + the constitution body** — the constitution is a governance document, + not a research artifact. This includes: + - DOIs (`10.xxxx/...`) + - arXiv IDs (`2401.12345`) + - URLs (`http://...`, `https://...`) + - Figshare / Zenodo / OSF / Hugging Face dataset record IDs + Naming a *dataset by name* (e.g., "QM9", "MD17", "codeparrot/github-code") + is acceptable when the dataset is referenced as a generic class of + data, NOT when it is identified by a publication-pointer. If you need + to specify a dataset's source, name only the dataset and let the + Reference-Validator Agent track the canonical pointer in `idea/` and + `paper/`. +- **DO NOT include HTML comment blocks** (``) in your + output. The template you receive contains explanatory comments that + describe the substitution tokens; those are scaffolding for you, NOT + content for the rendered constitution. Strip them before returning + your final document. - Output ONLY the Markdown document. diff --git a/agents/registry.yaml b/agents/registry.yaml index fe66aedb..621115cb 100644 --- a/agents/registry.yaml +++ b/agents/registry.yaml @@ -87,7 +87,7 @@ agents: outputs: - project_state prompt_path: agents/prompts/project_initializer.md - prompt_version: 1.0.0 + prompt_version: 1.2.0 default_backend: dartmouth fallback_backends: - huggingface diff --git a/notes/2026-05-05-phase2-diagnostic.md b/notes/2026-05-05-phase2-diagnostic.md new file mode 100644 index 00000000..51678af1 --- /dev/null +++ b/notes/2026-05-05-phase2-diagnostic.md @@ -0,0 +1,449 @@ +# Phase 2 (Project Bootstrap) Diagnostic Report + +**Spec**: [specs/004-phase2-project-bootstrap-testing/spec.md](../specs/004-phase2-project-bootstrap-testing/spec.md) +**Generated**: 2026-05-06T01:50:00Z (last updated 2026-05-06T03:00:00Z post convention change) +**Branch**: `008-phase2-project-bootstrap-testing` +**Final commit**: see `git log` (HEAD as of last update) +**Issue**: #46 (parent) / #62 (project_initializer) +**Tracker**: #107 + +> **Convention-change note (2026-05-06)**: This report's prose references `-iterN` sibling directories that existed during the spec's original execution but were removed post-spec per the new in-place-iteration convention. See [`notes/2026-05-06-iteration-convention-change.md`](2026-05-06-iteration-convention-change.md). The iteration trail described in §5 is now browsable via `git log -- projects/PROJ-NNN-/` rather than via filesystem suffixes. The audited Phase 2 outputs (constitutions) live in place at `projects/PROJ-261-evaluating-the-impact-of-code-duplicatio/.specify/memory/constitution.md` and `projects/PROJ-262-predicting-molecular-dipole-moments-with/.specify/memory/constitution.md`. + +--- + +## Section 1 — Inputs (carry-forward substrate) + +### Canonicals (from spec 003) + +| Canonical ID | Field | Title | Idea sha256 | Spec-003 final state | +|-|-|-|-|-| +| PROJ-261-evaluating-the-impact-of-code-duplicatio | computer science | Evaluating the Impact of Code Duplication on LLM Code Understanding | `283df3b2b12aba43...` | project_initialized | +| PROJ-262-predicting-molecular-dipole-moments-with | chemistry | Predicting Molecular Dipole Moments with Graph Neural Networks | `6c68732c4f131be0...` | project_initialized | + +(From `specs/003-phase1-idea-lifecycle-testing/carry-forward.yaml`, generated_at `2026-05-05T04:30:00Z`, final_commit `e422cef`.) + +### Iter2 siblings spawned in this spec + +| Sibling ID | Spawner CLI | Idea-clone sha256 | Initial state | +|-|-|-|-| +| PROJ-261-evaluating-the-impact-of-code-duplicatio-iter2 | `python tests/phase1/sibling_project.py PROJ-261-... --iter 2 --start-stage validated` | `283df3b2b12a...` (matches canonical) | `current_stage: validated` | +| PROJ-262-predicting-molecular-dipole-moments-with-iter2 | (analogous) | `6c68732c4f13...` (matches canonical) | `current_stage: validated` | + +Spawner stderr (verbatim): + +```text +[sibling] canonical: PROJ-261-evaluating-the-impact-of-code-duplicatio +[sibling] sibling: PROJ-261-evaluating-the-impact-of-code-duplicatio-iter2 +[sibling] copied projects/PROJ-261-.../idea/evaluating-the-impact-of-code-duplicatio.md → projects/PROJ-261-...-iter2/idea/evaluating-the-impact-of-code-duplicatio.md (sha256 verified: 283df3b2b12a...) +[sibling] wrote state/projects/PROJ-261-...-iter2.yaml (start_stage=validated) +PROJ-261-evaluating-the-impact-of-code-duplicatio-iter2 + +[sibling] canonical: PROJ-262-predicting-molecular-dipole-moments-with +[sibling] sibling: PROJ-262-predicting-molecular-dipole-moments-with-iter2 +[sibling] copied projects/PROJ-262-.../idea/predicting-molecular-dipole-moments-with.md → projects/PROJ-262-...-iter2/idea/predicting-molecular-dipole-moments-with.md (sha256 verified: 6c68732c4f13...) +[sibling] wrote state/projects/PROJ-262-...-iter2.yaml (start_stage=validated) +PROJ-262-predicting-molecular-dipole-moments-with-iter2 +``` + +### Iter3 siblings (Phase 7 iteration after US2 prompt patch) + +| Sibling ID | Justification | +|-|-| +| PROJ-261-evaluating-the-impact-of-code-duplicatio-iter3 | Phase 7 iteration to verify P2-D04 (HTML comment leak) fix | +| PROJ-262-predicting-molecular-dipole-moments-with-iter3 | Phase 7 iteration to verify P2-D05 (DOI citation leak) fix | + +Both spawned via the same spawner with `--iter 3 --start-stage validated`; both idea-files sha256-match the canonicals. + +### Induced-failure siblings (Phase 6 / US4) + +| Sibling ID | Scenario | +|-|-| +| PROJ-261-...-iter4 | Backend unreachable (invalid `DARTMOUTH_CHAT_API_KEY`) | +| PROJ-262-...-iter4 | Idea file missing (deleted before run) | +| PROJ-261-...-iter5 | Template file missing (renamed before run) | + +All three archived per FR-019 (`archived_at: 2026-05-06T01:46:00Z`). + +### Backend retry policy verification (FR-002) + +Confirmed `src/llmxive/backends/router.py:96-100`: + +```python +models_to_try = [model] + [m for m in MODEL_FALLBACKS.get(model, []) if m != model] +for model_idx, m in enumerate(models_to_try): + attempts = 3 if model_idx == 0 else 1 +``` + +This satisfies Q4's "≥2 retries / ≥3 total attempts" minimum (the existing policy gives 3 attempts × primary + 1 attempt × each peer model in `MODEL_FALLBACKS` × the entire fallback-backend chain). No code change needed (per research.md Decision 3). + +--- + +## Section 2 — Agent behavior (per sibling, per run) + +### 2.1 PROJ-261-iter2 happy-path run (run_id `e9a3dfce-8435-455f-bf7a-8e4206ffb754`) + +**2.1.1 Pre-run state YAML** (verbatim `cat /tmp/pre-261.yaml`): + +```yaml +artifact_hashes: {} +assigned_agent: null +created_at: '2026-05-06T01:34:59.650757Z' +current_stage: validated +failed_stage: null +field: computer science +human_escalation_reason: null +id: PROJ-261-evaluating-the-impact-of-code-duplicatio-iter2 +last_run_id: null +last_run_status: null +points_paper: {} +points_research: {} +revision_round: 0 +speckit_paper_dir: null +speckit_research_dir: null +title: Evaluating the Impact of Code Duplication on LLM Code Understanding +updated_at: '2026-05-06T01:34:59.650757Z' +``` + +**2.1.2 Rendered system prompt** (`/tmp/prompt-PROJ-261-...-iter2.txt`, system 2098 chars after substitution): + +Key excerpt showing tokens substituted (no `{{...}}` survive): + +```text +The agent's runtime substitutes `PROJ-261-evaluating-the-impact-of-code-duplicatio-iter2`, +`Evaluating the Impact of Code Duplication on LLM Code Understanding`, +`computer science`, `2026-05-06`, and `flesh_out` BEFORE the LLM is invoked, +so the model sees concrete values. +``` + +[Full prompt 2098 chars; quoted in `/tmp/prompt-PROJ-261-...-iter2.txt` for archival; truncated here for report length.] + +**2.1.3 Rendered user prompt**: 8044 chars containing the rendered constitution template (with all 5 tokens substituted) plus the full idea body. Substitution verified — no `{{token}}` strings. + +**2.1.4 LLM response** (the resulting constitution): see § 3.1.2. + +**2.1.5 Run-log JSONL line** (verbatim): + +```json +{"agent_name": "project_initializer", "backend": "dartmouth", "cost_estimate_usd": 0.0, "ended_at": "2026-05-06T01:36:28.619215Z", "entry_id": "0f1509ea-3f6b-4121-abf7-3a57874f2279", "failure_reason": null, "inputs": ["projects/PROJ-261-evaluating-the-impact-of-code-duplicatio-iter2/idea/evaluating-the-impact-of-code-duplicatio.md"], "model_name": "qwen.qwen3.5-122b", "outcome": "success", "outputs": ["projects/PROJ-261-evaluating-the-impact-of-code-duplicatio-iter2/.specify/memory/constitution.md"], "parent_entry_id": null, "project_id": "PROJ-261-evaluating-the-impact-of-code-duplicatio-iter2", "prompt_version": "1.0.0", "run_id": "e9a3dfce-8435-455f-bf7a-8e4206ffb754", "started_at": "2026-05-06T01:35:25.536741Z", "task_id": "60aceaed-3295-49bb-af12-779613877485"} +``` + +Duration: 63s (< 300s wall_clock_budget). `outcome: success`, `prompt_version: 1.0.0`. + +**2.1.6 Post-run state YAML**: `current_stage: project_initialized`, `last_run_id: e9a3dfce-...`. (Diff from § 2.1.1: `current_stage` advanced; `last_run_id` populated; `updated_at` advanced 89s.) + +### 2.2 PROJ-262-iter2 happy-path run (run_id `4a04a919-0a1c-46f9-a9a3-fab5a96200ce`) + +Identical pattern; duration 72s; run-log `outcome: success`; state advanced to `project_initialized`. Run-log JSONL: + +```json +{"agent_name": "project_initializer", "backend": "dartmouth", "cost_estimate_usd": 0.0, "ended_at": "2026-05-06T01:37:45.360194Z", "entry_id": "21b4e5e1-e85a-478f-b66a-a09cfc6acf23", "failure_reason": null, "inputs": ["projects/PROJ-262-predicting-molecular-dipole-moments-with-iter2/idea/predicting-molecular-dipole-moments-with.md"], "model_name": "qwen.qwen3.5-122b", "outcome": "success", "outputs": ["projects/PROJ-262-predicting-molecular-dipole-moments-with-iter2/.specify/memory/constitution.md"], "parent_entry_id": null, "project_id": "PROJ-262-predicting-molecular-dipole-moments-with-iter2", "prompt_version": "1.0.0", "run_id": "4a04a919-0a1c-46f9-a9a3-fab5a96200ce", "started_at": "2026-05-06T01:36:33.062008Z", "task_id": "072cd3e0-f357-4404-b1e7-764d8ad11ef7"} +``` + +### 2.3 PROJ-261-iter3 (Phase 7 — patched prompt v1.1.0) + +Run produced clean constitution under v1.1.0 prompt. State advanced to `project_initialized`. Run-log `outcome: success`, `prompt_version: 1.1.0`. + +### 2.4 PROJ-262-iter3 (Phase 7 — patched prompt v1.1.0) + +Same pattern; run-log `outcome: success`, `prompt_version: 1.1.0`. + +### 2.5 Induced-failure run: PROJ-261-iter4 (backend unreachable) + +**Stderr quote**: + +```text +[run] FAIL on PROJ-261-evaluating-the-impact-of-code-duplicatio-iter4: every backend in chain ['dartmouth', 'huggingface', 'local'] failed; errors: dartmouth/qwen.qwen3.5-122b(permanent): 'API key invalid!' | huggingface/qwen.qwen3.5-122b(permanent): HF_TOKEN is not set (required by HF backend) | local/qwen.qwen3.5-122b(permanent): transformers is not installed; required by local backend +``` + +**Run-log entry** (`outcome: failed`, populated `failure_reason`, `outputs: []`): + +```json +{"agent_name": "project_initializer", "outcome": "failed", "failure_reason": "BackendError: every backend in chain ['dartmouth', 'huggingface', 'local'] failed; errors: dartmouth/qwen.qwen3.5-122b(permanent): 'API key invalid!' | ...", "inputs": ["projects/PROJ-261-...-iter4/idea/...md"], "outputs": [], "started_at": "2026-05-06T01:44:56.987321Z", "ended_at": "2026-05-06T01:44:57.810596Z", "run_id": "a0c232b3-5868-46c7-85c0-38558d483a71", "prompt_version": "1.1.0"} +``` + +**Post-failure state**: `current_stage: validated` (UNCHANGED). No `.specify/` directory created. + +### 2.6 Induced-failure run: PROJ-262-iter4 (idea file missing) + +**Stderr**: `[run] FAIL on PROJ-262-...-iter4: project_initializer requires at least one input (idea file path); got ctx.inputs=[]` + +**Run-log**: `outcome: failed`, `failure_reason: "FileNotFoundError: project_initializer requires at least one input (idea file path); got ctx.inputs=[]"`, `inputs: []`, `outputs: []`. State `validated` unchanged. No `.specify/`. + +This is the fail-fast guard from T008 (P2-D03 fix) firing as designed. + +### 2.7 Induced-failure run: PROJ-261-iter5 (template file missing) + +**Stderr**: `[run] FAIL on PROJ-261-...-iter5: prompt template not found: /Users/jmanning/llmXive/agents/templates/research_project_constitution.md` + +**Run-log**: `outcome: failed`, `failure_reason: "FileNotFoundError: prompt template not found: /Users/jmanning/llmXive/agents/templates/research_project_constitution.md"`, `outputs: []`. State unchanged. No `.specify/`. Template restored after run; git tree clean. + +--- + +## Section 3 — Outputs (per sibling) + +### 3.1 PROJ-261-iter2 (initial run — pre-fix) + +**3.1.1 Constitution audit table** + +| # | Item | Verdict | Excerpt | Severity | +|-|-|-|-|-| +| a | Heading | ✓ PASS | `# Evaluating the Impact of Code Duplication on LLM Code Understanding — Research Project Constitution` | — | +| b | Footer | ✓ PASS | `**Project ID**: PROJ-261-...-iter2 \| **Field**: computer science \| **Ratified**: 2026-05-06` | — | +| c | Inherited principles I-V | ✓ PASS | All five present (lines 19-52) | — | +| d | ≤2 added principles | ✓ PASS | VI (Inference Determinism) + VII (Clone Metric Integrity), both grounded | — | +| e | No external citations | ✓ PASS | (model identifier `Salesforce/codegen-350M-mono` is acceptable per prompt v1.1.0; iter2 had no DOI/URL) | — | +| f | Reproducibility-Requirements adapted | ✓ PASS | Names `codeparrot/github-code` corpus + 8-bit quantization + 7GB RAM | — | +| **EXTRA: HTML comment leak** | ⚠️ FAIL | Lines 3-15 contain the template's `` comment block (substituted but not stripped) | **MEDIUM (P2-D04)** | + +**3.1.2 Constitution full text**: 121 lines; sha256 `a9328c69108e7eaf...`. Quoted in full in the spec branch's commit `931698a`. Truncating here for report length: `[file: projects/PROJ-261-...-iter2/.specify/memory/constitution.md, lines 1-121, sha256: a9328c69108e7eaf]`. + +**3.1.3 Token-leak check**: `grep -F '{{' projects/PROJ-261-...-iter2/.specify/memory/constitution.md` exits 1 (no matches). ✓ PASS (SC-010). + +**3.1.4 Source-of-truth verification**: all 9 mechanical files (4 scripts + 5 templates) byte-identical to repo-root canonicals (sha256 match). ✓ PASS. + +### 3.2 PROJ-262-iter2 (initial run — pre-fix) + +**3.2.1 Constitution audit table** + +| # | Item | Verdict | Excerpt | Severity | +|-|-|-|-|-| +| a | Heading | ✓ PASS | `# Predicting Molecular Dipole Moments with Graph Neural Networks — Research Project Constitution` | — | +| b | Footer | ✓ PASS | `**Project ID**: PROJ-262-...-iter2 \| **Field**: chemistry \| **Ratified**: 2026-05-06` | — | +| c | Inherited principles I-V | ✓ PASS | All five preserved | — | +| d | ≤2 added principles | ✓ PASS | VI (Numerical Stability) + VII (Chemical Consistency) | — | +| e | No external citations | ⚠️ **FAIL** | Line 56: `DOI: 10.6084/m9.figshare.9981994` (Figshare DOI for QM9) | **CRITICAL (P2-D05)** per spec.md SC-011 | +| f | Reproducibility-Requirements adapted | ✓ PASS | Names QM9 dataset + connectivity rules | — | + +**3.2.2 Constitution full text**: 98 lines; sha256 captured in commit `931698a`. + +**3.2.3 Token-leak check**: ✓ no matches. PASS (SC-010). + +**3.2.4 Source-of-truth verification**: all 9 mechanical files match. ✓ PASS. + +### 3.3 PROJ-261-iter3 (Phase 7 — post-fix with prompt v1.1.0) + +**3.3.1 Constitution audit table** + +| # | Item | Verdict | Evidence | +|-|-|-|-| +| a | Heading | ✓ PASS | Line 1 | +| b | Footer | ✓ PASS | Line 104 | +| c | Inherited I-V preserved | ✓ PASS | Lines 5-38 | +| d | ≤2 added principles | ✓ PASS | VI (Model & Compute Integrity) + VII (Code Licensing & Compliance) | +| e | No external citations | ✓ **PASS** | No DOI / arXiv / URL anywhere | +| f | Reproducibility-Requirements adapted | ✓ PASS | Line 62 names `codeparrot/github-code` as dataset name (allowed per v1.1.0) | +| **HTML comment leak** | ✓ **PASS** | No ` - ## Core Principles ### I. Reproducibility (NON-NEGOTIABLE) @@ -51,23 +37,45 @@ Advancement-Evaluator Agent invalidates stale review records when the hashed artifact changes. Every research-stage artifact change updates this project's `state/projects/PROJ-262-predicting-molecular-dipole-moments-with.yaml` `updated_at` timestamp. -### VI. Numerical Stability & Convergence +### VI. 3D Geometry Preservation (domain-specific) + +All molecular coordinate transformations and 3D-equivariant model operations +MUST preserve rotational and translational invariance. Coordinate preprocessing +pipelines MUST document all geometric transformations applied to the QM9 dataset +and verify that derived features maintain proper spatial relationships. This +principle is grounded in the project's Methodology sketch which specifies +"extract 3D coordinates, atom types, and bond connectivity" and the Expected +results which state "3D conformation carries significant signal" for dipole +prediction. + +### VII. Chemical Interpretability (domain-specific) -Graph Neural Network training workflows MUST define floating-point precision standards (e.g., float32 vs float64) and explicit convergence criteria (loss plateau thresholds) in `code/`. Dipole moment predictions MUST remain stable within 1% variance across re-runs with pinned seeds to ensure physical validity and prevent numerical artifacts from influencing feature importance analysis. +Feature attribution analysis MUST identify specific structural components +(atom types, bond types, 3D conformation) that drive dipole moment predictions. +Model outputs MUST be traceable to chemical features through permutation +importance or attention analysis as specified in the Methodology sketch. This +principle is grounded in the Research question asking "Which structural features +of small organic molecules... carry the most predictive signal" and the +Motivation stating "Understanding which structural components drive dipole +predictions is critical for designing interpretable machine learning potentials." ## Reproducibility Requirements -- A `requirements.txt` (or `pyproject.toml`) at `projects/PROJ-262-predicting-molecular-dipole-moments-with/code/` pins every Python dependency. -- The Code-Execution Agent runs each task in an isolated virtualenv built from this requirements file; no global packages are assumed. -- Every notebook or script under `code/` is runnable end-to-end without manual intervention. -- External datasets (specifically QM9 from Figshare) MUST be fetched from the canonical source and verified against the project's recorded checksum before training begins. +- A `requirements.txt` (or `pyproject.toml`) at `projects/PROJ-262-predicting-molecular-dipole-moments-with/code/` + pins every Python dependency. +- The Code-Execution Agent runs each task in an isolated virtualenv built + from this requirements file; no global packages are assumed. +- Every notebook or script under `code/` is runnable end-to-end without + manual intervention. ## Data Hygiene -- Every file under `data/` is checksummed in the project's `state/projects/PROJ-262-predicting-molecular-dipole-moments-with.yaml` `artifact_hashes` map. -- Raw data (e.g., QM9 raw downloads) is preserved unchanged; derivations are written to new filenames. -- No commits are accepted that fail the Repository-Hygiene Agent's PII scan. -- Dataset versions (e.g., QM9 DOI) MUST be recorded in `data/` metadata files to ensure traceability of molecular structures. +- Every file under `data/` is checksummed in the project's + `state/projects/PROJ-262-predicting-molecular-dipole-moments-with.yaml` `artifact_hashes` map. +- Raw data is preserved unchanged; derivations are written to new + filenames. +- No commits are accepted that fail the Repository-Hygiene Agent's PII + scan. ## Verified Accuracy Gate @@ -75,14 +83,16 @@ The Reference-Validator Agent runs at three points: 1. On every artifact write that introduces or modifies citations. 2. Inside the Advancement-Evaluator before awarding any review point. -3. As a blocking gate on the `research_review` → `research_accepted` transition. +3. As a blocking gate on the `research_review` → `research_accepted` + transition. -A reviewer's score MUST be set to 0.0 if the reviewed artifact has any citation in `unreachable` or `mismatch` status. +A reviewer's score MUST be set to 0.0 if the reviewed artifact has any +citation in `unreachable` or `mismatch` status. ## Versioning This constitution carries its own semver. Initial version: -**1.0.0** — ratified 2026-05-05. +**1.0.0** — ratified 2026-05-06. Amendments follow the parent llmXive constitution's amendment procedure (open a PR; update the version line; record a Sync Impact Report). @@ -97,4 +107,4 @@ Review-point thresholds for this project follow `web/about.html`. The parser at `src/llmxive/config.py` is the single source these numbers flow from. -**Project ID**: PROJ-262-predicting-molecular-dipole-moments-with | **Field**: chemistry | **Ratified**: 2026-05-05 +**Project ID**: PROJ-262-predicting-molecular-dipole-moments-with | **Field**: chemistry | **Ratified**: 2026-05-06 diff --git a/projects/PROJ-261-investigating-the-correlation-between-gu/idea/investigating-the-correlation-between-gu.md b/projects/PROJ-331-investigating-the-correlation-between-gu/idea/investigating-the-correlation-between-gu.md similarity index 100% rename from projects/PROJ-261-investigating-the-correlation-between-gu/idea/investigating-the-correlation-between-gu.md rename to projects/PROJ-331-investigating-the-correlation-between-gu/idea/investigating-the-correlation-between-gu.md diff --git a/projects/PROJ-262-quantifying-the-impact-of-magnetic-field/idea/quantifying-the-impact-of-magnetic-field.md b/projects/PROJ-332-quantifying-the-impact-of-magnetic-field/idea/quantifying-the-impact-of-magnetic-field.md similarity index 100% rename from projects/PROJ-262-quantifying-the-impact-of-magnetic-field/idea/quantifying-the-impact-of-magnetic-field.md rename to projects/PROJ-332-quantifying-the-impact-of-magnetic-field/idea/quantifying-the-impact-of-magnetic-field.md diff --git a/specs/004-phase2-project-bootstrap-testing/carry-forward.yaml b/specs/004-phase2-project-bootstrap-testing/carry-forward.yaml new file mode 100644 index 00000000..7e51c403 --- /dev/null +++ b/specs/004-phase2-project-bootstrap-testing/carry-forward.yaml @@ -0,0 +1,52 @@ +spec: "004-phase2-project-bootstrap-testing" +generated_at: 2026-05-06T03:00:00Z +final_commit: HEAD +projects: + - project_id: PROJ-261-evaluating-the-impact-of-code-duplicatio + final_state: project_initialized + final_commit: HEAD + audited_iter_id: PROJ-261-evaluating-the-impact-of-code-duplicatio # in-place; iteration trail in git log + agents_run: + - { name: brainstorm, iterations: 1, final_iter_id: PROJ-261-evaluating-the-impact-of-code-duplicatio } + - { name: flesh_out, iterations: 1, final_iter_id: PROJ-261-evaluating-the-impact-of-code-duplicatio } + - { name: research_question_validator, iterations: 1, final_iter_id: PROJ-261-evaluating-the-impact-of-code-duplicatio } + - { name: project_initializer, iterations: 3, final_iter_id: PROJ-261-evaluating-the-impact-of-code-duplicatio } + justification: | + Constitution audited under project_initializer prompt v1.2.0 (the + audited content was originally produced on a now-removed iter6 + sibling and copied in place onto the canonical per the iteration + convention change documented at notes/2026-05-06-iteration-convention-change.md). + All six US2 contract items PASS plus four EXTRA audit checks + (no DOI, no HTML comments, no token leaks, every new-principle + claim traces to a specific idea-body section). Principle VI + "Statistical Correlation Integrity" grounds in idea's Methodology + + Expected results (p < 0.05 threshold, Spearman's rank correlation). + Principle VII "Clone Detection Consistency" grounds in idea's + Methodology (AST-based clone detector, codeparrot/github-code subset). + All 5 inherited principles preserved. Iteration trail visible in + `git log -- projects/PROJ-261-evaluating-the-impact-of-code-duplicatio/`. + Ready for spec 005's specifier + clarifier agents. + + - project_id: PROJ-262-predicting-molecular-dipole-moments-with + final_state: project_initialized + final_commit: HEAD + audited_iter_id: PROJ-262-predicting-molecular-dipole-moments-with # in-place + agents_run: + - { name: brainstorm, iterations: 1, final_iter_id: PROJ-262-predicting-molecular-dipole-moments-with } + - { name: flesh_out, iterations: 2, final_iter_id: PROJ-262-predicting-molecular-dipole-moments-with } + - { name: research_question_validator, iterations: 2, final_iter_id: PROJ-262-predicting-molecular-dipole-moments-with } + - { name: project_initializer, iterations: 3, final_iter_id: PROJ-262-predicting-molecular-dipole-moments-with } + justification: | + Constitution audited under project_initializer prompt v1.2.0 + (originally produced on iter6 sibling, copied in place per + convention change). The LLM included explicit "This principle is + grounded in..." annotations directly in the constitution body, + citing specific idea sections by name. Principle VI "3D Geometry + Preservation" grounds in idea's Methodology sketch ("extract 3D + coordinates, atom types, and bond connectivity") and Expected + results ("3D conformation carries significant signal"). Principle + VII "Chemical Interpretability" grounds in idea's Research question + ("Which structural features... carry the most predictive signal") + and Motivation. Both principles strictly within the project's + actual research scope; no fabrication. Iteration trail in + `git log -- projects/PROJ-262-predicting-molecular-dipole-moments-with/`. diff --git a/specs/004-phase2-project-bootstrap-testing/checklists/requirements.md b/specs/004-phase2-project-bootstrap-testing/checklists/requirements.md new file mode 100644 index 00000000..d0969d5a --- /dev/null +++ b/specs/004-phase2-project-bootstrap-testing/checklists/requirements.md @@ -0,0 +1,37 @@ +# Specification Quality Checklist: Phase 2 (Project Bootstrap) End-to-End Testing & Diagnostics + +**Purpose**: Validate specification completeness and quality before proceeding to planning +**Created**: 2026-05-05 +**Feature**: [spec.md](../spec.md) + +## Content Quality + +- [x] No implementation details (languages, frameworks, APIs) — *spec names production code paths and file paths because the testing-spec genre requires referencing the system under test; this is the same convention spec 003 used and is consistent with `/speckit-specify` guidance for testing-domain specs* +- [x] Focused on user value and business needs — *each US explicitly states "Why this priority" tying it to pipeline correctness* +- [x] Written for non-technical stakeholders — *prose-led; technical pointers (file:line) appear as audit anchors rather than implementation prescription* +- [x] All mandatory sections completed — *User Scenarios & Testing, Requirements, Success Criteria, Assumptions all populated; Edge Cases enumerated* + +## Requirement Completeness + +- [x] No [NEEDS CLARIFICATION] markers remain — *Clarifications section flags three optional decisions for `/speckit-clarify` but none are blocking [NEEDS CLARIFICATION] markers in FRs/SCs* +- [x] Requirements are testable and unambiguous — *each FR names a specific file/path/threshold; FR-001 through FR-021 each pass the "testable and unambiguous" test* +- [x] Success criteria are measurable — *SC-001 through SC-012 each have a concrete pass/fail condition (e.g., "≥2 successful runs", "0 mock/fake calls", "100% of `{{token}}` strings substituted")* +- [x] Success criteria are technology-agnostic (no implementation details) — *Most SCs describe outcomes (e.g., "constitution passes audit"); SCs that name file paths do so to anchor measurability, not to mandate implementation* +- [x] All acceptance scenarios are defined — *each US has 2-3 numbered Given/When/Then scenarios* +- [x] Edge cases are identified — *11 edge cases enumerated, including the spawner-allowlist prerequisite and the partial-write-on-backend-failure case* +- [x] Scope is clearly bounded — *Spec is explicitly Phase 2 only (single agent, single stage transition); deliberately defers Phase 3 to spec 005* +- [x] Dependencies and assumptions identified — *Assumptions section explicitly names the spec-003 carry-forward manifest, the sibling spawner, the orchestrator entry point, and the credentials location* + +## Feature Readiness + +- [x] All functional requirements have clear acceptance criteria — *FRs map 1:1 to USs (US1 → FR-001/002/003/003a/004; US2 → FR-010; US3 → FR-011; US4 → FR-012; US5 → FR-006/007/008/013; US6 → FR-017)* +- [x] User scenarios cover primary flows — *US1 (happy path) through US6 (carry-forward gate) cover ingest → audit → idempotency → failure → report → handoff* +- [x] Feature meets measurable outcomes defined in Success Criteria — *each SC traces to at least one FR (e.g., SC-001 ↔ FR-001/002, SC-009 ↔ FR-011, SC-010/011 ↔ FR-015)* +- [x] No implementation details leak into specification — *FRs describe what to verify, not how to verify; sibling spawner extension (FR-003a) is named because it's a known prerequisite, not a chosen design* + +## Notes + +- Items marked incomplete require spec updates before `/speckit-clarify` or `/speckit-plan` +- Branch number (`008-…`) and spec directory number (`004-…`) intentionally diverge — this is allowed by `/speckit-specify` and explained in the spec's frontmatter +- The spec mirrors spec 003's structure intentionally to make pattern-match audit easy and to inherit spec 003's clarification decisions (sibling iteration, prompt-version semver, verbatim-quote cap, etc.) +- Three soft clarification candidates are noted in the Clarifications section but left unresolved as defaults; user may run `/speckit-clarify` if any default needs to change before planning diff --git a/specs/004-phase2-project-bootstrap-testing/contracts/carry-forward.md b/specs/004-phase2-project-bootstrap-testing/contracts/carry-forward.md new file mode 100644 index 00000000..51f6920d --- /dev/null +++ b/specs/004-phase2-project-bootstrap-testing/contracts/carry-forward.md @@ -0,0 +1,87 @@ +# Contract: Phase 2 → Phase 3 carry-forward manifest + +**File**: `specs/004-phase2-project-bootstrap-testing/carry-forward.yaml` +**Produced by**: this spec's `/speckit-implement` workflow (US6) +**Consumed by**: spec 005 (Phase 3 — Spec Kit: Specify → Clarify, parent issue #47) +**Schema base**: `specs/003-phase1-idea-lifecycle-testing/carry-forward.yaml` (extends with one new field) + +## YAML schema + +```yaml +spec: "004-phase2-project-bootstrap-testing" # string, fixed +generated_at: # ISO-8601 with Z suffix +final_commit: # 7-char short SHA +projects: + - project_id: (-iterN)?> # the project spec 005 will operate on + final_state: project_initialized # MUST equal this string verbatim + final_commit: # commit hash that produced final_state + phase2_iter2_id: -iterN> # NEW: which iter2 sibling produced the audited constitution + agents_run: # ordered list of agents that touched this project + - { name: brainstorm, iterations: , final_iter_id: (-iterN)?> } + - { name: flesh_out, iterations: , final_iter_id: (-iterN)?> } + - { name: research_question_validator, iterations: , final_iter_id: (-iterN)?> } + - { name: project_initializer, iterations: , final_iter_id: (-iterN)?> } + justification: | + +``` + +## Field-level validation rules + +| Field | Type | Required | Validation | +|-|-|-|-| +| `spec` | string | yes | MUST equal `"004-phase2-project-bootstrap-testing"` | +| `generated_at` | ISO-8601 UTC | yes | MUST end in `Z`; MUST be ≤ now | +| `final_commit` | string | yes | MUST resolve to a real commit on the feature branch (`git rev-parse ` succeeds) | +| `projects` | list | yes | length 1 or 2 (per FR-017 / SC-002) | +| `projects[*].project_id` | string | yes | regex `^PROJ-\d{3}-[a-z0-9-]{1,50}(-iter\d+)?$`; MUST resolve to a real `projects//` directory; MUST be among {PROJ-261-…, PROJ-262-…, or one of their iterN siblings spawned in this spec} | +| `projects[*].final_state` | string | yes | MUST equal `project_initialized` | +| `projects[*].final_commit` | string | yes | MUST resolve to a real commit on the feature branch; MUST be the commit that touched `state/projects/.yaml` last | +| `projects[*].phase2_iter2_id` | string | yes | regex `^PROJ-\d{3}-[a-z0-9-]{1,50}-iter\d+$`; MUST resolve to a real iter2 sibling at `projects//` with a complete `.specify/` scaffold; NEW field per Decision 6 in research.md | +| `projects[*].agents_run` | list | yes | non-empty; MUST contain at least one entry where `name == project_initializer` and `iterations >= 1` | +| `projects[*].agents_run[*].name` | enum | yes | one of {brainstorm, flesh_out, research_question_validator, project_initializer} for Phase 2 carry-forward | +| `projects[*].agents_run[*].iterations` | int ≥ 1 | yes | MUST equal the actual count of sibling iters that ran this agent for this project | +| `projects[*].agents_run[*].final_iter_id` | string | yes | regex matches PROJ-id pattern; MUST resolve to a real `projects//` | +| `projects[*].justification` | string (multiline) | yes | ≤200 words; MUST cite the US2 audit result for the named `phase2_iter2_id` | + +## Cross-field invariants + +- **`phase2_iter2_id`'s state must match the `final_state` claim**: `state/projects/.yaml` MUST have `current_stage: project_initialized`. +- **`phase2_iter2_id`'s constitution must exist and pass the US2 audit**: `projects//.specify/memory/constitution.md` MUST be a real file, ≥1 byte, with no `{{token}}` strings. (Verified by the diagnostic report's §3.X.3.) +- **`phase2_iter2_id`'s scaffold must be complete**: all 9 mechanical files (5 templates + 4 scripts) MUST be present and byte-identical to repo root (verified by §3.X.4). +- **If `project_id != phase2_iter2_id`** (i.e., the carry-forward names a canonical with the iter2's audited constitution), then the canonical's `.specify/memory/constitution.md` MUST be byte-identical to the iter2's. The diagnostic report MUST quote both and verify sha256 equality. + +## Validator + +A validator script (analogous to spec 003's `tests/phase1/validate_carry_forward.py`) MAY be added in a follow-up spec to enforce these rules in CI; for spec 004 the validation is performed manually by the maintainer reading the manifest and the diagnostic report side-by-side. Hand-validation is recorded as the §6 row "Schema validation" in the diagnostic report. + +## Example (illustrative — not the actual final manifest) + +```yaml +spec: "004-phase2-project-bootstrap-testing" +generated_at: 2026-05-05T18:00:00Z +final_commit: abc1234 +projects: + - project_id: PROJ-261-evaluating-the-impact-of-code-duplicatio-iter2 + final_state: project_initialized + final_commit: abc1234 + phase2_iter2_id: PROJ-261-evaluating-the-impact-of-code-duplicatio-iter2 + agents_run: + - { name: brainstorm, iterations: 1, final_iter_id: PROJ-261-evaluating-the-impact-of-code-duplicatio } + - { name: flesh_out, iterations: 1, final_iter_id: PROJ-261-evaluating-the-impact-of-code-duplicatio } + - { name: research_question_validator, iterations: 1, final_iter_id: PROJ-261-evaluating-the-impact-of-code-duplicatio } + - { name: project_initializer, iterations: 2, final_iter_id: PROJ-261-evaluating-the-impact-of-code-duplicatio-iter2 } + justification: | + Clean iter2 run on first pass. Constitution audit (US2): all 6 contract + items PASS, including the chemistry-domain-specific Reproducibility + Requirements adaptation (named `codeparrot/github-code` corpus directly). + Idempotency check (US3): all 10 .specify/-tree files byte-identical + after second init_speckit_in invocation; constitution sha256 unchanged + after skip-if-exists guard exercised. Two domain-specific principles + were added (VI: Code-corpus Provenance, VII: 8-bit Quantization + Reproducibility), both grounded in the project's idea body. No CRITICAL + or HIGH defects; ready for spec 005. +``` diff --git a/specs/004-phase2-project-bootstrap-testing/contracts/diagnostic-report.md b/specs/004-phase2-project-bootstrap-testing/contracts/diagnostic-report.md new file mode 100644 index 00000000..4c5139f4 --- /dev/null +++ b/specs/004-phase2-project-bootstrap-testing/contracts/diagnostic-report.md @@ -0,0 +1,153 @@ +# Contract: Diagnostic report structure + +**File**: `notes/2026-05-05-phase2-diagnostic.md` +**Produced by**: maintainer-driven `/speckit-implement` workflow on this spec +**Consumed by**: GitHub issue #62 closure comment, GitHub issue #107 checkbox advancement, spec 005 author when picking up the substrate + +## Format + +Single Markdown file with the eight top-level sections specified below. All artifact quotes use fenced code blocks with appropriate language tags (`yaml`, `json`, `markdown`, `text`). Quotes >100 lines are truncated with the marker `[truncated lines N-M, sha256: ]`. + +## Frontmatter + +```markdown +# Phase 2 (Project Bootstrap) Diagnostic Report + +**Spec**: [specs/004-phase2-project-bootstrap-testing/spec.md](../specs/004-phase2-project-bootstrap-testing/spec.md) +**Generated**: +**Branch**: 008-phase2-project-bootstrap-testing +**Final commit**: +**Issue**: #46 (parent) / #62 (project_initializer) +**Tracker**: #107 +``` + +## Section 1 — Inputs (carry-forward substrate) + +Required content: + +- A table listing each canonical (PROJ-261, PROJ-262) with: + - Source: `specs/003-phase1-idea-lifecycle-testing/carry-forward.yaml` short reference + - Final state on `main`: `project_initialized` + - Field, title, idea-file path, sha256 of `idea/.md` +- A table listing each iter2 sibling spawned in this spec with: + - Sibling ID + - Spawner CLI invocation verbatim + - sha256 evidence: source `idea/.md` hash AND destination `idea/.md` hash (must match) + - Initial state YAML (`current_stage: validated`) + +## Section 2 — Agent behavior (per sibling, per run) + +For each `project_initializer` invocation in this spec (happy-path runs from US1 + induced-failure runs from US4 + any iter3+ runs from iteration loops), include a subsection numbered 2.X with: + +- **2.X.1 Pre-run state YAML**: verbatim `cat state/projects/.yaml` block +- **2.X.2 Rendered system prompt**: verbatim quote of the system message after token substitution (per `project_initializer.py` line 65 — call `render_prompt` and include the returned string). Must show `{{title}}`, `{{field}}`, `{{date}}`, `{{principal_agent_name}}`, `{{project_id}}` all resolved to concrete values. +- **2.X.3 Rendered user prompt**: verbatim quote of the user message including the rendered constitution template AND the idea body +- **2.X.4 LLM response**: verbatim quote of `response.text` (the constitution Markdown) +- **2.X.5 Run-log JSONL line**: verbatim quote of the entry written to `state/run-log//.jsonl` +- **2.X.6 Post-run state YAML**: verbatim `cat state/projects/.yaml` block (must show `current_stage: project_initialized` for happy-path, unchanged for failure-path) + +## Section 3 — Outputs (per sibling) + +For each happy-path sibling, include a subsection numbered 3.X with: + +- **3.X.1 Constitution audit table** (the six US2 contract items per E2): + + | # | Contract item | Verdict | Quoted excerpt | Severity (if FAIL) | + |-|-|-|-|-| + | a | Heading line | PASS / FAIL | `# — Research Project Constitution` | CRITICAL | + | b | Footer line | PASS / FAIL | `**Project ID**: …` | CRITICAL | + | c | All five inherited principles preserved | PASS / FAIL | (per-principle quote) | CRITICAL | + | d | At most TWO added domain principles | PASS / FAIL / N/A | (numbered VI/VII or absent) | HIGH | + | e | No external citations introduced | PASS / FAIL | (any URL/DOI found) | CRITICAL | + | f | `Reproducibility Requirements` adapted to project's data sources | PASS / FAIL | (quoted section) | MEDIUM | + +- **3.X.2 Constitution full text**: verbatim quote of `.specify/memory/constitution.md` (≤100 lines or `[truncated…]`) +- **3.X.3 Token-leak check**: assert no literal `{{token}}` strings appear in the constitution; quote the result of `grep -F "{{" .specify/memory/constitution.md` (must be empty) +- **3.X.4 Source-of-truth verification**: a table comparing each scaffold-tree file to its repo-root canonical: + + | File path (relative to .specify/) | Repo-root canonical | sha256 match? | + |-|-|-| + | scripts/bash/common.sh | .specify/scripts/bash/common.sh | ✓/✗ | + | scripts/bash/check-prerequisites.sh | .specify/scripts/bash/check-prerequisites.sh | ✓/✗ | + | scripts/bash/create-new-feature.sh | .specify/scripts/bash/create-new-feature.sh | ✓/✗ | + | scripts/bash/setup-plan.sh | .specify/scripts/bash/setup-plan.sh | ✓/✗ | + | templates/checklist-template.md | .specify/templates/checklist-template.md | ✓/✗ | + | templates/constitution-template.md | .specify/templates/constitution-template.md | ✓/✗ | + | templates/plan-template.md | .specify/templates/plan-template.md | ✓/✗ | + | templates/spec-template.md | .specify/templates/spec-template.md | ✓/✗ | + | templates/tasks-template.md | .specify/templates/tasks-template.md | ✓/✗ | + +- **3.X.5 Idempotency check** (for the iter2 sibling chosen as the US3 subject): the sha256-tree before/after manifests from E8, both quoted, plus a verdict (`IDENTICAL` ⇒ pass, `DIVERGED` ⇒ list of changed files with severity) + +## Section 4 — Defects table + +Required column order: + +| ID | Severity | Source US/FR | File:line | Description | Status | Resolution | +|-|-|-|-|-|-|-| +| P2-D01 | HIGH | US3 / FR-011 | src/llmxive/agents/project_initializer.py:84-104 | Constitution write is overwrite-unconditional, violating idempotency | Fixed | Commit `<SHA>` (skip-if-exists guard added) | +| P2-D02 | HIGH | FR-003a | tests/phase1/sibling_project.py:36 | `ALLOWED_START_STAGES` doesn't include `validated` | Fixed | Commit `<SHA>` | + +Defects discovered during implementation (US1-US6) get appended with the next available P2-D## ID. Status options: `Fixed in PR <SHA>` / `Deferred to issue #<N>` / `Accepted (not addressed) — rationale: <text>`. CRITICAL defects MUST NOT have status `Accepted`. + +## Section 5 — Iteration diffs + +Only present if iter3+ siblings were spawned (a defect surfaced after iter2). Format per iteration: + +```text +### Iteration N → N+1: <title of the change> + +**Patch motivation**: <one-sentence finding from the report section that motivated this iteration> + +**Files changed**: +- `agents/prompts/project_initializer.md` (prompt_version `<old>` → `<new>`) +- `agents/templates/research_project_constitution.md` (if applicable) +- `src/llmxive/agents/project_initializer.py` (if applicable) + +**Diff (verbatim `git diff <prev-SHA> <curr-SHA> -- <path>`)**: + +```diff +<diff content> +``` + +**Re-run result**: <pass/fail of the previously-failing acceptance criterion, with a quoted excerpt from the new sibling's constitution> +``` + +If no iter3+ runs occurred, this section is a single line: `No iteration loops fired; iter2 happy-path was sufficient on first pass.` + +## Section 6 — Per-issue acceptance-criteria summary + +Issue #62 (project_initializer) has three checkboxes. Each MUST be marked PASS or FAIL with rationale tied to a quoted artifact from §2 or §3: + +| # | Issue #62 checkbox | Verdict | Rationale (anchored to artifact) | +|-|-|-|-| +| 1 | Renders `.specify/memory/constitution.md` with project-specific principles (not template placeholders) | PASS / FAIL | (cite §3.X.1 row a/b/c/d) | +| 2 | Creates the scripts/bash/ runners (setup-plan.sh, check-prerequisites.sh, etc.) | PASS / FAIL | (cite §3.X.4) | +| 3 | Idempotent: running twice doesn't duplicate or corrupt files | PASS / FAIL | (cite §3.X.5) | + +Issue #46 (parent phase) has four checkboxes; each is marked PASS/FAIL/N/A with rationale: + +| # | Issue #46 checkbox | Verdict | Rationale | +|-|-|-|-| +| 1 | Every agent sub-issue passes its acceptance criteria | PASS / FAIL | derived from issue #62's three above | +| 2 | Phase-level smoke test passes end-to-end on a fresh project | PASS / FAIL | cite §2 of any happy-path sibling | +| 3 | No silent shortcuts | PASS / FAIL | cite §3.X.3 (no token leaks), cite §2 of any failure-path sibling (state unchanged on failure) | +| 4 | All artifacts written by this phase pass schema validation (where applicable) | PASS / FAIL | cite the state YAML schema check | +| 5 | Run-log entries record outcome, started_at, ended_at for every agent invocation | PASS / FAIL | cite §2.X.5 of every sibling | + +## Section 7 — Recommendations + +Required content: + +- A bulleted list of recommended changes for Phase 2 going forward (e.g., "Tighten the prompt's domain-principle constraint to require citing the project's idea body explicitly") +- A bulleted list of follow-up issue numbers this spec opened (or recommends opening) for deferred defects +- A bulleted list of items the spec deliberately accepted as-is (with rationale per FR-005) + +## Section 8 — Carry-forward decision + +Required content: + +- Final selection: 1 or 2 sibling IDs (or canonicals + iter2 ID) that advance to spec 005 +- Per-selection: final commit hash, full state-YAML quote, justification paragraph (≤200 words) covering whether the constitution passes the US2 audit cleanly + whether idempotency holds +- A pointer to the carry-forward manifest at `specs/004-phase2-project-bootstrap-testing/carry-forward.yaml` +- Closing line: "Carry-forward complete. Spec 005 (Phase 3) MAY pick up these projects." diff --git a/specs/004-phase2-project-bootstrap-testing/contracts/idempotency-check.md b/specs/004-phase2-project-bootstrap-testing/contracts/idempotency-check.md new file mode 100644 index 00000000..10defc45 --- /dev/null +++ b/specs/004-phase2-project-bootstrap-testing/contracts/idempotency-check.md @@ -0,0 +1,143 @@ +# Contract: Idempotency-check pytest harness + +**File**: `tests/phase1/test_idempotency.py` +**Produced by**: this spec's `/speckit-implement` workflow (US3 / SC-009) +**Consumed by**: pytest in CI, the maintainer running US3 audits +**Purpose**: Verify FR-011 / SC-009 — full byte-level idempotency of `project_initializer` on a sibling already at `project_initialized`. + +## Test inventory + +| Test name | Purpose | Source: spec scenario | +|-|-|-| +| `test_init_speckit_in_idempotent_on_complete_tree` | Run `init_speckit_in` twice on a complete `.specify/` tree (templates + scripts + memory); assert all 9 mechanical files have unchanged sha256 | US3 acceptance scenario 1 | +| `test_project_initializer_skips_existing_constitution` | Instantiate `ProjectInitializerAgent` directly; call `handle_response` with mock LLM-output text on a tree that already has `.specify/memory/constitution.md`; assert the file's sha256 is unchanged | US3 acceptance scenario 2 | +| `test_project_initializer_writes_on_first_invocation` | Same agent on a fresh project_dir; assert the constitution IS written and matches the LLM-output (regression: don't break the happy path with the new skip-if-exists guard) | US3 implicit (negative-control) | +| `test_full_tree_idempotent_after_two_agent_invocations` | End-to-end: two consecutive `handle_response` calls; assert ALL 10 files (9 mechanical + 1 constitution) are byte-identical | FR-011 / SC-009 | + +## Test pattern + +```python +import hashlib +from pathlib import Path + +import pytest + +from llmxive.agents.base import AgentContext +from llmxive.agents.project_initializer import ProjectInitializerAgent +from llmxive.backends.base import ChatResponse +from llmxive.speckit.runner import init_speckit_in +from llmxive.types import AgentRegistryEntry + + +def _sha256_tree(root: Path) -> dict[str, str]: + """Return {relpath: sha256} for every regular file under root.""" + out: dict[str, str] = {} + for p in root.rglob("*"): + if p.is_file(): + out[str(p.relative_to(root))] = hashlib.sha256(p.read_bytes()).hexdigest() + return out + + +def test_init_speckit_in_idempotent_on_complete_tree(tmp_path: Path): + """SC-009: scaffold tree must be byte-identical after second init.""" + project_dir = tmp_path / "PROJ-test-idem" + init_speckit_in(project_dir) + before = _sha256_tree(project_dir / ".specify") + init_speckit_in(project_dir) + after = _sha256_tree(project_dir / ".specify") + assert before == after, f"divergence: {set(before.items()) ^ set(after.items())}" + + +def test_project_initializer_skips_existing_constitution(tmp_path: Path, monkeypatch): + """US3 acceptance 2: re-running the agent on a project with a pre-existing + constitution must NOT overwrite it (skip-if-exists guard from Q3).""" + # Pre-stage a project_dir with a constitution already in place. + project_dir = tmp_path / "PROJ-test-skip" + init_speckit_in(project_dir) + constitution_path = project_dir / ".specify" / "memory" / "constitution.md" + constitution_path.parent.mkdir(parents=True, exist_ok=True) + pre_existing_text = "# Test Constitution\n\n**Project ID**: PROJ-test-skip | **Field**: testing | **Ratified**: 2026-05-05\n" + constitution_path.write_text(pre_existing_text, encoding="utf-8") + pre_hash = hashlib.sha256(constitution_path.read_bytes()).hexdigest() + + # Build a context pointing at this project; the agent's handle_response + # should detect the existing file and skip. + entry = AgentRegistryEntry( + name="project_initializer", + purpose="test", + prompt_path="agents/prompts/project_initializer.md", + prompt_version="1.0.0", + default_backend="dartmouth", + fallback_backends=[], + default_model="qwen.qwen3.5-122b", + wall_clock_budget_seconds=300, + ) + agent = ProjectInitializerAgent(entry) + + # Monkeypatch the project_dir resolution to point at tmp_path. + # (The exact mechanism depends on how the agent computes project_dir; + # the fix in research.md Decision 2 reads it from `repo / "projects" / ctx.project_id`, + # so we set ctx.project_id to a path that resolves there.) + ctx = AgentContext( + project_id="PROJ-test-skip", + metadata={"title": "Test", "field": "testing", "principal_agent_name": "flesh_out"}, + inputs=[], + ) + monkeypatch.setattr( + "llmxive.agents.project_initializer.Path", + lambda *a: tmp_path / "fake-repo-root" if False else Path(*a), + ) # see note below — actual monkeypatching shape depends on the fix + + # Simulate the LLM having returned different text from what's on disk. + response = ChatResponse( + text="# Different Constitution\n\nThis would corrupt the existing one.\n", + model="qwen.qwen3.5-122b", + backend="dartmouth", + cost_estimate_usd=0.0, + ) + agent.handle_response(ctx, response) + + post_hash = hashlib.sha256(constitution_path.read_bytes()).hexdigest() + assert pre_hash == post_hash, "skip-if-exists guard failed: constitution was overwritten" + + +def test_project_initializer_writes_on_first_invocation(tmp_path: Path, monkeypatch): + """Negative control: with no pre-existing constitution, the agent MUST write one. + Ensures the skip-if-exists guard didn't break the happy path.""" + # ... similar setup, but constitution_path.is_file() is False going in. + # Assert that after handle_response, the file exists and contains the LLM response text. + + +def test_full_tree_idempotent_after_two_agent_invocations(tmp_path: Path, monkeypatch): + """FR-011 / SC-009 end-to-end: two consecutive agent invocations leave the + full .specify/ tree byte-identical at file-content level.""" + # ... runs the agent twice, computes _sha256_tree before and after the + # SECOND invocation, asserts equality. +``` + +## Notes on the monkeypatching + +The harness needs to redirect `project_dir = repo / "projects" / ctx.project_id` to a `tmp_path`-based root. The cleanest mechanism is to factor the path resolution out of `ProjectInitializerAgent.handle_response` into a small helper that accepts an explicit `project_root`, then passing `tmp_path` in tests. The patch in research.md Decision 2 should add this seam if it doesn't already exist; if not, the alternative is to monkeypatch the `Path(__file__).resolve().parent.parent.parent.parent` calculation that yields `repo`. Either approach is acceptable; the test contract requires that the harness can run without writing into the actual repository. + +## Run-cost expectation + +Pytest collection: <1s. Each test: <2s on a developer workstation (no network, no LLM, no large file copies). Total module wall-clock: <10s. Suitable for CI without time budget concerns. + +## Acceptance evidence (referenced from §3.X.5 of the diagnostic report) + +When the harness passes: + +```text +$ pytest tests/phase1/test_idempotency.py -v +============================= test session starts ============================== +collected 4 items + +tests/phase1/test_idempotency.py::test_init_speckit_in_idempotent_on_complete_tree PASSED +tests/phase1/test_idempotency.py::test_project_initializer_skips_existing_constitution PASSED +tests/phase1/test_idempotency.py::test_project_initializer_writes_on_first_invocation PASSED +tests/phase1/test_idempotency.py::test_full_tree_idempotent_after_two_agent_invocations PASSED + +============================== 4 passed in 4.21s =============================== +``` + +This block is quoted verbatim into the diagnostic report as evidence for SC-009. diff --git a/specs/004-phase2-project-bootstrap-testing/contracts/induced-failure-runs.md b/specs/004-phase2-project-bootstrap-testing/contracts/induced-failure-runs.md new file mode 100644 index 00000000..9ff8ee2a --- /dev/null +++ b/specs/004-phase2-project-bootstrap-testing/contracts/induced-failure-runs.md @@ -0,0 +1,155 @@ +# Contract: Induced-failure runs (US4) + +**Produced by**: maintainer-driven `/speckit-implement` workflow on this spec +**Consumed by**: §2 of the diagnostic report (one subsection per induced-failure scenario) +**Purpose**: Verify FR-012 / SC-005 — Phase 2 fails loudly under each of three precondition violations. + +## Required sibling iter naming + +Each induced-failure scenario uses a dedicated sibling iter so the failures don't contaminate each other (per Q2 clarification). Suggested naming: + +| Scenario | Sibling iter ID | Canonical | +|-|-|-| +| Backend unreachable | `PROJ-261-…-iterFAIL-backend` | PROJ-261 | +| Idea file missing | `PROJ-262-…-iterFAIL-idea` | PROJ-262 | +| Template file missing | `PROJ-261-…-iterFAIL-template` | PROJ-261 | + +(Alternative: use sequential suffixes `-iter3`, `-iter4`, `-iter5` if the canonical doesn't already have those — but human-readable suffixes are easier to grep for in the diagnostic report.) + +## Scenario 1 — Backend unreachable + +**Setup**: + +```bash +# Spawn a fresh sibling at validated. +python tests/phase1/sibling_project.py \ + PROJ-261-evaluating-the-impact-of-code-duplicatio \ + --iter 6 \ + --start-stage validated +# (or whatever iter number is unused; capture the sibling_id) + +# Save the original env, then point the backend at an invalid host. +ORIGINAL_BASE_URL="${LLMXIVE_BACKEND_BASE_URL:-}" +export LLMXIVE_BACKEND_BASE_URL="https://invalid.example.com" + +# Run the orchestrator with the bogus URL active. +python -m llmxive run --project <sibling_id> --max-tasks 1 +echo "exit code: $?" + +# Restore env. +export LLMXIVE_BACKEND_BASE_URL="$ORIGINAL_BASE_URL" +``` + +**Expected behavior**: +- Router walks the entire backend chain (`dartmouth → huggingface → local`); each backend either fails to instantiate (no API key for that backend) or hits transient errors and retries +- Eventually the router raises `TransientBackendError` (or `PermanentBackendError` if all backends are unconfigured) +- Orchestrator writes one run-log JSONL line with `outcome: failure` and `failure_reason` containing the original exception's repr +- State YAML's `current_stage` remains `validated` (unchanged) +- No `.specify/memory/constitution.md` is created +- No partial scaffold tree under `.specify/{scripts,templates}/` (init_speckit_in is called only after the LLM response is received, per `project_initializer.py:88-89`; if the LLM call fails, init_speckit_in never runs) + +**Failure modes that would be CRITICAL defects**: +- Empty `failure_reason` string in the run-log entry (Constitution Principle V violation) +- State YAML advances to `project_initialized` despite the LLM failing (silent state advancement on failure) +- A partial constitution file appears at `.specify/memory/constitution.md` (file-write should be atomic-or-absent) + +## Scenario 2 — Idea file missing + +**Setup**: + +```bash +# Spawn a fresh sibling at validated; capture the sibling_id. +SIBLING_ID=$(python tests/phase1/sibling_project.py \ + PROJ-262-predicting-molecular-dipole-moments-with \ + --iter 7 \ + --start-stage validated) + +# Delete the idea file BEFORE the agent runs. +SLUG=$(echo "$SIBLING_ID" | sed 's/^PROJ-[0-9]*-//' | sed 's/-iter[0-9]*$//') +rm "projects/$SIBLING_ID/idea/$SLUG.md" + +# Confirm it's gone. +ls "projects/$SIBLING_ID/idea/" || echo "(directory empty)" + +# Run the orchestrator. +python -m llmxive run --project "$SIBLING_ID" --max-tasks 1 +echo "exit code: $?" +``` + +**Expected behavior** (post Decision 5 fix in research.md): + +After the in-PR fix lands (replace `if idea_path.exists():` with `raise FileNotFoundError`): + +- `ProjectInitializerAgent.build_messages` raises `FileNotFoundError` immediately +- Orchestrator records `outcome: failure` with `failure_reason` quoting the exception +- State remains `validated`; no constitution written + +**Pre-fix behavior** (the defect we're surfacing): + +- `build_messages` silently sets `idea_summary = ""` (line 60 of `project_initializer.py`) +- The LLM is invoked with an empty idea body and produces a constitution with no idea-grounding +- State advances to `project_initialized` despite the precondition violation + +The diagnostic report MUST capture the pre-fix behavior FIRST (running the unpatched code, quoting the resulting constitution that lacks idea-grounding), then file the defect (P2-D03) at HIGH severity, then capture the post-fix behavior in an "After fix" subsection. + +## Scenario 3 — Template file missing + +**Setup**: + +```bash +# Spawn a fresh sibling at validated; capture sibling_id. +SIBLING_ID=$(python tests/phase1/sibling_project.py \ + PROJ-261-evaluating-the-impact-of-code-duplicatio \ + --iter 8 \ + --start-stage validated) + +# Move the template out of the way. +mv agents/templates/research_project_constitution.md \ + agents/templates/research_project_constitution.md.bak + +# Run the orchestrator. +python -m llmxive run --project "$SIBLING_ID" --max-tasks 1 +echo "exit code: $?" + +# Restore the template (CRITICAL — don't leave it renamed in the work tree). +mv agents/templates/research_project_constitution.md.bak \ + agents/templates/research_project_constitution.md +``` + +**Expected behavior**: + +- `render_prompt(CONSTITUTION_TEMPLATE_PATH, …)` at line 44 of `project_initializer.py` raises `FileNotFoundError` (the loader can't find the missing template) +- This happens BEFORE the LLM is invoked, so no API call is made (also satisfies Constitution Principle V — fail-fast on missing precondition) +- Orchestrator records `outcome: failure` with `failure_reason` quoting the exception +- State remains `validated`; no constitution written; no scaffold tree written + +**Failure modes that would be CRITICAL defects**: +- The agent silently falls back to a default-rendered constitution (the defensive fallback at lines 94-101 should NOT activate on a template-not-found error — that fallback is for malformed LLM output, not for missing template files) +- The exception is swallowed and replaced with a generic "could not render constitution" message that doesn't name the missing path +- The agent reaches the LLM-invocation step and burns API tokens despite the precondition being unmet + +## Required diagnostic-report capture per scenario + +Each induced-failure scenario produces a §2.X subsection with: + +| Required element | Source | +|-|-| +| Pre-run state YAML | `cat state/projects/<sibling_id>.yaml` | +| Setup steps verbatim | the bash block from this contract document | +| Stderr / exception trace | captured by `python -m llmxive run …` and quoted as `text` block | +| Run-log JSONL line | `cat state/run-log/<YYYY-MM>/<run_id>.jsonl` | +| Post-run state YAML | `cat state/projects/<sibling_id>.yaml` (must equal pre-run YAML in `current_stage`) | +| Post-run filesystem state | `ls projects/<sibling_id>/.specify/ 2>&1` (should show no `memory/constitution.md`; for scenarios 1+3, may show or not show partial scaffold; document either way) | +| Verdict | PASS (failure was loud + recorded + state unchanged + no partial artifacts) or FAIL (one or more of those four conditions violated; defect logged) | + +## Cleanup checklist (after all three scenarios run) + +- [ ] `LLMXIVE_BACKEND_BASE_URL` restored to its original value (or unset) +- [ ] `agents/templates/research_project_constitution.md` is back in place at the canonical path +- [ ] All three induced-failure sibling directories committed to git (per FR-016) — they are NOT silently deleted +- [ ] State YAMLs of those siblings remain at `current_stage: validated` (the failure record IS the artifact spec 005 may need to inspect) +- [ ] Each sibling's state YAML has `archived_at: <ISO-8601 UTC>` set (per FR-019, since these are not carry-forward candidates) + +## Acceptance verdict (rolls into SC-005) + +SC-005 passes when ALL THREE scenarios produce: (a) `outcome: failure` in the run-log, (b) populated `failure_reason`, (c) `current_stage` unchanged, (d) no partial constitution file. If any scenario fails any of (a)-(d), that's a defect the spec is responsible for fixing in-PR or deferring with rationale (FR-014 / FR-018). diff --git a/specs/004-phase2-project-bootstrap-testing/data-model.md b/specs/004-phase2-project-bootstrap-testing/data-model.md new file mode 100644 index 00000000..ea82ae32 --- /dev/null +++ b/specs/004-phase2-project-bootstrap-testing/data-model.md @@ -0,0 +1,262 @@ +# Data Model: Phase 2 (Project Bootstrap) End-to-End Testing & Diagnostics + +**Spec**: [spec.md](./spec.md) +**Plan**: [plan.md](./plan.md) +**Date**: 2026-05-05 + +## Purpose + +Concrete schema for every entity the spec produces or consumes, so the diagnostic, the audit, and the carry-forward manifest can all reference the same definitions. + +--- + +## E1. Carry-forward sibling + +A new project ID derived from a canonical PROJ-NNN-<slug> by appending `-iterN`, used as the actual subject of Phase 2 testing. + +**Identity**: +- `project_id` (string, regex `^PROJ-\d{3}-[a-z0-9-]{1,50}-iter\d+$`) +- `canonical_id` (string, the original `PROJ-NNN-<slug>` from spec 003's carry-forward) +- `iter_n` (int ≥ 2, monotonically increasing per canonical) + +**Lifecycle entry conditions**: +- Spawned by `tests/phase1/sibling_project.py <canonical_id> --iter <N> --start-stage validated` +- `idea/<slug>.md` is byte-for-byte cloned from the canonical (sha256-verified by the spawner) +- Fresh `state/projects/<project_id>.yaml` written at `current_stage: validated` +- No `.specify/` scaffold yet (the agent under test produces it) + +**Relationships**: +- 1 sibling → 1 canonical (many siblings per canonical possible) +- 1 sibling → 1 state YAML (`state/projects/<project_id>.yaml`) +- 1 sibling → ≥0 run-log entries (one per agent invocation against this sibling) + +**Validation rules**: +- Sibling MUST NOT exist in `projects/` before spawning (spawner refuses to clobber) +- `iter_n ≥ 2` (iter1 is reserved for the canonical) +- After Phase 2 happy-path: `current_stage: project_initialized` +- After Phase 2 induced-failure path: `current_stage: validated` (unchanged) + +--- + +## E2. Constitution artifact + +The LLM-rendered Markdown produced by `project_initializer` and written to the sibling's `.specify/memory/constitution.md`. + +**Storage**: file at `projects/<sibling_id>/.specify/memory/constitution.md` + +**Content contract** (from `agents/prompts/project_initializer.md` lines 38-50): +- (a) **Heading line 1** literally `# <title> — Research Project Constitution` +- (b) **Footer line** literally `**Project ID**: <project_id> | **Field**: <field> | **Ratified**: <date>` +- (c) **Inherited principles I-V** preserved (names may be paraphrased but content must be substantively equivalent to the parent template) +- (d) **At most TWO** added domain-specific principles (numbered VI and/or VII) +- (e) **No external citations** (governance document, not research artifact) +- (f) **`Reproducibility Requirements` section** adapted to project's actual data sources (e.g., names QM9 / MD17 for chemistry, names `codeparrot/github-code` for the CS project, etc.) + +**Substitution rule** (from `src/llmxive/agents/project_initializer.py:43-54`): tokens `{{project_id}}`, `{{title}}`, `{{field}}`, `{{date}}`, `{{principal_agent_name}}` MUST all be substituted with concrete values BEFORE the LLM is invoked. Final file must contain no literal `{{token}}` strings (SC-010, CRITICAL defect if violated). + +**Audit derivation**: each of the six contract items above maps to one row in the US2 audit table per sibling (see `contracts/diagnostic-report.md` § "Constitution audit table"). + +--- + +## E3. Spec Kit scaffold tree + +The mechanical filesystem tree produced by `init_speckit_in` under each sibling. + +**Storage**: directories under `projects/<sibling_id>/.specify/`: + +``` +.specify/ +├── memory/ +│ ├── constitution.md # E2 (LLM-rendered) +│ └── (sentinel files written by future agents go here, e.g., research_question_validated.yaml — but Phase 2 produces none) +├── scripts/ +│ └── bash/ +│ ├── common.sh +│ ├── check-prerequisites.sh +│ ├── create-new-feature.sh +│ └── setup-plan.sh +└── templates/ + ├── checklist-template.md + ├── constitution-template.md + ├── plan-template.md + ├── spec-template.md + └── tasks-template.md +``` + +**Total file count**: 5 templates + 4 scripts + 1 constitution = **10 files** (memory/ has only the constitution post-Phase-2; sentinel files appear later). + +**Source-of-truth invariant**: every file under `templates/` and `scripts/bash/` MUST be byte-for-byte identical to the corresponding file at the repo root's `.specify/templates/*` or `.specify/scripts/bash/*`. Any byte-level divergence is a CRITICAL defect (the meta-system is supposed to be the single source of truth per Constitution Principle I). + +**Idempotency invariant** (FR-011 / SC-009): a second `init_speckit_in` invocation MUST leave every file unchanged at sha256 level. The constitution write follows the skip-if-exists rule per Decision 2 in research.md. + +--- + +## E4. Project state YAML + +The `state/projects/<project_id>.yaml` file the orchestrator reads/writes to track sibling progress through the pipeline. + +**Storage**: `state/projects/<sibling_id>.yaml` + +**Schema** (matches `specs/001-agentic-pipeline-refactor/contracts/project-state.schema.yaml`): + +| Field | Type | Phase 2 value (entry) | Phase 2 value (happy-path exit) | Phase 2 value (failure-path exit) | +|-|-|-|-|-| +| `id` | string | `<sibling_id>` | unchanged | unchanged | +| `title` | string | inherited from canonical | unchanged | unchanged | +| `field` | string | inherited from canonical | unchanged | unchanged | +| `current_stage` | enum | `validated` | `project_initialized` | `validated` (unchanged) | +| `last_run_id` | UUID | `null` | new run UUID | new run UUID | +| `last_run_status` | enum | `null` | `success` | `failure` | +| `failed_stage` | string | `null` | `null` | `null` (failure recorded in run-log, not state) | +| `human_escalation_reason` | string | `null` | `null` | `null` (only set on `human_input_needed` transitions) | +| `revision_round` | int | 0 | 0 | 0 | +| `created_at` | ISO-8601 UTC | spawner sets | unchanged | unchanged | +| `updated_at` | ISO-8601 UTC | spawner sets | new timestamp | new timestamp | +| `assigned_agent`, `points_*`, `speckit_*_dir`, `artifact_hashes` | various | empty/null | unchanged for Phase 2 | unchanged | + +**Validation rules**: +- `current_stage` MUST be in the schema enum (per spec 003 / D14 fix that added `validated`/`validator_revise`/`validator_rejected`) +- `last_run_status` MUST be `success` if `current_stage` advanced to `project_initialized` post-run; otherwise the run-log entry MUST record the failure +- Stage transitions MUST be in `ALLOWED_TRANSITIONS[current_stage]` (per `src/llmxive/agents/lifecycle.py`); for Phase 2 this means `validated → {project_initialized, human_input_needed}` + +--- + +## E5. Run-log entry + +One JSONL line per agent invocation, written to `state/run-log/<YYYY-MM>/<run_id>.jsonl` (one file per run UUID; one line per agent within that run). + +**Storage**: `state/run-log/2026-05/<run_id>.jsonl` + +**Schema** (one JSON object per line): + +| Field | Type | Required | Phase 2 happy-path | Phase 2 failure-path | +|-|-|-|-|-| +| `agent` | string | yes | `"project_initializer"` | `"project_initializer"` | +| `project_id` | string | yes | `<sibling_id>` | `<sibling_id>` | +| `run_id` | UUID | yes | matches state's `last_run_id` | matches state's `last_run_id` | +| `outcome` | enum | yes | `success` | `failure` | +| `started_at` | ISO-8601 UTC | yes | populated | populated | +| `ended_at` | ISO-8601 UTC | yes | populated | populated | +| `duration_seconds` | float | yes | <300 (within wall_clock_budget) | typically <60 (fail-fast) | +| `failure_reason` | string \| null | iff outcome=failure | `null` | non-empty exception repr or message | +| `stage_before` | string | yes | `"validated"` | `"validated"` | +| `stage_after` | string | yes | `"project_initialized"` | `"validated"` (unchanged) | +| `model` | string | yes | resolved at runtime (e.g., `qwen.qwen3.5-122b`) | resolved at runtime | +| `backend` | string | yes | resolved at runtime (e.g., `dartmouth`) | resolved at runtime | + +**Validation rules**: +- Every agent invocation MUST produce exactly one run-log entry, including failures (FR-012, Constitution Principle V) +- `outcome` and `stage_before`/`stage_after` MUST be consistent: `success ⇒ stage_after = STAGE_AFTER_AGENT[stage_before]`; `failure ⇒ stage_after = stage_before` +- `failure_reason` MUST be non-empty when `outcome = failure` (no silent failures per FR-015) + +--- + +## E6. Diagnostic report + +A single Markdown file at `notes/2026-05-05-phase2-diagnostic.md` aggregating all artifacts and their evaluations. + +**Storage**: `notes/2026-05-05-phase2-diagnostic.md` + +**Section structure** (mirrors spec 003's report; defined in detail in `contracts/diagnostic-report.md`): + +| § | Title | Required | Content | +|-|-|-|-| +| 1 | Inputs (carry-forward substrate) | yes | which canonicals, which iter2 siblings, sha256 evidence of byte-identical idea-clone | +| 2 | Agent behavior (per sibling, per run) | yes | rendered system prompts, LLM responses, state YAML before/after, run-log JSONL line | +| 3 | Outputs (per sibling) | yes | full constitution quote (≤100 lines verbatim, else `[truncated…]`), full scaffold-tree manifest, `init_speckit_in` source-of-truth verification | +| 4 | Defects table | yes | one row per CRITICAL/HIGH/MEDIUM/LOW finding with severity, file:line, status (`fixed in PR <hash>` / `deferred to issue #N` / `accepted (not addressed)`) | +| 5 | Iteration diffs | iff iter3+ spawned | `git diff <iter2-commit>:<path> <iter3-commit>:<path>` blocks per iteration | +| 6 | Per-issue acceptance-criteria summary | yes | issue #62's three checkboxes, each marked pass/fail with rationale tied to a quoted artifact | +| 7 | Recommendations | yes | what (if anything) to change in Phase 2 going forward; pointers to follow-up issues | +| 8 | Carry-forward decision | yes | which iter2 siblings (1-2) advance to spec 005; their final commit hashes; one-paragraph justification per | + +**Validation rules**: +- Every sibling that ran (whether iter2 happy-path or `-iterFAIL-*` induced failure) MUST appear in §2 and §3 with verbatim quotes +- Every CRITICAL defect MUST have a status entry that is NOT `accepted (not addressed)` (CRITICAL defects must be fixed or deferred to a tracked issue per FR-014 / SC-006) +- §6 MUST mark each of issue #62's three acceptance-criteria checkboxes pass/fail (no skips) +- §8 MUST name 1-2 sibling IDs OR explicitly state "no carry-forward selected; falling back to spec-003 canonicals" with rationale + +--- + +## E7. Carry-forward manifest (Phase 2 → Phase 3) + +YAML file at `specs/004-phase2-project-bootstrap-testing/carry-forward.yaml` naming the iter2 siblings spec 005 will operate on. + +**Storage**: `specs/004-phase2-project-bootstrap-testing/carry-forward.yaml` + +**Schema** (extends spec 003's schema with one new field): + +```yaml +spec: "004-phase2-project-bootstrap-testing" +generated_at: <ISO-8601 UTC> +final_commit: <git SHA> +projects: + - project_id: <sibling_id-or-canonical_id> # e.g., PROJ-261-...-iter2 or PROJ-261-... + final_state: project_initialized + final_commit: <git SHA> + phase2_iter2_id: <sibling_id> # NEW field — names which iter2 produced the .specify/memory/constitution.md + # MAY equal project_id (when carrying forward the iter2 sibling itself) + # MAY differ from project_id (when carrying forward the canonical with the iter2's audited constitution copied in) + agents_run: + - { name: brainstorm, iterations: <N>, final_iter_id: <id> } + - { name: flesh_out, iterations: <N>, final_iter_id: <id> } + - { name: research_question_validator, iterations: <N>, final_iter_id: <id> } + - { name: project_initializer, iterations: <N>, final_iter_id: <id> } + justification: | + <one paragraph: did the constitution pass the US2 audit cleanly? + did idempotency hold under the patched skip-if-exists? + which domain-specific principles did the LLM add and were they grounded?> +``` + +**Validation rules**: +- `projects` list MUST contain 1-2 entries (FR-017, SC-002) +- Each `project_id` MUST be either an iter2 sibling OR a canonical, AND `phase2_iter2_id` MUST be a real iter2 sibling that exists at `projects/<phase2_iter2_id>/` +- Each named project MUST have `final_state: project_initialized` +- Each named `final_commit` MUST resolve to a real commit on the feature branch +- `agents_run` MUST include `{name: project_initializer, iterations: ≥1, final_iter_id: <some sibling>}` (this spec's distinguishing run) + +--- + +## E8. Idempotency hash list + +A pair of sha256-per-file manifests computed before and after a second `init_speckit_in` invocation, used to verify FR-011 / SC-009. + +**Format** (in-memory; not persisted to disk except as a quoted block in §3 of the diagnostic report): + +```python +{ + ".specify/memory/constitution.md": "<sha256>", + ".specify/scripts/bash/common.sh": "<sha256>", + ".specify/scripts/bash/check-prerequisites.sh": "<sha256>", + ".specify/scripts/bash/create-new-feature.sh": "<sha256>", + ".specify/scripts/bash/setup-plan.sh": "<sha256>", + ".specify/templates/checklist-template.md": "<sha256>", + ".specify/templates/constitution-template.md": "<sha256>", + ".specify/templates/plan-template.md": "<sha256>", + ".specify/templates/spec-template.md": "<sha256>", + ".specify/templates/tasks-template.md": "<sha256>", +} +``` + +**Validation rules**: +- Both manifests MUST have identical key sets (same 10 files) +- For every key, `before[k] == after[k]` (full byte-for-byte equality) +- If any key's hash differs, that's the defect record's `failure_reason`; the file path is the file:line pointer + +--- + +## Cross-entity invariants + +- **Every sibling spawned ⇒ exactly one E4 (state YAML), ≥0 E5 (run-log entries; ≥1 if any agent invocation succeeded or failed cleanly)**. +- **Every successful `project_initializer` run ⇒ exactly one E2 (constitution) + one E3 (scaffold tree)**. +- **Every CRITICAL defect surfaced in E6 ⇒ either an `[After fix]` subsection in the same E6 section quoting corrected behavior, or a tracking issue link** (FR-014). +- **Every sibling listed in E7 ⇒ exists at `projects/<id>/` AND has E4 at `current_stage: project_initialized` AND has E2 + E3 byte-present**. + +--- + +## Out of scope (deliberately not modeled) + +- **Phase 3 specifier output** (handed to spec 005) +- **`paper_initializer` and the paper-side scaffold** (Phase 8, separate spec) +- **GHA cron-driven invocation of `project_initializer`** (out of scope per spec.md "GHA cron eventually" note) +- **The behavior of `/speckit-plan` and `/speckit-tasks` when run inside the sibling's `.specify/` scaffold** (this is Phase 3's concern, not Phase 2's) diff --git a/specs/004-phase2-project-bootstrap-testing/plan.md b/specs/004-phase2-project-bootstrap-testing/plan.md new file mode 100644 index 00000000..3555c1fc --- /dev/null +++ b/specs/004-phase2-project-bootstrap-testing/plan.md @@ -0,0 +1,118 @@ +# Implementation Plan: Phase 2 (Project Bootstrap) End-to-End Testing & Diagnostics + +**Branch**: `008-phase2-project-bootstrap-testing` | **Date**: 2026-05-05 | **Spec**: [spec.md](./spec.md) +**Input**: Feature specification from `specs/004-phase2-project-bootstrap-testing/spec.md` + +## Summary + +Drive the single Phase 2 agent (`project_initializer`) through the production code path against the Dartmouth Chat backend on **iter2 siblings** of the spec-003 carry-forward projects (PROJ-261 + PROJ-262). The agent renders `.specify/memory/constitution.md` from `agents/templates/research_project_constitution.md` (LLM-driven domain adaptation) and mechanically scaffolds `.specify/{scripts,templates}/` via `init_speckit_in`. For each iter2 sibling, audit the rendered constitution against the explicit output contract in `agents/prompts/project_initializer.md`, verify full idempotency under a second invocation, and induce all three deliberate failure modes (backend-unreachable / idea-missing / template-missing) on dedicated iter siblings to prove failure paths are loud. Record every artifact verbatim in a single diagnostic report; emit a `carry-forward.yaml` manifest naming the substrate for spec 005 (Phase 3 testing). + +Technical approach: reuse the spec-003 sibling spawner (`tests/phase1/sibling_project.py`) with one extension (add `validated` to `ALLOWED_START_STAGES`); apply one in-PR fix to make `project_initializer` idempotent on the constitution write (skip-if-exists, per Q3 clarification); lean on the existing `python -m llmxive run` orchestrator for every agent invocation; verify the existing backend-router retry policy at `src/llmxive/backends/router.py` already satisfies the Q4 retry budget (3 attempts × primary model + 1 attempt × 2 peer models per backend = sufficient transient-error tolerance). Use git history as the canonical iteration trail. The diagnostic itself is a manual procedure driven by the maintainer with the orchestrator CLI; no production code changes are required for the testing infrastructure beyond the two tightly-scoped fixes (spawner allowlist + agent idempotency). + +## Technical Context + +**Language/Version**: Python 3.11 (matches `pyproject.toml`) +**Primary Dependencies**: existing `llmxive` package (orchestrator, agents, backends, speckit), `pyyaml` (already available), spec-003's `tests/phase1/sibling_project.py` (extended) +**Storage**: filesystem — `projects/<id>/.specify/{memory,scripts,templates}/**`, `projects/<id>/idea/<slug>.md`, `state/projects/<id>.yaml`, `state/run-log/<YYYY-MM>/*.jsonl`, all committed to git +**Testing**: pytest for the `project_initializer` idempotency fix unit test (real filesystem temp dir); the diagnostic itself is a manual procedure driven by the maintainer with the orchestrator CLI; spec-003's `tests/phase1/test_citation_resolver.py` continues to run in CI as a regression check on the substrate +**Target Platform**: macOS / Linux (developer workstation), Dartmouth Chat backend reachable; eventually GHA cron per the project's broader vision (out of scope for this spec) +**Project Type**: research-pipeline diagnostic — single-project (no separate frontend/backend split) +**Performance Goals**: per-agent wall-clock budget already encoded in `agents/registry.yaml` (project_initializer 300s); idempotency check must add no more than 60s of overhead per sibling (sha256 over <30 files); each induced-failure run must hard-fail within 60s (faster than wall_clock_budget) so the cumulative cost of all failure inductions stays bounded +**Constraints**: every agent invocation MUST go through `python -m llmxive run --project <sibling-id> --max-tasks 1` (no direct agent-class instantiation, except in the idempotency-check Python harness for US3 acceptance scenario 2 where re-running from `validated` is impossible via the CLI); iterations are sibling projects per spec-003 FR-004 (never state surgery); transient backend errors retry per the existing router policy (3 attempts on primary model + 1 on each peer in `MODEL_FALLBACKS`, then fall through to next backend in `fallback_backends`); FR-005 5-cycle iteration cap inherited from spec 003 +**Scale/Scope**: 2 happy-path iter2 siblings (1 per canonical, per Q1) + up to 5 iter3+ siblings (if defects surface, per FR-005 cap × ≤2 canonicals) + 3 dedicated induced-failure iter siblings (one per Q2 scenario). Worst case ≤10 committed `projects/PROJ-NNN-…-iterN/` directories. Each sibling produces a constitution (~100 lines), a scaffold tree (~12 files, all bytewise copies of repo-root templates), one state YAML (~20 lines), and one run-log JSONL line. Total artifacts: bounded under 200 files; well under 1MB total. + +## Constitution Check + +*GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.* + +The constitution at `.specify/memory/constitution.md` v1.0.0 names five non-negotiable principles. Each is evaluated below. + +### I. Single Source of Truth (NON-NEGOTIABLE) + +- **Compliance**: PASS. The plan creates no duplicate prompts, helpers, or schemas. New artifacts (`notes/2026-05-05-phase2-diagnostic.md`, `specs/004-phase2-project-bootstrap-testing/carry-forward.yaml`, the four `contracts/*.md` files in this spec dir) are unique additions with single canonical locations. The two in-PR fixes (extend `ALLOWED_START_STAGES`; skip-if-exists in `project_initializer.py`) modify single canonical locations rather than forking. The sibling-spawner from spec 003 is extended in place — not duplicated. Constitution-template path (`agents/templates/research_project_constitution.md`) and prompt path (`agents/prompts/project_initializer.md`) remain canonical. + +### II. Verified Accuracy (NON-NEGOTIABLE) + +- **Compliance**: PASS. The diagnostic is itself a verified-accuracy mechanism: every artifact is quoted verbatim, every constitution line is audited against the explicit output contract in `agents/prompts/project_initializer.md` (a 6-item check per US2), and any token-substitution leak (`{{project_id}}`, etc.) is a CRITICAL defect (SC-010). The constitution may not introduce external citations (SC-011) — a stricter requirement than the parent constitution's verified-accuracy mandate, since governance documents shouldn't depend on external sources at all. The audit also explicitly checks that the produced constitution does not contradict any parent-constitution principle, preventing weakening of the meta-system's accuracy guarantees. + +### III. Robustness & Reliability (Real-World Testing) + +- **Compliance**: PASS. The diagnostic explicitly forbids mocks and stubs (FR-002, US1-US4 acceptance scenarios). Every agent invocation runs against the real Dartmouth Chat backend; `init_speckit_in` writes real files to the real filesystem; the sibling spawner produces real committed projects. The induced-failure-mode requirement (FR-012, all three modes per Q2) confirms failure paths produce loud, recorded outcomes rather than silent advancement — including the worst case (backend dies mid-stream and the spec verifies no partial constitution is left behind). The idempotency check (US3) computes real sha256 hashes against real files on real disk, not against a checksummed-by-policy contract. + +### IV. Cost Effectiveness (Free-First) + +- **Compliance**: PASS. Dartmouth Chat is free per `agents/registry.yaml` (`is_paid: false`). The diagnostic introduces no paid dependencies. Worst-case backend usage is bounded: 2 happy-path runs × 300s budget + ≤5 iteration runs × 300s + 3 induced-failure runs × 60s (early hard-fail) = ~36 minutes of backend wall-clock at the absolute upper bound, well within the daily quota estimate of 100 calls/day for one maintainer. + +### V. Fail Fast + +- **Compliance**: PASS. Preflight checks before any agent run: (a) `DARTMOUTH_CHAT_API_KEY` non-empty (verified via `llmxive auth check` or direct credential file read at `~/.config/llmxive/credentials.toml`); (b) `python -m llmxive run --help` succeeds; (c) `git status` clean before starting an iter2 batch; (d) `tests/phase1/sibling_project.py --help` succeeds (proves spawner is on the import path); (e) the `validated` start-stage extension landed in the same commit as FR-003a's prerequisite work. The Backend-unreachable edge case in spec.md mandates immediate halt rather than retry-forever (router walks the fallback chain once each then surfaces the original `TransientBackendError`). The induced-failure-mode test (FR-012 × 3) explicitly exercises fail-fast on all three precondition violations Phase 2 depends on. + +**Verdict**: All five principles satisfied. No Complexity Tracking entries needed. + +## Project Structure + +### Documentation (this feature) + +```text +specs/004-phase2-project-bootstrap-testing/ +├── plan.md # This file +├── spec.md # Feature specification (already created, /speckit-clarify resolved) +├── research.md # Phase 0 output (this file's Phase 0) +├── data-model.md # Phase 1 output +├── quickstart.md # Phase 1 output +├── contracts/ # Phase 1 output +│ ├── diagnostic-report.md # Markdown structural contract for the report +│ ├── carry-forward.md # YAML schema contract for the manifest +│ ├── idempotency-check.md # CLI/IO contract for the sha256 verification harness +│ └── induced-failure-runs.md # Procedural contract for each of the three induced-failure scenarios +├── checklists/ +│ └── requirements.md # Spec-quality checklist (already created) +├── carry-forward.yaml # Output of US6 — produced during /speckit-implement +└── tasks.md # Phase 2 output (/speckit-tasks; not produced by /speckit-plan) +``` + +### Source Code (repository root) + +```text +# Production code (touched by the two tightly-scoped fixes only) +src/llmxive/ +├── __main__.py # existing — orchestrator entry point +├── cli.py # existing — `run` subcommand +├── pipeline/ # existing — graph + state machine +├── agents/ +│ └── project_initializer.py # FIX P2-D01 — skip-if-exists on constitution write (Q3) +├── backends/ # existing — router policy already satisfies Q4 (no edit) +├── speckit/ +│ └── runner.py # existing — `init_speckit_in` already idempotent on dirs (no edit) +└── ... + +agents/ +├── registry.yaml # existing — project_initializer entry (no edit unless prompt iterates) +├── prompts/ +│ └── project_initializer.md # iteration target if constitution audit (US2) surfaces defects +└── templates/ + └── research_project_constitution.md # iteration target if domain adaptation underperforms + +# Diagnostic-only code (extension of spec 003's tests/phase1/) +tests/phase1/ +├── sibling_project.py # FIX P2-D02 — extend ALLOWED_START_STAGES to include 'validated' (FR-003a) +└── test_idempotency.py # NEW — pytest harness for US3 sha256-tree idempotency check + +# Diagnostic outputs (NEW, this spec) +notes/2026-05-05-phase2-diagnostic.md # FR-013 — the report itself + +# Real project artifacts (produced by agents during /speckit-implement) +projects/PROJ-261-evaluating-the-impact-of-code-duplicatio-iter2/ +projects/PROJ-262-predicting-molecular-dipole-moments-with-iter2/ +projects/PROJ-261-…-iterN/ # zero or more iter3+ if defects surface (≤5 per FR-005) +projects/PROJ-262-…-iterN/ # zero or more iter3+ if defects surface (≤5 per FR-005) +projects/PROJ-261-…-iterFAIL-{backend,idea,template}/ # induced-failure siblings (one per Q2 scenario) +state/projects/PROJ-…-iterN.yaml +state/run-log/2026-05/<run-id>.jsonl +``` + +**Structure Decision**: Single-project layout (Option 1). The diagnostic introduces only one new pytest module (`tests/phase1/test_idempotency.py`) and one new markdown report. Two production-code edits land as in-PR fixes (one ALLOWED_START_STAGES extension, one constitution-write skip-if-exists guard). All other behavior flows through existing pipeline code paths. Real-project artifacts under `projects/` and `state/` are produced by the agents themselves via the orchestrator CLI — no new directory contracts are introduced. + +## Complexity Tracking + +> No Constitution-Check violations to justify. Table omitted. diff --git a/specs/004-phase2-project-bootstrap-testing/quickstart.md b/specs/004-phase2-project-bootstrap-testing/quickstart.md new file mode 100644 index 00000000..cf220735 --- /dev/null +++ b/specs/004-phase2-project-bootstrap-testing/quickstart.md @@ -0,0 +1,296 @@ +# Quickstart: Phase 2 Diagnostic Runbook + +**Spec**: [spec.md](./spec.md) +**Plan**: [plan.md](./plan.md) +**Date**: 2026-05-05 + +This is a hands-on runbook for the maintainer driving the Phase 2 diagnostic. It assumes you have spec 003's tools (`tests/phase1/sibling_project.py`, `tests/phase1/citation_resolver.py`) on the path and the Dartmouth Chat backend reachable. + +## Step 0 — Preflight + +```bash +# Confirm the carry-forward substrate exists. +cat specs/003-phase1-idea-lifecycle-testing/carry-forward.yaml | head -20 +ls projects/PROJ-261-evaluating-the-impact-of-code-duplicatio/ +ls projects/PROJ-262-predicting-molecular-dipole-moments-with/ + +# Confirm the orchestrator entry point works. +python -m llmxive run --help + +# Confirm the Dartmouth credential is loaded. +python -c "from llmxive.credentials import load_dartmouth_key; print('ok' if load_dartmouth_key(prompt_if_missing=False) else 'missing')" + +# Confirm git working tree is clean before starting. +git status --short +``` + +If any of these fails, stop and resolve before proceeding. + +## Step 1 — Land the two prerequisite fixes + +These MUST be in-place before any sibling spawn or agent run, because the diagnostic depends on both. + +### 1a. Extend `ALLOWED_START_STAGES` to include `validated` + +```bash +# Open tests/phase1/sibling_project.py:36 and change: +# ALLOWED_START_STAGES = {"brainstormed", "flesh_out_in_progress", "flesh_out_complete"} +# to: +# ALLOWED_START_STAGES = {"brainstormed", "flesh_out_in_progress", "flesh_out_complete", "validated"} + +# Verify by trying the spawner with --start-stage validated --help. +python tests/phase1/sibling_project.py --help +``` + +Commit: + +```bash +git add tests/phase1/sibling_project.py +git commit -m "phase2/spec-004: add 'validated' to sibling spawner allowlist (FR-003a, #46 #62)" +``` + +### 1b. Add skip-if-exists guard to `project_initializer` + +```python +# In src/llmxive/agents/project_initializer.py, modify handle_response: + +def handle_response(self, ctx: AgentContext, response: ChatResponse) -> list[str]: + repo = Path(__file__).resolve().parent.parent.parent.parent + project_dir = repo / "projects" / ctx.project_id + constitution_path = project_dir / ".specify" / "memory" / "constitution.md" + + # NEW: skip-if-exists guard for idempotency (Q3 / FR-011). + if constitution_path.is_file(): + init_speckit_in(project_dir) # still idempotent on dirs, safe to re-call + return [str(constitution_path.relative_to(repo))] + + # ... rest of existing handle_response unchanged ... +``` + +Commit: + +```bash +git add src/llmxive/agents/project_initializer.py +git commit -m "phase2/spec-004: skip-if-exists guard on constitution write (Q3, FR-011, #46 #62) + +Constitution is a governance document; re-rendering with possibly-different +LLM output silently mutates downstream Constitution Checks. Match the +init_speckit_in skip-if-dir-exists pattern at src/llmxive/speckit/runner.py:114. +" +``` + +(Optional, file as a separate defect P2-D03: also patch line 60 to `raise FileNotFoundError` instead of silently using empty `idea_summary` — see research.md Decision 5. Recommend doing this AFTER inducing the missing-idea-file failure once with the unpatched code, so the diagnostic captures the pre-fix behavior verbatim.) + +### 1c. Run the idempotency test (regression check) + +```bash +# Implement tests/phase1/test_idempotency.py per contracts/idempotency-check.md. +pytest tests/phase1/test_idempotency.py -v +``` + +All four tests must pass before continuing. + +## Step 2 — Spawn the two iter2 happy-path siblings + +```bash +# PROJ-261-iter2. +python tests/phase1/sibling_project.py \ + PROJ-261-evaluating-the-impact-of-code-duplicatio \ + --iter 2 \ + --start-stage validated + +# PROJ-262-iter2. +python tests/phase1/sibling_project.py \ + PROJ-262-predicting-molecular-dipole-moments-with \ + --iter 2 \ + --start-stage validated + +# Verify both siblings are in place. +ls projects/PROJ-261-evaluating-the-impact-of-code-duplicatio-iter2/ +ls projects/PROJ-262-predicting-molecular-dipole-moments-with-iter2/ +cat state/projects/PROJ-261-evaluating-the-impact-of-code-duplicatio-iter2.yaml +cat state/projects/PROJ-262-predicting-molecular-dipole-moments-with-iter2.yaml +``` + +Each sibling MUST have: +- `idea/<slug>.md` byte-identical to the canonical (the spawner sha256-verifies) +- `state/projects/<sibling-id>.yaml` at `current_stage: validated` +- No `.specify/` directory yet + +Commit the two new sibling directories + state YAMLs. + +## Step 3 — Run `project_initializer` on each iter2 sibling (US1 happy path) + +```bash +# PROJ-261-iter2. +python -m llmxive run \ + --project PROJ-261-evaluating-the-impact-of-code-duplicatio-iter2 \ + --max-tasks 1 +echo "exit code: $?" + +# PROJ-262-iter2. +python -m llmxive run \ + --project PROJ-262-predicting-molecular-dipole-moments-with-iter2 \ + --max-tasks 1 +echo "exit code: $?" + +# Inspect outputs. +cat projects/PROJ-261-…-iter2/.specify/memory/constitution.md +cat projects/PROJ-262-…-iter2/.specify/memory/constitution.md +ls -la projects/PROJ-261-…-iter2/.specify/{scripts/bash,templates}/ +ls -la projects/PROJ-262-…-iter2/.specify/{scripts/bash,templates}/ +cat state/projects/PROJ-261-…-iter2.yaml # must show project_initialized +cat state/projects/PROJ-262-…-iter2.yaml # must show project_initialized +``` + +For each sibling, the rendered constitution MUST satisfy the six US2 contract items (see [contracts/diagnostic-report.md § 3.X.1](./contracts/diagnostic-report.md)). Fill in the constitution audit table for each as you read. + +## Step 4 — US3 idempotency audit on PROJ-261-iter2 + +```bash +# Compute the pre-rerun sha256 manifest of .specify/. +find projects/PROJ-261-…-iter2/.specify -type f -exec sha256sum {} \; | sort > /tmp/sha-before.txt + +# Run init_speckit_in directly via python (bypasses the orchestrator's +# stage-routing which would otherwise advance to specifier). +python -c " +from pathlib import Path +from llmxive.speckit.runner import init_speckit_in +init_speckit_in(Path('projects/PROJ-261-evaluating-the-impact-of-code-duplicatio-iter2')) +print('done') +" + +# Compute the post-rerun manifest. +find projects/PROJ-261-…-iter2/.specify -type f -exec sha256sum {} \; | sort > /tmp/sha-after.txt + +# Diff. +diff /tmp/sha-before.txt /tmp/sha-after.txt +# (must be empty) +``` + +For US3 acceptance scenario 2 (constitution skip-if-exists), `pytest tests/phase1/test_idempotency.py::test_project_initializer_skips_existing_constitution -v` IS the canonical evidence — quote the pytest output verbatim into §3.X.5 of the diagnostic report. + +## Step 5 — Run all three induced-failure scenarios (US4) + +Follow `contracts/induced-failure-runs.md` step-by-step. Each scenario: + +1. Spawns a fresh sibling at `--start-stage validated` +2. Mutates one precondition +3. Runs the orchestrator +4. Captures stderr + run-log + state YAML + filesystem state +5. Restores the precondition + +After all three scenarios complete, verify cleanup: + +```bash +# Backend env restored. +echo "${LLMXIVE_BACKEND_BASE_URL:-(unset)}" + +# Template back in place. +ls -la agents/templates/research_project_constitution.md + +# All three failure-iter siblings committed. +git status projects/PROJ-26*-iterFAIL-*/ +``` + +Commit the failure-iter siblings + run-log entries. + +## Step 6 — Author the diagnostic report + +Open `notes/2026-05-05-phase2-diagnostic.md` and follow `contracts/diagnostic-report.md` section by section. Quote artifacts verbatim from the file paths captured in steps 3-5. Use ≤100 lines per quote with `[truncated lines N-M, sha256: <hash>]` markers. + +While authoring, file each defect into §4 with the next available `P2-D##` ID. CRITICAL defects MUST be either fixed in-PR (with an "After fix" subsection in §3 quoting the post-fix output) or deferred to a tracked issue with rationale (per FR-014). + +## Step 7 — Iteration loop (only if defects surface) + +If §3.X.1 audit fails for any sibling, follow this loop (capped at 5 iterations per FR-005): + +1. Identify the failing contract item and root cause (prompt? template? agent code?) +2. Patch with a `prompt_version` bump per the spec-003 semver policy (MAJOR for output-contract-breaking, MINOR for behavior, PATCH for prose) — same commit +3. Spawn a new sibling iter (`--iter 3`, `--iter 4`, …) — never reset the prior sibling's state +4. Run `project_initializer` on the new sibling +5. Re-audit; if still failing, return to step 1 +6. If 5th iteration still fails: file a follow-up issue, mark the defect `Deferred to issue #<N>` in the report's §4, move on + +For each iteration loop, capture a §5 subsection in the report with the verbatim `git diff` between iters. + +## Step 8 — Author the carry-forward manifest + +Open `specs/004-phase2-project-bootstrap-testing/carry-forward.yaml` and write the schema per `contracts/carry-forward.md`. Pick 1-2 iter2 siblings that pass the US2 audit cleanly. Commit. + +## Step 9 — Close issues + update tracker + +```bash +# Tick the Phase 2 box in tracking issue #107. +# Add a closing comment to issue #62 referencing the report and final commit. +# Add a closing comment to issue #46 referencing the diagnostic report's §6 verdict. + +gh issue edit 107 --body "$(gh issue view 107 --json body -q .body | sed 's/- \[ \] #46/- [x] #46/')" +gh issue close 62 --comment "Resolved via spec 004 (PR #<N>). See diagnostic report at notes/2026-05-05-phase2-diagnostic.md and carry-forward manifest at specs/004-phase2-project-bootstrap-testing/carry-forward.yaml." +gh issue close 46 --comment "Phase 2 verified end-to-end via spec 004 (PR #<N>). All three issue #62 acceptance criteria pass; carry-forward manifest names <K> sibling(s) for spec 005." +``` + +## Step 10 — PR + merge + +```bash +# Run all spec-003 + spec-004 tests + linters. +pytest tests/phase1/ -v + +# Push, open PR. +git push origin 008-phase2-project-bootstrap-testing +gh pr create --base main --head 008-phase2-project-bootstrap-testing \ + --title "Spec 004: Phase 2 (Project Bootstrap) end-to-end testing" \ + --body "$(cat <<'EOF' +## Summary + +Validates Phase 2 of the llmXive pipeline end-to-end on iter2 siblings of +spec 003's carry-forward projects (PROJ-261, PROJ-262), per issue #46 and +sub-issue #62. Lands two prerequisite fixes: + +- Extend sibling spawner's `ALLOWED_START_STAGES` to include `validated` +- Skip-if-exists guard on `project_initializer`'s constitution write + (idempotency fix, per Q3 clarification) + +## Diagnostic + +Full report at `notes/2026-05-05-phase2-diagnostic.md`. Carry-forward +manifest at `specs/004-phase2-project-bootstrap-testing/carry-forward.yaml` +names <K> sibling(s) as input substrate for spec 005 (Phase 3 testing). + +## Test plan + +- [x] All four `tests/phase1/test_idempotency.py` tests pass +- [x] All eleven `tests/phase1/test_citation_resolver.py` tests pass (regression) +- [x] Manual verification: each iter2 sibling's constitution passes US2 audit +- [x] Manual verification: all three induced-failure scenarios produce loud + recorded failures with state unchanged + +🤖 Generated with [Claude Code](https://claude.com/claude-code) + +EOF +)" +``` + +## Estimated wall-clock + +| Step | Duration | +|-|-| +| 0–1 (preflight + fixes + idempotency tests) | 30 min | +| 2 (spawn iter2 siblings) | 2 min | +| 3 (project_initializer happy-path runs) | 10 min (2 × ≤300s wall_clock_budget) | +| 4 (idempotency audit) | 10 min | +| 5 (induced-failure scenarios) | 30 min (3 × manual setup + ≤2 min each + cleanup) | +| 6 (author diagnostic report) | 90-120 min | +| 7 (iteration loop, if needed) | variable; budget 60 min × ≤5 iters = 5h max | +| 8 (carry-forward manifest) | 10 min | +| 9–10 (issues + PR) | 15 min | + +**Total**: ~3.5h on the happy path, up to ~9h with full iteration cap. + +## Common failure modes & how to resolve + +- **Spawner refuses with "malformed canonical_project_id"** → check the regex; canonical IDs end in `[a-z0-9-]+` with no `-iterN` suffix. +- **Orchestrator fails with "no agent assigned for stage 'validated'"** → confirm `STAGE_TO_AGENT` in `src/llmxive/pipeline/graph.py:70` includes the `Stage.VALIDATED: "project_initializer"` line (it should, since spec 003 added it). +- **Constitution has literal `{{title}}` token** → the `render_prompt` substitution may have failed; check `agents/templates/research_project_constitution.md` for non-substituted token spellings vs. what `project_initializer.py:46-53` substitutes. +- **`init_speckit_in` raises `FileExistsError`** → not expected (the function is dir-skip-if-exists); if you see this, file a defect against `src/llmxive/speckit/runner.py`. +- **Idempotency check shows the constitution divergent** → confirm the skip-if-exists patch from step 1b is actually in place; `git diff src/llmxive/agents/project_initializer.py` should show the new guard. +- **Backend hard-fails before retry exhausts** → expected behavior in induced-failure scenario 1; verify the run-log entry shows ≥1 retry attempt before the final failure. diff --git a/specs/004-phase2-project-bootstrap-testing/research.md b/specs/004-phase2-project-bootstrap-testing/research.md new file mode 100644 index 00000000..a1d32a5c --- /dev/null +++ b/specs/004-phase2-project-bootstrap-testing/research.md @@ -0,0 +1,132 @@ +# Phase 0 Research: Phase 2 (Project Bootstrap) End-to-End Testing & Diagnostics + +**Spec**: [spec.md](./spec.md) +**Plan**: [plan.md](./plan.md) +**Date**: 2026-05-05 + +## Purpose + +The Technical Context in `plan.md` has zero `NEEDS CLARIFICATION` markers — every unknown was resolved during `/speckit-clarify` (Q1-Q4). Phase 0 research therefore **(a)** consolidates the mechanism choices that the clarifications committed to into concrete code-level decisions, **(b)** does the small amount of repo-introspection needed to verify the existing pipeline code already supports those choices (or names the precise file:line where it doesn't), and **(c)** documents three known-quirks-of-the-substrate that will affect the diagnostic without requiring changes. + +## Decision 1 — Sibling start-stage extension + +**Decision**: Extend `tests/phase1/sibling_project.py`'s `ALLOWED_START_STAGES` set to include `validated`. This is the single line at `tests/phase1/sibling_project.py:36` (currently `{"brainstormed", "flesh_out_in_progress", "flesh_out_complete"}`). No other change needed in the spawner — the rest of the spawner is stage-agnostic (it copies the canonical `idea/<slug>.md`, writes a fresh state YAML at the chosen `start_stage`, and never touches the canonical's state). + +**Rationale**: spec 003 introduced the `validated` stage (D10 architecture decision) AFTER the sibling spawner was written, so the spawner's allowlist is simply out-of-date. Phase 2 testing requires staging siblings at `validated` because the orchestrator's `STAGE_TO_AGENT[VALIDATED] = "project_initializer"` mapping is the only way to route the sibling to the agent under test without manually invoking the agent class. + +**Alternatives considered**: + +- **Drop the allowlist entirely** — rejected because it would let the spawner produce siblings at `project_initialized`, `specified`, etc., which are downstream of the agent under test and would silently skip Phase 2. +- **Add a CLI flag like `--bypass-allowlist`** — rejected because it's a flexibility that this spec doesn't need; the simplest fix is a one-line set extension. +- **Refactor the allowlist to be derived from `agents/registry.yaml`** — rejected as out-of-scope; correct long-term direction but not needed for spec 004 and would touch many more lines than the one-line fix. + +**Verification**: Read [tests/phase1/sibling_project.py:35-36](tests/phase1/sibling_project.py#L35-L36) directly. Confirmed the allowlist is at line 36. Confirmed the only consumer is line 65-67 (validation in `spawn_sibling`); no other code references it. + +## Decision 2 — Constitution-write skip-if-exists fix + +**Decision**: Patch `src/llmxive/agents/project_initializer.py` so the `handle_response` method (lines 84-104) checks for an existing `.specify/memory/constitution.md` BEFORE writing. If the file exists, the method returns early with a no-op (still re-running `init_speckit_in` since that operation is already idempotent on directories). The patch must preserve the defensive fallback that catches malformed LLM output and substitutes the template — that fallback only applies on first-write, not on skip. + +**Rationale**: Per Q3 clarification, re-rendering a governance document with a possibly-different LLM output silently mutates downstream Constitution Checks (because `/speckit-plan` and `/speckit-tasks` inside the project read this file at every invocation). True idempotency requires the constitution to be written once and only once per project. The pattern matches the existing skip-if-dir-exists guard at [src/llmxive/speckit/runner.py:114](src/llmxive/speckit/runner.py#L114) (`if dst.is_dir(): continue`), so the fix is consistent with how the same module already handles idempotency. + +**Alternatives considered**: + +- **Hash-and-skip** (re-render to a temp, compare sha256, skip if identical, error if differs) — rejected as too strict for the LLM's natural variance; would force every re-run to be a hard failure even when the new constitution is acceptably similar to the old. +- **Always re-render with `temperature=0` and assert equality** — rejected because `temperature=0` doesn't guarantee determinism on the Dartmouth Chat backend (the underlying vLLM cluster has `seed`-handling quirks that produce non-deterministic outputs even at temperature=0); this would make the spec brittle to backend variance. +- **Document overwrite as accepted behavior, mark as known issue** — rejected per Q3: the user explicitly chose option B (skip-if-exists) so this option is off the table. + +**Verification**: Read [src/llmxive/agents/project_initializer.py:84-104](src/llmxive/agents/project_initializer.py#L84-L104). Confirmed the agent unconditionally writes `constitution_path.write_text(constitution_text + "\n")` at line 102. Confirmed the only callers are `runner.run_one_task` via the `STAGE_TO_AGENT` dispatch table (per [src/llmxive/pipeline/graph.py:70](src/llmxive/pipeline/graph.py#L70)), so the patch surface is contained. + +**Scope of patch (concrete diff sketch)**: + +```python +# Before any of the LLM-rendering or init_speckit_in work, guard: +constitution_path = project_dir / ".specify" / "memory" / "constitution.md" +if constitution_path.is_file(): + init_speckit_in(project_dir) # still idempotent; safe to re-call + return [str(constitution_path.relative_to(repo))] + +# ...rest of existing handle_response... +``` + +## Decision 3 — Transient-backend retry policy is satisfied by existing router + +**Decision**: No code change is required for FR-002's retry budget. The existing backend router at `src/llmxive/backends/router.py:96-100` already implements 3 attempts on the primary model + 1 attempt on each model in `MODEL_FALLBACKS[primary_model]`, then falls through to the next backend in `fallback_backends`. For `project_initializer` (default model `qwen.qwen3.5-122b`, fallbacks `[huggingface, local]`), the worst-case retry tree is: + +- Dartmouth + qwen3.5-122b: 3 attempts +- Dartmouth + gpt-oss-120b (peer per `MODEL_FALLBACKS`): 1 attempt +- Dartmouth + gemma-3-27b-it (peer): 1 attempt +- HuggingFace + qwen3.5-122b: 3 attempts +- HuggingFace + (any peers): 1 each +- Local + qwen3.5-122b: 3 attempts +- ... etc. + +This is **strictly more retry-tolerant** than Q4's "2 retries / 3 total attempts" minimum, so FR-002 is satisfied "by inheritance" from the production router. The spec's responsibility is to **verify** this empirically (induce a transient failure on the primary model — e.g., temporarily blackhole `api.dartmouth.edu` — and confirm the run-log entry shows the retry attempts before the eventual `TransientBackendError`). + +**Rationale**: Per Constitution Principle I (Single Source of Truth), the spec must NOT fork the retry policy into its own implementation. The router is the canonical retry mechanism for the whole project; spec 004 inherits it. + +**Alternatives considered**: + +- **Add a Phase 2-specific retry wrapper** — rejected as a Constitution Principle I violation. +- **Tighten the router's existing 3-attempt policy to Q4's exact 2-retry policy** — rejected because the existing policy is more permissive (good for production reliability) and Q4 specified 2 retries as a *minimum*, not a maximum. + +**Verification**: Read [src/llmxive/backends/router.py:96-100](src/llmxive/backends/router.py#L96-L100). Confirmed `attempts = 3 if model_idx == 0 else 1`. Read [src/llmxive/backends/router.py:44-50](src/llmxive/backends/router.py#L44-L50) confirmed `MODEL_FALLBACKS["qwen.qwen3.5-122b"] = ["openai.gpt-oss-120b", "google.gemma-3-27b-it"]`. Read [src/llmxive/backends/dartmouth.py:163-180](src/llmxive/backends/dartmouth.py#L163-L180) confirmed transient classification covers rate-limit / 5xx / connection / DNS errors. + +## Decision 4 — Idempotency-check harness location & invocation pattern + +**Decision**: Place the idempotency-check pytest harness at `tests/phase1/test_idempotency.py`. It uses pytest's `tmp_path` fixture to clone an existing iter2 sibling's `.specify/` tree into a temp dir, then runs `init_speckit_in` directly twice in sequence and asserts sha256-equality of every file. For US3 acceptance scenario 2 (constitution skip-if-exists), the harness instantiates `ProjectInitializerAgent` directly (bypassing the orchestrator) and asserts the constitution file's sha256 is unchanged after a second `handle_response` call with a different LLM response. + +**Rationale**: Live-running the orchestrator on a sibling at `project_initialized` would route to `specifier` (Phase 3), not re-run Phase 2. Direct agent invocation in a Python harness is the only way to test re-entry. Pytest is already in the project's dev dependencies (per spec 003's test_citation_resolver.py). + +**Alternatives considered**: + +- **Bash script + `sha256sum`** — rejected as less integrated with CI than pytest; adds shell-script-vs-python skill split. +- **Add `--force-stage <stage>` to the orchestrator** — rejected as a feature creep that violates the simplicity principle for spec 004; only useful for one test scenario. + +**Verification**: Confirmed pytest is set up via `pyproject.toml` (project uses pytest for spec 003's tests). Confirmed `init_speckit_in` is importable from `llmxive.speckit.runner`. Confirmed `ProjectInitializerAgent` accepts a registry entry constructor argument and exposes `build_messages` + `handle_response` as the canonical lifecycle methods. + +## Decision 5 — Induced-failure scenario implementation + +**Decision**: Each of the three induced-failure scenarios from Q2 is implemented as a maintainer-driven runbook step (not as automated test code), captured in `quickstart.md` and `contracts/induced-failure-runs.md`: + +1. **Backend unreachable** (`-iterFAIL-backend` sibling): `LLMXIVE_BACKEND_BASE_URL` is temporarily exported to `https://invalid.example.com` for the duration of one orchestrator invocation. Expected outcome: router walks the entire backend chain, every backend's instantiation either fails immediately (Dartmouth: `PermanentBackendError` from missing endpoint) or hits transient errors and retries; eventually surfaces `TransientBackendError` to the orchestrator, which writes an `outcome: failure` run-log entry and leaves `current_stage: validated` unchanged. Diagnostic confirms no `.specify/memory/constitution.md` is created. +2. **Idea file missing** (`-iterFAIL-idea` sibling): Maintainer manually deletes `projects/<sibling-id>/idea/<slug>.md` after spawning the sibling (via `tests/phase1/sibling_project.py`) but before invoking the orchestrator. Expected outcome: `ProjectInitializerAgent.build_messages` reads `ctx.inputs[0]` (the idea path) at line 58; if the file doesn't exist, the read returns empty string (defensive fallback at line 60: `if idea_path.exists():`). Currently this means the agent silently builds a prompt with `idea_summary=""`. **This is a Constitution Principle V (Fail Fast) violation we will surface as a HIGH defect**: the agent should raise `FileNotFoundError` if the idea seed is missing, not produce a constitution untethered from any idea. Fix lands as part of FR-014 / FR-018. +3. **Template file missing** (`-iterFAIL-template` sibling): Maintainer renames `agents/templates/research_project_constitution.md` to `…research_project_constitution.md.bak` for the duration of one orchestrator invocation. Expected outcome: `render_prompt(CONSTITUTION_TEMPLATE_PATH, …)` at line 44 of `project_initializer.py` raises `FileNotFoundError` immediately; the orchestrator records `outcome: failure` with the exception's repr in `failure_reason`; state remains `validated`. + +**Rationale**: Each scenario validates one distinct precondition (network reachability, idea-file presence, template-file presence). Per Q2, all three are required, and each runs on its own dedicated sibling so the failures don't contaminate each other. Scenario 2's expected fix-on-discovery is itself a finding — Phase 2 testing surfacing a real Phase 2 defect is exactly the kind of value spec 003 demonstrated. + +**Alternatives considered**: + +- **Mock the backend / filesystem to induce failures** — rejected per Constitution Principle III (real-world testing). +- **Use `pytest.raises` to assert the exception path** — rejected because the failure path goes through the orchestrator's run-log writer, which is what we're auditing; we need to read the actual run-log JSONL after the fact, not just assert that a Python exception was raised. + +**Verification**: Read [src/llmxive/agents/project_initializer.py:55-61](src/llmxive/agents/project_initializer.py#L55-L61) — confirmed the `if idea_path.exists()` defensive check that masks the missing-idea-file scenario. This is exactly the kind of silent fallback Constitution Principle V prohibits. + +## Decision 6 — Carry-forward forward-compatibility + +**Decision**: The carry-forward manifest at `specs/004-phase2-project-bootstrap-testing/carry-forward.yaml` uses the same schema as spec 003's `specs/003-phase1-idea-lifecycle-testing/carry-forward.yaml`, with an additional `agents_run` entry recording `project_initializer` and an additional metadata field `phase2_iter2_id` capturing which iter2 sibling produced the carried-forward `.specify/memory/constitution.md`. Spec 005 (Phase 3) will read this manifest to know which iter2 sibling to pick up. + +**Rationale**: Schema continuity across spec-NNN/carry-forward.yaml files makes future-phase specs (005-007 etc.) trivial to author — they just `cat` the previous spec's manifest and pick a project ID. Adding a new field rather than replacing an existing one preserves backward compatibility with spec 003's parser at `tests/phase1/validate_carry_forward.py`. + +**Alternatives considered**: + +- **New schema for spec 004's manifest** — rejected as a Constitution Principle I violation (would force two parsers). +- **Embed the iter2 ID inside the existing `agents_run` entry** — rejected because spec 003's schema treats `agents_run` as an unstructured list of name+iteration counts; adding a sibling-iter pointer there would couple two different concerns. + +**Verification**: Read `specs/003-phase1-idea-lifecycle-testing/carry-forward.yaml` and `tests/phase1/validate_carry_forward.py`. Confirmed the schema is `{spec, generated_at, final_commit, projects: [{project_id, final_state, final_commit, agents_run: [{name, iterations, final_iter_id}], justification}]}`. Adding a new top-level field to each project entry is non-breaking. + +## Substrate quirks (no fix, just documented) + +- **PROJ-261 and PROJ-262 are already at `project_initialized` on `main`**: spec 003 ran `project_initializer` on them as part of its end-state. Phase 2 testing therefore operates on iter2 siblings, not on the canonicals. Confirmed by inspecting `find projects/PROJ-26{1,2}-…/ -maxdepth 4 -type f` which shows `.specify/memory/constitution.md`, `.specify/templates/{constitution,plan,spec,tasks,checklist}-template.md`, `.specify/scripts/bash/{common,setup-plan,check-prerequisites,create-new-feature}.sh` already present. +- **Cron-driven commits land on `main` continuously**: e.g., recent commits `df3537d pipeline(brainstorm): hourly tick`, `19ce86a pipeline(flesh-out): 2h tick`. These don't affect spec 004's correctness (the cron jobs operate on different projects in a separate `cron/` workflow) but the maintainer should `git pull` before starting each diagnostic session to avoid merge conflicts on `state/run-log/`. +- **`templates/{spec,plan,tasks,checklist}-template.md` exist at repo root**: `init_speckit_in` copies these (4 files) plus `templates/constitution-template.md` (1 file) into the project's `.specify/templates/` (5 files total). Audit US1 acceptance scenario 3 must list all 5 names, not just 4 like the spec draft mistakenly suggested. (The spec text already names all 5 correctly: `templates/{constitution,plan,spec,tasks,checklist}-template.md`.) + +## Summary of code changes required by this plan + +| File | Change | Severity | Source | +|-|-|-|-| +| [tests/phase1/sibling_project.py](tests/phase1/sibling_project.py) | Add `validated` to `ALLOWED_START_STAGES` (line 36) | Prerequisite | FR-003a | +| [src/llmxive/agents/project_initializer.py](src/llmxive/agents/project_initializer.py) | Skip-if-exists guard before constitution write (line 84-104) | HIGH defect fix | FR-011, Q3 | +| [src/llmxive/agents/project_initializer.py](src/llmxive/agents/project_initializer.py) | Replace `if idea_path.exists():` with `raise FileNotFoundError` if missing (line 60) | HIGH defect fix | Decision 5 / FR-012 finding | +| `tests/phase1/test_idempotency.py` | New pytest harness for US3 sha256-tree check | New code | Decision 4 | + +No edits required to `src/llmxive/backends/router.py` (Q4 satisfied), `src/llmxive/speckit/runner.py` (already idempotent on dirs), or `agents/registry.yaml` (Phase 2 entry already correct). diff --git a/specs/004-phase2-project-bootstrap-testing/spec.md b/specs/004-phase2-project-bootstrap-testing/spec.md new file mode 100644 index 00000000..15d1dd34 --- /dev/null +++ b/specs/004-phase2-project-bootstrap-testing/spec.md @@ -0,0 +1,209 @@ +# Feature Specification: Phase 2 (Project Bootstrap) End-to-End Testing & Diagnostics + +**Feature Branch**: `008-phase2-project-bootstrap-testing` *(spec dir is `specs/004-phase2-project-bootstrap-testing/` — branch number diverges from spec number per `/speckit-specify` allowance because the git-feature hook counts branches across the repo, not spec dirs)* +**Created**: 2026-05-05 +**Status**: In Review +**Input**: User description: "next let's work on phase 2: issue 46 + all sub-issues + any related agents. for context, also see issue 107. critical details and considerations: our goal here is to validate *each step* of the llmXive pipeline; we need to examine the *inputs* and *outputs* produced by any agents related to this phase; use *REAL* projects as inputs. currently we're using projects 261 and 262 as ideal for carrying forward into this next phase." + +## Context (carried from spec 003) + +This spec is a direct continuation of spec 003 (Phase 1 Idea Lifecycle Testing, closed via PR #108). It uses the carry-forward manifest at `specs/003-phase1-idea-lifecycle-testing/carry-forward.yaml` as its canonical input substrate. Both named projects already exist on `main` at `current_stage: project_initialized` because spec 003's diagnostic ran the full Phase 1 pipeline (including `project_initializer`) end-to-end on them — the Phase 1 spec's success criterion was "carry-forward at `project_initialized`". + +The **current pipeline graph** wires Phase 2 as a single-agent transition: `validated → project_initializer → project_initialized`. The validator (added by spec 003 / D10) sits in Phase 1, so by the time a project enters Phase 2 it has already passed all four research-question-quality checks. Phase 2's only job is to produce the per-project Spec Kit scaffold and the LLM-rendered constitution. + +**Implication for this spec**: PROJ-261 and PROJ-262 are already past Phase 2's exit stage on `main` (their `.specify/` scaffolds already exist, with constitutions, templates, and scripts written). This spec therefore tests Phase 2 by spawning **`-iter2` siblings** of each carry-forward project (starting at `current_stage: validated`) using the same sibling-iteration pattern spec 003 introduced (FR-004 of spec 003). State surgery on the canonical PROJ-261/PROJ-262 is never used; each iteration is a fresh, independently replayable run on a new project ID. + +## Clarifications + +### Session 2026-05-05 + +- Q: Target iter2 sibling count per canonical project (PROJ-261, PROJ-262) → A: One iter2 sibling per canonical (2 runs total). Independent evidence across both research domains (CS + chemistry) without redundant audits; further iterations only spawn if a defect surfaces. +- Q: Induced-failure scenario choice for US4 → A: All three (backend-unreachable + idea file missing + template file missing). Full failure-path audit — each scenario exercises a distinct precondition that Phase 2 depends on, so a per-scenario verdict gives the most defensible Constitution-Principle-V coverage. +- Q: Idempotency-overwrite policy for LLM constitution re-render on second `project_initializer` invocation → A: Skip-if-exists. The agent MUST detect a pre-existing `.specify/memory/constitution.md` and skip re-rendering (matching the `init_speckit_in` skip-if-dir-exists pattern at `src/llmxive/speckit/runner.py:114`). Re-rendering a governance document silently mutates downstream Constitution Checks — fix `src/llmxive/agents/project_initializer.py:84-102` as part of this PR. +- Q: Transient-backend-error retry budget per agent run → A: 2 retries with exponential backoff (3 total attempts), then `TransientBackendError` → `human_input_needed`. Counted as one cycle against the FR-005 5-cycle iteration cap. Verify the backend client at `src/llmxive/backends/dartmouth.py` implements this; if it doesn't, fix as part of this PR's prerequisite work. + +## User Scenarios & Testing *(mandatory)* + +### User Story 1 - project_initializer runs cleanly on each carry-forward sibling, audited end-to-end (Priority: P1) + +A pipeline maintainer reads `specs/003-phase1-idea-lifecycle-testing/carry-forward.yaml`, picks each named project (PROJ-261 + PROJ-262), and uses `tests/phase1/sibling_project.py` (the spec-003 sibling spawner) to spawn an `-iter2` sibling per project at `--start-stage validated`. The sibling's `idea/<slug>.md` is byte-for-byte cloned from the canonical (sha256-verified by the spawner), and a fresh `state/projects/<sibling-id>.yaml` is written at `current_stage: validated`. The maintainer then invokes `python -m llmxive run --project <sibling-id> --max-tasks 1` against the real Dartmouth Chat backend. The orchestrator picks `project_initializer` as the next agent (per `STAGE_TO_AGENT[VALIDATED]`). The maintainer captures every input the agent saw (system prompt after token substitution, rendered constitution template, idea body) and every output it produced (`.specify/memory/constitution.md`, plus any artifacts under `.specify/{scripts,templates}/` written by the mechanical `init_speckit_in` step), state YAML before/after, and the run-log JSONL entry. They evaluate each artifact against issue #62's acceptance criteria. + +**Why this priority**: This is the entirety of Phase 2. Without this story, nothing in Phase 2 is actually tested. Every other story in this spec depends on at least one successful clean run from this story. + +**Independent Test**: Can be fully tested per project by spawning the sibling, running `--max-tasks 1`, opening `projects/<sibling-id>/.specify/memory/constitution.md`, and verifying it (a) starts with `# <title> — Research Project Constitution`, (b) ends with the `**Project ID**: …` footer, (c) names the actual project field (not the literal `{{field}}` token), and (d) adapts at most two domain-specific principles per the prompt's constraint. State must end at `project_initialized`. The run-log must record `outcome: success` with `started_at`/`ended_at`. + +**Acceptance Scenarios**: + +1. **Given** PROJ-261-iter2 staged at `current_stage: validated` with a verified-clone `idea/evaluating-the-impact-of-code-duplicatio.md`, **When** `python -m llmxive run --project PROJ-261-evaluating-the-impact-of-code-duplicatio-iter2 --max-tasks 1` is invoked, **Then** `projects/PROJ-261-…-iter2/.specify/{memory,scripts,templates}/` is created, `memory/constitution.md` is written with project-specific tokens fully substituted (no literal `{{…}}` placeholders survive), state advances to `project_initialized`, and one run-log entry is appended with `outcome: success`. +2. **Given** PROJ-262-iter2 staged identically, **When** the same orchestrator command runs, **Then** the same set of artifacts is produced and the constitution adapts to the chemistry domain (e.g., adds a domain-specific principle around quantum-chemistry validation, molecular-feature reproducibility, or similar — per the prompt's "add at most two domain-specific principles" rule). +3. **Given** either iter2 run completes, **When** the maintainer reads `.specify/memory/constitution.md` and `.specify/templates/`, **Then** `templates/{constitution,plan,spec,tasks,checklist}-template.md` are present and byte-identical to `.specify/templates/*` at the repo root (idempotent mechanical step), AND `scripts/bash/{common,create-new-feature,setup-plan,check-prerequisites}.sh` are present and executable. + +--- + +### User Story 2 - Constitution-quality audit against the system prompt's contract (Priority: P1) + +For each iter2 constitution produced in US1, the maintainer audits its content against the explicit output contract in `agents/prompts/project_initializer.md`. The contract requires: (a) literal `# <title> — Research Project Constitution` heading, (b) literal `**Project ID**: …` footer, (c) at most TWO added domain-specific principles (numbered VI/VII), (d) all five inherited principles (I–V) preserved verbatim, (e) no external citations introduced, (f) `Reproducibility Requirements` section adapted to the project's actual data sources. The audit is line-by-line, and any deviation is recorded as a defect with severity (CRITICAL / HIGH / MEDIUM / LOW) per spec-003's defect-categorization convention. The maintainer also confirms the constitution does not contradict or weaken any principle in the parent `.specify/memory/constitution.md` (per the parent-template's explicit instruction in line 13). + +**Why this priority**: A constitution that fails its output contract (e.g., LLM dropped Principle V, fabricated a citation, or invented a sixth and seventh principle) breaks every downstream slash command (`/speckit-specify`, `/speckit-plan`, `/speckit-tasks`) inside the project, because they read this file to apply Constitution Checks. Phase 2's correctness IS the constitution's correctness — there's nothing else to test. + +**Independent Test**: Can be tested per project by reading the produced constitution side-by-side with `agents/templates/research_project_constitution.md`, marking each contractual requirement (a)-(f) pass/fail with the specific text quoted, and verifying no literal `{{token}}` strings survive substitution. + +**Acceptance Scenarios**: + +1. **Given** PROJ-261-iter2's constitution, **When** the audit runs, **Then** every contract item (a)-(f) is marked pass with a quoted excerpt, OR any failure is recorded as a defect with severity, file:line pointer, and proposed fix. +2. **Given** PROJ-262-iter2's constitution, **When** the audit runs, **Then** the chemistry-specific adaptation in `Reproducibility Requirements` is verified — e.g., it MUST name a real chemistry data source (QM9, MD17, etc., or whichever the idea cites) rather than the generic placeholder language from the template. +3. **Given** either constitution introduces a defect (e.g., dropped a principle, invented a citation, contradicted parent principle II), **When** the defect is identified, **Then** it is logged with severity and either fixed in this PR (per spec 003 FR-013) or deferred to a follow-up issue. + +--- + +### User Story 3 - Idempotency audit on `init_speckit_in` (Priority: P1) + +The maintainer re-invokes `init_speckit_in` directly via a small Python harness on a sibling that is *already* at `current_stage: project_initialized` (i.e., immediately after US1's run). The orchestrator-level approach won't work — running `python -m llmxive run --project <sibling-id> --max-tasks 1` from `project_initialized` would advance to Phase 3's `specifier`, not re-run Phase 2 — so direct invocation is the only way to test issue #62's third acceptance criterion: **"Idempotent: running twice doesn't duplicate or corrupt files"**. The maintainer compares the `.specify/{scripts,templates}/` tree's sha256 hashes before and after the second invocation; they must match exactly. + +**Why this priority**: Idempotency is one of three explicit acceptance criteria in issue #62. Cron-driven pipelines re-run agents on the same project frequently; a non-idempotent `init_speckit_in` would corrupt the project's own constitution on every re-run, silently invalidating every downstream slash command's behavior. + +**Independent Test**: Can be tested by computing `sha256sum projects/<sibling-id>/.specify/{scripts,templates}/**/*` before and after the second invocation and confirming the hash list is identical. If any file's hash changed, the test fails — note that this is a stricter test than "the file still exists" because a re-rendered constitution with different LLM output would still pass a file-existence check while failing idempotency. + +**Acceptance Scenarios**: + +1. **Given** PROJ-261-iter2 at `project_initialized` with a complete `.specify/` tree, **When** `init_speckit_in` is invoked a second time directly (bypassing the orchestrator's stage-routing), **Then** all template/scripts files are unchanged (sha256 identical) and no exception is raised. +2. **Given** the LLM-rendered constitution at `.specify/memory/constitution.md` exists from US1, **When** the agent itself is re-invoked via a Python harness with the project at `validated` (impossible in production — would require sibling-iter3 — but tested via direct agent invocation), **Then** the agent MUST detect the pre-existing constitution, skip re-rendering, and leave the file byte-for-byte unchanged (sha256 identical before/after). Per Q3 clarification, the current overwrite-unconditional behavior at `src/llmxive/agents/project_initializer.py:84-102` is a HIGH defect; the fix lands in this PR and US3 verifies it. +3. **Given** the idempotency check completes, **When** the diagnostic report is generated, **Then** issue #62's checkbox "Idempotent: running twice doesn't duplicate or corrupt files" is marked pass or fail with the sha256 evidence quoted verbatim. + +--- + +### User Story 4 - Failure-path induction: agent fails loudly when prerequisites are missing (Priority: P2) + +The maintainer induces all three deliberate failure modes (per Q2 clarification) to verify Phase 2's failure paths are loud (per spec 003's FR-015 / SC-006 pattern). The three scenarios: + +1. **Backend unreachable**: temporarily set `LLMXIVE_BACKEND_BASE_URL` to an invalid host and confirm the run-log records `outcome: failure` with a populated `failure_reason` quoting the backend exception, and that the sibling's state YAML is NOT advanced past `validated`. +2. **Idea file missing**: spawn a sibling-iter3 manually but delete its `idea/<slug>.md` before invoking the orchestrator; confirm the agent fails fast (per llmXive Constitution Principle V) with a clear message rather than producing a constitution that lacks idea-grounding. +3. **Template file missing**: rename `agents/templates/research_project_constitution.md` to a backup name and run the agent; confirm it raises a clear `FileNotFoundError`, not a silent fallback to a generic constitution. + +**All three** scenarios MUST be exercised (per Q2 clarification). The diagnostic report quotes, for each scenario independently, the run-log entry, the exception trace (or stderr block), and the post-failure state YAML to prove the failure is recorded and state is not silently advanced. The three failures cover three distinct preconditions Phase 2 depends on (backend reachability, idea-grounding input, constitution-template input) and must each be exercised on their own dedicated sibling iter (e.g., -iter3 for backend-fail, -iter4 for missing idea, -iter5 for missing template) so the failures don't contaminate each other. + +**Why this priority**: P2 because it's not the happy path, but the cron-driven pipeline absolutely depends on failures being loud and recorded — silent failure with state advancement is the most damaging bug class in this whole system. P2 (not P1) because spec 003 already exercised induced-failure paths for Phase 1 and we have substantial evidence that the failure recording machinery works generically — Phase 2 only needs a smoke test, not a full audit. + +**Independent Test**: Can be tested by running the induced-failure scenario, then reading `state/run-log/<YYYY-MM>/<run-id>.jsonl` and `state/projects/<sibling-id>.yaml`. The run-log entry must have `outcome: failure` with `failure_reason` populated, and the state YAML's `current_stage` must remain at `validated` (not advanced). + +**Acceptance Scenarios**: + +1. **Given** each of the three induced-failure scenarios (backend unreachable, missing idea file, missing template) has been exercised on a dedicated sibling iter, **When** the orchestrator is invoked under each condition, **Then** the run-log entry for that scenario has `outcome: failure` with a non-empty `failure_reason`, the state YAML's `current_stage` is unchanged, and no `.specify/memory/constitution.md` is partially written. Each of the three scenarios must produce its own pass verdict independently. +2. **Given** the backend-unreachable scenario specifically, **When** the failure occurs, **Then** the failure is classified as a `TransientBackendError` (per spec 003 FR-002's distinction between backend-side and agent-side failures), and the diagnostic report explicitly notes that this is NOT a Phase 2 agent defect. + +--- + +### User Story 5 - Verbatim artifact capture and critical evaluation report (Priority: P1) + +Throughout US1-US4, the maintainer maintains a single diagnostic report under `notes/2026-05-05-phase2-diagnostic.md` (mirroring spec 003's `notes/2026-05-04-phase1-diagnostic.md` structure). The report quotes every artifact verbatim — every system prompt sent to the agent (with tokens already substituted), every constitution produced, every state YAML before/after, every run-log JSONL line — and critiques each one against issue #62's acceptance criteria with severity tags. The report's structure mirrors spec 003's: Sections 1-8 covering inputs, agent behavior, outputs, defects table, iteration diffs (if any), per-issue acceptance-criteria summary, recommendations, and carry-forward decision for spec 005. + +**Why this priority**: A bare "tests pass" verdict is useless. Verbatim quotes plus side-by-side critique are the only format that catches latent quality issues — partially substituted constitutions, dropped principles, idempotency violations that don't surface as exceptions but as silent file mutation. The report doubles as the source-of-truth handed to issue #107 to advance Phase 2's checkbox from `[ ]` to `[x]`. + +**Independent Test**: Can be tested by reading the report and confirming each artifact appears in a fenced code block, each quote is followed by an evaluation paragraph that explicitly cites issue #62's checkbox(es) (the three acceptance-criterion lines: rendering, scripts/runners, idempotency), each marked pass/fail with rationale, and that issues identified are bucketed into well/needs-improvement/broken with severity (CRITICAL/HIGH/MEDIUM/LOW) and file:line fix pointers. + +**Acceptance Scenarios**: + +1. **Given** project_initializer has run on at least one iter2 sibling, **When** the report is generated, **Then** the rendered system prompt (with all `{{token}}` substitutions visible) is quoted in full, the constitution is quoted in full (or with `[truncated lines N-M, sha256: <hash>]` markers if >100 lines), the state YAML is quoted before/after, and the run-log entry is quoted as JSON. +2. **Given** the report is generated, **When** a reviewer reads it, **Then** every checkbox in issue #62's acceptance-criteria block is explicitly marked pass/fail with rationale tied to a quoted artifact. +3. **Given** the report identifies any defect, **When** the defect is summarized, **Then** it has a severity (CRITICAL / HIGH / MEDIUM / LOW), a file:line pointer to where the fix should land, and either an "After fix" subsection quoting corrected behavior (if fixed in-PR) OR a follow-up issue link (if deferred). + +--- + +### User Story 6 - Carry-forward gate: 1-2 projects tagged for Phase 3 testing (Priority: P2) + +Once US1-US5 have produced quality output on at least one sibling per canonical project, the maintainer formally selects 1-2 iter2 siblings (or, if all iter2 siblings have audit-blocking defects, the original PROJ-261/PROJ-262 from spec 003's carry-forward) that will become the input substrate for spec 005 (Phase 3 — Spec Kit: Specify → Clarify, parent issue #47). The selection is recorded in `specs/004-phase2-project-bootstrap-testing/carry-forward.yaml` with each project ID, final state (`project_initialized`), final commit hash, the agents that ran on it (in this spec: just `project_initializer`), and a one-paragraph justification covering whether its constitution passes the US2 audit cleanly. Future specs reference this file to know which projects to operate on. + +**Why this priority**: Without this gate, "carry forward to Phase 3" is folklore. With this gate, spec 005 starts with `cat specs/004-phase2-project-bootstrap-testing/carry-forward.yaml` and knows exactly which projects to run `specifier` and `clarifier` on. P2 because it's the bridge to the next spec, not a self-contained capability. + +**Independent Test**: Can be tested by reading `carry-forward.yaml`, confirming each named project ID corresponds to a real `projects/<id>/` directory at `current_stage: project_initialized`, and confirming the named commit hashes match the last touch of those project directories. + +**Acceptance Scenarios**: + +1. **Given** the diagnostic in US5 has identified at least one constitution that passes the US2 audit cleanly, **When** `specs/004-phase2-project-bootstrap-testing/carry-forward.yaml` is written, **Then** it names 1-2 project IDs with metadata: `final_state: project_initialized`, final commit hash, agents-run summary, justification. +2. **Given** spec 005 (or any later phase-test spec) starts work, **When** it reads this carry-forward manifest, **Then** it can pick any named project ID and find the project directory, `state/projects/<id>.yaml`, idea artifacts, AND `.specify/memory/constitution.md` in the expected committed state. +3. **Given** the carry-forward is written, **When** the spec is closed, **Then** the matching parent issue checkbox in #107 (`#46 [Phase 2] Project Bootstrap`) is ticked. + +--- + +### Edge Cases + +- **Backend unreachable mid-run**: project_initializer needs the backend to render the constitution. If it's down, the run must surface a `TransientBackendError` (per spec 003 FR-002), the run-log records `outcome: failure`, and state remains at `validated`. The diagnostic must distinguish this from agent-side defects. +- **Constitution is partially written if backend dies mid-stream**: The current implementation writes the constitution AFTER the response is fully received (per `project_initializer.py` L84-L102), so a mid-stream backend failure should leave NO `.specify/memory/constitution.md`. Spec must verify this. If a partial file is found after a forced backend failure, that's a CRITICAL defect (file write should be atomic-or-absent). +- **LLM output is malformed**: `project_initializer.py` L94-L101 has a defensive fallback — if the LLM output doesn't start with `#`, it falls back to a pre-rendered template substitution. The spec must induce this case (e.g., prompt the agent in a way that returns a non-Markdown response — though this is hard to force naturally). At minimum the fallback path must be documented and either tested or accepted as best-effort. +- **`init_speckit_in` is non-idempotent on a corrupted scaffold**: If the project already has a `.specify/templates/` dir but it contains stale or partial files, `init_speckit_in` does NOT overwrite (per `runner.py` L114 — `if dst.is_dir(): continue`). This is the documented idempotent behavior, but it means a corrupted scaffold is NEVER auto-repaired. Spec must surface this — either as accepted behavior or as a HIGH defect needing a fix to make `init_speckit_in` checksum-aware (similar to how `_resync_project_scripts` already is, per `runner.py` L42-L55). +- **Domain-specific principles fabricated to look real**: The prompt allows up to two domain-specific principles. The LLM might fabricate principles with no actual basis in the project's research domain (e.g., a "Quantum Coherence Preservation" principle for a chemistry project that has nothing to do with coherence). The audit (US2) must spot-check this against the project's idea body. +- **Token substitution leaks**: If any of `{{project_id}}`, `{{title}}`, `{{field}}`, `{{date}}`, `{{principal_agent_name}}` survives in the final constitution, that's a CRITICAL defect — the parent template explicitly says these are substituted before the LLM is called, but the LLM might echo a token literally in its response. +- **Constitution contradicts the parent constitution's principles**: The project-level constitution explicitly inherits parent Principles I–V (per the parent template's design). If the LLM writes domain-specific principles that contradict a parent principle (e.g., a "Mocks Acceptable for Speed" principle would contradict parent Principle III), that's a CRITICAL defect. +- **Stage advancement when the constitution is empty/malformed**: Per llmXive Constitution Principle V (Fail Fast), if the LLM returns an empty response or malformed Markdown that fails the defensive fallback, the agent must NOT advance state. If `current_stage` becomes `project_initialized` despite an empty constitution, that's the most severe defect class in this spec — silent state advancement on broken content. +- **Run-log gap on uncaught exception**: If project_initializer crashes (uncaught Python exception), the run-log entry must still be appended with `outcome: failure` and a populated `failure_reason`. US4's induced-failure scenarios verify this. +- **Quote size cap**: Constitutions are roughly 100 lines; that's right at spec 003's verbatim-quote cap. Cap quotes at 100 lines with `[truncated lines N-M, sha256: <hash>]` markers above that. +- **Sibling spawner doesn't accept `--start-stage validated`**: Spec 003's spawner declared `ALLOWED_START_STAGES = {"brainstormed", "flesh_out_in_progress", "flesh_out_complete"}` — none of which is `validated`. This is a known prerequisite covered by **FR-003a / T004** (extend the allowlist to include `validated`); not a defect — the spawner was written before the validator stage existed. + +## Requirements *(mandatory)* + +### Functional Requirements + +- **FR-001**: System MUST run the Phase 2 agent (`project_initializer`) one-at-a-time on **real projects** spawned as `-iter2` siblings of the carry-forward projects from `specs/003-phase1-idea-lifecycle-testing/carry-forward.yaml`. Exactly **one iter2 sibling per canonical** (PROJ-261 and PROJ-262) — 2 successful runs total under the happy path, plus any iter3+ spawned only on defect — using the production code path (`python -m llmxive run --project <sibling-id> --max-tasks 1`) against the real Dartmouth Chat backend. +- **FR-002**: System MUST issue real API calls against the configured backend (no mocks, no fakes — per spec 003 FR-002 and llmXive Constitution Principle III) and gracefully distinguish backend-side failures from agent-side defects. Per Q4 clarification, a single agent run MUST retry transient backend errors **at least** 2 times (3 total attempts minimum) before raising `TransientBackendError` and routing the project to `human_input_needed`. The retried-then-failed run counts as one cycle against the FR-005 5-cycle iteration cap. The canonical retry policy is implemented in `src/llmxive/backends/router.py:96-100` (3 attempts on primary model + 1 attempt on each peer model in `MODEL_FALLBACKS` per backend × the entire fallback-backend chain). This already EXCEEDS the 2-retry minimum — see research.md Decision 3. The dartmouth backend at `src/llmxive/backends/dartmouth.py:163-180` only classifies transient vs permanent errors; the retry loop itself lives in the router. Verification (no fix expected) lands as task T012. +- **FR-003**: System MUST spawn each iter2 sibling using `tests/phase1/sibling_project.py` from spec 003 (already merged) at `--start-stage validated`, NOT at `brainstormed` — because Phase 2 testing presumes the project has already passed Phase 1's validator (per the current pipeline graph wiring at `STAGE_TO_AGENT[VALIDATED]`). **FR-003 depends on FR-003a's allowlist extension**: the spawner does NOT accept `validated` until that fix lands (covered by T004); FR-003a MUST be completed before any sibling-spawn task in US1 / US4 runs. +- **FR-003a**: System MUST extend `tests/phase1/sibling_project.py`'s `ALLOWED_START_STAGES` to include `validated` (currently only contains `{brainstormed, flesh_out_in_progress, flesh_out_complete}`). This is a known prerequisite of FR-003 and lands as the first commit of this spec's implementation. +- **FR-004**: System MUST accept that the canonical PROJ-261 and PROJ-262 on `main` already have `.specify/` scaffolds (because spec 003 ran project_initializer on them). State surgery on the canonical projects is **never** used; each iter2 sibling is a fresh, independently replayable run. +- **FR-005**: System MUST cap fix-and-re-run iterations per agent at 5 cycles (per spec 003 FR-005). Hitting the cap forces a deferral decision — either accept current state (record `accepted (not addressed)`) or file a follow-up GitHub issue. Continuing past the cap on the same defect is prohibited. +- **FR-006**: System MUST capture every artifact written by the agent — the LLM-rendered `.specify/memory/constitution.md`, the mechanical `.specify/{scripts,templates}/` tree contents (or sha256 manifest if too large to quote in full), and any sentinel files written under `.specify/memory/` — verbatim into the diagnostic report, with cap+hash truncation for files >100 lines. +- **FR-007**: System MUST capture the project state YAML and the run-log JSONL entry before and after every agent invocation, quoted verbatim. +- **FR-008**: System MUST capture the verbatim system prompt sent to the LLM (with all `{{token}}` substitutions resolved to concrete values), so the audit can verify (a) substitution worked, (b) the prompt sent the correct title/field/idea-body, (c) no tokens leaked. +- **FR-009**: System MUST evaluate every artifact against the acceptance-criteria checkboxes from issue #62 (project_initializer): (1) renders constitution.md with project-specific principles (not template placeholders), (2) creates the scripts/bash/ runners, (3) idempotent on second run. +- **FR-010**: System MUST audit the rendered constitution against the explicit output contract in `agents/prompts/project_initializer.md`: literal heading and footer, ≤2 added principles, no removed inherited principles, no external citations, Reproducibility Requirements adapted to actual data sources. Any deviation is a defect with severity. +- **FR-011**: System MUST verify full idempotency of `project_initializer` by computing sha256 hashes of every file under `projects/<id>/.specify/` (including `memory/constitution.md`, scripts, and templates) before and after a second agent invocation; the hash lists must be identical at the file-content level. Per Q3 clarification: the current overwrite-unconditional behavior on `.specify/memory/constitution.md` at `src/llmxive/agents/project_initializer.py:84-102` is a HIGH defect that MUST be fixed in this PR — the agent MUST detect a pre-existing constitution and skip re-rendering, matching the `init_speckit_in` skip-if-dir-exists pattern at `src/llmxive/speckit/runner.py:114`. +- **FR-012**: System MUST induce **all three** deliberate failure modes (backend unreachable, idea file missing, template file missing) — per Q2 clarification — and verify each one's failure path produces a loud, recorded failure rather than silent state advancement, per llmXive Constitution Principle V. Each scenario runs on its own dedicated sibling iter so the failures don't contaminate each other. +- **FR-013**: System MUST persist the diagnostic report under `notes/2026-05-05-phase2-diagnostic.md`, formatted in Markdown with fenced code blocks for every quoted artifact, mirroring spec 003's report structure. +- **FR-014**: For each CRITICAL or HIGH defect identified, system MUST either (a) apply a fix in this PR with an "After fix" report section quoting the corrected behavior, or (b) explicitly defer to a follow-up GitHub issue with rationale recorded in the report, per spec 003 FR-013's pattern. +- **FR-015**: System MUST never advance state silently when the constitution fails its content contract — empty file, partially substituted tokens, missing inherited principles, fabricated citations, or contradictions with the parent constitution must be flagged as CRITICAL defects. +- **FR-016**: System MUST commit all real-project artifacts produced (`projects/PROJ-261-…-iter2/**`, `projects/PROJ-262-…-iter2/**`, `state/projects/PROJ-…-iter2.yaml`, `state/run-log/<YYYY-MM>/*.jsonl`) so the report and the carry-forward gate are reproducible. +- **FR-017**: System MUST formally select 1-2 projects to carry forward (those whose constitutions pass the US2 audit cleanly) and record the selection in `specs/004-phase2-project-bootstrap-testing/carry-forward.yaml` with project IDs, final state, final commit hash, agents-run summary, and a one-paragraph justification per project. +- **FR-018**: All fixes applied as part of this work MUST land as separate commits with messages referencing both the parent issue (#46) and the specific sub-issue (#62) and the report section that motivated the fix. +- **FR-019**: System MUST allow non-selected iter2 siblings to remain in `projects/` (kept for future reference) or be marked archived by adding `archived_at: <ISO-8601 UTC>` to their state YAML. Non-selected siblings MUST NOT be silently deleted. +- **FR-020**: Iteration on the agent's prompt at `agents/prompts/project_initializer.md`, the constitution template at `agents/templates/research_project_constitution.md`, the registry entry at `agents/registry.yaml`, or the implementation in `src/llmxive/agents/project_initializer.py` MUST follow spec 003's prompt-version semver policy (MAJOR for output-contract-breaking, MINOR for behavior, PATCH for prose), with the version bump in the same commit as the patch. +- **FR-021**: Each new iteration after a prompt/code patch MUST spawn a new sibling (`PROJ-NNN-<slug>-iter3`, `-iter4`, …) — never reset state on the prior iteration's sibling — per spec 003's iteration discipline (FR-004). + +### Key Entities *(include if feature involves data)* + +- **Carry-forward sibling**: An iter2 (or iterN) project spawned via `tests/phase1/sibling_project.py` from a canonical carry-forward project. Has a fresh `state/projects/<sibling-id>.yaml` at `current_stage: validated`, byte-identical `idea/<slug>.md` (sha256-verified), and no `.specify/` scaffold yet. Distinct from the canonical: state surgery on the canonical is never used. +- **Project_initializer agent run**: A single invocation of `python -m llmxive run --project <sibling-id> --max-tasks 1` against a sibling at `validated`. Produces (a) a rendered system prompt (LLM input), (b) `.specify/memory/constitution.md` (LLM output), (c) `.specify/{scripts,templates}/` tree (mechanical via `init_speckit_in`), (d) state YAML transition `validated → project_initialized`, (e) run-log JSONL line. +- **Constitution artifact**: `projects/<sibling-id>/.specify/memory/constitution.md`. Audited against the explicit output contract in `agents/prompts/project_initializer.md`. +- **Spec Kit scaffold**: `projects/<sibling-id>/.specify/{scripts,templates}/` produced by the mechanical `init_speckit_in` step. Verified for completeness and idempotency. +- **Diagnostic report**: A single Markdown file at `notes/2026-05-05-phase2-diagnostic.md` quoting every artifact verbatim, evaluating each against issue #62's acceptance criteria, capturing iteration diffs (if any), and bucketing findings into well/needs-improvement/broken with severity tags. +- **Carry-forward manifest**: `specs/004-phase2-project-bootstrap-testing/carry-forward.yaml` naming the 1-2 selected projects with metadata for reference by spec 005 (Phase 3 testing). +- **Idempotency hash list**: A sha256-per-file manifest computed over `projects/<sibling-id>/.specify/{scripts,templates}/` before and after a second `init_speckit_in` invocation. Diff between the two lists is the test result. + +## Success Criteria *(mandatory)* + +### Measurable Outcomes + +- **SC-001**: The `project_initializer` agent runs end-to-end against the real Dartmouth Chat backend on at least one iter2 sibling per carry-forward project (at least 2 successful runs total), with zero mock/fake calls and zero direct calls bypassing the production orchestrator entry point. +- **SC-002**: At least 1 iter2 sibling has a constitution that passes the US2 audit (all six output-contract items pass) and is recorded in `carry-forward.yaml`. +- **SC-003**: The diagnostic report quotes every artifact written and every run-log entry produced — no agent's output omitted, no induced failure's failure path omitted. +- **SC-004**: Every acceptance-criterion checkbox from issue #62 is explicitly marked pass or fail in the report, with rationale tied to a specific quoted artifact (per agent run, per project). +- **SC-005**: All three deliberate failure modes (backend unreachable, idea file missing, template file missing) are induced on dedicated sibling iters, and each one's run-log entry is verified to record `outcome: failure` with a populated `failure_reason`. State YAML's `current_stage` remains unchanged in all three cases — demonstrating Phase 2's failure paths are not silent under any of the three precondition violations Constitution Principle V requires us to guard. +- **SC-006**: For every CRITICAL or HIGH defect identified, either an "After fix" report section quotes the corrected behavior or a follow-up issue link is recorded with rationale — no defect is silently dropped. +- **SC-007**: Iteration is bounded per agent (≤5 fix-and-re-run cycles, per spec 003 FR-005 / SC-008) so the spec converges in finite time. +- **SC-008**: The carry-forward manifest is concrete enough that spec 005 can read it and pick up the named projects without re-discovering the substrate. +- **SC-009**: Full idempotency is empirically verified: the sha256-per-file manifest of `projects/<sibling-id>/.specify/` (including `memory/constitution.md`, `scripts/`, `templates/`) after a second `project_initializer` invocation matches the first byte-for-byte. Per Q3 clarification, the constitution skip-if-exists fix MUST be applied in this PR before SC-009 can be marked pass — failure mode where re-render produces a different governance document is a HIGH defect this spec is responsible for fixing, not deferring. +- **SC-010**: No `.specify/memory/constitution.md` produced by Phase 2 contains any literal `{{token}}` strings (substitution must be complete before the LLM is invoked, per `project_initializer.py` L43-L54). Any token leak is a CRITICAL defect. +- **SC-011**: No `.specify/memory/constitution.md` produced by Phase 2 introduces an external citation, removes any of the inherited Principles I-V, or contradicts any parent constitution principle. Any of those is a CRITICAL defect. +- **SC-012**: At the end of this spec, the parent issue checkbox `#46 [Phase 2] Project Bootstrap` in tracking issue #107 is ticked (`[x]`) and issue #62 is closed with a comment referencing the diagnostic report and the carry-forward manifest. + +## Assumptions + +- The Dartmouth Chat backend (`DARTMOUTH_CHAT_API_KEY` in `~/.config/llmxive/credentials.toml`) is reachable for the duration of the test; if not, the test will surface that as a transient failure and stop, rather than fall back to a mock. +- The orchestrator entry point is `python -m llmxive run --project <id> --max-tasks 1` (verified during spec 003). +- The carry-forward manifest from spec 003 (`specs/003-phase1-idea-lifecycle-testing/carry-forward.yaml`) is authoritative and unmodified — PROJ-261 and PROJ-262 remain valid carry-forward inputs. +- The sibling spawner `tests/phase1/sibling_project.py` from spec 003 is reusable for Phase 2 with one extension (FR-003a: add `validated` to `ALLOWED_START_STAGES`). No separate sibling tooling is built. +- Agent prompts at `agents/prompts/project_initializer.md`, the constitution template at `agents/templates/research_project_constitution.md`, the registry entry, and the agent code are all editable as part of this iteration loop; changes ship as part of this PR. +- The existing PROJ-261 and PROJ-262 scaffolds on `main` are NOT modified by this spec. They serve as a reference for what "good output" should look like (since spec 003 already audited them through the Phase 1 lens), but iter2 siblings are the actual subject of this spec's testing. +- Real-project artifacts produced by iter2 sibling runs are small (constitutions are ~100 lines, scaffold trees are <30 files, idea files are <100 lines), so committing them is bounded in size. +- The agent's acceptance criteria as written in issue #62 are the authoritative checklist; if they themselves are wrong (too lax / too strict), the report will note that as a finding rather than rewriting them inline. +- The diagnostic report file path will be `notes/2026-05-05-phase2-diagnostic.md` unless the user prefers a different location. +- The carry-forward manifest path is `specs/004-phase2-project-bootstrap-testing/carry-forward.yaml`; spec 005 (Phase 3) and beyond can reference it. +- This work is sequential to (not blocked by) any other open Phase issue in #107 — Phase 2 has no dependency on Phases 3-14. +- A maintainer (human in the loop) makes the final selection of which iter2 sibling(s) carry forward; no agent's verdict overrides that judgment, since this spec is also testing whether project_initializer's output is trustworthy enough to feed into Phase 3. diff --git a/specs/004-phase2-project-bootstrap-testing/tasks.md b/specs/004-phase2-project-bootstrap-testing/tasks.md new file mode 100644 index 00000000..e1a1ca97 --- /dev/null +++ b/specs/004-phase2-project-bootstrap-testing/tasks.md @@ -0,0 +1,326 @@ +--- + +description: "Task list for Phase 2 (Project Bootstrap) end-to-end testing & diagnostics" +--- + +# Tasks: Phase 2 (Project Bootstrap) End-to-End Testing & Diagnostics + +**Input**: Design documents from `specs/004-phase2-project-bootstrap-testing/` +**Prerequisites**: plan.md, spec.md, research.md, data-model.md, contracts/, quickstart.md + +**Tests**: Yes — pytest harness for FR-011 / SC-009 (idempotency check) is part of FR-007 / Decision 4 in research.md. The diagnostic itself is a manual procedure but the idempotency-check harness MUST have automated tests. + +**Organization**: Tasks are grouped by user story. The MVP is US1 (clean run on iter2 siblings); US2-US6 build on US1's substrate. + +## Format: `[ID] [P?] [Story] Description` + +- **[P]**: Can run in parallel (different files, no dependencies) +- **[Story]**: Which user story this task belongs to (US1-US6) +- File paths are absolute relative to repo root + +## Path Conventions + +Single project; all paths relative to `/Users/jmanning/llmXive/`: +- Production code: `src/llmxive/` +- Diagnostic helpers + tests: `tests/phase1/` +- Spec artifacts: `specs/004-phase2-project-bootstrap-testing/` +- Diagnostic report: `notes/` +- Real-project artifacts: `projects/`, `state/` + +--- + +## Phase 1: Setup (Shared Infrastructure) + +**Purpose**: Preflight verification + the two production-code prerequisite fixes that ALL user stories depend on. No work in any user-story phase can begin until Phase 1 + Phase 2 complete. + +- [X] T001 Run preflight checks per quickstart.md Step 0: verify `cat specs/003-phase1-idea-lifecycle-testing/carry-forward.yaml` succeeds, `python -m llmxive run --help` succeeds, `python -c "from llmxive.credentials import load_dartmouth_key; print('ok' if load_dartmouth_key(prompt_if_missing=False) else 'missing')"` prints `ok`, and `git status --short` is clean (or only modified `.omc/`/cron files). +- [X] T002 Confirm carry-forward substrate exists: `ls projects/PROJ-261-evaluating-the-impact-of-code-duplicatio/idea/` and `ls projects/PROJ-262-predicting-molecular-dipole-moments-with/idea/` both list a `<slug>.md` file. +- [X] T003 Confirm spec 004 directory layout is in place: `ls specs/004-phase2-project-bootstrap-testing/{spec.md,plan.md,research.md,data-model.md,quickstart.md,contracts,checklists}` succeeds. + +--- + +## Phase 2: Foundational (Blocking Prerequisites) + +**Purpose**: The two production-code patches + the test harness that ALL user stories depend on. Per research.md Decisions 1, 2, 4 + spec.md FR-003a / FR-011. + +**⚠️ CRITICAL**: No US1-US6 task can begin until T004-T010 complete and committed. + +- [X] T004 Patch [tests/phase1/sibling_project.py:36](tests/phase1/sibling_project.py#L36) to extend `ALLOWED_START_STAGES` from `{"brainstormed", "flesh_out_in_progress", "flesh_out_complete"}` to `{"brainstormed", "flesh_out_in_progress", "flesh_out_complete", "validated"}`. Per spec.md FR-003a / research.md Decision 1. +- [X] T005 Verify the spawner change: run `python tests/phase1/sibling_project.py --help` and confirm the `--start-stage` choices include `validated`. +- [X] T006 Commit T004 with message referencing FR-003a, #46, #62: `git add tests/phase1/sibling_project.py && git commit -m "phase2/spec-004: add 'validated' to sibling spawner allowlist (FR-003a, #46 #62)"` +- [X] T007 Patch [src/llmxive/agents/project_initializer.py:84-104](src/llmxive/agents/project_initializer.py#L84-L104) to add a skip-if-exists guard before the constitution write: at the top of `handle_response`, if `(project_dir / ".specify" / "memory" / "constitution.md").is_file()`, call `init_speckit_in(project_dir)` (still idempotent on dirs) and return `[str(constitution_path.relative_to(repo))]` without invoking the LLM-output write. Per spec.md FR-011 / Q3 / research.md Decision 2. +- [X] T008 Patch [src/llmxive/agents/project_initializer.py:60](src/llmxive/agents/project_initializer.py#L60) to upgrade the silent `if idea_path.exists():` defensive guard into a fail-fast `raise FileNotFoundError(f"idea seed not found: {idea_path}")`. Per spec.md US4 scenario 2 / research.md Decision 5 / Constitution Principle V. +- [X] T009 [P] Implement [tests/phase1/test_idempotency.py](tests/phase1/test_idempotency.py) per `contracts/idempotency-check.md` — four pytest tests: `test_init_speckit_in_idempotent_on_complete_tree`, `test_project_initializer_skips_existing_constitution`, `test_project_initializer_writes_on_first_invocation`, `test_full_tree_idempotent_after_two_agent_invocations`. Use real `tmp_path` fixtures; no mocks per Constitution Principle III. +- [X] T010 Run `pytest tests/phase1/test_idempotency.py -v` and confirm all 4 tests pass. If any fail, fix the agent patches in T007/T008 (do NOT loosen the test). +- [X] T011 Commit T007/T008/T009 with message referencing FR-011, Q3, P2-D03, #46, #62: `git add src/llmxive/agents/project_initializer.py tests/phase1/test_idempotency.py && git commit -m "phase2/spec-004: idempotency + fail-fast guards on project_initializer (FR-011 Q3 P2-D03, #46 #62)"` +- [X] T012 Verify the existing backend retry policy at [src/llmxive/backends/router.py:96-100](src/llmxive/backends/router.py#L96-L100) (`attempts = 3 if model_idx == 0 else 1`) satisfies Q4's "2 retries / 3 total attempts" minimum. Per research.md Decision 3, no code change is expected; record verification in the diagnostic report's §1 as evidence FR-002 is satisfied by inheritance. **Contingency**: if the policy at L96-L100 has been weakened to <3 attempts on the primary model (e.g., reduced to `attempts = 2`) since spec 003 merged, file as defect P2-D04 with HIGH severity and patch in this PR before continuing to US1. + +**Checkpoint**: Foundation ready. The two prerequisite production fixes are committed; idempotency tests are green; retry policy verified. User-story phases may now begin. + +--- + +## Phase 3: User Story 1 - project_initializer runs cleanly on each iter2 sibling (Priority: P1) 🎯 MVP + +**Goal**: Run `project_initializer` end-to-end against the real Dartmouth Chat backend on one iter2 sibling per canonical (PROJ-261, PROJ-262), capturing every input/output/state-transition for audit. + +**Independent Test**: Spawn the sibling, run `python -m llmxive run --project <sibling-id> --max-tasks 1`, open `projects/<sibling-id>/.specify/memory/constitution.md`. The run-log must record `outcome: success`; the constitution must start with `# <title> — Research Project Constitution` and end with `**Project ID**: …` footer; no literal `{{token}}` strings remain; state YAML's `current_stage` advances to `project_initialized`. Per spec.md US1 acceptance scenarios 1-3. + +### Implementation for User Story 1 + +- [X] T013 [P] [US1] Spawn PROJ-261-iter2 by running `python tests/phase1/sibling_project.py PROJ-261-evaluating-the-impact-of-code-duplicatio --iter 2 --start-stage validated`. Confirm output prints `PROJ-261-evaluating-the-impact-of-code-duplicatio-iter2`. Confirm `projects/PROJ-261-evaluating-the-impact-of-code-duplicatio-iter2/idea/evaluating-the-impact-of-code-duplicatio.md` exists and is byte-identical to the canonical's idea file (the spawner sha256-verifies; capture its stderr for the report). +- [X] T014 [P] [US1] Spawn PROJ-262-iter2 by running `python tests/phase1/sibling_project.py PROJ-262-predicting-molecular-dipole-moments-with --iter 2 --start-stage validated`. Confirm output prints `PROJ-262-predicting-molecular-dipole-moments-with-iter2`. Confirm `projects/PROJ-262-…-iter2/idea/<slug>.md` exists and is byte-identical to the canonical's. +- [X] T015 [US1] Snapshot the pre-run state YAMLs: `cat state/projects/PROJ-261-evaluating-the-impact-of-code-duplicatio-iter2.yaml > /tmp/pre-261.yaml` and `cat state/projects/PROJ-262-predicting-molecular-dipole-moments-with-iter2.yaml > /tmp/pre-262.yaml`. Both MUST show `current_stage: validated`. +- [X] T016 [US1] Commit the two iter2 spawn artifacts: `git add projects/PROJ-261-…-iter2/ projects/PROJ-262-…-iter2/ state/projects/PROJ-26{1,2}-…-iter2.yaml && git commit -m "phase2/spec-004: spawn iter2 siblings of PROJ-261, PROJ-262 (US1, FR-001, #46 #62)"` +- [X] T017 [US1] Run `project_initializer` on PROJ-261-iter2: `python -m llmxive run --project PROJ-261-evaluating-the-impact-of-code-duplicatio-iter2 --max-tasks 1`. Capture stdout, stderr, and exit code into `/tmp/run-261.log`. Expected exit code: 0. +- [X] T018 [US1] Run `project_initializer` on PROJ-262-iter2: `python -m llmxive run --project PROJ-262-predicting-molecular-dipole-moments-with-iter2 --max-tasks 1`. Capture stdout, stderr, and exit code into `/tmp/run-262.log`. Expected exit code: 0. +- [X] T019 [US1] Snapshot post-run state YAMLs: `cat state/projects/PROJ-261-…-iter2.yaml > /tmp/post-261.yaml` and same for 262. Both MUST show `current_stage: project_initialized` and `last_run_status: success`. +- [X] T020 [US1] Capture run-log JSONL entries: locate the new line(s) in `state/run-log/2026-05/<run_id>.jsonl` corresponding to T017 and T018. Each must have `agent: project_initializer`, `outcome: success`, populated `started_at`/`ended_at`, `stage_before: validated`, `stage_after: project_initialized`. +- [X] T020a [US1] Capture verbatim rendered system prompt for each iter2 run (per FR-008 / SC-010 evidence). Run a Python harness that reconstructs the prompt EXACTLY as the agent built it, then write it to `/tmp/prompt-<sibling-id>.txt`: `python -c "from pathlib import Path; from llmxive.agents.base import AgentContext; from llmxive.agents.project_initializer import ProjectInitializerAgent; from llmxive.agents.registry import load_registry; reg = load_registry(); entry = next(e for e in reg.agents if e.name == 'project_initializer'); agent = ProjectInitializerAgent(entry); slug = '<slug>'; ctx = AgentContext(project_id='<sibling-id>', metadata={'title': '<title>', 'field': '<field>', 'principal_agent_name': 'flesh_out'}, inputs=[f'projects/<sibling-id>/idea/{slug}.md']); msgs = agent.build_messages(ctx); print('=== SYSTEM ==='); print(msgs[0].content); print('=== USER ==='); print(msgs[1].content)" > /tmp/prompt-<sibling-id>.txt`. Substitute `<title>`, `<field>` from the canonical's state YAML. The captured file is quoted verbatim in the diagnostic report's § 2.X.2 / § 2.X.3. +- [X] T021 [US1] Verify both iter2 siblings have a complete `.specify/` tree: `find projects/PROJ-261-…-iter2/.specify -type f | sort` and same for 262. Each must list 10 files (1 constitution + 4 scripts + 5 templates) per data-model.md E3. +- [X] T022 [US1] Commit the iter2 run artifacts: `git add projects/PROJ-26{1,2}-…-iter2/.specify/ state/projects/PROJ-26{1,2}-…-iter2.yaml state/run-log/ && git commit -m "phase2/spec-004: project_initializer happy-path runs on iter2 siblings (US1, #46 #62)"` + +**Checkpoint**: At this point, US1 is fully exercised. PROJ-261-iter2 and PROJ-262-iter2 are at `current_stage: project_initialized` with audited-able artifacts. + +--- + +## Phase 4: User Story 2 - Constitution-quality audit against the system prompt's contract (Priority: P1) + +**Goal**: For each iter2 sibling, audit `.specify/memory/constitution.md` against the six output-contract items in `agents/prompts/project_initializer.md`. Record per-item PASS/FAIL with quoted excerpts. + +**Independent Test**: Read each iter2 constitution alongside `agents/templates/research_project_constitution.md`, mark each contract item (a)-(f) per data-model.md E2, confirm no `{{token}}` survives, confirm the chemistry constitution's Reproducibility Requirements names QM9 / MD17 (or whichever data source the idea cites). Per spec.md US2 acceptance scenarios 1-3. + +### Implementation for User Story 2 + +- [X] T023 [P] [US2] Audit PROJ-261-iter2's constitution: open `projects/PROJ-261-…-iter2/.specify/memory/constitution.md` side-by-side with `agents/templates/research_project_constitution.md`. Fill in the six-row audit table from `contracts/diagnostic-report.md` § 3.X.1 (heading / footer / inherited principles / added principles ≤2 / no external citations / Reproducibility-Requirements adapted). Record verdict per row with quoted excerpts. +- [X] T024 [P] [US2] Audit PROJ-262-iter2's constitution: same procedure as T023 against `projects/PROJ-262-…-iter2/.specify/memory/constitution.md`. Pay special attention to row (f) — the Reproducibility-Requirements section MUST name a real chemistry data source (QM9, MD17, or whichever is cited in the iter2's idea body). +- [X] T025 [P] [US2] Token-leak check: `grep -F '{{' projects/PROJ-261-…-iter2/.specify/memory/constitution.md projects/PROJ-262-…-iter2/.specify/memory/constitution.md`. MUST be empty. Per spec.md SC-010. +- [X] T026 [P] [US2] Source-of-truth verification for both siblings: for each of the 9 mechanical files (4 scripts + 5 templates), compute `sha256sum projects/<sibling-id>/.specify/<path>` and compare to `sha256sum .specify/<path>` at repo root. Build the table from `contracts/diagnostic-report.md` § 3.X.4 — all 18 rows (9 files × 2 siblings) MUST show ✓ match. +- [X] T027 [US2] If T023 or T024 surfaces any contract violation: file as defect P2-D## with severity per data-model.md E6 § 4 (CRITICAL for heading/footer/inherited-principles/citations; HIGH for added-principles count; MEDIUM for Reproducibility-Requirements adaptation). Either fix in-PR (with prompt or template patch — see Phase 7 iteration loop) or defer to a follow-up issue. Per spec.md FR-014. + +**Checkpoint**: Each iter2 constitution has been audited against its six-item output contract. All deviations are recorded as defects. + +--- + +## Phase 5: User Story 3 - Idempotency audit on init_speckit_in (Priority: P1) + +**Goal**: Empirically verify FR-011 / SC-009 — the full `.specify/` tree is byte-identical after a second `init_speckit_in` invocation on a sibling already at `project_initialized`. + +**Independent Test**: Compute sha256-per-file manifest before and after second invocation; assert lists are identical. Per spec.md US3 acceptance scenarios 1-3. + +### Implementation for User Story 3 + +- [X] T028 [US3] Before the second invocation, capture the pre-rerun manifest on PROJ-261-iter2: `find projects/PROJ-261-…-iter2/.specify -type f -exec sha256sum {} \; | sort > /tmp/sha-before-261.txt`. +- [X] T029 [US3] Run a direct second invocation of `init_speckit_in`: `python -c "from pathlib import Path; from llmxive.speckit.runner import init_speckit_in; init_speckit_in(Path('projects/PROJ-261-evaluating-the-impact-of-code-duplicatio-iter2'))"`. Expected: completes silently, no exceptions. +- [X] T030 [US3] Compute the post-rerun manifest: `find projects/PROJ-261-…-iter2/.specify -type f -exec sha256sum {} \; | sort > /tmp/sha-after-261.txt`. Diff: `diff /tmp/sha-before-261.txt /tmp/sha-after-261.txt`. MUST be empty. Per data-model.md E8 cross-entity invariants. +- [X] T031 [US3] Quote the pytest output from `pytest tests/phase1/test_idempotency.py::test_project_initializer_skips_existing_constitution -v` (run during T010) verbatim into the diagnostic report's § 3.X.5 as the primary evidence for US3 acceptance scenario 2. +- [X] T032 [US3] If T030 shows ANY divergence (i.e., diff is non-empty): file as CRITICAL defect P2-D## with the specific changed file path and proposed fix. Per spec.md SC-009 — the constitution skip-if-exists fix MUST be in place for SC-009 to pass. + +**Checkpoint**: Idempotency is empirically verified at file-content level for at least one iter2 sibling. The pytest harness corroborates the manual sha256 evidence. + +--- + +## Phase 6: User Story 4 - Failure-path induction (Priority: P2) + +**Goal**: Induce all three deliberate failure modes per Q2 clarification (backend-unreachable, idea-missing, template-missing) on dedicated sibling iters; verify each produces a loud + recorded failure with state unchanged. Per spec.md US4 / FR-012 / SC-005. + +**Independent Test**: After each scenario, the run-log entry MUST have `outcome: failure` with non-empty `failure_reason`, the state YAML's `current_stage` MUST remain `validated`, and no `.specify/memory/constitution.md` MUST exist on the failure-iter sibling. Per spec.md US4 acceptance scenarios 1-2. + +### Implementation for User Story 4 + +- [X] T033 [P] [US4] Scenario 1 (backend unreachable): follow `contracts/induced-failure-runs.md § Scenario 1` exactly. Spawn `PROJ-261-…-iterFAIL-backend` (use `--iter 6` or first available unused iter), export `LLMXIVE_BACKEND_BASE_URL=https://invalid.example.com` for the duration of one orchestrator run, capture stderr / run-log / post-run state, then restore the env. Pass criterion: per spec.md US4 acceptance scenario 2 — failure classified as `TransientBackendError`, state unchanged. +- [X] T034 [P] [US4] Scenario 2 (idea file missing): follow `contracts/induced-failure-runs.md § Scenario 2` exactly. Spawn `PROJ-262-…-iterFAIL-idea` (use `--iter 7`), `rm projects/<sibling-id>/idea/<slug>.md`, run the orchestrator. Per research.md Decision 5: with the T008 fix in place, the agent MUST raise `FileNotFoundError`; the run-log MUST record `outcome: failure` with the exception's repr. +- [X] T035 [P] [US4] Scenario 3 (template file missing): follow `contracts/induced-failure-runs.md § Scenario 3` exactly. Spawn `PROJ-261-…-iterFAIL-template` (use `--iter 8`), `mv agents/templates/research_project_constitution.md agents/templates/research_project_constitution.md.bak`, run the orchestrator, restore the template. Per `project_initializer.py:44`, the agent MUST raise `FileNotFoundError` BEFORE the LLM is invoked (fail-fast on missing template per Constitution Principle V). +- [X] T036 [US4] Cleanup verification: `echo "${LLMXIVE_BACKEND_BASE_URL:-(unset)}"` shows the original value or `(unset)`; `ls -la agents/templates/research_project_constitution.md` shows the file is back in place; `git status agents/templates/` shows clean. Per `contracts/induced-failure-runs.md § Cleanup checklist`. +- [X] T037 [US4] For each of the three induced-failure siblings, set `archived_at: <ISO-8601 UTC>` in their state YAMLs (per spec.md FR-019). The state files remain committed; only the `archived_at` field is added. +- [X] T038 [US4] Commit the three failure-iter siblings + cleanup: `git add projects/PROJ-26{1,2}-…-iterFAIL-*/ state/projects/PROJ-26{1,2}-…-iterFAIL-*.yaml state/run-log/ && git commit -m "phase2/spec-004: induced-failure scenarios + archive (US4, FR-012, #46 #62)"` + +**Checkpoint**: All three induced-failure scenarios pass: failures are loud, recorded, state-preserving, and atomic-or-absent on filesystem writes. + +--- + +## Phase 7: Iteration loop (conditional on US2 / US3 / US4 defects) + +**Purpose**: Apply prompt/template/code patches if any audit phase surfaced a defect; spawn iter3+ siblings to verify the fix; repeat up to 5 cycles per spec.md FR-005. + +**Trigger**: ANY of T023, T024, T025, T026, T030, T033, T034, T035 reveals a defect that warrants in-PR fix per spec.md FR-014. Skip Phase 7 entirely if no defects surfaced. + +### Implementation for Iteration loop (conditional) + +- [X] T039 [US2] [conditional] If T023/T024 surfaced a constitution-content defect (e.g., dropped principle, fabricated citation, missing data-source adaptation): patch the affected source — `agents/prompts/project_initializer.md` (most common) or `agents/templates/research_project_constitution.md` — bumping the agent's `prompt_version` in `agents/registry.yaml` per the spec-003 semver policy (MAJOR/MINOR/PATCH per the change kind). Same commit MUST include both the prompt/template patch AND the version bump. **Verify the bump landed**: after each iteration commit, run `git show --stat HEAD -- agents/registry.yaml` and confirm the `prompt_version` line for `project_initializer` shows the version diff. If the registry didn't change but a prompt/template did, the commit violates FR-020 — amend the commit to include the bump before pushing. +- [X] T040 [US2] [conditional] After T039: spawn iter3 siblings of the affected canonicals (`python tests/phase1/sibling_project.py <canonical> --iter 3 --start-stage validated`). Re-run `project_initializer` on each. Re-audit per T023/T024. If still failing AND iteration count <5: return to T039 for another patch. If iteration count = 5 and still failing: file follow-up issue, mark defect `Deferred to issue #<N>` in §4 of report, exit the loop. Per FR-005. +- [X] T041 [US3] [conditional] If T030 showed sha256 divergence: investigate. Either the T007 skip-if-exists guard isn't working (revert + re-investigate) or `init_speckit_in` is mutating an unexpected file (file as defect against `src/llmxive/speckit/runner.py`). Patch + commit + re-run T028-T030. +- [X] T042 [US4] [conditional] If T033/T034/T035 surfaced any failure-handling defect (silent state advancement, empty `failure_reason`, partial constitution write): patch the relevant agent or orchestrator code, commit with version bump if a registry-tracked prompt was patched, re-run the affected scenario. +- [X] T043 [conditional] For each iteration, capture a §5 subsection in the diagnostic report with the verbatim `git diff <prev-SHA> <curr-SHA> -- <path>` block per spec.md FR-008 / `contracts/diagnostic-report.md § Section 5`. + +**Checkpoint**: Either all defects fixed (and iter3+ siblings exist with passing audits) or all unresolved defects deferred to follow-up issues. Iteration count never exceeds 5 per agent (FR-005 hard cap). + +--- + +## Phase 8: User Story 5 - Verbatim diagnostic report (Priority: P1) + +**Goal**: Author a single Markdown file at `notes/2026-05-05-phase2-diagnostic.md` quoting every artifact verbatim from US1-US4 + iteration loops, evaluating each against issue #62's three acceptance criteria with severity-tagged defects. Per spec.md US5 / FR-013 / `contracts/diagnostic-report.md`. + +**Independent Test**: Reading the report top-to-bottom, every checkbox in issue #62 has an explicit pass/fail verdict tied to a quoted artifact, every defect has severity + file:line + status, every CRITICAL defect has `Fixed in PR <SHA>` or `Deferred to issue #<N>`. Per spec.md US5 acceptance scenarios 1-3. + +### Implementation for User Story 5 + +- [X] T044 [US5] Create `notes/2026-05-05-phase2-diagnostic.md` with the frontmatter block per `contracts/diagnostic-report.md`'s "Frontmatter" section (spec link, generation timestamp, branch name, final commit, parent issue, tracker). +- [X] T045 [US5] Write Section 1 (Inputs): tables per `contracts/diagnostic-report.md § Section 1` covering both canonicals and both iter2 siblings; quote spawner stderr from T013/T014 as sha256-clone evidence. +- [X] T046 [US5] Write Section 2 (Agent behavior): six subsections per sibling × N siblings (2 happy-path + ≥3 induced-failure + ≥0 iter3+). Each contains pre-run state YAML, rendered system prompt, rendered user prompt, LLM response, run-log JSONL line, post-run state YAML — all verbatim. Per `contracts/diagnostic-report.md § Section 2`. +- [X] T047 [US5] Write Section 3 (Outputs): for each happy-path sibling, the constitution audit table (T023/T024), the full constitution quote (with `[truncated…]` markers if >100 lines), the token-leak check (T025), the source-of-truth verification table (T026), and the idempotency check (T028-T031 for the chosen US3 sibling). Per `contracts/diagnostic-report.md § Section 3`. +- [X] T048 [US5] Write Section 4 (Defects table) with the running list of P2-D## defects from US2 / US3 / US4 / iteration loops. P2-D01 (constitution skip-if-exists, fixed by T007), P2-D02 (sibling allowlist extension, fixed by T004), P2-D03 (idea-missing fail-fast, fixed by T008) are pre-known; append any new defects discovered during US1-US4. Per `contracts/diagnostic-report.md § Section 4`. +- [X] T049 [US5] Write Section 5 (Iteration diffs) — only if Phase 7 ran. Otherwise this section is the single line `No iteration loops fired; iter2 happy-path was sufficient on first pass.` per `contracts/diagnostic-report.md § Section 5`. +- [X] T050 [US5] Write Section 6 (Per-issue acceptance-criteria summary): two tables, one for issue #62 (3 checkboxes) and one for issue #46 (5 checkboxes). Each row marked PASS / FAIL with rationale tied to a quoted artifact from §2 or §3. Per `contracts/diagnostic-report.md § Section 6`. +- [X] T051 [US5] Write Section 7 (Recommendations): bulleted lists of recommended Phase-2 changes going forward, follow-up issues opened/recommended, items deliberately accepted-as-is. Per `contracts/diagnostic-report.md § Section 7`. +- [X] T052 [US5] Verify all artifacts referenced in §1-§6 exist on disk and the quotes are exact (run `diff <(grep -A 100 "constitution.md" notes/2026-05-05-phase2-diagnostic.md | head -100) <(head -100 projects/PROJ-261-…-iter2/.specify/memory/constitution.md)` etc. spot-checks for ≥3 random quoted artifacts). +- [X] T053 [US5] Commit the diagnostic report: `git add notes/2026-05-05-phase2-diagnostic.md && git commit -m "phase2/spec-004: diagnostic report (US5, FR-013, #46 #62)"` + +**Checkpoint**: Single report at `notes/2026-05-05-phase2-diagnostic.md` covers everything Phase 2 produced + verdict per issue #62 / #46 acceptance criterion. + +--- + +## Phase 9: User Story 6 - Carry-forward gate (Priority: P2) + +**Goal**: Author `specs/004-phase2-project-bootstrap-testing/carry-forward.yaml` naming 1-2 iter2 siblings (or canonicals + iter2 IDs) that pass the US2 audit cleanly, providing the input substrate for spec 005 (Phase 3 — Specifier + Clarifier). + +**Independent Test**: Read the manifest; confirm each named project ID corresponds to a real `projects/<id>/` directory at `current_stage: project_initialized` with a complete `.specify/` scaffold; confirm each named commit hash resolves to a real commit on the feature branch; confirm `phase2_iter2_id` field names a real iter2 sibling whose constitution passed the US2 audit. Per spec.md US6 acceptance scenarios 1-3. + +### Implementation for User Story 6 + +- [X] T054 [US5/US6] Decide carry-forward selection based on the diagnostic report's §6 verdicts: pick 1-2 iter2 siblings whose constitutions passed all six US2 audit items at PASS (no FAIL on any row). If both iter2 siblings have unresolved CRITICAL defects, fall back to carrying forward the canonical PROJ-261/PROJ-262 from spec 003 (with the iter2 sibling's audited constitution copied in) — record this as a sub-decision in §8 of the report. Per spec.md US6. +- [X] T055 [US6] Author `specs/004-phase2-project-bootstrap-testing/carry-forward.yaml` per `contracts/carry-forward.md` schema: `spec`, `generated_at`, `final_commit`, `projects[*]` with `project_id`, `final_state`, `final_commit`, `phase2_iter2_id`, `agents_run`, `justification`. The `agents_run` list MUST include all four Phase-1 agents (brainstorm / flesh_out / research_question_validator / project_initializer) plus the iteration counts from spec 003 + this spec. Per spec.md FR-017. +- [X] T056 [US6] Validate the manifest manually against the schema in `contracts/carry-forward.md`: every cross-field invariant satisfied (each `phase2_iter2_id` resolves to a real iter2 sibling at `project_initialized`; each `final_commit` resolves on the branch; no >2 entries; no <1 entry). Document the validation in §6 of the report under "Schema validation" row. +- [X] T057 [US6] Write Section 8 (Carry-forward decision) of the diagnostic report: name the selected sibling IDs, quote each named project's full state YAML, write a ≤200-word justification per selection covering whether the constitution passed US2 cleanly + whether idempotency held + which domain principles the LLM added + caveats for spec 005. Per `contracts/diagnostic-report.md § Section 8`. +- [X] T058 [US6] Commit the carry-forward manifest + report update: `git add specs/004-phase2-project-bootstrap-testing/carry-forward.yaml notes/2026-05-05-phase2-diagnostic.md && git commit -m "phase2/spec-004: carry-forward manifest + report § 8 (US6, FR-017, #46 #62)"` +- [X] T058a [US6] Archive any iter2 happy-path siblings NOT named in `carry-forward.yaml` (per FR-019 — non-selected siblings MUST be marked archived, not deleted). For each unselected `<sibling-id>` from {PROJ-261-…-iter2, PROJ-262-…-iter2, plus any iter3+ from Phase 7}: append an `archived_at: <ISO-8601 UTC>` field to its `state/projects/<sibling-id>.yaml` (same pattern as T037 for failure-iter siblings). Confirm with `grep archived_at state/projects/PROJ-26*-iter*.yaml`. Commit: `git add state/projects/ && git commit -m "phase2/spec-004: archive non-selected iter2 siblings (FR-019, US6, #46 #62)"`. Skip this task entirely if every spawned iter2 sibling appears in `carry-forward.yaml`. + +**Checkpoint**: spec 005 can `cat specs/004-phase2-project-bootstrap-testing/carry-forward.yaml` and pick its substrate without re-discovering anything. + +--- + +## Phase 10: Polish & Cross-Cutting Concerns + +**Purpose**: Run the full test suite, close issues, update tracker #107, push, open PR. + +- [X] T059 [P] Run the full pytest suite to confirm no regression: `pytest tests/phase1/ -v`. All spec-003 tests (citation_resolver) must still pass; new spec-004 tests (idempotency) must all pass. If any test fails: stop, fix the underlying code (do NOT loosen tests per CLAUDE.md), commit the fix, retry. +- [X] T060 [P] Run any project linters: `ruff check .` and `pyright` (or whatever the project's existing lint/type-check toolchain is). Any new errors introduced by T004-T011 fixes MUST be resolved before continuing. +- [X] T061 Tick the Phase 2 box in tracking issue #107. Note: GitHub's issue body may have whitespace variations after rendering, so prefer a Python regex over a fragile `sed` literal: `gh issue view 107 --json body -q .body > /tmp/issue107.md && python3 -c "import re,sys; t=open('/tmp/issue107.md').read(); open('/tmp/issue107.md','w').write(re.sub(r'- \[ \] #46\b', '- [x] #46', t, count=1))" && gh issue edit 107 --body-file /tmp/issue107.md`. Verify the edit by re-fetching the issue body and confirming `- [x] #46` appears. +- [X] T062 Close issue #62 (project_initializer agent): `gh issue close 62 --comment "Resolved via spec 004 (commit <SHA>). See diagnostic report at notes/2026-05-05-phase2-diagnostic.md and carry-forward manifest at specs/004-phase2-project-bootstrap-testing/carry-forward.yaml. All three acceptance criteria pass per § 6 of the report."` — substitute `<SHA>` with the final commit hash. +- [X] T063 Close issue #46 (Phase 2 parent): `gh issue close 46 --comment "Phase 2 verified end-to-end via spec 004 (commit <SHA>). All five acceptance-criterion checkboxes in this issue pass per § 6 of the diagnostic report. Carry-forward manifest names <K> sibling(s) for spec 005."` +- [X] T064 Push the feature branch: `git push origin 008-phase2-project-bootstrap-testing`. +- [X] T065 Open the PR: use `gh pr create --base main --head 008-phase2-project-bootstrap-testing --title "Spec 004: Phase 2 (Project Bootstrap) end-to-end testing" --body "$(cat <<'EOF'`...heredoc...`EOF`...`)"`. The full PR body block is defined verbatim in [quickstart.md § Step 10](./quickstart.md) — copy it inline into the heredoc. Confirm the PR body renders correctly on GitHub before continuing (any unescaped backticks or special chars will display as raw markup). +- [X] T066 [P] Add the PR URL to a comment on tracking issue #107 for easy navigation. +- [X] T067 Update spec.md's `**Status**` line from `Draft` to `In Review` (or `Merged` after merge). Recommended approach is a manual edit (open the file, change the literal `**Status**: Draft` to `**Status**: In Review`) since `sed` on macOS BSD can mishandle markdown asterisks; alternatively use `python3 -c "import re,sys; p='specs/004-phase2-project-bootstrap-testing/spec.md'; t=open(p).read(); open(p,'w').write(re.sub(r'^\*\*Status\*\*: Draft\s*$', '**Status**: In Review', t, count=1, flags=re.MULTILINE))"`. Verify with `head -10 specs/004-phase2-project-bootstrap-testing/spec.md | grep Status`. + +**Checkpoint**: PR open. All issues updated. Tracker reflects Phase 2 complete pending merge. + +--- + +## Dependencies & Execution Order + +### Phase Dependencies + +- **Phase 1 (Setup, T001-T003)**: No dependencies — preflight only +- **Phase 2 (Foundational, T004-T012)**: Depends on Phase 1 completion. **BLOCKS US1-US6.** +- **Phase 3 (US1, T013-T022)**: Depends on Phase 2 complete. P1 / MVP. +- **Phase 4 (US2, T023-T027)**: Depends on Phase 3 complete (audits the iter2 outputs). +- **Phase 5 (US3, T028-T032)**: Depends on Phase 3 complete + T009/T010 idempotency tests passing. Can run in parallel with Phase 4. +- **Phase 6 (US4, T033-T038)**: Depends on Phase 2 complete (independent of Phase 3-5; spawns its own dedicated failure-iter siblings). +- **Phase 7 (Iteration loop, T039-T043)**: CONDITIONAL — only if Phase 4 / 5 / 6 surfaced defects. Iterates per FR-005 5-cycle cap. +- **Phase 8 (US5 report, T044-T053)**: Depends on Phases 3-7 complete (report quotes their artifacts). +- **Phase 9 (US6 carry-forward, T054-T058)**: Depends on Phase 8 complete (selection driven by report verdicts). +- **Phase 10 (Polish, T059-T067)**: Depends on Phase 9 complete. + +### User Story Dependencies + +- **US1 (P1)**: After Phase 2 — no inter-story dependencies. +- **US2 (P1)**: After US1 — audits US1's outputs. +- **US3 (P1)**: After US1 + T009 idempotency tests passing — verifies idempotency on US1's iter2 output. +- **US4 (P2)**: After Phase 2 — independent of US1-US3 (spawns its own dedicated failure-iter siblings, using the T004 spawner extension and the T008 fail-fast guard). +- **US5 (P1)**: After US1-US4 (+ Phase 7 if it ran) — quotes everything. +- **US6 (P2)**: After US5 — selection driven by §6 verdicts of the report. + +### Within Each User Story + +- Spawn siblings before running the orchestrator on them. +- Run the orchestrator before snapshotting state YAML / run-log entries. +- Audit before filing defects. +- Tests pass before committing the patch they cover. +- Commit after each task or logical group, per CLAUDE.md guidance. + +### Parallel Opportunities + +- T013, T014 (spawn PROJ-261-iter2 + PROJ-262-iter2) can run in parallel — different sibling IDs. +- T009 (test_idempotency.py) and T012 (router verification) are independent and can run in parallel. +- T023, T024, T025, T026 (US2 audit subtasks) are different files / different commands — fully parallel. +- T028-T030 (US3 sha256 sequence) is sequential per sibling, but can run in parallel for PROJ-262-iter2 if you want to audit both (the spec's MVP only requires one US3 sibling). +- T033, T034, T035 (US4 induced-failure scenarios) can run in parallel — different sibling-iter IDs and different precondition mutations. **WARNING**: T035 (template rename) is mutually exclusive with T033/T034 if those need the template too. Suggest sequential execution of US4 with explicit setup/teardown per scenario. +- T044-T053 (US5 report sections) are independent within the same file — can be drafted in parallel but committed together at T053. +- T059, T060 (test + lint) can run in parallel. + +--- + +## Parallel Example: User Story 1 + +```bash +# Spawn both iter2 siblings in parallel: +python tests/phase1/sibling_project.py PROJ-261-evaluating-the-impact-of-code-duplicatio --iter 2 --start-stage validated & +python tests/phase1/sibling_project.py PROJ-262-predicting-molecular-dipole-moments-with --iter 2 --start-stage validated & +wait + +# Then run project_initializer SEQUENTIALLY on each (the orchestrator +# isn't designed for concurrent invocations on different projects, and +# Dartmouth's free tier rate-limits concurrent calls anyway): +python -m llmxive run --project PROJ-261-evaluating-the-impact-of-code-duplicatio-iter2 --max-tasks 1 +python -m llmxive run --project PROJ-262-predicting-molecular-dipole-moments-with-iter2 --max-tasks 1 +``` + +## Parallel Example: User Story 2 audits + +```bash +# All four subtasks operate on different files / different commands: +diff <(cat agents/templates/research_project_constitution.md) <(cat projects/PROJ-261-…-iter2/.specify/memory/constitution.md) > /tmp/audit-261-side-by-side.diff & +diff <(cat agents/templates/research_project_constitution.md) <(cat projects/PROJ-262-…-iter2/.specify/memory/constitution.md) > /tmp/audit-262-side-by-side.diff & +grep -F '{{' projects/PROJ-261-…-iter2/.specify/memory/constitution.md projects/PROJ-262-…-iter2/.specify/memory/constitution.md > /tmp/token-leak-check.log & +# T026 source-of-truth verification (loop over 9 mechanical files × 2 siblings) +wait +``` + +--- + +## Implementation Strategy + +### MVP First (US1 only, with Phase 1+2 prerequisites) + +1. Phase 1 (T001-T003): preflight checks +2. Phase 2 (T004-T012): the two production-code patches + idempotency tests + retry policy verification — **most-critical chunk of the spec** +3. Phase 3 (T013-T022): two iter2 siblings spawned + project_initializer run on each +4. **STOP and VALIDATE**: `cat projects/PROJ-26{1,2}-…-iter2/.specify/memory/constitution.md` and inspect manually +5. If both look reasonable: continue to Phase 4-9 for the full audit + report. If either looks broken: skip to Phase 7 iteration loop on the affected canonical. + +### Incremental Delivery + +1. Phase 1+2 → Foundation ready (the two production fixes are the most reusable artifact this spec produces; once they land, future phase-test specs (005-007) inherit them) +2. Phase 3 → MVP: project_initializer demonstrably runs end-to-end on real iter2 siblings +3. Phase 4-5 → Audit verdict: do the iter2 outputs pass their content + idempotency contracts? +4. Phase 6 → Failure paths verified: Constitution Principle V satisfied for Phase 2 +5. Phase 8-9 → Diagnostic report + carry-forward manifest (the substrate spec 005 will read) +6. Phase 10 → Issues closed, PR open, tracker updated + +### Parallel Team Strategy (single-developer fallback) + +This spec is designed for a single maintainer. The parallel opportunities listed above are advisory — single-threaded execution is fully supported and is the expected primary path. The estimated total wall-clock with single-threaded execution is ~3.5h on the happy path (per quickstart.md Step 10), up to ~9h if Phase 7 iteration triggers. + +--- + +## Notes + +- [P] tasks = different files, no dependencies on incomplete tasks within the same phase +- [Story] label maps task to specific user story for traceability per `/speckit-tasks` rules +- Each user story can be independently demonstrated to a reviewer (per spec.md's "Independent Test" sections) +- Tests in T009-T010 must pass BEFORE T011 commits — verify failure path is detected (negative control test in test_idempotency.py) +- Commit after each Phase checkpoint or logical group, per CLAUDE.md "frequent commits" guidance +- Stop at any checkpoint to validate; resume by re-reading the current Phase's task list +- Avoid: vague tasks (every task has a concrete file path), same-file conflicts (P-marked tasks verified independent), cross-story dependencies that break independence (US1-US6 mostly independent except where audit depends on output) +- Defects discovered during US1-US4 auto-trigger Phase 7 iteration loop; once Phase 7 exits (either fix or defer), Phase 8 report generation can begin +- The diagnostic report is the single source of truth for "what Phase 2 did" — every artifact, every verdict, every defect, every selection rationale lives in one Markdown file at `notes/2026-05-05-phase2-diagnostic.md` diff --git a/src/llmxive/agents/project_initializer.py b/src/llmxive/agents/project_initializer.py index e1f15fb4..cc2a57e1 100644 --- a/src/llmxive/agents/project_initializer.py +++ b/src/llmxive/agents/project_initializer.py @@ -13,7 +13,7 @@ from __future__ import annotations -from datetime import datetime, timezone +from datetime import UTC, datetime from pathlib import Path from llmxive.agents.base import Agent, AgentContext @@ -22,7 +22,6 @@ from llmxive.speckit.runner import init_speckit_in from llmxive.types import AgentRegistryEntry, Project, Stage - CONSTITUTION_TEMPLATE_PATH = "agents/templates/research_project_constitution.md" @@ -40,7 +39,7 @@ def build_messages(self, ctx: AgentContext) -> list[ChatMessage]: title = ctx.metadata.get("title", ctx.project_id) field = ctx.metadata.get("field", "general") principal = ctx.metadata.get("principal_agent_name", "flesh_out") - date = datetime.now(timezone.utc).date().isoformat() + date = datetime.now(UTC).date().isoformat() rendered_template = render_prompt( CONSTITUTION_TEMPLATE_PATH, { @@ -53,11 +52,17 @@ def build_messages(self, ctx: AgentContext) -> list[ChatMessage]: repo_root=repo, ) - idea_summary = "" - if ctx.inputs: - idea_path = repo / ctx.inputs[0] - if idea_path.exists(): - idea_summary = idea_path.read_text(encoding="utf-8") + # Fail-fast on missing idea file (P2-D03 / FR-012 / Constitution Principle V). + # The previous defensive `if idea_path.exists()` masked missing inputs and + # produced constitutions untethered from any idea body. + if not ctx.inputs: + raise FileNotFoundError( + f"project_initializer requires at least one input (idea file path); got ctx.inputs={ctx.inputs!r}" + ) + idea_path = repo / ctx.inputs[0] + if not idea_path.is_file(): + raise FileNotFoundError(f"idea seed not found: {idea_path}") + idea_summary = idea_path.read_text(encoding="utf-8") system_prompt = render_prompt( self.entry.prompt_path, @@ -84,12 +89,22 @@ def build_messages(self, ctx: AgentContext) -> list[ChatMessage]: def handle_response(self, ctx: AgentContext, response: ChatResponse) -> list[str]: repo = Path(__file__).resolve().parent.parent.parent.parent project_dir = repo / "projects" / ctx.project_id + constitution_path = project_dir / ".specify" / "memory" / "constitution.md" + + # Idempotency guard (FR-011 / spec-004 Q3): if the project already has + # a constitution, treat the entire agent invocation as a no-op for + # the constitution write. We still re-call init_speckit_in (which is + # idempotent on directories per src/llmxive/speckit/runner.py:114). + # Re-rendering a governance document silently mutates downstream + # Constitution Checks, so skip-if-exists is the safe default. + if constitution_path.is_file(): + init_speckit_in(project_dir) + return [str(constitution_path.relative_to(repo))] # Mechanical step: scaffold .specify/ inside the project. init_speckit_in(project_dir) # Write the LLM-rendered constitution. - constitution_path = project_dir / ".specify" / "memory" / "constitution.md" constitution_path.parent.mkdir(parents=True, exist_ok=True) constitution_text = response.text.strip() if not constitution_text.startswith("#"): diff --git a/src/llmxive/cli.py b/src/llmxive/cli.py index a6e3f5e2..fe1981f0 100644 --- a/src/llmxive/cli.py +++ b/src/llmxive/cli.py @@ -196,11 +196,11 @@ def _cmd_brainstorm(args: argparse.Namespace) -> int: from llmxive.backends.base import ChatMessage from llmxive.backends.router import chat_with_fallback from llmxive.state import project as project_store + from llmxive.state.project_id_lock import next_available_proj_num, project_id_lock from llmxive.types import Project, Stage repo = Path.cwd() existing_projects = project_store.list_all(repo_root=repo) - existing_ids = {p.id for p in existing_projects} existing_titles_by_field: dict[str, list[str]] = {} for p in existing_projects: existing_titles_by_field.setdefault((p.field or "general").lower(), []).append(p.title) @@ -213,9 +213,6 @@ def _cmd_brainstorm(args: argparse.Namespace) -> int: n_target = max(1, args.count) now = datetime.now(timezone.utc) - next_num = 1 - while any(p.id.startswith(f"PROJ-{next_num:03d}") for p in existing_projects): - next_num += 1 try: entry = registry_loader.get("brainstorm") @@ -289,29 +286,38 @@ def _cmd_brainstorm(args: argparse.Namespace) -> int: continue slug = re.sub(r"[^a-z0-9]+", "-", title.lower()).strip("-")[:40] or "idea" - while True: - pid = f"PROJ-{next_num:03d}-{slug}" - if pid not in existing_ids: - break - next_num += 1 - existing_ids.add(pid) - existing_titles_by_field.setdefault(field.lower(), []).append(title) - - project = Project( - id=pid, - title=title, - field=field, - current_stage=Stage.BRAINSTORMED, - points_research={}, - points_paper={}, - created_at=now, - updated_at=now, - artifact_hashes={}, - ) - project_store.save(project, repo_root=repo) - idea_dir = repo / "projects" / pid / "idea" - idea_dir.mkdir(parents=True, exist_ok=True) + # Q1B fix (spec 004): atomic project-ID allocation. Re-scan disk + # under an exclusive flock so concurrent brainstorm invocations + # cannot race to the same PROJ-NNN. Lock is held only during the + # disk-snapshot + state-YAML write (microseconds), NOT during + # the LLM call above. + with project_id_lock(repo): + n = next_available_proj_num(repo_root=repo) + pid = f"PROJ-{n:03d}-{slug}" + existing_titles_by_field.setdefault(field.lower(), []).append(title) + + project = Project( + id=pid, + title=title, + field=field, + current_stage=Stage.BRAINSTORMED, + points_research={}, + points_paper={}, + created_at=now, + updated_at=now, + artifact_hashes={}, + ) + # Eagerly write the state YAML inside the lock — this is the + # ID claim. Once this returns, next_available_proj_num() in any + # other process will see this PROJ-NNN as used. + project_store.save(project, repo_root=repo) + + idea_dir = repo / "projects" / pid / "idea" + idea_dir.mkdir(parents=True, exist_ok=True) + + # The LLM body + idea/<slug>.md write happen OUTSIDE the lock — + # the ID is already claimed, so no other process can race for it. front = ( "---\n" f"field: {field}\n" @@ -321,7 +327,6 @@ def _cmd_brainstorm(args: argparse.Namespace) -> int: ) (idea_dir / f"{slug}.md").write_text(front, encoding="utf-8") created += 1 - next_num += 1 print(f"[brainstorm] seeded {pid} ({field}) via {model_used}") print(f"[brainstorm] created {created} brainstormed project(s)") diff --git a/src/llmxive/state/project_id_lock.py b/src/llmxive/state/project_id_lock.py new file mode 100644 index 00000000..e7f294cd --- /dev/null +++ b/src/llmxive/state/project_id_lock.py @@ -0,0 +1,125 @@ +"""Concurrency-safe project ID allocation (Q1B fix from spec 004). + +Background: prior to this module, `cli._cmd_brainstorm` computed +`next_num` once at the top of the function from an in-memory snapshot +of `state/projects/`, then claimed IDs sequentially. Two concurrent +brainstorm runs (e.g., two cron jobs firing at the same time) would +each compute the same `next_num` from their independent disk snapshots +and both write `PROJ-NNN-<slug-A>.yaml` / `PROJ-NNN-<slug-B>.yaml` — +producing duplicate project numbers with different slugs (verified on +disk: PROJ-261-evaluating-... + PROJ-261-investigating-...; PROJ-262- +predicting-... + PROJ-262-quantifying-...). + +This module wraps the read-next-num + write-state-YAML critical +section in an `fcntl.flock`-protected atomic block. The lock is held +only during the disk snapshot + the state-YAML write (microseconds), +not during the LLM call (which is the long-running part). + +Lock file: `state/.brainstorm.lock`. Lock is exclusive (LOCK_EX). +On non-POSIX platforms (Windows), `fcntl` is unavailable — the lock +falls back to a no-op + a logged warning. (llmXive is POSIX-only per +the spec; Windows fallback is defense-in-depth.) +""" + +from __future__ import annotations + +import contextlib +import os +import sys +from pathlib import Path +from typing import Iterator + + +def _lock_path(repo_root: Path) -> Path: + return repo_root / "state" / ".brainstorm.lock" + + +@contextlib.contextmanager +def project_id_lock(repo_root: Path) -> Iterator[None]: + """Hold an exclusive lock on `state/.brainstorm.lock` for the + duration of the with-block. + + On POSIX, uses `fcntl.flock(LOCK_EX)`. On non-POSIX, no-op (logs a + warning to stderr). + """ + lock_file = _lock_path(repo_root) + lock_file.parent.mkdir(parents=True, exist_ok=True) + + try: + import fcntl # type: ignore[import-not-found] + except ImportError: + print( + "[project_id_lock] fcntl unavailable (non-POSIX?); " + "concurrent-safety NOT enforced", + file=sys.stderr, + ) + yield + return + + fd = os.open(str(lock_file), os.O_CREAT | os.O_RDWR, 0o644) + try: + fcntl.flock(fd, fcntl.LOCK_EX) + try: + yield + finally: + fcntl.flock(fd, fcntl.LOCK_UN) + finally: + os.close(fd) + + +def next_available_proj_num( + *, + repo_root: Path, + starting_num: int = 1, +) -> int: + """Scan `state/projects/` and `projects/` from disk and return the + smallest `n` >= starting_num such that no `PROJ-NNN-*` exists. + + MUST be called inside `project_id_lock(repo_root)` to be safe + against concurrent invocations. (This function does NOT take the + lock itself — the caller controls the critical-section boundary.) + """ + state_dir = repo_root / "state" / "projects" + projects_dir = repo_root / "projects" + + used: set[int] = set() + if state_dir.is_dir(): + for child in state_dir.iterdir(): + if child.suffix != ".yaml": + continue + stem = child.stem # e.g., "PROJ-261-evaluating-..." + n = _extract_num(stem) + if n is not None: + used.add(n) + if projects_dir.is_dir(): + for child in projects_dir.iterdir(): + if not child.is_dir(): + continue + n = _extract_num(child.name) + if n is not None: + used.add(n) + + n = max(starting_num, 1) + while n in used: + n += 1 + return n + + +def _extract_num(name: str) -> int | None: + """Parse 'PROJ-NNN-...' (or 'PROJ-NNN-..-iter2') and return NNN + as int, or None if not parseable. + + Per the post spec-004 convention, `-iterN` siblings are deprecated + but historic ones may still appear in `state/projects/` snapshots + on older branches. We treat them as occupying their canonical + PROJ-NNN slot too (defensive). + """ + if not name.startswith("PROJ-"): + return None + parts = name.split("-") + if len(parts) < 2: + return None + try: + return int(parts[1]) + except ValueError: + return None diff --git a/state/projects/PROJ-267-predicting-plant-stress-response-from-pu-iter2.history.jsonl b/state/projects/PROJ-267-predicting-plant-stress-response-from-pu-iter2.history.jsonl deleted file mode 100644 index 5d83dd10..00000000 --- a/state/projects/PROJ-267-predicting-plant-stress-response-from-pu-iter2.history.jsonl +++ /dev/null @@ -1 +0,0 @@ -{"at": "2026-05-05T03:17:49.372526+00:00", "from_stage": "brainstormed", "last_run_id": "c768854b-f65b-41d6-a9cf-bb6877744ba2", "to_stage": "flesh_out_complete"} diff --git a/state/projects/PROJ-267-predicting-plant-stress-response-from-pu-iter2.yaml b/state/projects/PROJ-267-predicting-plant-stress-response-from-pu-iter2.yaml deleted file mode 100644 index 75627139..00000000 --- a/state/projects/PROJ-267-predicting-plant-stress-response-from-pu-iter2.yaml +++ /dev/null @@ -1,17 +0,0 @@ -artifact_hashes: {} -assigned_agent: null -created_at: '2026-05-04T20:50:00Z' -current_stage: flesh_out_complete -failed_stage: null -field: biology -human_escalation_reason: null -id: PROJ-267-predicting-plant-stress-response-from-pu-iter2 -last_run_id: c768854b-f65b-41d6-a9cf-bb6877744ba2 -last_run_status: null -points_paper: {} -points_research: {} -revision_round: 0 -speckit_paper_dir: null -speckit_research_dir: null -title: Predicting Plant Stress Response from Publicly Available Proteomic Data -updated_at: '2026-05-05T03:17:49.371068Z' diff --git a/state/projects/PROJ-261-investigating-the-correlation-between-gu.history.jsonl b/state/projects/PROJ-331-investigating-the-correlation-between-gu.history.jsonl similarity index 100% rename from state/projects/PROJ-261-investigating-the-correlation-between-gu.history.jsonl rename to state/projects/PROJ-331-investigating-the-correlation-between-gu.history.jsonl diff --git a/state/projects/PROJ-261-investigating-the-correlation-between-gu.yaml b/state/projects/PROJ-331-investigating-the-correlation-between-gu.yaml similarity index 90% rename from state/projects/PROJ-261-investigating-the-correlation-between-gu.yaml rename to state/projects/PROJ-331-investigating-the-correlation-between-gu.yaml index e43262da..25ea7a4c 100644 --- a/state/projects/PROJ-261-investigating-the-correlation-between-gu.yaml +++ b/state/projects/PROJ-331-investigating-the-correlation-between-gu.yaml @@ -5,7 +5,7 @@ current_stage: flesh_out_complete failed_stage: null field: biology human_escalation_reason: null -id: PROJ-261-investigating-the-correlation-between-gu +id: PROJ-331-investigating-the-correlation-between-gu last_run_id: 508640a5-1b2d-414b-9c99-d06777c6d08d last_run_status: null points_paper: {} diff --git a/state/projects/PROJ-262-quantifying-the-impact-of-magnetic-field.history.jsonl b/state/projects/PROJ-332-quantifying-the-impact-of-magnetic-field.history.jsonl similarity index 100% rename from state/projects/PROJ-262-quantifying-the-impact-of-magnetic-field.history.jsonl rename to state/projects/PROJ-332-quantifying-the-impact-of-magnetic-field.history.jsonl diff --git a/state/projects/PROJ-262-quantifying-the-impact-of-magnetic-field.yaml b/state/projects/PROJ-332-quantifying-the-impact-of-magnetic-field.yaml similarity index 89% rename from state/projects/PROJ-262-quantifying-the-impact-of-magnetic-field.yaml rename to state/projects/PROJ-332-quantifying-the-impact-of-magnetic-field.yaml index 91372d3b..9aa002aa 100644 --- a/state/projects/PROJ-262-quantifying-the-impact-of-magnetic-field.yaml +++ b/state/projects/PROJ-332-quantifying-the-impact-of-magnetic-field.yaml @@ -5,7 +5,7 @@ current_stage: flesh_out_complete failed_stage: null field: physics human_escalation_reason: null -id: PROJ-262-quantifying-the-impact-of-magnetic-field +id: PROJ-332-quantifying-the-impact-of-magnetic-field last_run_id: c7a3245e-9097-4157-8187-a200a1853e3f last_run_status: null points_paper: {} diff --git a/state/run-log/2026-05/1a3726e9-d840-4ca3-ab1e-f6d5205b00d7.jsonl b/state/run-log/2026-05/1a3726e9-d840-4ca3-ab1e-f6d5205b00d7.jsonl new file mode 100644 index 00000000..bf80d02d --- /dev/null +++ b/state/run-log/2026-05/1a3726e9-d840-4ca3-ab1e-f6d5205b00d7.jsonl @@ -0,0 +1 @@ +{"agent_name": "project_initializer", "backend": "dartmouth", "cost_estimate_usd": 0.0, "ended_at": "2026-05-06T01:45:34.523718Z", "entry_id": "17cde5e1-3117-4185-b56d-1585b95d9945", "failure_reason": "FileNotFoundError: prompt template not found: /Users/jmanning/llmXive/agents/templates/research_project_constitution.md", "inputs": ["projects/PROJ-261-evaluating-the-impact-of-code-duplicatio-iter5/idea/evaluating-the-impact-of-code-duplicatio.md"], "model_name": "qwen.qwen3.5-122b", "outcome": "failed", "outputs": [], "parent_entry_id": null, "project_id": "PROJ-261-evaluating-the-impact-of-code-duplicatio-iter5", "prompt_version": "1.1.0", "run_id": "1a3726e9-d840-4ca3-ab1e-f6d5205b00d7", "started_at": "2026-05-06T01:45:34.523648Z", "task_id": "8ace448d-35d1-4f2c-b20a-fa46053b2abe"} diff --git a/state/run-log/2026-05/483efca9-fe92-45d1-a10f-48c5d12bf35f.jsonl b/state/run-log/2026-05/483efca9-fe92-45d1-a10f-48c5d12bf35f.jsonl new file mode 100644 index 00000000..35722029 --- /dev/null +++ b/state/run-log/2026-05/483efca9-fe92-45d1-a10f-48c5d12bf35f.jsonl @@ -0,0 +1 @@ +{"agent_name": "project_initializer", "backend": "dartmouth", "cost_estimate_usd": 0.0, "ended_at": "2026-05-06T01:42:06.787253Z", "entry_id": "152ac899-ff12-40b4-bb51-820e302e8157", "failure_reason": null, "inputs": ["projects/PROJ-261-evaluating-the-impact-of-code-duplicatio-iter3/idea/evaluating-the-impact-of-code-duplicatio.md"], "model_name": "qwen.qwen3.5-122b", "outcome": "success", "outputs": ["projects/PROJ-261-evaluating-the-impact-of-code-duplicatio-iter3/.specify/memory/constitution.md"], "parent_entry_id": null, "project_id": "PROJ-261-evaluating-the-impact-of-code-duplicatio-iter3", "prompt_version": "1.1.0", "run_id": "483efca9-fe92-45d1-a10f-48c5d12bf35f", "started_at": "2026-05-06T01:40:40.225748Z", "task_id": "1e1b2133-95ae-4d96-96a7-880996257d8e"} diff --git a/state/run-log/2026-05/4a04a919-0a1c-46f9-a9a3-fab5a96200ce.jsonl b/state/run-log/2026-05/4a04a919-0a1c-46f9-a9a3-fab5a96200ce.jsonl new file mode 100644 index 00000000..f5dc1015 --- /dev/null +++ b/state/run-log/2026-05/4a04a919-0a1c-46f9-a9a3-fab5a96200ce.jsonl @@ -0,0 +1 @@ +{"agent_name": "project_initializer", "backend": "dartmouth", "cost_estimate_usd": 0.0, "ended_at": "2026-05-06T01:37:45.360194Z", "entry_id": "21b4e5e1-e85a-478f-b66a-a09cfc6acf23", "failure_reason": null, "inputs": ["projects/PROJ-262-predicting-molecular-dipole-moments-with-iter2/idea/predicting-molecular-dipole-moments-with.md"], "model_name": "qwen.qwen3.5-122b", "outcome": "success", "outputs": ["projects/PROJ-262-predicting-molecular-dipole-moments-with-iter2/.specify/memory/constitution.md"], "parent_entry_id": null, "project_id": "PROJ-262-predicting-molecular-dipole-moments-with-iter2", "prompt_version": "1.0.0", "run_id": "4a04a919-0a1c-46f9-a9a3-fab5a96200ce", "started_at": "2026-05-06T01:36:33.062008Z", "task_id": "072cd3e0-f357-4404-b1e7-764d8ad11ef7"} diff --git a/state/run-log/2026-05/508640a5-1b2d-414b-9c99-d06777c6d08d.jsonl b/state/run-log/2026-05/508640a5-1b2d-414b-9c99-d06777c6d08d.jsonl index 9688bb8b..30c4adc6 100644 --- a/state/run-log/2026-05/508640a5-1b2d-414b-9c99-d06777c6d08d.jsonl +++ b/state/run-log/2026-05/508640a5-1b2d-414b-9c99-d06777c6d08d.jsonl @@ -1 +1 @@ -{"agent_name": "flesh_out", "backend": "dartmouth", "cost_estimate_usd": 0.0, "ended_at": "2026-05-04T16:43:49.940816Z", "entry_id": "c8a46590-fbd3-4970-9cd4-9b6eb9005b96", "failure_reason": null, "inputs": ["projects/PROJ-261-investigating-the-correlation-between-gu/idea/investigating-the-correlation-between-gu.md"], "model_name": "qwen.qwen3.5-122b", "outcome": "success", "outputs": ["projects/PROJ-261-investigating-the-correlation-between-gu/idea/investigating-the-correlation-between-gu.md"], "parent_entry_id": null, "project_id": "PROJ-261-investigating-the-correlation-between-gu", "prompt_version": "1.0.0", "run_id": "508640a5-1b2d-414b-9c99-d06777c6d08d", "started_at": "2026-05-04T16:42:49.629158Z", "task_id": "0e00509b-4691-4ff5-b896-3c3093abcb59"} +{"agent_name": "flesh_out", "backend": "dartmouth", "cost_estimate_usd": 0.0, "ended_at": "2026-05-04T16:43:49.940816Z", "entry_id": "c8a46590-fbd3-4970-9cd4-9b6eb9005b96", "failure_reason": null, "inputs": ["projects/PROJ-331-investigating-the-correlation-between-gu/idea/investigating-the-correlation-between-gu.md"], "model_name": "qwen.qwen3.5-122b", "outcome": "success", "outputs": ["projects/PROJ-331-investigating-the-correlation-between-gu/idea/investigating-the-correlation-between-gu.md"], "parent_entry_id": null, "project_id": "PROJ-331-investigating-the-correlation-between-gu", "prompt_version": "1.0.0", "run_id": "508640a5-1b2d-414b-9c99-d06777c6d08d", "started_at": "2026-05-04T16:42:49.629158Z", "task_id": "0e00509b-4691-4ff5-b896-3c3093abcb59"} diff --git a/state/run-log/2026-05/5e482333-b8a0-4b87-914a-e9053bb89b15.jsonl b/state/run-log/2026-05/5e482333-b8a0-4b87-914a-e9053bb89b15.jsonl new file mode 100644 index 00000000..f34c37d4 --- /dev/null +++ b/state/run-log/2026-05/5e482333-b8a0-4b87-914a-e9053bb89b15.jsonl @@ -0,0 +1 @@ +{"agent_name": "project_initializer", "backend": "dartmouth", "cost_estimate_usd": 0.0, "ended_at": "2026-05-06T01:45:15.413732Z", "entry_id": "03132c7b-abb5-4b2c-b9f3-997ea8877758", "failure_reason": "FileNotFoundError: project_initializer requires at least one input (idea file path); got ctx.inputs=[]", "inputs": [], "model_name": "qwen.qwen3.5-122b", "outcome": "failed", "outputs": [], "parent_entry_id": null, "project_id": "PROJ-262-predicting-molecular-dipole-moments-with-iter4", "prompt_version": "1.1.0", "run_id": "5e482333-b8a0-4b87-914a-e9053bb89b15", "started_at": "2026-05-06T01:45:15.413572Z", "task_id": "4b7e984f-1e15-4677-aac0-6968b9030848"} diff --git a/state/run-log/2026-05/88740a04-00c2-4162-aae3-df1e571814ec.jsonl b/state/run-log/2026-05/88740a04-00c2-4162-aae3-df1e571814ec.jsonl new file mode 100644 index 00000000..7899ed8a --- /dev/null +++ b/state/run-log/2026-05/88740a04-00c2-4162-aae3-df1e571814ec.jsonl @@ -0,0 +1 @@ +{"agent_name": "project_initializer", "backend": "dartmouth", "cost_estimate_usd": 0.0, "ended_at": "2026-05-06T01:43:40.900943Z", "entry_id": "a4f63ff5-6444-4f98-a470-b6f249046cd4", "failure_reason": null, "inputs": ["projects/PROJ-262-predicting-molecular-dipole-moments-with-iter3/idea/predicting-molecular-dipole-moments-with.md"], "model_name": "qwen.qwen3.5-122b", "outcome": "success", "outputs": ["projects/PROJ-262-predicting-molecular-dipole-moments-with-iter3/.specify/memory/constitution.md"], "parent_entry_id": null, "project_id": "PROJ-262-predicting-molecular-dipole-moments-with-iter3", "prompt_version": "1.1.0", "run_id": "88740a04-00c2-4162-aae3-df1e571814ec", "started_at": "2026-05-06T01:42:11.856990Z", "task_id": "0c5b037e-6851-4b9e-9107-b3e9bc809631"} diff --git a/state/run-log/2026-05/a09d531a-16d3-4d72-ab08-b24897becc30.jsonl b/state/run-log/2026-05/a09d531a-16d3-4d72-ab08-b24897becc30.jsonl new file mode 100644 index 00000000..2825bd45 --- /dev/null +++ b/state/run-log/2026-05/a09d531a-16d3-4d72-ab08-b24897becc30.jsonl @@ -0,0 +1 @@ +{"agent_name": "project_initializer", "backend": "dartmouth", "cost_estimate_usd": 0.0, "ended_at": "2026-05-06T04:16:58.430919Z", "entry_id": "66895766-4f21-432d-bde1-d9c4aeedc70a", "failure_reason": null, "inputs": ["projects/PROJ-262-predicting-molecular-dipole-moments-with-iter6/idea/predicting-molecular-dipole-moments-with.md"], "model_name": "qwen.qwen3.5-122b", "outcome": "success", "outputs": ["projects/PROJ-262-predicting-molecular-dipole-moments-with-iter6/.specify/memory/constitution.md"], "parent_entry_id": null, "project_id": "PROJ-262-predicting-molecular-dipole-moments-with-iter6", "prompt_version": "1.2.0", "run_id": "a09d531a-16d3-4d72-ab08-b24897becc30", "started_at": "2026-05-06T04:15:52.349878Z", "task_id": "5c35d74d-ccdd-40dc-8765-b711c624e098"} diff --git a/state/run-log/2026-05/a0c232b3-5868-46c7-85c0-38558d483a71.jsonl b/state/run-log/2026-05/a0c232b3-5868-46c7-85c0-38558d483a71.jsonl new file mode 100644 index 00000000..135bd460 --- /dev/null +++ b/state/run-log/2026-05/a0c232b3-5868-46c7-85c0-38558d483a71.jsonl @@ -0,0 +1 @@ +{"agent_name": "project_initializer", "backend": "dartmouth", "cost_estimate_usd": 0.0, "ended_at": "2026-05-06T01:44:57.810596Z", "entry_id": "0a9187be-c866-4350-a6ab-41445e939916", "failure_reason": "BackendError: every backend in chain ['dartmouth', 'huggingface', 'local'] failed; errors: dartmouth/qwen.qwen3.5-122b(permanent): 'API key invalid!' | huggingface/qwen.qwen3.5-122b(permanent): HF_TOKEN is not set (required by HF backend) | local/qwen.qwen3.5-122b(permanent): transformers is not installed; required by local backend", "inputs": ["projects/PROJ-261-evaluating-the-impact-of-code-duplicatio-iter4/idea/evaluating-the-impact-of-code-duplicatio.md"], "model_name": "qwen.qwen3.5-122b", "outcome": "failed", "outputs": [], "parent_entry_id": null, "project_id": "PROJ-261-evaluating-the-impact-of-code-duplicatio-iter4", "prompt_version": "1.1.0", "run_id": "a0c232b3-5868-46c7-85c0-38558d483a71", "started_at": "2026-05-06T01:44:56.987321Z", "task_id": "14c3d0c5-302b-4fe7-b670-e49c2b9765a5"} diff --git a/state/run-log/2026-05/c7a3245e-9097-4157-8187-a200a1853e3f.jsonl b/state/run-log/2026-05/c7a3245e-9097-4157-8187-a200a1853e3f.jsonl index 50015fc6..05dbfed9 100644 --- a/state/run-log/2026-05/c7a3245e-9097-4157-8187-a200a1853e3f.jsonl +++ b/state/run-log/2026-05/c7a3245e-9097-4157-8187-a200a1853e3f.jsonl @@ -1 +1 @@ -{"agent_name": "flesh_out", "backend": "dartmouth", "cost_estimate_usd": 0.0, "ended_at": "2026-05-04T16:44:53.675087Z", "entry_id": "15e3923a-a927-47e1-afda-94506faa7138", "failure_reason": null, "inputs": ["projects/PROJ-262-quantifying-the-impact-of-magnetic-field/idea/quantifying-the-impact-of-magnetic-field.md"], "model_name": "qwen.qwen3.5-122b", "outcome": "success", "outputs": ["projects/PROJ-262-quantifying-the-impact-of-magnetic-field/idea/quantifying-the-impact-of-magnetic-field.md"], "parent_entry_id": null, "project_id": "PROJ-262-quantifying-the-impact-of-magnetic-field", "prompt_version": "1.0.0", "run_id": "c7a3245e-9097-4157-8187-a200a1853e3f", "started_at": "2026-05-04T16:43:50.327233Z", "task_id": "b6393f67-7fa2-4d1a-b23d-814c0589706e"} +{"agent_name": "flesh_out", "backend": "dartmouth", "cost_estimate_usd": 0.0, "ended_at": "2026-05-04T16:44:53.675087Z", "entry_id": "15e3923a-a927-47e1-afda-94506faa7138", "failure_reason": null, "inputs": ["projects/PROJ-332-quantifying-the-impact-of-magnetic-field/idea/quantifying-the-impact-of-magnetic-field.md"], "model_name": "qwen.qwen3.5-122b", "outcome": "success", "outputs": ["projects/PROJ-332-quantifying-the-impact-of-magnetic-field/idea/quantifying-the-impact-of-magnetic-field.md"], "parent_entry_id": null, "project_id": "PROJ-332-quantifying-the-impact-of-magnetic-field", "prompt_version": "1.0.0", "run_id": "c7a3245e-9097-4157-8187-a200a1853e3f", "started_at": "2026-05-04T16:43:50.327233Z", "task_id": "b6393f67-7fa2-4d1a-b23d-814c0589706e"} diff --git a/state/run-log/2026-05/e7cc764f-8e5d-4887-81df-d71790622db6.jsonl b/state/run-log/2026-05/e7cc764f-8e5d-4887-81df-d71790622db6.jsonl new file mode 100644 index 00000000..746f7492 --- /dev/null +++ b/state/run-log/2026-05/e7cc764f-8e5d-4887-81df-d71790622db6.jsonl @@ -0,0 +1 @@ +{"agent_name": "project_initializer", "backend": "dartmouth", "cost_estimate_usd": 0.0, "ended_at": "2026-05-06T04:15:47.961029Z", "entry_id": "f985d289-2f0d-4a95-8dff-53cf7125b555", "failure_reason": null, "inputs": ["projects/PROJ-261-evaluating-the-impact-of-code-duplicatio-iter6/idea/evaluating-the-impact-of-code-duplicatio.md"], "model_name": "qwen.qwen3.5-122b", "outcome": "success", "outputs": ["projects/PROJ-261-evaluating-the-impact-of-code-duplicatio-iter6/.specify/memory/constitution.md"], "parent_entry_id": null, "project_id": "PROJ-261-evaluating-the-impact-of-code-duplicatio-iter6", "prompt_version": "1.2.0", "run_id": "e7cc764f-8e5d-4887-81df-d71790622db6", "started_at": "2026-05-06T04:14:03.354644Z", "task_id": "a1a4ab3f-f12c-4c19-b22e-8860605c66bc"} diff --git a/state/run-log/2026-05/e9a3dfce-8435-455f-bf7a-8e4206ffb754.jsonl b/state/run-log/2026-05/e9a3dfce-8435-455f-bf7a-8e4206ffb754.jsonl new file mode 100644 index 00000000..3912c799 --- /dev/null +++ b/state/run-log/2026-05/e9a3dfce-8435-455f-bf7a-8e4206ffb754.jsonl @@ -0,0 +1 @@ +{"agent_name": "project_initializer", "backend": "dartmouth", "cost_estimate_usd": 0.0, "ended_at": "2026-05-06T01:36:28.619215Z", "entry_id": "0f1509ea-3f6b-4121-abf7-3a57874f2279", "failure_reason": null, "inputs": ["projects/PROJ-261-evaluating-the-impact-of-code-duplicatio-iter2/idea/evaluating-the-impact-of-code-duplicatio.md"], "model_name": "qwen.qwen3.5-122b", "outcome": "success", "outputs": ["projects/PROJ-261-evaluating-the-impact-of-code-duplicatio-iter2/.specify/memory/constitution.md"], "parent_entry_id": null, "project_id": "PROJ-261-evaluating-the-impact-of-code-duplicatio-iter2", "prompt_version": "1.0.0", "run_id": "e9a3dfce-8435-455f-bf7a-8e4206ffb754", "started_at": "2026-05-06T01:35:25.536741Z", "task_id": "60aceaed-3295-49bb-af12-779613877485"} diff --git a/tests/phase1/sibling_project.py b/tests/phase1/sibling_project.py index 883c5976..f3dec43d 100644 --- a/tests/phase1/sibling_project.py +++ b/tests/phase1/sibling_project.py @@ -1,6 +1,15 @@ """Phase 1 sibling project spawner. -Implements the contract at +⚠️ DEPRECATED post spec 004 (2026-05-06): the sibling-iteration pattern +was retired in favor of in-place iteration on canonical projects, with +git history (commits + log notes) tracking the iteration trail. The +proliferation of ``PROJ-NNN-<slug>-iterN`` directories produced messy +project trees with no offsetting benefit. This file is preserved for +spec 003's historical reproducibility, but new phase-test specs MUST +NOT call it. See ``notes/2026-05-06-iteration-convention-change.md`` +for rationale. + +Original contract: ``specs/003-phase1-idea-lifecycle-testing/contracts/sibling-project.md``. Spawns ``PROJ-NNN-<slug>-iterN`` from canonical ``PROJ-NNN-<slug>``: @@ -33,7 +42,7 @@ STATE_DIR = PROJECT_ROOT / "state" / "projects" PROJ_ID_RE = re.compile(r"^PROJ-\d{3}-[a-z0-9-]{1,50}$") -ALLOWED_START_STAGES = {"brainstormed", "flesh_out_in_progress", "flesh_out_complete"} +ALLOWED_START_STAGES = {"brainstormed", "flesh_out_in_progress", "flesh_out_complete", "validated"} def _now_iso() -> str: diff --git a/tests/phase1/test_idempotency.py b/tests/phase1/test_idempotency.py new file mode 100644 index 00000000..e2a7ce18 --- /dev/null +++ b/tests/phase1/test_idempotency.py @@ -0,0 +1,316 @@ +"""Phase 2 idempotency tests (FR-011 / SC-009 / spec 004 US3 acceptance scenarios). + +Verifies: + 1. `init_speckit_in` is byte-idempotent on a complete .specify/ tree + (templates + scripts) on a second invocation. + 2. The skip-if-exists guard at + ``src/llmxive/agents/project_initializer.py:handle_response`` leaves + a pre-existing ``.specify/memory/constitution.md`` byte-unchanged + when the agent is re-invoked. (Per spec 004 Q3 clarification — the + constitution is a governance document; re-rendering it silently + mutates downstream Constitution Checks.) + 3. The negative-control: on a fresh project_dir, the agent DOES write + the constitution from the LLM response (skip-if-exists guard + doesn't break the happy path). + +Per Constitution Principle III: real filesystem (pytest tmp_path), no +mocks. Per Principle V: tests fail fast on any byte-level divergence. +""" + +from __future__ import annotations + +import hashlib +from pathlib import Path + +import pytest + +from llmxive.agents.base import AgentContext +from llmxive.agents.project_initializer import ProjectInitializerAgent +from llmxive.backends.base import ChatResponse +from llmxive.speckit.runner import init_speckit_in +from llmxive.types import AgentRegistryEntry + +PROJECT_ROOT = Path(__file__).resolve().parents[2] + + +def _sha256_tree(root: Path) -> dict[str, str]: + """Return {relpath_str: sha256_hex} for every regular file under ``root``.""" + out: dict[str, str] = {} + for p in sorted(root.rglob("*")): + if p.is_file(): + out[str(p.relative_to(root))] = hashlib.sha256(p.read_bytes()).hexdigest() + return out + + +def _make_registry_entry() -> AgentRegistryEntry: + """Construct the same registry entry the production runner builds for + project_initializer. Mirrors agents/registry.yaml lines 83-97.""" + return AgentRegistryEntry( + name="project_initializer", + purpose="Bootstrap a per-project Spec Kit scaffold and render a project constitution.", + inputs=["idea"], + outputs=["project_state"], + prompt_path="agents/prompts/project_initializer.md", + prompt_version="1.0.0", + default_backend="dartmouth", + fallback_backends=["huggingface", "local"], + default_model="qwen.qwen3.5-122b", + wall_clock_budget_seconds=300, + paid_opt_in=False, + ) + + +def test_init_speckit_in_idempotent_on_complete_tree(tmp_path: Path) -> None: + """SC-009 first half: scaffold tree byte-identical after second init.""" + project_dir = tmp_path / "PROJ-999-idem-test" + init_speckit_in(project_dir) + + specify_dir = project_dir / ".specify" + assert specify_dir.is_dir(), "init_speckit_in must create .specify/" + assert (specify_dir / "templates").is_dir() + assert (specify_dir / "scripts").is_dir() + assert (specify_dir / "memory").is_dir() + + before = _sha256_tree(specify_dir) + init_speckit_in(project_dir) + after = _sha256_tree(specify_dir) + assert before == after, ( + f"init_speckit_in is NOT idempotent at file-content level. " + f"Diverged keys: {sorted(set(before) ^ set(after)) or '(none)'}, " + f"changed values: {[k for k in (set(before) & set(after)) if before[k] != after[k]]}" + ) + + +def test_project_initializer_skips_existing_constitution( + tmp_path: Path, monkeypatch: pytest.MonkeyPatch +) -> None: + """US3 acceptance scenario 2: re-running the agent on a project with a + pre-existing constitution must NOT overwrite it (skip-if-exists guard). + + Strategy: monkeypatch the module-level ``__file__`` so the agent computes + a tmp_path-rooted ``repo`` and creates ``projects/<id>/`` there. + """ + # Construct a fake repo skeleton where Path(__file__).parent.parent.parent.parent + # resolves to a tmp_path-rooted directory. + fake_repo = tmp_path / "fake-repo" + fake_module_dir = fake_repo / "src" / "llmxive" / "agents" + fake_module_dir.mkdir(parents=True, exist_ok=True) + fake_module_file = fake_module_dir / "project_initializer.py" + fake_module_file.write_text("# placeholder", encoding="utf-8") + + # Also mirror the agents/templates and agents/prompts under the fake repo + # so render_prompt(...) can find them. (We actually skip render_prompt + # entirely by exercising only handle_response, which doesn't read those + # files when the constitution already exists — the skip-if-exists branch + # is hit BEFORE any template-reading.) + (fake_repo / ".specify").mkdir() + (fake_repo / ".specify" / "scripts").mkdir() + (fake_repo / ".specify" / "templates").mkdir() + # Copy the real meta-system into the fake repo so init_speckit_in can mirror it. + import shutil + + real_specify = PROJECT_ROOT / ".specify" + for sub in ("scripts", "templates"): + src = real_specify / sub + dst = fake_repo / ".specify" / sub + if dst.exists(): + shutil.rmtree(dst) + shutil.copytree(src, dst) + + # Pre-stage the project with an existing constitution. + project_id = "PROJ-test-skip-iter1" + project_dir = fake_repo / "projects" / project_id + constitution_path = project_dir / ".specify" / "memory" / "constitution.md" + constitution_path.parent.mkdir(parents=True, exist_ok=True) + pre_existing_text = ( + "# Test Constitution — Research Project Constitution\n\n" + "(deliberately distinct from any LLM output to detect overwrites)\n\n" + "**Project ID**: PROJ-test-skip-iter1 | **Field**: testing | **Ratified**: 2026-05-05\n" + ) + constitution_path.write_text(pre_existing_text, encoding="utf-8") + pre_hash = hashlib.sha256(constitution_path.read_bytes()).hexdigest() + + # Monkeypatch project_initializer's __file__ so its repo calculation + # lands inside our fake_repo. + import llmxive.agents.project_initializer as pi_mod + + monkeypatch.setattr(pi_mod, "__file__", str(fake_module_file)) + + # Construct agent + ctx + a synthetic ChatResponse whose text would + # OVERWRITE the constitution if the guard were broken. + entry = _make_registry_entry() + agent = ProjectInitializerAgent(entry) + ctx = AgentContext( + project_id=project_id, + run_id="test-run-skip", + task_id="test-task-skip", + inputs=[], # not consulted on the skip-if-exists branch + metadata={ + "title": "Test", + "field": "testing", + "principal_agent_name": "flesh_out", + }, + ) + response = ChatResponse( + text="# DIFFERENT Constitution\n\nThis would corrupt a real constitution.\n", + model="qwen.qwen3.5-122b", + backend="dartmouth", + cost_estimate_usd=0.0, + ) + + result = agent.handle_response(ctx, response) + post_hash = hashlib.sha256(constitution_path.read_bytes()).hexdigest() + + assert pre_hash == post_hash, ( + "skip-if-exists guard FAILED: constitution was overwritten on re-invocation. " + f"pre={pre_hash[:12]}... post={post_hash[:12]}..." + ) + # The agent must still return the constitution path (so the orchestrator's + # state-machine sees a valid output artifact and doesn't treat the no-op + # as a failure). + assert result, "handle_response must return a non-empty output list" + assert any("constitution.md" in p for p in result), result + + +def test_project_initializer_writes_on_first_invocation( + tmp_path: Path, monkeypatch: pytest.MonkeyPatch +) -> None: + """Negative control: with no pre-existing constitution, the agent MUST + write the LLM response. Ensures the skip-if-exists guard didn't break + the happy path. + """ + # Reuse the same fake-repo strategy. + fake_repo = tmp_path / "fake-repo" + fake_module_dir = fake_repo / "src" / "llmxive" / "agents" + fake_module_dir.mkdir(parents=True, exist_ok=True) + fake_module_file = fake_module_dir / "project_initializer.py" + fake_module_file.write_text("# placeholder", encoding="utf-8") + + import shutil + + real_specify = PROJECT_ROOT / ".specify" + (fake_repo / ".specify").mkdir(exist_ok=True) + for sub in ("scripts", "templates"): + src = real_specify / sub + dst = fake_repo / ".specify" / sub + if dst.exists(): + shutil.rmtree(dst) + shutil.copytree(src, dst) + + project_id = "PROJ-test-fresh-iter1" + # Pre-create the project_dir but NOT the constitution file. + (fake_repo / "projects" / project_id).mkdir(parents=True) + + import llmxive.agents.project_initializer as pi_mod + + monkeypatch.setattr(pi_mod, "__file__", str(fake_module_file)) + + entry = _make_registry_entry() + agent = ProjectInitializerAgent(entry) + ctx = AgentContext( + project_id=project_id, + run_id="test-run-fresh", + task_id="test-task-fresh", + inputs=[], + metadata={ + "title": "Test", + "field": "testing", + "principal_agent_name": "flesh_out", + }, + ) + expected_text = ( + "# Fresh Constitution — Research Project Constitution\n\n" + "**Project ID**: PROJ-test-fresh-iter1 | **Field**: testing | **Ratified**: 2026-05-05\n" + ) + response = ChatResponse( + text=expected_text, + model="qwen.qwen3.5-122b", + backend="dartmouth", + cost_estimate_usd=0.0, + ) + + agent.handle_response(ctx, response) + constitution_path = fake_repo / "projects" / project_id / ".specify" / "memory" / "constitution.md" + + assert constitution_path.is_file(), "agent must write constitution on first invocation" + written = constitution_path.read_text(encoding="utf-8") + # The agent strips and appends a trailing newline; assert content matches. + assert written.startswith("# Fresh Constitution"), ( + f"constitution content unexpected: {written[:100]!r}" + ) + + +def test_full_tree_idempotent_after_two_agent_invocations( + tmp_path: Path, monkeypatch: pytest.MonkeyPatch +) -> None: + """SC-009 end-to-end: two consecutive handle_response calls on the same + project_dir leave the FULL .specify/ tree (constitution + 9 mechanical + files) byte-identical at file-content level. + """ + fake_repo = tmp_path / "fake-repo" + fake_module_dir = fake_repo / "src" / "llmxive" / "agents" + fake_module_dir.mkdir(parents=True, exist_ok=True) + fake_module_file = fake_module_dir / "project_initializer.py" + fake_module_file.write_text("# placeholder", encoding="utf-8") + + import shutil + + real_specify = PROJECT_ROOT / ".specify" + (fake_repo / ".specify").mkdir(exist_ok=True) + for sub in ("scripts", "templates"): + src = real_specify / sub + dst = fake_repo / ".specify" / sub + if dst.exists(): + shutil.rmtree(dst) + shutil.copytree(src, dst) + + project_id = "PROJ-test-fulltree-iter1" + (fake_repo / "projects" / project_id).mkdir(parents=True) + + import llmxive.agents.project_initializer as pi_mod + + monkeypatch.setattr(pi_mod, "__file__", str(fake_module_file)) + + entry = _make_registry_entry() + agent = ProjectInitializerAgent(entry) + ctx = AgentContext( + project_id=project_id, + run_id="test-run-fulltree", + task_id="test-task-fulltree", + inputs=[], + metadata={ + "title": "Test", + "field": "testing", + "principal_agent_name": "flesh_out", + }, + ) + response_1 = ChatResponse( + text=( + "# Fulltree Constitution — Research Project Constitution\n\n" + "**Project ID**: PROJ-test-fulltree-iter1 | **Field**: testing | **Ratified**: 2026-05-05\n" + ), + model="qwen.qwen3.5-122b", + backend="dartmouth", + cost_estimate_usd=0.0, + ) + response_2 = ChatResponse( + text=( + "# DIFFERENT Constitution\n\n" + "would mutate the governance file if guard broken\n" + ), + model="qwen.qwen3.5-122b", + backend="dartmouth", + cost_estimate_usd=0.0, + ) + + agent.handle_response(ctx, response_1) + specify_dir = fake_repo / "projects" / project_id / ".specify" + before = _sha256_tree(specify_dir) + agent.handle_response(ctx, response_2) + after = _sha256_tree(specify_dir) + + assert before == after, ( + f"Full .specify/ tree NOT idempotent across two agent invocations. " + f"Diverged keys: {sorted(set(before) ^ set(after)) or '(none)'}, " + f"changed values: {[k for k in (set(before) & set(after)) if before[k] != after[k]]}" + ) diff --git a/tests/phase1/test_project_id_lock.py b/tests/phase1/test_project_id_lock.py new file mode 100644 index 00000000..120bef0f --- /dev/null +++ b/tests/phase1/test_project_id_lock.py @@ -0,0 +1,160 @@ +"""Regression tests for the project-ID allocation lock (Q1B fix from spec 004). + +These tests verify that concurrent calls to `next_available_proj_num` +under `project_id_lock` cannot produce duplicate PROJ-NNN values +(the bug that produced PROJ-261-evaluating-... + PROJ-261-investigating-... +and PROJ-262-predicting-... + PROJ-262-quantifying-... on `main`). + +Per Constitution Principle III: real filesystem (pytest tmp_path) + +real `os.fork`-based concurrency, no mocks. +""" + +from __future__ import annotations + +import os +import sys +from pathlib import Path + +import pytest + +from llmxive.state.project_id_lock import ( + next_available_proj_num, + project_id_lock, +) + + +def _seed_existing(repo_root: Path, nums: list[int]) -> None: + """Plant fake existing project state YAMLs so next_available_proj_num + has a non-trivial 'used' set.""" + state_dir = repo_root / "state" / "projects" + state_dir.mkdir(parents=True, exist_ok=True) + for n in nums: + (state_dir / f"PROJ-{n:03d}-fake-existing.yaml").write_text( + "id: dummy\n", encoding="utf-8" + ) + + +def test_next_available_with_no_existing(tmp_path: Path) -> None: + """No projects exist → next available is 001.""" + assert next_available_proj_num(repo_root=tmp_path) == 1 + + +def test_next_available_with_gaps(tmp_path: Path) -> None: + """If 001, 003, 005 exist → next available is 002 (smallest gap).""" + _seed_existing(tmp_path, [1, 3, 5]) + assert next_available_proj_num(repo_root=tmp_path) == 2 + + +def test_next_available_skips_iter_suffixes(tmp_path: Path) -> None: + """A historic PROJ-007-foo-iter2 from spec 003 era still occupies + slot 7 — `next_available_proj_num(starting_num=7)` MUST skip past it.""" + state_dir = tmp_path / "state" / "projects" + state_dir.mkdir(parents=True, exist_ok=True) + (state_dir / "PROJ-007-foo-iter2.yaml").write_text("id: dummy\n", encoding="utf-8") + # When starting from 7, must skip to 8 (since iter2 occupies slot 7). + assert next_available_proj_num(repo_root=tmp_path, starting_num=7) == 8 + # When starting from 1 (default), 1 is free so we get 1. + assert next_available_proj_num(repo_root=tmp_path) == 1 + + +def test_next_available_scans_projects_dir_too(tmp_path: Path) -> None: + """A PROJ-NNN dir without a state YAML still counts as used (defensive).""" + (tmp_path / "projects" / "PROJ-042-orphan").mkdir(parents=True) + assert next_available_proj_num(repo_root=tmp_path) != 42 + n = next_available_proj_num(repo_root=tmp_path) + assert n == 1 # since 042 is the only used number, 1 is still free + + +def test_starting_num_respected(tmp_path: Path) -> None: + """If caller asks for >= 100, return 100 even if lower nums are free.""" + assert next_available_proj_num(repo_root=tmp_path, starting_num=100) == 100 + + +def test_lock_serializes_concurrent_allocations(tmp_path: Path) -> None: + """The CRITICAL regression test: two `os.fork()`-spawned children + each acquire the lock + compute next_available + write a state YAML + + release. Result MUST be two DISTINCT project numbers, even though + they raced. + + Without the lock, both would compute next_num=1 from the same disk + snapshot and both write PROJ-001-*.yaml. + """ + if not hasattr(os, "fork"): + pytest.skip("os.fork not available (non-POSIX)") + + # Seed: no projects yet. Both children should land 001 + 002 (in + # some order), not collide on 001. + pipe_r, pipe_w = os.pipe() + + def child_work(slug: str) -> None: + """In each child, take the lock + claim a PID + write a fake + state YAML, then write the claimed PID to the pipe.""" + try: + with project_id_lock(tmp_path): + n = next_available_proj_num(repo_root=tmp_path) + pid = f"PROJ-{n:03d}-{slug}" + state_dir = tmp_path / "state" / "projects" + state_dir.mkdir(parents=True, exist_ok=True) + # Simulate a slow LLM-call-then-write... no actually, + # we want to test the lock is held during the claim, + # so write IMMEDIATELY (which is what cli.py does post-fix). + (state_dir / f"{pid}.yaml").write_text( + f"id: {pid}\n", encoding="utf-8" + ) + os.write(pipe_w, f"{pid}\n".encode()) + finally: + os._exit(0) + + pid_a = os.fork() + if pid_a == 0: + os.close(pipe_r) + child_work("alpha") + pid_b = os.fork() + if pid_b == 0: + os.close(pipe_r) + child_work("beta") + + os.close(pipe_w) + os.waitpid(pid_a, 0) + os.waitpid(pid_b, 0) + + output = b"" + while True: + chunk = os.read(pipe_r, 4096) + if not chunk: + break + output += chunk + os.close(pipe_r) + + claimed = sorted(line.strip() for line in output.decode().splitlines() if line.strip()) + assert len(claimed) == 2, f"expected 2 PIDs claimed, got {claimed!r}" + + # The numbers MUST be distinct — that's the whole point. + nums = {p.split("-")[1] for p in claimed} + assert len(nums) == 2, ( + f"DUPLICATE PROJECT NUMBERS — lock failed: {claimed!r}" + ) + # And both must be on disk. + for pid in claimed: + assert (tmp_path / "state" / "projects" / f"{pid}.yaml").is_file() + + +def test_lock_yields_inside_with_block(tmp_path: Path) -> None: + """Smoke test: project_id_lock as context manager yields control + to the with-block (we can perform work inside without hangs).""" + inside = [] + with project_id_lock(tmp_path): + inside.append("work-done") + assert inside == ["work-done"] + + +def test_lock_releases_on_exception(tmp_path: Path) -> None: + """If the with-block raises, the lock is still released so + subsequent acquisitions don't deadlock.""" + with pytest.raises(RuntimeError, match="boom"): + with project_id_lock(tmp_path): + raise RuntimeError("boom") + + # Should be able to re-acquire immediately. + with project_id_lock(tmp_path): + pass # no hang diff --git a/tests/real_call/test_full_pipeline_e2e.py b/tests/real_call/test_full_pipeline_e2e.py index 91c43a9a..629bbcab 100644 --- a/tests/real_call/test_full_pipeline_e2e.py +++ b/tests/real_call/test_full_pipeline_e2e.py @@ -90,10 +90,25 @@ def test_one_step_advances_fixture(fresh_project: Path) -> None: updated = graph.run_one_step(project) - # The Project-Initializer should have scaffolded .specify/ and the - # project state should have advanced one step. + # Spec 003 / D10 inserted research_question_validator between + # flesh_out_complete and project_initializer. The next stage from + # flesh_out_complete is now the validator, which can output one of + # four legitimate verdicts (or HUMAN_INPUT_NEEDED on a runtime issue): + # - VALIDATED: question passed all four checks + # - VALIDATOR_REVISE: rolls back to FLESH_OUT_IN_PROGRESS + # - VALIDATOR_REJECTED: rolls back to BRAINSTORMED (e.g., when + # the idea body is empty/synthetic, as in + # this smoke fixture) + # - HUMAN_INPUT_NEEDED: legitimate failure (e.g., backend down) + # On a synthetic stub idea (no real research question), the + # validator legitimately rejects to BRAINSTORMED — that's the + # correct behavior, not a regression. assert updated.current_stage in { - Stage.PROJECT_INITIALIZED, + Stage.VALIDATED, + Stage.VALIDATOR_REVISE, + Stage.VALIDATOR_REJECTED, + Stage.BRAINSTORMED, # post-validator-rejected rollback target + Stage.FLESH_OUT_IN_PROGRESS, # post-validator-revise rollback target Stage.HUMAN_INPUT_NEEDED, }, f"unexpected stage after one step: {updated.current_stage}" if updated.current_stage == Stage.PROJECT_INITIALIZED: diff --git a/web/data/projects.json b/web/data/projects.json index 9b0e5097..c2c57ba9 100644 --- a/web/data/projects.json +++ b/web/data/projects.json @@ -18488,7 +18488,7 @@ "citations": null, "code": null, "data": null, - "idea": "projects/PROJ-261-investigating-the-correlation-between-gu/idea/investigating-the-correlation-between-gu.md", + "idea": "projects/PROJ-331-investigating-the-correlation-between-gu/idea/investigating-the-correlation-between-gu.md", "paper_figures": null, "paper_pdf": null, "paper_plan": null, @@ -18529,7 +18529,7 @@ "current_stage": "flesh_out_complete", "description": "Field: biology Which specific gut microbial taxa are significantly correlated with longitudinal progression rates of Parkinson\u2019s Disease (PD) severity, after controlling for age, sex, and medication status? Parkinson\u2019s Disease exhibits substantial clinical heterogeneity, complicating prognosis and treatment\u2026", "field": "biology", - "id": "PROJ-261-investigating-the-correlation-between-gu", + "id": "PROJ-331-investigating-the-correlation-between-gu", "keywords": [], "last_run_log": [ { @@ -18680,7 +18680,7 @@ "citations": null, "code": null, "data": null, - "idea": "projects/PROJ-262-quantifying-the-impact-of-magnetic-field/idea/quantifying-the-impact-of-magnetic-field.md", + "idea": "projects/PROJ-332-quantifying-the-impact-of-magnetic-field/idea/quantifying-the-impact-of-magnetic-field.md", "paper_figures": null, "paper_pdf": null, "paper_plan": null, @@ -18721,7 +18721,7 @@ "current_stage": "flesh_out_complete", "description": "Field: physics How do specific magnetic field topology features (e.g., magnetic island width, resonant surface density) correlate with energy confinement time in publicly available tokamak discharge data? Fusion performance is limited by turbulence and instabilities that alter magnetic field topology, yet the precise\u2026", "field": "physics", - "id": "PROJ-262-quantifying-the-impact-of-magnetic-field", + "id": "PROJ-332-quantifying-the-impact-of-magnetic-field", "keywords": [], "last_run_log": [ {