ContextLab · jeremymanning · May 6, 2026 · May 6, 2026 · May 6, 2026 · May 6, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -70,5 +70,5 @@ Since this is primarily a research documentation repository without traditional
 <!-- SPECKIT START -->
 For additional context about technologies to be used, project structure,
 shell commands, and other important information, read the current plan:
-[specs/003-phase1-idea-lifecycle-testing/plan.md](specs/003-phase1-idea-lifecycle-testing/plan.md).
+[specs/004-phase2-project-bootstrap-testing/plan.md](specs/004-phase2-project-bootstrap-testing/plan.md).
 <!-- SPECKIT END -->
diff --git a/agents/prompts/project_initializer.md b/agents/prompts/project_initializer.md
@@ -1,6 +1,6 @@
 # Project-Initializer Agent
 
-**Version**: 1.0.0
+**Version**: 1.2.0
 **Stage owned**: `flesh_out_complete` → `project_initialized`
 **Default backend**: dartmouth (fallback huggingface, then local)
 
@@ -45,7 +45,40 @@ literal `**Project ID**: …` footer line.
 
 - Add at most TWO domain-specific principles (numbered I, II, III,
   IV, V already exist; if you add one it becomes VI; if two, VII).
+- **Each added principle MUST be explicitly grounded in the idea body.**
+  Concretely: every claim a new principle makes (about methodology,
+  data sources, evaluation, etc.) MUST trace back to a specific section
+  of the idea body — Methodology sketch, Expected results, Motivation,
+  or Research question. If you cannot point to a sentence in the idea
+  body that justifies a claim in your new principle, do NOT include
+  that claim. Add fewer principles rather than fabricating ones.
+- DO NOT add principles about topics the idea body does not address
+  (e.g., licensing, IP, deployment, or maintenance) just because they
+  seem like generic "good practice" for the field. Generic-good-practice
+  principles belong in the parent constitution, not in the project-level
+  one. The project-level constitution governs THIS project's specific
+  research scope.
+- Each new principle's body should reference the idea's specific
+  artifacts (named datasets, named models, named methods) when codifying
+  a domain norm. Vague principles ("must use good engineering practices")
+  are not acceptable.
 - DO NOT remove any of the inherited principles.
-- DO NOT introduce external citations here — the constitution is a
-  governance document, not a research artifact.
+- **DO NOT introduce ANY external citations or external identifiers in
+  the constitution body** — the constitution is a governance document,
+  not a research artifact. This includes:
+   - DOIs (`10.xxxx/...`)
+   - arXiv IDs (`2401.12345`)
+   - URLs (`http://...`, `https://...`)
+   - Figshare / Zenodo / OSF / Hugging Face dataset record IDs
+  Naming a *dataset by name* (e.g., "QM9", "MD17", "codeparrot/github-code")
+  is acceptable when the dataset is referenced as a generic class of
+  data, NOT when it is identified by a publication-pointer. If you need
+  to specify a dataset's source, name only the dataset and let the
+  Reference-Validator Agent track the canonical pointer in `idea/` and
+  `paper/`.
+- **DO NOT include HTML comment blocks** (`<!-- ... -->`) in your
+  output. The template you receive contains explanatory comments that
+  describe the substitution tokens; those are scaffolding for you, NOT
+  content for the rendered constitution. Strip them before returning
+  your final document.
 - Output ONLY the Markdown document.
diff --git a/agents/registry.yaml b/agents/registry.yaml
@@ -87,7 +87,7 @@ agents:
   outputs:
   - project_state
   prompt_path: agents/prompts/project_initializer.md
-  prompt_version: 1.0.0
+  prompt_version: 1.2.0
   default_backend: dartmouth
   fallback_backends:
   - huggingface

diff --git a/notes/2026-05-05-phase2-diagnostic.md b/notes/2026-05-05-phase2-diagnostic.md
diff --git a/notes/2026-05-06-iteration-convention-change.md b/notes/2026-05-06-iteration-convention-change.md
@@ -0,0 +1,61 @@
+# Iteration convention change — sibling spawning retired
+
+**Date**: 2026-05-06
+**Triggered by**: spec 004 (Phase 2 testing) Phase 7 cleanup
+**Affects**: spec 003 (retroactive — kept for historical reproducibility), spec 004+ (new convention)
+
+## What changed
+
+**Old convention (specs 003 + 004)**: each iteration of an agent spawned a new sibling project with an `-iterN` suffix (e.g., `PROJ-261-...-iter2`, `-iter3`, `-iterFAIL-backend`, etc.). Old iters were left in `projects/` with `archived_at` markers.
+
+**New convention (spec 005+)**: iteration happens **in place** on the canonical `PROJ-NNN-<slug>/` directory. Each iteration is a separate **git commit** on the feature branch with a descriptive message; the iteration trail is browsable via `git log -- projects/PROJ-NNN-<slug>/`. No suffix-named sibling directories.
+
+## Why it changed
+
+The sibling pattern produced messy project trees:
+- After spec 004 alone, just two canonicals had **8 sibling directories** between them (iter2, iter3, iter4, iter5, iter6 for PROJ-261; iter2, iter3, iter4, iter6 for PROJ-262), most archived.
+- Each sibling carried a duplicate `state/projects/<id>.yaml`, a duplicate `.specify/` scaffold, a duplicate `idea/<slug>.md`, plus `.history.jsonl` files. ~70 files of redundant duplication just for two carry-forward projects' worth of testing.
+- Spec 005 (Phase 3) and beyond would compound the proliferation: each project under test would acquire its own `-iterN` family.
+
+The original justification was "every iteration is independently replayable from a clean state" — but git already provides this via `git checkout <commit>` on the canonical's path, with the additional benefit of structured commit messages explaining what changed.
+
+## What's preserved
+
+- **`tests/phase1/sibling_project.py`** is preserved as-is with a deprecation banner (so spec 003's historical reproducibility holds).
+- **The Phase 1 commit history** (e5e423c, 8f2fe48, 7c5cc08, etc.) remains the audit trail for spec 004's iteration trajectory v1.0.0 → v1.1.0 → v1.2.0.
+- **Spec 003 / spec 004 diagnostic reports** retain their original sibling-iter references in the prose (because that IS what happened at the time).
+
+## What's removed
+
+- All `projects/PROJ-261-...-iterN/` and `projects/PROJ-262-...-iterN/` directories from spec 004.
+- All `state/projects/PROJ-26*-iter*.yaml` files.
+- All `state/projects/PROJ-26*-iter*.history.jsonl` files.
+- Run-log JSONL entries for iter siblings remain in `state/run-log/2026-05/` (they are timestamped historical evidence; deleting them would erase auditability).
+
+## What's promoted
+
+- **Iter6's audited constitution** for each carry-forward project is copied onto its canonical path (`projects/PROJ-NNN-<slug>/.specify/memory/constitution.md`), with the `-iter6` suffix stripped from the substituted project_id references.
+- The spec 003 → spec 004 carry-forward trajectory is now: `PROJ-261-evaluating-the-impact-of-code-duplicatio` and `PROJ-262-predicting-molecular-dipole-moments-with` are the **canonical** carry-forward targets, holding the latest audited Phase 2 outputs.
+
+## Convention going forward
+
+For any future-phase spec (005+):
+
+1. **Iterate in place** on the canonical project directory. Edit `idea/`, `.specify/memory/constitution.md`, `state/projects/<id>.yaml`, etc., directly.
+2. **One commit per iteration**, with messages like `phaseN/spec-NNN: <agent_name> iter K — <what changed and why>`. The iteration count is in the commit message, not the directory name.
+3. **Add an iteration log section to the diagnostic report** (`§ 5 — Iteration log`) summarizing each iteration's commit hash + scope + outcome. The git log is the source of truth; the report's § 5 is a curated index.
+4. **Don't spawn sibling-iter projects** unless absolutely necessary (e.g., to deliberately exercise a multi-state-machine path that requires two independently-evolving projects). If you do spawn one, name it explicitly with rationale, not just `-iterN`.
+
+## Backwards compatibility
+
+- Spec 003 + spec 004 reports retain their sibling-iter references in prose (they describe historical state).
+- The `tests/phase1/sibling_project.py` deprecation banner points future readers here.
+- `tests/phase1/test_idempotency.py` doesn't depend on siblings (it uses pytest `tmp_path` fixtures); regression test still passes.
+- `tests/phase1/test_citation_resolver.py` is unaffected.
+
+## Verification
+
+- `find projects/PROJ-26*-iter* 2>/dev/null` → empty.
+- `ls state/projects/ | grep iter` → empty.
+- `pytest tests/phase1/` → 15/15 passing.
+- `sha256sum projects/PROJ-26{1,2}-*/.specify/memory/constitution.md` → matches the audited iter6 content (with the -iter6 suffix stripped).
diff --git a/notes/2026-05-06-project-id-numbering-fix.md b/notes/2026-05-06-project-id-numbering-fix.md
@@ -0,0 +1,87 @@
+# Project-ID numbering race fix + duplicate cleanup
+
+**Date**: 2026-05-06
+**Triggered by**: user observation that two PROJ-261s and two PROJ-262s existed with different topics
+**Tracked in**: spec 004 / PR #109
+
+## Root cause
+
+`src/llmxive/cli.py:_cmd_brainstorm` computed `next_num` once at the
+top of the function from an in-memory snapshot of `state/projects/`,
+then claimed IDs sequentially. The inner allocation loop only re-checked
+against the local `existing_ids` set, never against disk. Two
+concurrent invocations (e.g., two cron-driven `python -m llmxive
+brainstorm` calls firing at the same time) would each compute the
+same `next_num` from independent disk snapshots, then both write
+`PROJ-NNN-<slug-A>.yaml` / `PROJ-NNN-<slug-B>.yaml` — duplicate
+project numbers with different slugs.
+
+This had already manifested on `main`:
+
+| Duplicate group | Slug A | Slug B |
+|-|-|-|
+| PROJ-261 | `evaluating-the-impact-of-code-duplicatio` (carry-forward, computer science) | `investigating-the-correlation-between-gu` (biology) |
+| PROJ-262 | `predicting-molecular-dipole-moments-with` (carry-forward, chemistry) | `quantifying-the-impact-of-magnetic-field` (physics) |
+
+## Fix (Q1B from user dialog)
+
+New module `src/llmxive/state/project_id_lock.py` with two helpers:
+
+- `project_id_lock(repo_root)` — context manager that takes an
+  exclusive `fcntl.flock` on `state/.brainstorm.lock` for the duration
+  of the with-block. Lock is microseconds-long (covers only the
+  read-disk + write-state-YAML window), not the LLM call.
+- `next_available_proj_num(*, repo_root, starting_num=1)` — scans
+  `state/projects/` AND `projects/` directories from disk and returns
+  the smallest free `n`. Works correctly with `-iterN` suffixes
+  (treats them as occupying the canonical slot).
+
+`cli._cmd_brainstorm` now wraps the per-seed allocation in the lock,
+and writes the state YAML eagerly inside the lock (acting as the ID
+claim) before releasing.
+
+Regression test at `tests/phase1/test_project_id_lock.py` — 8 tests,
+including a `os.fork()`-based concurrent-allocation test that
+confirms two children racing for the lock produce DISTINCT project
+numbers.
+
+## Cleanup (Q3A from user dialog)
+
+Renamed the two non-carry-forward duplicates to next-available IDs
+(331 + 332) so each PROJ-NNN is unique on the branch:
+
+| Old ID | New ID |
+|-|-|
+| `PROJ-261-investigating-the-correlation-between-gu` | `PROJ-331-investigating-the-correlation-between-gu` |
+| `PROJ-262-quantifying-the-impact-of-magnetic-field` | `PROJ-332-quantifying-the-impact-of-magnetic-field` |
+
+The carry-forward projects (`PROJ-261-evaluating-...` and
+`PROJ-262-predicting-...`) keep their numbers, since spec 003 + spec
+004 reports + carry-forward manifests + the parent issue/tracker all
+reference them.
+
+Files updated:
+- Project directories renamed under `projects/`.
+- State YAMLs renamed under `state/projects/`; internal `id:` field
+  updated.
+- `.history.jsonl` files renamed.
+- `web/data/projects.json` IDs replaced.
+- Run-log JSONL entries (2 files) updated to use the new IDs.
+
+## Verification
+
+- `grep -rn "PROJ-261-investigating\|PROJ-262-quantifying" --include="*.md" --include="*.yaml" --include="*.json" --include="*.jsonl"` → 0 matches (clean).
+- `pytest tests/phase1/test_project_id_lock.py -v` → 8/8 PASS.
+- `pytest tests/phase1/` (full regression) → all PASS.
+- `ls projects/` shows each PROJ-NNN unique.
+
+## Forward-looking note
+
+This fix is a defensive narrow patch on the brainstorm allocation
+path. A future spec (likely the librarian-agent spec) should consider
+whether other places that allocate project-ID-shaped strings
+(`paper_initializer`, `task_atomizer`, etc.) also need the lock.
+
+The lock pattern (`project_id_lock` + `next_available_proj_num`) is
+reusable — any agent that needs to claim a fresh PROJ-NNN should
+import these helpers rather than implementing its own allocation.
diff --git a/notes/2026-05-06-spec-005-librarian-outline.md b/notes/2026-05-06-spec-005-librarian-outline.md
@@ -0,0 +1,110 @@
+# Spec 005 outline — Librarian agent + Phase 1 re-validation
+
+**Status**: Outline only (handoff note for next session). Not yet a Spec Kit feature.
+**Date**: 2026-05-06
+**Triggered by**: user observation in spec 004 that (a) literature-search behavior is duplicated across `flesh_out`, `reference_validator`, and the spec-003 citation resolver; (b) the gap-analysis fallback when no relevant papers are found should instead trigger a multi-step expanded search; (c) Single-Source-of-Truth Constitutional Principle I says these duplicated implementations should be consolidated.
+**Predecessors**: spec 003 (Phase 1 testing) + spec 004 (Phase 2 testing)
+
+## Goal
+
+Build a `librarian` agent that consolidates literature search + citation verification into a single canonical implementation, then re-validate all Phase 1 agents that depend on lit search behavior.
+
+## Scope (3 user stories, P1 each)
+
+### US1 — `librarian` agent: validated literature search, single source of truth
+
+A `librarian` agent that takes a search term (or list of terms) and returns a list of verified citations. The agent's contract:
+
+**Input**: a search term plus optional context (project field, idea body, etc.).
+
+**Output**: per citation:
+- DOI / arXiv ID / HTTPS URL (the canonical pointer)
+- bibliographic info (title, authors, venue, year)
+- summary of content (1-3 sentences) grounded in the actual fetched content (not hallucinated)
+- verification verdict: pointer resolves, content matches the bibliographic claim, summary is faithful
+
+**Internals**:
+1. **Web search** (Semantic Scholar API, arXiv API, Google Scholar via `scholarly`, etc.) for the term.
+2. **Download** each candidate's PDF / HTML / abstract.
+3. **Verify**: (a) URL/address resolves, (b) bibliographic info from search matches primary source, (c) summary derived from actual content (not the search snippet).
+4. **Return** structured JSON.
+
+Re-uses the spec-003 citation resolver pattern for verification (or extracts shared code into a utility).
+
+### US2 — Multi-step expanded search when results are thin
+
+When an initial search returns < N (default 5) verified citations, the librarian:
+
+**Step 1**: brainstorms an expanded list of terms accommodating alternative naming, ranked by relevance to the originating query. The LLM is asked: "What are 10-20 alternative phrasings, related concepts, or sub-area terms that might surface relevant papers if the original search missed them?"
+
+**Step 2**: iterates over those terms, performing at least 10 distinct searches, accumulating verified citations until ≥5 are found OR the term list is exhausted.
+
+**Step 3**: returns the verified citations PLUS a record of which expanded terms were searched (for the log + the idea's `.md` file).
+
+The agent updates:
+- run-log entry with expanded terms used + per-term hit count
+- the calling project's idea.md (if applicable) with a "Search trail" subsection naming the expanded terms.
+
+### US3 — Re-validate Phase 1 agents under librarian-backed lit search
+
+After the librarian agent is built, re-validate two Phase 1 agents whose behavior may shift:
+
+**research_question_validator** — its 4-check audit (phenomenon-vs-method, circularity, triviality, narrowing) may rely indirectly on lit-search-driven evidence. Re-run the full Phase 1 pipeline (brainstorm → flesh_out → validator → project_initializer) on the carry-forward canonicals (PROJ-261-evaluating-, PROJ-262-predicting-) and confirm the validator's verdicts still hold.
+
+**flesh_out** — its citation-fetching behavior was a primary trigger for spec 005. Re-run flesh_out on the canonicals; confirm:
+- `idea.md` now includes a "Search trail" subsection per US2
+- citations are now librarian-validated (not just regex-resolved)
+- the previously-empty Literature gap analysis sections are now populated with real cited literature (the gap-analysis-as-feature path from spec 003 should only fire when the librarian's expanded search ALSO returns nothing — a much stricter trigger)
+
+If validator or flesh_out's behavior on the canonicals materially changes, the spec 003 + spec 004 reports gain an addendum noting the shift and the new audit verdicts.
+
+## Touch points
+
+| File | Change |
+|-|-|
+| `src/llmxive/agents/librarian.py` | NEW — agent class |
+| `agents/prompts/librarian.md` | NEW — prompt |
+| `agents/registry.yaml` | NEW entry; existing `lit_search` tool → DEPRECATE or refactor |
+| `src/llmxive/tools/lit_search.py` | refactor to call librarian, OR remove (callers go to librarian directly) |
+| `src/llmxive/agents/flesh_out.py` (or its prompt) | change: call librarian instead of lit_search |
+| `src/llmxive/agents/reference_validator.py` (or its prompt) | change: call librarian for verification step |
+| `tests/phase1/citation_resolver.py` | refactor: re-export librarian's verification logic (or deprecate, since librarian subsumes it) |
+| `tests/phase2/test_librarian.py` (NEW) | extensive tests covering many domains: every project we've brainstormed thus far (CS, chemistry, biology, physics, neuroscience, etc.) |
+| `notes/2026-05-NN-spec-005-librarian-diagnostic.md` | NEW — diagnostic report mirroring spec 003 / spec 004 structure |
+
+## Test substrate (US3 input)
+
+The carry-forward projects from spec 004's manifest:
+- PROJ-261-evaluating-the-impact-of-code-duplicatio (CS)
+- PROJ-262-predicting-molecular-dipole-moments-with (chemistry)
+
+Plus optional broader-domain coverage drawing from the larger pool of brainstormed projects in `projects/` (the cron-driven runs have produced ~40+ projects across all default fields). For US1 testing the librarian on diverse terms, sample one project per field.
+
+## Open design questions for `/speckit-clarify`
+
+1. **Web-search backend choice** — Semantic Scholar API + arXiv API only (free, no rate-limit issues for this scale)? Or also Google Scholar via `scholarly` (richer but rate-limited)? Or a real web-search service via DARTMOUTH_CHAT_API_KEY?
+2. **Verification depth** — does "verify summary matches content" require downloading full PDFs (slow), or is the abstract enough? PDF gives more truth-grounding; abstract is faster.
+3. **Caching** — librarian queries can be expensive; should results be cached on disk (e.g., keyed on `sha256(term)`)? If yes, where + how does cache invalidation work?
+4. **Failure mode** — what does the librarian do if EVEN the expanded multi-step search finds <5 verified citations? Surface an explicit `librarian_inconclusive.yaml` sentinel; let the caller decide whether to gap-analyze, escalate to human, or fail.
+5. **Re-validation scope of US3** — re-run Phase 1 from brainstorm forward, OR re-run only flesh_out + validator on the existing canonical idea bodies? The latter is cheaper but doesn't catch cascading shifts.
+
+## Anticipated effort
+
+- US1 (librarian implementation + tests): ~2-3 days (the verification protocol is the hardest part, especially with PDFs)
+- US2 (expanded search): ~0.5 day given US1's substrate
+- US3 (re-validation): ~1 day for flesh_out + validator on 2 canonicals
+- **Total**: ~4-5 days
+
+## Carry-forward to spec 006+
+
+If spec 005 closes cleanly, the librarian becomes available to all paper-side agents (paper_writing, paper_implementer, reference_validator) and downstream phase-test specs (006-007 etc. — Phase 3-4 testing). This is the highest-leverage piece of infrastructure across the whole pipeline after the four Phase 1 agents.
+
+## Suggested workflow
+
+When the user is ready to start spec 005:
+
+1. `/speckit-specify` with the bullet "build a librarian agent per `notes/2026-05-06-spec-005-librarian-outline.md`; re-validate flesh_out + research_question_validator on the carry-forward projects from spec 004"
+2. `/speckit-clarify` — resolve the 5 open design questions above
+3. `/speckit-plan` → `/speckit-tasks` → `/speckit-analyze` → `/speckit-implement` (mirror spec 004's flow)
+
+Will produce a separate PR. Spec 005's diagnostic report will mirror spec 003's structure.