Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
e5e423c
phase2/spec-004: add 'validated' to sibling spawner allowlist (FR-003…
jeremymanning May 6, 2026
e8e09f7
phase2/spec-004: idempotency + fail-fast guards on project_initialize…
jeremymanning May 6, 2026
3c83c99
phase2/spec-004: spawn iter2 siblings of PROJ-261, PROJ-262 (US1, FR-…
jeremymanning May 6, 2026
931698a
phase2/spec-004: project_initializer happy-path runs on iter2 sibling…
jeremymanning May 6, 2026
8f2fe48
phase2/spec-004: tighten project_initializer prompt to forbid citatio…
jeremymanning May 6, 2026
fce9ebf
phase2/spec-004: iter3 siblings re-run with v1.1.0 prompt — both cons…
jeremymanning May 6, 2026
0eafcd8
phase2/spec-004: induced-failure scenarios + archive (US4, FR-012, #4…
jeremymanning May 6, 2026
0e12b44
phase2/spec-004: diagnostic report + carry-forward manifest (US5 US6,…
jeremymanning May 6, 2026
495c8db
phase2/spec-004: archive non-selected iter2 siblings (FR-019, US6 T05…
jeremymanning May 6, 2026
6803e4e
phase2/spec-004: lint cleanup on touched files (T060, #46 #62)
jeremymanning May 6, 2026
5061004
phase2/spec-004: tasks.md ticked off + spec.md status In Review (T067…
jeremymanning May 6, 2026
53af543
phase2/spec-004: spec-kit design artifacts + state history (#46 #62)
jeremymanning May 6, 2026
d42cc5a
phase2/spec-004: tasks.md final tick + tracker update + PR open (#46 …
jeremymanning May 6, 2026
7c5cc08
phase2/spec-004: tighten project_initializer prompt v1.1.0 → v1.2.0 —…
jeremymanning May 6, 2026
7da5bd1
phase2/spec-004: iter6 siblings re-run with v1.2.0 prompt — both pass…
jeremymanning May 6, 2026
5f72de2
phase2/spec-004: report § 4/5/8 + carry-forward.yaml updated for iter…
jeremymanning May 6, 2026
30aa5a8
phase2/spec-004: retire sibling-iter convention; iterate in place goi…
jeremymanning May 6, 2026
9820567
phase2/spec-004: fix project-ID numbering race + clean up duplicate P…
jeremymanning May 6, 2026
bae345a
spec-005-handoff: outline for librarian agent + Phase 1 re-validation…
jeremymanning May 6, 2026
49647c5
phase2/spec-004: fix stale CI assertion in real_call/test_full_pipeli…
jeremymanning May 6, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,5 +70,5 @@ Since this is primarily a research documentation repository without traditional
<!-- SPECKIT START -->
For additional context about technologies to be used, project structure,
shell commands, and other important information, read the current plan:
[specs/003-phase1-idea-lifecycle-testing/plan.md](specs/003-phase1-idea-lifecycle-testing/plan.md).
[specs/004-phase2-project-bootstrap-testing/plan.md](specs/004-phase2-project-bootstrap-testing/plan.md).
<!-- SPECKIT END -->
39 changes: 36 additions & 3 deletions agents/prompts/project_initializer.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Project-Initializer Agent

**Version**: 1.0.0
**Version**: 1.2.0
**Stage owned**: `flesh_out_complete` → `project_initialized`
**Default backend**: dartmouth (fallback huggingface, then local)

Expand Down Expand Up @@ -45,7 +45,40 @@ literal `**Project ID**: …` footer line.

- Add at most TWO domain-specific principles (numbered I, II, III,
IV, V already exist; if you add one it becomes VI; if two, VII).
- **Each added principle MUST be explicitly grounded in the idea body.**
Concretely: every claim a new principle makes (about methodology,
data sources, evaluation, etc.) MUST trace back to a specific section
of the idea body — Methodology sketch, Expected results, Motivation,
or Research question. If you cannot point to a sentence in the idea
body that justifies a claim in your new principle, do NOT include
that claim. Add fewer principles rather than fabricating ones.
- DO NOT add principles about topics the idea body does not address
(e.g., licensing, IP, deployment, or maintenance) just because they
seem like generic "good practice" for the field. Generic-good-practice
principles belong in the parent constitution, not in the project-level
one. The project-level constitution governs THIS project's specific
research scope.
- Each new principle's body should reference the idea's specific
artifacts (named datasets, named models, named methods) when codifying
a domain norm. Vague principles ("must use good engineering practices")
are not acceptable.
- DO NOT remove any of the inherited principles.
- DO NOT introduce external citations here — the constitution is a
governance document, not a research artifact.
- **DO NOT introduce ANY external citations or external identifiers in
the constitution body** — the constitution is a governance document,
not a research artifact. This includes:
- DOIs (`10.xxxx/...`)
- arXiv IDs (`2401.12345`)
- URLs (`http://...`, `https://...`)
- Figshare / Zenodo / OSF / Hugging Face dataset record IDs
Naming a *dataset by name* (e.g., "QM9", "MD17", "codeparrot/github-code")
is acceptable when the dataset is referenced as a generic class of
data, NOT when it is identified by a publication-pointer. If you need
to specify a dataset's source, name only the dataset and let the
Reference-Validator Agent track the canonical pointer in `idea/` and
`paper/`.
- **DO NOT include HTML comment blocks** (`<!-- ... -->`) in your
output. The template you receive contains explanatory comments that
describe the substitution tokens; those are scaffolding for you, NOT
content for the rendered constitution. Strip them before returning
your final document.
- Output ONLY the Markdown document.
2 changes: 1 addition & 1 deletion agents/registry.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ agents:
outputs:
- project_state
prompt_path: agents/prompts/project_initializer.md
prompt_version: 1.0.0
prompt_version: 1.2.0
default_backend: dartmouth
fallback_backends:
- huggingface
Expand Down
449 changes: 449 additions & 0 deletions notes/2026-05-05-phase2-diagnostic.md

Large diffs are not rendered by default.

61 changes: 61 additions & 0 deletions notes/2026-05-06-iteration-convention-change.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Iteration convention change — sibling spawning retired

**Date**: 2026-05-06
**Triggered by**: spec 004 (Phase 2 testing) Phase 7 cleanup
**Affects**: spec 003 (retroactive — kept for historical reproducibility), spec 004+ (new convention)

## What changed

**Old convention (specs 003 + 004)**: each iteration of an agent spawned a new sibling project with an `-iterN` suffix (e.g., `PROJ-261-...-iter2`, `-iter3`, `-iterFAIL-backend`, etc.). Old iters were left in `projects/` with `archived_at` markers.

**New convention (spec 005+)**: iteration happens **in place** on the canonical `PROJ-NNN-<slug>/` directory. Each iteration is a separate **git commit** on the feature branch with a descriptive message; the iteration trail is browsable via `git log -- projects/PROJ-NNN-<slug>/`. No suffix-named sibling directories.

## Why it changed

The sibling pattern produced messy project trees:
- After spec 004 alone, just two canonicals had **8 sibling directories** between them (iter2, iter3, iter4, iter5, iter6 for PROJ-261; iter2, iter3, iter4, iter6 for PROJ-262), most archived.
- Each sibling carried a duplicate `state/projects/<id>.yaml`, a duplicate `.specify/` scaffold, a duplicate `idea/<slug>.md`, plus `.history.jsonl` files. ~70 files of redundant duplication just for two carry-forward projects' worth of testing.
- Spec 005 (Phase 3) and beyond would compound the proliferation: each project under test would acquire its own `-iterN` family.

The original justification was "every iteration is independently replayable from a clean state" — but git already provides this via `git checkout <commit>` on the canonical's path, with the additional benefit of structured commit messages explaining what changed.

## What's preserved

- **`tests/phase1/sibling_project.py`** is preserved as-is with a deprecation banner (so spec 003's historical reproducibility holds).
- **The Phase 1 commit history** (e5e423c, 8f2fe48, 7c5cc08, etc.) remains the audit trail for spec 004's iteration trajectory v1.0.0 → v1.1.0 → v1.2.0.
- **Spec 003 / spec 004 diagnostic reports** retain their original sibling-iter references in the prose (because that IS what happened at the time).

## What's removed

- All `projects/PROJ-261-...-iterN/` and `projects/PROJ-262-...-iterN/` directories from spec 004.
- All `state/projects/PROJ-26*-iter*.yaml` files.
- All `state/projects/PROJ-26*-iter*.history.jsonl` files.
- Run-log JSONL entries for iter siblings remain in `state/run-log/2026-05/` (they are timestamped historical evidence; deleting them would erase auditability).

## What's promoted

- **Iter6's audited constitution** for each carry-forward project is copied onto its canonical path (`projects/PROJ-NNN-<slug>/.specify/memory/constitution.md`), with the `-iter6` suffix stripped from the substituted project_id references.
- The spec 003 → spec 004 carry-forward trajectory is now: `PROJ-261-evaluating-the-impact-of-code-duplicatio` and `PROJ-262-predicting-molecular-dipole-moments-with` are the **canonical** carry-forward targets, holding the latest audited Phase 2 outputs.

## Convention going forward

For any future-phase spec (005+):

1. **Iterate in place** on the canonical project directory. Edit `idea/`, `.specify/memory/constitution.md`, `state/projects/<id>.yaml`, etc., directly.
2. **One commit per iteration**, with messages like `phaseN/spec-NNN: <agent_name> iter K — <what changed and why>`. The iteration count is in the commit message, not the directory name.
3. **Add an iteration log section to the diagnostic report** (`§ 5 — Iteration log`) summarizing each iteration's commit hash + scope + outcome. The git log is the source of truth; the report's § 5 is a curated index.
4. **Don't spawn sibling-iter projects** unless absolutely necessary (e.g., to deliberately exercise a multi-state-machine path that requires two independently-evolving projects). If you do spawn one, name it explicitly with rationale, not just `-iterN`.

## Backwards compatibility

- Spec 003 + spec 004 reports retain their sibling-iter references in prose (they describe historical state).
- The `tests/phase1/sibling_project.py` deprecation banner points future readers here.
- `tests/phase1/test_idempotency.py` doesn't depend on siblings (it uses pytest `tmp_path` fixtures); regression test still passes.
- `tests/phase1/test_citation_resolver.py` is unaffected.

## Verification

- `find projects/PROJ-26*-iter* 2>/dev/null` → empty.
- `ls state/projects/ | grep iter` → empty.
- `pytest tests/phase1/` → 15/15 passing.
- `sha256sum projects/PROJ-26{1,2}-*/.specify/memory/constitution.md` → matches the audited iter6 content (with the -iter6 suffix stripped).
87 changes: 87 additions & 0 deletions notes/2026-05-06-project-id-numbering-fix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Project-ID numbering race fix + duplicate cleanup

**Date**: 2026-05-06
**Triggered by**: user observation that two PROJ-261s and two PROJ-262s existed with different topics
**Tracked in**: spec 004 / PR #109

## Root cause

`src/llmxive/cli.py:_cmd_brainstorm` computed `next_num` once at the
top of the function from an in-memory snapshot of `state/projects/`,
then claimed IDs sequentially. The inner allocation loop only re-checked
against the local `existing_ids` set, never against disk. Two
concurrent invocations (e.g., two cron-driven `python -m llmxive
brainstorm` calls firing at the same time) would each compute the
same `next_num` from independent disk snapshots, then both write
`PROJ-NNN-<slug-A>.yaml` / `PROJ-NNN-<slug-B>.yaml` — duplicate
project numbers with different slugs.

This had already manifested on `main`:

| Duplicate group | Slug A | Slug B |
|-|-|-|
| PROJ-261 | `evaluating-the-impact-of-code-duplicatio` (carry-forward, computer science) | `investigating-the-correlation-between-gu` (biology) |
| PROJ-262 | `predicting-molecular-dipole-moments-with` (carry-forward, chemistry) | `quantifying-the-impact-of-magnetic-field` (physics) |

## Fix (Q1B from user dialog)

New module `src/llmxive/state/project_id_lock.py` with two helpers:

- `project_id_lock(repo_root)` — context manager that takes an
exclusive `fcntl.flock` on `state/.brainstorm.lock` for the duration
of the with-block. Lock is microseconds-long (covers only the
read-disk + write-state-YAML window), not the LLM call.
- `next_available_proj_num(*, repo_root, starting_num=1)` — scans
`state/projects/` AND `projects/` directories from disk and returns
the smallest free `n`. Works correctly with `-iterN` suffixes
(treats them as occupying the canonical slot).

`cli._cmd_brainstorm` now wraps the per-seed allocation in the lock,
and writes the state YAML eagerly inside the lock (acting as the ID
claim) before releasing.

Regression test at `tests/phase1/test_project_id_lock.py` — 8 tests,
including a `os.fork()`-based concurrent-allocation test that
confirms two children racing for the lock produce DISTINCT project
numbers.

## Cleanup (Q3A from user dialog)

Renamed the two non-carry-forward duplicates to next-available IDs
(331 + 332) so each PROJ-NNN is unique on the branch:

| Old ID | New ID |
|-|-|
| `PROJ-261-investigating-the-correlation-between-gu` | `PROJ-331-investigating-the-correlation-between-gu` |
| `PROJ-262-quantifying-the-impact-of-magnetic-field` | `PROJ-332-quantifying-the-impact-of-magnetic-field` |

The carry-forward projects (`PROJ-261-evaluating-...` and
`PROJ-262-predicting-...`) keep their numbers, since spec 003 + spec
004 reports + carry-forward manifests + the parent issue/tracker all
reference them.

Files updated:
- Project directories renamed under `projects/`.
- State YAMLs renamed under `state/projects/`; internal `id:` field
updated.
- `.history.jsonl` files renamed.
- `web/data/projects.json` IDs replaced.
- Run-log JSONL entries (2 files) updated to use the new IDs.

## Verification

- `grep -rn "PROJ-261-investigating\|PROJ-262-quantifying" --include="*.md" --include="*.yaml" --include="*.json" --include="*.jsonl"` → 0 matches (clean).
- `pytest tests/phase1/test_project_id_lock.py -v` → 8/8 PASS.
- `pytest tests/phase1/` (full regression) → all PASS.
- `ls projects/` shows each PROJ-NNN unique.

## Forward-looking note

This fix is a defensive narrow patch on the brainstorm allocation
path. A future spec (likely the librarian-agent spec) should consider
whether other places that allocate project-ID-shaped strings
(`paper_initializer`, `task_atomizer`, etc.) also need the lock.

The lock pattern (`project_id_lock` + `next_available_proj_num`) is
reusable — any agent that needs to claim a fresh PROJ-NNN should
import these helpers rather than implementing its own allocation.
110 changes: 110 additions & 0 deletions notes/2026-05-06-spec-005-librarian-outline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# Spec 005 outline — Librarian agent + Phase 1 re-validation

**Status**: Outline only (handoff note for next session). Not yet a Spec Kit feature.
**Date**: 2026-05-06
**Triggered by**: user observation in spec 004 that (a) literature-search behavior is duplicated across `flesh_out`, `reference_validator`, and the spec-003 citation resolver; (b) the gap-analysis fallback when no relevant papers are found should instead trigger a multi-step expanded search; (c) Single-Source-of-Truth Constitutional Principle I says these duplicated implementations should be consolidated.
**Predecessors**: spec 003 (Phase 1 testing) + spec 004 (Phase 2 testing)

## Goal

Build a `librarian` agent that consolidates literature search + citation verification into a single canonical implementation, then re-validate all Phase 1 agents that depend on lit search behavior.

## Scope (3 user stories, P1 each)

### US1 — `librarian` agent: validated literature search, single source of truth

A `librarian` agent that takes a search term (or list of terms) and returns a list of verified citations. The agent's contract:

**Input**: a search term plus optional context (project field, idea body, etc.).

**Output**: per citation:
- DOI / arXiv ID / HTTPS URL (the canonical pointer)
- bibliographic info (title, authors, venue, year)
- summary of content (1-3 sentences) grounded in the actual fetched content (not hallucinated)
- verification verdict: pointer resolves, content matches the bibliographic claim, summary is faithful

**Internals**:
1. **Web search** (Semantic Scholar API, arXiv API, Google Scholar via `scholarly`, etc.) for the term.
2. **Download** each candidate's PDF / HTML / abstract.
3. **Verify**: (a) URL/address resolves, (b) bibliographic info from search matches primary source, (c) summary derived from actual content (not the search snippet).
4. **Return** structured JSON.

Re-uses the spec-003 citation resolver pattern for verification (or extracts shared code into a utility).

### US2 — Multi-step expanded search when results are thin

When an initial search returns < N (default 5) verified citations, the librarian:

**Step 1**: brainstorms an expanded list of terms accommodating alternative naming, ranked by relevance to the originating query. The LLM is asked: "What are 10-20 alternative phrasings, related concepts, or sub-area terms that might surface relevant papers if the original search missed them?"

**Step 2**: iterates over those terms, performing at least 10 distinct searches, accumulating verified citations until ≥5 are found OR the term list is exhausted.

**Step 3**: returns the verified citations PLUS a record of which expanded terms were searched (for the log + the idea's `.md` file).

The agent updates:
- run-log entry with expanded terms used + per-term hit count
- the calling project's idea.md (if applicable) with a "Search trail" subsection naming the expanded terms.

### US3 — Re-validate Phase 1 agents under librarian-backed lit search

After the librarian agent is built, re-validate two Phase 1 agents whose behavior may shift:

**research_question_validator** — its 4-check audit (phenomenon-vs-method, circularity, triviality, narrowing) may rely indirectly on lit-search-driven evidence. Re-run the full Phase 1 pipeline (brainstorm → flesh_out → validator → project_initializer) on the carry-forward canonicals (PROJ-261-evaluating-, PROJ-262-predicting-) and confirm the validator's verdicts still hold.

**flesh_out** — its citation-fetching behavior was a primary trigger for spec 005. Re-run flesh_out on the canonicals; confirm:
- `idea.md` now includes a "Search trail" subsection per US2
- citations are now librarian-validated (not just regex-resolved)
- the previously-empty Literature gap analysis sections are now populated with real cited literature (the gap-analysis-as-feature path from spec 003 should only fire when the librarian's expanded search ALSO returns nothing — a much stricter trigger)

If validator or flesh_out's behavior on the canonicals materially changes, the spec 003 + spec 004 reports gain an addendum noting the shift and the new audit verdicts.

## Touch points

| File | Change |
|-|-|
| `src/llmxive/agents/librarian.py` | NEW — agent class |
| `agents/prompts/librarian.md` | NEW — prompt |
| `agents/registry.yaml` | NEW entry; existing `lit_search` tool → DEPRECATE or refactor |
| `src/llmxive/tools/lit_search.py` | refactor to call librarian, OR remove (callers go to librarian directly) |
| `src/llmxive/agents/flesh_out.py` (or its prompt) | change: call librarian instead of lit_search |
| `src/llmxive/agents/reference_validator.py` (or its prompt) | change: call librarian for verification step |
| `tests/phase1/citation_resolver.py` | refactor: re-export librarian's verification logic (or deprecate, since librarian subsumes it) |
| `tests/phase2/test_librarian.py` (NEW) | extensive tests covering many domains: every project we've brainstormed thus far (CS, chemistry, biology, physics, neuroscience, etc.) |
| `notes/2026-05-NN-spec-005-librarian-diagnostic.md` | NEW — diagnostic report mirroring spec 003 / spec 004 structure |

## Test substrate (US3 input)

The carry-forward projects from spec 004's manifest:
- PROJ-261-evaluating-the-impact-of-code-duplicatio (CS)
- PROJ-262-predicting-molecular-dipole-moments-with (chemistry)

Plus optional broader-domain coverage drawing from the larger pool of brainstormed projects in `projects/` (the cron-driven runs have produced ~40+ projects across all default fields). For US1 testing the librarian on diverse terms, sample one project per field.

## Open design questions for `/speckit-clarify`

1. **Web-search backend choice** — Semantic Scholar API + arXiv API only (free, no rate-limit issues for this scale)? Or also Google Scholar via `scholarly` (richer but rate-limited)? Or a real web-search service via DARTMOUTH_CHAT_API_KEY?
2. **Verification depth** — does "verify summary matches content" require downloading full PDFs (slow), or is the abstract enough? PDF gives more truth-grounding; abstract is faster.
3. **Caching** — librarian queries can be expensive; should results be cached on disk (e.g., keyed on `sha256(term)`)? If yes, where + how does cache invalidation work?
4. **Failure mode** — what does the librarian do if EVEN the expanded multi-step search finds <5 verified citations? Surface an explicit `librarian_inconclusive.yaml` sentinel; let the caller decide whether to gap-analyze, escalate to human, or fail.
5. **Re-validation scope of US3** — re-run Phase 1 from brainstorm forward, OR re-run only flesh_out + validator on the existing canonical idea bodies? The latter is cheaper but doesn't catch cascading shifts.

## Anticipated effort

- US1 (librarian implementation + tests): ~2-3 days (the verification protocol is the hardest part, especially with PDFs)
- US2 (expanded search): ~0.5 day given US1's substrate
- US3 (re-validation): ~1 day for flesh_out + validator on 2 canonicals
- **Total**: ~4-5 days

## Carry-forward to spec 006+

If spec 005 closes cleanly, the librarian becomes available to all paper-side agents (paper_writing, paper_implementer, reference_validator) and downstream phase-test specs (006-007 etc. — Phase 3-4 testing). This is the highest-leverage piece of infrastructure across the whole pipeline after the four Phase 1 agents.

## Suggested workflow

When the user is ready to start spec 005:

1. `/speckit-specify` with the bullet "build a librarian agent per `notes/2026-05-06-spec-005-librarian-outline.md`; re-validate flesh_out + research_question_validator on the carry-forward projects from spec 004"
2. `/speckit-clarify` — resolve the 5 open design questions above
3. `/speckit-plan` → `/speckit-tasks` → `/speckit-analyze` → `/speckit-implement` (mirror spec 004's flow)

Will produce a separate PR. Spec 005's diagnostic report will mirror spec 003's structure.
Loading
Loading