ai-partner: build embeddings pipeline script

**Parent epic:** #1446 (Amicus — AI Study Partner v1)
**Phase:** 1 · **Size:** M

Build the one-time and incremental pipeline that chunks all Companion Study content, generates vector embeddings via OpenAI's `text-embedding-3-small` API, and writes them to `embeddings.db` ready to be merged into `scripture.db` (see #1448).

---

## Files to create

- `_tools/build_embeddings.py` — main orchestrator script (new)
- `_tools/build_embeddings_chunks.py` — chunker module, one function per source type (new)
- `_tools/embedding_manifest.json` — generated file tracking chunk hashes for incremental rebuilds (new, gitignored)

## Files to modify

- `_tools/content_writer.py` — `save_chapter()` writes the affected `chapter_id` to `_tools/embedding_manifest.json` under a `dirty_chapters` array so the next incremental build picks it up
- `.gitignore` — add `_tools/embedding_manifest.json`, `embeddings.db`

## Conventions to follow

- Match the orchestrator + loaders pattern of `_tools/build_sqlite.py` (imports from `build_sqlite_schema.py` and `build_sqlite_loaders.py`). `build_embeddings.py` is the orchestrator; `build_embeddings_chunks.py` holds one chunker function per source type.
- File header: include UTF-8 stdout preamble (`sys.stdout.reconfigure(encoding='utf-8')`) — match `build_sqlite.py` line 15–17.
- Path resolution: `ROOT = Path(__file__).resolve().parent.parent` — match `build_sqlite.py` line 19.
- Output: `[OK]` print markers on successful steps — match existing `_tools/` convention.
- Windows compatibility: `python`, not `python3`. Never assume Unix path separators.

---

## Chunker strategy

Chunk sources and approximate counts. Each chunker function reads content JSON files and yields `(chunk_id, source_type, source_id, metadata_dict, text)` tuples.

| Source type | Per | Approx count |
|---|---|---|
| `section_panel` | one per (section, panel_type) | ~16,000 |
| `chapter_panel` | one per (chapter, panel_type) | ~7,500 |
| `word_study` | one per entry | 46 |
| `lexicon_entry` | one per entry (both Greek + Hebrew) | 13,655 |
| `debate_topic` | one per topic | 308 |
| `cross_ref_thread_note` | one per thread note | ~200 |
| `journey_stop` | one per stop including connective text | ~existing |
| `meta_faq` | one per article (see #1449) | ~50 initial |

**Chunk IDs** are deterministic — format: `{source_type}:{source_id}` (e.g., `section_panel:genesis-1-s1-sarna`, `lexicon_entry:heb-H7225`). Never use UUIDs or hashes — deterministic IDs enable diffing between builds.

**Metadata captured per chunk** (stored alongside embedding for retrieval filtering):
```
{
  "scholar_id": str | None,
  "tradition": str | None,
  "book_id": str | None,
  "chapter_num": int | None,
  "verse_start": int | None,
  "verse_end": int | None,
  "panel_type": str | None
}
```

---

## Embedding API

- Model: `text-embedding-3-small` (1536 dimensions, $0.02/1M tokens)
- Batch size: 100 chunks per API call (OpenAI limit is higher but 100 keeps retry cost bounded)
- Retry policy: exponential backoff 3× on 429/5xx
- Checkpointing: after each successful batch, append results to `embeddings.db` and update `embedding_manifest.json` with latest chunk hashes. On crash/Ctrl-C, resume from last checkpoint.

**API key source:** `OPENAI_API_KEY` environment variable. Script must fail early with a clear message if the variable is missing. Never read from a file or hardcode.

## Output database (embeddings.db)

`build_embeddings.py` writes to a standalone `embeddings.db` SQLite file (NOT directly into `scripture.db`). The `build_sqlite.py` orchestrator (updated in #1448) merges `embeddings.db` into `scripture.db` during its build step.

```sql
CREATE TABLE embedding_chunks (
  chunk_id       TEXT PRIMARY KEY,
  source_type    TEXT NOT NULL,
  source_id      TEXT NOT NULL,
  text           TEXT NOT NULL,
  metadata_json  TEXT NOT NULL,
  content_hash   TEXT NOT NULL,       -- SHA256 of text + metadata; used for incremental dedup
  embedding      BLOB NOT NULL        -- 1536 floats, packed as little-endian float32
);

CREATE INDEX idx_source ON embedding_chunks(source_type, source_id);
```

## CLI interface

```
python _tools/build_embeddings.py              # full rebuild (re-embeds everything)
python _tools/build_embeddings.py --incremental  # only re-embeds chunks whose content_hash changed
python _tools/build_embeddings.py --dry-run     # prints chunk count and estimated cost, no API calls
python _tools/build_embeddings.py --source section_panel  # restrict to one source type (for testing)
```

## Cost guardrails

- `--dry-run` must print: total chunks, total estimated tokens, estimated cost in USD
- Full rebuild without `--dry-run` prompts `Continue? [y/N]` if estimated cost > $0.50
- `--incremental` does not prompt (expected to be cheap)

---

## Acceptance criteria

- [ ] `python _tools/build_embeddings.py --dry-run` prints chunk counts for every source type matching expected approximate counts (±10%)
- [ ] Full run generates `embeddings.db` with row count equal to dry-run total
- [ ] Every embedding is exactly 1536 dimensions; no nulls; content_hash populated
- [ ] Chunk IDs are deterministic (two full rebuilds produce identical chunk_id set)
- [ ] `--incremental` after a `save_chapter()` call re-embeds only that chapter's chunks
- [ ] Resume-from-checkpoint works (Ctrl-C mid-batch, re-run completes without duplicate API calls)
- [ ] Cost prompt appears correctly for full rebuilds > $0.50
- [ ] Missing `OPENAI_API_KEY` fails immediately with a clear message
- [ ] `_tools/embedding_manifest.json` updated on every successful batch
- [ ] Works on Windows (uses `python` not `python3`, no Unix-only path assumptions)

## Out of scope

- Merging `embeddings.db` into `scripture.db` — that's #1448
- Runtime vector search — that's #1451
- Embedding user queries at runtime — handled by the proxy in #1450 (separate concern)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-partner: build embeddings pipeline script #1447

Files to create

Files to modify

Conventions to follow

Chunker strategy

Embedding API

Output database (embeddings.db)

CLI interface

Cost guardrails

Acceptance criteria

Out of scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Source type	Per	Approx count
`section_panel`	one per (section, panel_type)	~16,000
`chapter_panel`	one per (chapter, panel_type)	~7,500
`word_study`	one per entry	46
`lexicon_entry`	one per entry (both Greek + Hebrew)	13,655
`debate_topic`	one per topic	308
`cross_ref_thread_note`	one per thread note	~200
`journey_stop`	one per stop including connective text	~existing
`meta_faq`	one per article (see #1449)	~50 initial

ai-partner: build embeddings pipeline script #1447

Description

Files to create

Files to modify

Conventions to follow

Chunker strategy

Embedding API

Output database (embeddings.db)

CLI interface

Cost guardrails

Acceptance criteria

Out of scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions