Parent epic: #1446 (Amicus — AI Study Partner v1)
Phase: 1 · Size: M
Build the one-time and incremental pipeline that chunks all Companion Study content, generates vector embeddings via OpenAI's text-embedding-3-small API, and writes them to embeddings.db ready to be merged into scripture.db (see #1448).
Files to create
_tools/build_embeddings.py — main orchestrator script (new)
_tools/build_embeddings_chunks.py — chunker module, one function per source type (new)
_tools/embedding_manifest.json — generated file tracking chunk hashes for incremental rebuilds (new, gitignored)
Files to modify
_tools/content_writer.py — save_chapter() writes the affected chapter_id to _tools/embedding_manifest.json under a dirty_chapters array so the next incremental build picks it up
.gitignore — add _tools/embedding_manifest.json, embeddings.db
Conventions to follow
- Match the orchestrator + loaders pattern of
_tools/build_sqlite.py (imports from build_sqlite_schema.py and build_sqlite_loaders.py). build_embeddings.py is the orchestrator; build_embeddings_chunks.py holds one chunker function per source type.
- File header: include UTF-8 stdout preamble (
sys.stdout.reconfigure(encoding='utf-8')) — match build_sqlite.py line 15–17.
- Path resolution:
ROOT = Path(__file__).resolve().parent.parent — match build_sqlite.py line 19.
- Output:
[OK] print markers on successful steps — match existing _tools/ convention.
- Windows compatibility:
python, not python3. Never assume Unix path separators.
Chunker strategy
Chunk sources and approximate counts. Each chunker function reads content JSON files and yields (chunk_id, source_type, source_id, metadata_dict, text) tuples.
| Source type |
Per |
Approx count |
section_panel |
one per (section, panel_type) |
~16,000 |
chapter_panel |
one per (chapter, panel_type) |
~7,500 |
word_study |
one per entry |
46 |
lexicon_entry |
one per entry (both Greek + Hebrew) |
13,655 |
debate_topic |
one per topic |
308 |
cross_ref_thread_note |
one per thread note |
~200 |
journey_stop |
one per stop including connective text |
~existing |
meta_faq |
one per article (see #1449) |
~50 initial |
Chunk IDs are deterministic — format: {source_type}:{source_id} (e.g., section_panel:genesis-1-s1-sarna, lexicon_entry:heb-H7225). Never use UUIDs or hashes — deterministic IDs enable diffing between builds.
Metadata captured per chunk (stored alongside embedding for retrieval filtering):
{
"scholar_id": str | None,
"tradition": str | None,
"book_id": str | None,
"chapter_num": int | None,
"verse_start": int | None,
"verse_end": int | None,
"panel_type": str | None
}
Embedding API
- Model:
text-embedding-3-small (1536 dimensions, $0.02/1M tokens)
- Batch size: 100 chunks per API call (OpenAI limit is higher but 100 keeps retry cost bounded)
- Retry policy: exponential backoff 3× on 429/5xx
- Checkpointing: after each successful batch, append results to
embeddings.db and update embedding_manifest.json with latest chunk hashes. On crash/Ctrl-C, resume from last checkpoint.
API key source: OPENAI_API_KEY environment variable. Script must fail early with a clear message if the variable is missing. Never read from a file or hardcode.
Output database (embeddings.db)
build_embeddings.py writes to a standalone embeddings.db SQLite file (NOT directly into scripture.db). The build_sqlite.py orchestrator (updated in #1448) merges embeddings.db into scripture.db during its build step.
CREATE TABLE embedding_chunks (
chunk_id TEXT PRIMARY KEY,
source_type TEXT NOT NULL,
source_id TEXT NOT NULL,
text TEXT NOT NULL,
metadata_json TEXT NOT NULL,
content_hash TEXT NOT NULL, -- SHA256 of text + metadata; used for incremental dedup
embedding BLOB NOT NULL -- 1536 floats, packed as little-endian float32
);
CREATE INDEX idx_source ON embedding_chunks(source_type, source_id);
CLI interface
python _tools/build_embeddings.py # full rebuild (re-embeds everything)
python _tools/build_embeddings.py --incremental # only re-embeds chunks whose content_hash changed
python _tools/build_embeddings.py --dry-run # prints chunk count and estimated cost, no API calls
python _tools/build_embeddings.py --source section_panel # restrict to one source type (for testing)
Cost guardrails
--dry-run must print: total chunks, total estimated tokens, estimated cost in USD
- Full rebuild without
--dry-run prompts Continue? [y/N] if estimated cost > $0.50
--incremental does not prompt (expected to be cheap)
Acceptance criteria
Out of scope
Parent epic: #1446 (Amicus — AI Study Partner v1)
Phase: 1 · Size: M
Build the one-time and incremental pipeline that chunks all Companion Study content, generates vector embeddings via OpenAI's
text-embedding-3-smallAPI, and writes them toembeddings.dbready to be merged intoscripture.db(see #1448).Files to create
_tools/build_embeddings.py— main orchestrator script (new)_tools/build_embeddings_chunks.py— chunker module, one function per source type (new)_tools/embedding_manifest.json— generated file tracking chunk hashes for incremental rebuilds (new, gitignored)Files to modify
_tools/content_writer.py—save_chapter()writes the affectedchapter_idto_tools/embedding_manifest.jsonunder adirty_chaptersarray so the next incremental build picks it up.gitignore— add_tools/embedding_manifest.json,embeddings.dbConventions to follow
_tools/build_sqlite.py(imports frombuild_sqlite_schema.pyandbuild_sqlite_loaders.py).build_embeddings.pyis the orchestrator;build_embeddings_chunks.pyholds one chunker function per source type.sys.stdout.reconfigure(encoding='utf-8')) — matchbuild_sqlite.pyline 15–17.ROOT = Path(__file__).resolve().parent.parent— matchbuild_sqlite.pyline 19.[OK]print markers on successful steps — match existing_tools/convention.python, notpython3. Never assume Unix path separators.Chunker strategy
Chunk sources and approximate counts. Each chunker function reads content JSON files and yields
(chunk_id, source_type, source_id, metadata_dict, text)tuples.section_panelchapter_panelword_studylexicon_entrydebate_topiccross_ref_thread_notejourney_stopmeta_faqChunk IDs are deterministic — format:
{source_type}:{source_id}(e.g.,section_panel:genesis-1-s1-sarna,lexicon_entry:heb-H7225). Never use UUIDs or hashes — deterministic IDs enable diffing between builds.Metadata captured per chunk (stored alongside embedding for retrieval filtering):
Embedding API
text-embedding-3-small(1536 dimensions, $0.02/1M tokens)embeddings.dband updateembedding_manifest.jsonwith latest chunk hashes. On crash/Ctrl-C, resume from last checkpoint.API key source:
OPENAI_API_KEYenvironment variable. Script must fail early with a clear message if the variable is missing. Never read from a file or hardcode.Output database (embeddings.db)
build_embeddings.pywrites to a standaloneembeddings.dbSQLite file (NOT directly intoscripture.db). Thebuild_sqlite.pyorchestrator (updated in #1448) mergesembeddings.dbintoscripture.dbduring its build step.CLI interface
Cost guardrails
--dry-runmust print: total chunks, total estimated tokens, estimated cost in USD--dry-runpromptsContinue? [y/N]if estimated cost > $0.50--incrementaldoes not prompt (expected to be cheap)Acceptance criteria
python _tools/build_embeddings.py --dry-runprints chunk counts for every source type matching expected approximate counts (±10%)embeddings.dbwith row count equal to dry-run total--incrementalafter asave_chapter()call re-embeds only that chapter's chunksOPENAI_API_KEYfails immediately with a clear message_tools/embedding_manifest.jsonupdated on every successful batchpythonnotpython3, no Unix-only path assumptions)Out of scope
embeddings.dbintoscripture.db— that's ai-partner: add sqlite-vec to scripture.db build #1448