Skip to content

ai-partner: build embeddings pipeline script #1447

@CraigBuckmaster

Description

@CraigBuckmaster

Parent epic: #1446 (Amicus — AI Study Partner v1)
Phase: 1 · Size: M

Build the one-time and incremental pipeline that chunks all Companion Study content, generates vector embeddings via OpenAI's text-embedding-3-small API, and writes them to embeddings.db ready to be merged into scripture.db (see #1448).


Files to create

  • _tools/build_embeddings.py — main orchestrator script (new)
  • _tools/build_embeddings_chunks.py — chunker module, one function per source type (new)
  • _tools/embedding_manifest.json — generated file tracking chunk hashes for incremental rebuilds (new, gitignored)

Files to modify

  • _tools/content_writer.pysave_chapter() writes the affected chapter_id to _tools/embedding_manifest.json under a dirty_chapters array so the next incremental build picks it up
  • .gitignore — add _tools/embedding_manifest.json, embeddings.db

Conventions to follow

  • Match the orchestrator + loaders pattern of _tools/build_sqlite.py (imports from build_sqlite_schema.py and build_sqlite_loaders.py). build_embeddings.py is the orchestrator; build_embeddings_chunks.py holds one chunker function per source type.
  • File header: include UTF-8 stdout preamble (sys.stdout.reconfigure(encoding='utf-8')) — match build_sqlite.py line 15–17.
  • Path resolution: ROOT = Path(__file__).resolve().parent.parent — match build_sqlite.py line 19.
  • Output: [OK] print markers on successful steps — match existing _tools/ convention.
  • Windows compatibility: python, not python3. Never assume Unix path separators.

Chunker strategy

Chunk sources and approximate counts. Each chunker function reads content JSON files and yields (chunk_id, source_type, source_id, metadata_dict, text) tuples.

Source type Per Approx count
section_panel one per (section, panel_type) ~16,000
chapter_panel one per (chapter, panel_type) ~7,500
word_study one per entry 46
lexicon_entry one per entry (both Greek + Hebrew) 13,655
debate_topic one per topic 308
cross_ref_thread_note one per thread note ~200
journey_stop one per stop including connective text ~existing
meta_faq one per article (see #1449) ~50 initial

Chunk IDs are deterministic — format: {source_type}:{source_id} (e.g., section_panel:genesis-1-s1-sarna, lexicon_entry:heb-H7225). Never use UUIDs or hashes — deterministic IDs enable diffing between builds.

Metadata captured per chunk (stored alongside embedding for retrieval filtering):

{
  "scholar_id": str | None,
  "tradition": str | None,
  "book_id": str | None,
  "chapter_num": int | None,
  "verse_start": int | None,
  "verse_end": int | None,
  "panel_type": str | None
}

Embedding API

  • Model: text-embedding-3-small (1536 dimensions, $0.02/1M tokens)
  • Batch size: 100 chunks per API call (OpenAI limit is higher but 100 keeps retry cost bounded)
  • Retry policy: exponential backoff 3× on 429/5xx
  • Checkpointing: after each successful batch, append results to embeddings.db and update embedding_manifest.json with latest chunk hashes. On crash/Ctrl-C, resume from last checkpoint.

API key source: OPENAI_API_KEY environment variable. Script must fail early with a clear message if the variable is missing. Never read from a file or hardcode.

Output database (embeddings.db)

build_embeddings.py writes to a standalone embeddings.db SQLite file (NOT directly into scripture.db). The build_sqlite.py orchestrator (updated in #1448) merges embeddings.db into scripture.db during its build step.

CREATE TABLE embedding_chunks (
  chunk_id       TEXT PRIMARY KEY,
  source_type    TEXT NOT NULL,
  source_id      TEXT NOT NULL,
  text           TEXT NOT NULL,
  metadata_json  TEXT NOT NULL,
  content_hash   TEXT NOT NULL,       -- SHA256 of text + metadata; used for incremental dedup
  embedding      BLOB NOT NULL        -- 1536 floats, packed as little-endian float32
);

CREATE INDEX idx_source ON embedding_chunks(source_type, source_id);

CLI interface

python _tools/build_embeddings.py              # full rebuild (re-embeds everything)
python _tools/build_embeddings.py --incremental  # only re-embeds chunks whose content_hash changed
python _tools/build_embeddings.py --dry-run     # prints chunk count and estimated cost, no API calls
python _tools/build_embeddings.py --source section_panel  # restrict to one source type (for testing)

Cost guardrails

  • --dry-run must print: total chunks, total estimated tokens, estimated cost in USD
  • Full rebuild without --dry-run prompts Continue? [y/N] if estimated cost > $0.50
  • --incremental does not prompt (expected to be cheap)

Acceptance criteria

  • python _tools/build_embeddings.py --dry-run prints chunk counts for every source type matching expected approximate counts (±10%)
  • Full run generates embeddings.db with row count equal to dry-run total
  • Every embedding is exactly 1536 dimensions; no nulls; content_hash populated
  • Chunk IDs are deterministic (two full rebuilds produce identical chunk_id set)
  • --incremental after a save_chapter() call re-embeds only that chapter's chunks
  • Resume-from-checkpoint works (Ctrl-C mid-batch, re-run completes without duplicate API calls)
  • Cost prompt appears correctly for full rebuilds > $0.50
  • Missing OPENAI_API_KEY fails immediately with a clear message
  • _tools/embedding_manifest.json updated on every successful batch
  • Works on Windows (uses python not python3, no Unix-only path assumptions)

Out of scope

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions