Skip to content

ai-partner: add sqlite-vec to scripture.db build #1448

@CraigBuckmaster

Description

@CraigBuckmaster

Parent epic: #1446 (Amicus — AI Study Partner v1)
Phase: 1 · Size: S · Depends on: #1447

Integrate the sqlite-vec extension into the scripture.db build pipeline so vector search runs on-device with no extra setup for the user. Merge embeddings.db (output of #1447) into scripture.db as a vec0 virtual table.


Files to modify

  • _tools/build_sqlite.py — add new phase after schema creation that loads sqlite-vec and populates the embeddings vec0 virtual table from embeddings.db
  • _tools/build_sqlite_schema.py — add the vec0 table DDL and the companion chunk_text + chunk_metadata tables
  • _tools/build_sqlite_loaders.py — add populate_embeddings(conn) loader function matching the existing loader signature pattern
  • _tools/validate_sqlite.py — add integrity checks (see acceptance)
  • app/package.json — add native deps needed for sqlite-vec (see "Client-side loading")

Files to create

  • _tools/sqlite_vec_loader.py — small helper that wraps sqlite_vec.loadable_path() and handles cross-platform loading for the build machine

Schema additions (in build_sqlite_schema.py)

Add these to the SCHEMA string. The vec0 table is declared as a virtual table; companion tables hold text + metadata keyed by chunk_id.

-- Vector embeddings for Amicus retrieval
CREATE VIRTUAL TABLE embeddings USING vec0(
  embedding FLOAT[1536]
);

CREATE TABLE chunk_text (
  chunk_id     TEXT PRIMARY KEY,
  text         TEXT NOT NULL
);

CREATE TABLE chunk_metadata (
  chunk_id      TEXT PRIMARY KEY,
  source_type   TEXT NOT NULL,
  source_id     TEXT NOT NULL,
  scholar_id    TEXT,
  tradition     TEXT,
  book_id       TEXT,
  chapter_num   INTEGER,
  verse_start   INTEGER,
  verse_end     INTEGER,
  panel_type    TEXT
);

CREATE INDEX idx_chunk_metadata_source ON chunk_metadata(source_type, source_id);
CREATE INDEX idx_chunk_metadata_chapter ON chunk_metadata(book_id, chapter_num);

The vec0 virtual table has an implicit rowid that maps to chunk_id via the order of insertion. populate_embeddings() inserts into vec0 and chunk_text/chunk_metadata in matching order so rowid ↔ chunk_id is consistent.

Build-time loader (populate_embeddings)

Match the signature + style of existing loaders (populate_books, populate_verses, etc.). Pseudocode:

def populate_embeddings(conn: sqlite3.Connection) -> None:
    embeddings_db = ROOT / 'embeddings.db'
    if not embeddings_db.exists():
        print("  [WARN] embeddings.db not found — skipping embeddings table")
        print("         Run: python _tools/build_embeddings.py")
        return

    # Load sqlite-vec extension into the target conn
    conn.enable_load_extension(True)
    import sqlite_vec
    sqlite_vec.load(conn)
    conn.enable_load_extension(False)

    # Attach embeddings.db read-only; copy rows into scripture.db target tables
    src = sqlite3.connect(f"file:{embeddings_db}?mode=ro", uri=True)
    cur = conn.cursor()
    rows = src.execute("SELECT chunk_id, source_type, source_id, text, metadata_json, embedding FROM embedding_chunks").fetchall()

    for chunk_id, src_type, src_id, text, meta_json, emb_blob in rows:
        meta = json.loads(meta_json)
        cur.execute("INSERT INTO embeddings(embedding) VALUES (?)", (emb_blob,))
        cur.execute("INSERT INTO chunk_text VALUES (?, ?)", (chunk_id, text))
        cur.execute("""INSERT INTO chunk_metadata VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
            (chunk_id, src_type, src_id, meta.get('scholar_id'), meta.get('tradition'),
             meta.get('book_id'), meta.get('chapter_num'), meta.get('verse_start'),
             meta.get('verse_end'), meta.get('panel_type')))

    conn.commit()
    src.close()
    print(f"  [OK] Embeddings: {len(rows)} chunks populated")

Client-side loading (app/)

React Native needs a native module for sqlite-vec. Add to app/package.json:

"expo-sqlite": "(latest matching Expo SDK 54)",
"@op-engineering/op-sqlite": "(if needed for extension loading, eval at implementation time)"

Verify at implementation which sqlite-vec approach works with Expo SDK 54. Document the chosen path in a code comment in app/src/db/index.ts where the scripture.db connection is opened.

Validation (_tools/validate_sqlite.py)

Add a new section 6. EMBEDDINGS that verifies:

  • embeddings virtual table exists and has rows
  • Row counts match across embeddings, chunk_text, chunk_metadata
  • No orphan chunk_ids (every chunk_text row has matching chunk_metadata)
  • Sample vector search returns results (smoke test with a canned query vector)
  • Total DB size under 150MB (hard cap before we need further optimization)

Acceptance criteria

  • python _tools/build_sqlite.py completes successfully after build_embeddings.py has run
  • scripture.db contains populated embeddings, chunk_text, chunk_metadata tables
  • python _tools/validate_sqlite.py passes with new section 6 EMBEDDINGS green
  • Smoke test query via sqlite CLI: SELECT rowid, distance FROM embeddings WHERE embedding MATCH ? ORDER BY distance LIMIT 5 returns results
  • DB size under 150MB
  • Skip-with-warning behavior works when embeddings.db absent (doesn't break existing build)
  • Client-side sqlite-vec loading documented in app/src/db/index.ts
  • Works on Windows build machine

Out of scope

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions