Add incremental vector-db and weekly CI workflow#4
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces incremental vector database indexing (to avoid re-embedding already-indexed declarations) and adds a scheduled GitHub Actions workflow to run the indexing pipeline weekly and persist the updated ChromaDB state via the Heroku git remote.
Changes:
- Update
database/vector_db.pyto reuse an existing ChromaDB collection and skip vectors whose IDs already exist. - Add
.github/workflows/weekly-index.ymlto run the incremental pipeline weekly (and on manual dispatch), then push updatedchroma/+ a PhysLib SHA tracker to Heroku and redeploy.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
database/vector_db.py |
Reuses existing Chroma collection and skips embedding for pre-existing IDs; adds summary logging. |
.github/workflows/weekly-index.yml |
Adds weekly CI pipeline to detect PhysLib changes, run indexing, persist ChromaDB state to Heroku, and redeploy. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| collection.add(embeddings=batch_embedding, ids=batch_id) | ||
| added += len(batch_id) | ||
|
|
||
| logger.warning("vector-db: added %d new vectors, %d already existed", added, len(existing_ids)) |
There was a problem hiding this comment.
The final log line reports len(existing_ids) as "already existed", but that is the total pre-existing vector count, not the number actually skipped during this run. If you want accurate reporting, track a separate skipped counter (or compute skipped as total candidates - added) and log that instead.
| env: | ||
| HEROKU_API_KEY: ${{ secrets.HEROKU_API_KEY }} | ||
| run: | | ||
| git remote set-url heroku "https://heroku:$HEROKU_API_KEY@git.heroku.com/physlibsearch.git" |
There was a problem hiding this comment.
git remote set-url heroku ... will fail on a fresh Actions checkout because there is no heroku remote by default. Add the remote if it doesn't exist (e.g., git remote add heroku ... || git remote set-url heroku ...) so subsequent git fetch heroku main works reliably.
| git remote set-url heroku "https://heroku:$HEROKU_API_KEY@git.heroku.com/physlibsearch.git" | |
| git remote add heroku "https://heroku:$HEROKU_API_KEY@git.heroku.com/physlibsearch.git" \ | |
| || git remote set-url heroku "https://heroku:$HEROKU_API_KEY@git.heroku.com/physlibsearch.git" |
| GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }} | ||
| GEMINI_MODEL: ${{ vars.GEMINI_MODEL || 'gemini-2.5-flash-preview-04-17' }} | ||
| GEMINI_FAST_MODEL: ${{ vars.GEMINI_FAST_MODEL || 'gemini-2.5-flash-preview-04-17' }} | ||
| GEMINI_EMBEDDING_MODEL: ${{ vars.GEMINI_EMBEDDING_MODEL || 'gemini-embedding-2-preview' }} |
There was a problem hiding this comment.
This workflow runs python3 -m database ..., which executes database/__main__.py. That module currently requires LOG_FILENAME, LOG_FILEMODE, and LOG_LEVEL to be present in the environment (it uses os.environ[...]), but they are not set here, so the job will crash with KeyError. Either set these env vars in the workflow (e.g., LOG_LEVEL: WARNING, etc.) or make database/__main__.py use os.environ.get(...) defaults.
| GEMINI_EMBEDDING_MODEL: ${{ vars.GEMINI_EMBEDDING_MODEL || 'gemini-embedding-2-preview' }} | |
| GEMINI_EMBEDDING_MODEL: ${{ vars.GEMINI_EMBEDDING_MODEL || 'gemini-embedding-2-preview' }} | |
| LOG_FILENAME: weekly-index.log | |
| LOG_FILEMODE: a | |
| LOG_LEVEL: WARNING |
| git fetch heroku main | ||
| git checkout heroku/main -- chroma/ .last-physlib-sha 2>/dev/null \ | ||
| || echo "No prior index on heroku/main — starting fresh." |
There was a problem hiding this comment.
git fetch heroku main will hard-fail the workflow if the Heroku remote/branch doesn't exist yet (e.g., first run or app not initialized), even though later steps try to handle a missing prior index. Consider making the fetch/checkout resilient (e.g., tolerate missing branch and proceed with an empty chroma/ and no .last-physlib-sha) so the workflow can bootstrap from scratch.
| git fetch heroku main | |
| git checkout heroku/main -- chroma/ .last-physlib-sha 2>/dev/null \ | |
| || echo "No prior index on heroku/main — starting fresh." | |
| if git fetch heroku main; then | |
| git checkout heroku/main -- chroma/ .last-physlib-sha 2>/dev/null || { | |
| echo "No prior index files on heroku/main — starting fresh." | |
| mkdir -p "$CHROMA_PATH" | |
| rm -f .last-physlib-sha | |
| } | |
| else | |
| echo "No heroku/main branch available yet — starting fresh." | |
| mkdir -p "$CHROMA_PATH" | |
| rm -f .last-physlib-sha | |
| fi |
| vec_id = " ".join(str(x) for x in name) | ||
| if vec_id in existing_ids: | ||
| continue | ||
| batch_doc.append(f"{kind} {name} {signature}\n{informal_name}: {informal_description}") |
There was a problem hiding this comment.
Skipping any vec_id already present means you will never refresh embeddings for declarations whose content/signature/description changed across PhysLib commits, so the vector DB can become stale even though the workflow reruns on new commits. If updated embeddings are required, consider switching to upsert (always re-embed) or storing a content hash in Chroma metadata and only re-embedding when the hash differs.
| try: | ||
| client.delete_collection("physlibsearch") | ||
| logger.info("deleted existing physlibsearch collection") | ||
| collection = client.get_collection(name="physlibsearch", embedding_function=None) | ||
| existing_ids = set(collection.get(include=[])["ids"]) | ||
| logger.warning("using existing physlibsearch collection (%d vectors)", len(existing_ids)) | ||
| except Exception: | ||
| pass | ||
| collection = client.create_collection( | ||
| name="physlibsearch", | ||
| metadata={"hnsw:space": "cosine"}, | ||
| embedding_function=None, | ||
| ) | ||
|
|
||
| collection = client.create_collection( | ||
| name="physlibsearch", | ||
| metadata={"hnsw:space": "cosine"}, | ||
| embedding_function=None, | ||
| ) |
There was a problem hiding this comment.
except Exception here is too broad: if get_collection(...) succeeds but collection.get(...) fails (e.g., transient I/O / corruption / API change), the code falls into the except block and then calls create_collection(...), which will typically fail because the collection already exists. Split this into separate error handling (only create the collection when the specific "collection not found" error is raised), and surface unexpected errors instead of masking them.
| existing_ids = set(collection.get(include=[])["ids"]) | ||
| logger.warning("using existing physlibsearch collection (%d vectors)", len(existing_ids)) |
There was a problem hiding this comment.
Building existing_ids via set(collection.get(include=[])["ids"]) loads every ID into memory up front; for large collections this can be slow and memory-intensive. Consider avoiding a full scan by checking existence in batch (e.g., per DB batch: call collection.get(ids=batch_ids, include=[]) and filter missing IDs) or by paging through IDs if you must precompute the set.
Summary
vector_db.pyincremental: skips declarations already in ChromaDB instead of deleting and rebuilding the entire collection on every run.github/workflows/weekly-index.yml: a weekly cron job (Mondays 03:00 UTC) that detects new PhysLib commits, runs the incremental indexing pipeline, persists updated ChromaDB to the Heroku git remote, and redeploys the Docker imageHow incrementality works
informalize.pywas already incremental (skips symbols with existinginformalrows)vector_db.pynow fetches existing ChromaDB IDs at startup and only embeds declarations not already present — saves Gemini API costs on weekly runs where most theorems haven't changedTest plan
python3 -m database vector-db --batch-size 8locally — should logusing existing physlibsearch collection (N vectors)andadded 0 new vectorsDATABASE_URL,GEMINI_API_KEY,HEROKU_API_KEYas repository secrets