Add incremental vector-db and weekly CI workflow by Gabrielebattimelli · Pull Request #4 · Kernel-Science/physlibsearch

Gabrielebattimelli · 2026-04-22T19:07:19Z

Summary

Makes vector_db.py incremental: skips declarations already in ChromaDB instead of deleting and rebuilding the entire collection on every run
Adds .github/workflows/weekly-index.yml: a weekly cron job (Mondays 03:00 UTC) that detects new PhysLib commits, runs the incremental indexing pipeline, persists updated ChromaDB to the Heroku git remote, and redeploys the Docker image

How incrementality works

informalize.py was already incremental (skips symbols with existing informal rows)
vector_db.py now fetches existing ChromaDB IDs at startup and only embeds declarations not already present — saves Gemini API costs on weekly runs where most theorems haven't changed

Test plan

Run python3 -m database vector-db --batch-size 8 locally — should log using existing physlibsearch collection (N vectors) and added 0 new vectors
Add DATABASE_URL, GEMINI_API_KEY, HEROKU_API_KEY as repository secrets
Trigger the workflow manually via Actions → Weekly Index → Run workflow and confirm it completes

Copilot

Pull request overview

This PR introduces incremental vector database indexing (to avoid re-embedding already-indexed declarations) and adds a scheduled GitHub Actions workflow to run the indexing pipeline weekly and persist the updated ChromaDB state via the Heroku git remote.

Changes:

Update database/vector_db.py to reuse an existing ChromaDB collection and skip vectors whose IDs already exist.
Add .github/workflows/weekly-index.yml to run the incremental pipeline weekly (and on manual dispatch), then push updated chroma/ + a PhysLib SHA tracker to Heroku and redeploy.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File	Description
`database/vector_db.py`	Reuses existing Chroma collection and skips embedding for pre-existing IDs; adds summary logging.
`.github/workflows/weekly-index.yml`	Adds weekly CI pipeline to detect PhysLib changes, run indexing, persist ChromaDB state to Heroku, and redeploy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-22T19:12:15Z

            collection.add(embeddings=batch_embedding, ids=batch_id)
+            added += len(batch_id)
+
+    logger.warning("vector-db: added %d new vectors, %d already existed", added, len(existing_ids))


The final log line reports len(existing_ids) as "already existed", but that is the total pre-existing vector count, not the number actually skipped during this run. If you want accurate reporting, track a separate skipped counter (or compute skipped as total candidates - added) and log that instead.

Copilot · 2026-04-22T19:12:15Z

+        env:
+          HEROKU_API_KEY: ${{ secrets.HEROKU_API_KEY }}
+        run: |
+          git remote set-url heroku "https://heroku:$HEROKU_API_KEY@git.heroku.com/physlibsearch.git"


git remote set-url heroku ... will fail on a fresh Actions checkout because there is no heroku remote by default. Add the remote if it doesn't exist (e.g., git remote add heroku ... || git remote set-url heroku ...) so subsequent git fetch heroku main works reliably.

Suggested change

git remote set-url heroku "https://heroku:$HEROKU_API_KEY@git.heroku.com/physlibsearch.git"

git remote add heroku "https://heroku:$HEROKU_API_KEY@git.heroku.com/physlibsearch.git" \

|| git remote set-url heroku "https://heroku:$HEROKU_API_KEY@git.heroku.com/physlibsearch.git"

Copilot · 2026-04-22T19:12:15Z

+      GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
+      GEMINI_MODEL: ${{ vars.GEMINI_MODEL || 'gemini-2.5-flash-preview-04-17' }}
+      GEMINI_FAST_MODEL: ${{ vars.GEMINI_FAST_MODEL || 'gemini-2.5-flash-preview-04-17' }}
+      GEMINI_EMBEDDING_MODEL: ${{ vars.GEMINI_EMBEDDING_MODEL || 'gemini-embedding-2-preview' }}


This workflow runs python3 -m database ..., which executes database/__main__.py. That module currently requires LOG_FILENAME, LOG_FILEMODE, and LOG_LEVEL to be present in the environment (it uses os.environ[...]), but they are not set here, so the job will crash with KeyError. Either set these env vars in the workflow (e.g., LOG_LEVEL: WARNING, etc.) or make database/__main__.py use os.environ.get(...) defaults.

Suggested change

GEMINI_EMBEDDING_MODEL: ${{ vars.GEMINI_EMBEDDING_MODEL || 'gemini-embedding-2-preview' }}

GEMINI_EMBEDDING_MODEL: ${{ vars.GEMINI_EMBEDDING_MODEL || 'gemini-embedding-2-preview' }}

LOG_FILENAME: weekly-index.log

LOG_FILEMODE: a

LOG_LEVEL: WARNING

Copilot · 2026-04-22T19:12:16Z

+          git fetch heroku main
+          git checkout heroku/main -- chroma/ .last-physlib-sha 2>/dev/null \
+            || echo "No prior index on heroku/main — starting fresh."


git fetch heroku main will hard-fail the workflow if the Heroku remote/branch doesn't exist yet (e.g., first run or app not initialized), even though later steps try to handle a missing prior index. Consider making the fetch/checkout resilient (e.g., tolerate missing branch and proceed with an empty chroma/ and no .last-physlib-sha) so the workflow can bootstrap from scratch.

Suggested change

git fetch heroku main

git checkout heroku/main -- chroma/ .last-physlib-sha 2>/dev/null \

|| echo "No prior index on heroku/main — starting fresh."

if git fetch heroku main; then

git checkout heroku/main -- chroma/ .last-physlib-sha 2>/dev/null || {

echo "No prior index files on heroku/main — starting fresh."

mkdir -p "$CHROMA_PATH"

rm -f .last-physlib-sha

}

else

echo "No heroku/main branch available yet — starting fresh."

mkdir -p "$CHROMA_PATH"

rm -f .last-physlib-sha

fi

Copilot · 2026-04-22T19:12:16Z

+                vec_id = " ".join(str(x) for x in name)
+                if vec_id in existing_ids:
+                    continue
+                batch_doc.append(f"{kind} {name} {signature}\n{informal_name}: {informal_description}")


Skipping any vec_id already present means you will never refresh embeddings for declarations whose content/signature/description changed across PhysLib commits, so the vector DB can become stale even though the workflow reruns on new commits. If updated embeddings are required, consider switching to upsert (always re-embed) or storing a content hash in Chroma metadata and only re-embedding when the hash differs.

Copilot · 2026-04-22T19:12:16Z

    try:
-        client.delete_collection("physlibsearch")
-        logger.info("deleted existing physlibsearch collection")
+        collection = client.get_collection(name="physlibsearch", embedding_function=None)
+        existing_ids = set(collection.get(include=[])["ids"])
+        logger.warning("using existing physlibsearch collection (%d vectors)", len(existing_ids))
    except Exception:
-        pass
-    collection = client.create_collection(
-        name="physlibsearch",
-        metadata={"hnsw:space": "cosine"},
-        embedding_function=None,
-    )
-
+        collection = client.create_collection(
+            name="physlibsearch",
+            metadata={"hnsw:space": "cosine"},
+            embedding_function=None,
+        )


except Exception here is too broad: if get_collection(...) succeeds but collection.get(...) fails (e.g., transient I/O / corruption / API change), the code falls into the except block and then calls create_collection(...), which will typically fail because the collection already exists. Split this into separate error handling (only create the collection when the specific "collection not found" error is raised), and surface unexpected errors instead of masking them.

Copilot · 2026-04-22T19:12:16Z

+        existing_ids = set(collection.get(include=[])["ids"])
+        logger.warning("using existing physlibsearch collection (%d vectors)", len(existing_ids))


Building existing_ids via set(collection.get(include=[])["ids"]) loads every ID into memory up front; for large collections this can be slow and memory-intensive. Consider avoiding a full scan by checking existence in batch (e.g., per DB batch: call collection.get(ids=batch_ids, include=[]) and filter missing IDs) or by paging through IDs if you must precompute the set.

Gabrielebattimelli added 2 commits April 22, 2026 21:04

Add incremental vector-db and weekly CI workflow

963b0ab

Merge remote-tracking branch 'origin/main' into add-agent-skill

fd7398a

Copilot AI review requested due to automatic review settings April 22, 2026 19:07

Copilot started reviewing on behalf of Gabrielebattimelli April 22, 2026 19:07 View session

Gabrielebattimelli merged commit ab11047 into main Apr 22, 2026
3 checks passed

Copilot AI reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add incremental vector-db and weekly CI workflow#4

Add incremental vector-db and weekly CI workflow#4
Gabrielebattimelli merged 2 commits intomainfrom
add-agent-skill

Gabrielebattimelli commented Apr 22, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	git remote set-url heroku "https://heroku:$HEROKU_API_KEY@git.heroku.com/physlibsearch.git"
	git remote add heroku "https://heroku:$HEROKU_API_KEY@git.heroku.com/physlibsearch.git" \
	\|\| git remote set-url heroku "https://heroku:$HEROKU_API_KEY@git.heroku.com/physlibsearch.git"

-      GEMINI_EMBEDDING_MODEL: ${{ vars.GEMINI_EMBEDDING_MODEL || 'gemini-embedding-2-preview' }}
+      GEMINI_EMBEDDING_MODEL: ${{ vars.GEMINI_EMBEDDING_MODEL || 'gemini-embedding-2-preview' }}
+      LOG_FILENAME: weekly-index.log
+      LOG_FILEMODE: a
+      LOG_LEVEL: WARNING

-          git fetch heroku main
-          git checkout heroku/main -- chroma/ .last-physlib-sha 2>/dev/null \
-            || echo "No prior index on heroku/main — starting fresh."
+          if git fetch heroku main; then
+            git checkout heroku/main -- chroma/ .last-physlib-sha 2>/dev/null || {
+              echo "No prior index files on heroku/main — starting fresh."
+              mkdir -p "$CHROMA_PATH"
+              rm -f .last-physlib-sha
+            }
+          else
+            echo "No heroku/main branch available yet — starting fresh."
+            mkdir -p "$CHROMA_PATH"
+            rm -f .last-physlib-sha
+          fi

		existing_ids = set(collection.get(include=[])["ids"])
		logger.warning("using existing physlibsearch collection (%d vectors)", len(existing_ids))

Conversation

Gabrielebattimelli commented Apr 22, 2026

Summary

How incrementality works

Test plan

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants