Skip to content

Add incremental vector-db and weekly CI workflow#4

Merged
Gabrielebattimelli merged 2 commits intomainfrom
add-agent-skill
Apr 22, 2026
Merged

Add incremental vector-db and weekly CI workflow#4
Gabrielebattimelli merged 2 commits intomainfrom
add-agent-skill

Conversation

@Gabrielebattimelli
Copy link
Copy Markdown
Member

Summary

  • Makes vector_db.py incremental: skips declarations already in ChromaDB instead of deleting and rebuilding the entire collection on every run
  • Adds .github/workflows/weekly-index.yml: a weekly cron job (Mondays 03:00 UTC) that detects new PhysLib commits, runs the incremental indexing pipeline, persists updated ChromaDB to the Heroku git remote, and redeploys the Docker image

How incrementality works

  • informalize.py was already incremental (skips symbols with existing informal rows)
  • vector_db.py now fetches existing ChromaDB IDs at startup and only embeds declarations not already present — saves Gemini API costs on weekly runs where most theorems haven't changed

Test plan

  • Run python3 -m database vector-db --batch-size 8 locally — should log using existing physlibsearch collection (N vectors) and added 0 new vectors
  • Add DATABASE_URL, GEMINI_API_KEY, HEROKU_API_KEY as repository secrets
  • Trigger the workflow manually via Actions → Weekly Index → Run workflow and confirm it completes

Copilot AI review requested due to automatic review settings April 22, 2026 19:07
@Gabrielebattimelli Gabrielebattimelli merged commit ab11047 into main Apr 22, 2026
3 checks passed
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces incremental vector database indexing (to avoid re-embedding already-indexed declarations) and adds a scheduled GitHub Actions workflow to run the indexing pipeline weekly and persist the updated ChromaDB state via the Heroku git remote.

Changes:

  • Update database/vector_db.py to reuse an existing ChromaDB collection and skip vectors whose IDs already exist.
  • Add .github/workflows/weekly-index.yml to run the incremental pipeline weekly (and on manual dispatch), then push updated chroma/ + a PhysLib SHA tracker to Heroku and redeploy.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
database/vector_db.py Reuses existing Chroma collection and skips embedding for pre-existing IDs; adds summary logging.
.github/workflows/weekly-index.yml Adds weekly CI pipeline to detect PhysLib changes, run indexing, persist ChromaDB state to Heroku, and redeploy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread database/vector_db.py
collection.add(embeddings=batch_embedding, ids=batch_id)
added += len(batch_id)

logger.warning("vector-db: added %d new vectors, %d already existed", added, len(existing_ids))
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final log line reports len(existing_ids) as "already existed", but that is the total pre-existing vector count, not the number actually skipped during this run. If you want accurate reporting, track a separate skipped counter (or compute skipped as total candidates - added) and log that instead.

Copilot uses AI. Check for mistakes.
env:
HEROKU_API_KEY: ${{ secrets.HEROKU_API_KEY }}
run: |
git remote set-url heroku "https://heroku:$HEROKU_API_KEY@git.heroku.com/physlibsearch.git"
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git remote set-url heroku ... will fail on a fresh Actions checkout because there is no heroku remote by default. Add the remote if it doesn't exist (e.g., git remote add heroku ... || git remote set-url heroku ...) so subsequent git fetch heroku main works reliably.

Suggested change
git remote set-url heroku "https://heroku:$HEROKU_API_KEY@git.heroku.com/physlibsearch.git"
git remote add heroku "https://heroku:$HEROKU_API_KEY@git.heroku.com/physlibsearch.git" \
|| git remote set-url heroku "https://heroku:$HEROKU_API_KEY@git.heroku.com/physlibsearch.git"

Copilot uses AI. Check for mistakes.
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
GEMINI_MODEL: ${{ vars.GEMINI_MODEL || 'gemini-2.5-flash-preview-04-17' }}
GEMINI_FAST_MODEL: ${{ vars.GEMINI_FAST_MODEL || 'gemini-2.5-flash-preview-04-17' }}
GEMINI_EMBEDDING_MODEL: ${{ vars.GEMINI_EMBEDDING_MODEL || 'gemini-embedding-2-preview' }}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workflow runs python3 -m database ..., which executes database/__main__.py. That module currently requires LOG_FILENAME, LOG_FILEMODE, and LOG_LEVEL to be present in the environment (it uses os.environ[...]), but they are not set here, so the job will crash with KeyError. Either set these env vars in the workflow (e.g., LOG_LEVEL: WARNING, etc.) or make database/__main__.py use os.environ.get(...) defaults.

Suggested change
GEMINI_EMBEDDING_MODEL: ${{ vars.GEMINI_EMBEDDING_MODEL || 'gemini-embedding-2-preview' }}
GEMINI_EMBEDDING_MODEL: ${{ vars.GEMINI_EMBEDDING_MODEL || 'gemini-embedding-2-preview' }}
LOG_FILENAME: weekly-index.log
LOG_FILEMODE: a
LOG_LEVEL: WARNING

Copilot uses AI. Check for mistakes.
Comment on lines +40 to +42
git fetch heroku main
git checkout heroku/main -- chroma/ .last-physlib-sha 2>/dev/null \
|| echo "No prior index on heroku/main — starting fresh."
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git fetch heroku main will hard-fail the workflow if the Heroku remote/branch doesn't exist yet (e.g., first run or app not initialized), even though later steps try to handle a missing prior index. Consider making the fetch/checkout resilient (e.g., tolerate missing branch and proceed with an empty chroma/ and no .last-physlib-sha) so the workflow can bootstrap from scratch.

Suggested change
git fetch heroku main
git checkout heroku/main -- chroma/ .last-physlib-sha 2>/dev/null \
|| echo "No prior index on heroku/main — starting fresh."
if git fetch heroku main; then
git checkout heroku/main -- chroma/ .last-physlib-sha 2>/dev/null || {
echo "No prior index files on heroku/main — starting fresh."
mkdir -p "$CHROMA_PATH"
rm -f .last-physlib-sha
}
else
echo "No heroku/main branch available yet — starting fresh."
mkdir -p "$CHROMA_PATH"
rm -f .last-physlib-sha
fi

Copilot uses AI. Check for mistakes.
Comment thread database/vector_db.py
Comment on lines +47 to +50
vec_id = " ".join(str(x) for x in name)
if vec_id in existing_ids:
continue
batch_doc.append(f"{kind} {name} {signature}\n{informal_name}: {informal_description}")
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping any vec_id already present means you will never refresh embeddings for declarations whose content/signature/description changed across PhysLib commits, so the vector DB can become stale even though the workflow reruns on new commits. If updated embeddings are required, consider switching to upsert (always re-embed) or storing a content hash in Chroma metadata and only re-embedding when the hash differs.

Copilot uses AI. Check for mistakes.
Comment thread database/vector_db.py
Comment on lines 16 to +25
try:
client.delete_collection("physlibsearch")
logger.info("deleted existing physlibsearch collection")
collection = client.get_collection(name="physlibsearch", embedding_function=None)
existing_ids = set(collection.get(include=[])["ids"])
logger.warning("using existing physlibsearch collection (%d vectors)", len(existing_ids))
except Exception:
pass
collection = client.create_collection(
name="physlibsearch",
metadata={"hnsw:space": "cosine"},
embedding_function=None,
)

collection = client.create_collection(
name="physlibsearch",
metadata={"hnsw:space": "cosine"},
embedding_function=None,
)
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except Exception here is too broad: if get_collection(...) succeeds but collection.get(...) fails (e.g., transient I/O / corruption / API change), the code falls into the except block and then calls create_collection(...), which will typically fail because the collection already exists. Split this into separate error handling (only create the collection when the specific "collection not found" error is raised), and surface unexpected errors instead of masking them.

Copilot uses AI. Check for mistakes.
Comment thread database/vector_db.py
Comment on lines +18 to +19
existing_ids = set(collection.get(include=[])["ids"])
logger.warning("using existing physlibsearch collection (%d vectors)", len(existing_ids))
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building existing_ids via set(collection.get(include=[])["ids"]) loads every ID into memory up front; for large collections this can be slow and memory-intensive. Consider avoiding a full scan by checking existence in batch (e.g., per DB batch: call collection.get(ids=batch_ids, include=[]) and filter missing IDs) or by paging through IDs if you must precompute the set.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants