Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ChromaDB for our embeddings database and similarity search #460

Merged
merged 11 commits into from
Jan 12, 2024

Conversation

granawkins
Copy link
Member

No description provided.

@@ -19,7 +19,7 @@ async def score(
self.query, features, self.loading_multiplier
)
features_scored = zip(features, sim_scores)
return sorted(features_scored, key=lambda x: x[1], reverse=True)
return sorted(features_scored, key=lambda x: x[1])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were using cosine_similarity, which is larger-is-better, but Chroma's default similarity func is 'l2', which is lower-is-better.

tests/conftest.py Outdated Show resolved Hide resolved
tests/license_check.py Outdated Show resolved Hide resolved
tests/license_check.py Outdated Show resolved Hide resolved
mentat/embeddings.py Outdated Show resolved Hide resolved
@granawkins
Copy link
Member Author

Now we're passing our llm_api_handler.call_embeddings_api directly to ChromaDB. Only complication was that it takes a non-async func.

Comment on lines 80 to 90
def migrate_old_db(self):
"""Temporary helper function to migrate sqlite3 to chromadb

Prior to January 2024, embeddings were fetched directly from the OpenAI API in
batches and saved to a db. We're currently using the same embeddings (ada-2) with
ChromaDB, so we might as well save the effort of re-fetching them. One drawback
is that ChromaDB saves the actual text, while our old schema did not, so migrated
records will have an empty documents field. This shouldn't be a problem. If it is,
we can just update the 'exists' method to require a non-empty "document" field.

TODO: erase this method/call after a few months
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a nice optimization, but I'm not sure the extra complexity is worth saving users a couple cents to rebuild. It might be unlikely, but some bug from migrated records being slightly different could be a big headache and not worth the risk

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok make sense. I was just looking at the size of my sqlite file but (1) mine is probably bigger than anyone's and (2) it is indeed really fast and cheap to reload.

Comment on lines -359 to +386
async def call_embedding_api(
def call_embedding_api(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm a little too bad Chroma doesn't support async. It's probably not an issue now but in the future if were doing embeddings in the background it could make mentat unresponsive

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look like some other people on Discord also want async, but not dice yet: https://discord.com/channels/1073293645303795742/1073293646083919946/1191885640120414218

tests/license_check.py Outdated Show resolved Hide resolved
@granawkins
Copy link
Member Author

I messed up - I was running license_check locally outside a venv, so it was scanning my entire pip library. Chroma only requires is 1 skip/UNKNOWN, and some different wordings of the Apache license.

@granawkins granawkins merged commit 414cfc3 into main Jan 12, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants