Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add QueryWithNegative #80

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

erikdubbelboer
Copy link
Contributor

QueryWithNegative allows you to pass a string that will be excluded from the search results.

There are tree ways to implement this:

  1. Find the document matching the normal query. Then re-order the result by multiplying the similarity by the dot product with the negative vector and some constant.
  2. Find the document matching the normal query. Exclude documents where the dot product with the negative vector are above a constant.
  3. The simpler method I implemented which just subtracts the negative vector from the positive one and re-normalizes the result.

I have done some simple tests and the results look good.

I'm not sure if the extra function is a nice API. It could also be added as extra argument to Query. Or maybe Query should get a struct for it's argument as the number of arguments seems to keep increasing.

@philippgille
Copy link
Owner

philippgille commented May 21, 2024

That's an interesting feature!

  1. The simpler method I implemented which just subtracts the negative vector from the positive one and re-normalizes the result.

I have done some simple tests and the results look good.

Did you test with positive and negative queries that were similar (e.g. a use case where a result is likely to appear but you want to exclude), as well also where they were entirely different (e.g. for use cases where some result is already unlikely, but you want to be even more certain to exclude it)?

I'm worried that the resulting vector C=A-B might be too different from A in some cases. E.g.

image

What about implementing all three and letting the user choose between them?

I'm not sure if the extra function is a nice API. It could also be added as extra argument to Query. Or maybe Query should get a struct for it's argument as the number of arguments seems to keep increasing.

WDYT about keeping the Query for now (for backward compatibility and catering to simple / most common use cases), but introducing a QueryWithOptions that already takes a struct as parameter with the where, whereDocument and negativeQuery fields?

Related to above proposal with three implementations, one field could also be an enum for picking the way to exclude the negative query.


Maybe this PR could just introduce the QueryWithOptions and the enum with two values (unspecified, which validates to an error, and the substract exclusion), and follow-up PRs can introduce the other two ways?

@erikdubbelboer
Copy link
Contributor Author

I rewrote it to QueryWithOptions. Can you please have a look.

@erikdubbelboer erikdubbelboer force-pushed the QueryWithNegative branch 2 times, most recently from bb1407b to 7529c5d Compare May 26, 2024 07:40
Copy link
Owner

@philippgille philippgille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the rewrite, plus adding the other two modes! 👍

Please consider my new comments more as questions for now, not yet as requests for change.

Regarding the reorder mode, if we don't find a good solution, one option can be to exclude that mode for now, and add it in a follow-up PR. That way users can already benefit from the simpler modes and it gives more time to discuss the reorder.

collection.go Outdated Show resolved Hide resolved
collection.go Outdated Show resolved Hide resolved
collection.go Outdated Show resolved Hide resolved
query_test.go Outdated Show resolved Hide resolved
collection.go Outdated

negativeMode := options.NegativeMode
if negativeMode == "" {
negativeMode = NEGATIVE_MODE_SUBTRACT
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your original PR implementation was with the subtract because it was simple to implement. Now that you've implemented all three (:muscle:!), isn't the filter one the best?

For the subtract my understanding is that it can lead to the following: You do a similarity search among product reviews, and when using "it's great" as query, "it's bad" as negative query, it can end up finding all "it's so-so" reviews. What I assume is more common is to use the filter to make sure no negative review is part of the results, in case they slipped through due to some additional words in the review that somehow make it similar to the positive one in the eyes of the embedding model.

Or is my understanding wrong?

An alternative could be to not define a default and always require the caller to pass an explicit value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the default mode so you're force to pick one now. I agree that for most the filter mode would make most sense, I have put that on top of the constants.

I have also reworked the filtering code to return more results. Instead of filtering the N results you get back, it now filters at query time to still try to guarantee that you get N results (unless it filters more than that).

QueryWithNegative allows you to pass a string that will be excluded from
the search results.

There are tree ways to implement this:
1. Find the document matching the normal query. Then re-order the result
   by multiplying the similarity by the dot product with the negative
   vector and some constant.
2. Find the document matching the normal query. Exclude documents where
   the dot product with the negative vector are above a constant.
3. The simpler method I implemented which just subtracts the negative
   vector from the positive one and re-normalizes the result.

I have done some simple tests and the results look good.

I'm not sure if the extra function is a nice API. It could also be added
as extra argument to Query. Or maybe Query should get a struct for it's
argument as the number of arguments seems to keep increasing.
Removed NEGATIVE_MODE_REORDER as it didn't make much sense.

NEGATIVE_MODE_FILTER now filters at query time instead of the result.
This means that when you want 10 results, you can get 10 results.
Instead of those 10 results then being filtered and resulting in less.

Removed default negative mode. You now always have to pick one when
providing negative text or embeddings.
@erikdubbelboer
Copy link
Contributor Author

I'm not using this feature in production yet. I'm still playing around with it, trying to improve it. Since embedding models can't encode something being negative, I'm currently using a fast model like Phi3 to split a search query users write into a positive and a negative part (see below). And then passing these parts to my index in chromem to find the results.

Break down the following search query in it's positive and negative parts. For example for
```
football not manager
```
you should return:
```
positive: football
negative: manager
```

Or another example, for:
```
village builder but not idle
```
you should return:
```
positive: village builder
negative: idle
```

Here is the query:
```
...
```

@philippgille
Copy link
Owner

Thank you for the changes/improvements! I should be able to review again before the end of the week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants