Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First version of smart search #61

Merged
merged 4 commits into from
Jan 15, 2024
Merged

Conversation

sky-2002
Copy link
Contributor

@sky-2002 sky-2002 commented Jan 7, 2024

Adds an initial basic version of smart search.

Description - As per discussions in issue #43 , we need a method to route queries to appropriate/relevant collection from the PipelineCollection interface. Using collection representatives and their similarity with the query, we decide on which collection/pipe to search in.

Approach - For each pipe in the PipelineCollection, compute the representative vector of its respective sink. Then use the respective pipe's embed method to embed the query and compute its similarity with each pipe's representative. This gives a similarity score with respect to each pipe. Then search in descending order of similarity to get most relevant results.

Note - Currently supported on Marqo and LanceDB sinks. Support for other sinks soon.

Demo - I have prepared a notebook which I have tested on my local system. I have created a custom embedding class in it to test quickly. Make sure to spin up your marqo instance before running the notebook.

Looking for review and feedback.

pipe_to_similarity = {}
for pipe in self.pipelines:
pipe_representative = pipe.sink.get_representative_vector()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems good. The representative vector in the future could be extended to also use the description as the representative vector in cases where there is not possible to compute one of the sink or where you would want to use that instead.

# Currently, we are only selection the only pipeline whose
# representative is most similar to the query.
pipe_to_similarity = dict(sorted(pipe_to_similarity.items(), key=lambda x: x[1], reverse=True)[:1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can probably make the number of pipelines to use configurable. Assume in most cases you might want to pick the highest. Also might want to consider making it configurable to using any pipelines with a similarity score above X.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about it, I think that making it configurable to similarities above a certain threshold seems best.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddematheu Yep, sounds good. But deciding that threshold is difficult. I mean, you never know even the most similar pipe can have similarity scores that are low sounding like 0.4 etc. We need some way to decide this threshold as per the computed similarities.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% agree with this.maybe keeping to top to start is fine. My thinking is that we could provide different modalities. One where we pass results for N highest pipelines (default to N=1) and another for threshold based where we pass results for pipelines with similarity is higher than N (default N = 0.5 or whatever)

One additional modality might be to look at distribution. Where N is the max distance or % away from the highest scoring.

Btw this are just thoughts, happy to commit to starting with highest and add other modalities based on feedback.

Copy link
Contributor Author

@sky-2002 sky-2002 Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddematheu These are interesting ideas. Especially the distribution one, sounds similar to(not exactly) p-sampling in language models, maybe we can use that idea, like considering all pipelines whose similarities add up to or more than a value, considering descending order of similarities. But yes, its upto the user feedback that you get.

# Currently, we are only selection the only pipeline whose
# representative is most similar to the query.
pipe_to_similarity = dict(sorted(pipe_to_similarity.items(), key=lambda x: x[1], reverse=True)[:1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about it, I think that making it configurable to similarities above a certain threshold seems best.

@sky-2002
Copy link
Contributor Author

sky-2002 commented Jan 10, 2024

@ddematheu one issue still remains here, when we use the get_representative_vector method, it computes the vector even if the sink is not updated. This is an unnecessary computation. I can think of the following way to fix this:

  • We need some way to check status of a sink, if the status says its updated, we compute the representative else we use the earlier one. A simple solution to this is to update whenever we add new vectors.
  • In this also, we can save on computation time, anyway we are averaging, so no point in using the already present vectors, something like this can be used - representative_vector = representative_vector + (n_new_vectors)/total_n i.e only computing average of newly added vectors. This is can be done as part of adding vectors to a sink, updating the representative while adding.

Open to discussion on this.

@ddematheu
Copy link
Contributor

@ddematheu one issue still remains here, when we use the get_representative_vector method, it computes the vector even if the sink is not updated. This is an unnecessary computation. I can think of the following way to fix this:

  • We need some way to check status of a sink, if the status says its updated, we compute the representative else we use the earlier one. A simple solution to this is to update whenever we add new vectors.
  • In this also, we can save on computation time, anyway we are averaging, so no point in using the already present vectors, something like this can be used - representative_vector = representative_vector + (n_new_vectors)/n i.e only computing average of newly added vectors. This is can be done as part of adding vectors to a sink, updating the representative while adding.

Open to discussion on this.

In case where we use the old one, it still needs to be stored somewhere. Initially it might just be locally but thinking whether we could store it back into the vector db (caveat that we would need to be careful in not returning it with just a search also feels a bit hacky but it makes it so that there is no additional storage requirement). If we store back into the sink , we can also check if a sink has already a vector calculated by and not recalculate.

Recomputing with just the newly added vectors makes sense.

@sky-2002
Copy link
Contributor Author

@ddematheu Cant we just store it as a property, anyway we have limited number of sinks, so will it become a problem to store the representatives?

@ddematheu
Copy link
Contributor

ddematheu commented Jan 12, 2024 via email

@ddematheu
Copy link
Contributor

@sky-2002 let me know if you think its ready to merge. Overall still a couple high level questions, but I think it is at a working point.

@sky-2002
Copy link
Contributor Author

@ddematheu I think we can merge it, just to remind, currently it only supports marqo and lancedb, but works well. We can keep improving PR by PR as we get feedback.

@ddematheu ddematheu merged commit 9a23e02 into NeumTry:main Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants