First version of smart search #61

sky-2002 · 2024-01-07T19:48:01Z

Adds an initial basic version of smart search.

Description - As per discussions in issue #43 , we need a method to route queries to appropriate/relevant collection from the PipelineCollection interface. Using collection representatives and their similarity with the query, we decide on which collection/pipe to search in.

Approach - For each pipe in the PipelineCollection, compute the representative vector of its respective sink. Then use the respective pipe's embed method to embed the query and compute its similarity with each pipe's representative. This gives a similarity score with respect to each pipe. Then search in descending order of similarity to get most relevant results.

Note - Currently supported on Marqo and LanceDB sinks. Support for other sinks soon.

Demo - I have prepared a notebook which I have tested on my local system. I have created a custom embedding class in it to test quickly. Make sure to spin up your marqo instance before running the notebook.

Looking for review and feedback.

ddematheu · 2024-01-08T01:30:17Z

neumai-tools/neumai_tools/PipelineCollection/PipelineCollection.py

+        pipe_to_similarity = {}
+        for pipe in self.pipelines:
+            pipe_representative = pipe.sink.get_representative_vector()


This seems good. The representative vector in the future could be extended to also use the description as the representative vector in cases where there is not possible to compute one of the sink or where you would want to use that instead.

ddematheu · 2024-01-08T01:31:53Z

neumai-tools/neumai_tools/PipelineCollection/PipelineCollection.py

+        # Currently, we are only selection the only pipeline whose 
+        # representative is most similar to the query.
+        pipe_to_similarity = dict(sorted(pipe_to_similarity.items(), key=lambda x: x[1], reverse=True)[:1])


Can probably make the number of pipelines to use configurable. Assume in most cases you might want to pick the highest. Also might want to consider making it configurable to using any pipelines with a similarity score above X.

The more I think about it, I think that making it configurable to similarities above a certain threshold seems best.

@ddematheu Yep, sounds good. But deciding that threshold is difficult. I mean, you never know even the most similar pipe can have similarity scores that are low sounding like 0.4 etc. We need some way to decide this threshold as per the computed similarities.

100% agree with this.maybe keeping to top to start is fine. My thinking is that we could provide different modalities. One where we pass results for N highest pipelines (default to N=1) and another for threshold based where we pass results for pipelines with similarity is higher than N (default N = 0.5 or whatever)

One additional modality might be to look at distribution. Where N is the max distance or % away from the highest scoring.

Btw this are just thoughts, happy to commit to starting with highest and add other modalities based on feedback.

@ddematheu These are interesting ideas. Especially the distribution one, sounds similar to(not exactly) p-sampling in language models, maybe we can use that idea, like considering all pipelines whose similarities add up to or more than a value, considering descending order of similarities. But yes, its upto the user feedback that you get.

ddematheu · 2024-01-08T01:32:41Z

neumai-tools/neumai_tools/PipelineCollection/PipelineCollection.py

+        # Currently, we are only selection the only pipeline whose 
+        # representative is most similar to the query.
+        pipe_to_similarity = dict(sorted(pipe_to_similarity.items(), key=lambda x: x[1], reverse=True)[:1])


The more I think about it, I think that making it configurable to similarities above a certain threshold seems best.

neumai-tools/neumai_tools/PipelineCollection/PipelineCollection.py

sky-2002 · 2024-01-10T16:16:19Z

@ddematheu one issue still remains here, when we use the get_representative_vector method, it computes the vector even if the sink is not updated. This is an unnecessary computation. I can think of the following way to fix this:

We need some way to check status of a sink, if the status says its updated, we compute the representative else we use the earlier one. A simple solution to this is to update whenever we add new vectors.
In this also, we can save on computation time, anyway we are averaging, so no point in using the already present vectors, something like this can be used - representative_vector = representative_vector + (n_new_vectors)/total_n i.e only computing average of newly added vectors. This is can be done as part of adding vectors to a sink, updating the representative while adding.

Open to discussion on this.

ddematheu · 2024-01-12T02:10:44Z

@ddematheu one issue still remains here, when we use the get_representative_vector method, it computes the vector even if the sink is not updated. This is an unnecessary computation. I can think of the following way to fix this:

We need some way to check status of a sink, if the status says its updated, we compute the representative else we use the earlier one. A simple solution to this is to update whenever we add new vectors.

In this also, we can save on computation time, anyway we are averaging, so no point in using the already present vectors, something like this can be used - representative_vector = representative_vector + (n_new_vectors)/n i.e only computing average of newly added vectors. This is can be done as part of adding vectors to a sink, updating the representative while adding.

Open to discussion on this.

In case where we use the old one, it still needs to be stored somewhere. Initially it might just be locally but thinking whether we could store it back into the vector db (caveat that we would need to be careful in not returning it with just a search also feels a bit hacky but it makes it so that there is no additional storage requirement). If we store back into the sink , we can also check if a sink has already a vector calculated by and not recalculate.

Recomputing with just the newly added vectors makes sense.

sky-2002 · 2024-01-12T18:27:34Z

@ddematheu Cant we just store it as a property, anyway we have limited number of sinks, so will it become a problem to store the representatives?

ddematheu · 2024-01-12T22:18:54Z

Yeah can be stored as a property for simplicty, but that means that the dev needs to keep track of that. (they might need to store it in their own database or something) I think this is fine initially.

…

On Fri, Jan 12, 2024 at 10:27 AM Aakash Thatte ***@***.***> wrote: @ddematheu <https://github.com/ddematheu> Cant we just store it as a property, anyway we have limited number of sinks, so will it become a problem to store the representatives? — Reply to this email directly, view it on GitHub <#61 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACRYWGF4PJJD52TQICDRUWTYOF6CDAVCNFSM6AAAAABBQRKL6WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBZG43DMMBVGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- David de Matheu Phone: +1(814)321-6323 Email: ***@***.***

ddematheu · 2024-01-15T04:56:11Z

@sky-2002 let me know if you think its ready to merge. Overall still a couple high level questions, but I think it is at a working point.

sky-2002 · 2024-01-15T15:49:22Z

@ddematheu I think we can merge it, just to remind, currently it only supports marqo and lancedb, but works well. We can keep improving PR by PR as we get feedback.

sky-2002 added 3 commits January 8, 2024 01:08

Fix argument name in search

bc084a9

Add method to get representative vector

978a18d

Add routed search method

9a1b593

ddematheu reviewed Jan 8, 2024

View reviewed changes

ddematheu requested changes Jan 8, 2024

View reviewed changes

Remove scipy code and add numpy

588ab38

ddematheu approved these changes Jan 15, 2024

View reviewed changes

ddematheu merged commit 9a23e02 into NeumTry:main Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First version of smart search #61

First version of smart search #61

sky-2002 commented Jan 7, 2024

ddematheu Jan 8, 2024

ddematheu Jan 8, 2024

ddematheu Jan 8, 2024

sky-2002 Jan 8, 2024

ddematheu Jan 8, 2024

sky-2002 Jan 8, 2024 •

edited

ddematheu Jan 8, 2024

sky-2002 commented Jan 10, 2024 •

edited

ddematheu commented Jan 12, 2024

sky-2002 commented Jan 12, 2024

ddematheu commented Jan 12, 2024 via email

ddematheu commented Jan 15, 2024

sky-2002 commented Jan 15, 2024

First version of smart search #61

First version of smart search #61

Conversation

sky-2002 commented Jan 7, 2024

ddematheu Jan 8, 2024

Choose a reason for hiding this comment

ddematheu Jan 8, 2024

Choose a reason for hiding this comment

ddematheu Jan 8, 2024

Choose a reason for hiding this comment

sky-2002 Jan 8, 2024

Choose a reason for hiding this comment

ddematheu Jan 8, 2024

Choose a reason for hiding this comment

sky-2002 Jan 8, 2024 • edited

Choose a reason for hiding this comment

ddematheu Jan 8, 2024

Choose a reason for hiding this comment

sky-2002 commented Jan 10, 2024 • edited

ddematheu commented Jan 12, 2024

sky-2002 commented Jan 12, 2024

ddematheu commented Jan 12, 2024 via email

ddematheu commented Jan 15, 2024

sky-2002 commented Jan 15, 2024

sky-2002 Jan 8, 2024 •

edited

sky-2002 commented Jan 10, 2024 •

edited