Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc's example fails #16

Open
vargonis opened this issue May 26, 2021 · 5 comments
Open

Doc's example fails #16

vargonis opened this issue May 26, 2021 · 5 comments

Comments

@vargonis
Copy link

From the docs, one could do:

using VPTrees
using StringDistances

data = ["bla", "blub", "asdf", ":assd", "ast", "baube"]
vptree = VPTree(data, Levenshtein()))
query = "blau"
n_neighbors = 3
data[find_nearest(vptree, query, n_neighbors)]

However, that fails with MethodError: no method matching iterate(::DataStructures.BinaryMaxHeap{Tuple{Int64, Int64}}).

@altre
Copy link
Collaborator

altre commented Jun 15, 2021

Thanks for the issue! I can't reproduce this. I'm guessing this is a problem with the DataStructures version compatibility restriction. Could you share which version you have installed?

@vargonis
Copy link
Author

Sorry about this, I should have saved/posted the environment in order to reproduce later. With my current environment it works. But it does seem like a missing lower/upper bound compat problem with DataStructures.

By the way, I was trying to build some large trees but that took too long. I ended up breaking them into smaller pieces, but in the process I modified the construction algorithm to use threads to parallelize distance computations, as opposed to spawning for children branches. I got a speed-up of 2-3x, so you might consider switching to that approach. I apologize for not offering myself to make a pull request in case you'd be interested; the problem is that I had to do that under time pressure and for the sake of simplicity I just got rid of all the logic that deals with the situation where there are no threads available.

@altre
Copy link
Collaborator

altre commented Jun 17, 2021

I'll keep this issue open for now and try to fix the version problem when I get around to it.

Thanks for the idea for a performance improvement. I'm guessing it depends on the expense of the distance metric used. Perhaps you could share something I could use as a benchmark?

@altre altre reopened this Jun 17, 2021
@vargonis
Copy link
Author

Let me ask if I can share (it's a client's dataset). The metric is indeed expensive, custom distance for dataframe rows with some text columns, maybe similar to TokenMax in StringDistances.

@vargonis
Copy link
Author

Sorry for the delay. So, I asked and for this particular dataset we are under NDA. However, I think it will not make much of a difference to just use a random string dataset and TokenMax.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants