Doc's example fails #16

vargonis · 2021-05-26T10:41:45Z

From the docs, one could do:

using VPTrees
using StringDistances

data = ["bla", "blub", "asdf", ":assd", "ast", "baube"]
vptree = VPTree(data, Levenshtein()))
query = "blau"
n_neighbors = 3
data[find_nearest(vptree, query, n_neighbors)]

However, that fails with MethodError: no method matching iterate(::DataStructures.BinaryMaxHeap{Tuple{Int64, Int64}}).

The text was updated successfully, but these errors were encountered:

altre · 2021-06-15T06:59:32Z

Thanks for the issue! I can't reproduce this. I'm guessing this is a problem with the DataStructures version compatibility restriction. Could you share which version you have installed?

vargonis · 2021-06-17T06:46:02Z

Sorry about this, I should have saved/posted the environment in order to reproduce later. With my current environment it works. But it does seem like a missing lower/upper bound compat problem with DataStructures.

By the way, I was trying to build some large trees but that took too long. I ended up breaking them into smaller pieces, but in the process I modified the construction algorithm to use threads to parallelize distance computations, as opposed to spawning for children branches. I got a speed-up of 2-3x, so you might consider switching to that approach. I apologize for not offering myself to make a pull request in case you'd be interested; the problem is that I had to do that under time pressure and for the sake of simplicity I just got rid of all the logic that deals with the situation where there are no threads available.

altre · 2021-06-17T08:08:08Z

I'll keep this issue open for now and try to fix the version problem when I get around to it.

Thanks for the idea for a performance improvement. I'm guessing it depends on the expense of the distance metric used. Perhaps you could share something I could use as a benchmark?

vargonis · 2021-06-17T08:28:02Z

Let me ask if I can share (it's a client's dataset). The metric is indeed expensive, custom distance for dataframe rows with some text columns, maybe similar to TokenMax in StringDistances.

vargonis · 2021-06-25T07:34:13Z

Sorry for the delay. So, I asked and for this particular dataset we are under NDA. However, I think it will not make much of a difference to just use a random string dataset and TokenMax.

vargonis closed this as completed Jun 17, 2021

altre reopened this Jun 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doc's example fails #16

Doc's example fails #16

vargonis commented May 26, 2021

altre commented Jun 15, 2021

vargonis commented Jun 17, 2021

altre commented Jun 17, 2021

vargonis commented Jun 17, 2021

vargonis commented Jun 25, 2021

Doc's example fails #16

Doc's example fails #16

Comments

vargonis commented May 26, 2021

altre commented Jun 15, 2021

vargonis commented Jun 17, 2021

altre commented Jun 17, 2021

vargonis commented Jun 17, 2021

vargonis commented Jun 25, 2021