New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Neighbor Search - Incorrect bound. #642
Comments
Give me a day or two to get back to you on this. I think that there is a reason why we can't make the change you've suggested, but I need to think on it. |
So about the situation you mentioned where the child's B2 bound is calculated using the tighter bound. You mentioned that rho(c) + lambda(c) < 2 * lambda(c). I don't think this can happen. So if I understand correctly lambda is a max distance from centroid of convex subset to points held in the node where as rho is a max distance from centroid of convex subset to all descendant points of that node which also includes the points held in that node. So clearly we get that rho(c) >= lambda(c). This can be proved through contradiction. If rho(c) is < lambda(c) then we know that there is some point in the points held by the node who's distance from centroid id more hence rho(c) is indeed incorrect and will become equal to lambda(c). So assuming I have not made any incorrect assumption while reading the paper :) the current implementation will stand correct. However, your observation is correct, that while deriving the recursive definition both bounds are used. I will have to go through that derivation and see the reasons behind it or if better bound can be achieved. |
Hi, thanks for your responses!
So always: rho(c) <= lambda(c). |
Sorry for that symbol confusion. In that case I agree that some error may occur when B2 is computed with strict bound. The auxiliary function is somewhat a recursive version of B2 - 2 * lambda. We add tighter bound to that and we get the proposed equation. |
Yes, B_aux would be a recursive function. We can cache previous calculations as we did with B2. So the implementation will be similar to the actual code, with a little modification. |
Okay, I spent a long time thinking about it, and I think B_2() is correct, so I have attached a quick writeup of a proof. Let me know if there are any errors in it... |
Hi Ryan, I am not sure about: However, with actual KDTree implementation, we do not have any problem because points are only included in leaf nodes, so the optimization of using rho instead of lambda is never considered. How are R-Trees implemented? Do they include points in non-leaf nodes? I will read about them. I have an idea of a space tree where the B2 bound could fail. I will try to think about it and I let you know. Thanks! PD: A simple detail about the proof, in ecuations (8) and (9), should be (lambda(N_q) - lambda(N_c)) instead of (lambda(N_q) + lambda(N_c)). In case you want to use it in the future. |
Yes, if you think of a space tree that breaks everything post it. :) The proof works even for non-ball bound trees: regardless of the shape of the bound, lambda represents the furthest distance of any descendant point to the center. So even if the bound is a hyperrectangle, all of the points in that bound are contained in a ball of radius lambda (sorry I don't know how to add Unicode characters on my phone!). This applies for the child also, and it must be true that B_c is contained in B_q because all descendant points of N_c are contained in the set of all descendant points of N_q. Even if the hyperrectangle bound is loose for a kd-tree node it doesn't end up mattering for the sake of the bound B_2, because for the calculation of B_2 we are using the implicit ball bounds B_q and B_c. I hope I've written that clearly. let me know if not. :) I will fix the paper write up tomorrow, thanks for pointing out the error. |
Hi Ryan, |
You are right, I failed to consider that it is possible that a child hyperrectangle's implicit bounding ball can lie not entirely withing the parent hyperrectangle's implicit bounding ball. It's easy to rework the proof I wrote to be correct and that comes out with the error correction 2 \lambda(N_q) - \lambda(N_c) (that is, we are subtracting just one times the furthest descendant distance of N_c, not two), but this is a less tight bound that the idea you suggested with B_{aux}(.). So I guess that I can conclude that my bound is incorrect and there isn't a better choice I can see than to go with your solution. Thanks for pointing all of this out. When you apply this bound, it would be interesting to see how much it helps performance; that could easily be tracked by running |
Hi Ryan,
Actual KDTree implementation doesn't not show any problem because points are only included in leaf nodes, so the optimization of using rho instead of lambda is never considered. The value of the original B_2 bound and the modification to use B_{aux} will be the same. So, we will have exactly the same number of BaseCases. CoverTrees are the interesting case. They include a point in each non-leaf node. So, if you agree, I can do this:
|
We could avoid the difference for cover trees if we refactored the bound like this:
This is still the same bound as in the original paper for ball-bound trees, but it is the fixed looser bound for hyperrectangle (and other weird) trees. I like the elegant solution in B_aux of not applying the adjustment until the highest level, instead of applying the adjustment at every level like in the current B_2 definition. But I didn't see an easy way to not apply the adjustment until the highest level if we are considering a tighter bound when ball trees are used. This should give the exact same performance as the existing implementation, so as long as we test that it's the same on one or two datasets there is no need for a big test on lots of datasets, I think. What do you think? Have I overlooked something (again)? :) |
Hi @rcurtin,
As you can see there, ALLKNN in mlpack and mlpack-aux shows the same number of base cases for all datasets. Maybe this happens because of the characteristics of Cover Trees. I have not an indepth understanding of them... So, I was wondering if we could simply use the b_aux modification without taking into account the tighter bound... |
The reason that we are seeing no change between the bounds is because the modifications to B_2() actually never comes into play in the current traversals that we have. As you pointed out, for kd-trees and ball trees and any other tree that only holds points in the leaves, rho(N_q) = lambda(N_q) so there is no problem. But for cover trees, the traversal is somewhat odd: it is depth-first in the query points and breadth-first in the references. This, along with the structure of the tree, means that for any query node N_q, we visit some set of reference nodes before any node combinations containing a descendant of N_q are visited, and then we never visit a node combination with N_q again. (I hope that explanation makes sense.) Therefore, B_aux actually goes unused, as does the part of B_2 that concerns child nodes, because no distances have been calculated for any descendant points of N_q except the one point held by N_q (which will give the tightest bound for B_2). So, I am tempted to say that we shouldn't change anything here: it is already working as-is; the problem you pointed out (which is a valid problem) doesn't actually surface in the code that we have. The modifications you made, while correct, do slow down the calculation a bit, by making What do you think? I guess it's not a problem to merge in the changes you made, with the hope of coming up with a cleaner set of bounds sometime later. |
Hi @rcurtin , |
Ah, nice catch with the unused members. There is certainly a lot of interesting open work here; when I have some free research time I think I will devote it towards enumerating all the different types of bounds that could be used for dual-tree nearest neighbor search (or even single-tree search possibly?) and revamp the code. Open a PR for the two commits you made, and I'll merge it in. Thanks again for all the time you've put towards this. 👍 |
I think this issue is done, so I'll go ahead and close it. |
Hi, @sumedhghaisas @rcurtin
I have been reading the paper: "Tree-Independent Dual-Tree Algorithms" in detail, and I found an error in the definition of bound B2. I attach a pdf file with an explanation using latex, so it is easier to understand what I mean.
I think it won't be difficult to update the code to fix this.
If you agree on this modification, I can work on this and make pull request when it is fixed.
Thanks,
Marcos
bounds.pdf
The text was updated successfully, but these errors were encountered: