Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Neighbor Search - Incorrect bound. #642

Closed
MarcosPividori opened this issue May 23, 2016 · 17 comments
Closed

Neighbor Search - Incorrect bound. #642

MarcosPividori opened this issue May 23, 2016 · 17 comments

Comments

@MarcosPividori
Copy link
Contributor

Hi, @sumedhghaisas @rcurtin
I have been reading the paper: "Tree-Independent Dual-Tree Algorithms" in detail, and I found an error in the definition of bound B2. I attach a pdf file with an explanation using latex, so it is easier to understand what I mean.
I think it won't be difficult to update the code to fix this.
If you agree on this modification, I can work on this and make pull request when it is fixed.
Thanks,
Marcos
bounds.pdf

@rcurtin
Copy link
Member

rcurtin commented May 23, 2016

Give me a day or two to get back to you on this. I think that there is a reason why we can't make the change you've suggested, but I need to think on it.

@sumedhghaisas
Copy link
Member

So about the situation you mentioned where the child's B2 bound is calculated using the tighter bound. You mentioned that rho(c) + lambda(c) < 2 * lambda(c). I don't think this can happen. So if I understand correctly lambda is a max distance from centroid of convex subset to points held in the node where as rho is a max distance from centroid of convex subset to all descendant points of that node which also includes the points held in that node. So clearly we get that rho(c) >= lambda(c). This can be proved through contradiction. If rho(c) is < lambda(c) then we know that there is some point in the points held by the node who's distance from centroid id more hence rho(c) is indeed incorrect and will become equal to lambda(c). So assuming I have not made any incorrect assumption while reading the paper :) the current implementation will stand correct. However, your observation is correct, that while deriving the recursive definition both bounds are used. I will have to go through that derivation and see the reasons behind it or if better bound can be achieved.

@MarcosPividori
Copy link
Contributor Author

MarcosPividori commented May 23, 2016

Hi, thanks for your responses!
@sumedhghaisas I think you confused rho and lambda:

  • rho: the maximum distance between the centroid Ci of the convex subset and each point in the node.
  • lambda: the maximum distance between the centroid Ci of the convex subset and all descendant points of that node.

So always: rho(c) <= lambda(c).

@sumedhghaisas
Copy link
Member

Sorry for that symbol confusion. In that case I agree that some error may occur when B2 is computed with strict bound. The auxiliary function is somewhat a recursive version of B2 - 2 * lambda. We add tighter bound to that and we get the proposed equation.

@MarcosPividori
Copy link
Contributor Author

Yes, B_aux would be a recursive function. We can cache previous calculations as we did with B2. So the implementation will be similar to the actual code, with a little modification.

@rcurtin
Copy link
Member

rcurtin commented May 25, 2016

Okay, I spent a long time thinking about it, and I think B_2() is correct, so I have attached a quick writeup of a proof. Let me know if there are any errors in it...
notes.pdf

@MarcosPividori
Copy link
Contributor Author

Hi Ryan,
Thank you very much for taking the time to write that proof. This is becoming really interesting!

I am not sure about:
"The ball of radius lambda(N_c) centered at the center of the node N_c lies entirely within the ball of radius lambda(N_q) centered at the center of the node N_q."
I agree that this is true for trees that consider "ballbound", but that is not true for different bounds such as "hrectbound" used in KDTrees.
The ball of radius (diagonal_of_rectangle / 2) centered at the center of a rectangle will not include the balls of the subrectangles resulting of splitting the original rectangle at the middle.

However, with actual KDTree implementation, we do not have any problem because points are only included in leaf nodes, so the optimization of using rho instead of lambda is never considered.

How are R-Trees implemented? Do they include points in non-leaf nodes? I will read about them.

I have an idea of a space tree where the B2 bound could fail. I will try to think about it and I let you know.

Thanks!

PD: A simple detail about the proof, in ecuations (8) and (9), should be (lambda(N_q) - lambda(N_c)) instead of (lambda(N_q) + lambda(N_c)). In case you want to use it in the future.

@rcurtin
Copy link
Member

rcurtin commented May 25, 2016

Yes, if you think of a space tree that breaks everything post it. :)

The proof works even for non-ball bound trees: regardless of the shape of the bound, lambda represents the furthest distance of any descendant point to the center. So even if the bound is a hyperrectangle, all of the points in that bound are contained in a ball of radius lambda (sorry I don't know how to add Unicode characters on my phone!). This applies for the child also, and it must be true that B_c is contained in B_q because all descendant points of N_c are contained in the set of all descendant points of N_q.

Even if the hyperrectangle bound is loose for a kd-tree node it doesn't end up mattering for the sake of the bound B_2, because for the calculation of B_2 we are using the implicit ball bounds B_q and B_c.

I hope I've written that clearly. let me know if not. :)

I will fix the paper write up tomorrow, thanks for pointing out the error.

@MarcosPividori
Copy link
Contributor Author

Hi Ryan,
I have been working on an example where b2 fails, using hyperrectangle bounds. I attach the explanation.
Thanks,
example.pdf

@rcurtin
Copy link
Member

rcurtin commented May 27, 2016

You are right, I failed to consider that it is possible that a child hyperrectangle's implicit bounding ball can lie not entirely withing the parent hyperrectangle's implicit bounding ball. It's easy to rework the proof I wrote to be correct and that comes out with the error correction 2 \lambda(N_q) - \lambda(N_c) (that is, we are subtracting just one times the furthest descendant distance of N_c, not two), but this is a less tight bound that the idea you suggested with B_{aux}(.). So I guess that I can conclude that my bound is incorrect and there isn't a better choice I can see than to go with your solution. Thanks for pointing all of this out.

When you apply this bound, it would be interesting to see how much it helps performance; that could easily be tracked by running mlpack_knn with the -v option, and looking at the number of base cases on a handful of datasets before and after the change. If you want to do this, feel free; if not, that's no issue, I'll do it. :)

@MarcosPividori
Copy link
Contributor Author

Hi Ryan,
Thanks for your reply. Ok, I agree with you.
I think It would be important to fix this and analyze performance before concentrating on aprox. nearest neighbor, because we need to work over this code indeed. So, if you agree, I would like to work on this!
I can see that now we accept these different tree options for Neighbor Search:

  • KDTree (default)
  • R-Tree
  • R*Tree
  • X-Tree
  • CoverTree
  • BallTree

Actual KDTree implementation doesn't not show any problem because points are only included in leaf nodes, so the optimization of using rho instead of lambda is never considered. The value of the original B_2 bound and the modification to use B_{aux} will be the same. So, we will have exactly the same number of BaseCases.
The same applies to R-Trees/R*-Trees/X-Trees (as far as I understand, they don't include points in non-leaf nodes), and for BallTrees too.

CoverTrees are the interesting case. They include a point in each non-leaf node.
So, if we change actual B_2 implementation to use B_{aux}, we will have a less tight bound, which will probably result in more BaseCases.
However, we know the original B2 bound is correct for trees that consider ball bounds, as you have proved before.

So, if you agree, I can do this:

  • Implement the modification to use B_{aux}.
  • Compare the performance after this modification with previous implementation.
    We shouldn't see any difference for all trees except for cover trees.
  • If the difference is relevant, we can modify the code to use previous B2 bound only when it is safe to do that. For example, we could add a new field in TreeTraits to decide.

MarcosPividori added a commit to MarcosPividori/mlpack that referenced this issue May 30, 2016
@rcurtin
Copy link
Member

rcurtin commented May 31, 2016

We could avoid the difference for cover trees if we refactored the bound like this:

  • Add HasBallBound() to TreeTraits; this is true for the ball tree and the cover tree but none of the other trees we currently have implemented.
  • When TreeTraits<TreeType>::HasBallBound() is true, we use the existing adjusted point bound of D_p[k] + rho(N_q) + lambda(N_q)
  • Otherwise, we use the correct point bound for non-ball-bounds of D_p[k] + 2 lambda(N_q)

This is still the same bound as in the original paper for ball-bound trees, but it is the fixed looser bound for hyperrectangle (and other weird) trees.

I like the elegant solution in B_aux of not applying the adjustment until the highest level, instead of applying the adjustment at every level like in the current B_2 definition. But I didn't see an easy way to not apply the adjustment until the highest level if we are considering a tighter bound when ball trees are used.

This should give the exact same performance as the existing implementation, so as long as we test that it's the same on one or two datasets there is no need for a big test on lots of datasets, I think. What do you think? Have I overlooked something (again)? :)

@MarcosPividori
Copy link
Contributor Author

Hi @rcurtin,
Thanks for your comments. I agree with the proposal.
I have implemented B_aux, in the commit MarcosPividori@558a4fd , and tested it with some datasets. It doesn't show any difference in the number of base cases for Cover Trees.
I modified the benchmarking system to add "BaseCases" as a new metric, and I have uploaded the results to: http://marcospividori.github.io/mlpack-aux/

  • mlpack is the original library using b_2.
  • mlpack-aux includes the modification to use b_aux for all trees. It doesn't differentiate between ball trees and the rest.

As you can see there, ALLKNN in mlpack and mlpack-aux shows the same number of base cases for all datasets.

Maybe this happens because of the characteristics of Cover Trees. I have not an indepth understanding of them...

So, I was wondering if we could simply use the b_aux modification without taking into account the tighter bound...

MarcosPividori added a commit to MarcosPividori/mlpack that referenced this issue Jun 1, 2016
@rcurtin
Copy link
Member

rcurtin commented Jun 2, 2016

The reason that we are seeing no change between the bounds is because the modifications to B_2() actually never comes into play in the current traversals that we have. As you pointed out, for kd-trees and ball trees and any other tree that only holds points in the leaves, rho(N_q) = lambda(N_q) so there is no problem. But for cover trees, the traversal is somewhat odd: it is depth-first in the query points and breadth-first in the references. This, along with the structure of the tree, means that for any query node N_q, we visit some set of reference nodes before any node combinations containing a descendant of N_q are visited, and then we never visit a node combination with N_q again. (I hope that explanation makes sense.) Therefore, B_aux actually goes unused, as does the part of B_2 that concerns child nodes, because no distances have been calculated for any descendant points of N_q except the one point held by N_q (which will give the tightest bound for B_2).

So, I am tempted to say that we shouldn't change anything here: it is already working as-is; the problem you pointed out (which is a valid problem) doesn't actually surface in the code that we have. The modifications you made, while correct, do slow down the calculation a bit, by making NeighborSearchStat larger. I think a better way to go, and maybe this is not feasible for right now, is to use TreeTraits (or maybe a new TraversalTraits) to determine which bounding strategies are valid for which types of trees and traversals. Then we can simply avoid calculating certain bounds in certain cases.

What do you think? I guess it's not a problem to merge in the changes you made, with the hope of coming up with a cleaner set of bounds sometime later.

@MarcosPividori
Copy link
Contributor Author

Hi @rcurtin ,
Yes, that makes sense. I think I understand why we didn't see any difference with cover trees. Thank you for the explanation!
I think the slow down won't be really significative. In fact, after adding the member "auxBound" to NeighborSearchStat, I realized that there were 2 unused members: "lastDistanceNode" and "bound", which I removed in MarcosPividori@ba4c5bb . So, at the end, the size of NeighborSearchStat is smaller than before. :)
Definitely, it would be useful to have something like TraversalTraits apart from TreeTraits, because the way the tree is traversed can make a huge difference when reasoning about the best bound!
I hope I get some time to continue working on this in the near future,
Thanks

@rcurtin
Copy link
Member

rcurtin commented Jun 2, 2016

Ah, nice catch with the unused members. There is certainly a lot of interesting open work here; when I have some free research time I think I will devote it towards enumerating all the different types of bounds that could be used for dual-tree nearest neighbor search (or even single-tree search possibly?) and revamp the code. Open a PR for the two commits you made, and I'll merge it in. Thanks again for all the time you've put towards this. 👍

rcurtin added a commit that referenced this issue Jun 2, 2016
@rcurtin
Copy link
Member

rcurtin commented Jul 5, 2016

I think this issue is done, so I'll go ahead and close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants