Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spill Trees implementation. #728

Closed
MarcosPividori opened this issue Jul 20, 2016 · 6 comments
Closed

Spill Trees implementation. #728

MarcosPividori opened this issue Jul 20, 2016 · 6 comments
Assignees

Comments

@MarcosPividori
Copy link
Contributor

MarcosPividori commented Jul 20, 2016

Hi! @rcurtin @sumedhghaisas
I have been working on the implementation of spill tree in the branch: spill-trees
I have summarized what I have implemented in a pdf file: spilltrees.pdf
I would be grateful if you could let me know your opinion so I can continue with the implementation :)
Thanks!

Example:

example3

  • p_q: represents a query point.
  • p_r1 and p_r2: represent two reference points.
  • the green area represents the area containing all the points included in the left node. It represents the "real" overlapping considering the "decision boundary".

As can be seen there, the overlapping is not the same for all the sections of the "decision boundary":

  • p_r2 is at a distance far greater than tau from the decision boundary and will be included in the left node.
  • p_r1 is at a distance greater than tau/2 but not greater than tau, from the decision boundary, and won't be included in the left node.

Only the points at a distance less than tau/2 from the decision boundary are guarantee to be included in the left node.

p_r2 is included in the left node, but p_r1 is not included.

The query point p_q will traverse to the left and will consider the point p_r2, but the point p_r1 won't be considered.

@rcurtin
Copy link
Member

rcurtin commented Jul 22, 2016

I took a look through the document; thanks for writing it up. I agree with your proof and the results. (Also as a side note if you don't want leftbound and rightbound rendered as math text you can use \mbox{leftbound} or possibly \operatorname{leftbound}, and if you want numbered equations then enter math mode with \begin{equation} instead of $$.)

You are right that the tighter bounds and MinDistance()/MaxDistance() are used to enhance pruning for backtracking, but with defeatist search, backtracking is not necessary. One big issue is that MinDistance() and MaxDistance() each take O(d) time to calculate, whereas determining which side of a splitting hyperplane a point is on is an O(1) calculation. This can be a big difference in higher-dimensional spaces.

I also think that it's not necessarily reasonable to use NeighborSearchRules because there is a lot of overhead there, with calculating bounds, and even during the recursion. For BinarySpaceTree, the rules will calculate the minimum distance between both children (2O(d)) and then decide which is best to go into first. There we need both minimum distance values to inform pruning decisions. But for defeatist search, we only need to solve the decisino version of the problem: which node is better to recurse into? We can do this in O(d) if we don't use splitting hyperplanes and in O(1) if we do use splitting hyperplanes. That's a significant acceleration for such an integral part of the search.

But this would mean that

  • We can't use the SingleTreeTraverser or DualTreeTraverser; we have to write a generic defeatist traverser. I think that's fairly straightforward.
  • We need to make the splitting dimension accessible for quick calculation. I think this could be done by a custom StatisticType.
  • We need a function that can calculate which child is best. I think this too could be done in StatisticType, as some function like size_t GetBestChild(VecType& v) or size_t GetBestChild(TreeType& n). Alternately we could add that to the TreeType API but I am not so enthusiastic about that, since I want to keep that API as small as is feasible.

Still, I think that approach (which maps to the second option in your paper) is going to be the option that produces faster results with only slightly worse results.

Some comments from IRC:

16:56 < sumedhghaisas> second... I am assuming spill trees are used widely ....
16:57 < sumedhghaisas> So it won;t be a good decision to implement an improvised version of spill trees but not the original...

I agree with this. If we say we are implementing spill trees, and there is only one well-known way to do that, we should try and stick with the original as much as is reasonably possible. For something like the kd-trees themselves, we already deviate from Bentley's original formulation by having only leaves hold points, but there are so many variations of kd-trees that I think it is okay. But I think the case is different with spill trees.

17:03 < sumedhghaisas> Its difficult to gauge their effectiveness... like which one is better... the only way to prove that hrect bounds are better is to compare it to the actual implementation...

I agree. I thought about this for a while. Although we can show in the proof that you gave that we inspect every reference point within tau/2 of the query, and with the original spill tree, we inspect every reference point within tau of the query, we have no guarantees on what happens when the nearest neighbor is further than tau from the query. (Like if the query point is very far from any of the reference points.) It might be interesting to investigate that point, but I am not sure if it is worth the time, because I think any difference is probably going to be marginal at best.

17:08 < marcosirc> Ok, yes I agree that the difference in the implementations is a problem if we want to benchmark and make assertions about spill trees in general.

That's true, but we must keep in mind that if we are benchmarking and comparing algorithms, it's not possible for us to say "spill trees are better than kd-trees for approximate nearest neighbor search", even on a single dataset. All we can say is "this implementation of spill trees is better than this implementation of kd-trees for approximate nearest neighbor search"; we can't separate implementation quality from our results. The best we can do is try to make both as fast as possible. :)

@rcurtin
Copy link
Member

rcurtin commented Jul 22, 2016

I should add, also, I see you commented yesterday you were going to start reimplementing the spill tree for the hyperplane split. So maybe you've already come up with better ideas than the ones I had on how to do it, but hopefully what I wrote is helpful nonetheless.

Another thought I had is that I think (but am not sure) that it should be possible to build spill trees in such a way that you don't need to duplicate any points in the dataset or hold a set of "spill indices" in each leaf, by cleverly arranging the points in the dataset. Like for a left child you would hold the spill points at the end of the submatrix and for a right child you would hold the spill points at the beginning of the submatrix.

@MarcosPividori
Copy link
Contributor Author

MarcosPividori commented Jul 23, 2016

Hi @rcurtin,
Thanks for your response! I agree with what was proposed. So I will continue implementing the original approach with splitting hyperplanes.

I have been thinking about this. I list some ideas that I need to refine:

  • We can decide which child node to traverse in O(1) if we consider axis-orthogonal splitting hyperplanes.
    As I mentioned in the pdf file, another option is to consider splitting hyperplanes that are not necessarily axis-orthogonal. For example, as suggested in Ting Liu's paper, we can approximate the pair of furthest points p1, p2, in the current set of points and split the set of points according to the projection over the vector defined by p1 and p2. We could record that vector in each node, so we can project the query node and decide which child node to traverse when doing dfs.
    The advantage is that we have more flexibility, I think it would improve the classification of points. However the disadvantage is that it takes O(d) to calculate the projection of the query point when deciding which node to traverse.
    Also, depending on this, I can see on 2 different approaches for dual tree search:
    • If the reference spill tree is built with axis-orthogonal hyperplanes, we can build a query tree with hrects bounds (KDTree) and use the information of the hrects bounds to know which reference node to traverse. Given a hrect bound and an axis-orthogonal hyperplane, we can know in O(1) time which side to traverse (or both).
    • If the reference spill tree is built with non-axis-orthogonal hyperplanes, we can build a query tree with ball bounds (BallTree), project the ball bound's center and use the radius to know which side to traverse in O(d) time.
  • I think I could record the splitting dimension and value (or the projection vector in case of non-axis-orthogonal hyperplanes) in each node instead of using the StatisticsType, because this information won't change after building the tree.
  • I don't think it is possible to rearrange the points in the dataset instead of holding a set of indices in the leaf nodes.

Thanks

@MarcosPividori
Copy link
Contributor Author

MarcosPividori commented Aug 1, 2016

Hi @sumedhghaisas @rcurtin,

I have implemented Spill trees with axis-orthogonal splitting hyperplanes. I have made an effort to avoid duplicating existing code for neighbor search.

I created a new class SpillSearch that provides an interface similar to NeighborSearch but with an extension to properly set the tau parameter. It encapsulates an instance of NeighborSearch.

Also, I have implemented a new version of NeighborSearchRules specialized for Spill Trees, because I needed to modify the methods:

  • Score() to consider splitting hyperplanes for overlapping nodes.
  • CalculateBounds() to ignore B_2 bound (we can not use B_2 bound for Spill Trees).

Single Tree Search:

The SingleTreeTraverser is similar to the implementation for BinarySpaceTree.
The difference is in the implementation of NeighborSearchRules for SpillTrees.
When calculating the score of a query point and a reference node I consider 2 cases:

  • If the reference node is non-overlapping, I calculate the score the same than before.
  • If the reference node is overlapping, I analyze the reference node's half space. If it contains the given query point, I return 0 (best score). Else, I return DBL_MAX (prune).

Dual Tree Search:

The Query tree is built without overlapping.

When calculating the score of a query node and a reference node, I consider 2 cases:

  • If the reference node is a non-overlapping node, I calculate the score the same as before.
  • If the reference node is a overlapping node, I analyze query node's bounding box. If it intersects the reference node's half space, I return 0 (best score). Else, I return DBL_MAX (prune).

The DualTreeTraverser is slightly different to the implementation for BinarySpaceTree. When referenceNode is a overlapping node and we can't decide which child node to traverse, this means that queryNode is at both sides of the splitting hyperplane, we analyse the queryNode:

  • If queryNode is a non-leaf node, I recurse down the query node.
  • If queryNode is a leaf node, I do single tree search for each point in the query node.

The DualTreeTraverser is faster than the SingleTreeTraverser. Specially when the value of tau increases, because we will have more non-overlapping nodes which implies more time involved in backtracking.

The extension was incorporated to existing mlpack_knn. With actual implementation, we can use "-t spill" to consider spill trees and "--tau 0.1" to set different values for the overlapping size (default value is tau=0).

I have made a pull request with this implementation in: #747.

If you agree, I plan to work next days in these topics:

  • Implement another version of SpillTrees to consider non-axis-orthogonal hyperplanes, using BallBound instead of HrectBound, and holding a projection vector in each node.
  • Add a command line flag "--get_real_error" to compare the approximate neighbor search against the naive method and print the real relative error, so we can test with differents values of tau and see the difference.

Every feedback is welcome! :)

Thanks

@MarcosPividori MarcosPividori self-assigned this Aug 2, 2016
@sumedhghaisas
Copy link
Member

Hey Marcos,

I will look at yiur code tonight. I will also be available online so we can
discuss about it. Will you be free tonight??

Regards,
Sumedh Ghaisas

On 01-Aug-2016 10:46 PM, "MarcosPividori" notifications@github.com wrote:

Hi @sumedhghaisas https://github.com/sumedhghaisas @rcurtin
https://github.com/rcurtin,

I have implemented Spill trees with axis-parallel splitting hyperplanes. I
have made an effort to avoid duplicating existing code for neighbor search.

I created a new class SpillSearch that provides an interface similar to
NeighborSearch but with an extension to properly set the tau parameter.
It encapsulates an instance of NeighborSearch.

Also, I have implemented a new version of NeighborSearchRules specialized
for Spill Trees, because I needed to modify the methods:

  • Score() to consider splitting hyperplanes for overlapping nodes.
  • CalculateBounds() to ignore B_2 bound (we can not use B_2 bound for
    Spill Trees).

Single Tree Search:

The SingleTreeTraverser is similar to the implementation for
BinarySpaceTree.
The difference is in the implementation of NeighborSearchRules for
SpillTrees.
When calculating the score of a query point and a reference node I
consider 2 cases:

  • If the reference node is non-overlapping, I calculate the score the
    same than before.
  • If the reference node is overlapping, I analyze the reference node's
    half space. If it contains the given query point, I return 0 (best score).
    Else, I return DBL_MAX (prune).

Dual Tree Search:

The Query tree is built without overlapping.

When calculating the score of a query node and a reference node, I
consider 2 cases:

  • If the reference node is a non-overlapping node, I calculate the
    score the same as before.
  • If the reference node is a overlapping node, I analyze query node's
    bounding box. If it intersects the reference node's half space, I return 0
    (best score). Else, I return DBL_MAX (prune).

The DualTreeTraverser is slightly different to the implementation for
BinarySpaceTree. When referenceNode is a overlapping node and we can't
decide which child node to traverse, this means that queryNode is at both
sides of the splitting hyperplane, we analyse the queryNode:

  • If queryNode is a non-leaf node, I recurse down the query node.
  • If queryNode is a leaf node, I do single tree search for each point
    in the query node.

The DualTreeTraverser is faster than the SingleTreeTraverser. Specially
when the value of tau increases, because we will have more non-overlapping
nodes which implies more time involved in backtracking.

The extension was incorporated to existing mlpack_knn. With actual
implementation, we can use "-t spill" to consider spill trees and "--tau
0.1"
to set different values for the overlapping size (default value is
tau=0).

I have made a pull request with this implementation in: #747
#747.

If you agree, I plan to work next days in these topics:

  • Implement another version of SpillTrees to consider
    non-axis-parallel hyperplanes, using BallBound instead of HrectBound,
    and holding a projection vector in each node.
  • Add a command line flag "--get_real_error" to compare the
    approximate neighbor search against the naive method and print the real
    relative error, so we can test with differents values of tau and see the
    difference.

Every feedback is welcome! :)

Thanks


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#728 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AFC3QaeOekficTE9vjQxBLZS9OloQoxeks5qbinwgaJpZM4JQY7d
.

@MarcosPividori
Copy link
Contributor Author

Hi Sumedh,
Yes, perfect. I will be online.
Thanks,
Marcos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants