New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spill Trees implementation. #728
Comments
I took a look through the document; thanks for writing it up. I agree with your proof and the results. (Also as a side note if you don't want You are right that the tighter bounds and I also think that it's not necessarily reasonable to use But this would mean that
Still, I think that approach (which maps to the second option in your paper) is going to be the option that produces faster results with only slightly worse results. Some comments from IRC:
I agree with this. If we say we are implementing spill trees, and there is only one well-known way to do that, we should try and stick with the original as much as is reasonably possible. For something like the kd-trees themselves, we already deviate from Bentley's original formulation by having only leaves hold points, but there are so many variations of kd-trees that I think it is okay. But I think the case is different with spill trees.
I agree. I thought about this for a while. Although we can show in the proof that you gave that we inspect every reference point within tau/2 of the query, and with the original spill tree, we inspect every reference point within tau of the query, we have no guarantees on what happens when the nearest neighbor is further than tau from the query. (Like if the query point is very far from any of the reference points.) It might be interesting to investigate that point, but I am not sure if it is worth the time, because I think any difference is probably going to be marginal at best.
That's true, but we must keep in mind that if we are benchmarking and comparing algorithms, it's not possible for us to say "spill trees are better than kd-trees for approximate nearest neighbor search", even on a single dataset. All we can say is "this implementation of spill trees is better than this implementation of kd-trees for approximate nearest neighbor search"; we can't separate implementation quality from our results. The best we can do is try to make both as fast as possible. :) |
I should add, also, I see you commented yesterday you were going to start reimplementing the spill tree for the hyperplane split. So maybe you've already come up with better ideas than the ones I had on how to do it, but hopefully what I wrote is helpful nonetheless. Another thought I had is that I think (but am not sure) that it should be possible to build spill trees in such a way that you don't need to duplicate any points in the dataset or hold a set of "spill indices" in each leaf, by cleverly arranging the points in the dataset. Like for a left child you would hold the spill points at the end of the submatrix and for a right child you would hold the spill points at the beginning of the submatrix. |
Hi @rcurtin, I have been thinking about this. I list some ideas that I need to refine:
Thanks |
I have implemented Spill trees with axis-orthogonal splitting hyperplanes. I have made an effort to avoid duplicating existing code for neighbor search. I created a new class Also, I have implemented a new version of
Single Tree Search:The
Dual Tree Search:The Query tree is built without overlapping. When calculating the score of a query node and a reference node, I consider 2 cases:
The
The The extension was incorporated to existing mlpack_knn. With actual implementation, we can use "-t spill" to consider spill trees and "--tau 0.1" to set different values for the overlapping size (default value is tau=0). I have made a pull request with this implementation in: #747. If you agree, I plan to work next days in these topics:
Every feedback is welcome! :) Thanks |
Hey Marcos, I will look at yiur code tonight. I will also be available online so we can Regards, On 01-Aug-2016 10:46 PM, "MarcosPividori" notifications@github.com wrote:
|
Hi Sumedh, |
Hi! @rcurtin @sumedhghaisas
I have been working on the implementation of spill tree in the branch: spill-trees
I have summarized what I have implemented in a pdf file: spilltrees.pdf
I would be grateful if you could let me know your opinion so I can continue with the implementation :)
Thanks!
Example:
As can be seen there, the overlapping is not the same for all the sections of the "decision boundary":
Only the points at a distance less than tau/2 from the decision boundary are guarantee to be included in the left node.
p_r2 is included in the left node, but p_r1 is not included.
The query point p_q will traverse to the left and will consider the point p_r2, but the point p_r1 won't be considered.
The text was updated successfully, but these errors were encountered: