New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
approaching 1.0 #4
Comments
A nice description of the "simple accelerant" is in Ben Langmead's notes here. While this doesn't guarantee O(p + log n), you usually get about this in practice. Further, since this approach doesn't require building and storing the LCP table, it's a bit more lightweight than the approach that's guaranteed to give you the O(p + log n). Ben also provides some code demonstrating the simple accelerant search here. |
@rob-p Ooo, excellent! Can't wait to read those links. Thank you so much for sharing. :-) |
Did you consider making the implementation type-generic for 1.0? If one wants to search e.g. at a token level rather than the character level, it'd make more sense to use a perfect hash automaton and an array of integers/unsigneds as the data array. I think the readme hints at trying this before, though. |
@danieldk The README hints at something called a generalized suffix array, which is a suffix array that can store multiple strings. As for making a suffix table type generic, no, I haven't considered that. It's not even clear to me that it would work at all. In any case, now isn't the time to design an entirely new API for 1.0. Which means we should take one of two paths:
As far as I'm concerned, we've been in state (2) for quite some time and I don't see that changing in the immediate future, hence (1). |
@BurntSushi, one recommendation I would make for the generalized suffix array is to replace the idea of keeping a table of offsets and doing a binary search with that of keeping a bitvector (or a compressed bit vector) and doing a rank operation. Basically, you build a generalized string, by concatenating together all strings with a separator (e.g. |
Done in 70fcea1 |
This crate enjoys a small API and has had some minor breaking changes over its lifetime, but given its modest functionality, I don't foresee a major API refactoring in its near future. Therefore, I'd like to move this to 1.0.
When I initially built this crate, I did have a grander vision for building a more complete suffix array implementation. In particular, while this crate provides a nearly optimal construction algorithm implementation, it does not provide an optimal search algorithm implementation. Search should be implementable in
O(p+logn)
time (wherep ~ len(query)
andn ~ len(haystack)
), but is currentlyO(p*logn)
. Improving the bound requires some sophisticated implementation tricks that have proved difficult to extract from the literature. I offered a bounty on a StackOverflow question and got an answer, but I haven't digested it yet. Nevertheless, a plain suffix table is plenty useful in its own right, so a 1.0 release shouldn't be blocked on further improvements even if it requires a rethink of the public API.The text was updated successfully, but these errors were encountered: