Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

approaching 1.0 #4

Closed
BurntSushi opened this issue Aug 1, 2016 · 6 comments
Closed

approaching 1.0 #4

BurntSushi opened this issue Aug 1, 2016 · 6 comments

Comments

@BurntSushi
Copy link
Owner

This crate enjoys a small API and has had some minor breaking changes over its lifetime, but given its modest functionality, I don't foresee a major API refactoring in its near future. Therefore, I'd like to move this to 1.0.

When I initially built this crate, I did have a grander vision for building a more complete suffix array implementation. In particular, while this crate provides a nearly optimal construction algorithm implementation, it does not provide an optimal search algorithm implementation. Search should be implementable in O(p+logn) time (where p ~ len(query) and n ~ len(haystack)), but is currently O(p*logn). Improving the bound requires some sophisticated implementation tricks that have proved difficult to extract from the literature. I offered a bounty on a StackOverflow question and got an answer, but I haven't digested it yet. Nevertheless, a plain suffix table is plenty useful in its own right, so a 1.0 release shouldn't be blocked on further improvements even if it requires a rethink of the public API.

@rob-p
Copy link

rob-p commented Aug 1, 2016

A nice description of the "simple accelerant" is in Ben Langmead's notes here. While this doesn't guarantee O(p + log n), you usually get about this in practice. Further, since this approach doesn't require building and storing the LCP table, it's a bit more lightweight than the approach that's guaranteed to give you the O(p + log n). Ben also provides some code demonstrating the simple accelerant search here.

@BurntSushi
Copy link
Owner Author

@rob-p Ooo, excellent! Can't wait to read those links. Thank you so much for sharing. :-)

@danieldk
Copy link
Contributor

Did you consider making the implementation type-generic for 1.0? If one wants to search e.g. at a token level rather than the character level, it'd make more sense to use a perfect hash automaton and an array of integers/unsigneds as the data array.

I think the readme hints at trying this before, though.

@BurntSushi
Copy link
Owner Author

@danieldk The README hints at something called a generalized suffix array, which is a suffix array that can store multiple strings. As for making a suffix table type generic, no, I haven't considered that. It's not even clear to me that it would work at all.

In any case, now isn't the time to design an entirely new API for 1.0. Which means we should take one of two paths:

  1. Bring the current API to 1.0 (modulo small changes) and reserve experimentation for a 2.0 (or another crate).
  2. Drop the push to 1.0 until we can do more experimentation.

As far as I'm concerned, we've been in state (2) for quite some time and I don't see that changing in the immediate future, hence (1).

@rob-p
Copy link

rob-p commented Aug 15, 2016

@BurntSushi, one recommendation I would make for the generalized suffix array is to replace the idea of keeping a table of offsets and doing a binary search with that of keeping a bitvector (or a compressed bit vector) and doing a rank operation. Basically, you build a generalized string, by concatenating together all strings with a separator (e.g. $), and keep a bitvector with the same number of characters as the generalized string. The bit vector contains a 0 in every position with a normal character and a 1 in every position with the separator. There exist methods to compute the rank of any position in the bitvector (the cumulative number of 1s up through a given position) in constant time. This is what we do in RapMap, and it is both theoretically (rank is O(1)) and practically fast. For any position in the generalized suffix array, a rank of the corresponding position in the bit array immediately tells you which string you're in. I don't know the status of good bitvector rank implementations in Rust, but I'd consider going this way for efficiency purposes.

@BurntSushi
Copy link
Owner Author

Done in 70fcea1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants