approaching 1.0 #4

BurntSushi · 2016-08-01T02:07:57Z

This crate enjoys a small API and has had some minor breaking changes over its lifetime, but given its modest functionality, I don't foresee a major API refactoring in its near future. Therefore, I'd like to move this to 1.0.

When I initially built this crate, I did have a grander vision for building a more complete suffix array implementation. In particular, while this crate provides a nearly optimal construction algorithm implementation, it does not provide an optimal search algorithm implementation. Search should be implementable in O(p+logn) time (where p ~ len(query) and n ~ len(haystack)), but is currently O(p*logn). Improving the bound requires some sophisticated implementation tricks that have proved difficult to extract from the literature. I offered a bounty on a StackOverflow question and got an answer, but I haven't digested it yet. Nevertheless, a plain suffix table is plenty useful in its own right, so a 1.0 release shouldn't be blocked on further improvements even if it requires a rethink of the public API.

The text was updated successfully, but these errors were encountered:

rob-p · 2016-08-01T21:57:43Z

A nice description of the "simple accelerant" is in Ben Langmead's notes here. While this doesn't guarantee O(p + log n), you usually get about this in practice. Further, since this approach doesn't require building and storing the LCP table, it's a bit more lightweight than the approach that's guaranteed to give you the O(p + log n). Ben also provides some code demonstrating the simple accelerant search here.

BurntSushi · 2016-08-01T22:15:07Z

@rob-p Ooo, excellent! Can't wait to read those links. Thank you so much for sharing. :-)

danieldk · 2016-08-13T12:08:58Z

Did you consider making the implementation type-generic for 1.0? If one wants to search e.g. at a token level rather than the character level, it'd make more sense to use a perfect hash automaton and an array of integers/unsigneds as the data array.

I think the readme hints at trying this before, though.

BurntSushi · 2016-08-15T10:39:18Z

@danieldk The README hints at something called a generalized suffix array, which is a suffix array that can store multiple strings. As for making a suffix table type generic, no, I haven't considered that. It's not even clear to me that it would work at all.

In any case, now isn't the time to design an entirely new API for 1.0. Which means we should take one of two paths:

Bring the current API to 1.0 (modulo small changes) and reserve experimentation for a 2.0 (or another crate).
Drop the push to 1.0 until we can do more experimentation.

As far as I'm concerned, we've been in state (2) for quite some time and I don't see that changing in the immediate future, hence (1).

rob-p · 2016-08-15T12:00:37Z

@BurntSushi, one recommendation I would make for the generalized suffix array is to replace the idea of keeping a table of offsets and doing a binary search with that of keeping a bitvector (or a compressed bit vector) and doing a rank operation. Basically, you build a generalized string, by concatenating together all strings with a separator (e.g. $), and keep a bitvector with the same number of characters as the generalized string. The bit vector contains a 0 in every position with a normal character and a 1 in every position with the separator. There exist methods to compute the rank of any position in the bitvector (the cumulative number of 1s up through a given position) in constant time. This is what we do in RapMap, and it is both theoretically (rank is O(1)) and practically fast. For any position in the generalized suffix array, a rank of the corresponding position in the bit array immediately tells you which string you're in. I don't know the status of good bitvector rank implementations in Rust, but I'd consider going this way for efficiency purposes.

BurntSushi · 2016-12-30T16:59:21Z

Done in 70fcea1

BurntSushi added the help wanted label Aug 1, 2016

BurntSushi closed this as completed Dec 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

approaching 1.0 #4

approaching 1.0 #4

BurntSushi commented Aug 1, 2016

rob-p commented Aug 1, 2016

BurntSushi commented Aug 1, 2016

danieldk commented Aug 13, 2016

BurntSushi commented Aug 15, 2016

rob-p commented Aug 15, 2016 •

edited

BurntSushi commented Dec 30, 2016

approaching 1.0 #4

approaching 1.0 #4

Comments

BurntSushi commented Aug 1, 2016

rob-p commented Aug 1, 2016

BurntSushi commented Aug 1, 2016

danieldk commented Aug 13, 2016

BurntSushi commented Aug 15, 2016

rob-p commented Aug 15, 2016 • edited

BurntSushi commented Dec 30, 2016

rob-p commented Aug 15, 2016 •

edited