Roadmap / TODO #6

rob-p · 2021-07-26T19:33:59Z

This issue will provide a roadmap for the library, along with specific tasks (TODOs). Ideally we should break these tasks into short and long term tasks and, as the library becomes more mature, tie individual tasks to specific release candidates.

d-cameron · 2021-10-10T05:32:46Z

It would be great if we defined the scope of the library. Specifically:

Is this a genomics kmer library, or a generic string kmer library?

If it's a genomics, what's in scope:

2-bit compression of sequences, or just kmers?
Support for ambiguous bases/4-bit encoding scheme?
de Bruijn graphs?

I've been implementing a Rust OLC assembler and I've found that there's a whole lot of 2-bit sequence functions that I need that aren't in other rust libraries (such as 10X Genomics debruijn library). They're not kmer-based functions per se but they generally are decomposable in ones (e.g. hamming distance between sequences).

rob-p · 2021-10-10T14:32:43Z

Hi @d-cameron,

Thanks for bringing these up. I think it's a great point. I certainly am not envisioning this as a general string k-mer library. However, I would like to get input from others on if we should support something in addition to the standard DNA alphabet. Specifically, I think there could be legitimate uses for having a code path that supports e.g. a protein alphabet.

The use cases I am most interested in, however, are in the standard 4 nucleotide alphabet. Regarding the encoding scheme, @Daniel-Liu-c0deb0t brought raised the issue in #3, and there was a bit of discussion of the relative merits of different schemes. I'd certainly be interested in any input you have on this.

Finally, while I intend for the focus of this library to be efficient k-mer creation, storage, manipulation and processing, I am absolutely open to having relevant functionality incorporated as either part of this library or as part of a sister crate.

--Rob

d-cameron · 2021-10-10T17:52:43Z

I intend for the focus of this library to be efficient k-mer creation

rust-debruijn appears to have a similar scope with specialised structs for small(ish) kmers. One consequence of this is that they have a 2-bit encoding for genomic sequences to enable efficient kmer extraction (e.g. sequence.kmer(offset)). Unfortunately, since it's a de Bruijn graph targeted crate, there's not a lot of support for doing stuff on these sequences other than extracting kmers.

If this library wants to take a similar approach that's fine but if it does, it would be great if it supported/integrated with sequences encoded by crates that have more comprensive feature sets. I believe support for the various encodings of [u8], [u64] slices should be sufficient.

rob-p pinned this issue Jul 26, 2021

rob-p added documentation Improvements or additions to documentation enhancement New feature or request labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap / TODO #6

Roadmap / TODO #6

rob-p commented Jul 26, 2021

d-cameron commented Oct 10, 2021

rob-p commented Oct 10, 2021

d-cameron commented Oct 10, 2021

Roadmap / TODO #6

Roadmap / TODO #6

Comments

rob-p commented Jul 26, 2021

d-cameron commented Oct 10, 2021

rob-p commented Oct 10, 2021

d-cameron commented Oct 10, 2021