Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap / TODO #6

Open
rob-p opened this issue Jul 26, 2021 · 3 comments
Open

Roadmap / TODO #6

rob-p opened this issue Jul 26, 2021 · 3 comments
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@rob-p
Copy link
Contributor

rob-p commented Jul 26, 2021

This issue will provide a roadmap for the library, along with specific tasks (TODOs). Ideally we should break these tasks into short and long term tasks and, as the library becomes more mature, tie individual tasks to specific release candidates.

@rob-p rob-p pinned this issue Jul 26, 2021
@rob-p rob-p added documentation Improvements or additions to documentation enhancement New feature or request labels Jul 26, 2021
@d-cameron
Copy link

It would be great if we defined the scope of the library. Specifically:

  • Is this a genomics kmer library, or a generic string kmer library?

If it's a genomics, what's in scope:

  • 2-bit compression of sequences, or just kmers?
  • Support for ambiguous bases/4-bit encoding scheme?
  • de Bruijn graphs?

I've been implementing a Rust OLC assembler and I've found that there's a whole lot of 2-bit sequence functions that I need that aren't in other rust libraries (such as 10X Genomics debruijn library). They're not kmer-based functions per se but they generally are decomposable in ones (e.g. hamming distance between sequences).

@rob-p
Copy link
Contributor Author

rob-p commented Oct 10, 2021

Hi @d-cameron,

Thanks for bringing these up. I think it's a great point. I certainly am not envisioning this as a general string k-mer library. However, I would like to get input from others on if we should support something in addition to the standard DNA alphabet. Specifically, I think there could be legitimate uses for having a code path that supports e.g. a protein alphabet.

The use cases I am most interested in, however, are in the standard 4 nucleotide alphabet. Regarding the encoding scheme, @Daniel-Liu-c0deb0t brought raised the issue in #3, and there was a bit of discussion of the relative merits of different schemes. I'd certainly be interested in any input you have on this.

Finally, while I intend for the focus of this library to be efficient k-mer creation, storage, manipulation and processing, I am absolutely open to having relevant functionality incorporated as either part of this library or as part of a sister crate.

--Rob

@d-cameron
Copy link

I intend for the focus of this library to be efficient k-mer creation

rust-debruijn appears to have a similar scope with specialised structs for small(ish) kmers. One consequence of this is that they have a 2-bit encoding for genomic sequences to enable efficient kmer extraction (e.g. sequence.kmer(offset)). Unfortunately, since it's a de Bruijn graph targeted crate, there's not a lot of support for doing stuff on these sequences other than extracting kmers.

If this library wants to take a similar approach that's fine but if it does, it would be great if it supported/integrated with sequences encoded by crates that have more comprensive feature sets. I believe support for the various encodings of [u8], [u64] slices should be sufficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants