Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One-hot encoding of sequences #130

Closed
cossio opened this issue Feb 17, 2021 · 2 comments
Closed

One-hot encoding of sequences #130

cossio opened this issue Feb 17, 2021 · 2 comments

Comments

@cossio
Copy link
Contributor

cossio commented Feb 17, 2021

This is a feature request.

Machine learning models of biological sequences often work by representing sequences in one-hot encoding. I think it would be nice to add support for onehot encoding/decoding of biological sequences in this package.

@jakobnissen
Copy link
Member

Hello @cossio

I think that's outside the scope of BioSequences. One-hot encoding pertains to the specific machine learning model that uses DNA sequences, not to the sequences themselves. There is no way we can predict what kind of embedding or transformation machine learning people will want to do with biological sequences, so that is better left up to the people who want to do that. For example - should it include Ns, or other ambiguous nucleotides, should it be a bitmatrix or an integer matrix, and should amino acids use a reduced alphabet?

@jakobnissen
Copy link
Member

jakobnissen commented Feb 17, 2021

I'm not aware of any Julia package using machine learning on biological sequences. But it's fairly easy to add a function that one-hot encodes sequences. Here's one:
onehot(s::BioSequence{A}) where A = reduce(vcat, [reshape(s .== i, 1, :) for i in A()])
Not the most efficient, but probably good enough for most use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants