Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add an ispermitted for Alphabets? #219

Open
TransGirlCodes opened this issue Jan 20, 2022 · 3 comments
Open

add an ispermitted for Alphabets? #219

TransGirlCodes opened this issue Jan 20, 2022 · 3 comments

Comments

@TransGirlCodes
Copy link
Member

Can we add an ispermitted method for Alphabets?

So the background context for this is I'm doing some of the internals for Kmers.jl - iterators being the last thing to move over and make nice. An recall when kmers only supported 2 bit alphabets, iterating over kmers for the generic method where a Kmer and the sequence kmers are generated from having differing alphabets, then gaps or anything that is not in the 2 bit alphabet, would be skipped over. In constructors it would throw an error - makes sense - the user is asking to make a kmer from something very specific, whereas we want the iterator to just drop the weird symbols and keep going. With the encode and decode methods as they are, they throw, but for my purpose of writing Kmer.jl internals, it would be really useful to have encode and decode
basically work like encode(::A, x) = ispermitted(A(), x) ? unsafe_encode(A(), x) : throw_n_stuff, so I can use the ispermitted and unsafe_encode methods manually and have the iterator do the right thing rather than cause a throw.

@CiaranOMara
Copy link
Member

I accept that there could be weird symbols. However, I'm not sure I understand the encoding and decoding of a weird symbol, the necessary constraints that would allow valid symbols to continue functioning, and what KMER is constructed from in your scenario. I think there are two construction scenarios, one from a structure with bit packing and one without (Vector).

I think there are more constraints to consider when there is bit packing. For example, would there be a constraint on unsafe_encode such that it must encode with the same number of bits? Though I assume, there are not necessarily enough bits to represent all the uniquely weird symbols that may be encountered. So I think this assumption pushes us towards having something in BioSequences that represents the presence of an invalid symbol, which would drop information. What do you think?

In terms of the skipping over that you spoke of, does that mean that the next valid symbol gets included in the KMER?

@jakobnissen
Copy link
Member

So if I understand it correctly, you want a tryencode(::Alphabet, x)::Union{eltype(A), nothing}? That sounds reasonable.
Should the title of this issue be changed?

jakobnissen added a commit to jakobnissen/BioSequences.jl that referenced this issue Jun 16, 2022
The encode and decode methods are not allowed to produce invalid data. Instead,
they throw an error when encountering invalid input data.
This can lead to some frustration when checking if a symbol is permitted in an
alphabet.
One way to solve it is by checking `symbol in symbols(A)`, but this is not
particularly effective.

This PR adds a tryencode and trydecode method to existing alphabets. These
methods return nothing when given invalid data.
Methods encode and decode now internally call their try-variants.

May solve BioJulia#219
@jakobnissen
Copy link
Member

If what you are literally asking is a way to check if some symbol is allowed in an Alphabet, this can already be done with x in symbols(A). This is not optimised, though, so this method could be possible if tryencode is not sufficient (although it really should be!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants