add an `ispermitted` for Alphabets? #219

TransGirlCodes · 2022-01-20T03:36:20Z

Can we add an ispermitted method for Alphabets?

So the background context for this is I'm doing some of the internals for Kmers.jl - iterators being the last thing to move over and make nice. An recall when kmers only supported 2 bit alphabets, iterating over kmers for the generic method where a Kmer and the sequence kmers are generated from having differing alphabets, then gaps or anything that is not in the 2 bit alphabet, would be skipped over. In constructors it would throw an error - makes sense - the user is asking to make a kmer from something very specific, whereas we want the iterator to just drop the weird symbols and keep going. With the encode and decode methods as they are, they throw, but for my purpose of writing Kmer.jl internals, it would be really useful to have encode and decode
basically work like encode(::A, x) = ispermitted(A(), x) ? unsafe_encode(A(), x) : throw_n_stuff, so I can use the ispermitted and unsafe_encode methods manually and have the iterator do the right thing rather than cause a throw.

The text was updated successfully, but these errors were encountered:

CiaranOMara · 2022-06-13T02:11:33Z

I accept that there could be weird symbols. However, I'm not sure I understand the encoding and decoding of a weird symbol, the necessary constraints that would allow valid symbols to continue functioning, and what KMER is constructed from in your scenario. I think there are two construction scenarios, one from a structure with bit packing and one without (Vector).

I think there are more constraints to consider when there is bit packing. For example, would there be a constraint on unsafe_encode such that it must encode with the same number of bits? Though I assume, there are not necessarily enough bits to represent all the uniquely weird symbols that may be encountered. So I think this assumption pushes us towards having something in BioSequences that represents the presence of an invalid symbol, which would drop information. What do you think?

In terms of the skipping over that you spoke of, does that mean that the next valid symbol gets included in the KMER?

jakobnissen · 2022-06-16T08:26:12Z

So if I understand it correctly, you want a tryencode(::Alphabet, x)::Union{eltype(A), nothing}? That sounds reasonable.
Should the title of this issue be changed?

The encode and decode methods are not allowed to produce invalid data. Instead, they throw an error when encountering invalid input data. This can lead to some frustration when checking if a symbol is permitted in an alphabet. One way to solve it is by checking `symbol in symbols(A)`, but this is not particularly effective. This PR adds a tryencode and trydecode method to existing alphabets. These methods return nothing when given invalid data. Methods encode and decode now internally call their try-variants. May solve BioJulia#219

jakobnissen · 2022-06-16T09:49:11Z

If what you are literally asking is a way to check if some symbol is allowed in an Alphabet, this can already be done with x in symbols(A). This is not optimised, though, so this method could be possible if tryencode is not sufficient (although it really should be!)

jakobnissen mentioned this issue Jun 16, 2022

Add tryencode and trydecode #238

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add an `ispermitted` for Alphabets? #219

add an `ispermitted` for Alphabets? #219

TransGirlCodes commented Jan 20, 2022

CiaranOMara commented Jun 13, 2022

jakobnissen commented Jun 16, 2022

jakobnissen commented Jun 16, 2022

add an ispermitted for Alphabets? #219

add an ispermitted for Alphabets? #219

Comments

TransGirlCodes commented Jan 20, 2022

CiaranOMara commented Jun 13, 2022

jakobnissen commented Jun 16, 2022

jakobnissen commented Jun 16, 2022

add an `ispermitted` for Alphabets? #219

add an `ispermitted` for Alphabets? #219