Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check interfaces, remove NotImplemented errors #140

Closed
7 tasks done
jakobnissen opened this issue Mar 11, 2021 · 6 comments
Closed
7 tasks done

Check interfaces, remove NotImplemented errors #140

jakobnissen opened this issue Mar 11, 2021 · 6 comments
Milestone

Comments

@jakobnissen
Copy link
Member

jakobnissen commented Mar 11, 2021

Errors such as e.g.

"Get the length of a biological sequence."
@inline function Base.length(seq::BioSequence)
error(
string(
"Base.length has not been defined for BioSequence type: ",
typeof(seq),
". Any compatible concrete BioSequence subtype must have this method implemented."
)
)
end

It's not more useful than a MethodError, and it hampers static analysis from e.g. JET, which will probably be a reality before the end of this year. Much better to simply remove it. (also see this blogpost)

Instead, we should go through and verify we have covered and documented interfaces for:

  • A new BioSequence subtype
  • A new Alphabet subtype
  • An alphabet having the ASCIIAlphabet trait

Also, the tests should include custom biosequence and alphabet types to make sure some basic behaviour is covered by fallbacks.

  • Add custom alphabet type to tests
  • Add a separate alphabet with ASCIIAlphabet trait for print tests
  • Add custom mutable biosequence subtype to tests
  • Make sure all relevant tests include custom biosequence and/or custom alphabet
@jakobnissen jakobnissen added this to the v3.0.0 milestone Mar 11, 2021
@kescobo
Copy link
Member

kescobo commented Mar 11, 2021

This seems like a good idea, should we spitball what those interfaces actually are?

@jakobnissen
Copy link
Member Author

jakobnissen commented Mar 11, 2021

Okay, so I decided to not make another issue. Here is my proposal for the interface for BioSequence and Alphabet. There are already inferfaces stated in the documentation, but if you actually try to implement a new sequence/alphabet type, you will find that they are insufficient (cc @benjward )

The following types are assumed to have the listed properties. You may create subtypes of that violate these properties, but then any fallback functions may be excessively slow or wrong.

What is an Alphabet?

  • Subtypes of Alphabet are singleton structs that may or may not be parameterized.
  • Alphabets determine a finite list of biological symbols that can be encoded in the given molecule type
  • The alphabet controls the encoding/decoding between a decoded element type and an internal data representation type.
  • An Alphabet must never encode (using encode) or decode (using decode) invalid data, but must error in cases where invalid data could have been produced. Other methods for check-free encoding/decoding methods may be added.

What to implement?

Every subtype A of Alphabet must implement:

  • Base.eltype(::Type{A})::Type{E} for some eltype E
  • symbols(::A)::Tuple{Vararg{E}}. This gives an ordered tuples of all elements of A.
  • encode(::A, ::E)::X encodes an element to the internal data eltype X
  • decode(::A, ::X)::E decodes an X to an element E.
  • Except for eltype which must follow Base conventions, all functions operating on Alphabet should operate on instances of the alphabet, not the type.

If you want interoperation with existing subtypes of BioSequence, the element type E must be UInt, and you must also implement:

  • BitsPerSymbol(::A)::BitsPerSymbol{N}, where the N must be zero or a power of two in [1, 8*sizeof(UInt)].

Minimal example

In this example, the element type is AminoAcid, and the encoded data type is UInt.

struct ReducedAAAlphabet <: Alphabet end
Base.eltype(::Type{ReducedAAAlphabet}) = AminoAcid
BioSequences.BitsPerSymbol(::ReducedAAAlphabet) = BioSequences.BitsPerSymbol{4}()
function BioSequences.symbols(::ReducedAAAlphabet)
    (AA_L, AA_C, AA_A, AA_G, AA_S, AA_T, AA_P, AA_F,
     AA_W, AA_E, AA_D, AA_N, AA_Q, AA_K, AA_H, AA_M)
end

const (ENC_LUT, DEC_LUT) = let
    enc_lut = fill(0xff, length(alphabet(AminoAcid)))
    dec_lut = fill(AA_A, length(symbols(ReducedAAAlphabet())))
    for (i, aa) in enumerate(symbols(ReducedAAAlphabet()))
        enc_lut[reinterpret(UInt8, aa) + 0x01] = i - 1
        dec_lut[i] = aa
    end
    (Tuple(enc_lut), Tuple(dec_lut))
end

function encode(::ReducedAAAlphabet, aa::AminoAcid)
    i = reinterpret(UInt8, aa) + 0x01
    (i  length(ENC_LUT) || @inbounds ENC_LUT[i] === 0xff) && throw(DomainError(aa))
    (@inbounds ENC_LUT[i]) % UInt
end

function decode(::ReducedAAAlphabet, x::UInt)
    x  length(DEC_LUT) && throw(DomainError(aa))
    @inbounds DEC_LUT[x + UInt(1)]
end

What is a BioSequence?

  • It's a linear container type with random access and indices Base.OneTo(length(x)).
  • It contains zero or more coding elements of type encoded_data_eltype(typeof(x)).
  • It may or may not be mutable. Generic functions cannot assume mutability.
  • It is associated with an Alphabet, A by being a subtype of BioSequence{A}.
  • The concrete type, together with its Alphabet, and no other properties, determines how to extract individual elements from the encoded data and the index. The Biosequence subtype with the index, and optionally with the Alphabet, determines how to extract the internally encoded elements. The Alphabet then controls how to decode it to an element.

What to implement?

Every subtype T of BioSequence{A} must implement the following methods, where x in an instance of the subtype:

  • Base.length(x)::Int
  • encoded_data_eltype(::Type{T})::Type{E}
  • extract_encoded_element(x, ::Int)::E
  • T must be able to be constructed from any iterable with length defined and with a known, compatible element type.

Furthermore, mutable sequences should implement

  • encoded_setindex!(x, ::E, ::Int)

For compatibility with existing Alphabets, the encoded data eltype must be UInt.

Minimal example

struct Codon <: BioSequence{RNAAlphabet{2}}
    x::UInt8
end

function Codon(it)
    length(it) == 3 || error("Must have length 3")
    x = zero(UInt)
    for (i, nt) in enumerate(it)
        x |= BioSequences.encode(Alphabet(Codon), convert(RNA, nt)) << (6-2i)
    end
    Codon(x)
end 

Base.length(::Codon) = 3
BioSequences.encoded_data_eltype(::Type{Codon}) = UInt
function BioSequences.extract_encoded_element(x::Codon, i::Int)
    (x.x >>> ((i-1) & 7)) % UInt
end

@TransGirlCodes
Copy link
Member

@jakobnissen So these two examples have made their way into the tests, and I can include them in the manual as example custom types if necessary. Can this close?

@jakobnissen
Copy link
Member Author

jakobnissen commented Jun 23, 2021 via email

@TransGirlCodes
Copy link
Member

The tests all pass. But we should probably, do like in the blogpost suggests about anti-patterns, where there's a function that also other users can use, that tests any type conforms to the interface.

@jakobnissen
Copy link
Member Author

I basically think this is done now, as #202 is merged and I've gone through the tests. It's not true that all tests also include an operation on a custom alphabet, but code coverage of the generic code paths should be good enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants