Check interfaces, remove NotImplemented errors #140

jakobnissen · 2021-03-11T09:59:15Z

Errors such as e.g.

BioSequences.jl/src/biosequence/biosequence.jl

Lines 21 to 30 in 0ba3d4b

    
           "Get the length of a biological sequence." 
        
           @inline function Base.length(seq::BioSequence) 
        
               error( 
        
                   string( 
        
                       "Base.length has not been defined for BioSequence type: ", 
        
                       typeof(seq), 
        
                       ". Any compatible concrete BioSequence subtype must have this method implemented." 
        
                   ) 
        
               ) 
        
           end

It's not more useful than a MethodError, and it hampers static analysis from e.g. JET, which will probably be a reality before the end of this year. Much better to simply remove it. (also see this blogpost)

Instead, we should go through and verify we have covered and documented interfaces for:

A new BioSequence subtype
A new Alphabet subtype
An alphabet having the ASCIIAlphabet trait

Also, the tests should include custom biosequence and alphabet types to make sure some basic behaviour is covered by fallbacks.

Add custom alphabet type to tests
Add a separate alphabet with ASCIIAlphabet trait for print tests
Add custom mutable biosequence subtype to tests
Make sure all relevant tests include custom biosequence and/or custom alphabet

kescobo · 2021-03-11T15:47:29Z

This seems like a good idea, should we spitball what those interfaces actually are?

jakobnissen · 2021-03-11T17:49:55Z

Okay, so I decided to not make another issue. Here is my proposal for the interface for BioSequence and Alphabet. There are already inferfaces stated in the documentation, but if you actually try to implement a new sequence/alphabet type, you will find that they are insufficient (cc @benjward )

The following types are assumed to have the listed properties. You may create subtypes of that violate these properties, but then any fallback functions may be excessively slow or wrong.

What is an Alphabet?

Subtypes of Alphabet are singleton structs that may or may not be parameterized.
Alphabets determine a finite list of biological symbols that can be encoded in the given molecule type
The alphabet controls the encoding/decoding between a decoded element type and an internal data representation type.
An Alphabet must never encode (using encode) or decode (using decode) invalid data, but must error in cases where invalid data could have been produced. Other methods for check-free encoding/decoding methods may be added.

What to implement?

Every subtype A of Alphabet must implement:

Base.eltype(::Type{A})::Type{E} for some eltype E
symbols(::A)::Tuple{Vararg{E}}. This gives an ordered tuples of all elements of A.
encode(::A, ::E)::X encodes an element to the internal data eltype X
decode(::A, ::X)::E decodes an X to an element E.
Except for eltype which must follow Base conventions, all functions operating on Alphabet should operate on instances of the alphabet, not the type.

If you want interoperation with existing subtypes of BioSequence, the element type E must be UInt, and you must also implement:

BitsPerSymbol(::A)::BitsPerSymbol{N}, where the N must be zero or a power of two in [1, 8*sizeof(UInt)].

Minimal example

In this example, the element type is AminoAcid, and the encoded data type is UInt.

struct ReducedAAAlphabet <: Alphabet end
Base.eltype(::Type{ReducedAAAlphabet}) = AminoAcid
BioSequences.BitsPerSymbol(::ReducedAAAlphabet) = BioSequences.BitsPerSymbol{4}()
function BioSequences.symbols(::ReducedAAAlphabet)
    (AA_L, AA_C, AA_A, AA_G, AA_S, AA_T, AA_P, AA_F,
     AA_W, AA_E, AA_D, AA_N, AA_Q, AA_K, AA_H, AA_M)
end

const (ENC_LUT, DEC_LUT) = let
    enc_lut = fill(0xff, length(alphabet(AminoAcid)))
    dec_lut = fill(AA_A, length(symbols(ReducedAAAlphabet())))
    for (i, aa) in enumerate(symbols(ReducedAAAlphabet()))
        enc_lut[reinterpret(UInt8, aa) + 0x01] = i - 1
        dec_lut[i] = aa
    end
    (Tuple(enc_lut), Tuple(dec_lut))
end

function encode(::ReducedAAAlphabet, aa::AminoAcid)
    i = reinterpret(UInt8, aa) + 0x01
    (i ≥ length(ENC_LUT) || @inbounds ENC_LUT[i] === 0xff) && throw(DomainError(aa))
    (@inbounds ENC_LUT[i]) % UInt
end

function decode(::ReducedAAAlphabet, x::UInt)
    x ≥ length(DEC_LUT) && throw(DomainError(aa))
    @inbounds DEC_LUT[x + UInt(1)]
end

What is a BioSequence?

It's a linear container type with random access and indices Base.OneTo(length(x)).
It contains zero or more coding elements of type encoded_data_eltype(typeof(x)).
It may or may not be mutable. Generic functions cannot assume mutability.
It is associated with an Alphabet, A by being a subtype of BioSequence{A}.
The concrete type, together with its Alphabet, and no other properties, determines how to extract individual elements from the encoded data and the index. The Biosequence subtype with the index, and optionally with the Alphabet, determines how to extract the internally encoded elements. The Alphabet then controls how to decode it to an element.

What to implement?

Every subtype T of BioSequence{A} must implement the following methods, where x in an instance of the subtype:

Base.length(x)::Int
encoded_data_eltype(::Type{T})::Type{E}
extract_encoded_element(x, ::Int)::E
T must be able to be constructed from any iterable with length defined and with a known, compatible element type.

Furthermore, mutable sequences should implement

encoded_setindex!(x, ::E, ::Int)

For compatibility with existing Alphabets, the encoded data eltype must be UInt.

Minimal example

struct Codon <: BioSequence{RNAAlphabet{2}}
    x::UInt8
end

function Codon(it)
    length(it) == 3 || error("Must have length 3")
    x = zero(UInt)
    for (i, nt) in enumerate(it)
        x |= BioSequences.encode(Alphabet(Codon), convert(RNA, nt)) << (6-2i)
    end
    Codon(x)
end 

Base.length(::Codon) = 3
BioSequences.encoded_data_eltype(::Type{Codon}) = UInt
function BioSequences.extract_encoded_element(x::Codon, i::Int)
    (x.x >>> ((i-1) & 7)) % UInt
end

TransGirlCodes · 2021-06-23T18:35:20Z

@jakobnissen So these two examples have made their way into the tests, and I can include them in the manual as example custom types if necessary. Can this close?

jakobnissen · 2021-06-23T19:03:29Z

Yes! Do they actually work? 😅 I kind of forgot if the custom types were thoroughly checked. Should I have a look tomorrow, or have you already tested the interfaces? If so, please do close this!

…

On Wed, Jun 23, 2021, 20:35 Ben J. Ward ***@***.***> wrote: @jakobnissen <https://github.com/jakobnissen> So these two examples have made their way into the tests, and I can include them in the manual as example custom types if necessary. Can this close? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#140 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFQ6SXW5RX4ZU7RM3OEDC7TTUISPFANCNFSM4Y76QFGQ> .

TransGirlCodes · 2021-06-23T22:03:36Z

The tests all pass. But we should probably, do like in the blogpost suggests about anti-patterns, where there's a function that also other users can use, that tests any type conforms to the interface.

jakobnissen · 2021-10-31T19:41:51Z

I basically think this is done now, as #202 is merged and I've gone through the tests. It's not true that all tests also include an operation on a custom alphabet, but code coverage of the generic code paths should be good enough.

jakobnissen added this to the v3.0.0 milestone Mar 11, 2021

jakobnissen mentioned this issue Mar 11, 2021

Spring cleanup for v3 #138

Closed

15 tasks

TransGirlCodes mentioned this issue Apr 30, 2021

Update user manual #160

Merged

17 tasks

jakobnissen mentioned this issue Apr 30, 2021

Update interface to version 3. #161

Merged

kescobo mentioned this issue May 19, 2021

Multiple alignment manipulation BioJulia/BioAlignments.jl#51

Open

jakobnissen mentioned this issue Oct 30, 2021

Add has_interface function #202

Merged

jakobnissen closed this as completed Oct 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check interfaces, remove NotImplemented errors #140

Check interfaces, remove NotImplemented errors #140

jakobnissen commented Mar 11, 2021 •

edited

Loading

kescobo commented Mar 11, 2021

jakobnissen commented Mar 11, 2021 •

edited

Loading

TransGirlCodes commented Jun 23, 2021

jakobnissen commented Jun 23, 2021 via email

TransGirlCodes commented Jun 23, 2021

jakobnissen commented Oct 31, 2021

Check interfaces, remove NotImplemented errors #140

Check interfaces, remove NotImplemented errors #140

Comments

jakobnissen commented Mar 11, 2021 • edited Loading

kescobo commented Mar 11, 2021

jakobnissen commented Mar 11, 2021 • edited Loading

What is an Alphabet?

What to implement?

Minimal example

What is a BioSequence?

What to implement?

Minimal example

TransGirlCodes commented Jun 23, 2021

jakobnissen commented Jun 23, 2021 via email

TransGirlCodes commented Jun 23, 2021

jakobnissen commented Oct 31, 2021

jakobnissen commented Mar 11, 2021 •

edited

Loading

jakobnissen commented Mar 11, 2021 •

edited

Loading