Record type inference #268

cjprybol · 2023-02-25T19:03:04Z

Is there anything available for inferring the FASTA record type from the sequence?

In earlier versions of FASTX I think this was done by default, and all of the records read in by Readers were returned as variants of LongSequence rather than strings. Now the same functionality is available optionally if you specify the return type when calling FASTX.sequence({desired_return_type}, record)

What I'm looking for is something along the lines of

FASTX.sequence(BioSequences.LongSequence, record) -> Union{LongDNA, LongRNA, LongAA}

#or BioSequences would be a better home for the inference
BioSequences.LongSequence(FASTX.sequence(record)) -> Union{LongDNA, LongRNA, LongAA}

Expected Behavior

Ambiguous interpretations lead to errors

julia> x = FASTX.FASTA.Record("identifier" , "AAAA")
FASTX.FASTA.Record:
  description: "identifier"
     sequence: "AAAA"

julia> FASTX.sequence(BioSequences.LongDNA{4}, x)
4nt DNA Sequence:
AAAA

julia> FASTX.sequence(BioSequences.LongRNA{4}, x)
4nt RNA Sequence:
AAAA

julia> FASTX.sequence(BioSequences.LongAA, x)
4aa Amino Acid Sequence:
AAAA

julia> FASTX.sequence(BioSequences.LongSequence, x)
Error -> ambiguous sequence type

unambiguous interpretations lead to auto-inferred sequence types

julia> x = FASTX.FASTA.Record("identifier" , "AAAAJ")
FASTX.FASTA.Record:
  description: "identifier"
     sequence: "AAAAJ"

julia> FASTX.sequence(BioSequences.LongDNA{4}, x)
ERROR: Cannot encode 0x4a to BioSequences.DNAAlphabet{4}()
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] throw_encode_error(A::BioSequences.DNAAlphabet{4}, src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, soff::Int64)
   @ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:156
 [3] encode_chunk
   @ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:168 [inlined]
 [4] copyto!(dst::BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}, doff::Int64, src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, soff::Int64, N::Int64, #unused#::BioSequences.AsciiAlphabet)
   @ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:298
 [5] copyto!
   @ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:235 [inlined]
 [6] BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}(src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, part::UnitRange{Int64})
   @ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:85
 [7] LongSequence
   @ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:83 [inlined]
 [8] sequence(::Type{BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}}, record::FASTX.FASTA.Record, part::UnitRange{Int64}) (repeats 2 times)
   @ FASTX.BioSequencesExt /opt/julia/packages/FASTX/gzHTk/ext/BioSequencesExt.jl:21
 [9] top-level scope
   @ REPL[21]:1

julia> FASTX.sequence(BioSequences.LongRNA{4}, x)
ERROR: Cannot encode 0x4a to BioSequences.RNAAlphabet{4}()
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] throw_encode_error(A::BioSequences.RNAAlphabet{4}, src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, soff::Int64)
   @ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:156
 [3] encode_chunk
   @ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:168 [inlined]
 [4] copyto!(dst::BioSequences.LongSequence{BioSequences.RNAAlphabet{4}}, doff::Int64, src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, soff::Int64, N::Int64, #unused#::BioSequences.AsciiAlphabet)
   @ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:298
 [5] copyto!
   @ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:235 [inlined]
 [6] BioSequences.LongSequence{BioSequences.RNAAlphabet{4}}(src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, part::UnitRange{Int64})
   @ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:85
 [7] LongSequence
   @ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:83 [inlined]
 [8] sequence(::Type{BioSequences.LongSequence{BioSequences.RNAAlphabet{4}}}, record::FASTX.FASTA.Record, part::UnitRange{Int64}) (repeats 2 times)
   @ FASTX.BioSequencesExt /opt/julia/packages/FASTX/gzHTk/ext/BioSequencesExt.jl:21
 [9] top-level scope
   @ REPL[22]:1

julia> FASTX.sequence(BioSequences.LongAA, x)
5aa Amino Acid Sequence:
AAAAJ

julia> FASTX.sequence(BioSequences.LongSequence, x) #can only be an AA seq, per the available alphabets
5aa Amino Acid Sequence:
AAAAJ

Current Behavior

Can't use a generic LongSequence for any record

julia> FASTX.sequence(BioSequences.LongSequence, x)
ERROR: MethodError: no method matching BioSequences.LongSequence(::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true})
Closest candidates are:
  BioSequences.LongSequence(::Kmers.Kmer{A, K, N}) where {A, K, N} at /opt/julia/packages/Kmers/7SNBQ/src/kmer.jl:492
  BioSequences.LongSequence(::BioSequences.LongSequence, ::AbstractUnitRange{var"#s29"} where var"#s29"<:Integer) at /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:49
  BioSequences.LongSequence(::BioSequences.LongSubSeq{A}) where A at /opt/julia/packages/BioSequences/zp2M8/src/longsequences/seqview.jl:76
  ...
Stacktrace:
 [1] sequence(::Type{BioSequences.LongSequence}, record::FASTX.FASTA.Record, part::UnitRange{Int64}) (repeats 2 times)
   @ FASTX.BioSequencesExt /opt/julia/packages/FASTX/gzHTk/ext/BioSequencesExt.jl:21
 [2] top-level scope
   @ REPL[19]:1

julia> BioSequences.LongSequence(FASTX.sequence(x))
ERROR: MethodError: no method matching BioSequences.LongSequence(::StringViews.StringView{SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}})
Closest candidates are:
  BioSequences.LongSequence(::Kmers.Kmer{A, K, N}) where {A, K, N} at /opt/julia/packages/Kmers/7SNBQ/src/kmer.jl:492
  BioSequences.LongSequence(::BioSequences.LongSequence, ::AbstractUnitRange{var"#s29"} where var"#s29"<:Integer) at /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:49
  BioSequences.LongSequence(::BioSequences.LongSubSeq{A}) where A at /opt/julia/packages/BioSequences/zp2M8/src/longsequences/seqview.jl:76
  ...
Stacktrace:
 [1] top-level scope
   @ REPL[26]:1

Context

In addition to validating whether a FASTA is valid https://biojulia.github.io/FASTX.jl/latest/fasta/#FASTX.FASTA.validate_fasta it would be useful to have functionality to auto-infer the type of records in the FASTA

I'd need to think through the most logical way to check, but I think an order of operations to infer the best alphabet match might be like

DNA{2} -> RNA{2} -> DNA{4} -> RNA{4} -> AA

The AA alphabet (letter codes, not molecules) seems to be a superset of DNA/RNA alphabet, and the T/U difference I think is enough to differentiate between DNA/RNA

link to codes https://www.ddbj.nig.ac.jp/ddbj/code-e.html

The text was updated successfully, but these errors were encountered:

cjprybol · 2023-02-25T20:38:29Z

related: https://github.com/BioJulia/Kmers.jl/blob/master/src/kmer.jl#L329

jakobnissen · 2023-02-27T08:36:15Z

This is a good idea, and I think we should have this feature, with a few caveats.

First, IMO, it belongs in BioSequences, not FASTX (hence, I transferred the issue to this repository). The reason is principal: The biological sequence type is not a feature of the FASTA format, in which the sequences really are just text that can contain anything - indeed, they often contain non-standard symbols. This is also the motivation why FASTA.sequence returns a string in FASTX v2, where it returned a BioSequence in v1 - all records really contain are strings, and interpreting them to a specific biological alphabet is a distinct process, which is done by BioSequences.

The second caveat is that autodetection of sequence type is bound to be both flaky and inefficient, no matter how we do it. That's actually why we removed autodetection for v2 (see BioJulia/FASTX.jl#59). As one of the goals of BioJulia more broadly is to allow people to use robust software, we should be wary of adding flaky functions that users might accidentally rely on, and as a result, produce unreliable software.

That doesn't mean we can't have it, but it should just be named something like guess_parse or guess_seq, which makes it clear that that's all it's doing. I think it's worth having, for convenient REPL work where it's fine if it's a little unreliable and slow. I agree in that case, we should just check each of the predefined alphabets in order, and error if it's ambiguous between RNA and DNA.

We might also want to remove the method for kmers you linked to, before Kmers.jl is released, for the same reasons.

jakobnissen · 2023-02-27T08:40:40Z

A few ideas for implementing this:

Using tryparse (see Add tryparse(::BioSequence, s::AbstractString) or similar #224), simply trying each alphabet in order
Make a length NTuple{256, UInt8} lookup table (can also fit in a length-128 table) where each byte encodes four bits: If it contains valid DNA, valid RNA, IUPAC ambiguous nucleotides, or valid AA. Then we simply check each byte and AND the bitmask together before making the guess. If no bits are set, we error. If only AA bit is set, we pick amino acid alphabet. If both RNA and DNA alphabets is set, we error. Else we pick DNA/RNA, and 2/4 bits based on the ambiguous nucleotide bit.

This function is a quick-and-dirty parser function from `AbstractString` to `LongSequence`, with autodetection of the alphabet. It's meant to be used in ephemeral REPL work, and very clearly documented to be unstable and subject to change. See BioJulia#268

jakobnissen transferred this issue from BioJulia/FASTX.jl Feb 27, 2023

jakobnissen linked a pull request Feb 27, 2023 that will close this issue

Add guessparse function #269

Draft

5 tasks

jakobnissen added the feature request A request for some desired or missing functionality label Feb 27, 2023

cjprybol mentioned this issue Feb 27, 2023

consider removing kmer/sequence type inference from strings in favor of generalized version in BioSequences BioJulia/Kmers.jl#31

Open

jakobnissen linked a pull request Oct 16, 2023 that will close this issue

[RFC]: Add functions guess_parse and guess_alphabet #292

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record type inference #268

Record type inference #268

cjprybol commented Feb 25, 2023

cjprybol commented Feb 25, 2023

jakobnissen commented Feb 27, 2023

jakobnissen commented Feb 27, 2023

Record type inference #268

Record type inference #268

Comments

cjprybol commented Feb 25, 2023

Expected Behavior

Current Behavior

Context

cjprybol commented Feb 25, 2023

jakobnissen commented Feb 27, 2023

jakobnissen commented Feb 27, 2023