You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Uppercase and lowercase are commonly used in DNA sequence data to identify distinctive features, such as coding regions, repeats, or other arbitrary regions of interest.
Currently all DNA sequences are modified to uppercase in BioSequences.
Expected Behavior
Suggested change/improvement: Allow for DNA sequences with mixed upper- and lowercase.
Current Behavior
Current behaviour: All DNA is written in uppercase letters.
Possible Solution / Implementation
Context
The trivial example: Provide the reverse complement of this sequence:
GACGTCGCCAGAGAggcataTAACGATAtgacacagagagagcaGAGACAAGT
cannot be answered by using BioSequences without losing information.
The text was updated successfully, but these errors were encountered:
The lowercase/uppercase is metadata associated with the DNA sequence, and not part of the DNA sequence itself. Therefore, it should not be contained in the DNA sequence type.
There are several reasons for this:
First, one can always encode extra information in the string representation of any object. For example, suppose I have a FASTA sequence where the line length is used to encode e.g. exons/introns. Or suppose I have an array literal where I use like breaks to signify different parts of the array. The information encoded may be arbitrary, and it's not possible to design a data structure to be able to contain this extra data
Second, suppose we specialize lowercase/uppercase in BioSequences. Then, every symbol would need one extra bit of storage, which means DNA would take either 3 or 5 bits instead of 2 or 4 bits. Since it's convenient to have the number of bits be a power of two, this would round up to 4 or 8 bits, which means the memory footprint would double. The implementation would also be more complex.
Instead, I propose you simply extract the metadata to a separate vector:
julia>using BioSequences
julia> s ="GACGTCGCCAGAGAggcataTAACGATAtgacacagagagagcaGAGACAAGT";
julia> mask =BitVector(isuppercase(i) for i in s);
julia> dna =LongDNA{2}(s)
Uppercase and lowercase are commonly used in DNA sequence data to identify distinctive features, such as coding regions, repeats, or other arbitrary regions of interest.
Currently all DNA sequences are modified to uppercase in BioSequences.
Expected Behavior
Suggested change/improvement: Allow for DNA sequences with mixed upper- and lowercase.
Current Behavior
Current behaviour: All DNA is written in uppercase letters.
Possible Solution / Implementation
Context
The trivial example: Provide the reverse complement of this sequence:
GACGTCGCCAGAGAggcataTAACGATAtgacacagagagagcaGAGACAAGT
cannot be answered by using BioSequences without losing information.
The text was updated successfully, but these errors were encountered: