-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
julia 1.5: Time for views of a sequence? #102
Comments
Yes, se could make a SeqView object - or whatever we should call it. I'm not sure what the use case is.When do you create so many views that the microsecond-delay caused by an allocation matters? Maybe for genome assembly purposes, where K=127 is sometimes (rarely) used. Even then, a strict wrapping four 64-bit integers may be way more efficient. Do you have a use case, maybe for your GenomeGraphs? Perhaps it doesn't matter that I don't have a use case, if you build the tool, sometimes people will figure out what to do with it. If we do it, I think we should make it unsafe and high performance: Just wrap the source array and a UnitRange in a struct. If people then resize the underlying array, the segfault is on them. |
There's also all the indirections and lack of contiguous memory layout if put in a vector in addition to the creation. So in the group's genome assembly research, we've got an algorithm that unlike any other assembler (even the reportedly reliable ones), really does give you reliable repeat expansion of repetitive regions between two haplotype-specific anchors, no matter the ploidy. It turns out many assemblers have diploid bias and screw up even simple triploid scenarios generated in Pseudoseq where we know the answer, because even though they are not aiming for a diploid assembler or phasing or whatever, they have human genomes on the brain if you like, so when heuristics are dreamed up, there are edge cases and undocumented assumptions that crop up. Problem is it takes aaaages for graphs constructed with K < 200. Indeed some other assemblers start with a low K and then map reads and expand the graph to a K=200 graph - might as well build the 200 graph in the first place! So an efficient way of dealing with big kmers would be great. In particular, for it to be worth it, you probably need K to require more bits for the data, than are required for the fields of the view (start, end, and so on). A HugeKmer or ViewKmer or something would also allow compaction of the kmers into a single vector, reducing redundant data bits stored twice, but again that's only worth it if K is high enough that the savings are significant. |
Maybe it's something we need to experiment with - make a baby type that can do a few things, and run some tests vs regular sequences - mapping over vectors of seqs and vectors of views and looking at the benchmark figures. Of course, we'd only see any potential benefit whatsovever on 1.5. |
Well, I s'pose we ought to set up BenchmarkCI and decide on some problems to benchmark solutions against. |
Having digested this over the past few days, I believe the question is, how do we elegantly wrap |
I'm not sure what you mean, @CiaranOMara, why would we use bit vectors to represent sequences? Can you explain? I've also tinkered a little bit with it this weekend and made a prototype. Boy, is it fast! In my simple test where I summed the integer representation of the 11th basepair of every 21-mer in a 1 million long First, I think we should use the following data structure:
We could also parameterize with Second, we should think about what to do with reverse complemetation (or any other kind of transformation). Reverse complementing a view RCs a chunk of the parent sequence, which is quite bad. For use in Kmers, RC'ing is a pretty basic concept, imagine e.g. iterating over canonical kmers. I can see two solutions to this problem:
Of these two, I think we should go with option 2. It's bad design, I think, but I'm not sure what else to do. Also, I think wee should implement some kind of |
Below is roughly the sort variation of sequence data structure I've been exploring. struct CustomSequence{B}
data::Base.ReshapedArray{Bool,2,BitArray{1},Tuple{}}
function CustomSequence{B}(raw::BitVector)
new(reshape(raw, B, :)))
end
end In this data structure, the type parameter This type of data structure, if loosened so that it can take a As I delve into this further, I realise this is pretty much a reimplementation of
I think the two relevant bits of information are bits per symbol for iteration, and the As for the subtypes of
I understood that as the type returned from an iteration of a 21-character-wide-sliding-window. You'd just return a view into the original sequence, would you not? I Think iter = Kmer{3}(seq)
second_3mer = iterate(iter, 2)
It'd be really useful to splice out a copy... It's a batch and order or operation optimisation problem? I wonder what SIMD optimisation is possible. For example, using the reshaping concept, it's possible to have the kth Kmers simultaneously available as column vectors. I always thought the alphabets should be designed such that a bit toggle (not operation) results in the complement. Then the kth Kmer complements would be simultaneously available. What is the low-level detail for this? In terms of high-level structure and interface development, it might be easier to try ideas using the |
So for me, the direction the BioSequences API is going in - the one I kinda started to push in when I did v2, was to really start to decouple expected behaviour, from the internal representation. The structure of a v1 you had a Kmer and a BioSequence. And a Alphabets were used as type parameters for BioSequence more as a coding convenience because they control everything from how data was encoded into bits to which symbols are allowed in a sequence and other bit shifting internals were dispatched on Alphabet. With v2 I began to change that. I want to get to a place where: A: Any methods defined for BioSequence's, are agnostic to any internal representation, rather it simply can assume any methods it called did what they said on the tin. B: Any concrete sub-type of BioSequence can expect to have the following traits and methods, which unlike V1, have more clearly delineated (and documented! in some kind of dev/interface docs) roles.
These and a few other traits - e.g. the exact type of Unsigned used as registers holding the compacted elements, perhaps some kind of encoder and decoder trait, should provide all we need to make really nice internal / developer interface: Given a new / custom sequence type and some traits we can know how the dispatch will go for some things, and method errors will give really clear inciation of which parts of the interface are not being fulfilled. e.g. Something internal that reverses elements only has to worry about |
In term of decoupling internal representation, there is a point at which one needs to know the internal representation. At the moment, the internal representation is inferred from a sequence-alphabet type combination at a vector level. Does it make sense to know the internal representation at an element level or at least be able to specify an element level representation with an element type? At the moment, it's not immediately obvious how to convert from a 4-bit alphabet to a 2-bit representaion. For example, seq = LongSequence{DNAAlphabet{4}}("TTAGC")
seq_twobit = collect(something{2}, seq)
# write seq_twobit to file. The issue here is that the internal bit-level representation and bit packing strategy are unknown ( The desired flexibility could be achieved by defining the element type At the user-level, we'd export performant aliases of some |
The packing/unpacking strategy is determined currently - in NTupleKmers by the combination of BitsPerElem, ElementOrder, and eltype(vec/tuple), which are sufficient to correctly dispatch to appropriate bit flipping methods for tasks where inserting/removing/moving values is required (an extra layer of customization could be obtained by combining those two into a PackingStrategy trait or something. I propose the other part, taking those bits and transforming them into an actual BioSymbol type, should be done according to an encoder/decoder trait, and also for bit flipping that does need to care about how a-value is represented in bits as well as how many bits are used per element e.g. complement. Anyway, I'm gonna continue working on the NTupleKmers to get a working implementation, that I can open a PR for. We can add LTS support and maybe benchmarks to BioSequences and then we'll be in a good position to try any major internal or extrernal API changes. |
A round-up of ideas/messages.
|
Ditching the |
To elaborate on the parametric primitive scheme, the The BitsPerElem(DNA{TwoBit}) = 2 The encoder would then pack the important bits into a sequence whose traits are either It occurs to me that not only could these encoders push bits into an The decoder would place the bits into the 8 bits of the
In this scheme, encoded and packed symbols are |
MWE of BioSymbols as parametric primitives: primitive_alphabet.jl. |
I have a feeling that is going to be more inefficient vs what we already have. The encoding constructs a tuple of two bytes, to represent two bits. When currently a single integer is used. Current internals insert N bits at a time, once per element, whereas in the MWE every single bit is set individually in that pack method. I'm not sold on the philosophy of this either. This makes the elements - BioSymbols - "aware" of their container's - which I'm not sure they should be, insofar as it grants them control over how container objects - sequence types - store them. Int's do not dictate to the Vector or BitVector how they are stored - that's up to the vectors. I'm all for rethinks of the bit-packing framework, but I think elements should just stay elements. There's an inelegance to having to say "adenine" possibly represented by N slightly different instantiations of a parametric type, that have different effects on different containers. The current API is neat since there's a separation of concerns: You pass DNA_A to a sequence - the sequence tells you if it's allowed and decides how it will be handled - DNA_A (or it's type) gets no "say" in the matter. |
I agree that
Ints and Floats both encode numerical symbols. How do
The intent was to make I've updated the MWE (revision 2) to remove that tuple distraction. I also filled out the example with more of the core features. The MWE is now able to compress, pack, unpack, and decompress symbols with 2 or 4 bits of useful encoding (revision 5). There is no low-level optimisation. It is only a high-level exploration of what the interfaces/workflow might be like if BioSymbols were aware of their type of encoding/contents. |
Ok, I've thought about this more, and it makes sense to me if, the symbol types become more like alphabets, say instead of just a |
Apologies, but I am not sure if
|
To point one - it is drifting! To point 2, I can see some possible benefits but they're not great enough to push it as a priority for me. Mostly to do with the separation of concerns: If symbol types are concerned with representation at the bit level - it effectively moves encoding/alphabet stuff from BioSequences.jl to BioSymbols.jl, and a sequence type's job is to simply pack bits of the symbol types - as they are - into registers, which is simpler than encoding and packing. It also means as I describe above you could have different symbol types/sets: DNA, IUPAC_DNA, APE_DNA(R package representation) and so on and so on, but with well-defined promotion and conversion rules it would be as convenient as all the sequence types decoding to the same eltype - it might be more flexible even. But ultimately to see any benefit someone would have to try it and see how using such an interface "feels" vs the current one. |
To round off the alphabet stuff, I'll open up an experimental branch to develop the suggested parametric primitive alphabets. I think LongSequence is brilliant and that any changes or reorganisation would need to offer comparable performance. I also think the delineation of BioSequences as a bit packer is useful and nicely separates the concerns of the two packages. Onto the views, It seems to me that a "view" would index into an optimised bit vector. @jakobnissen, from your https://github.com/jakobnissen/hardware_introduction work, do you have a sense as to what the upper size limit is for an allocation free data-structure? As an aside, if this thing is to be called a view, I'd expect it to implement the view interface and behave as views do in |
Closed by #120 . |
@jakobnissen @CiaranOMara
So julia 1.5 now means structs that have references to mutable are now stack-allocated.
I'm very excited about this. As you know, every
LongSequence
are mutable and everyLongSequence
owns its data.Yes, there is a copy on write mechanism, but that's just an implementation detail to remove copying unless absolutely nessecery.
If you create many views over a sequence - taking advantage of the mechanism, you avoid copying, but you still need to allocate many LongSequence objects on the heap and trigger the gc and all that.
I've long wanted to make proper allocation free views of a sequence - a true view of fixed start and end, where if you edit a base you DO indeed edit the base in the underlying sequence (contrary to the copy on write mechanism which mean the seqs still owns its own data - even if sharing a vec temporarily for performance reasons). But there was never any point as they would just allocate to. Perhaps 1.5 is the time to try something - what do you guys think? I might be a massive boon, for example making kmers with K > 64?
The text was updated successfully, but these errors were encountered: