Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

julia 1.5: Time for views of a sequence? #102

Closed
TransGirlCodes opened this issue May 29, 2020 · 21 comments
Closed

julia 1.5: Time for views of a sequence? #102

TransGirlCodes opened this issue May 29, 2020 · 21 comments

Comments

@TransGirlCodes
Copy link
Member

TransGirlCodes commented May 29, 2020

@jakobnissen @CiaranOMara

So julia 1.5 now means structs that have references to mutable are now stack-allocated.
I'm very excited about this. As you know, every LongSequence are mutable and every LongSequence owns its data.

Yes, there is a copy on write mechanism, but that's just an implementation detail to remove copying unless absolutely nessecery.

If you create many views over a sequence - taking advantage of the mechanism, you avoid copying, but you still need to allocate many LongSequence objects on the heap and trigger the gc and all that.

I've long wanted to make proper allocation free views of a sequence - a true view of fixed start and end, where if you edit a base you DO indeed edit the base in the underlying sequence (contrary to the copy on write mechanism which mean the seqs still owns its own data - even if sharing a vec temporarily for performance reasons). But there was never any point as they would just allocate to. Perhaps 1.5 is the time to try something - what do you guys think? I might be a massive boon, for example making kmers with K > 64?

@jakobnissen
Copy link
Member

Yes, se could make a SeqView object - or whatever we should call it.

I'm not sure what the use case is.When do you create so many views that the microsecond-delay caused by an allocation matters? Maybe for genome assembly purposes, where K=127 is sometimes (rarely) used. Even then, a strict wrapping four 64-bit integers may be way more efficient. Do you have a use case, maybe for your GenomeGraphs?

Perhaps it doesn't matter that I don't have a use case, if you build the tool, sometimes people will figure out what to do with it. If we do it, I think we should make it unsafe and high performance: Just wrap the source array and a UnitRange in a struct. If people then resize the underlying array, the segfault is on them.

@TransGirlCodes
Copy link
Member Author

TransGirlCodes commented May 29, 2020

There's also all the indirections and lack of contiguous memory layout if put in a vector in addition to the creation.

So in the group's genome assembly research, we've got an algorithm that unlike any other assembler (even the reportedly reliable ones), really does give you reliable repeat expansion of repetitive regions between two haplotype-specific anchors, no matter the ploidy. It turns out many assemblers have diploid bias and screw up even simple triploid scenarios generated in Pseudoseq where we know the answer, because even though they are not aiming for a diploid assembler or phasing or whatever, they have human genomes on the brain if you like, so when heuristics are dreamed up, there are edge cases and undocumented assumptions that crop up.

Problem is it takes aaaages for graphs constructed with K < 200. Indeed some other assemblers start with a low K and then map reads and expand the graph to a K=200 graph - might as well build the 200 graph in the first place! So an efficient way of dealing with big kmers would be great. In particular, for it to be worth it, you probably need K to require more bits for the data, than are required for the fields of the view (start, end, and so on). A HugeKmer or ViewKmer or something would also allow compaction of the kmers into a single vector, reducing redundant data bits stored twice, but again that's only worth it if K is high enough that the savings are significant.

@TransGirlCodes
Copy link
Member Author

TransGirlCodes commented May 29, 2020

Maybe it's something we need to experiment with - make a baby type that can do a few things, and run some tests vs regular sequences - mapping over vectors of seqs and vectors of views and looking at the benchmark figures. Of course, we'd only see any potential benefit whatsovever on 1.5.

@CiaranOMara
Copy link
Member

Well, I s'pose we ought to set up BenchmarkCI and decide on some problems to benchmark solutions against.

@CiaranOMara
Copy link
Member

CiaranOMara commented Jun 2, 2020

Having digested this over the past few days, I believe the question is, how do we elegantly wrap BitVectors and iterate over them. It seems to me that at the medium level, we only need to reshape BitVectors based on alphabet letter size, then iterate over the columns. Does this seem correct, and if so, is it more efficient to work with/from a BitVector, and is it a clearer concept of a BioSequence?

@jakobnissen
Copy link
Member

I'm not sure what you mean, @CiaranOMara, why would we use bit vectors to represent sequences? Can you explain?

I've also tinkered a little bit with it this weekend and made a prototype. Boy, is it fast! In my simple test where I summed the integer representation of the 11th basepair of every 21-mer in a 1 million long LongSequence{DNAAlphabet{4}}, using views was around 3x faster than using kmers.

First, I think we should use the following data structure:

struct SeqView{A <: Alphabet} <: BioSequence
    data::Vector{UInt64}
    part::UnitRange{Int}
    rc::Bool
end

We could also parameterize with S <: BioSequence instead of A, but I'm not sure it's of much use. For SeqView to be fast (its entire raison d´être), its implementation would have to be tightly coupled to the data structure of its underlying sequence. And what would a SeqView{DNAMer{21}} even mean? Therefore, I think we should just focus on making views of LongSequence. Of course, if in the future, a new subtype of BioSequence is created, it might bite us in the butt that we can't create a SeqView of that type. That's one decision. What do you think?

Second, we should think about what to do with reverse complemetation (or any other kind of transformation). Reverse complementing a view RCs a chunk of the parent sequence, which is quite bad. For use in Kmers, RC'ing is a pretty basic concept, imagine e.g. iterating over canonical kmers. I can see two solutions to this problem:

  1. Just don't allow RC'ing of SeqViews. If you want to RC it, you will have to RC the parent sequence. This would be very annoying for the users if they have to e.g. get all canonical kmers.

  2. Let the struct contain a rc::Bool field that is toggled upon RC'ing. The problem here is that it complicates and slows down almost all other code operating on SeqViews, including simple things like == or getindex. The downside here is that it is very annoying for the developers (us), makes eveything significantly slower, and that it only solves the problem of RC and not any other transformation.

Of these two, I think we should go with option 2. It's bad design, I think, but I'm not sure what else to do.

Also, I think wee should implement some kind of encoded_data_iterator that iterates UInt64 filled with the encoded data, and then implement hash, == etc. using this iterator. Actually, we can probably use the same code for LongSequence and speed up those ops significantly.

@CiaranOMara
Copy link
Member

CiaranOMara commented Jun 2, 2020

Below is roughly the sort variation of sequence data structure I've been exploring.

struct CustomSequence{B}
    data::Base.ReshapedArray{Bool,2,BitArray{1},Tuple{}}
    function CustomSequence{B}(raw::BitVector)
        new(reshape(raw, B, :)))
    end
end

In this data structure, the type parameter B is the number of bits per symbol (I had called this letter size). When constructing this structure, B is used to reshape the raw input, which could be directly from a read backing store. Anyhow, after reshaping the input, each column vector will correspond to a letter/symbol.

This type of data structure, if loosened so that it can take a View as an input, could work for both SeqView and Biosequence. Also, both the Reshaped and View types have links to their parents.

As I delve into this further, I realise this is pretty much a reimplementation of LongSequence using BitArrays from Base.

We could also parameterize with S <: BioSequence instead of A, but I'm not sure it's of much use.

I think the two relevant bits of information are bits per symbol for iteration, and the Alphabet type should there be a need to display something. I believe these are encompassed by the Alphabet type already.

As for the subtypes of BioSequence, the butt bites should be relatively painless if the subtypes implement the same interface. Subtypes could be interchangeable.

And what would a SeqView{DNAMer{21}} even mean?

I understood that as the type returned from an iteration of a 21-character-wide-sliding-window. You'd just return a view into the original sequence, would you not? I Think DNAMer{21} should be an iterator type. As an aside, a generic iterator could select the ith Kmer. For example

iter = Kmer{3}(seq)
second_3mer = iterate(iter, 2)

Reverse complementing a view RCs a chunk of the parent sequence, which is quite bad.

It'd be really useful to splice out a copy... It's a batch and order or operation optimisation problem?

I wonder what SIMD optimisation is possible. For example, using the reshaping concept, it's possible to have the kth Kmers simultaneously available as column vectors. I always thought the alphabets should be designed such that a bit toggle (not operation) results in the complement. Then the kth Kmer complements would be simultaneously available. What is the low-level detail for this?

In terms of high-level structure and interface development, it might be easier to try ideas using the BitArray at the middle-level as it comes with many standard conveniences.

@TransGirlCodes
Copy link
Member Author

TransGirlCodes commented Jun 2, 2020

We could also parameterize with S <: BioSequence instead of A, but I'm not sure it's of much use. For SeqView to be fast (its entire raison d´être), its implementation would have to be tightly coupled to the data structure of its underlying sequence. And what would a SeqView{DNAMer{21}} even mean? Therefore, I think we should just focus on making views of LongSequence. Of course, if in the future, a new subtype of BioSequence is created, it might bite us in the butt that we can't create a SeqView of that type. That's one decision. What do you think?

So for me, the direction the BioSequences API is going in - the one I kinda started to push in when I did v2, was to really start to decouple expected behaviour, from the internal representation.

The structure of a LongSequence is radically different from that of a Mer, and that of an NTupleKmer. It's really quite different to a ReferenceSequence what with that third bitvector.

v1 you had a Kmer and a BioSequence. And a Alphabets were used as type parameters for BioSequence more as a coding convenience because they control everything from how data was encoded into bits to which symbols are allowed in a sequence and other bit shifting internals were dispatched on Alphabet.

With v2 I began to change that. I want to get to a place where:

A: Any methods defined for BioSequence's, are agnostic to any internal representation, rather it simply can assume any methods it called did what they said on the tin.

B: Any concrete sub-type of BioSequence can expect to have the following traits and methods, which unlike V1, have more clearly delineated (and documented! in some kind of dev/interface docs) roles.

  • Alphabet: A trait that should only explicitly determine the symbols allowed in the sequence.
    So if MyPoopySeq <: BioSequence{DNAAlphabet{4}} then the only thing that should be
    assumed from MyPoopySeq is that all IUPAC DNA elements are allowed.
    Nothing about encoding or even bits used per element should be assumed (despite
    the 4 parameter.

  • BitsPerSymbol{N} (or BitsPerElem{N} in NTupleKmers.jl): This trait is actually a trait of the sequence type,
    NOT the alphabet (BUT if you are a developer if you inherit from say BioSequence{DNAAlphabet{4}}, then as a matter of convenience only then there is a fallback for BitsPerSymbol{4}. But it's only a convenience - a default. You could for example have a sequence type that mirrored that in R's ape package, in julia it would have alphabet DNAAlphabet{4}, but use 1 byte per nucleotide, and so would have trait BitsPerSymbol{8}. Because BitsPerSymbol is a trait of a sequence type, just as Alphabet is a trait, one does not outrank the other.

  • ElementOrder: So this one only exists in NTupleKmers right now, but if they are to replace Mers, then this makes sense. There are two possibilities: LeftToRight and RightToLeft. They determine how the compacted elements are laid out in a single register: LongSequence style (base 1 at the LSB end) is RightToLeft, or Mer style (base 1 at the MSB end) is LeftToRight.

  • bitindex when called on a bitindex(sequence, i), returns a constructed BitIndex, essentially a pointer to the element inside of a Tuple or Vector of some Unsigned type.

These and a few other traits - e.g. the exact type of Unsigned used as registers holding the compacted elements, perhaps some kind of encoder and decoder trait, should provide all we need to make really nice internal / developer interface: Given a new / custom sequence type and some traits we can know how the dispatch will go for some things, and method errors will give really clear inciation of which parts of the interface are not being fulfilled.

e.g. getindex(seq, i) -> checkbounds, make a idx = bitindex(seq, i), call bits = extract_encoded_element(seq.data, idx), decode (decoder trait tbd) the bits and return element.

Something internal that reverses elements only has to worry about BitsPerElem and ElementOrder since the translation of bits -> element does not matter.

@CiaranOMara
Copy link
Member

CiaranOMara commented Jun 3, 2020

In term of decoupling internal representation, there is a point at which one needs to know the internal representation. At the moment, the internal representation is inferred from a sequence-alphabet type combination at a vector level. Does it make sense to know the internal representation at an element level or at least be able to specify an element level representation with an element type?

At the moment, it's not immediately obvious how to convert from a 4-bit alphabet to a 2-bit representaion. For example,

seq = LongSequence{DNAAlphabet{4}}("TTAGC")
seq_twobit = collect(something{2}, seq)
# write seq_twobit to file.

The issue here is that the internal bit-level representation and bit packing strategy are unknown (LongSequence encompasses a packing strategy). If these were known at the element level, all of the encoding/decoding and bit packing could be achieved with the convert and collect patterns, which are really easy to work with.

The desired flexibility could be achieved by defining the element type BioSymbol{P,A,R}. Where P is the bit packing strategy, A is the alphabet constraint, and R is the bit-level character representation.

At the user-level, we'd export performant aliases of some BioSymbol PAR combination.

@TransGirlCodes
Copy link
Member Author

TransGirlCodes commented Jun 3, 2020

The packing/unpacking strategy is determined currently - in NTupleKmers by the combination of BitsPerElem, ElementOrder, and eltype(vec/tuple), which are sufficient to correctly dispatch to appropriate bit flipping methods for tasks where inserting/removing/moving values is required (an extra layer of customization could be obtained by combining those two into a PackingStrategy trait or something. I propose the other part, taking those bits and transforming them into an actual BioSymbol type, should be done according to an encoder/decoder trait, and also for bit flipping that does need to care about how a-value is represented in bits as well as how many bits are used per element e.g. complement.

Anyway, I'm gonna continue working on the NTupleKmers to get a working implementation, that I can open a PR for. We can add LTS support and maybe benchmarks to BioSequences and then we'll be in a good position to try any major internal or extrernal API changes.

@CiaranOMara
Copy link
Member

A round-up of ideas/messages.

Yes they [the BioSymbols] are primitives. It doesn't seem like for primitives that closely mirror sizes we see in julia - bytes, 64ints etc, there's much of a performance difference - so far, but it's worth noting custom primitives are not recommended anymore without great care

Right, because machine-level operations work on registers, and there is no gain in using operations and cycles to pack bits if it doesn't lead to a machine-level operation that results in some sort of batch computation. May as well skip the packing and unpacking if there is no net gain.

So, when counting Kmers or bit-streams, once the bit-stream exceeds the capacity of the register, would it be more performant to have an algorithm that compares pointers to singletons? In terms of a SeqView, that idea translates to a data structure with a pointer and a number that represents the location of the Kmer/bit-stream repetition in the larger sequence.

Maybe the unique Kmers could be represented by a radix tree, then the pointer in SeqView would point to a leaf in the radix tree. The Kmer could be retrieved by either using the pointer to traverse from the relevant leaf to the root, or by using the recorded position to retrieve the relevant slice of the original sequence. That'd give us a pointer to a singleton like thing.

@CiaranOMara
Copy link
Member

CiaranOMara commented Jun 6, 2020

Ditching the BioSymbol{P,A,R} idea, parametric primitives make more sense as a way of moving encoding information down to the element level as the bio symbol encoding is roughly analogous to the encoding of UInt64, Int64, Float64, etc.

@CiaranOMara
Copy link
Member

CiaranOMara commented Jun 6, 2020

To elaborate on the parametric primitive scheme, the BioSymbol type would define which lookup/encoding tables to use, which is pretty much the same as is but would allow for new symbol encoding to be tried or added. This additional type would also allow for the conversion between an element of a vector of type BioSymbol.

The BitsPerElem trait would derive information from the BioSymbol type to tell the encoder which bits of the primitive are important.

BitsPerElem(DNA{TwoBit}) = 2

The encoder would then pack the important bits into a sequence whose traits are either RightToLeft or LeftToRight, and another one that defines how to pack the encodings between boundaries. Where the boundaries are the limits of the sizes of the primitive types supported in Julia. Currently, only sizes that are multiples of 8 bits are supported. Maybe endianness too, if that's not captured already, which may be relevant when packing 3mer of 2-bit encoded symbols.

It occurs to me that not only could these encoders push bits into an UInt64, but they could also write bits to an IOStream - another avenue to explore.

The decoder would place the bits into the 8 bits of the BioSymbol primitive, effectively expanding a bit-stream into Vector{BioSymbol}.

DNA{TwoBit}(bs::NTuple{2,Bool}) = reinterpret(bx0<however you splice these>)

In this scheme, encoded and packed symbols are BioSequences, whether they be NTupleKmers, LongSequence, etc.

@CiaranOMara
Copy link
Member

CiaranOMara commented Jun 6, 2020

MWE of BioSymbols as parametric primitives: primitive_alphabet.jl.

@TransGirlCodes
Copy link
Member Author

I have a feeling that is going to be more inefficient vs what we already have. The encoding constructs a tuple of two bytes, to represent two bits. When currently a single integer is used. Current internals insert N bits at a time, once per element, whereas in the MWE every single bit is set individually in that pack method.

I'm not sold on the philosophy of this either. This makes the elements - BioSymbols - "aware" of their container's - which I'm not sure they should be, insofar as it grants them control over how container objects - sequence types - store them. Int's do not dictate to the Vector or BitVector how they are stored - that's up to the vectors.

I'm all for rethinks of the bit-packing framework, but I think elements should just stay elements. There's an inelegance to having to say "adenine" possibly represented by N slightly different instantiations of a parametric type, that have different effects on different containers. The current API is neat since there's a separation of concerns: You pass DNA_A to a sequence - the sequence tells you if it's allowed and decides how it will be handled - DNA_A (or it's type) gets no "say" in the matter.

@CiaranOMara
Copy link
Member

I agree that Ints don't get to dictate where they're stored, but they do get to define boundaries for their content and set expectations as to how their bits are organised. Their handlers do get to know what type of package they're dealing with, and in some cases, the handlers get to limit the packages they accept or hold - I'm thinking of Vector{UInt8}.

There's an inelegance to having to say "adenine" possibly represented by N slightly different instantiations of a parametric type

Ints and Floats both encode numerical symbols. How do UInt64, Int64, or Float64 differ when they encode the same numerical symbol?

This makes the elements - BioSymbols - "aware" of their container's - which I'm not sure they should be, insofar as it grants them control over how container objects - sequence types - store them. Int's do not dictate to the Vector or BitVector how they are stored - that's up to the vectors.

The intent was to make BioSymbols aware of the type of encoding they hold so they have the information required to convert to other encodings at the 8-bit element level and to set expectations as to the location and number of their useful bits.

I've updated the MWE (revision 2) to remove that tuple distraction. I also filled out the example with more of the core features. The MWE is now able to compress, pack, unpack, and decompress symbols with 2 or 4 bits of useful encoding (revision 5). There is no low-level optimisation. It is only a high-level exploration of what the interfaces/workflow might be like if BioSymbols were aware of their type of encoding/contents.

@TransGirlCodes
Copy link
Member Author

Ok, I've thought about this more, and it makes sense to me if, the symbol types become more like alphabets, say instead of just a DNA type you had DNA (ATCG), Ext(ended)DNA (GATCBDSW), and AmbDNA (ATCG + ambiguities and so on). These types all encode values - say 'adenine' - as they will, and the types themselves serve a similar role to Alphabets as being a set of all permitted symbols. Then by having well-defined promotion and conversion rules, Encoding/representation + values permitted is handled by symbol types, and sequences just deal with packing, and we can have well-defined promotion rules between symbol types that are compatible (e.g. promote(DNA, ExtDNA) -> (ExtDNA, ExtDNA)). However, I can see sneaky performance pitfalls if we're not careful: E.g. DNA_A .== seq::LongSequence{ExtDNA}, would it end up doing a ton of DNA_A -> ExtDNA_A conversions? Or worse: ExtDNA_A .== seq::LongSequence{DNA} - it would want to convert every element of seq to an ExtDNA, when a human can clearly see actually given the value A is present in both types, you could just turn the ExtDNA_A to a DNA_A once to avoid the conversion of every seq element, which is almost the opposite of promotion - choosing the smaller alphabet that can represent both values, which isn't really a choice you can make based only on types, it takes knowledge that the value is A, which it may not be.

@jakobnissen
Copy link
Member

Apologies, but I am not sure if

  • The conversation is drifting away from the original goal of creating allocation-free views of sequences. Is reinventing the encoding scheme really necessary to do this, and
  • Disregarding sequence views, what is the benefit of changing LongSequence encoding to the system proposed above?

@TransGirlCodes
Copy link
Member Author

To point one - it is drifting!

To point 2, I can see some possible benefits but they're not great enough to push it as a priority for me. Mostly to do with the separation of concerns: If symbol types are concerned with representation at the bit level - it effectively moves encoding/alphabet stuff from BioSequences.jl to BioSymbols.jl, and a sequence type's job is to simply pack bits of the symbol types - as they are - into registers, which is simpler than encoding and packing. It also means as I describe above you could have different symbol types/sets: DNA, IUPAC_DNA, APE_DNA(R package representation) and so on and so on, but with well-defined promotion and conversion rules it would be as convenient as all the sequence types decoding to the same eltype - it might be more flexible even. But ultimately to see any benefit someone would have to try it and see how using such an interface "feels" vs the current one.

@CiaranOMara
Copy link
Member

To round off the alphabet stuff, I'll open up an experimental branch to develop the suggested parametric primitive alphabets. I think LongSequence is brilliant and that any changes or reorganisation would need to offer comparable performance. I also think the delineation of BioSequences as a bit packer is useful and nicely separates the concerns of the two packages.

Onto the views, It seems to me that a "view" would index into an optimised bit vector.
Should the view only contain location information, or if it makes sense, is it possible for it to hold a copy of some data?

@jakobnissen, from your https://github.com/jakobnissen/hardware_introduction work, do you have a sense as to what the upper size limit is for an allocation free data-structure?

As an aside, if this thing is to be called a view, I'd expect it to implement the view interface and behave as views do in Base.

This was referenced Jul 21, 2020
@jakobnissen
Copy link
Member

Closed by #120 .
@CiaranOMara @benjward This thread has some discussion about encodings/alphabets - if you feel there are more to discuss about that, please re-open the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants