Further generalise the bitparallel counting code #27

TransGirlCodes · 2017-09-28T13:12:14Z

There are other operations I've seen need of - computing the number of segregating sites in a population sample of sequences and so on, that is not purely counting the number of a kind of site between two sequences, but that CAN be done, with a bit of re-working, with the bit-parallel counting methods already developed here. So previously I generalised the code, to counting different kinds of sites bit-parallel, now I'm going to generalise it further to let it do much more than just pairwise site counting. Related PR to come.

timbitz · 2017-11-07T19:37:21Z

@Ward9250, @bicycle1885, I needed to calculate gc_content quickly for fastq reads and was looking into the current implementation and saw it was iterating through each nucleotide to calculate gc content or test for ambiguity (which means lots of needless masking and shifting?). I figured that some bitparallel code would be faster, so I wrote some: https://gist.github.com/timbitz/cb728a9dcddf49f3c3a29e81102f752f
The implementations in that Gist are 10-30x faster than iterating through every nucleotide, for long sequences.

julia> longseq = dna"ATGACAAGATGACAGATAGACAGATAAGACAAAAATGGGT" ^ 1_000_000
40000000nt DNA Sequence:
ATGACAAGATGACAGATAGACAGATAAGACAAAAATGGG…TGACAAGATGACAGATAGACAGATAAGACAAAAATGGGT

julia> longseq_ambig = deepcopy(longseq)
40000000nt DNA Sequence:
ATGACAAGATGACAGATAGACAGATAAGACAAAAATGGG…TGACAAGATGACAGATAGACAGATAAGACAAAAATGGGT

julia> longseq_ambig[end] = DNA_N
DNA_N

julia> @time parallel_hasambiguity(longseq)
  0.008842 seconds (4 allocations: 160 bytes)
false

julia> @time parallel_hasambiguity(longseq_ambig)
  0.008782 seconds (4 allocations: 160 bytes)
true

julia> @time hasambiguity(longseq)
  0.113598 seconds (4 allocations: 160 bytes)
false

julia> @time hasambiguity(longseq_ambig)
  0.110090 seconds (4 allocations: 160 bytes)
true

julia> @time parallel_gc_content(longseq)
  0.004346 seconds (5 allocations: 176 bytes)
0.35

julia> @time gc_content(longseq)
  0.132628 seconds (5 allocations: 176 bytes)
0.35

The parallel_hasambiguity is probably sufficient already... but the parallel_gc_content is not identical to the current implementation which handles ambiguity very specifically: (S nucleotides get counts, whereas other ambiguous nucleotides do not) and the parallel version in the gist does not handle ambiguity at all. I was hoping to open a discussion about how best to handle this in BioSequences.jl. Since I don't have ambiguous nucleotides for my uses, the above works fine... but given the performance increase it seems like a more general solution suitable for BioSequences.jl would be beneficial. I was thinking there could be a few solutions:

The speed offered by the parallel versions is so much greater than iterating each nucleotide that a simple hack would be to just run parallel_hasambiguity, then based on the result dispatch gc_content or parallel_gc_content. Obviously not the most elegant or efficient solution, but probably still faster than present.
Implement a version of parallel_gc_content that handles ambiguity properly... for example it could check each block for too many ones first, and if ambiguity is found handles them nucleotide by nucleotide, else it masks with 0x6666666666666666.
Add the parallel version with a name that underscores its lack of ambiguity handling, and test for ambiguity in each block and throw an error if found.

What do you guys think? Is there a better solution I am missing?

bicycle1885 · 2017-11-08T11:27:42Z

Thank you @timbitz. The performance gain looks great! I've tried a quick implementation of the idea and the performance has been dramatically improved (roughly x40 faster):

import BioSequences: bitindex, index, randdnaseq, gc_content

function parallel_gc_content(seq)
    gc = 0
    idx = bitindex(seq, 1)
    stop = bitindex(seq, endof(seq) + 1)
    while idx < stop
        @inbounds x = seq.data[index(idx)]
        a =  x & 0x1111111111111111
        c = (x & 0x2222222222222222) >> 1
        g = (x & 0x4444444444444444) >> 2
        t = (x & 0x8888888888888888) >> 3
        gc += count_ones((c | g) & ~(a | t))
        idx += 64
    end
    return isempty(seq) ? 0.0 : gc / length(seq)
end

julia> using Compat, BenchmarkTools

julia> srand(1234); seq = randdnaseq(2^16);

julia> @benchmark gc_content(seq)
BenchmarkTools.Trial:
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     505.569 μs (0.00% GC)
  median time:      517.594 μs (0.00% GC)
  mean time:        549.456 μs (0.00% GC)
  maximum time:     4.973 ms (0.00% GC)
  --------------
  samples:          9068
  evals/sample:     1

julia> @benchmark parallel_gc_content(seq)
BenchmarkTools.Trial:
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     11.673 μs (0.00% GC)
  median time:      11.794 μs (0.00% GC)
  mean time:        13.180 μs (0.00% GC)
  maximum time:     84.459 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> gc_content(seq) == parallel_gc_content(seq)
true

The implementation above is very rough (it doesn't check boundaries so the length of a sequence must be a multiple of 16) but handles ambiguity in the same way as the current gc_content.

timbitz · 2017-11-08T17:02:24Z

Very cool! @bicycle1885. The shift of each position to allow the AND/OR logic is great, a very natural solution to handle the potential ambiguity. Thanks!

TransGirlCodes self-assigned this Sep 28, 2017

bicycle1885 mentioned this issue Nov 8, 2017

implement bit-parallel GC counter #29

Merged

TransGirlCodes closed this as completed Aug 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further generalise the bitparallel counting code #27

Further generalise the bitparallel counting code #27

TransGirlCodes commented Sep 28, 2017 •

edited

Loading

timbitz commented Nov 7, 2017

bicycle1885 commented Nov 8, 2017 •

edited

Loading

timbitz commented Nov 8, 2017

Further generalise the bitparallel counting code #27

Further generalise the bitparallel counting code #27

Comments

TransGirlCodes commented Sep 28, 2017 • edited Loading

timbitz commented Nov 7, 2017

bicycle1885 commented Nov 8, 2017 • edited Loading

timbitz commented Nov 8, 2017

TransGirlCodes commented Sep 28, 2017 •

edited

Loading

bicycle1885 commented Nov 8, 2017 •

edited

Loading