Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further generalise the bitparallel counting code #27

Closed
TransGirlCodes opened this issue Sep 28, 2017 · 3 comments
Closed

Further generalise the bitparallel counting code #27

TransGirlCodes opened this issue Sep 28, 2017 · 3 comments
Assignees

Comments

@TransGirlCodes
Copy link
Member

TransGirlCodes commented Sep 28, 2017

There are other operations I've seen need of - computing the number of segregating sites in a population sample of sequences and so on, that is not purely counting the number of a kind of site between two sequences, but that CAN be done, with a bit of re-working, with the bit-parallel counting methods already developed here. So previously I generalised the code, to counting different kinds of sites bit-parallel, now I'm going to generalise it further to let it do much more than just pairwise site counting. Related PR to come.

@TransGirlCodes TransGirlCodes self-assigned this Sep 28, 2017
@timbitz
Copy link
Member

timbitz commented Nov 7, 2017

@Ward9250, @bicycle1885, I needed to calculate gc_content quickly for fastq reads and was looking into the current implementation and saw it was iterating through each nucleotide to calculate gc content or test for ambiguity (which means lots of needless masking and shifting?). I figured that some bitparallel code would be faster, so I wrote some: https://gist.github.com/timbitz/cb728a9dcddf49f3c3a29e81102f752f
The implementations in that Gist are 10-30x faster than iterating through every nucleotide, for long sequences.

julia> longseq = dna"ATGACAAGATGACAGATAGACAGATAAGACAAAAATGGGT" ^ 1_000_000
40000000nt DNA Sequence:
ATGACAAGATGACAGATAGACAGATAAGACAAAAATGGGTGACAAGATGACAGATAGACAGATAAGACAAAAATGGGT

julia> longseq_ambig = deepcopy(longseq)
40000000nt DNA Sequence:
ATGACAAGATGACAGATAGACAGATAAGACAAAAATGGGTGACAAGATGACAGATAGACAGATAAGACAAAAATGGGT

julia> longseq_ambig[end] = DNA_N
DNA_N

julia> @time parallel_hasambiguity(longseq)
  0.008842 seconds (4 allocations: 160 bytes)
false

julia> @time parallel_hasambiguity(longseq_ambig)
  0.008782 seconds (4 allocations: 160 bytes)
true

julia> @time hasambiguity(longseq)
  0.113598 seconds (4 allocations: 160 bytes)
false

julia> @time hasambiguity(longseq_ambig)
  0.110090 seconds (4 allocations: 160 bytes)
true

julia> @time parallel_gc_content(longseq)
  0.004346 seconds (5 allocations: 176 bytes)
0.35

julia> @time gc_content(longseq)
  0.132628 seconds (5 allocations: 176 bytes)
0.35

The parallel_hasambiguity is probably sufficient already... but the parallel_gc_content is not identical to the current implementation which handles ambiguity very specifically: (S nucleotides get counts, whereas other ambiguous nucleotides do not) and the parallel version in the gist does not handle ambiguity at all. I was hoping to open a discussion about how best to handle this in BioSequences.jl. Since I don't have ambiguous nucleotides for my uses, the above works fine... but given the performance increase it seems like a more general solution suitable for BioSequences.jl would be beneficial. I was thinking there could be a few solutions:

  1. The speed offered by the parallel versions is so much greater than iterating each nucleotide that a simple hack would be to just run parallel_hasambiguity, then based on the result dispatch gc_content or parallel_gc_content. Obviously not the most elegant or efficient solution, but probably still faster than present.

  2. Implement a version of parallel_gc_content that handles ambiguity properly... for example it could check each block for too many ones first, and if ambiguity is found handles them nucleotide by nucleotide, else it masks with 0x6666666666666666.

  3. Add the parallel version with a name that underscores its lack of ambiguity handling, and test for ambiguity in each block and throw an error if found.

What do you guys think? Is there a better solution I am missing?

@bicycle1885
Copy link
Member

bicycle1885 commented Nov 8, 2017

Thank you @timbitz. The performance gain looks great! I've tried a quick implementation of the idea and the performance has been dramatically improved (roughly x40 faster):

import BioSequences: bitindex, index, randdnaseq, gc_content

function parallel_gc_content(seq)
    gc = 0
    idx = bitindex(seq, 1)
    stop = bitindex(seq, endof(seq) + 1)
    while idx < stop
        @inbounds x = seq.data[index(idx)]
        a =  x & 0x1111111111111111
        c = (x & 0x2222222222222222) >> 1
        g = (x & 0x4444444444444444) >> 2
        t = (x & 0x8888888888888888) >> 3
        gc += count_ones((c | g) & ~(a | t))
        idx += 64
    end
    return isempty(seq) ? 0.0 : gc / length(seq)
end
julia> using Compat, BenchmarkTools

julia> srand(1234); seq = randdnaseq(2^16);

julia> @benchmark gc_content(seq)
BenchmarkTools.Trial:
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     505.569 μs (0.00% GC)
  median time:      517.594 μs (0.00% GC)
  mean time:        549.456 μs (0.00% GC)
  maximum time:     4.973 ms (0.00% GC)
  --------------
  samples:          9068
  evals/sample:     1

julia> @benchmark parallel_gc_content(seq)
BenchmarkTools.Trial:
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     11.673 μs (0.00% GC)
  median time:      11.794 μs (0.00% GC)
  mean time:        13.180 μs (0.00% GC)
  maximum time:     84.459 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> gc_content(seq) == parallel_gc_content(seq)
true

The implementation above is very rough (it doesn't check boundaries so the length of a sequence must be a multiple of 16) but handles ambiguity in the same way as the current gc_content.

@timbitz
Copy link
Member

timbitz commented Nov 8, 2017

Very cool! @bicycle1885. The shift of each position to allow the AND/OR logic is great, a very natural solution to handle the potential ambiguity. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants