-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Further generalise the bitparallel counting code #27
Comments
@Ward9250, @bicycle1885, I needed to calculate gc_content quickly for fastq reads and was looking into the current implementation and saw it was iterating through each nucleotide to calculate gc content or test for ambiguity (which means lots of needless masking and shifting?). I figured that some bitparallel code would be faster, so I wrote some: https://gist.github.com/timbitz/cb728a9dcddf49f3c3a29e81102f752f julia> longseq = dna"ATGACAAGATGACAGATAGACAGATAAGACAAAAATGGGT" ^ 1_000_000
40000000nt DNA Sequence:
ATGACAAGATGACAGATAGACAGATAAGACAAAAATGGG…TGACAAGATGACAGATAGACAGATAAGACAAAAATGGGT
julia> longseq_ambig = deepcopy(longseq)
40000000nt DNA Sequence:
ATGACAAGATGACAGATAGACAGATAAGACAAAAATGGG…TGACAAGATGACAGATAGACAGATAAGACAAAAATGGGT
julia> longseq_ambig[end] = DNA_N
DNA_N
julia> @time parallel_hasambiguity(longseq)
0.008842 seconds (4 allocations: 160 bytes)
false
julia> @time parallel_hasambiguity(longseq_ambig)
0.008782 seconds (4 allocations: 160 bytes)
true
julia> @time hasambiguity(longseq)
0.113598 seconds (4 allocations: 160 bytes)
false
julia> @time hasambiguity(longseq_ambig)
0.110090 seconds (4 allocations: 160 bytes)
true
julia> @time parallel_gc_content(longseq)
0.004346 seconds (5 allocations: 176 bytes)
0.35
julia> @time gc_content(longseq)
0.132628 seconds (5 allocations: 176 bytes)
0.35 The
What do you guys think? Is there a better solution I am missing? |
Thank you @timbitz. The performance gain looks great! I've tried a quick implementation of the idea and the performance has been dramatically improved (roughly x40 faster): import BioSequences: bitindex, index, randdnaseq, gc_content
function parallel_gc_content(seq)
gc = 0
idx = bitindex(seq, 1)
stop = bitindex(seq, endof(seq) + 1)
while idx < stop
@inbounds x = seq.data[index(idx)]
a = x & 0x1111111111111111
c = (x & 0x2222222222222222) >> 1
g = (x & 0x4444444444444444) >> 2
t = (x & 0x8888888888888888) >> 3
gc += count_ones((c | g) & ~(a | t))
idx += 64
end
return isempty(seq) ? 0.0 : gc / length(seq)
end
The implementation above is very rough (it doesn't check boundaries so the length of a sequence must be a multiple of 16) but handles ambiguity in the same way as the current |
Very cool! @bicycle1885. The shift of each position to allow the AND/OR logic is great, a very natural solution to handle the potential ambiguity. Thanks! |
There are other operations I've seen need of - computing the number of segregating sites in a population sample of sequences and so on, that is not purely counting the number of a kind of site between two sequences, but that CAN be done, with a bit of re-working, with the bit-parallel counting methods already developed here. So previously I generalised the code, to counting different kinds of sites bit-parallel, now I'm going to generalise it further to let it do much more than just pairwise site counting. Related PR to come.
The text was updated successfully, but these errors were encountered: