Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance review before release #86

Open
jakobnissen opened this issue Jan 15, 2020 · 11 comments
Open

Performance review before release #86

jakobnissen opened this issue Jan 15, 2020 · 11 comments

Comments

@jakobnissen
Copy link
Contributor

@jakobnissen jakobnissen commented Jan 15, 2020

So @BenJWard have mentioned wanting to make a new release of BioSequences that includes performance improvements. We might as well squeeze whatever we have on our performance wishlist into that release.

When #83 and #84 gets merged, I have two smaller PRs lined up which builds on top of those:

  • One that improves the performance of printing LongSequence. Currently, they are printed base-by-base, which is inefficient in 1.3 and above due to IO operations locking threads. Also, we can exploit the ASCII-related functions I created in the other PRs to make it way faster for ASCII sequences.
  • One that introduces a copy! method, such that you can in-place modify a single sequence when you are iterating over sequences, similar to what you do for FASTA records. This is just a simple wrapper around copyto!, or encode_copy! or encode_chunks, depending on source and destination types.

Benchmark

We can use the RC benchmark, created by the Seq guys. This benchmark covers almost all the included benchmarks. I'll be Julia 1.3 on my native computer instead of in a VM. This is comparing all current PRs with commit 61b8fae.

Before: 100.9 seconds
After:    9.2 seconds

TODO:

  • Improve LongSequence printing (PR #91 )
  • Add copy! method ( PR #87)
  • Add docs to copy! method (PR #95 )
  • Optimize length(::LongSequence) (PR #88 )
  • Optimize resize!(::LongSequence) (PR #92)
  • Optimize orphan!(::LongSequence) (PR #88)
  • Review bitpar counting kernels ( PR #96 )
  • Review ReferenceSequence ( PR #94 )
  • SIMD-vectorize reverse complementation ( PR #97 )
  • Consider optimizing random nucleotide generation (issue #77 )
@BenJWard

This comment has been minimized.

Copy link
Member

@BenJWard BenJWard commented Jan 16, 2020

We should add some docs for the copy! method as well.

@jakobnissen

This comment has been minimized.

Copy link
Contributor Author

@jakobnissen jakobnissen commented Jan 16, 2020

Good call. I'll add that to #88 perhaps, which has turned into a big pile of small changes anyway. Alternatively, I could just ditch #88 , make one small PR with the orphan! fix, that can then be backported to 1.0, and then a big PR with all the optimizations and the copy! docs.

One thing I'm not quite sure how to approach: The overhead from incremental writes in print(io::IO, ::LongSequence). This is destroying performance. Here's the options as far as I can see:

  1. Don't change anything. Printing sequences will stay insufferably slow, but it's the easiest thing to do.
  2. Make printing of sequences go though an IOBuffer. This is simple to implement, but it is still a little slow, and it also can unlimited capacity, meaning a hundred-million-base chromosome could be kept in memory with 1 byte per nt. Ouch.
  3. Use BufferedStreams. This adds a dependency to BioSequences but is easy to use and performant.
  4. Implement a very efficient simple buffer for internal use. This is the fastest, but it's just more code to maintain. I was thinking something like
mutable struct SimpleBuffer{T} <: IO
    arr::Vector{UInt8}
    len::Int
    output::T
end

SimpleBuffer(io::IO) = SimpleBuffer{typeof(io)}(Vector{UInt8}(undef, 1024), 0, io)

function Base.write(sb::SimpleBuffer, byte::UInt8)
    len = sb.len + 1
    len > 1024 && flush(sb)
    sb.len = len
    @inbounds sb.arr[len] = byte
end

@noinline function Base.flush(sb::SimpleBuffer)
    write(sb.output, sb.arr)
    sb.len = 0
end
@BenJWard

This comment has been minimized.

Copy link
Member

@BenJWard BenJWard commented Jan 16, 2020

I'd be tempted by option 4. BufferedStreams was designed largely with Automa in mind as well, and has since been replaced by TranscodingStreams by Kenta. I think a simple buffer type for BioSequences is acceptable.

@jakobnissen

This comment has been minimized.

Copy link
Contributor Author

@jakobnissen jakobnissen commented Jan 18, 2020

We should look at the bitpar counting functions:

  1. There is no documentation on what the two-argument count(isambiguous, seq, seq) does. I can't work it out.
  2. They don't seem to work properly. n_ambiguous(seq) dispatches to count_naive, which it shouldn't. I think there is no compiled method for a single-seq count(isambiguous, seq).
  3. Performance could maybe take a review. count(isGC, seq) (which does correctly dispatch to the bitparallel one) takes 7.8 µs on my computer for a 100,000 4-bit DNA seq. I wrote a simple function which completes in 2.6 µs.

Maybe a few simple optimizations may do a lot of work:

  • Mask bitshifts with 63 e.g. y >> (sh & 63) instead of y >> sh. Maybe even make a function shr(n::Union{Int, UInt}, sh::Base.BitInteger) = n >>> (sh & 63).
  • Replace && and || with & and |, respectively (unless short-circuiting saves time).
@BenJWard

This comment has been minimized.

Copy link
Member

@BenJWard BenJWard commented Jan 25, 2020

Thought I would checkout a branch and see about the bitcounting, as I wrote a fair amount of it, so it's my fault. I'm not sure if I can improve

@inline function gc_bitcount(x::Unsigned, ::BitsPerSymbol{2})
    msk = repeatpattern(typeof(x), 0x55)
    c = x & msk
    g = (x >> 1) & msk
    return count_ones(c ⊻ g)
end

Much furthur as it compiles to:

julia> @code_typed BioSequences.gc_bitcount(0xffffffffffffffff, BioSequences.BitsPerSymbol{2}())
CodeInfo(
1 ─ %1  = (Base.and_int)(x, 0x5555555555555555)::UInt64
│   %2  = (Base.lshr_int)(x, 0x0000000000000001)::UInt64
│   %3  = (Base.and_int)(%2, 0x5555555555555555)::UInt64
│   %4  = (Base.xor_int)(%1, %3)::UInt64
│   %5  = (Base.ctpop_int)(%4)::UInt64
│   %6  = (Core.lshr_int)(%5, 63)::UInt64
│   %7  = (Core.trunc_int)(Core.UInt8, %6)::UInt8
│   %8  = (Core.eq_int)(%7, 0x01)::Bool
└──       goto #3 if not %8
2 ─       invoke Core.throw_inexacterror(:check_top_bit::Symbol, UInt64::Any, %5::UInt64)::Union{}
└──       $(Expr(:unreachable))::Union{}
3 ┄       goto #4
4 ─ %13 = (Core.bitcast)(Core.Int64, %5)::Int64
└──       goto #5
5 ─       goto #6
6 ─       goto #7
7 ─       return %13
) => Int64

julia> @code_native BioSequences.gc_bitcount(0xffffffffffffffff, BioSequences.BitsPerSymbol{2}())
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ bit-manipulation.jl:33 within `gc_bitcount'
; │┌ @ bit-manipulation.jl:31 within `>>'
	movq	%rdi, %rax
	shrq	%rax
; │└
; │ @ bit-manipulation.jl:34 within `gc_bitcount'
; │┌ @ int.jl:321 within `xor'
	xorq	%rdi, %rax
	movabsq	$6148914691236517205, %rcx ## imm = 0x5555555555555555
	andq	%rax, %rcx
; │└
; │┌ @ int.jl:358 within `count_ones'
	popcntq	%rcx, %rax
; │└
	retq
	nopl	(%rax)
; └

Although where the UInt8 is comming from is beyond me.

@BenJWard

This comment has been minimized.

Copy link
Member

@BenJWard BenJWard commented Jan 25, 2020

This is what I get for count(isGC, s):

julia> @benchmark count(isGC, s)
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     840.271 ns (0.00% GC)
  median time:      854.543 ns (0.00% GC)
  mean time:        889.395 ns (0.00% GC)
  maximum time:     2.612 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     70

I'll have a look at making some optimizations shortly.

@jakobnissen

This comment has been minimized.

Copy link
Contributor Author

@jakobnissen jakobnissen commented Jan 25, 2020

Edit: Oops, I miswrote in the other post. The 7 vs 2 µs was for a sequence of 100 kbp, not 10 kbp.

I think gc.bitcount is very effective. I'm not exactly sure what goes wrong, but something is not quite optimal in the compiled one. For example, try comparing count(isGC, seq) on a long sequence, say, 10 kbp, with this:

function simple_gc(seq::LongSequence{A}) where {A <: Alphabet}
    ngc = 0
    ind = BioSequences.bitindex(seq, 1)
    stop = BioSequences.lastbitindex(seq) + BioSequences.bits_per_symbol(seq)

    while !iszero(BioSequences.offset(ind)) & (ind ≤ stop)
        ngc += isGC(BioSequences.inbounds_getindex(seq, ind))
        ind += BioSequences.bits_per_symbol(seq)
    end

    data = seq.data
    lastind = BioSequences.index(stop - BioSequences.bits_per_symbol(seq))
    lastind -= !iszero(BioSequences.offset(stop))
    @inbounds for i in BioSequences.index(ind):lastind
        ngc += BioSequences.gc_bitcount(data[i], BioSequences.BitsPerSymbol(seq))
        ind += 64
    end

    while ind ≤ stop
        ngc += isGC(BioSequences.inbounds_getindex(seq, ind))
        ind += BioSequences.bits_per_symbol(seq)
    end

    return ngc
end
@BenJWard

This comment has been minimized.

Copy link
Member

@BenJWard BenJWard commented Jan 27, 2020

julia> @code_native simple_gc(s)
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ REPL[5]:3 within `simple_gc'
; │┌ @ indexing.jl:10 within `bitindex'
; ││┌ @ REPL[5]:2 within `getproperty'
	pushq	%r15
	pushq	%r14
	pushq	%r13
	pushq	%r12
	pushq	%rbx
	movq	8(%rdi), %rdx
; │└└
; │ @ REPL[5]:4 within `simple_gc'
; │┌ @ indexing.jl:33 within `lastbitindex'
; ││┌ @ indexing.jl:109 within `lastindex'
; │││┌ @ longsequence.jl:71 within `length'
; ││││┌ @ sysimg.jl:18 within `getproperty'
	movq	16(%rdi), %rax
; │││└└
; │││┌ @ checked.jl:194 within `length'
	movq	%rax, %rsi
	subq	%rdx, %rsi
; │││└
; │││┌ @ longsequence.jl:71 within `length' @ range.jl:540
; ││││┌ @ checked.jl:223 within `checked_sub'
	jo	L804
; ││││└
; ││││┌ @ checked.jl:165 within `checked_add'
; │││││┌ @ checked.jl:132 within `add_with_overflow'
	movq	%rsi, %rax
	incq	%rax
; │││││└
; │││││ @ checked.jl:166 within `checked_add'
	jo	L831
; │└└└└
; │ @ REPL[5]:3 within `simple_gc'
; │┌ @ indexing.jl:10 within `bitindex' @ bitindex.jl:22
; ││┌ @ int.jl:450 within `<<' @ int.jl:443
	leaq	(%rdx,%rdx), %rbx
	addq	$-2, %rbx
; │└└
; │ @ REPL[5]:4 within `simple_gc'
; │┌ @ indexing.jl:33 within `lastbitindex'
; ││┌ @ indexing.jl:10 within `bitindex'
; │││┌ @ int.jl:53 within `+'
	addq	%rdx, %rax
; │└└└
; │┌ @ bitindex.jl:34 within `+' @ int.jl:53
	leaq	(%rax,%rax), %r13
	addq	$-2, %r13
; │└
; │┌ @ indexing.jl:33 within `lastbitindex'
; ││┌ @ indexing.jl:10 within `bitindex' @ bitindex.jl:22
; │││┌ @ int.jl:450 within `<<' @ int.jl:443
	leaq	(%rax,%rax), %r14
	xorl	%eax, %eax
; │└└└
; │ @ REPL[5]:11 within `simple_gc'
; │┌ @ sysimg.jl:18 within `getproperty'
	movq	(%rdi), %r8
; │└
; │ @ REPL[5]:6 within `simple_gc'
; │┌ @ operators.jl:309 within `<='
; ││┌ @ bool.jl:41 within `|'
	cmpq	%r13, %rbx
; │└└
	jg	L203
	movl	%ebx, %esi
	andl	$62, %esi
; │ @ REPL[5]:6 within `simple_gc'
	testq	%rsi, %rsi
	je	L198
	addq	%rdx, %rdx
; │ @ REPL[5]:7 within `simple_gc'
; │┌ @ indexing.jl:57 within `inbounds_getindex'
; ││┌ @ bitindex.jl:63 within `extract_encoded_element'
; │││┌ @ bitindex.jl:31 within `index'
; ││││┌ @ int.jl:448 within `>>' @ int.jl:441
	sarq	$6, %rbx
; ││└└└
; ││┌ @ array.jl:729 within `extract_encoded_element'
	movq	(%r8), %rdi
; ││└
; ││┌ @ bitindex.jl:64 within `extract_encoded_element'
; │││┌ @ int.jl:448 within `>>' @ int.jl:442
	shrxq	%rsi, (%rdi,%rbx,8), %rax
; │└└└
; │┌ @ boot.jl:611 within `inbounds_getindex'
	andb	$3, %al
; │└
; │┌ @ nucleicacid.jl:399 within `isGC'
; ││┌ @ int.jl:432 within `==' @ promotion.jl:350 @ promotion.jl:403
	movzbl	%al, %esi
	movl	$9, %ebx
	xorl	%eax, %eax
	btl	%esi, %ebx
	setae	%al
; │└└
; │ @ REPL[5]:6 within `simple_gc'
; │┌ @ operators.jl:309 within `<='
; ││┌ @ bool.jl:41 within `|'
	cmpq	%r13, %rdx
; │└└
	jg	L206
	movl	%edx, %esi
	andl	$62, %esi
; │ @ REPL[5]:6 within `simple_gc'
	testq	%rsi, %rsi
	je	L206
	movl	$9, %r9d
	nop
L144:
	movq	%rax, %rbx
; │ @ REPL[5]:7 within `simple_gc'
; │┌ @ indexing.jl:57 within `inbounds_getindex'
; ││┌ @ bitindex.jl:63 within `extract_encoded_element'
; │││┌ @ bitindex.jl:31 within `index'
; ││││┌ @ int.jl:448 within `>>' @ int.jl:441
	movq	%rdx, %rax
	sarq	$6, %rax
; │││└└
; │││ @ bitindex.jl:64 within `extract_encoded_element'
; │││┌ @ int.jl:448 within `>>' @ int.jl:442
	shrxq	%rsi, (%rdi,%rax,8), %rax
; │└└└
; │┌ @ boot.jl:611 within `inbounds_getindex'
	andb	$3, %al
; │└
; │┌ @ nucleicacid.jl:399 within `isGC'
; ││┌ @ int.jl:432 within `==' @ promotion.jl:350 @ promotion.jl:403
	movzbl	%al, %esi
	xorl	%eax, %eax
	btl	%esi, %r9d
	setae	%al
; │└└
; │┌ @ int.jl:800 within `+' @ int.jl:53
	addq	%rbx, %rax
; │└
; │ @ REPL[5]:8 within `simple_gc'
; │┌ @ bitindex.jl:34 within `+' @ int.jl:53
	addq	$2, %rdx
; │└
; │ @ REPL[5]:6 within `simple_gc'
; │┌ @ operators.jl:309 within `<='
; ││┌ @ bool.jl:41 within `|'
	cmpq	%r13, %rdx
; │└└
	jg	L206
	movl	%edx, %esi
	andl	$62, %esi
; │ @ REPL[5]:6 within `simple_gc'
	testq	%rsi, %rsi
	jne	L144
	jmp	L206
L198:
	movq	%rbx, %rdx
	jmp	L206
L203:
	movq	%rbx, %rdx
; │ @ REPL[5]:4 within `simple_gc'
; │┌ @ indexing.jl:33 within `lastbitindex'
; ││┌ @ indexing.jl:10 within `bitindex' @ bitindex.jl:22
; │││┌ @ int.jl:450 within `<<' @ int.jl:443
L206:
	addq	$-4, %r14
; │└└└
; │ @ REPL[5]:12 within `simple_gc'
; │┌ @ bitindex.jl:31 within `index'
; ││┌ @ int.jl:448 within `>>' @ int.jl:441
	sarq	$6, %r14
; │└└
; │ @ REPL[5]:13 within `simple_gc'
; │┌ @ number.jl:40 within `iszero'
; ││┌ @ promotion.jl:403 within `=='
	xorl	%esi, %esi
	testb	$62, %r13b
	setne	%sil
; │└└
; │ @ REPL[5]:12 within `simple_gc'
; │┌ @ bitindex.jl:31 within `index'
; ││┌ @ int.jl:53 within `+'
	movq	%r14, %r12
	subq	%rsi, %r12
; │└└
; │ @ REPL[5]:13 within `simple_gc'
; │┌ @ int.jl:800 within `-' @ int.jl:52
	addq	$1, %r12
; │└
; │ @ REPL[5]:14 within `simple_gc'
; │┌ @ bitindex.jl:31 within `index'
; ││┌ @ int.jl:448 within `>>' @ int.jl:441
	movq	%rdx, %r10
	sarq	$6, %r10
; │└└
; │┌ @ range.jl:5 within `Colon'
; ││┌ @ range.jl:274 within `Type'
; │││┌ @ range.jl:279 within `unitrange_last'
; ││││┌ @ operators.jl:333 within `>='
; │││││┌ @ int.jl:428 within `<='
	cmpq	%r12, %r10
; │└└└└└
	jge	L727
	negq	%rsi
	movq	(%r8), %r11
; │ @ REPL[5]:14 within `simple_gc'
	shlq	$6, %r12
	movq	%rdx, %r9
	andq	$-64, %r9
	addq	%r14, %rsi
	addq	$1, %rsi
	movq	%rsi, %r15
	subq	%r10, %r15
	cmpq	$16, %r15
	jb	L666
	movq	%r15, %r14
	andq	$-16, %r14
	vmovq	%rax, %xmm0
	vpxor	%xmm1, %xmm1, %xmm1
	leaq	(%r11,%r10,8), %rax
	addq	$96, %rax
	addq	%r14, %r10
	movabsq	$4585169312, %rdi       ## imm = 0x1114C21A0
	vpbroadcastq	(%rdi), %ymm2
	movabsq	$4585169248, %rdi       ## imm = 0x1114C2160
	vmovdqa	(%rdi), %ymm3
	movabsq	$4585169280, %rdi       ## imm = 0x1114C2180
	vmovdqa	(%rdi), %ymm4
	movq	%r14, %rbx
	vpxor	%xmm5, %xmm5, %xmm5
	vpxor	%xmm6, %xmm6, %xmm6
	vpxor	%xmm7, %xmm7, %xmm7
	nopw	(%rax,%rax)
; │ @ REPL[5]:15 within `simple_gc'
; │┌ @ array.jl:729 within `getindex'
L384:
	vmovdqu	-96(%rax), %ymm8
	vmovdqu	-64(%rax), %ymm9
	vmovdqu	-32(%rax), %ymm10
	vmovdqu	(%rax), %ymm11
; │└
; │┌ @ bit-manipulation.jl:48 within `gc_bitcount'
; ││┌ @ int.jl:448 within `>>' @ int.jl:442
	vpsrlq	$1, %ymm8, %ymm12
	vpsrlq	$1, %ymm9, %ymm13
	vpsrlq	$1, %ymm10, %ymm14
	vpsrlq	$1, %ymm11, %ymm15
; │└└
; │┌ @ int.jl:321 within `gc_bitcount'
	vpxor	%ymm8, %ymm12, %ymm8
	vpxor	%ymm9, %ymm13, %ymm9
	vpxor	%ymm10, %ymm14, %ymm10
	vpxor	%ymm11, %ymm15, %ymm11
	vpand	%ymm2, %ymm8, %ymm8
	vpand	%ymm2, %ymm9, %ymm9
	vpand	%ymm2, %ymm10, %ymm10
	vpand	%ymm2, %ymm11, %ymm11
; │└
; │┌ @ bit-manipulation.jl:49 within `gc_bitcount'
; ││┌ @ int.jl:358 within `count_ones'
	vpand	%ymm3, %ymm8, %ymm12
	vpshufb	%ymm12, %ymm4, %ymm12
	vpsrlw	$4, %ymm8, %ymm8
	vpand	%ymm3, %ymm8, %ymm8
	vpshufb	%ymm8, %ymm4, %ymm8
	vpaddb	%ymm12, %ymm8, %ymm8
	vpsadbw	%ymm1, %ymm8, %ymm8
; │└└
; │┌ @ int.jl:53 within `+'
	vpaddq	%ymm0, %ymm8, %ymm0
; │└
; │┌ @ bit-manipulation.jl:49 within `gc_bitcount'
; ││┌ @ int.jl:358 within `count_ones'
	vpand	%ymm3, %ymm9, %ymm8
	vpshufb	%ymm8, %ymm4, %ymm8
	vpsrlw	$4, %ymm9, %ymm9
	vpand	%ymm3, %ymm9, %ymm9
	vpshufb	%ymm9, %ymm4, %ymm9
	vpaddb	%ymm8, %ymm9, %ymm8
	vpsadbw	%ymm1, %ymm8, %ymm8
; │└└
; │┌ @ int.jl:53 within `+'
	vpaddq	%ymm5, %ymm8, %ymm5
; │└
; │┌ @ bit-manipulation.jl:49 within `gc_bitcount'
; ││┌ @ int.jl:358 within `count_ones'
	vpand	%ymm3, %ymm10, %ymm8
	vpshufb	%ymm8, %ymm4, %ymm8
	vpsrlw	$4, %ymm10, %ymm9
	vpand	%ymm3, %ymm9, %ymm9
	vpshufb	%ymm9, %ymm4, %ymm9
	vpaddb	%ymm8, %ymm9, %ymm8
	vpsadbw	%ymm1, %ymm8, %ymm8
; │└└
; │┌ @ int.jl:53 within `+'
	vpaddq	%ymm6, %ymm8, %ymm6
; │└
; │┌ @ bit-manipulation.jl:49 within `gc_bitcount'
; ││┌ @ int.jl:358 within `count_ones'
	vpand	%ymm3, %ymm11, %ymm8
	vpshufb	%ymm8, %ymm4, %ymm8
	vpsrlw	$4, %ymm11, %ymm9
	vpand	%ymm3, %ymm9, %ymm9
	vpshufb	%ymm9, %ymm4, %ymm9
	vpaddb	%ymm8, %ymm9, %ymm8
	vpsadbw	%ymm1, %ymm8, %ymm8
; │└└
; │┌ @ int.jl:53 within `+'
	vpaddq	%ymm7, %ymm8, %ymm7
	subq	$-128, %rax
	addq	$-16, %rbx
	jne	L384
	vpaddq	%ymm0, %ymm5, %ymm0
	vpaddq	%ymm0, %ymm6, %ymm0
	vpaddq	%ymm0, %ymm7, %ymm0
	vextracti128	$1, %ymm0, %xmm1
	vpaddq	%ymm1, %ymm0, %ymm0
	vpshufd	$78, %xmm0, %xmm1       ## xmm1 = xmm0[2,3,0,1]
	vpaddq	%ymm1, %ymm0, %ymm0
	vmovq	%xmm0, %rax
	cmpq	%r14, %r15
; │└
; │ @ REPL[5]:14 within `simple_gc'
	je	L721
L666:
	leaq	(%r11,%r10,8), %rbx
	subq	%r10, %rsi
	movabsq	$6148914691236517205, %r10 ## imm = 0x5555555555555555
	nopl	(%rax,%rax)
; │ @ REPL[5]:15 within `simple_gc'
; │┌ @ array.jl:729 within `getindex'
L688:
	movq	(%rbx), %rdi
; │└
; │┌ @ bit-manipulation.jl:48 within `gc_bitcount'
; ││┌ @ int.jl:448 within `>>' @ int.jl:442
	movq	%rdi, %rcx
	shrq	%rcx
; │└└
; │┌ @ int.jl:321 within `gc_bitcount'
	xorq	%rdi, %rcx
	andq	%r10, %rcx
; │└
; │┌ @ bit-manipulation.jl:49 within `gc_bitcount'
; ││┌ @ int.jl:358 within `count_ones'
	popcntq	%rcx, %rcx
; │└└
; │┌ @ int.jl:53 within `+'
	addq	%rcx, %rax
; │└
; │ @ REPL[5]:16 within `simple_gc'
; │┌ @ range.jl:594 within `iterate'
; ││┌ @ promotion.jl:403 within `=='
	addq	$8, %rbx
	addq	$-1, %rsi
; │└└
	jne	L688
; │ @ REPL[5]:14 within `simple_gc'
L721:
	addq	%r12, %rdx
	subq	%r9, %rdx
; │ @ REPL[5]:19 within `simple_gc'
; │┌ @ operators.jl:309 within `<='
; ││┌ @ bool.jl:41 within `|'
L727:
	cmpq	%r13, %rdx
; │└└
	jg	L791
	movq	(%r8), %rsi
	movl	$9, %edi
	nopw	%cs:(%rax,%rax)
; │ @ REPL[5]:20 within `simple_gc'
; │┌ @ indexing.jl:57 within `inbounds_getindex'
; ││┌ @ bitindex.jl:63 within `extract_encoded_element'
; │││┌ @ bitindex.jl:31 within `index'
; ││││┌ @ int.jl:448 within `>>' @ int.jl:441
L752:
	movq	%rdx, %rcx
	sarq	$6, %rcx
; ││└└└
; ││┌ @ int.jl:442 within `extract_encoded_element'
	shrxq	%rdx, (%rsi,%rcx,8), %rcx
; ││└
; ││ @ indexing.jl:58 within `inbounds_getindex'
; ││┌ @ alphabet.jl:222 within `decode'
; │││┌ @ boot.jl:735 within `Type'
; ││││┌ @ boot.jl:680 within `toUInt8'
; │││││┌ @ boot.jl:611 within `checked_trunc_uint'
	andb	$3, %cl
; │└└└└└
; │┌ @ nucleicacid.jl:399 within `isGC'
; ││┌ @ int.jl:432 within `==' @ promotion.jl:350 @ promotion.jl:403
	movzbl	%cl, %ecx
	xorl	%ebx, %ebx
	btl	%ecx, %edi
	setae	%bl
; │└└
; │┌ @ int.jl:800 within `+' @ int.jl:53
	addq	%rbx, %rax
; │└
; │ @ REPL[5]:21 within `simple_gc'
; │┌ @ bitindex.jl:34 within `+' @ int.jl:53
	addq	$2, %rdx
; │└
; │ @ REPL[5]:19 within `simple_gc'
; │┌ @ operators.jl:309 within `<='
; ││┌ @ bool.jl:41 within `|'
	cmpq	%r13, %rdx
; │└└
	jle	L752
; │ @ REPL[5]:24 within `simple_gc'
L791:
	popq	%rbx
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	vzeroupper
	retq
; │ @ REPL[5]:4 within `simple_gc'
; │┌ @ indexing.jl:33 within `lastbitindex'
; ││┌ @ indexing.jl:109 within `lastindex'
; │││┌ @ longsequence.jl:71 within `length' @ range.jl:540
; ││││┌ @ checked.jl:223 within `checked_sub'
L804:
	movabsq	$throw_overflowerr_binaryop, %rcx
	movabsq	$4547875832, %rdi       ## imm = 0x10F1313F8
	movq	%rax, %rsi
	callq	*%rcx
	ud2
; ││││└
; ││││┌ @ checked.jl:166 within `checked_add'
L831:
	movabsq	$throw_overflowerr_binaryop, %rax
	movabsq	$4547875528, %rdi       ## imm = 0x10F1312C8
	movl	$1, %edx
	callq	*%rax
	ud2
	nopl	(%rax)
; └└└└└
julia> @code_native BioSequences.count_gc_bitpar(s)
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ bitpar-compiler.jl:20 within `count_gc_bitpar'
	pushq	%r15
	pushq	%r14
	pushq	%r12
	pushq	%rbx
	subq	$24, %rsp
	movq	%rdi, %r14
	vxorps	%xmm0, %xmm0, %xmm0
	vmovaps	%xmm0, (%rsp)
	movq	$0, 16(%rsp)
	movabsq	$jl_get_ptls_states_fast, %rax
	callq	*%rax
; │ @ bitpar-compiler.jl:21 within `count_gc_bitpar'
; │┌ @ indexing.jl:10 within `bitindex'
; ││┌ @ sysimg.jl:18 within `getproperty'
	movq	$2, (%rsp)
	movq	(%rax), %rcx
	movq	%rcx, 8(%rsp)
	movq	%rsp, %rcx
	movq	%rcx, (%rax)
	movq	8(%r14), %rdx
; │└└
; │ @ bitpar-compiler.jl:22 within `count_gc_bitpar'
; │┌ @ indexing.jl:109 within `lastindex'
; ││┌ @ longsequence.jl:71 within `length'
; │││┌ @ sysimg.jl:18 within `getproperty'
	movq	16(%r14), %rcx
; │││└
; │││ @ longsequence.jl:71 within `length' @ range.jl:540
; │││┌ @ checked.jl:222 within `checked_sub'
; ││││┌ @ checked.jl:194 within `sub_with_overflow'
	movq	%rcx, %rsi
	subq	%rdx, %rsi
; ││││└
; ││││ @ checked.jl:223 within `checked_sub'
	jo	L428
; │││└
; │││┌ @ checked.jl:165 within `checked_add'
; ││││┌ @ checked.jl:132 within `add_with_overflow'
	movq	%rsi, %r12
	incq	%r12
; ││││└
; ││││ @ checked.jl:166 within `checked_add'
	jo	L455
; └└└└
; ┌ @ checked.jl within `count_gc_bitpar'
	movabsq	$6148914691236517205, %r9 ## imm = 0x5555555555555555
; └
; ┌ @ bitpar-compiler.jl:21 within `count_gc_bitpar'
; │┌ @ indexing.jl:10 within `bitindex' @ bitindex.jl:22
; ││┌ @ int.jl:52 within `-'
	leaq	-1(%rdx), %rbx
; ││└
; ││┌ @ int.jl:450 within `<<' @ int.jl:443
	leaq	(%rdx,%rdx), %rsi
	addq	$-2, %rsi
; │└└
; │ @ bitpar-compiler.jl:22 within `count_gc_bitpar'
; │┌ @ indexing.jl:10 within `bitindex'
; ││┌ @ int.jl:52 within `-'
	addq	%rdx, %r12
; │└└
; │┌ @ int.jl:443 within `bitindex'
	leaq	(%r12,%r12), %r8
	addq	$-2, %r8
	xorl	%ecx, %ecx
; │└
; │ @ bitpar-compiler.jl:24 within `count_gc_bitpar'
; │┌ @ operators.jl:185 within `!='
; ││┌ @ promotion.jl:403 within `=='
	testb	$31, %bl
; │└└
	je	L299
	movq	%r8, %r15
	subq	%rsi, %r15
	jle	L282
; │ @ bitpar-compiler.jl:26 within `count_gc_bitpar'
; │┌ @ bitindex.jl:32 within `offset'
; ││┌ @ int.jl:800 within `&' @ int.jl:297
	movl	%esi, %r11d
	andl	$62, %r11d
; │└└
; │ @ bitpar-compiler.jl:29 within `count_gc_bitpar'
; │┌ @ int.jl:52 within `-'
	movl	$64, %ebx
	subq	%r11, %rbx
; │└
; │┌ @ bitindex.jl:34 within `+' @ int.jl:53
	addq	%rsi, %rbx
; │└
; │ @ bitpar-compiler.jl:30 within `count_gc_bitpar'
; │┌ @ promotion.jl:403 within `=='
	testb	$62, %bl
; │└
	jne	L484
; │ @ bitpar-compiler.jl:27 within `count_gc_bitpar'
; │┌ @ sysimg.jl:18 within `getproperty'
	movq	(%r14), %r10
; │└
; │┌ @ bitindex.jl:31 within `index'
; ││┌ @ int.jl:448 within `>>' @ int.jl:441
	sarq	$6, %rsi
; │└└
; │┌ @ array.jl:729 within `getindex'
	movq	(%r10), %rcx
; │└
; │┌ @ int.jl:448 within `>>' @ int.jl:442
	shrxq	%r11, (%rcx,%rsi,8), %r11
	xorl	%r10d, %r10d
; │└
; │┌ @ bitindex.jl:76 within `bitmask' @ bitindex.jl:71
; ││┌ @ int.jl:450 within `<<' @ int.jl:443
	cmpq	$63, %r15
	movl	$1, %esi
	shlxq	%r15, %rsi, %rcx
	cmovaq	%r10, %rcx
; │││ @ int.jl:450 within `<<'
; │││┌ @ int.jl:51 within `-'
	movq	%r15, %rdx
	negq	%rdx
; │││└
; │││ @ int.jl:442 within `<<'
	cmpq	$63, %rdx
	shrxq	%rdx, %rsi, %rdx
	cmovaq	%r10, %rdx
; │││ @ int.jl:450 within `<<'
; │││┌ @ int.jl:428 within `<='
	testq	%r15, %r15
; │││└
	cmovnsq	%rcx, %rdx
; ││└
; ││┌ @ int.jl:52 within `-'
	addq	$-1, %rdx
; │└└
; │┌ @ int.jl:297 within `&'
	andq	%r11, %rdx
; │└
; │ @ bitpar-compiler.jl:28 within `count_gc_bitpar'
; │┌ @ bit-manipulation.jl:48 within `gc_bitcount'
; ││┌ @ int.jl:448 within `>>' @ int.jl:442
	movq	%rdx, %rcx
	shrq	%rcx
; ││└
; ││ @ bit-manipulation.jl:49 within `gc_bitcount'
; ││┌ @ int.jl:321 within `xor'
	xorq	%rdx, %rcx
	andq	%r9, %rcx
; ││└
; ││┌ @ int.jl:358 within `count_ones'
	popcntq	%rcx, %rcx
	addq	%r12, %r12
; │└└
; │ @ bitpar-compiler.jl:32 within `count_gc_bitpar'
; │┌ @ bitindex.jl:35 within `-' @ int.jl:52
	addq	$-66, %r12
; │└
; │┌ @ operators.jl:309 within `<='
; ││┌ @ bool.jl:41 within `|'
	cmpq	%r12, %rbx
; │└└
	jg	L360
	jmp	L314
L282:
	movq	%rsi, %rbx
	addq	%r12, %r12
; │ @ bitpar-compiler.jl:32 within `count_gc_bitpar'
; │┌ @ bitindex.jl:35 within `-' @ int.jl:52
	addq	$-66, %r12
; │└
; │┌ @ operators.jl:309 within `<='
; ││┌ @ bool.jl:41 within `|'
	cmpq	%r12, %rbx
; │└└
	jg	L360
	jmp	L314
L299:
	movq	%rsi, %rbx
	addq	%r12, %r12
; │ @ bitpar-compiler.jl:32 within `count_gc_bitpar'
; │┌ @ bitindex.jl:35 within `-' @ int.jl:52
	addq	$-66, %r12
; │└
; │┌ @ operators.jl:309 within `<='
; ││┌ @ bool.jl:41 within `|'
	cmpq	%r12, %rbx
; │└└
	jg	L360
L314:
	movq	(%r14), %rdx
	movq	(%rdx), %rdx
; │ @ bitpar-compiler.jl:33 within `count_gc_bitpar'
; │┌ @ bitindex.jl:31 within `index'
; ││┌ @ int.jl:448 within `>>' @ int.jl:441
L320:
	movq	%rbx, %rsi
	sarq	$6, %rsi
; │└└
; │┌ @ array.jl:729 within `getindex'
	movq	(%rdx,%rsi,8), %rdi
; │└
; │ @ bitpar-compiler.jl:34 within `count_gc_bitpar'
; │┌ @ bit-manipulation.jl:48 within `gc_bitcount'
; ││┌ @ int.jl:448 within `>>' @ int.jl:442
	movq	%rdi, %rsi
	shrq	%rsi
; │└└
; │┌ @ int.jl:321 within `gc_bitcount'
	xorq	%rdi, %rsi
	andq	%r9, %rsi
; │└
; │┌ @ bit-manipulation.jl:49 within `gc_bitcount'
; ││┌ @ int.jl:358 within `count_ones'
	popcntq	%rsi, %rsi
; │└└
; │┌ @ int.jl:53 within `+'
	addq	%rsi, %rcx
; │└
; │ @ bitpar-compiler.jl:35 within `count_gc_bitpar'
; │┌ @ bitindex.jl:34 within `+' @ int.jl:53
	addq	$64, %rbx
; │└
; │ @ bitpar-compiler.jl:32 within `count_gc_bitpar'
; │┌ @ operators.jl:309 within `<='
; ││┌ @ bool.jl:41 within `|'
	cmpq	%r12, %rbx
; │└└
	jle	L320
; │ @ bitpar-compiler.jl:37 within `count_gc_bitpar'
; │┌ @ operators.jl:260 within `<'
; ││┌ @ bitindex.jl:38 within `isless' @ operators.jl:338
; │││┌ @ int.jl:49 within `<'
L360:
	cmpq	%r8, %rbx
; │└└└
	jge	L405
; │ @ bitpar-compiler.jl:38 within `count_gc_bitpar'
; │┌ @ sysimg.jl:18 within `getproperty'
	movq	(%r14), %rdx
; │└
; │┌ @ bitindex.jl:31 within `index'
; ││┌ @ int.jl:448 within `>>' @ int.jl:441
	sarq	$6, %rbx
; │└└
; │┌ @ array.jl:729 within `getindex'
	movq	(%rdx), %rdx
; │└
; │┌ @ bitindex.jl:76 within `bitmask' @ bitindex.jl:71
; ││┌ @ int.jl:450 within `<<' @ int.jl:443
	andb	$62, %r8b
; │└└
; │┌ @ int.jl:297 within `&'
	bzhiq	%r8, (%rdx,%rbx,8), %rdx
; │└
; │ @ bitpar-compiler.jl:39 within `count_gc_bitpar'
; │┌ @ bit-manipulation.jl:48 within `gc_bitcount'
; ││┌ @ int.jl:448 within `>>' @ int.jl:442
	movq	%rdx, %rsi
	shrq	%rsi
; ││└
; ││ @ bit-manipulation.jl:49 within `gc_bitcount'
; ││┌ @ int.jl:321 within `xor'
	xorq	%rdx, %rsi
	andq	%r9, %rsi
; │└└
; │┌ @ int.jl:358 within `gc_bitcount'
	popcntq	%rsi, %rdx
; │└
; │┌ @ int.jl:53 within `+'
	addq	%rdx, %rcx
; │└
; │ @ bitpar-compiler.jl:42 within `count_gc_bitpar'
L405:
	movq	8(%rsp), %rdx
	movq	%rdx, (%rax)
	movq	%rcx, %rax
	addq	$24, %rsp
	popq	%rbx
	popq	%r12
	popq	%r14
	popq	%r15
	retq
; │ @ bitpar-compiler.jl:22 within `count_gc_bitpar'
; │┌ @ indexing.jl:109 within `lastindex'
; ││┌ @ longsequence.jl:71 within `length' @ range.jl:540
; │││┌ @ checked.jl:223 within `checked_sub'
L428:
	movabsq	$throw_overflowerr_binaryop, %rax
	movabsq	$4547875832, %rdi       ## imm = 0x10F1313F8
	movq	%rcx, %rsi
	callq	*%rax
	ud2
; │││└
; │││ @ longsequence.jl:71 within `length' @ checked.jl:166
L455:
	movabsq	$throw_overflowerr_binaryop, %rax
	movabsq	$4547875528, %rdi       ## imm = 0x10F1312C8
	movl	$1, %edx
	callq	*%rax
	ud2
; │└└
; │ @ bitpar-compiler.jl:30 within `count_gc_bitpar'
; │┌ @ boot.jl:301 within `Type'
L484:
	movabsq	$jl_gc_pool_alloc, %rcx
	movl	$1616, %esi             ## imm = 0x650
	movl	$16, %edx
	movq	%rax, %rdi
	callq	*%rcx
	movabsq	$jl_system_image_data, %rcx
	movq	%rcx, -8(%rax)
	movabsq	$4606006320, %rcx       ## imm = 0x1128A1430
	movq	%rcx, (%rax)
	movq	%rax, 16(%rsp)
; │└
	movabsq	$jl_throw, %rcx
	movq	%rax, %rdi
	callq	*%rcx
	nopl	(%rax)
; └

I think I may have found it. Something about how you loop in your version is generating vectorized machine code, whereas the current version is not! I will do some further experimentation.

Perhaps that the main body loop of your version is a well defined for loop, rather than a while loop, is the main difference here.

@BenJWard

This comment has been minimized.

Copy link
Member

@BenJWard BenJWard commented Jan 27, 2020

Done some more testing, it sure is that loop that processes the main body of the data that is the issue.

@BenJWard

This comment has been minimized.

Copy link
Member

@BenJWard BenJWard commented Jan 27, 2020

PR #96 Addresses 2 and 3 of these questions:

We should look at the bitpar counting functions:

  1. There is no documentation on what the two-argument count(isambiguous, seq, seq) does. I can't work it out.
  2. They don't seem to work properly. n_ambiguous(seq) dispatches to count_naive, which it shouldn't. I think there is no compiled method for a single-seq count(isambiguous, seq).
  3. Performance could maybe take a review. count(isGC, seq) (which does correctly dispatch to the bit-parallel one) takes 7.8 µs on my computer for a 100,000 4-bit DNA seq. I wrote a simple function which completes in 2.6 µs.

For point 3. There was an issue in the single sequence compile_bitpar function which meant loop that really should be vectorized wasn't. Now it is and the performance is increased to that expected by simple_gc.

For point 2. A bit-parallel function has been defined using the modified compile_bitpar function,
a method overloads that simply return 0 for 2-bit encoded seqs are also provided since ambiguous IUPAC codes and gap codes are not possible.

For point 1. the two argument count(isambiguous, seqa, seqb) counts the number of sites at which seqa or seqb is ambiguous. Apologies for not documenting it, it's not used often but I did have occasion for it in one project once where someone wanted to know that number.

@jakobnissen

This comment has been minimized.

Copy link
Contributor Author

@jakobnissen jakobnissen commented Jan 29, 2020

I'll have a look at #96 .

Meanwhile, I got inspired by this comment on our Seq blogpost. His version is faster because it vectorizes all the way.
However, if we don't reverse in-place, BioSequences also vectorizes and become even faster. And actually, if we complement in the reversing SIMD loop, we get ~4x faster at reverse-complementing long sequences than we currently are.

Just for fun I'll pitch a PR when I get some time (a little strapped for time right now). I'm not convinced BioSequences needs faster RC'ing, but let's see how the PR looks, and if it's 100 extra lines of code, it's not worth it. If it's a simple change, maybe it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.