Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove broken register passing optimization. #524

Closed
wants to merge 4 commits into from

Conversation

maleadt
Copy link

@maleadt maleadt commented Jan 8, 2024

Yet another alternative to a4a160f and #523, now reverting the problematic optimization from #261. IMO this is the least controversial option, as it keeps the functionality of LoopVectorization.jl for 1.11 users.

cc @staticfloat

@chriselrod Please approve this PR (or #523) so that at least CI can run. I would like to get either merged ASAP to get PkgEval in a better state.

Fixes #520, re-fixes #518

Copy link

codecov bot commented Jan 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (3fbe248) 80.31% compared to head (5cf8ddb) 78.54%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #524      +/-   ##
==========================================
- Coverage   80.31%   78.54%   -1.77%     
==========================================
  Files          39       39              
  Lines        9604     9542      -62     
==========================================
- Hits         7713     7495     -218     
- Misses       1891     2047     +156     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@maleadt
Copy link
Author

maleadt commented Jan 8, 2024

Forgot to include the revert of a4a160f. Updated the branch, so needs re-approval.

@maleadt
Copy link
Author

maleadt commented Jan 9, 2024

Turns out this isn't sufficient. On this branch, MCPhylo.jl still asserts, suggesting that the bad IR doesn't (only) come from the Expr(:new) optimization that's been removed here. Maybe the CI failure has something to do with it, as it seems to indicate that other code parts (relying on ManualMemory.jl) are doing bad object introspection too:

  ArgumentError: type does not have a definite number of fields
  Stacktrace:
    [1] fieldcount(t::Any)
      @ Base ./reflection.jl:1025
    [2] store_aggregate!(q::Expr, sym::Symbol, ::Type{Core.TypeName}, offset::Int64)
      @ ManualMemory ~/.julia/packages/ManualMemory/iySrG/src/ManualMemory.jl:194
    [3] store_aggregate!(q::Expr, sym::Symbol, ::Type{DataType}, offset::Int64)
      @ ManualMemory ~/.julia/packages/ManualMemory/iySrG/src/ManualMemory.jl:202
    [4] store_aggregate!(q::Expr, sym::Symbol, ::Type{Tuple{LayoutPointers.GroupedStridedPointers{Tuple{Ptr{Float64}, Ptr{Float64}, Ptr{Float64}}, (1, 1, 1), (0, 0, 0), ((1, 2), (1, 2, 3), (1, 2)), ((1, 2), (3, 4, 5), (6, 7)), Tuple{Static.StaticInt{8}, Static.StaticInt{16}, Static.StaticInt{8}, Static.StaticInt{16}, Int64, Static.StaticInt{8}, Static.StaticInt{16}}, NTuple{7, Static.StaticInt{0}}}, DataType, DataType}}, offset::Int64)
      @ ManualMemory ~/.julia/packages/ManualMemory/iySrG/src/ManualMemory.jl:202
    [5] store_aggregate!(q::Expr, sym::Symbol, ::Type{Tuple{Tuple{CloseOpenIntervals.CloseOpen{Int64, Int64}, CloseOpenIntervals.CloseOpen{Int64, Int64}}, Tuple{LayoutPointers.GroupedStridedPointers{Tuple{Ptr{Float64}, Ptr{Float64}, Ptr{Float64}}, (1, 1, 1), (0, 0, 0), ((1, 2), (1, 2, 3), (1, 2)), ((1, 2), (3, 4, 5), (6, 7)), Tuple{Static.StaticInt{8}, Static.StaticInt{16}, Static.StaticInt{8}, Static.StaticInt{16}, Int64, Static.StaticInt{8}, Static.StaticInt{16}}, NTuple{7, Static.StaticInt{0}}}, DataType, DataType}}}, offset::Int64)
      @ ManualMemory ~/.julia/packages/ManualMemory/iySrG/src/ManualMemory.jl:202
    [6] #s13#5
      @ ~/.julia/packages/ManualMemory/iySrG/src/ManualMemory.jl:216 [inlined]
    [7] var"#s13#5"(T::Any, ::Any, p::Any, x::Any, offset::Any)
      @ ManualMemory ./none:0
    [8] (::Core.GeneratedFunctionStub)(::UInt64, ::LineNumberNode, ::Any, ::Vararg{Any})
      @ Core ./boot.jl:703
    [9] setup_turbo_threads!
      @ ~/work/LoopVectorization.jl/LoopVectorization.jl/src/codegen/lower_threads.jl:42 [inlined]

As I've already spent too much time on this, I'm going to vote on #523 instead. I'd suggest that people who are actually familiar with this code take a look instead after we've fixed the immediate problem.

@maleadt maleadt marked this pull request as draft January 9, 2024 11:16
@chriselrod
Copy link
Member

chriselrod commented Jan 9, 2024

It is still annoying that we'd have to remove this optimization.

PR:

julia> @time using LoopVectorization
  0.272088 seconds (427.03 k allocations: 22.256 MiB, 5.16% gc time, 4.05% compilation time)

julia> function matmul!(C, A, B)
           @turbo for n = indices((C,B), 2), m = indices((C,A),1)
               Cmn = zero(eltype(C))
               for k = indices((A,B), (2,1))
                   Cmn += A[m,k]*B[k,n]
               end
               C[m,n] = Cmn
           end
       end
matmul! (generic function with 1 method)

julia> A = rand(3,3); B = rand(3,3); C = similar(A);

julia> @benchmark matmul!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 997 evaluations.
 Range (min  max):  20.742 ns  23.508 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     20.831 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.852 ns ±  0.155 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▂▅▇██▇▅▁                                                   ▂
  ▆████████▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▆▇▇▇▇▆▆ █
  20.7 ns      Histogram: log(frequency) by time      21.8 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Release

julia> A = rand(3,3); B = rand(3,3); C = similar(A);

julia> @benchmark matmul!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 997 evaluations.
 Range (min  max):  19.945 ns  23.903 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     20.230 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.276 ns ±  0.226 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▃▄▃▆█▃                                              
  ▂▂▃▄▅▅▆▇████████▅▄▅█▇▄▃▂▂▂▂▂▂▂▂▂▂▂▃▃▅▄▃▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  19.9 ns         Histogram: frequency by time        21.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

It's a small, but consistent benefit.

_turbo_! call assembly, PR:

"japi1_matmul!_3476":                   # @"japi1_matmul!_3476"
# %bb.0:                                # %L86.preheader
	push	rbp
	mov	rbp, rsp
	push	r15
	push	r14
	push	rbx
	and	rsp, -32
	sub	rsp, 128
	vxorps	xmm0, xmm0, xmm0
	vmovaps	ymmword ptr [rsp], ymm0
	mov	qword ptr [rsp + 32], 0
	mov	qword ptr [rsp + 112], rsi
	#APP
	mov	rax, qword ptr fs:[0]
	#NO_APP
	mov	r14, qword ptr [rax - 8]
	mov	qword ptr [rsp], 12
	mov	rax, qword ptr [r14]
	mov	qword ptr [rsp + 8], rax
	mov	rax, rsp
	mov	qword ptr [r14], rax
	mov	rax, qword ptr [rsi]
	mov	rcx, qword ptr [rsi + 8]
	mov	rdx, qword ptr [rsi + 16]
	mov	r8, qword ptr [rax + 32]
	mov	r9, qword ptr [rax]
	mov	rsi, qword ptr [rax + 24]
	mov	r11, qword ptr [rcx]
	mov	rbx, qword ptr [rcx + 32]
	lea	r10, [8*rsi]
	mov	rdi, qword ptr [rdx]
	lea	r15, [8*rbx]
	mov	qword ptr [rsp + 72], r11
	mov	qword ptr [rsp + 80], rdi
	mov	qword ptr [rsp + 88], r9
	mov	qword ptr [rsp + 96], r10
	mov	qword ptr [rsp + 104], r15
	mov	qword ptr [rsp + 48], r8
	mov	qword ptr [rsp + 56], rsi
	mov	qword ptr [rsp + 64], rbx
	mov	qword ptr [rsp + 32], rcx
	mov	qword ptr [rsp + 24], rdx
	mov	qword ptr [rsp + 16], rax
	movabs	rax, offset "j__turbo_!_3483"
	lea	rdi, [rsp + 48]
	lea	rsi, [rsp + 72]
	vzeroupper
	call	rax
	mov	rax, qword ptr [rsp + 8]
	mov	qword ptr [r14], rax
	movabs	rax, 94218903085064
	lea	rsp, [rbp - 24]
	pop	rbx
	pop	r14
	pop	r15
	pop	rbp
	ret
.Lfunc_end0:
	.size	"japi1_matmul!_3476", .Lfunc_end0-"japi1_matmul!_3476"

Release:

"japi1_matmul!_3809":                   # @"japi1_matmul!_3809"
# %bb.0:                                # %L86.preheader
	push	rbp
	mov	rbp, rsp
	push	r15
	push	r14
	push	rbx
	and	rsp, -32
	sub	rsp, 64
	vxorps	xmm0, xmm0, xmm0
	vmovaps	ymmword ptr [rsp], ymm0
	mov	qword ptr [rsp + 32], 0
	mov	qword ptr [rsp + 48], rsi
	#APP
	mov	rax, qword ptr fs:[0]
	#NO_APP
	mov	r14, qword ptr [rax - 8]
	mov	qword ptr [rsp], 12
	mov	rax, qword ptr [r14]
	mov	qword ptr [rsp + 8], rax
	mov	rax, rsp
	mov	qword ptr [r14], rax
	mov	rax, qword ptr [rsi]
	mov	rbx, qword ptr [rsi + 8]
	mov	r10, qword ptr [rsi + 16]
	mov	rdi, qword ptr [rax + 32]
	mov	r9, qword ptr [rax]
	mov	rsi, qword ptr [rax + 24]
	mov	rcx, qword ptr [rbx]
	mov	rdx, qword ptr [rbx + 32]
	lea	r11, [8*rsi]
	mov	r8, qword ptr [r10]
	lea	r15, [8*rdx]
	mov	qword ptr [rsp + 32], rbx
	mov	qword ptr [rsp + 24], r10
	mov	qword ptr [rsp + 16], rax
	movabs	rax, offset "j__turbo_!_3816"
	push	r15
	push	r11
	vzeroupper
	call	rax
	add	rsp, 16
	mov	rax, qword ptr [rsp + 8]
	mov	qword ptr [r14], rax
	movabs	rax, 94379488305160
	lea	rsp, [rbp - 24]
	pop	rbx
	pop	r14
	pop	r15
	pop	rbp
	ret
.Lfunc_end0:
	.size	"japi1_matmul!_3809", .Lfunc_end0-"japi1_matmul!_3809"

Note all the extra stores to the stack in the asm from this PR.

Mirroring this, the first block within the _turbo! call has more stack loads in this PR:

	.text
	push	rbp
	mov	rbp, rsp
	push	r15
	push	r14
	push	r13
	push	r12
	push	rbx
	mov	rax, r13
	mov	r8, qword ptr [rdi]
	mov	rcx, qword ptr [rdi + 8]
	mov	r11, qword ptr [rdi + 16]
	mov	r9, qword ptr [rsi]
	mov	r14, qword ptr [rsi + 8]
	mov	r15, qword ptr [rsi + 16]
	mov	r13, qword ptr [rsi + 24]
	mov	rsi, qword ptr [rsi + 32]
	mov	rax, qword ptr [rax + 16]
	mov	rax, qword ptr [rax + 16]
	mov	rax, qword ptr [rax]
	dec	r11
	mov	qword ptr [rbp - 72], rcx
	neg	cl
	and	cl, 7
	mov	r12b, -1
	shr	r12b, cl
	lea	rcx, [4*r13]
	add	rcx, r13
	lea	rdi, [rsi + 2*rsi]
	lea	rax, [8*rsi]
	sub	rax, rsi
	lea	rdx, [8*r13]
	sub	rdx, r13
	mov	qword ptr [rbp - 88], rdx
	add	r8, -9
	imul	r8, rsi
	lea	r10, [rsi + 4*rsi]
	lea	rbx, [rsi + 8*rsi]
	lea	rdx, [2*r13]
	add	rdx, r13
	mov	qword ptr [rbp - 56], rdx
	lea	rdx, [r14 + r8]
	mov	qword ptr [rbp - 136], rbx
	add	rbx, rdx
	mov	qword ptr [rbp - 112], rbx
	test	r8, r8
	mov	qword ptr [rbp - 64], r11
	mov	qword ptr [rbp - 80], r9
	mov	dword ptr [rbp - 44], r12d
	js	L2378
	lea	rbx, [8*r13]
	add	rbx, r13
	mov	qword ptr [rbp - 120], rbx
	kmovd	k1, r12d
	mov	r12, r14
	mov	rbx, qword ptr [rbp - 72]
	mov	qword ptr [rbp - 128], rdx
	jmp	L255
	nop	dword ptr [rax + rax]
L224:

Compared to the release:

	.text
	push	rbp
	mov	rbp, rsp
	push	r15
	push	r14
	push	r13
	push	r12
	push	rbx
	mov	r10, r9
	mov	r9, rcx
	mov	rax, qword ptr [rbp + 24]
	mov	r11, qword ptr [rbp + 16]
	mov	rcx, qword ptr [r13 + 16]
	mov	r13, r8
	mov	r8, rdx
	mov	rcx, qword ptr [rcx + 16]
	mov	rcx, qword ptr [rcx]
	dec	r8
	mov	qword ptr [rbp - 64], rsi
	mov	ecx, esi
	neg	cl
	and	cl, 7
	mov	r14b, -1
	shr	r14b, cl
	lea	rbx, [r11 + 4*r11]
	lea	rcx, [rax + 2*rax]
	lea	r12, [8*rax]
	sub	r12, rax
	lea	rdx, [8*r11]
	sub	rdx, r11
	mov	qword ptr [rbp - 72], rdx
	add	rdi, -9
	imul	rdi, rax
	lea	r15, [rax + 4*rax]
	lea	rsi, [rax + 8*rax]
	lea	rdx, [r11 + 2*r11]
	mov	qword ptr [rbp - 56], rdx
	lea	rdx, [r13 + rdi]
	mov	qword ptr [rbp - 96], rsi
	add	rsi, rdx
	mov	qword ptr [rbp - 112], rsi
	test	rdi, rdi
	mov	dword ptr [rbp - 44], r14d
	js	L2367
	lea	rsi, [r11 + 8*r11]
	mov	qword ptr [rbp - 120], rsi
	kmovd	k1, r14d
	mov	r14, r13
	mov	rdi, qword ptr [rbp - 64]
	mov	qword ptr [rbp - 104], r9
	mov	qword ptr [rbp - 128], rdx
	jmp	L220
	nop	word ptr cs:[rax + rax]
L192:

It's not the biggest difference, but you can see all the extra stores and loads from the stack that this PR introduces.

The perfectionist in me doesn't like to see regressions like this, a silent slow decay.

But it seems to be broken anyway.

@maleadt
Copy link
Author

maleadt commented Jan 9, 2024

Yes, that is unfortunate. It's probably possible to specialize the optimization to not only happen for types where it's legal (i.e., not Array), but I'm not familiar enough with the codebase to work on that efficiently (again, my prime motivation right now is to get 1.11 in a somewhat better shape, as we haven't been doing well lately).

@chriselrod
Copy link
Member

It shouldn't be passing arrays to _turbo_!. Arrays should be getting GC.@preserved, and then the pointer and some strides are passed in.
It actually goes through a GroupedStridedPointer object which consolidates shared information.
There won't be known shared information in general, but in the example above

julia> function matmul!(C, A, B)
           @turbo for n = indices((C,B), 2), m = indices((C,A),1)
               Cmn = zero(eltype(C))
               for k = indices((A,B), (2,1))
                   Cmn += A[m,k]*B[k,n]
               end
               C[m,n] = Cmn
           end
       end
matmul! (generic function with 1 method)

because Arrays have a dense layout, and indices checks that the corresponding axes match, it is known that stride(C,2) == stride(B,2), and thus this is shared.

My primary point here is that in the typical case, we should only need this decomposition and reconstruction for the GroupedStridedPointer object, as that is the form it passes arrays.

For multithreaded code, it will try and store objects in either stack space, or a per-thread scratch space (use of this per-thread space is locked, so we shouldn't get multiple tasks trying to use the same one). I suspect this code may be problematic.

Anyway, thanks for the time you have spent on this.

My original concern was that we don't use this in any of our products, while you, Jameson, and others are sporadically paying maintenance costs for this ecosystem. It seems to be doing more harm than good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants