Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Commit "Fast integer ops shouldn't wrap" slowed down loading from PtrArrays with four or more dimensions #55

Closed
ranocha opened this issue May 2, 2021 · 3 comments

Comments

@ranocha
Copy link
Contributor

ranocha commented May 2, 2021

I tracked down performance regressions observed in trixi-framework/Trixi.jl#509 to the following problem. On AVX2 (and non-AVX512) systems, the commit e2b6ccb "Fast integer ops shouldn't wrap" in VectorizationBase resulted in significant performance regressions when loading data from a PtrArray via getindex.

Using Julia v1.6.1 on an Intel i7 8700K yields the following results for a minimal working example. I'm using StrideArrays v0.1.6 with StrideArraysCore v0.1.5 and LoopVectorization v0.12.12 with the diff

~/.julia/dev/StrideArrays$ git diff
diff --git a/Project.toml b/Project.toml
index 206a58f..bb0e2b0 100644
--- a/Project.toml
+++ b/Project.toml
@@ -18,13 +18,11 @@ VectorizedRNG = "33b4df10-0173-11e9-2a0c-851a7edac40e"
 
 [compat]
 ArrayInterface = "3"
-LoopVectorization = "0.12.13"
 Octavian = "0.2.3"
 SLEEFPirates = "0.6.13"
 Static = "0.2.4"
 StrideArraysCore = "0.1.3"
 ThreadingUtilities = "0.4"
-VectorizationBase = "0.19.32"
 VectorizedRNG = "0.2.8"
 julia = "1.5"

in StrideArrays to be able to check different versions of VectorizationBase.

Last good commit

~/.julia/dev/VectorizationBase$ git checkout cb8185d6a1b4b4ad84f0c539af7b4b39d5b7bb59
Previous HEAD position was e2b6ccb Fast integer ops shouldn't wrap
HEAD is now at cb8185d non small pow2 non-AVX512 mask fixes
(StrideArrays) pkg> up
    Updating registry at `~/.julia/registries/General`
    Updating git-repo `https://github.com/JuliaRegistries/General`
  No Changes to `~/.julia/dev/StrideArrays/Project.toml`
  No Changes to `~/.julia/dev/StrideArrays/Manifest.toml`
Precompiling project...
  9 dependencies successfully precompiled in 18 seconds (15 already precompiled)

julia> using StrideArrays, BenchmarkTools

julia> function foo(a::AbstractArray{T,N}) where {T,N}
           idx = ntuple(_ -> 1, Val(N))
           @inbounds res = a[idx...]
           res
       end
foo (generic function with 1 method)

julia> for N in 1:5
           a_array = randn(ntuple(_ -> 4, N)...)
           a_stride     = StrideArray(a_array)
           a_ptr        = PtrArray(pointer(a_array), size(a_array))
           a_ptr_static = PtrArray(pointer(a_array), map(StaticInt, size(a_array)))
           @info "New round" N
           @btime foo($a_array)
           @btime foo($a_stride)
           @btime foo($a_ptr)
           @btime foo($a_ptr_static)
       end
┌ Info: New round
└   N = 1
  1.056 ns (0 allocations: 0 bytes)
  1.079 ns (0 allocations: 0 bytes)
  1.056 ns (0 allocations: 0 bytes)
  1.056 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 2
  1.056 ns (0 allocations: 0 bytes)
  1.056 ns (0 allocations: 0 bytes)
  1.057 ns (0 allocations: 0 bytes)
  1.056 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 3
  1.056 ns (0 allocations: 0 bytes)
  1.056 ns (0 allocations: 0 bytes)
  1.056 ns (0 allocations: 0 bytes)
  1.056 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 4
  1.057 ns (0 allocations: 0 bytes)
  1.057 ns (0 allocations: 0 bytes)
  1.057 ns (0 allocations: 0 bytes)
  1.057 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 5
  1.057 ns (0 allocations: 0 bytes)
  1.057 ns (0 allocations: 0 bytes)
  1.057 ns (0 allocations: 0 bytes)
  1.057 ns (0 allocations: 0 bytes)

First bad commit

~/.julia/dev/VectorizationBase$ git checkout e2b6ccbc9cea28e57b44b17dc864d46217cdd93d
Previous HEAD position was cb8185d non small pow2 non-AVX512 mask fixes
HEAD is now at e2b6ccb Fast integer ops shouldn't wrap
(StrideArrays) pkg> up
    Updating registry at `~/.julia/registries/General`
    Updating git-repo `https://github.com/JuliaRegistries/General`
  No Changes to `~/.julia/dev/StrideArrays/Project.toml`
  No Changes to `~/.julia/dev/StrideArrays/Manifest.toml`
Precompiling project...
  9 dependencies successfully precompiled in 18 seconds (15 already precompiled)

julia> using StrideArrays, BenchmarkTools

julia> function foo(a::AbstractArray{T,N}) where {T,N}
           idx = ntuple(_ -> 1, Val(N))
           @inbounds res = a[idx...]
           res
       end
foo (generic function with 1 method)

julia> for N in 1:5
           a_array = randn(ntuple(_ -> 4, N)...)
           a_stride     = StrideArray(a_array)
           a_ptr        = PtrArray(pointer(a_array), size(a_array))
           a_ptr_static = PtrArray(pointer(a_array), map(StaticInt, size(a_array)))
           @info "New round" N
           @btime foo($a_array)
           @btime foo($a_stride)
           @btime foo($a_ptr)
           @btime foo($a_ptr_static)
       end
┌ Info: New round
└   N = 1
  1.056 ns (0 allocations: 0 bytes)
  1.056 ns (0 allocations: 0 bytes)
  1.056 ns (0 allocations: 0 bytes)
  1.056 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 2
  1.056 ns (0 allocations: 0 bytes)
  1.057 ns (0 allocations: 0 bytes)
  1.057 ns (0 allocations: 0 bytes)
  1.057 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 3
  1.057 ns (0 allocations: 0 bytes)
  1.057 ns (0 allocations: 0 bytes)
  1.057 ns (0 allocations: 0 bytes)
  1.057 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 4
  1.057 ns (0 allocations: 0 bytes)
  2.099 ns (0 allocations: 0 bytes)
  2.099 ns (0 allocations: 0 bytes)
  2.104 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 5
  1.056 ns (0 allocations: 0 bytes)
  2.099 ns (0 allocations: 0 bytes)
  2.103 ns (0 allocations: 0 bytes)
  2.099 ns (0 allocations: 0 bytes)
@chriselrod
Copy link
Member

julia> using StrideArrays, BenchmarkTools

julia> function foo(a::AbstractArray{T,N}) where {T,N}
           idx = ntuple(_ -> 1, Val(N))
           @inbounds res = a[idx...]
           res
       end
foo (generic function with 1 method)

julia> for N in 1:5
           a_array = randn(ntuple(_ -> 4, N)...)
           a_stride     = StrideArray(a_array)
           a_ptr        = PtrArray(pointer(a_array), size(a_array))
           a_ptr_static = PtrArray(pointer(a_array), map(StaticInt, size(a_array)))
           @info "New round" N
           @btime foo($a_array)
           @btime foo($a_stride)
           @btime foo($a_ptr)
           @btime foo($a_ptr_static)
       end
┌ Info: New round
└   N = 1
  1.104 ns (0 allocations: 0 bytes)
  1.104 ns (0 allocations: 0 bytes)
  1.106 ns (0 allocations: 0 bytes)
  1.104 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 2
  1.118 ns (0 allocations: 0 bytes)
  1.104 ns (0 allocations: 0 bytes)
  1.103 ns (0 allocations: 0 bytes)
  1.107 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 3
  1.104 ns (0 allocations: 0 bytes)
  1.104 ns (0 allocations: 0 bytes)
  1.104 ns (0 allocations: 0 bytes)
  1.106 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 4
  1.104 ns (0 allocations: 0 bytes)
  2.218 ns (0 allocations: 0 bytes)
  1.983 ns (0 allocations: 0 bytes)
  2.349 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 5
  1.107 ns (0 allocations: 0 bytes)
  2.201 ns (0 allocations: 0 bytes)
  1.982 ns (0 allocations: 0 bytes)
  2.198 ns (0 allocations: 0 bytes)

julia> @inline function foo(a::AbstractArray{T,N}) where {T,N}
           idx = ntuple(_ -> 1, Val(N))
           @inbounds res = a[idx...]
           res
       end
foo (generic function with 1 method)

julia> for N in 1:5
           a_array = randn(ntuple(_ -> 4, N)...)
           a_stride     = StrideArray(a_array)
           a_ptr        = PtrArray(pointer(a_array), size(a_array))
           a_ptr_static = PtrArray(pointer(a_array), map(StaticInt, size(a_array)))
           @info "New round" N
           @btime foo($a_array)
           @btime foo($a_stride)
           @btime foo($a_ptr)
           @btime foo($a_ptr_static)
       end
┌ Info: New round
└   N = 1
  1.103 ns (0 allocations: 0 bytes)
  1.104 ns (0 allocations: 0 bytes)
  1.104 ns (0 allocations: 0 bytes)
  1.104 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 2
  1.106 ns (0 allocations: 0 bytes)
  1.103 ns (0 allocations: 0 bytes)
  1.105 ns (0 allocations: 0 bytes)
  1.104 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 3
  1.108 ns (0 allocations: 0 bytes)
  1.105 ns (0 allocations: 0 bytes)
  1.106 ns (0 allocations: 0 bytes)
  1.106 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 4
  1.103 ns (0 allocations: 0 bytes)
  1.103 ns (0 allocations: 0 bytes)
  1.104 ns (0 allocations: 0 bytes)
  1.106 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 5
  1.104 ns (0 allocations: 0 bytes)
  1.109 ns (0 allocations: 0 bytes)
  1.107 ns (0 allocations: 0 bytes)
  1.108 ns (0 allocations: 0 bytes)

The problem is that following that commit, foo no longer inlines, adding function call overhead.
You can work around this by manually adding @inline or @propagate_inbounds to the functions you wish to inline.

The current version might generate better code in some cases:

julia> N = 5;

julia> a_array = randn(ntuple(_ -> 4, N)...); a_stride = StrideArray(a_array);

julia> @inline inbounds_getindex(A,i::Vararg{Any,K}) where {K} = @inbounds A[i...]

julia> @btime ntuple(@inline(k -> ntuple(@inline(i -> ntuple(@inline(j -> inbounds_getindex($a_array,j,i,k,1,1)),Val(4))),Val(4))),Val(4))
  6.501 ns (0 allocations: 0 bytes)
(((-1.0845134036492494, 0.39319992091195866, -0.8895077857106304, -0.964892931202095), (2.0334187989577157, -2.35024835499007, -0.6695110129668518, 0.23273920015432767), (-1.5995251412408744, 0.48350248524295886, 1.1584175201853102, 1.2967183832634657), (0.6038235462101194, -0.8909856309920922, -0.052482763971459415, -0.5636245121330408)), ((-0.1323632823158482, 0.43906713631489525, 2.041036278214374, 1.333462654283023), (-0.590749444206323, -0.5638287312934376, -0.1600949086014227, -1.0773625948057042), (1.9468834433378406, -0.7367494587056644, 0.3168111912464907, -1.2375796945559765), (0.5612651890807325, -0.3157157212029217, -0.577796542397977, -0.025728273365417796)), ((0.2421323617521182, -0.8504194026834689, -1.952109988993807, 0.4538025683624084), (0.4579110544376671, 0.6229132137433413, 0.6251117502074499, -1.307212183077493), (-1.1277874037846727, 1.1857956647834296, 0.3390898603722827, 0.3132461216912783), (-0.3606955803358298, 0.2860279843675183, -0.6161026349564189, 0.2388840752091055)), ((-1.510597148876941, 1.408674128168656, 0.15194308350850289, 2.032177662862577), (-0.04885375760515278, 1.9821423773144886, 1.3882563062679014, 0.24915632308057115), (-2.44791611171785, 0.6001244841414322, 1.2575221138980344, 0.38201615843991393), (-0.08420300440414022, -0.7635435445809656, 0.7436804735576459, -0.19661438632061876)))

julia> @btime ntuple(@inline(k -> ntuple(@inline(i -> ntuple(@inline(j -> inbounds_getindex($a_stride,j,i,k,1,1)),Val(4))),Val(4))),Val(4))
  5.828 ns (0 allocations: 0 bytes)
(((-1.0845134036492494, 0.39319992091195866, -0.8895077857106304, -0.964892931202095), (2.0334187989577157, -2.35024835499007, -0.6695110129668518, 0.23273920015432767), (-1.5995251412408744, 0.48350248524295886, 1.1584175201853102, 1.2967183832634657), (0.6038235462101194, -0.8909856309920922, -0.052482763971459415, -0.5636245121330408)), ((-0.1323632823158482, 0.43906713631489525, 2.041036278214374, 1.333462654283023), (-0.590749444206323, -0.5638287312934376, -0.1600949086014227, -1.0773625948057042), (1.9468834433378406, -0.7367494587056644, 0.3168111912464907, -1.2375796945559765), (0.5612651890807325, -0.3157157212029217, -0.577796542397977, -0.025728273365417796)), ((0.2421323617521182, -0.8504194026834689, -1.952109988993807, 0.4538025683624084), (0.4579110544376671, 0.6229132137433413, 0.6251117502074499, -1.307212183077493), (-1.1277874037846727, 1.1857956647834296, 0.3390898603722827, 0.3132461216912783), (-0.3606955803358298, 0.2860279843675183, -0.6161026349564189, 0.2388840752091055)), ((-1.510597148876941, 1.408674128168656, 0.15194308350850289, 2.032177662862577), (-0.04885375760515278, 1.9821423773144886, 1.3882563062679014, 0.24915632308057115), (-2.44791611171785, 0.6001244841414322, 1.2575221138980344, 0.38201615843991393), (-0.08420300440414022, -0.7635435445809656, 0.7436804735576459, -0.19661438632061876)))

But every other example I tried performed the same.

A fix would be to make the indexing functions not use vload(stridedpointer(A),I), but use something like

@inline function Base.getindex(A::AbstractStrideArray{S,D,T,K}, i::Vararg{Integer,K}) where {S,D,T,K}
    b = preserve_buffer(A)
    P = PtrArray(A)
    GC.@preserve b begin
        @boundscheck checkbounds(P, i...)
        sp = stridedpointer(P) # note `strides(::AbstractStridedPointer)` returns strides in bytes, not elements
        unsafe_load(pointer(sp) + sum(map(*,map(-,i,VectorizationBase.offsets(sp)), strides(sp))))
    end
end

@chriselrod
Copy link
Member

Closing this as fixed by StrideArraysCore 0.1.6, which switches from llvmcall based versions to base +/*, etc.
The llvmcall versions can tell LLVM information like "no signed wrap". This enables some optimizations, for example:

julia> using VectorizationBase

julia> cmp_1(x,y) = x  y - 1
cmp_1 (generic function with 1 method)

julia> cmp_2(x,y) = x  VectorizationBase.vsub_fast(y, 1)
cmp_2 (generic function with 1 method)
; julia> @cl cmp_1(4,5)
define i8 @julia_cmp_1_962(i64 signext %0, i64 signext %1) #0 {
top:
  %2 = add i64 %1, -1
  %3 = icmp sge i64 %2, %0
  %4 = zext i1 %3 to i8
  ret i8 %4
}
; julia> @cl cmp_2(4,5)
define i8 @julia_cmp_2_964(i64 signext %0, i64 signext %1) #0 {
top:
  %2 = icmp sgt i64 %1, %0
  %3 = zext i1 %2 to i8
  ret i8 %3
}

It is illegal for LLVM to transform x <= y - 1 into x < y, because they are not equivalent if if y == typemin(Int).
But by telling LLVM that y - 1 will not wrap, it can make the transform, saving an instruction.

This is what the offending VectorizationBase commit enabled.
However, Julia's inliner marks all llvmcalls as expensive. Therefore, using just a few of them will make it choose not to inline your function. Even though the llvmcall vsub_fast is often effectively cheaper than - (because LLVM can delete the former in cases where it cannot delete the latter, such as the above example), vsub_fast is considered to be 40 times as expensive by Julia's inliner IIRC.
So, code using llvmcall normally forces the code author to decide whether or not it should inline; Julia's inliner can't see inside the llvmcall and thus can't judge its true cost, so it cannot make good decisions.

The StrideArraysCore fix was to get rid of the llvmcalls. vload is used mostly inside LoopVectorization._avx_!, so extra llvmcalls there are fine. StrideArrays indexing is intended to be used in generic Julia code, so I think the tradeoff there should definitely be in favor of helping Julia's compiler.
Especially because the main benefit to LLVM is for these comparison operations, e.g. specifying loop bounds. Not for indexing, meaning nsw is unlikely to ever help here anyway.

@ranocha
Copy link
Contributor Author

ranocha commented May 3, 2021

Thanks a lot for the quick response! Just adding a few more @inline annotations fixed the problem in my case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants