Commit "Fast integer ops shouldn't wrap" slowed down loading from PtrArrays with four or more dimensions #55

ranocha · 2021-05-02T13:25:12Z

I tracked down performance regressions observed in trixi-framework/Trixi.jl#509 to the following problem. On AVX2 (and non-AVX512) systems, the commit e2b6ccb "Fast integer ops shouldn't wrap" in VectorizationBase resulted in significant performance regressions when loading data from a PtrArray via getindex.

Using Julia v1.6.1 on an Intel i7 8700K yields the following results for a minimal working example. I'm using StrideArrays v0.1.6 with StrideArraysCore v0.1.5 and LoopVectorization v0.12.12 with the diff

~/.julia/dev/StrideArrays$ git diff
diff --git a/Project.toml b/Project.toml
index 206a58f..bb0e2b0 100644
--- a/Project.toml
+++ b/Project.toml
@@ -18,13 +18,11 @@ VectorizedRNG = "33b4df10-0173-11e9-2a0c-851a7edac40e"
 
 [compat]
 ArrayInterface = "3"
-LoopVectorization = "0.12.13"
 Octavian = "0.2.3"
 SLEEFPirates = "0.6.13"
 Static = "0.2.4"
 StrideArraysCore = "0.1.3"
 ThreadingUtilities = "0.4"
-VectorizationBase = "0.19.32"
 VectorizedRNG = "0.2.8"
 julia = "1.5"

in StrideArrays to be able to check different versions of VectorizationBase.

Last good commit

~/.julia/dev/VectorizationBase$ git checkout cb8185d6a1b4b4ad84f0c539af7b4b39d5b7bb59
Previous HEAD position was e2b6ccb Fast integer ops shouldn't wrap
HEAD is now at cb8185d non small pow2 non-AVX512 mask fixes

(StrideArrays) pkg> up
    Updating registry at `~/.julia/registries/General`
    Updating git-repo `https://github.com/JuliaRegistries/General`
  No Changes to `~/.julia/dev/StrideArrays/Project.toml`
  No Changes to `~/.julia/dev/StrideArrays/Manifest.toml`
Precompiling project...
  9 dependencies successfully precompiled in 18 seconds (15 already precompiled)

julia> using StrideArrays, BenchmarkTools

julia> function foo(a::AbstractArray{T,N}) where {T,N}
           idx = ntuple(_ -> 1, Val(N))
           @inbounds res = a[idx...]
           res
       end
foo (generic function with 1 method)

julia> for N in 1:5
           a_array = randn(ntuple(_ -> 4, N)...)
           a_stride     = StrideArray(a_array)
           a_ptr        = PtrArray(pointer(a_array), size(a_array))
           a_ptr_static = PtrArray(pointer(a_array), map(StaticInt, size(a_array)))
           @info "New round" N
           @btime foo($a_array)
           @btime foo($a_stride)
           @btime foo($a_ptr)
           @btime foo($a_ptr_static)
       end
┌ Info: New round
└   N = 1
  1.056 ns (0 allocations: 0 bytes)
  1.079 ns (0 allocations: 0 bytes)
  1.056 ns (0 allocations: 0 bytes)
  1.056 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 2
  1.056 ns (0 allocations: 0 bytes)
  1.056 ns (0 allocations: 0 bytes)
  1.057 ns (0 allocations: 0 bytes)
  1.056 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 3
  1.056 ns (0 allocations: 0 bytes)
  1.056 ns (0 allocations: 0 bytes)
  1.056 ns (0 allocations: 0 bytes)
  1.056 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 4
  1.057 ns (0 allocations: 0 bytes)
  1.057 ns (0 allocations: 0 bytes)
  1.057 ns (0 allocations: 0 bytes)
  1.057 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 5
  1.057 ns (0 allocations: 0 bytes)
  1.057 ns (0 allocations: 0 bytes)
  1.057 ns (0 allocations: 0 bytes)
  1.057 ns (0 allocations: 0 bytes)

First bad commit

~/.julia/dev/VectorizationBase$ git checkout e2b6ccbc9cea28e57b44b17dc864d46217cdd93d
Previous HEAD position was cb8185d non small pow2 non-AVX512 mask fixes
HEAD is now at e2b6ccb Fast integer ops shouldn't wrap

(StrideArrays) pkg> up
    Updating registry at `~/.julia/registries/General`
    Updating git-repo `https://github.com/JuliaRegistries/General`
  No Changes to `~/.julia/dev/StrideArrays/Project.toml`
  No Changes to `~/.julia/dev/StrideArrays/Manifest.toml`
Precompiling project...
  9 dependencies successfully precompiled in 18 seconds (15 already precompiled)

julia> using StrideArrays, BenchmarkTools

julia> function foo(a::AbstractArray{T,N}) where {T,N}
           idx = ntuple(_ -> 1, Val(N))
           @inbounds res = a[idx...]
           res
       end
foo (generic function with 1 method)

julia> for N in 1:5
           a_array = randn(ntuple(_ -> 4, N)...)
           a_stride     = StrideArray(a_array)
           a_ptr        = PtrArray(pointer(a_array), size(a_array))
           a_ptr_static = PtrArray(pointer(a_array), map(StaticInt, size(a_array)))
           @info "New round" N
           @btime foo($a_array)
           @btime foo($a_stride)
           @btime foo($a_ptr)
           @btime foo($a_ptr_static)
       end
┌ Info: New round
└   N = 1
  1.056 ns (0 allocations: 0 bytes)
  1.056 ns (0 allocations: 0 bytes)
  1.056 ns (0 allocations: 0 bytes)
  1.056 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 2
  1.056 ns (0 allocations: 0 bytes)
  1.057 ns (0 allocations: 0 bytes)
  1.057 ns (0 allocations: 0 bytes)
  1.057 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 3
  1.057 ns (0 allocations: 0 bytes)
  1.057 ns (0 allocations: 0 bytes)
  1.057 ns (0 allocations: 0 bytes)
  1.057 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 4
  1.057 ns (0 allocations: 0 bytes)
  2.099 ns (0 allocations: 0 bytes)
  2.099 ns (0 allocations: 0 bytes)
  2.104 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 5
  1.056 ns (0 allocations: 0 bytes)
  2.099 ns (0 allocations: 0 bytes)
  2.103 ns (0 allocations: 0 bytes)
  2.099 ns (0 allocations: 0 bytes)

The text was updated successfully, but these errors were encountered:

chriselrod · 2021-05-02T16:39:46Z

julia> using StrideArrays, BenchmarkTools

julia> function foo(a::AbstractArray{T,N}) where {T,N}
           idx = ntuple(_ -> 1, Val(N))
           @inbounds res = a[idx...]
           res
       end
foo (generic function with 1 method)

julia> for N in 1:5
           a_array = randn(ntuple(_ -> 4, N)...)
           a_stride     = StrideArray(a_array)
           a_ptr        = PtrArray(pointer(a_array), size(a_array))
           a_ptr_static = PtrArray(pointer(a_array), map(StaticInt, size(a_array)))
           @info "New round" N
           @btime foo($a_array)
           @btime foo($a_stride)
           @btime foo($a_ptr)
           @btime foo($a_ptr_static)
       end
┌ Info: New round
└   N = 1
  1.104 ns (0 allocations: 0 bytes)
  1.104 ns (0 allocations: 0 bytes)
  1.106 ns (0 allocations: 0 bytes)
  1.104 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 2
  1.118 ns (0 allocations: 0 bytes)
  1.104 ns (0 allocations: 0 bytes)
  1.103 ns (0 allocations: 0 bytes)
  1.107 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 3
  1.104 ns (0 allocations: 0 bytes)
  1.104 ns (0 allocations: 0 bytes)
  1.104 ns (0 allocations: 0 bytes)
  1.106 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 4
  1.104 ns (0 allocations: 0 bytes)
  2.218 ns (0 allocations: 0 bytes)
  1.983 ns (0 allocations: 0 bytes)
  2.349 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 5
  1.107 ns (0 allocations: 0 bytes)
  2.201 ns (0 allocations: 0 bytes)
  1.982 ns (0 allocations: 0 bytes)
  2.198 ns (0 allocations: 0 bytes)

julia> @inline function foo(a::AbstractArray{T,N}) where {T,N}
           idx = ntuple(_ -> 1, Val(N))
           @inbounds res = a[idx...]
           res
       end
foo (generic function with 1 method)

julia> for N in 1:5
           a_array = randn(ntuple(_ -> 4, N)...)
           a_stride     = StrideArray(a_array)
           a_ptr        = PtrArray(pointer(a_array), size(a_array))
           a_ptr_static = PtrArray(pointer(a_array), map(StaticInt, size(a_array)))
           @info "New round" N
           @btime foo($a_array)
           @btime foo($a_stride)
           @btime foo($a_ptr)
           @btime foo($a_ptr_static)
       end
┌ Info: New round
└   N = 1
  1.103 ns (0 allocations: 0 bytes)
  1.104 ns (0 allocations: 0 bytes)
  1.104 ns (0 allocations: 0 bytes)
  1.104 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 2
  1.106 ns (0 allocations: 0 bytes)
  1.103 ns (0 allocations: 0 bytes)
  1.105 ns (0 allocations: 0 bytes)
  1.104 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 3
  1.108 ns (0 allocations: 0 bytes)
  1.105 ns (0 allocations: 0 bytes)
  1.106 ns (0 allocations: 0 bytes)
  1.106 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 4
  1.103 ns (0 allocations: 0 bytes)
  1.103 ns (0 allocations: 0 bytes)
  1.104 ns (0 allocations: 0 bytes)
  1.106 ns (0 allocations: 0 bytes)
┌ Info: New round
└   N = 5
  1.104 ns (0 allocations: 0 bytes)
  1.109 ns (0 allocations: 0 bytes)
  1.107 ns (0 allocations: 0 bytes)
  1.108 ns (0 allocations: 0 bytes)

The problem is that following that commit, foo no longer inlines, adding function call overhead.
You can work around this by manually adding @inline or @propagate_inbounds to the functions you wish to inline.

The current version might generate better code in some cases:

julia> N = 5;

julia> a_array = randn(ntuple(_ -> 4, N)...); a_stride = StrideArray(a_array);

julia> @inline inbounds_getindex(A,i::Vararg{Any,K}) where {K} = @inbounds A[i...]

julia> @btime ntuple(@inline(k -> ntuple(@inline(i -> ntuple(@inline(j -> inbounds_getindex($a_array,j,i,k,1,1)),Val(4))),Val(4))),Val(4))
  6.501 ns (0 allocations: 0 bytes)
(((-1.0845134036492494, 0.39319992091195866, -0.8895077857106304, -0.964892931202095), (2.0334187989577157, -2.35024835499007, -0.6695110129668518, 0.23273920015432767), (-1.5995251412408744, 0.48350248524295886, 1.1584175201853102, 1.2967183832634657), (0.6038235462101194, -0.8909856309920922, -0.052482763971459415, -0.5636245121330408)), ((-0.1323632823158482, 0.43906713631489525, 2.041036278214374, 1.333462654283023), (-0.590749444206323, -0.5638287312934376, -0.1600949086014227, -1.0773625948057042), (1.9468834433378406, -0.7367494587056644, 0.3168111912464907, -1.2375796945559765), (0.5612651890807325, -0.3157157212029217, -0.577796542397977, -0.025728273365417796)), ((0.2421323617521182, -0.8504194026834689, -1.952109988993807, 0.4538025683624084), (0.4579110544376671, 0.6229132137433413, 0.6251117502074499, -1.307212183077493), (-1.1277874037846727, 1.1857956647834296, 0.3390898603722827, 0.3132461216912783), (-0.3606955803358298, 0.2860279843675183, -0.6161026349564189, 0.2388840752091055)), ((-1.510597148876941, 1.408674128168656, 0.15194308350850289, 2.032177662862577), (-0.04885375760515278, 1.9821423773144886, 1.3882563062679014, 0.24915632308057115), (-2.44791611171785, 0.6001244841414322, 1.2575221138980344, 0.38201615843991393), (-0.08420300440414022, -0.7635435445809656, 0.7436804735576459, -0.19661438632061876)))

julia> @btime ntuple(@inline(k -> ntuple(@inline(i -> ntuple(@inline(j -> inbounds_getindex($a_stride,j,i,k,1,1)),Val(4))),Val(4))),Val(4))
  5.828 ns (0 allocations: 0 bytes)
(((-1.0845134036492494, 0.39319992091195866, -0.8895077857106304, -0.964892931202095), (2.0334187989577157, -2.35024835499007, -0.6695110129668518, 0.23273920015432767), (-1.5995251412408744, 0.48350248524295886, 1.1584175201853102, 1.2967183832634657), (0.6038235462101194, -0.8909856309920922, -0.052482763971459415, -0.5636245121330408)), ((-0.1323632823158482, 0.43906713631489525, 2.041036278214374, 1.333462654283023), (-0.590749444206323, -0.5638287312934376, -0.1600949086014227, -1.0773625948057042), (1.9468834433378406, -0.7367494587056644, 0.3168111912464907, -1.2375796945559765), (0.5612651890807325, -0.3157157212029217, -0.577796542397977, -0.025728273365417796)), ((0.2421323617521182, -0.8504194026834689, -1.952109988993807, 0.4538025683624084), (0.4579110544376671, 0.6229132137433413, 0.6251117502074499, -1.307212183077493), (-1.1277874037846727, 1.1857956647834296, 0.3390898603722827, 0.3132461216912783), (-0.3606955803358298, 0.2860279843675183, -0.6161026349564189, 0.2388840752091055)), ((-1.510597148876941, 1.408674128168656, 0.15194308350850289, 2.032177662862577), (-0.04885375760515278, 1.9821423773144886, 1.3882563062679014, 0.24915632308057115), (-2.44791611171785, 0.6001244841414322, 1.2575221138980344, 0.38201615843991393), (-0.08420300440414022, -0.7635435445809656, 0.7436804735576459, -0.19661438632061876)))

But every other example I tried performed the same.

A fix would be to make the indexing functions not use vload(stridedpointer(A),I), but use something like

@inline function Base.getindex(A::AbstractStrideArray{S,D,T,K}, i::Vararg{Integer,K}) where {S,D,T,K}
    b = preserve_buffer(A)
    P = PtrArray(A)
    GC.@preserve b begin
        @boundscheck checkbounds(P, i...)
        sp = stridedpointer(P) # note `strides(::AbstractStridedPointer)` returns strides in bytes, not elements
        unsafe_load(pointer(sp) + sum(map(*,map(-,i,VectorizationBase.offsets(sp)), strides(sp))))
    end
end

chriselrod · 2021-05-03T07:06:38Z

Closing this as fixed by StrideArraysCore 0.1.6, which switches from llvmcall based versions to base +/*, etc.
The llvmcall versions can tell LLVM information like "no signed wrap". This enables some optimizations, for example:

julia> using VectorizationBase

julia> cmp_1(x,y) = x ≤ y - 1
cmp_1 (generic function with 1 method)

julia> cmp_2(x,y) = x ≤ VectorizationBase.vsub_fast(y, 1)
cmp_2 (generic function with 1 method)

; julia> @cl cmp_1(4,5)
define i8 @julia_cmp_1_962(i64 signext %0, i64 signext %1) #0 {
top:
  %2 = add i64 %1, -1
  %3 = icmp sge i64 %2, %0
  %4 = zext i1 %3 to i8
  ret i8 %4
}

; julia> @cl cmp_2(4,5)
define i8 @julia_cmp_2_964(i64 signext %0, i64 signext %1) #0 {
top:
  %2 = icmp sgt i64 %1, %0
  %3 = zext i1 %2 to i8
  ret i8 %3
}

It is illegal for LLVM to transform x <= y - 1 into x < y, because they are not equivalent if if y == typemin(Int).
But by telling LLVM that y - 1 will not wrap, it can make the transform, saving an instruction.

This is what the offending VectorizationBase commit enabled.
However, Julia's inliner marks all llvmcalls as expensive. Therefore, using just a few of them will make it choose not to inline your function. Even though the llvmcall vsub_fast is often effectively cheaper than - (because LLVM can delete the former in cases where it cannot delete the latter, such as the above example), vsub_fast is considered to be 40 times as expensive by Julia's inliner IIRC.
So, code using llvmcall normally forces the code author to decide whether or not it should inline; Julia's inliner can't see inside the llvmcall and thus can't judge its true cost, so it cannot make good decisions.

The StrideArraysCore fix was to get rid of the llvmcalls. vload is used mostly inside LoopVectorization._avx_!, so extra llvmcalls there are fine. StrideArrays indexing is intended to be used in generic Julia code, so I think the tradeoff there should definitely be in favor of helping Julia's compiler.
Especially because the main benefit to LLVM is for these comparison operations, e.g. specifying loop bounds. Not for indexing, meaning nsw is unlikely to ever help here anyway.

ranocha · 2021-05-03T07:22:12Z

Thanks a lot for the quick response! Just adding a few more @inline annotations fixed the problem in my case.

ranocha mentioned this issue May 2, 2021

Improve performance, not only when using multiple threads trixi-framework/Trixi.jl#509

Merged

10 tasks

chriselrod closed this as completed May 3, 2021

chriselrod mentioned this issue Jun 4, 2021

ArrayIndex Interface JuliaArrays/ArrayInterface.jl#159

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit "Fast integer ops shouldn't wrap" slowed down loading from PtrArrays with four or more dimensions #55

Commit "Fast integer ops shouldn't wrap" slowed down loading from PtrArrays with four or more dimensions #55

ranocha commented May 2, 2021

chriselrod commented May 2, 2021

chriselrod commented May 3, 2021

ranocha commented May 3, 2021

Commit "Fast integer ops shouldn't wrap" slowed down loading from PtrArrays with four or more dimensions #55

Commit "Fast integer ops shouldn't wrap" slowed down loading from PtrArrays with four or more dimensions #55

Comments

ranocha commented May 2, 2021

Last good commit

First bad commit

chriselrod commented May 2, 2021

chriselrod commented May 3, 2021

ranocha commented May 3, 2021