-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Commit "Fast integer ops shouldn't wrap" slowed down loading from PtrArrays with four or more dimensions #55
Comments
julia> using StrideArrays, BenchmarkTools
julia> function foo(a::AbstractArray{T,N}) where {T,N}
idx = ntuple(_ -> 1, Val(N))
@inbounds res = a[idx...]
res
end
foo (generic function with 1 method)
julia> for N in 1:5
a_array = randn(ntuple(_ -> 4, N)...)
a_stride = StrideArray(a_array)
a_ptr = PtrArray(pointer(a_array), size(a_array))
a_ptr_static = PtrArray(pointer(a_array), map(StaticInt, size(a_array)))
@info "New round" N
@btime foo($a_array)
@btime foo($a_stride)
@btime foo($a_ptr)
@btime foo($a_ptr_static)
end
┌ Info: New round
└ N = 1
1.104 ns (0 allocations: 0 bytes)
1.104 ns (0 allocations: 0 bytes)
1.106 ns (0 allocations: 0 bytes)
1.104 ns (0 allocations: 0 bytes)
┌ Info: New round
└ N = 2
1.118 ns (0 allocations: 0 bytes)
1.104 ns (0 allocations: 0 bytes)
1.103 ns (0 allocations: 0 bytes)
1.107 ns (0 allocations: 0 bytes)
┌ Info: New round
└ N = 3
1.104 ns (0 allocations: 0 bytes)
1.104 ns (0 allocations: 0 bytes)
1.104 ns (0 allocations: 0 bytes)
1.106 ns (0 allocations: 0 bytes)
┌ Info: New round
└ N = 4
1.104 ns (0 allocations: 0 bytes)
2.218 ns (0 allocations: 0 bytes)
1.983 ns (0 allocations: 0 bytes)
2.349 ns (0 allocations: 0 bytes)
┌ Info: New round
└ N = 5
1.107 ns (0 allocations: 0 bytes)
2.201 ns (0 allocations: 0 bytes)
1.982 ns (0 allocations: 0 bytes)
2.198 ns (0 allocations: 0 bytes)
julia> @inline function foo(a::AbstractArray{T,N}) where {T,N}
idx = ntuple(_ -> 1, Val(N))
@inbounds res = a[idx...]
res
end
foo (generic function with 1 method)
julia> for N in 1:5
a_array = randn(ntuple(_ -> 4, N)...)
a_stride = StrideArray(a_array)
a_ptr = PtrArray(pointer(a_array), size(a_array))
a_ptr_static = PtrArray(pointer(a_array), map(StaticInt, size(a_array)))
@info "New round" N
@btime foo($a_array)
@btime foo($a_stride)
@btime foo($a_ptr)
@btime foo($a_ptr_static)
end
┌ Info: New round
└ N = 1
1.103 ns (0 allocations: 0 bytes)
1.104 ns (0 allocations: 0 bytes)
1.104 ns (0 allocations: 0 bytes)
1.104 ns (0 allocations: 0 bytes)
┌ Info: New round
└ N = 2
1.106 ns (0 allocations: 0 bytes)
1.103 ns (0 allocations: 0 bytes)
1.105 ns (0 allocations: 0 bytes)
1.104 ns (0 allocations: 0 bytes)
┌ Info: New round
└ N = 3
1.108 ns (0 allocations: 0 bytes)
1.105 ns (0 allocations: 0 bytes)
1.106 ns (0 allocations: 0 bytes)
1.106 ns (0 allocations: 0 bytes)
┌ Info: New round
└ N = 4
1.103 ns (0 allocations: 0 bytes)
1.103 ns (0 allocations: 0 bytes)
1.104 ns (0 allocations: 0 bytes)
1.106 ns (0 allocations: 0 bytes)
┌ Info: New round
└ N = 5
1.104 ns (0 allocations: 0 bytes)
1.109 ns (0 allocations: 0 bytes)
1.107 ns (0 allocations: 0 bytes)
1.108 ns (0 allocations: 0 bytes) The problem is that following that commit, The current version might generate better code in some cases: julia> N = 5;
julia> a_array = randn(ntuple(_ -> 4, N)...); a_stride = StrideArray(a_array);
julia> @inline inbounds_getindex(A,i::Vararg{Any,K}) where {K} = @inbounds A[i...]
julia> @btime ntuple(@inline(k -> ntuple(@inline(i -> ntuple(@inline(j -> inbounds_getindex($a_array,j,i,k,1,1)),Val(4))),Val(4))),Val(4))
6.501 ns (0 allocations: 0 bytes)
(((-1.0845134036492494, 0.39319992091195866, -0.8895077857106304, -0.964892931202095), (2.0334187989577157, -2.35024835499007, -0.6695110129668518, 0.23273920015432767), (-1.5995251412408744, 0.48350248524295886, 1.1584175201853102, 1.2967183832634657), (0.6038235462101194, -0.8909856309920922, -0.052482763971459415, -0.5636245121330408)), ((-0.1323632823158482, 0.43906713631489525, 2.041036278214374, 1.333462654283023), (-0.590749444206323, -0.5638287312934376, -0.1600949086014227, -1.0773625948057042), (1.9468834433378406, -0.7367494587056644, 0.3168111912464907, -1.2375796945559765), (0.5612651890807325, -0.3157157212029217, -0.577796542397977, -0.025728273365417796)), ((0.2421323617521182, -0.8504194026834689, -1.952109988993807, 0.4538025683624084), (0.4579110544376671, 0.6229132137433413, 0.6251117502074499, -1.307212183077493), (-1.1277874037846727, 1.1857956647834296, 0.3390898603722827, 0.3132461216912783), (-0.3606955803358298, 0.2860279843675183, -0.6161026349564189, 0.2388840752091055)), ((-1.510597148876941, 1.408674128168656, 0.15194308350850289, 2.032177662862577), (-0.04885375760515278, 1.9821423773144886, 1.3882563062679014, 0.24915632308057115), (-2.44791611171785, 0.6001244841414322, 1.2575221138980344, 0.38201615843991393), (-0.08420300440414022, -0.7635435445809656, 0.7436804735576459, -0.19661438632061876)))
julia> @btime ntuple(@inline(k -> ntuple(@inline(i -> ntuple(@inline(j -> inbounds_getindex($a_stride,j,i,k,1,1)),Val(4))),Val(4))),Val(4))
5.828 ns (0 allocations: 0 bytes)
(((-1.0845134036492494, 0.39319992091195866, -0.8895077857106304, -0.964892931202095), (2.0334187989577157, -2.35024835499007, -0.6695110129668518, 0.23273920015432767), (-1.5995251412408744, 0.48350248524295886, 1.1584175201853102, 1.2967183832634657), (0.6038235462101194, -0.8909856309920922, -0.052482763971459415, -0.5636245121330408)), ((-0.1323632823158482, 0.43906713631489525, 2.041036278214374, 1.333462654283023), (-0.590749444206323, -0.5638287312934376, -0.1600949086014227, -1.0773625948057042), (1.9468834433378406, -0.7367494587056644, 0.3168111912464907, -1.2375796945559765), (0.5612651890807325, -0.3157157212029217, -0.577796542397977, -0.025728273365417796)), ((0.2421323617521182, -0.8504194026834689, -1.952109988993807, 0.4538025683624084), (0.4579110544376671, 0.6229132137433413, 0.6251117502074499, -1.307212183077493), (-1.1277874037846727, 1.1857956647834296, 0.3390898603722827, 0.3132461216912783), (-0.3606955803358298, 0.2860279843675183, -0.6161026349564189, 0.2388840752091055)), ((-1.510597148876941, 1.408674128168656, 0.15194308350850289, 2.032177662862577), (-0.04885375760515278, 1.9821423773144886, 1.3882563062679014, 0.24915632308057115), (-2.44791611171785, 0.6001244841414322, 1.2575221138980344, 0.38201615843991393), (-0.08420300440414022, -0.7635435445809656, 0.7436804735576459, -0.19661438632061876))) But every other example I tried performed the same. A fix would be to make the indexing functions not use @inline function Base.getindex(A::AbstractStrideArray{S,D,T,K}, i::Vararg{Integer,K}) where {S,D,T,K}
b = preserve_buffer(A)
P = PtrArray(A)
GC.@preserve b begin
@boundscheck checkbounds(P, i...)
sp = stridedpointer(P) # note `strides(::AbstractStridedPointer)` returns strides in bytes, not elements
unsafe_load(pointer(sp) + sum(map(*,map(-,i,VectorizationBase.offsets(sp)), strides(sp))))
end
end |
Closing this as fixed by StrideArraysCore 0.1.6, which switches from julia> using VectorizationBase
julia> cmp_1(x,y) = x ≤ y - 1
cmp_1 (generic function with 1 method)
julia> cmp_2(x,y) = x ≤ VectorizationBase.vsub_fast(y, 1)
cmp_2 (generic function with 1 method) ; julia> @cl cmp_1(4,5)
define i8 @julia_cmp_1_962(i64 signext %0, i64 signext %1) #0 {
top:
%2 = add i64 %1, -1
%3 = icmp sge i64 %2, %0
%4 = zext i1 %3 to i8
ret i8 %4
} ; julia> @cl cmp_2(4,5)
define i8 @julia_cmp_2_964(i64 signext %0, i64 signext %1) #0 {
top:
%2 = icmp sgt i64 %1, %0
%3 = zext i1 %2 to i8
ret i8 %3
} It is illegal for LLVM to transform This is what the offending VectorizationBase commit enabled. The StrideArraysCore fix was to get rid of the llvmcalls. |
Thanks a lot for the quick response! Just adding a few more |
I tracked down performance regressions observed in trixi-framework/Trixi.jl#509 to the following problem. On AVX2 (and non-AVX512) systems, the commit e2b6ccb "Fast integer ops shouldn't wrap" in VectorizationBase resulted in significant performance regressions when loading data from a
PtrArray
viagetindex
.Using Julia v1.6.1 on an Intel i7 8700K yields the following results for a minimal working example. I'm using StrideArrays v0.1.6 with StrideArraysCore v0.1.5 and LoopVectorization v0.12.12 with the diff
in StrideArrays to be able to check different versions of VectorizationBase.
Last good commit
First bad commit
The text was updated successfully, but these errors were encountered: