-
Notifications
You must be signed in to change notification settings - Fork 152
Description
Multidimensional algorithms built on CartesianIndices seem to have significant performance overhead when applied to StaticArrays.MArrays, but not regular Arrays. One can recover performance by writing specific functions with nested loops for each dimension, or by leveraging the macros in Base.Cartesian to automate that process.
To demonstrate, consider an N-dimensional algorithm that fills the elements of a generic input array arr using a function f(i_1, ⋯, i_N) of the indices.
function fillfn_cartesianindices!(arr::AbstractArray, f::F) where F # force specialization
@inbounds for I ∈ CartesianIndices(arr)
arr[I] = f(I)
end
return arr
endHere is a similar @generated function using the Base.Cartesian macros.
using Base.Cartesian
@generated function fillfn_basecartesian!(arr::AbstractArray{<:Any,N}, f::F) where {N,F}
# Generates a function body like
# for i_N ∈ axes(arr, N), ..., i_1 ∈ axes(arr, 1)
# arr[i_1, ..., i_N] = f(i_1, ..., i_N)
# end
quote
@inbounds @nloops $N i arr begin
(@nref $N arr i) = (@ncall $N f i)
end
return arr
end
endFinally, here is function of indices to feed to either of the above functions (for the four dimensional case).
indexfunc4d(i, j, k, l) = (i==k) & (j==l) - (i==l) & (j==k)
@inline indexfunc4d(I::CartesianIndex{4}) = indexfunc4d(Tuple(I)...)For a regular Array, using CartesianIndices in the 3×3×3×3 case has no performance overhead. (In fact, it is consistently 1% to 3% faster than using explicit loops on my machine.)
using BenchmarkTools
using StaticArrays
let dims = (3,3,3,3)
T = eltype(indexfunc4d(map(_ -> 1, dims)...))
println("Using `CartesianIndices`")
out1 = @btime(
fillfn_cartesianindices!(A, $indexfunc4d),
setup=(A = Array{$T}(undef, $dims))
)
println("Using `Base.Cartesian`")
out2 = @btime(
fillfn_basecartesian!(A, $indexfunc4d),
setup=(A = Array{$T}(undef, $dims))
)
@assert out1 == out2
end;Using `CartesianIndices`
101.466 ns (0 allocations: 0 bytes)
Using `Base.Cartesian`
104.874 ns (0 allocations: 0 bytes)
A similar benchmark on an MArray shows that fillfn_basecartesian is a good deal quicker than fillfn_cartesianindices.
let SDims = NTuple{4,3}
dims = fieldtypes(SDims)
T = eltype(indexfunc4d(map(_ -> 1, dims)...)) # Element type
println("Using `CartesianIndices`")
out1 = @btime(
fillfn_cartesianindices!(A, $indexfunc4d),
setup=(A = MArray{$SDims,$T}(undef))
)
println("Using `Base.Cartesian`")
out2 = @btime(
fillfn_basecartesian!(A, $indexfunc4d),
setup=(A = MArray{$SDims,$T}(undef))
)
end;Using `CartesianIndices`
88.504 ns (0 allocations: 0 bytes)
Using `Base.Cartesian`
14.276 ns (0 allocations: 0 bytes)
The result is the same for a SizedArray.
let SDims = NTuple{4,3}
dims = fieldtypes(SDims)
T = eltype(indexfunc4d(map(_ -> 1, dims)...)) # Element type
println("Using `CartesianIndices`")
out1 = @btime(
fillfn_cartesianindices!(A, $indexfunc4d),
setup=(A = SizedArray{$SDims,$T}(undef))
)
println("Using `Base.Cartesian`")
out2 = @btime(
fillfn_basecartesian!(A, $indexfunc4d),
setup=(A = SizedArray{$SDims,$T}(undef))
)
end;Using `CartesianIndices`
87.301 ns (0 allocations: 0 bytes)
Using `Base.Cartesian`
13.225 ns (0 allocations: 0 bytes)
Note that this problem does not apply to SArray (since it is not mutable); using StaticArrays.sacollect with CartesianIndices is efficient.
@btime StaticArrays.sacollect(SArray{NTuple{4,3}},
$indexfunc4d(I) for I ∈ CartesianIndices(ntuple(_ -> SOneTo(3), Val(4))));
@btime StaticArrays.sacollect(MArray{NTuple{4,3}},
$indexfunc4d(I) for I ∈ CartesianIndices(ntuple(_ -> SOneTo(3), Val(4))));10.907 ns (0 allocations: 0 bytes)
51.259 ns (1 allocation: 672 bytes)
However, this way is different than trying to mutate the elements of a preallocated MArray or SizedArray.
Is there some necessary special support for CartesianIndices that is not implemented? Is it possible to implement? I need to look into exactly how CartesianIndices works in each case.