SharedArray performance overhead #21957

dextorious · 2017-05-19T00:35:19Z

So, prior to porting some code to SharedArrays, I wanted to see what the overhead of using this data structure was for the parts of the codebase that would remain single threaded. It turned out to be vastly higher than I expected. Consider the following (rather contrived) code example:

function f!(A::AbstractMatrix{Float64}, B::AbstractMatrix{Float64}, v::AbstractVector{Float64}, N::Integer)
  for i in 1 : N
    for j in 1 : N
      A[j,i] = v[i] * sqrt(A[j,i]^2 + B[j,i]^2)
    end
  end
  A
end

Using 1000x1000 random matrices for A and B and a 1000-element random vector for v, @benchmark yields 4.525 ms on my machine (median from 1071 samples). Copying the same data to SharedArrays and calling the same function yields 7.361 ms, so an overhead of 1.63x.

This gets weirder still. The original code I tested (probably too large to paste here) ran for ~90 ms using normal arrays (also 1000x1000). Transferring the same data to SharedArrays produced an overhead of ~3.3x. I would have expected the overhead to be smaller for a longer running calculation, not considerably larger.

Is this a known issue? Is there a way to decrease the overhead or are SharedArrays this slow for a good reason?

EDIT: in case it matters, this was all tested on v0.5.1, on Windows 10 x64.

The text was updated successfully, but these errors were encountered:

amitmurthy · 2017-05-19T05:38:51Z

I see the same timing difference on master.

The slowdown disappears is we pass the mapped array directly, i.e., f!(SA.s, SB.s, SV.s, 500). Hence the slowdown must be in the getindex/setindex! definitions (reproduced below):

getindex(S::SharedArray, i::Real) = getindex(S.s, i)
setindex!(S::SharedArray, x, i::Real) = setindex!(S.s, x, i)

Any ideas as to what is causing the overhead?

amitmurthy · 2017-05-19T08:27:18Z

Similar case previously reported and corrected - #17060 . In that case the underlying array was not being initialized at construction time and hence the slowdown.

KristofferC · 2017-05-19T08:41:35Z

Perhaps #9755? Manually hoisting the .s fields is exactly what had to be done in the Sparse Matrix code before SparseMatrixCSC was made an immutable. SharedArray's are mutable.

amitmurthy · 2017-05-19T09:12:33Z

I think I found the cause in the context of this particular issue.

Defining additional indexing methods helps.

With

Base.getindex(S::SharedArray, i::Real...) = getindex(S.s, i...)
Base.setindex!(S::SharedArray, x, i::Real...) = setindex!(S.s, x, i...)

defined, timings are similar for regular arrays and shared arrays.

timholy · 2017-05-19T10:04:22Z

If those really solve it, they are essentially a bandaid; I think @KristofferC has probably identified the root cause. Is there any reason to keep SharedArrays mutable? For example, is pidx ever going to change after construction is finished? I know we mutate things now at the tail end of construction, but is that necessary or just how things happen to be written?

amitmurthy · 2017-05-19T10:29:48Z

I think @KristofferC has probably identified the root cause

In general maybe, however I don't think it is the root cause for this issue. Mutability/Immutability does not seem to make a difference with the below test code.

function f!(A::AbstractMatrix{Float64}, B::AbstractMatrix{Float64}, v::AbstractVector{Float64}, N::Integer)
  for i in 1 : N
    for j in 1 : N
      A[j,i] = v[i] * sqrt(A[j,i]^2 + B[j,i]^2)
    end
  end
  A
end

const N = 1000
const RA = rand(N,N);
const RB = rand(N,N);
const RV = rand(N);


f!(RA,RB,RV, N);

mutable struct MutS{T,N} <: DenseArray{T,N}
  s::Array{T,N}
end

Base.getindex(S::MutS, i::Real) = getindex(S.s, i)
Base.setindex!(S::MutS, x, i::Real) = setindex!(S.s, x, i)
Base.getindex(S::MutS, i::Real...) = getindex(S.s, i...)
Base.setindex!(S::MutS, x, i::Real...) = setindex!(S.s, x, i...)

const MA = MutS(rand(N,N));
const MB = MutS(rand(N,N));
const MV = MutS(rand(N));

f!(MA,MB,MV,N);

struct ImmutS{T,N} <: DenseArray{T,N}
  s::Array{T,N}
end

Base.getindex(S::ImmutS, i::Real) = getindex(S.s, i)
Base.setindex!(S::ImmutS, x, i::Real) = setindex!(S.s, x, i)
Base.getindex(S::ImmutS, i::Real...) = getindex(S.s, i...)
Base.setindex!(S::ImmutS, x, i::Real...) = setindex!(S.s, x, i...)

const IMA = ImmutS(rand(N,N));
const IMB = ImmutS(rand(N,N));
const IMV = ImmutS(rand(N));

f!(IMA,IMB,IMV,N);

@time f!(RA,RB, RV, N);
@time f!(MA,MB,MV,N);
@time f!(IMA,IMB,IMV,N);


julia> @time f!(RA,RB, RV, N);
  0.004637 seconds (83 allocations: 6.092 KiB)

julia> @time f!(MA,MB,MV,N);
  0.004843 seconds (4 allocations: 160 bytes)

julia> @time f!(IMA,IMB,IMV,N);
  0.004586 seconds (4 allocations: 160 bytes)

One other thing I observed - Without the additional indexing methods MutS/ImmutS threw indexing not defined for ImmutS{Float64,2} while SharedArray works fine without it albeit a bit slower.

timholy · 2017-05-19T13:22:50Z

It's because you need to define IndexStyle if you want to support linear indexing. If you're defining the multiple-input version, use the i::Vararg{Real,N} syntax. And add @inline.

What happens if you use @inbounds in the loop? This seems more likely to be a bounds-checking thing, because of access to the (mutable) dims field.

amitmurthy · 2017-05-20T05:11:43Z

With @inbounds everything gets equally speeded up.

Shared Arrays without additional indexing methods defined:

julia> @time f!(SA,SB,SV, N);
  0.004548 seconds (4 allocations: 160 bytes)

Shared Arrays with additional indexing methods defined:

julia> @time f!(SA,SB,SV, N);
  0.003382 seconds (4 allocations: 160 bytes)

Regular arrays:

julia> @time f!(A,B,V, N);
  0.003525 seconds (4 allocations: 160 bytes)

biona001 · 2018-08-21T18:13:54Z

Here are the explicit code that does everything. Took me awhile to put all the pieces together, so might as well post it:)

Define test function to see overhead of porting using shared arrays (use @inbound)

using BenchmarkTools
srand(1914)

function f!(A::AbstractMatrix{Float64}, B::AbstractMatrix{Float64}, v::AbstractVector{Float64}, N::Int64)
  @inbounds for i in 1:N
    for j in 1:N
      A[j,i] = v[i] * sqrt(A[j,i]^2 + B[j,i]^2)
    end
  end
  A
end

Use S.s and @inline

@inline function Base.getindex(S::SharedArray, i::Real) 
	getindex(S.s, i)
end
@inline function Base.setindex!(S::SharedArray, x, i::Real) 
	setindex!(S.s, x, i)
end
@inline function Base.getindex(S::SharedArray, i::Real...) 
	getindex(S.s, i...)
end
@inline function Base.setindex!(S::SharedArray, x, i::Real...) 
	setindex!(S.s, x, i...)
end

Initialize random arrays and shared arrays

const N = 1000
const RA = rand(N,N)
const RB = rand(N,N)
const RV = rand(N)

sa = SharedArray{Float64}((N, N), init = s -> s[localindexes(s)] = rand(length(localindexes(s))))
sb = SharedArray{Float64}((N, N), init = s -> s[localindexes(s)] = rand(length(localindexes(s))))
sv = SharedArray{Float64}((N))

See performance

julia> @benchmark f!(RA,RB,RV,N)    #normal arrays
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.926 ms (0.00% GC)
  median time:      2.063 ms (0.00% GC)
  mean time:        2.136 ms (0.00% GC)
  maximum time:     5.971 ms (0.00% GC)
  --------------
  samples:          2326
  evals/sample:     1

julia> @benchmark f!(sa, sb, sv, N) #shared arrays
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.940 ms (0.00% GC)
  median time:      1.972 ms (0.00% GC)
  mean time:        2.021 ms (0.00% GC)
  maximum time:     3.259 ms (0.00% GC)
  --------------
  samples:          2452
  evals/sample:     1

brenhinkeller added the domain:parallelism Parallel or distributed computation label Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SharedArray performance overhead #21957

SharedArray performance overhead #21957

dextorious commented May 19, 2017 •

edited

amitmurthy commented May 19, 2017

amitmurthy commented May 19, 2017

KristofferC commented May 19, 2017 •

edited

amitmurthy commented May 19, 2017

timholy commented May 19, 2017

amitmurthy commented May 19, 2017

timholy commented May 19, 2017 •

edited

amitmurthy commented May 20, 2017

biona001 commented Aug 21, 2018 •

edited

SharedArray performance overhead #21957

SharedArray performance overhead #21957

Comments

dextorious commented May 19, 2017 • edited

amitmurthy commented May 19, 2017

amitmurthy commented May 19, 2017

KristofferC commented May 19, 2017 • edited

amitmurthy commented May 19, 2017

timholy commented May 19, 2017

amitmurthy commented May 19, 2017

timholy commented May 19, 2017 • edited

amitmurthy commented May 20, 2017

biona001 commented Aug 21, 2018 • edited

dextorious commented May 19, 2017 •

edited

KristofferC commented May 19, 2017 •

edited

timholy commented May 19, 2017 •

edited

biona001 commented Aug 21, 2018 •

edited