Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SharedArray performance overhead #21957

Open
dextorious opened this issue May 19, 2017 · 9 comments
Open

SharedArray performance overhead #21957

dextorious opened this issue May 19, 2017 · 9 comments
Labels
domain:parallelism Parallel or distributed computation

Comments

@dextorious
Copy link

dextorious commented May 19, 2017

So, prior to porting some code to SharedArrays, I wanted to see what the overhead of using this data structure was for the parts of the codebase that would remain single threaded. It turned out to be vastly higher than I expected. Consider the following (rather contrived) code example:

function f!(A::AbstractMatrix{Float64}, B::AbstractMatrix{Float64}, v::AbstractVector{Float64}, N::Integer)
  for i in 1 : N
    for j in 1 : N
      A[j,i] = v[i] * sqrt(A[j,i]^2 + B[j,i]^2)
    end
  end
  A
end

Using 1000x1000 random matrices for A and B and a 1000-element random vector for v, @benchmark yields 4.525 ms on my machine (median from 1071 samples). Copying the same data to SharedArrays and calling the same function yields 7.361 ms, so an overhead of 1.63x.

This gets weirder still. The original code I tested (probably too large to paste here) ran for ~90 ms using normal arrays (also 1000x1000). Transferring the same data to SharedArrays produced an overhead of ~3.3x. I would have expected the overhead to be smaller for a longer running calculation, not considerably larger.

Is this a known issue? Is there a way to decrease the overhead or are SharedArrays this slow for a good reason?

EDIT: in case it matters, this was all tested on v0.5.1, on Windows 10 x64.

@amitmurthy
Copy link
Contributor

I see the same timing difference on master.

The slowdown disappears is we pass the mapped array directly, i.e., f!(SA.s, SB.s, SV.s, 500). Hence the slowdown must be in the getindex/setindex! definitions (reproduced below):

getindex(S::SharedArray, i::Real) = getindex(S.s, i)
setindex!(S::SharedArray, x, i::Real) = setindex!(S.s, x, i)

Any ideas as to what is causing the overhead?

@amitmurthy
Copy link
Contributor

Similar case previously reported and corrected - #17060 . In that case the underlying array was not being initialized at construction time and hence the slowdown.

@KristofferC
Copy link
Sponsor Member

KristofferC commented May 19, 2017

Perhaps #9755? Manually hoisting the .s fields is exactly what had to be done in the Sparse Matrix code before SparseMatrixCSC was made an immutable. SharedArray's are mutable.

@amitmurthy
Copy link
Contributor

I think I found the cause in the context of this particular issue.

Defining additional indexing methods helps.

With

Base.getindex(S::SharedArray, i::Real...) = getindex(S.s, i...)
Base.setindex!(S::SharedArray, x, i::Real...) = setindex!(S.s, x, i...)

defined, timings are similar for regular arrays and shared arrays.

@timholy
Copy link
Sponsor Member

timholy commented May 19, 2017

If those really solve it, they are essentially a bandaid; I think @KristofferC has probably identified the root cause. Is there any reason to keep SharedArrays mutable? For example, is pidx ever going to change after construction is finished? I know we mutate things now at the tail end of construction, but is that necessary or just how things happen to be written?

@amitmurthy
Copy link
Contributor

I think @KristofferC has probably identified the root cause

In general maybe, however I don't think it is the root cause for this issue. Mutability/Immutability does not seem to make a difference with the below test code.

function f!(A::AbstractMatrix{Float64}, B::AbstractMatrix{Float64}, v::AbstractVector{Float64}, N::Integer)
  for i in 1 : N
    for j in 1 : N
      A[j,i] = v[i] * sqrt(A[j,i]^2 + B[j,i]^2)
    end
  end
  A
end

const N = 1000
const RA = rand(N,N);
const RB = rand(N,N);
const RV = rand(N);


f!(RA,RB,RV, N);

mutable struct MutS{T,N} <: DenseArray{T,N}
  s::Array{T,N}
end

Base.getindex(S::MutS, i::Real) = getindex(S.s, i)
Base.setindex!(S::MutS, x, i::Real) = setindex!(S.s, x, i)
Base.getindex(S::MutS, i::Real...) = getindex(S.s, i...)
Base.setindex!(S::MutS, x, i::Real...) = setindex!(S.s, x, i...)

const MA = MutS(rand(N,N));
const MB = MutS(rand(N,N));
const MV = MutS(rand(N));

f!(MA,MB,MV,N);

struct ImmutS{T,N} <: DenseArray{T,N}
  s::Array{T,N}
end

Base.getindex(S::ImmutS, i::Real) = getindex(S.s, i)
Base.setindex!(S::ImmutS, x, i::Real) = setindex!(S.s, x, i)
Base.getindex(S::ImmutS, i::Real...) = getindex(S.s, i...)
Base.setindex!(S::ImmutS, x, i::Real...) = setindex!(S.s, x, i...)

const IMA = ImmutS(rand(N,N));
const IMB = ImmutS(rand(N,N));
const IMV = ImmutS(rand(N));

f!(IMA,IMB,IMV,N);

@time f!(RA,RB, RV, N);
@time f!(MA,MB,MV,N);
@time f!(IMA,IMB,IMV,N);

julia> @time f!(RA,RB, RV, N);
  0.004637 seconds (83 allocations: 6.092 KiB)

julia> @time f!(MA,MB,MV,N);
  0.004843 seconds (4 allocations: 160 bytes)

julia> @time f!(IMA,IMB,IMV,N);
  0.004586 seconds (4 allocations: 160 bytes)

One other thing I observed - Without the additional indexing methods MutS/ImmutS threw indexing not defined for ImmutS{Float64,2} while SharedArray works fine without it albeit a bit slower.

@timholy
Copy link
Sponsor Member

timholy commented May 19, 2017

It's because you need to define IndexStyle if you want to support linear indexing. If you're defining the multiple-input version, use the i::Vararg{Real,N} syntax. And add @inline.

What happens if you use @inbounds in the loop? This seems more likely to be a bounds-checking thing, because of access to the (mutable) dims field.

@amitmurthy
Copy link
Contributor

With @inbounds everything gets equally speeded up.

Shared Arrays without additional indexing methods defined:

julia> @time f!(SA,SB,SV, N);
  0.004548 seconds (4 allocations: 160 bytes)

Shared Arrays with additional indexing methods defined:

julia> @time f!(SA,SB,SV, N);
  0.003382 seconds (4 allocations: 160 bytes)

Regular arrays:

julia> @time f!(A,B,V, N);
  0.003525 seconds (4 allocations: 160 bytes)

@biona001
Copy link

biona001 commented Aug 21, 2018

Here are the explicit code that does everything. Took me awhile to put all the pieces together, so might as well post it:)

  1. Define test function to see overhead of porting using shared arrays (use @inbound)
using BenchmarkTools
srand(1914)

function f!(A::AbstractMatrix{Float64}, B::AbstractMatrix{Float64}, v::AbstractVector{Float64}, N::Int64)
  @inbounds for i in 1:N
    for j in 1:N
      A[j,i] = v[i] * sqrt(A[j,i]^2 + B[j,i]^2)
    end
  end
  A
end
  1. Use S.s and @inline
@inline function Base.getindex(S::SharedArray, i::Real) 
	getindex(S.s, i)
end
@inline function Base.setindex!(S::SharedArray, x, i::Real) 
	setindex!(S.s, x, i)
end
@inline function Base.getindex(S::SharedArray, i::Real...) 
	getindex(S.s, i...)
end
@inline function Base.setindex!(S::SharedArray, x, i::Real...) 
	setindex!(S.s, x, i...)
end
  1. Initialize random arrays and shared arrays
const N = 1000
const RA = rand(N,N)
const RB = rand(N,N)
const RV = rand(N)

sa = SharedArray{Float64}((N, N), init = s -> s[localindexes(s)] = rand(length(localindexes(s))))
sb = SharedArray{Float64}((N, N), init = s -> s[localindexes(s)] = rand(length(localindexes(s))))
sv = SharedArray{Float64}((N))
  1. See performance
julia> @benchmark f!(RA,RB,RV,N)    #normal arrays
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.926 ms (0.00% GC)
  median time:      2.063 ms (0.00% GC)
  mean time:        2.136 ms (0.00% GC)
  maximum time:     5.971 ms (0.00% GC)
  --------------
  samples:          2326
  evals/sample:     1

julia> @benchmark f!(sa, sb, sv, N) #shared arrays
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.940 ms (0.00% GC)
  median time:      1.972 ms (0.00% GC)
  mean time:        2.021 ms (0.00% GC)
  maximum time:     3.259 ms (0.00% GC)
  --------------
  samples:          2452
  evals/sample:     1

@brenhinkeller brenhinkeller added the domain:parallelism Parallel or distributed computation label Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:parallelism Parallel or distributed computation
Projects
None yet
Development

No branches or pull requests

6 participants