Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieving elements from an inhomogeneous tuple is slow #19850

Closed
wsshin opened this issue Jan 4, 2017 · 10 comments
Closed

Retrieving elements from an inhomogeneous tuple is slow #19850

wsshin opened this issue Jan 4, 2017 · 10 comments
Labels
compiler:codegen Generation of LLVM IR and native code performance Must go faster

Comments

@wsshin
Copy link
Contributor

wsshin commented Jan 4, 2017

Consider the following function:

julia> function tuple_elems_slow(t::NTuple{3,Real})
     const i1 = 1; const i2 = 2; const i3 = 3
     t1, t2, t3 = t[i1], t[i2], t[i3]
end

The performance of this function varies a lot depending on whether t is homogeneous (e.g., t = (0,0,0) with all Int64 elements) or inhomogeneous (e.g., t = (0.,0,0) with the first element being Float64). When t is inhomogeneous, the function uses extra allocations:

julia> using BenchmarkTools

julia> @benchmark tuple_elems_slow((0,0,0))
BenchmarkTools.Trial:
  memory estimate:  0.00 bytes
  allocs estimate:  0
  median time:      5.533 ns (0.00% GC)

julia> @benchmark tuple_elems_slow((0.,0,0))
BenchmarkTools.Trial:
  memory estimate:  112.00 bytes
  allocs estimate:  4
  median time:      116.735 ns (0.00% GC)

This is not caused by type instability, because @code_warntype shows that the function is type-stable for both executions.

In comparison, if integer linerals are used to index tuple elements, the function does not experience such performance degradation even if the tuple is inhomogeneous:

julia> function tuple_elems_fast(t::NTuple{3,Real})
     t1, t2, t3 = t[1], t[2], t[3]
end
tuple_elems_fast (generic function with 1 method)

julia> @benchmark tuple_elems_fast((0.,0,0))
BenchmarkTools.Trial:
  memory estimate:  0.00 bytes
  allocs estimate:  0
  median time:      5.581 ns (0.00% GC)

julia> @benchmark tuple_elems_fast((0,0,0))
BenchmarkTools.Trial:
  memory estimate:  0.00 bytes
  allocs estimate:  0
  median time:      5.537 ns (0.00% GC)

Here is the versioninfo() result:

julia> versioninfo()
Julia Version 0.6.0-dev.1877
Commit ea73c84* (2017-01-03 20:54 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin16.0.0)
  CPU: Intel(R) Core(TM)2 Duo CPU     T9900  @ 3.06GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, penryn)

I could be wrong, but I think this problem was introduced recently. I don't think I had this problem on the master branch at least about a month ago.

@JeffBezanson
Copy link
Sponsor Member

Thanks. Would you be willing to bisect this to see if it regressed in the past month or so?

@giordano
Copy link
Contributor

giordano commented Jan 4, 2017

In Julia 0.5 there was the same behavior:

julia> @benchmark tuple_elems_slow((0.,0,0))
BenchmarkTools.Trial: 
  memory estimate:  112.00 bytes
  allocs estimate:  4
  --------------
  minimum time:     72.274 ns (0.00% GC)
  median time:      75.505 ns (0.00% GC)
  mean time:        89.214 ns (13.57% GC)
  maximum time:     2.915 μs (96.81% GC)
  --------------
  samples:          10000
  evals/sample:     974
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark tuple_elems_slow((0,0,0))
BenchmarkTools.Trial: 
  memory estimate:  0.00 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.271 ns (0.00% GC)
  median time:      3.375 ns (0.00% GC)
  mean time:        3.389 ns (0.00% GC)
  maximum time:     15.121 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
  time tolerance:   5.00%
  memory tolerance: 1.00%

Hope this helps.

@JeffBezanson
Copy link
Sponsor Member

Ok --- I can understand why this might be slow, but I'm definitely concerned if it regressed recently.

@andyferris
Copy link
Member

Why "might" this be slow? If it is type-stable I don't see why it would be allocating.

julia> @code_warntype tuple_elems_slow((0.0,0,0))
Variables:
  #self#::#tuple_elems_slow
  t::Tuple{Float64,Int64,Int64}
  i1::Int64
  i2::Int64
  i3::Int64
  t1::Float64
  t2::Int64
  t3::Int64

Body:
  begin 
      i1::Int64 = 1 # line 2:
      i2::Int64 = 2 # line 2:
      i3::Int64 = 3 # line 3:
      SSAValue(0) = (Base.getfield)(t::Tuple{Float64,Int64,Int64},i1::Int64)::Float64
      SSAValue(1) = (Base.getfield)(t::Tuple{Float64,Int64,Int64},i2::Int64)::Int64
      SSAValue(2) = (Base.getfield)(t::Tuple{Float64,Int64,Int64},i3::Int64)::Int64
      t1::Float64 = SSAValue(0)
      t2::Int64 = SSAValue(1)
      t3::Int64 = SSAValue(2)
      return (Core.tuple)(SSAValue(0),SSAValue(1),SSAValue(2))::Tuple{Float64,Int64,Int64}
  end::Tuple{Float64,Int64,Int64}

@wsshin
Copy link
Contributor Author

wsshin commented Jan 4, 2017

@JeffBezanson Because @giordano says it exists also in v0.5, it might be the case that my memory is wrong and the problem has existed ever since. I'll test the Julia version for which I remember this problem didn't exist and report back.

@wsshin
Copy link
Contributor Author

wsshin commented Jan 4, 2017

@JeffBezanson My memory turns out to be wrong. The problem must have existed for quite a while.

I used to use Julia v0.5, but picked up the v0.6 master branch when the "Vectorization Roadmap" (#16285) was completed, which corresponds to the commit 99b6a8c. That's also when I wrote some functions like the above in my project.

I don't remember seeing any significant performance difference between homogeneous and inhomogeneous tuples back then, so I had an impression that the commit didn't have the problem. Maybe I was passing only homogeneous tuples to the functions.

Anyway, I just built Julia fresh with the commit 99b6a8c and ran the same tests, but the problem didn't go away. Together with @giordano's report of the same problem on v0.5, I think it is safe to assume that the problem wasn't introduced recently, but has been there for quite a while.

Sorry about the false information, and hope this helps solving the problem!

@JeffBezanson JeffBezanson added compiler:codegen Generation of LLVM IR and native code performance Must go faster labels Jan 4, 2017
@chethega
Copy link
Contributor

The thing that happens is that julia actually pulls a copy for the tuple access. I don't understand why codegen decides to do this, but this makes retrieving the element O(N) instead of O(1).

using BenchmarkTools
f(x,i)=x[i];
for N=[2,10,100,1000,10_000]
       nnta=Any[i for i=1:N]; nnta[2]=1.0; nnt = (nnta...,);
       println("tuple with $N")
       @btime f($nnt, $1)
       end

######

tuple with 2
  31.694 ns (1 allocation: 32 bytes)
tuple with 10
  36.790 ns (1 allocation: 96 bytes)
tuple with 100
  115.433 ns (1 allocation: 896 bytes)
tuple with 1000
  802.000 ns (1 allocation: 7.88 KiB)
tuple with 10000
  6.995 ��s (1 allocation: 78.19 KiB)

@KristofferC
Copy link
Sponsor Member

Ref: #24596, #15274, #15482

@ChrisRackauckas
Copy link
Member

ChrisRackauckas commented Aug 25, 2018

It looks like it's fixed in v1.0?

using BenchmarkTools

function tuple_elems_slow(t::NTuple{3,Real})
     i1 = 1; i2 = 2; i3 = 3
     t1, t2, t3 = t[i1], t[i2], t[i3]
end

@btime tuple_elems_slow((0,0,0)) # 1.170 ns (0 allocations: 0 bytes)
@btime tuple_elems_slow((0.0,0,0)) # 1.463 ns (0 allocations: 0 bytes)  

Seems like constant propogation of literals #24011 may have done it (though that is slightly different)?

@KristofferC
Copy link
Sponsor Member

Yep, looks fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code performance Must go faster
Projects
None yet
Development

No branches or pull requests

7 participants