Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

need efficient reinterpret for extracting array elements #31305

Open
ExpandingMan opened this issue Mar 9, 2019 · 6 comments
Open

need efficient reinterpret for extracting array elements #31305

ExpandingMan opened this issue Mar 9, 2019 · 6 comments
Labels
arrays [a, r, r, a, y, s] performance Must go faster

Comments

@ExpandingMan
Copy link
Contributor

The performance of reinterpret is impressive, however there is still one bit of functionality (which is sometimes needed for efficient random-access IO) which is a little slow. Extracting individual elements from a buffer using reinterpret is only as fast as using pointers if the size of the eltype of the buffer matches the size of the type being converted to.

See this example:

using BenchmarkTools

v = zeros(UInt8, 128)

function unsafe(::Type{T}, v::Vector, i::Integer) where {T}
    p = convert(Ptr{T}, pointer(v))
    checkbounds(v, i)
    checkbounds(v, i+sizeof(T)-1)
    unsafe_load(p, i)
end

function safe(::Type{T}, v::Vector, i::Integer) where {T}
    first(reinterpret(T, @view v[i:(i+sizeof(T)-1)]))
end
julia> @btime unsafe(Int, v, 1);
  42.647 ns (0 allocations: 0 bytes)

julia> @btime safe(Int, v, 1);
  57.300 ns (2 allocations: 80 bytes)

There are a few things going on here. First, reinterpret can only be called on an entire array, not just a slice of it, so a view of the array must be allocated before reinterpret can even be called. Then, the return value is itself an array, so if you only want a single value, you need to index it (I think what's going on is that, since the returned array is itself a view which must be allocated, the allocation of both of these views are the two allocations seen in the benchmark).

It would be really nice if we can get a version of reinterpret which would allow one to extract individual elements without allocating. It would also be nice to have a reinterpret that takes views from the middle of an array rather than the start of it, however the performance of that is a little less crucial since you'd only be eliminating 1 allocation of a view for an entire array.

I'm not sure what that function would best be called or how to structure its arguments. Were I to make a PR for this, I'd probably put in a function a bit like my unsafe method here (which ought to be safe if properly bounds checked).

Thoughts?

@Keno
Copy link
Member

Keno commented Mar 9, 2019

There's two unrelated problems here:

julia> function safe(::Type{T}, v::Vector, i::Integer) where {T}
           @inbounds reinterpret(T, v)[fld1(i, sizeof(T))]
       end

julia> @btime safe(Int, v, 1)
  50.717 ns (0 allocations: 0 bytes)
0

julia> @btime unsafe(Int, v, 1)
  49.655 ns (0 allocations: 0 bytes)
0
  1. We allocate the ReinterpretArray despite it only being used in the error path causing an allocation.
  2. LLVM doesn't combine the Int8 sized loads into an Int size load.

@ExpandingMan
Copy link
Contributor Author

Are those problems realistically resolvable? If not, would it still make sense to define a new method in Base which is safe but non-allocating?

@KristofferC
Copy link
Sponsor Member

1 sounds similar to #29867 but it's not really scalable to add that "fix" to all array wrappers.

@Keno
Copy link
Member

Keno commented Mar 10, 2019

It is legal for the optimizer to move the allocation into the error path. We should figure out how to teach it to do that.

@maleadt
Copy link
Member

maleadt commented Mar 10, 2019

LLVM doesn't combine the Int8 sized loads into an Int size load.

Would that be legal? It's an issue with PTX too, where these 1-byte loads are very costly (hence a custom unsafe_load with alignment at https://github.com/JuliaGPU/CUDAnative.jl/blob/e28c5f07cd2dae78a41d0d67ea2dfd4b374a92fe/src/device/pointer.jl#L157).

@JeffBezanson JeffBezanson added performance Must go faster arrays [a, r, r, a, y, s] labels Mar 11, 2019
@chethega
Copy link
Contributor

LLVM doesn't combine the Int8 sized loads into an Int size load.

Would that be legal?

Combining in IR should definitely be legal. And the backend knows whether unaligned access is supported by the hardware (yes on x86, no on ptx). Since pointer(v) is only byte-aligned (e.g. after popfirst!), llvm probably must emit the slow code on ptx, but could do better on x86.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] performance Must go faster
Projects
None yet
Development

No branches or pull requests

6 participants