-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ntoh
/ bswap
are 10x slower when operating in-place on reinterpret
array
#42227
Comments
It should be able to be fixed by something like: t1 = Base.@_gc_preserve_begin rawdata
real_data = unsafe_wrap(Array, pointer(reinterpret(Int32, rawdata)), 1_000_000÷4)
real_data .= ntoh.(real_data)
vov = VectorOfVectors(real_data, rawoffsets, ArraysOfArrays.no_consistency_checks)
Base.@_gc_preserve_end t1
return vov but for some reason GC isn't happy... |
The inefficiency comes from indexing into the reinterpret'et array being slower than indexing into a normal |
I'm thinking there's a (theoretical) shortcut because in-place |
I now see a related previous discussion: #25014 |
It may not even have anything to do directly with that issue; instead it looks like a SIMD vectorization issue. julia> function run_ntoh2!(y, n=1000) # easier to profile or `@code_llvm` without all the tuning & inference with `@btime`
for i = 1:n
@inbounds @simd for j in eachindex(y)
y[j] = ntoh(y[j])
end
end
return y
end
run_ntoh2! (generic function with 2 methods)
julia> x = rand(Int32, 1_000_000÷4);
julia> y = reinterpret(reshape, Int32, rand(UInt8, 4, 1_000_000÷4));
julia> @btime run_ntoh2!(x);
22.733 ms (0 allocations: 0 bytes)
julia> @btime run_ntoh2!(y);
122.921 ms (1 allocation: 32 bytes)
julia> z = reinterpret(Int32, rand(UInt8, 1_000_000));
julia> @btime run_ntoh2!(z);
276.921 ms (1 allocation: 32 bytes) Interestingly, the difference between ERROR: MethodError: no method matching bswap(::VectorizationBase.Vec{8, Int32}) |
I also don't understand why |
Fixed: JuliaSIMD/VectorizationBase.jl@5636185 Now: julia> using LoopVectorization
[ Info: Precompiling LoopVectorization [bdcacae8-1622-11e9-2a5c-532679323890]
julia> function run_ntoh3!(y, n=1000) # easier to profile or `@code_llvm` without all the tuning & inference with `@btime`
for i = 1:n
@turbo for j in eachindex(y)
y[j] = ntoh(y[j])
end
end
return y
end
run_ntoh3! (generic function with 2 methods)
julia> function run_ntoh2!(y, n=1000) # easier to profile or `@code_llvm` without all the tuning & inference with `@btime`
for i = 1:n
@inbounds @simd for j in eachindex(y)
y[j] = ntoh(y[j])
end
end
return y
end
run_ntoh2! (generic function with 2 methods)
julia> x = rand(Int32, 1_000_000÷4);
julia> y = reinterpret(reshape, Int32, rand(UInt8, 4, 1_000_000÷4));
julia> @btime run_ntoh2!($x);
12.122 ms (0 allocations: 0 bytes)
julia> @btime run_ntoh3!($x);
12.631 ms (0 allocations: 0 bytes)
julia> @btime run_ntoh2!($y);
79.918 ms (0 allocations: 0 bytes)
julia> @btime run_ntoh3!($y);
12.953 ms (0 allocations: 0 bytes)
julia> z = reinterpret(Int32, rand(UInt8, 1_000_000));
julia> @btime run_ntoh2!($z);
101.174 ms (0 allocations: 0 bytes)
julia> @btime run_ntoh3!($z);
12.000 ms (0 allocations: 0 bytes) I didn't add Starting Julia with the environmental variable JULIA_LLVM_ARGS="--pass-remarks-analysis=loop-vectorize --pass-remarks-missed=loop-vectorize --pass-remarks=loop-vectorize" julia $(MISC_JULIA_FLAGS) I get julia> run_ntoh2!(x, 1);
remark: simdloop.jl:75:0: vectorized loop (vectorization width: 8, interleaved count: 4)
julia> run_ntoh2!(y, 1);
remark: simdloop.jl:75:0: loop not vectorized: unsafe dependent memory operations in loop. Use #pragma loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
remark: simdloop.jl:75:0: loop not vectorized
julia> run_ntoh2!(z, 1);
remark: simdloop.jl:75:0: the cost-model indicates that interleaving is not beneficial
remark: simdloop.jl:75:0: vectorized loop (vectorization width: 8, interleaved count: 1) So Here is the loop for .LBB0_4: # %vector.body
# Parent Loop BB0_3 Depth=1
# => This Inner Loop Header: Depth=2
vmovdqu ymm1, ymmword ptr [rsi + 4*rdx]
vmovdqu ymm2, ymmword ptr [rsi + 4*rdx + 32]
vmovdqu ymm3, ymmword ptr [rsi + 4*rdx + 64]
vmovdqu ymm4, ymmword ptr [rsi + 4*rdx + 96]
vpshufb ymm1, ymm1, ymm0
vpshufb ymm2, ymm2, ymm0
vpshufb ymm3, ymm3, ymm0
vpshufb ymm4, ymm4, ymm0
vmovdqu ymmword ptr [rsi + 4*rdx], ymm1
vmovdqu ymmword ptr [rsi + 4*rdx + 32], ymm2
vmovdqu ymmword ptr [rsi + 4*rdx + 64], ymm3
vmovdqu ymmword ptr [rsi + 4*rdx + 96], ymm4
add rdx, 32
cmp r10, rdx
jne .LBB0_4 4 loads, 4 shuffles, 4 stores to process 4*8 = 32 While this is the nonsense LLVM produces for .LBB0_4: # %vector.body
# Parent Loop BB0_3 Depth=1
# => This Inner Loop Header: Depth=2
vmovdqu ymm4, ymmword ptr [rcx + 4*rbx]
vpmovdb xmm4, ymm4
vmovdqu xmm5, xmmword ptr [rcx + 4*rbx]
vmovdqu xmm6, xmmword ptr [rcx + 4*rbx + 16]
vmovdqa xmm7, xmm5
vpermt2b xmm7, xmm8, xmm6
vmovdqa xmm0, xmm5
vpermt2b xmm0, xmm1, xmm6
vpermt2b ymm5, ymm2, ymm6
vpslld ymm5, ymm5, 24
vpmovzxbd ymm0, xmm0 # ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero
vpslld ymm0, ymm0, 16
vpor ymm0, ymm5, ymm0
vpmovzxbd ymm5, xmm7 # ymm5 = xmm7[0],zero,zero,zero,xmm7[1],zero,zero,zero,xmm7[2],zero,zero,zero,xmm7[3],zero,zero,zero,xmm7[4],zero,zero,zero,xmm7[5],zero,zero,zero,xmm7[6],zero,zero,zero,xmm7[7],zero,zero,zero
vpslld ymm5, ymm5, 8
vpmovzxbd ymm4, xmm4 # ymm4 = xmm4[0],zero,zero,zero,xmm4[1],zero,zero,zero,xmm4[2],zero,zero,zero,xmm4[3],zero,zero,zero,xmm4[4],zero,zero,zero,xmm4[5],zero,zero,zero,xmm4[6],zero,zero,zero,xmm4[7],zero,zero,zero
vpternlogd ymm4, ymm0, ymm5, 254
vpshufb ymm0, ymm4, ymm3
vpmovdb xmm4, ymm0
vpsrld ymm5, ymm0, 8
vpmovdb xmm5, ymm5
vpunpcklbw xmm4, xmm4, xmm5 # xmm4 = xmm4[0],xmm5[0],xmm4[1],xmm5[1],xmm4[2],xmm5[2],xmm4[3],xmm5[3],xmm4[4],xmm5[4],xmm4[5],xmm5[5],xmm4[6],xmm5[6],xmm4[7],xmm5[7]
vpsrld ymm5, ymm0, 16
vpmovdb xmm5, ymm5
vpsrld ymm0, ymm0, 24
vpmovdb xmm0, ymm0
vpunpcklbw xmm0, xmm5, xmm0 # xmm0 = xmm5[0],xmm0[0],xmm5[1],xmm0[1],xmm5[2],xmm0[2],xmm5[3],xmm0[3],xmm5[4],xmm0[4],xmm5[5],xmm0[5],xmm5[6],xmm0[6],xmm5[7],xmm0[7]
vpunpcklwd xmm5, xmm4, xmm0 # xmm5 = xmm4[0],xmm0[0],xmm4[1],xmm0[1],xmm4[2],xmm0[2],xmm4[3],xmm0[3]
vpunpckhwd xmm0, xmm4, xmm0 # xmm0 = xmm4[4],xmm0[4],xmm4[5],xmm0[5],xmm4[6],xmm0[6],xmm4[7],xmm0[7]
vmovdqu xmmword ptr [rcx + 4*rbx], xmm5
vmovdqu xmmword ptr [rcx + 4*rbx + 16], xmm0
add rbx, 8
cmp r10, rbx
jne .LBB0_4 Processing just 8 We could add |
does the LV version not work with reinterpret float array? |
Yes, it works with |
but they are all reinterpreted into Int32, what about Float32 and Float64 |
I'll have to add the missing |
It's precisely because of that "additional reshape" (which is just a different kind of |
so we definitely could use another fix in Base such that in-place |
It would be nice to know which specific operations are judged as unsafe for vectorization by LLVM. |
yeah something is systematically inhibiting vectorization(?) of julia> @benchmark sum(a) setup=(a = reinterpret(Int32, rand(UInt8, 10^4*4)))
BenchmarkTools.Trial: 10000 samples with 175 evaluations.
Range (min … max): 610.246 ns … 1.442 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 629.880 ns ┊ GC (median): 0.00%
Time (mean ± σ): 635.574 ns ± 35.472 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▆█▆▂▁▂▂▁▁ ▂
▇▁██████████▇▇▇▅▄▄▅▃▅▃▅▁▁▃▃▁▁▁▁▁▁▁▃▃▁▁▁▃▃▁▁▁▃▁▁▁▅▃▄▃▃▄▄▃▁▄▄▄ █
610 ns Histogram: log(frequency) by time 881 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark sum(b) setup=(b = rand(Int32, 10^4))
BenchmarkTools.Trial: 10000 samples with 193 evaluations.
Range (min … max): 507.021 ns … 840.611 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 517.041 ns ┊ GC (median): 0.00%
Time (mean ± σ): 519.393 ns ± 11.290 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▂▃▅██▇▆▅▅▄▂▂▁ ▁ ▂
▄▅▆▇█████████████████▇▇▆▇███████▇▇▆▆▆▆▇▆▇▆▇▇▆▅▆▅▅▅▅▄▅▅▄▄▄▆▅▆▆ █
507 ns Histogram: log(frequency) by time 558 ns <
Memory estimate: 0 bytes, allocs estimate: 0. |
Don't use As you can see, the first and third are much closer in terms of time. |
yeah but the raw bytes are of course read in as a julia> a = rand(UInt8, 100);
julia> reinterpret(Int32, a);
julia> reinterpret(reshape, Int32, a);
ERROR: ArgumentError: `reinterpret(reshape, Int32, a)` where `eltype(a)` is UInt8 requires that `axes(a, 1)` (got Base.OneTo(100)) be equal to 1:4 (from the ratio of element sizes)
Stacktrace:
[1] (::Base.var"#throwsize1#282")(a::Vector{UInt8}, T::Type)
@ Base ./reinterpretarray.jl:59
[2] reinterpret(#unused#::typeof(reshape), #unused#::Type{Int32}, a::Vector{UInt8})
@ Base ./reinterpretarray.jl:72
[3] top-level scope
@ REPL[3]:1
julia> reinterpret(reshape, Int32, reshape(a, 4, 25)); EDIT: eh, this double reshape appears to be faster irl too... |
this has been fixed somehow, partially: julia> @benchmark x .= ntoh.(x) setup=begin x = rand(UInt32,1_000_000÷4) end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 10.630 μs … 38.643 μs ┊ GC (min … max): 0.00% … 0.00%
julia> @benchmark begin y = reinterpret(Int32, x); map!(ntoh, y, y) end setup=begin x = rand(UInt8,1_000_000) end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 97.315 μs … 146.138 μs ┊ GC (min … max): 0.00% … 0.00%
julia> @benchmark begin y = reinterpret(Int32, x); y .= ntoh.(y) end setup=begin x = rand(UInt8,1_000_000) end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 11.010 μs … 57.449 μs ┊ GC (min … max): 0.00% … 0.00% |
Does #43955 help any? |
looks like free speedup to me julia> @benchmark x .= ntoh.(x) setup=begin x = rand(UInt32,1_000_000÷4) end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 10.239 μs … 38.322 μs ┊ GC (min … max): 0.00% … 0.00%
julia> @benchmark begin y = reinterpret(Int32, x); map!(ntoh, y, y) end setup=begin x = rand(UInt8,1_000_000) end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 113.273 μs … 146.015 μs ┊ GC (min … max): 0.00% … 0.00%
julia> @benchmark begin y = reinterpret(Int32, x); y .= ntoh.(y) end setup=begin x = rand(UInt8,1_000_000) end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 10.159 μs … 37.931 μs ┊ GC (min … max): 0.00% … 0.00% the in-place seems to be better indeed! |
The text was updated successfully, but these errors were encountered: