Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ntoh / bswap are 10x slower when operating in-place on reinterpret array #42227

Closed
Moelf opened this issue Sep 13, 2021 · 20 comments
Closed

ntoh / bswap are 10x slower when operating in-place on reinterpret array #42227

Moelf opened this issue Sep 13, 2021 · 20 comments

Comments

@Moelf
Copy link
Contributor

Moelf commented Sep 13, 2021

julia> @benchmark x .= ntoh.(x) setup=begin x = rand(UInt32,1_000_000÷4) end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  10.710 μs  37.441 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     18.936 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   18.776 μs ±  1.804 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                               ▃▅▄▆▇█▆▃                        
  ▂▂▂▂▁▂▁▁▂▁▁▂▁▂▁▂▂▃▃▄▆▇█▇▆▆▅▇█████████▇▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▄
  10.7 μs         Histogram: frequency by time          26 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.


julia> @benchmark begin y = reinterpret(Int32, x); map!(ntoh, y, y) end setup=begin x = rand(UInt8,1_000_000) end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  106.632 μs  296.513 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     108.787 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   109.698 μs ±   3.539 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁▃ ▁▆▄ ▆█▃▂▇▆▂▁▃▁ ▄▅▁▁▂▂          ▁  ▂▁ ▁▁                 ▂
  ▆▅▄██▆███▇███████████████████▇▇▇▇█▆███▇▇█████▇▇▇▆▇▇▇▇▇▄▅▇▆▆▅▅ █
  107 μs        Histogram: log(frequency) by time        118 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark begin y = reinterpret(Int32, x); y .= ntoh.(y) end setup=begin x = rand(UInt8,1_000_000) end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  173.369 μs  388.629 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     181.555 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   181.340 μs ±   9.220 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

            ▃▄ ▁█▆▂      ▁ ▂▁▁▃▇▄▁▃▆▂▁▆▇▃ ▁▁ ▁▁ ▁▁▁▁ ▁▂▁▁▁▁     ▂
  ▅▃▁▁▄▁▁▇▄▁█████████▆▇▇███████████████████████████████████▇▆▆▅ █
  173 μs        Histogram: log(frequency) by time        190 μs <

 Memory estimate: 0 bytes, allocs estimate: 
@Moelf
Copy link
Contributor Author

Moelf commented Sep 13, 2021

It should be able to be fixed by something like:

        t1 = Base.@_gc_preserve_begin rawdata
        real_data = unsafe_wrap(Array, pointer(reinterpret(Int32, rawdata)), 1_000_000÷4)
        real_data .= ntoh.(real_data)
        vov = VectorOfVectors(real_data, rawoffsets, ArraysOfArrays.no_consistency_checks)
        Base.@_gc_preserve_end t1
        return vov

but for some reason GC isn't happy...

@jakobnissen
Copy link
Contributor

The inefficiency comes from indexing into the reinterpret'et array being slower than indexing into a normal Array. The ntoh is not relevant. I don't think there are any issues about the inefficiencies of indexing into reinterpreted arrays, at least, I couldn't find any.

@Moelf
Copy link
Contributor Author

Moelf commented Sep 13, 2021

I'm thinking there's a (theoretical) shortcut because in-place ntoh of a reinterpret should be as fast as on a normal array since the bytes layout are exactly the same.

@Moelf
Copy link
Contributor Author

Moelf commented Sep 13, 2021

I now see a related previous discussion: #25014

@timholy
Copy link
Member

timholy commented Sep 13, 2021

It may not even have anything to do directly with that issue; instead it looks like a SIMD vectorization issue.

julia> function run_ntoh2!(y, n=1000)    # easier to profile or `@code_llvm` without all the tuning & inference with `@btime`
           for i = 1:n
               @inbounds @simd for j in eachindex(y)
                   y[j] = ntoh(y[j])
               end
           end
           return y
       end
run_ntoh2! (generic function with 2 methods)

julia> x = rand(Int32, 1_000_000÷4);

julia> y = reinterpret(reshape, Int32, rand(UInt8, 4, 1_000_000÷4));

julia> @btime run_ntoh2!(x);
  22.733 ms (0 allocations: 0 bytes)

julia> @btime run_ntoh2!(y);
  122.921 ms (1 allocation: 32 bytes)

julia> z = reinterpret(Int32, rand(UInt8, 1_000_000));

julia> @btime run_ntoh2!(z);
  276.921 ms (1 allocation: 32 bytes)

Interestingly, the difference between x and y mostly seems to be that one vectorizes and the other does not. @chriselrod, any ideas? I tried with @turbo but got

ERROR: MethodError: no method matching bswap(::VectorizationBase.Vec{8, Int32})

@Moelf
Copy link
Contributor Author

Moelf commented Sep 13, 2021

I also don't understand why z is so much slower than y, shouldn't z be faster because it doesn't have one more Reshape wrapper?

@chriselrod
Copy link
Contributor

I tried with @turbo but got

Fixed: JuliaSIMD/VectorizationBase.jl@5636185

Now:

julia> using LoopVectorization
[ Info: Precompiling LoopVectorization [bdcacae8-1622-11e9-2a5c-532679323890]

julia> function run_ntoh3!(y, n=1000)    # easier to profile or `@code_llvm` without all the tuning & inference with `@btime`
           for i = 1:n
               @turbo for j in eachindex(y)
                   y[j] = ntoh(y[j])
               end
           end
           return y
       end
run_ntoh3! (generic function with 2 methods)

julia> function run_ntoh2!(y, n=1000)    # easier to profile or `@code_llvm` without all the tuning & inference with `@btime`
           for i = 1:n
               @inbounds @simd for j in eachindex(y)
                   y[j] = ntoh(y[j])
               end
           end
           return y
       end
run_ntoh2! (generic function with 2 methods)

julia> x = rand(Int32, 1_000_000÷4);

julia> y = reinterpret(reshape, Int32, rand(UInt8, 4, 1_000_000÷4));

julia> @btime run_ntoh2!($x);
  12.122 ms (0 allocations: 0 bytes)

julia> @btime run_ntoh3!($x);
  12.631 ms (0 allocations: 0 bytes)

julia> @btime run_ntoh2!($y);
  79.918 ms (0 allocations: 0 bytes)

julia> @btime run_ntoh3!($y);
  12.953 ms (0 allocations: 0 bytes)

julia> z = reinterpret(Int32, rand(UInt8, 1_000_000));

julia> @btime run_ntoh2!($z);
  101.174 ms (0 allocations: 0 bytes)

julia> @btime run_ntoh3!($z);
  12.000 ms (0 allocations: 0 bytes)

I didn't add ntoh or bswap to LoopVectorization's cost table, so it is treating them as expensive and not unrolling.

Starting Julia with the environmental variable JULIA_LLVM_ARGS:

JULIA_LLVM_ARGS="--pass-remarks-analysis=loop-vectorize --pass-remarks-missed=loop-vectorize --pass-remarks=loop-vectorize" julia $(MISC_JULIA_FLAGS)

I get

julia> run_ntoh2!(x, 1);
remark: simdloop.jl:75:0: vectorized loop (vectorization width: 8, interleaved count: 4)

julia> run_ntoh2!(y, 1);
remark: simdloop.jl:75:0: loop not vectorized: unsafe dependent memory operations in loop. Use #pragma loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
remark: simdloop.jl:75:0: loop not vectorized

julia> run_ntoh2!(z, 1);
remark: simdloop.jl:75:0: the cost-model indicates that interleaving is not beneficial
remark: simdloop.jl:75:0: vectorized loop (vectorization width: 8, interleaved count: 1)

So x and z are vectorized, while y is not. I tried adding @simd ivdep, but y still refused to vectorize because of the "unsafe dependent memory operations in loop".
While z did vectorize, the generated code was awful.

Here is the loop for run_ntoh2!(x,1):

.LBB0_4:                                # %vector.body
                                        #   Parent Loop BB0_3 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
        vmovdqu ymm1, ymmword ptr [rsi + 4*rdx]
        vmovdqu ymm2, ymmword ptr [rsi + 4*rdx + 32]
        vmovdqu ymm3, ymmword ptr [rsi + 4*rdx + 64]
        vmovdqu ymm4, ymmword ptr [rsi + 4*rdx + 96]
        vpshufb ymm1, ymm1, ymm0
        vpshufb ymm2, ymm2, ymm0
        vpshufb ymm3, ymm3, ymm0
        vpshufb ymm4, ymm4, ymm0
        vmovdqu ymmword ptr [rsi + 4*rdx], ymm1
        vmovdqu ymmword ptr [rsi + 4*rdx + 32], ymm2
        vmovdqu ymmword ptr [rsi + 4*rdx + 64], ymm3
        vmovdqu ymmword ptr [rsi + 4*rdx + 96], ymm4
        add     rdx, 32
        cmp     r10, rdx
        jne     .LBB0_4

4 loads, 4 shuffles, 4 stores to process 4*8 = 32 Int32.

While this is the nonsense LLVM produces for run_ntoh2!(z,1):

.LBB0_4:                                # %vector.body
                                        #   Parent Loop BB0_3 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
        vmovdqu ymm4, ymmword ptr [rcx + 4*rbx]
        vpmovdb xmm4, ymm4
        vmovdqu xmm5, xmmword ptr [rcx + 4*rbx]
        vmovdqu xmm6, xmmword ptr [rcx + 4*rbx + 16]
        vmovdqa xmm7, xmm5
        vpermt2b        xmm7, xmm8, xmm6
        vmovdqa xmm0, xmm5
        vpermt2b        xmm0, xmm1, xmm6
        vpermt2b        ymm5, ymm2, ymm6
        vpslld  ymm5, ymm5, 24
        vpmovzxbd       ymm0, xmm0              # ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero
        vpslld  ymm0, ymm0, 16
        vpor    ymm0, ymm5, ymm0
        vpmovzxbd       ymm5, xmm7              # ymm5 = xmm7[0],zero,zero,zero,xmm7[1],zero,zero,zero,xmm7[2],zero,zero,zero,xmm7[3],zero,zero,zero,xmm7[4],zero,zero,zero,xmm7[5],zero,zero,zero,xmm7[6],zero,zero,zero,xmm7[7],zero,zero,zero
        vpslld  ymm5, ymm5, 8
        vpmovzxbd       ymm4, xmm4              # ymm4 = xmm4[0],zero,zero,zero,xmm4[1],zero,zero,zero,xmm4[2],zero,zero,zero,xmm4[3],zero,zero,zero,xmm4[4],zero,zero,zero,xmm4[5],zero,zero,zero,xmm4[6],zero,zero,zero,xmm4[7],zero,zero,zero
        vpternlogd      ymm4, ymm0, ymm5, 254
        vpshufb ymm0, ymm4, ymm3
        vpmovdb xmm4, ymm0
        vpsrld  ymm5, ymm0, 8
        vpmovdb xmm5, ymm5
        vpunpcklbw      xmm4, xmm4, xmm5        # xmm4 = xmm4[0],xmm5[0],xmm4[1],xmm5[1],xmm4[2],xmm5[2],xmm4[3],xmm5[3],xmm4[4],xmm5[4],xmm4[5],xmm5[5],xmm4[6],xmm5[6],xmm4[7],xmm5[7]
        vpsrld  ymm5, ymm0, 16
        vpmovdb xmm5, ymm5
        vpsrld  ymm0, ymm0, 24
        vpmovdb xmm0, ymm0
        vpunpcklbw      xmm0, xmm5, xmm0        # xmm0 = xmm5[0],xmm0[0],xmm5[1],xmm0[1],xmm5[2],xmm0[2],xmm5[3],xmm0[3],xmm5[4],xmm0[4],xmm5[5],xmm0[5],xmm5[6],xmm0[6],xmm5[7],xmm0[7]
        vpunpcklwd      xmm5, xmm4, xmm0        # xmm5 = xmm4[0],xmm0[0],xmm4[1],xmm0[1],xmm4[2],xmm0[2],xmm4[3],xmm0[3]
        vpunpckhwd      xmm0, xmm4, xmm0        # xmm0 = xmm4[4],xmm0[4],xmm4[5],xmm0[5],xmm4[6],xmm0[6],xmm4[7],xmm0[7]
        vmovdqu xmmword ptr [rcx + 4*rbx], xmm5
        vmovdqu xmmword ptr [rcx + 4*rbx + 16], xmm0
        add     rbx, 8
        cmp     r10, rbx
        jne     .LBB0_4

Processing just 8 Int32, i.e. doing a lot more work for calculating many fewer results.
Not sure what is going wrong; I'd have to spend more time looking at it to get some idea.

We could add ntoh and/or byteswap to LV's cost table if we want it to be unrolled more. That'd probably help for much smaller arrays; at this size it's memory bottlenecked and thus doesn't make a difference.

@Moelf
Copy link
Contributor Author

Moelf commented Sep 13, 2021

does the LV version not work with reinterpret float array?

@chriselrod
Copy link
Contributor

does the LV version not work with reinterpret float array?

Yes, it works with x, y, and z above, getting similar performance in each case.

@Moelf
Copy link
Contributor Author

Moelf commented Sep 13, 2021

but they are all reinterpreted into Int32, what about Float32 and Float64

@chriselrod
Copy link
Contributor

chriselrod commented Sep 13, 2021

I'll have to add the missing bswap methods.
EDIT: JuliaSIMD/VectorizationBase.jl@c775fe9

@timholy
Copy link
Member

timholy commented Sep 14, 2021

I also don't understand why z is so much slower than y, shouldn't z be faster because it doesn't have one more Reshape wrapper?

It's precisely because of that "additional reshape" (which is just a different kind of ReinterpretArray, not an additional wrapper): #37559. Clearly it didn't fix absolutely everything (witness this issue), but the performance gains in a few cases were as big as 20x.

@Moelf
Copy link
Contributor Author

Moelf commented Sep 14, 2021

so we definitely could use another fix in Base such that in-place ntoh on ReinterpretArray is potentially 10x faster?

@timholy
Copy link
Member

timholy commented Sep 15, 2021

It would be nice to know which specific operations are judged as unsafe for vectorization by LLVM.

@Moelf
Copy link
Contributor Author

Moelf commented Sep 15, 2021

yeah something is systematically inhibiting vectorization(?) of ReinterpretArray:

julia> @benchmark sum(a) setup=(a = reinterpret(Int32, rand(UInt8, 10^4*4)))
BenchmarkTools.Trial: 10000 samples with 175 evaluations.
 Range (min  max):  610.246 ns   1.442 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     629.880 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   635.574 ns ± 35.472 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁▆█▆▂▁▂▂▁▁                                                 ▂
  ▇▁██████████▇▇▇▅▄▄▅▃▅▃▅▁▁▃▃▁▁▁▁▁▁▁▃▃▁▁▁▃▃▁▁▁▃▁▁▁▅▃▄▃▃▄▄▃▁▄▄▄ █
  610 ns        Histogram: log(frequency) by time       881 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sum(b) setup=(b = rand(Int32, 10^4))
BenchmarkTools.Trial: 10000 samples with 193 evaluations.
 Range (min  max):  507.021 ns  840.611 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     517.041 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   519.393 ns ±  11.290 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▁▂▃▅██▇▆▅▅▄▂▂▁        ▁                                 ▂
  ▄▅▆▇█████████████████▇▇▆▇███████▇▇▆▆▆▆▇▆▇▆▇▇▆▅▆▅▅▅▅▄▅▅▄▄▄▆▅▆▆ █
  507 ns        Histogram: log(frequency) by time        558 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

@timholy
Copy link
Member

timholy commented Sep 16, 2021

Don't use reinterpret(T, a). Use reinterpret(reshape, T, a). The following display uses JuliaCI/BenchmarkTools.jl#255 to make it easier to compare the histograms:

image

As you can see, the first and third are much closer in terms of time.

@Moelf
Copy link
Contributor Author

Moelf commented Sep 16, 2021

yeah but the raw bytes are of course read in as a Vector, would you recommend:

julia> a = rand(UInt8, 100);

julia> reinterpret(Int32, a);

julia> reinterpret(reshape, Int32, a);
ERROR: ArgumentError: `reinterpret(reshape, Int32, a)` where `eltype(a)` is UInt8 requires that `axes(a, 1)` (got Base.OneTo(100)) be equal to 1:4 (from the ratio of element sizes)
Stacktrace:
 [1] (::Base.var"#throwsize1#282")(a::Vector{UInt8}, T::Type)
   @ Base ./reinterpretarray.jl:59
 [2] reinterpret(#unused#::typeof(reshape), #unused#::Type{Int32}, a::Vector{UInt8})
   @ Base ./reinterpretarray.jl:72
 [3] top-level scope
   @ REPL[3]:1

julia> reinterpret(reshape, Int32, reshape(a, 4, 25));

EDIT: eh, this double reshape appears to be faster irl too...

@Moelf
Copy link
Contributor Author

Moelf commented Feb 12, 2022

this has been fixed somehow, partially:

julia> @benchmark x .= ntoh.(x) setup=begin x = rand(UInt32,1_000_000÷4) end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  10.630 μs  38.643 μs  ┊ GC (min  max): 0.00%  0.00%


julia> @benchmark begin y = reinterpret(Int32, x); map!(ntoh, y, y) end setup=begin x = rand(UInt8,1_000_000) end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):   97.315 μs  146.138 μs  ┊ GC (min  max): 0.00%  0.00%

julia> @benchmark begin y = reinterpret(Int32, x); y .= ntoh.(y) end setup=begin x = rand(UInt8,1_000_000) end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  11.010 μs  57.449 μs  ┊ GC (min  max): 0.00%  0.00%

@vtjnash
Copy link
Member

vtjnash commented Feb 13, 2022

Does #43955 help any?

@Moelf
Copy link
Contributor Author

Moelf commented Feb 13, 2022

julia> @time sum(length, tree.Muon_pt)
  2.881126 seconds (20.48 k allocations: 2.378 GiB, 0.93% gc time)
149322456

julia> @time sum(length, tree.Muon_pt)
  2.918528 seconds (21.00 k allocations: 2.378 GiB, 1.12% gc time)
149322456

looks like free speedup to me


julia> @benchmark x .= ntoh.(x) setup=begin x = rand(UInt32,1_000_000÷4) end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  10.239 μs  38.322 μs  ┊ GC (min  max): 0.00%  0.00%


julia> @benchmark begin y = reinterpret(Int32, x); map!(ntoh, y, y) end setup=begin x = rand(UInt8,1_000_000) end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  113.273 μs  146.015 μs  ┊ GC (min  max): 0.00%  0.00%

julia> @benchmark begin y = reinterpret(Int32, x); y .= ntoh.(y) end setup=begin x = rand(UInt8,1_000_000) end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  10.159 μs  37.931 μs  ┊ GC (min  max): 0.00%  0.00%

the in-place seems to be better indeed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants