Improve reduce performance by passing CartesianIndices and length statically #100

maxwindiff · 2023-02-21T07:57:00Z

Improve indexing performance by passing CartesianIndices statically, using a similar trick as JuliaGPU/GPUArrays.jl#454. Still slow, but not as bad as before. Helps with #46.

Before:

julia> a = fill(Float32(1.0), 4096 * 4096);
julia> da = MtlArray(a);
julia> b = fill(Float32(1.0), 4096, 4096);
julia> db = MtlArray(b);

julia> @btime sum(a)
  1.393 ms (1 allocation: 16 bytes)
1.6777216f7

julia> @btime sum(b)
  1.392 ms (1 allocation: 16 bytes)
1.6777216f7

julia> @btime sum(da)
  4.026 ms (868 allocations: 23.95 KiB)
1.6777216f7

julia> @btime sum(db)
  11.196 ms (873 allocations: 25.23 KiB)
1.6777216f7

After:

julia> @btime sum(da)
  1.811 ms (754 allocations: 20.80 KiB)
1.6777216f7

julia> @btime sum(db)
  2.181 ms (759 allocations: 21.33 KiB)
1.6777216f7

Passing length(Rother) as Rlen may look redundant, but the 2D case (sum(db)) runs 3x slower without it.

julia> @btime sum(db)
  6.648 ms (759 allocations: 21.33 KiB)
1.6777216f7

There were some test failures, but they also happen on main (complains about symbol not found) and seems unrelated to this PR -- https://gist.github.com/maxwindiff/fe0480dcfd1bcd4cb28e91f2c1a0cfa6

…tically

src/mapreduce.jl

maleadt · 2023-02-21T08:43:59Z

LGTM, for now at least. This isn't something we want to apply everywhere due to the increased compile times, it's better to figure out a way to encode dynamic Cartesian indices in a way that Metal can handle them somewhat performantly.

maleadt · 2023-02-21T08:49:54Z

Did you explore adding back some of the information that gets lost by @inbounds? That improved performance significantly in JuliaGPU/GPUArrays.jl#454.

Co-authored-by: Tim Besard <tim.besard@gmail.com>

maxwindiff · 2023-02-22T06:08:39Z

The linear indexing at https://github.com/JuliaGPU/Metal.jl/blob/main/src/mapreduce.jl#L105 and https://github.com/JuliaGPU/Metal.jl/blob/main/src/mapreduce.jl#L120 were guarded by range checks already. What other bounds info should I try?

I tried this but there's no improvement:

diff --git a/src/mapreduce.jl b/src/mapreduce.jl
index e878851..29010ae 100644
--- a/src/mapreduce.jl
+++ b/src/mapreduce.jl
@@ -96,6 +96,7 @@ function partial_mapreduce_device(f, op, neutral, maxthreads, ::Val{Rreduce},
     # and possibly groups if it doesn't fit) and other elements (remaining groups)
     localIdx_reduce = thread_position_in_threadgroup_1d()
     localDim_reduce = threads_per_threadgroup_1d()
+    assume(1 <= Rlen)
     groupIdx_reduce, groupIdx_other = fldmod1(threadgroup_position_in_grid_1d(), Rlen)
     groupDim_reduce = threadgroups_per_grid_1d() ÷ Rlen
 
@@ -103,6 +104,7 @@ function partial_mapreduce_device(f, op, neutral, maxthreads, ::Val{Rreduce},
     # (that means we can safely synchronize items within this group)
     iother = groupIdx_other
     @inbounds if iother <= length(Rother)
+        assume(1 <= iother <= length(Rother))
         Iother = Rother[iother]
 
         # load the neutral value
@@ -118,6 +120,7 @@ function partial_mapreduce_device(f, op, neutral, maxthreads, ::Val{Rreduce},
         # reduce serially across chunks of input vector that don't fit in a group
         ireduce = localIdx_reduce + (groupIdx_reduce - 1) * localDim_reduce
         while ireduce <= length(Rreduce)
+            assume(1 <= ireduce <= length(Rreduce))
             Ireduce = Rreduce[ireduce]
             J = max(Iother, Ireduce)
             val = op(val, f(_map_getindex(As, J)...))

maleadt · 2023-02-22T08:00:54Z

Yeah I guess that covers all of them already.

Improve reduce performance by passing CartesianIndices and length sta…

634e5f9

…tically

maleadt reviewed Feb 21, 2023

View reviewed changes

src/mapreduce.jl Outdated Show resolved Hide resolved

maleadt added the performance Gotta go fast. label Feb 21, 2023

maleadt mentioned this pull request Feb 21, 2023

Improve performance of Cartesian indexing #101

Open

Remove stray log

2a02d7a

Co-authored-by: Tim Besard <tim.besard@gmail.com>

maleadt merged commit 25a7930 into JuliaGPU:main Feb 22, 2023

maxwindiff deleted the reduce branch February 26, 2023 07:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve reduce performance by passing CartesianIndices and length statically #100

Improve reduce performance by passing CartesianIndices and length statically #100

maxwindiff commented Feb 21, 2023 •

edited

maleadt commented Feb 21, 2023

maleadt commented Feb 21, 2023

maxwindiff commented Feb 22, 2023

maleadt commented Feb 22, 2023

Improve reduce performance by passing CartesianIndices and length statically #100

Improve reduce performance by passing CartesianIndices and length statically #100

Conversation

maxwindiff commented Feb 21, 2023 • edited

maleadt commented Feb 21, 2023

maleadt commented Feb 21, 2023

maxwindiff commented Feb 22, 2023

maleadt commented Feb 22, 2023

maxwindiff commented Feb 21, 2023 •

edited