Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strided pattern support #118

Merged
merged 9 commits into from
Sep 16, 2023
Merged

Strided pattern support #118

merged 9 commits into from
Sep 16, 2023

Conversation

chriselrod
Copy link
Member

Adds limited support for

@batch stride=true for ...

Example usecase:

using LoopVectorization, VectorizationBase, Polyester

# using Plots 

const xn = 960
const yn = 960
const xmin = -2.0
const xmax = 0.6
const ymin = -1.5
const ymax = 1.5
const MAX_ITERS = 200

@inline function mandelbrot_kernel(c)
  z = c
  for i = 1:MAX_ITERS
    z = z * z + c
    if abs2(z) > 4
      return i
    end
  end
  return MAX_ITERS
end
@inline function mandelbrot_kernel(r, i::Vec{W}) where {W}
  z = c = complex(r,i)
  inactive = VectorizationBase.Mask(VectorizationBase.zero_mask(Val(W)))
  imask = VectorizationBase.vbroadcast(Val(W), 0)
  for i = 1:MAX_ITERS
    z = muladd(z, z, c)
    imask = ifelse(inactive, imask, i)
    inactive |= abs2(z) > 4
    VectorizationBase.vall(inactive) && return imask
  end
  return imask
end
@inline function mandelbrot_kernel(r::VecUnroll, i::VecUnroll)
VectorizationBase.VecUnroll(VectorizationBase.fmap(mandelbrot_kernel, VectorizationBase.data(r),VectorizationBase.data(i)))
end
@inline function mandelbrot_kernel(r::Vec, i::VecUnroll)
VectorizationBase.VecUnroll(VectorizationBase.fmap(mandelbrot_kernel, r, VectorizationBase.data(i)))
end
@inline function mandelbrot_kernel(r::VecUnroll, i::Vec)
VectorizationBase.VecUnroll(VectorizationBase.fmap(mandelbrot_kernel, VectorizationBase.data(r),i))
end

function compute_mandelbrot!(result)

  x_range = range(xmin, xmax, xn)
  y_range = range(ymin, ymax, xn)

  Threads.@threads for j = 1:yn
    for i = 1:xn
      x = x_range[i]
      y = y_range[j]
      result[j, i] = mandelbrot_kernel(complex(x, y))
    end
  end
  return result
end
compute_mandelbrot() = compute_mandelbrot!(zeros(yn, xn))
function compute_mandelbrot_turbo!(result)
  x_range = range(xmin, xmax, xn)
  y_range = range(ymin, ymax, yn)
  
  @tturbo for j = 1:yn
    for i = 1:xn
      x = x_range[i]
      y = y_range[j]
      result[j, i] = mandelbrot_kernel(x, y)
    end
  end
  return result
end

function compute_mandelbrot_polyturbo!(result)
  x_range = range(xmin, xmax, xn)
  y_range = range(ymin, ymax, xn)
    @batch per=thread for i = 1:xn
      c = 13
      m = xn
      a = m + 1
      x = x_range[i]
    @turbo for j = 1:yn
      y = y_range[j]
      result[j, i] = mandelbrot_kernel(x, y)
    end
  end
  return result
end
function compute_mandelbrot_polyturbo_stride!(result)
  x_range = range(xmin, xmax, xn)
  y_range = range(ymin, ymax, xn)
    @batch stride=true for i = 1:xn
      c = 13
      m = xn
      a = m + 1
      x = x_range[i]
    @turbo for j = 1:yn
      y = y_range[j]
      result[j, i] = mandelbrot_kernel(x, y)
    end
  end
  return result
end
compute_mandelbrot_turbo() = compute_mandelbrot_turbo!(zeros(yn, xn))
compute_mandelbrot_polyturbo() = compute_mandelbrot_polyturbo!(zeros(yn, xn))
compute_mandelbrot_polyturbo_stride() = compute_mandelbrot_polyturbo_stride!(zeros(yn, xn))


function compute_mandelbrot_polyturbo!(result, perm)
  x_range = range(xmin, xmax, xn)
  y_range = range(ymin, ymax, yn)
  
  @batch per=thread for _i = 1:xn
    i = perm[_i]
    x = x_range[i]
    @turbo for j = 1:yn
      result[j, i] = mandelbrot_kernel(x, y_range[j])
    end
  end
  return result
end

using Random
perm = randperm(yn);

struct ComplexSIMD{N,T}
    re::NTuple{N,T}
    im::NTuple{N,T}
end

Base.abs2(z::ComplexSIMD) = z.re .* z.re .+ z.im .* z.im
Base.:(*)(a::ComplexSIMD, b::ComplexSIMD) = ComplexSIMD(
    a.re .* b.re .- a.im .* b.im,
    a.re .* b.im .+ a.im .* b.re
)
Base.:(+)(a::ComplexSIMD, b::ComplexSIMD) = ComplexSIMD(a.re .+ b.re, a.im .+ b.im)

@inline function mandelbrot_kernel(c::ComplexSIMD{N,T}) where {N,T}
    z = c
    mask = ntuple(k -> true, N)
    iters = ntuple(k -> T(0), N)
    for i in 1:MAX_ITERS
        !any(mask) && return iters
        mask = abs2(z) .<= 4.0
        iters = iters .+ (mask .* 1)
        z = z * z + c
    end
    return iters
end

function inner(result, x_range, y_range, x, ::Val{N}) where N
    ci = x_range[x]
    @inbounds for y in 1:N:length(y_range)
        # Calculate whether the (x:x+7)-th real coordinates with
        # the y-th imaginary coordinate belong to the
        # mandelbrot set.
        c = ComplexSIMD(ntuple(k -> x_range[y+k-1], N), ntuple(k -> ci, N))
        iters = mandelbrot_kernel(c)
        foreach(enumerate(iters)) do (k, v)
            # Write the result to the output array
            result[y+k-1, x] = v
        end
    end
end
function inner_turbo(result, x_range, y_range, x, ::Val{N}) where N
  ci = x_range[x]
  @turbo for j = eachindex(y_range)
    y = y_range[j]
    result[j, x] = mandelbrot_kernel(x, y)
  end
end

function mandelbrot!(result, x_range, y_range, N)
    @sync for x in eachindex(x_range)
        # Threads.@spawn allows dynamic scheduling instead of static scheduling
        # of Threads.@threads macro.
        # See <https://github.com/JuliaLang/julia/issues/21017>.
        Threads.@spawn @inbounds begin
          inner(result, x_range, y_range, x, N)
        end
    end
    return result
end
function inner_turbo(result, x_range, y_range, x, ::Val{N}) where N
  ci = x_range[x]
  @turbo for j = eachindex(y_range)
    y = y_range[j]
    result[j, x] = mandelbrot_kernel(x, y)
  end
end
function mandelbrot_turbo!(result, x_range, y_range, N)
    @sync for x in eachindex(x_range)
        # Threads.@spawn allows dynamic scheduling instead of static scheduling
        # of Threads.@threads macro.
        # See <https://github.com/JuliaLang/julia/issues/21017>.
        Threads.@spawn @inbounds begin
          inner_turbo(result, x_range, y_range, x, N)
        end
    end
    return result
end
function mandelbrot!(result, N = Val(8))
  x_range = range(xmin, xmax, xn)
  y_range = range(ymin, ymax, xn)
  mandelbrot!(result, x_range, y_range, N)
end
function mandelbrot_turbo!(result, N = Val(8))
  x_range = range(xmin, xmax, xn)
  y_range = range(ymin, ymax, xn)
  mandelbrot_turbo!(result, x_range, y_range, N)
end


result = zeros(yn, xn);

@benchmark mandelbrot!($result, Val(8))
@benchmark mandelbrot_turbo!($result, Val(8))
@benchmark compute_mandelbrot!($result)
@benchmark compute_mandelbrot_turbo!($result)
@benchmark compute_mandelbrot_polyturbo!($result)
@benchmark compute_mandelbrot_polyturbo!($result, $perm)
@benchmark compute_mandelbrot_polyturbo_stride!($result)

Benchmarks:

julia> @benchmark mandelbrot!($result, Val(8))
BenchmarkTools.Trial: 253 samples with 1 evaluation.
 Range (min  max):  3.793 ms   13.201 ms  ┊ GC (min  max): 0.00%  67.68%
 Time  (median):     3.865 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.957 ms ± 719.601 μs  ┊ GC (mean ± σ):  1.79% ±  6.30%

    ▁ ▂█▆▆▅█ ▄▆                                                
  ▄▃█▇██████▇██▇▅▄▄▆▄▅▄▁▄▃▄▃▃▁▃▁▃▄▃▃▁▁▃▁▁▃▃▃▁▁▁▁▁▁▁▃▃▁▁▁▁▁▁▁▃ ▃
  3.79 ms         Histogram: frequency by time        4.25 ms <

 Memory estimate: 598.83 KiB, allocs estimate: 5027.

julia> @benchmark mandelbrot_turbo!($result, Val(8))
BenchmarkTools.Trial: 796 samples with 1 evaluation.
 Range (min  max):  1.115 ms    4.934 ms  ┊ GC (min  max): 0.00%  70.76%
 Time  (median):     1.202 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.253 ms ± 429.366 μs  ┊ GC (mean ± σ):  3.80% ±  8.28%

   █                                                           
  ██▅▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  1.11 ms         Histogram: frequency by time        4.83 ms <

 Memory estimate: 610.58 KiB, allocs estimate: 5403.

julia> @benchmark compute_mandelbrot!($result)
BenchmarkTools.Trial: 67 samples with 1 evaluation.
 Range (min  max):  14.134 ms   16.270 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     14.927 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   15.003 ms ± 644.629 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂ █  ▂▂ ▂  ▂▅          ▂                          ▂ ▅         
  █▁█▁▅██▁█▅▅██▅▁▅▅▁▁▁▁█▁█▅█▅▁▅▁█▅▅▅▁▁▁▅▁▁▁▅█▅▅▁▁▁▅████▅▅▁▁▁▁▅ ▁
  14.1 ms         Histogram: frequency by time         16.2 ms <

 Memory estimate: 18.69 KiB, allocs estimate: 177.

julia> @benchmark compute_mandelbrot_turbo!($result)
BenchmarkTools.Trial: 398 samples with 1 evaluation.
 Range (min  max):  2.490 ms   2.707 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.497 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.516 ms ± 35.502 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▂▇█▇▃                                               ▁▃▁▁   
  ▇███████▅▄▄▄▄▄▅▄█▆▄█▇▇▇▆▆▇▇▁▁▄▅▁▁▁▁▁▁▁▁▁▁▄▄▅▅▁▁▁▅▁▁▆█████▆ ▆
  2.49 ms      Histogram: log(frequency) by time      2.6 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark compute_mandelbrot_polyturbo!($result)
BenchmarkTools.Trial: 645 samples with 1 evaluation.
 Range (min  max):  1.491 ms    5.557 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.517 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.550 ms ± 202.571 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃█▅▅▂▂                                                       
  ██████▆▆█▇█▅▆▁▄▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▇
  1.49 ms      Histogram: log(frequency) by time      2.26 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark compute_mandelbrot_polyturbo!($result, $perm)
BenchmarkTools.Trial: 1054 samples with 1 evaluation.
 Range (min  max):  867.041 μs    5.591 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     871.859 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   946.894 μs ± 249.207 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃▃▂▁▁                                                         
  ██████▇▆▆██▇▇▃▁▁▃▃▁▁▁▁▁▄▃▃▃▁▅▆▅▆▇▃▃▃▁▆▄▅▄▃▁▃▁▁▁▃▄▃▃▅▃▁▁▃▆▅▅▆▅ ▇
  867 μs        Histogram: log(frequency) by time       1.68 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark compute_mandelbrot_polyturbo_stride!($result)
BenchmarkTools.Trial: 1181 samples with 1 evaluation.
 Range (min  max):  721.232 μs    5.919 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     728.151 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   844.556 μs ± 360.679 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃▄▁▁▃                ▁                                        
  ███████▆▁▃▄▅▁▁▃▅▇▅▁▁▁▇█▇▅▄▃▆▄▆▇▆▇▅▄▄▅▅▄▅▄▁▁▁▁▁▁▁▃▅▅▆▇▅▅▄▃▃▁▁▄ ▇
  721 μs        Histogram: log(frequency) by time       1.81 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

Note in particular that it was 2x faster than per=thread (note, stride=true forces per=thread), and also faster than the random permutation.

@codecov
Copy link

codecov bot commented Sep 16, 2023

Codecov Report

Patch coverage: 94.73% and project coverage change: -0.19% ⚠️

Comparison is base (ab78dea) 91.31% compared to head (8d1988f) 91.13%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #118      +/-   ##
==========================================
- Coverage   91.31%   91.13%   -0.19%     
==========================================
  Files           3        3              
  Lines         426      451      +25     
==========================================
+ Hits          389      411      +22     
- Misses         37       40       +3     
Files Changed Coverage Δ
src/closure.jl 89.77% <94.54%> (-0.16%) ⬇️
src/Polyester.jl 100.00% <100.00%> (ø)
src/batch.jl 93.98% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@chriselrod
Copy link
Member Author

chriselrod commented Sep 16, 2023

I haven't been able to reproduce the segfault locally, starting Julia with

GITHUB_ACTIONS=true JULIA_CPU_THREADS=3 julia --project=@polyester --check-bounds=yes -t4

and then running

for i = 1:100; @show i; @time include(joinpath(homedir(), ".julia/dev/Polyester/test/runtests.jl"); end

@chriselrod chriselrod merged commit f0e3ef3 into master Sep 16, 2023
57 of 59 checks passed
@chriselrod
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant