Knowing array size at compile time does not always help #33927

gasagna · 2019-11-23T18:57:48Z

I have a custom array type where the size is stored as one the type parameters. I hoped to increase performance in tight looping, since the indexing limits would be known at compile time. However, I have noticed that this approach can result in variable performance, depending on the size.

I managed to get a MWE, that does not use my custom array type, but shows the same symptoms. It consists of two mysum functions, looping over a 3D Julia array. In the first mysum1 the indexing bounds are known at compile time, in the other mysum2, they are calculated at runtime.

using BenchmarkTools
using Printf

function mysum1(a, ::Val{N}) where {N}
    S = zero(eltype(a))
    for k = 1:N
        for j = 1:N
            @simd for i = 1:N
                @inbounds S += a[i, j, k]
            end
        end
    end
    return S
end

function mysum2(a)
    S = zero(eltype(a))
    for k = 1:size(a, 3)
        for j = 1:size(a, 2)
            @simd for i = 1:size(a, 1)
                @inbounds S += a[i, j, k]
            end
        end
    end
    return S
end

t1s = Float64[]
t2s = Float64[]
Ns = 10:150
for N = Ns
    a = rand(N, N, N)
    vn = Val(N)
    t1 = minimum([@elapsed mysum1(a, vn) for i = 1:1000])
    t2 = minimum([@elapsed mysum2(a)     for i = 1:1000])
    @printf "%d %.6f %.6f %.6f\n" N 1000*t1 1000*t2 t1/t2
    push!(t1s, t1)
    push!(t2s, t2)
end

using PyPlot; pygui(true)
plot(Ns, 1e9.*t1s./(Ns.^3), label=L"10^9 t1 / N^3")
plot(Ns, 1e9.*t2s./(Ns.^3), label=L"10^9 t2 / N^3")
plot(Ns, t1s./t2s,          label="t1/t2")
xlabel(L"N")
ylim(0, 1.7)
legend(loc=2)
grid()
savefig("figure.png", dpi=600)

The data is as follows:

There seem to be a threshold around N = 36 below which knowing the array size at compile time is always advantageous. Above that, the speed up with respect to not knowing the size can vary significantly.

I have examined the @code_llvm output and both functions usually produce vectorised code. However, it appears that when mysum2 is faster than mysum1, there is quite some unrolling going on.

For instance, at N=64, we have from mysum1 (slower)

vector.body:                                      ; preds = %vector.body, %L6
; │ @ simdloop.jl:74 within `macro expansion'
; │┌ @ int.jl:53 within `+'
    %index = phi i64 [ 0, %L6 ], [ %index.next, %vector.body ]
    %vec.phi = phi <4 x double> [ %14, %L6 ], [ %18, %vector.body ]
; │└
; │ @ simdloop.jl:73 within `macro expansion' @ /Users/davide/Desktop/mwe.jl:10
; │┌ @ array.jl:730 within `getindex'
    %15 = add i64 %reass.mul, %index
    %16 = getelementptr inbounds double, double addrspace(13)* %10, i64 %15
    %17 = bitcast double addrspace(13)* %16 to <4 x double> addrspace(13)*
    %wide.load = load <4 x double>, <4 x double> addrspace(13)* %17, align 8
; │└
; │┌ @ float.jl:395 within `+'
    %18 = fadd fast <4 x double> %vec.phi, %wide.load
; │└
; │ @ simdloop.jl:74 within `macro expansion'
; │┌ @ int.jl:53 within `+'
    %index.next = add i64 %index, 4
    %19 = icmp eq i64 %index.next, 64
    br i1 %19, label %middle.block, label %vector.body

while for mysum2 (faster):

vector.body:                                      ; preds = %vector.body, %vector.ph
; │ @ simdloop.jl:74 within `macro expansion'
; │┌ @ int.jl:53 within `+'
    %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
    %vec.phi = phi <4 x double> [ %24, %vector.ph ], [ %34, %vector.body ]
    %vec.phi83 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %35, %vector.body ]
    %vec.phi84 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %36, %vector.body ]
    %vec.phi85 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %37, %vector.body ]
; │└
; │ @ simdloop.jl:73 within `macro expansion' @ /Users/davide/Desktop/mwe.jl:22
; │┌ @ array.jl:730 within `getindex'
    %25 = add i64 %reass.mul.us56, %index
    %26 = getelementptr inbounds double, double addrspace(13)* %21, i64 %25
    %27 = bitcast double addrspace(13)* %26 to <4 x double> addrspace(13)*
    %wide.load = load <4 x double>, <4 x double> addrspace(13)* %27, align 8
    %28 = getelementptr double, double addrspace(13)* %26, i64 4
    %29 = bitcast double addrspace(13)* %28 to <4 x double> addrspace(13)*
    %wide.load86 = load <4 x double>, <4 x double> addrspace(13)* %29, align 8
    %30 = getelementptr double, double addrspace(13)* %26, i64 8
    %31 = bitcast double addrspace(13)* %30 to <4 x double> addrspace(13)*
    %wide.load87 = load <4 x double>, <4 x double> addrspace(13)* %31, align 8
    %32 = getelementptr double, double addrspace(13)* %26, i64 12
    %33 = bitcast double addrspace(13)* %32 to <4 x double> addrspace(13)*
    %wide.load88 = load <4 x double>, <4 x double> addrspace(13)* %33, align 8
; │└
; │┌ @ float.jl:395 within `+'
    %34 = fadd fast <4 x double> %vec.phi, %wide.load
    %35 = fadd fast <4 x double> %vec.phi83, %wide.load86
    %36 = fadd fast <4 x double> %vec.phi84, %wide.load87
    %37 = fadd fast <4 x double> %vec.phi85, %wide.load88
; │└
; │ @ simdloop.jl:74 within `macro expansion'
; │┌ @ int.jl:53 within `+'
    %index.next = add i64 %index, 16
    %38 = icmp eq i64 %index.next, %n.vec
    br i1 %38, label %middle.block, label %vector.body

Note this is on

davide@vega - Desktop$: julia-master
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.4.0-DEV.374 (2019-10-24)
 _/ |\__'_|_|_|\__'_|  |  Commit f806717255 (30 days old master)
|__/                   |

The text was updated successfully, but these errors were encountered:

stevengj added the performance Must go faster label Nov 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Knowing array size at compile time does not always help #33927

Knowing array size at compile time does not always help #33927

gasagna commented Nov 23, 2019 •

edited

Knowing array size at compile time does not always help #33927

Knowing array size at compile time does not always help #33927

Comments

gasagna commented Nov 23, 2019 • edited

gasagna commented Nov 23, 2019 •

edited