Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Knowing array size at compile time does not always help #33927

Open
gasagna opened this issue Nov 23, 2019 · 0 comments
Open

Knowing array size at compile time does not always help #33927

gasagna opened this issue Nov 23, 2019 · 0 comments
Labels
performance Must go faster

Comments

@gasagna
Copy link
Contributor

gasagna commented Nov 23, 2019

I have a custom array type where the size is stored as one the type parameters. I hoped to increase performance in tight looping, since the indexing limits would be known at compile time. However, I have noticed that this approach can result in variable performance, depending on the size.

I managed to get a MWE, that does not use my custom array type, but shows the same symptoms. It consists of two mysum functions, looping over a 3D Julia array. In the first mysum1 the indexing bounds are known at compile time, in the other mysum2, they are calculated at runtime.

using BenchmarkTools
using Printf

function mysum1(a, ::Val{N}) where {N}
    S = zero(eltype(a))
    for k = 1:N
        for j = 1:N
            @simd for i = 1:N
                @inbounds S += a[i, j, k]
            end
        end
    end
    return S
end

function mysum2(a)
    S = zero(eltype(a))
    for k = 1:size(a, 3)
        for j = 1:size(a, 2)
            @simd for i = 1:size(a, 1)
                @inbounds S += a[i, j, k]
            end
        end
    end
    return S
end

t1s = Float64[]
t2s = Float64[]
Ns = 10:150
for N = Ns
    a = rand(N, N, N)
    vn = Val(N)
    t1 = minimum([@elapsed mysum1(a, vn) for i = 1:1000])
    t2 = minimum([@elapsed mysum2(a)     for i = 1:1000])
    @printf "%d %.6f %.6f %.6f\n" N 1000*t1 1000*t2 t1/t2
    push!(t1s, t1)
    push!(t2s, t2)
end

using PyPlot; pygui(true)
plot(Ns, 1e9.*t1s./(Ns.^3), label=L"10^9 t1 / N^3")
plot(Ns, 1e9.*t2s./(Ns.^3), label=L"10^9 t2 / N^3")
plot(Ns, t1s./t2s,          label="t1/t2")
xlabel(L"N")
ylim(0, 1.7)
legend(loc=2)
grid()
savefig("figure.png", dpi=600)

The data is as follows:
figure

There seem to be a threshold around N = 36 below which knowing the array size at compile time is always advantageous. Above that, the speed up with respect to not knowing the size can vary significantly.

I have examined the @code_llvm output and both functions usually produce vectorised code. However, it appears that when mysum2 is faster than mysum1, there is quite some unrolling going on.

For instance, at N=64, we have from mysum1 (slower)

vector.body:                                      ; preds = %vector.body, %L6
; │ @ simdloop.jl:74 within `macro expansion'
; │┌ @ int.jl:53 within `+'
    %index = phi i64 [ 0, %L6 ], [ %index.next, %vector.body ]
    %vec.phi = phi <4 x double> [ %14, %L6 ], [ %18, %vector.body ]
; │└
; │ @ simdloop.jl:73 within `macro expansion' @ /Users/davide/Desktop/mwe.jl:10
; │┌ @ array.jl:730 within `getindex'
    %15 = add i64 %reass.mul, %index
    %16 = getelementptr inbounds double, double addrspace(13)* %10, i64 %15
    %17 = bitcast double addrspace(13)* %16 to <4 x double> addrspace(13)*
    %wide.load = load <4 x double>, <4 x double> addrspace(13)* %17, align 8
; │└
; │┌ @ float.jl:395 within `+'
    %18 = fadd fast <4 x double> %vec.phi, %wide.load
; │└
; │ @ simdloop.jl:74 within `macro expansion'
; │┌ @ int.jl:53 within `+'
    %index.next = add i64 %index, 4
    %19 = icmp eq i64 %index.next, 64
    br i1 %19, label %middle.block, label %vector.body

while for mysum2 (faster):

vector.body:                                      ; preds = %vector.body, %vector.ph
; │ @ simdloop.jl:74 within `macro expansion'
; │┌ @ int.jl:53 within `+'
    %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
    %vec.phi = phi <4 x double> [ %24, %vector.ph ], [ %34, %vector.body ]
    %vec.phi83 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %35, %vector.body ]
    %vec.phi84 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %36, %vector.body ]
    %vec.phi85 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %37, %vector.body ]
; │└
; │ @ simdloop.jl:73 within `macro expansion' @ /Users/davide/Desktop/mwe.jl:22
; │┌ @ array.jl:730 within `getindex'
    %25 = add i64 %reass.mul.us56, %index
    %26 = getelementptr inbounds double, double addrspace(13)* %21, i64 %25
    %27 = bitcast double addrspace(13)* %26 to <4 x double> addrspace(13)*
    %wide.load = load <4 x double>, <4 x double> addrspace(13)* %27, align 8
    %28 = getelementptr double, double addrspace(13)* %26, i64 4
    %29 = bitcast double addrspace(13)* %28 to <4 x double> addrspace(13)*
    %wide.load86 = load <4 x double>, <4 x double> addrspace(13)* %29, align 8
    %30 = getelementptr double, double addrspace(13)* %26, i64 8
    %31 = bitcast double addrspace(13)* %30 to <4 x double> addrspace(13)*
    %wide.load87 = load <4 x double>, <4 x double> addrspace(13)* %31, align 8
    %32 = getelementptr double, double addrspace(13)* %26, i64 12
    %33 = bitcast double addrspace(13)* %32 to <4 x double> addrspace(13)*
    %wide.load88 = load <4 x double>, <4 x double> addrspace(13)* %33, align 8
; │└
; │┌ @ float.jl:395 within `+'
    %34 = fadd fast <4 x double> %vec.phi, %wide.load
    %35 = fadd fast <4 x double> %vec.phi83, %wide.load86
    %36 = fadd fast <4 x double> %vec.phi84, %wide.load87
    %37 = fadd fast <4 x double> %vec.phi85, %wide.load88
; │└
; │ @ simdloop.jl:74 within `macro expansion'
; │┌ @ int.jl:53 within `+'
    %index.next = add i64 %index, 16
    %38 = icmp eq i64 %index.next, %n.vec
    br i1 %38, label %middle.block, label %vector.body

Note this is on

davide@vega - Desktop$: julia-master
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.4.0-DEV.374 (2019-10-24)
 _/ |\__'_|_|_|\__'_|  |  Commit f806717255 (30 days old master)
|__/                   |
@stevengj stevengj added the performance Must go faster label Nov 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

2 participants