Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suboptimal performance in reduction #126

Open
carstenbauer opened this issue Apr 1, 2022 · 0 comments
Open

Suboptimal performance in reduction #126

carstenbauer opened this issue Apr 1, 2022 · 0 comments

Comments

@carstenbauer
Copy link

As discussed on Slack, I get really bad or suboptimal performance for a simple reduction (Multithreaded Monte Carlo).

function estimate_pi_floop_1(attempts)
    hits = 0
    @floop for i in 1:attempts
        x = rand()
        y = rand()
        if (x^2 + y^2) <= 1
            @reduce(hits += 1)
        end
    end
    return 4.0 * (hits / attempts)
end

function estimate_pi_floop_2(attempts)
    hits = 0
    @floop for i in 1:attempts
        x = rand()
        y = rand()
        if (x^2 + y^2) <= 1
            @reduce(hits = 0 + 1)
        end
    end
    return 4.0 * (hits / attempts)
end

function estimate_pi_threads_partitioned(attempts)
    nt = Threads.nthreads()
    attempts_per_thread = ceil(Int, attempts ÷ nt)
    hits = zeros(Int, nt)
    Threads.@threads for i in 1:nt
        h = 0
        for i in 1:attempts_per_thread
            x = rand()
            y = rand()
            if (x^2 + y^2) <= 1
                h += 1
            end
        end
        hits[Threads.threadid()] = h
    end
    return 4.0 * (sum(hits) / attempts)
end
julia> @btime estimate_pi_floop_1(500_000_000)
  2.664 s (125000108 allocations: 1.86 GiB)

julia> @btime estimate_pi_floop_2(500_000_000)                                                                                                                             
  258.906 ms (64 allocations: 3.88 KiB)

julia> @btime estimate_pi_threads_partitioned(500_000_000)
  208.475 ms (42 allocations: 4.00 KiB)

So

  • using @reduce(hits = 0 + 1) over @reduce(hits += 1) makes a huge difference
  • even when using the former we don't get the performance of estimate_pi_threads_partitioned (which, IIUC, should be similar to what FLoops should produce under the hood).

Thanks for taking a look!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant