Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with heuristic_workgroup #2828

Closed
glwagner opened this issue Nov 17, 2022 · 1 comment
Closed

Issues with heuristic_workgroup #2828

glwagner opened this issue Nov 17, 2022 · 1 comment
Labels
feature 🌟 Something new and shiny performance 🏍️ So we can get the wrong answer even faster

Comments

@glwagner
Copy link
Member

This is our heuristic workgroup for all kernels:

function heuristic_workgroup(Wx, Wy, Wz=nothing)
workgroup = Wx == 1 && Wy == 1 ?
# One-dimensional column models:
(1, 1) :
Wx == 1 ?
# Two-dimensional y-z slice models:
(1, min(256, Wy)) :
Wy == 1 ?
# Two-dimensional x-z slice models:
(1, min(256, Wx)) :
# Three-dimensional models
(16, 16)
return workgroup
end

For large 3D kernels on the GPU this is correct because workgroup = (16, 16) yields the recommended total workgroup size of 256:

https://juliagpu.github.io/KernelAbstractions.jl/stable/quickstart/#Terminology-1

Note that "workgroup" refers to "total size of the a working group", which is the size of the chunks of data that are processed in parallel. We may want to change "workgroup" to "workgroupsize".

Finally, note that "1" is appended to the workgroup size (all our kernels are 3D:

https://github.com/JuliaGPU/KernelAbstractions.jl/blob/ecd2c3785849334df0a3157edb5a9ca229d6565a/src/nditeration.jl#L105

These are my issues:

  • workgroupsize for column models is (1, 1, 1). This is too small on the GPU, where it should be (1, 1, min(256, Nz)).
  • workgroupsize for slice models does not chop up the third dimension. Why?
  • According to Fix quickstart JuliaGPU/KernelAbstractions.jl#335 (comment) we want different behavior on the CPU, according to the number of threads we are using.
  • Finally, I think that workgroupsize should be a settable property of the CPU architecture so that we can opt-in to effectively single threaded behavior with workgroupsize = (Nx, Ny, Nz). With this abstraction we can also precalculate it on construction, if we want, which might make it more visible to users.

However, column and slice models are not optimal because we don't ever slize the z-dimension, so

JuliaGPU/KernelAbstractions.jl#335 (comment)

@glwagner glwagner added feature 🌟 Something new and shiny performance 🏍️ So we can get the wrong answer even faster labels Nov 17, 2022
@glwagner
Copy link
Member Author

This is stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature 🌟 Something new and shiny performance 🏍️ So we can get the wrong answer even faster
Projects
None yet
Development

No branches or pull requests

1 participant