Issues with `heuristic_workgroup` #2828

glwagner · 2022-11-17T20:11:34Z

This is our heuristic workgroup for all kernels:

Oceananigans.jl/src/Utils/kernel_launching.jl

Lines 11 to 32 in 3e0781c

    
           function heuristic_workgroup(Wx, Wy, Wz=nothing) 
        
               workgroup = Wx == 1 && Wy == 1 ? 
        
                               # One-dimensional column models: 
        
                               (1, 1) : 
        
                           Wx == 1 ? 
        
                               # Two-dimensional y-z slice models: 
        
                               (1, min(256, Wy)) : 
        
                           Wy == 1 ? 
        
                               # Two-dimensional x-z slice models: 
        
                               (1, min(256, Wx)) : 
        
                               # Three-dimensional models 
        
                               (16, 16) 
        
               return workgroup 
        
           end

For large 3D kernels on the GPU this is correct because workgroup = (16, 16) yields the recommended total workgroup size of 256:

https://juliagpu.github.io/KernelAbstractions.jl/stable/quickstart/#Terminology-1

Note that "workgroup" refers to "total size of the a working group", which is the size of the chunks of data that are processed in parallel. We may want to change "workgroup" to "workgroupsize".

Finally, note that "1" is appended to the workgroup size (all our kernels are 3D:

https://github.com/JuliaGPU/KernelAbstractions.jl/blob/ecd2c3785849334df0a3157edb5a9ca229d6565a/src/nditeration.jl#L105

These are my issues:

workgroupsize for column models is (1, 1, 1). This is too small on the GPU, where it should be (1, 1, min(256, Nz)).
workgroupsize for slice models does not chop up the third dimension. Why?
According to Fix quickstart JuliaGPU/KernelAbstractions.jl#335 (comment) we want different behavior on the CPU, according to the number of threads we are using.
Finally, I think that workgroupsize should be a settable property of the CPU architecture so that we can opt-in to effectively single threaded behavior with workgroupsize = (Nx, Ny, Nz). With this abstraction we can also precalculate it on construction, if we want, which might make it more visible to users.

However, column and slice models are not optimal because we don't ever slize the z-dimension, so

Oceananigans.jl/src/Utils/kernel_launching.jl

Line 16 in 3e0781c

(1, 1) :

JuliaGPU/KernelAbstractions.jl#335 (comment)

The text was updated successfully, but these errors were encountered:

glwagner · 2023-03-22T15:38:45Z

This is stale.

glwagner added feature 🌟 Something new and shiny performance 🏍️ So we can get the wrong answer even faster labels Nov 17, 2022

glwagner closed this as completed Mar 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with `heuristic_workgroup` #2828

Issues with `heuristic_workgroup` #2828

glwagner commented Nov 17, 2022

glwagner commented Mar 22, 2023

Issues with heuristic_workgroup #2828

Issues with heuristic_workgroup #2828

Comments

glwagner commented Nov 17, 2022

glwagner commented Mar 22, 2023

Issues with `heuristic_workgroup` #2828

Issues with `heuristic_workgroup` #2828