Adjust block sizes in offload kernels #3910
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Proposed changes
The full loop iterations space is spliced into sizes of ChunkSizePerTeam to increase the number of CUDA blocks/ workgroups on GPU.
Larger block sizes allow more resident computation and thus more efficient but lose time to solution when the number of blocks. For small problem sizes, the number of blocks is not an issue and efficiency matters more. I noticed much closer kernel performance between legacy CUDA and OpenMP offload.
A larger block size only affects iterations space larger than the block size, namely when the number of blocks changes.
The increase from 128 to 512 actually improve kernel efficiency a lot for small to medium problem sizes with smaller CUDA grid sizes. For example NiO a32 has 288 splines, so need 3 blocks of size 128, 2 blocks of 256, and 1 blocks of 512. 512 should be good for both NVIDIA and AMD, 64 CUDA cores per SM and 64 Stream processors per CU.
With 128, 256, 512, I saw kernel time dropping but 1024 stays the same as 512 accross all problem sizes.
For large problem sizes, the number of walkers drop and thus keeping more blocks makes better use of the device. So I think 512 is a sweat spot.
What type(s) of changes does this code introduce?
Does this introduce a breaking change?
What systems has this change been tested on?
epyc-server
Checklist