Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust block sizes in offload kernels #3910

Merged
merged 3 commits into from
Mar 20, 2022
Merged

Conversation

ye-luo
Copy link
Contributor

@ye-luo ye-luo commented Mar 20, 2022

Proposed changes

The full loop iterations space is spliced into sizes of ChunkSizePerTeam to increase the number of CUDA blocks/ workgroups on GPU.

Larger block sizes allow more resident computation and thus more efficient but lose time to solution when the number of blocks. For small problem sizes, the number of blocks is not an issue and efficiency matters more. I noticed much closer kernel performance between legacy CUDA and OpenMP offload.

A larger block size only affects iterations space larger than the block size, namely when the number of blocks changes.
The increase from 128 to 512 actually improve kernel efficiency a lot for small to medium problem sizes with smaller CUDA grid sizes. For example NiO a32 has 288 splines, so need 3 blocks of size 128, 2 blocks of 256, and 1 blocks of 512. 512 should be good for both NVIDIA and AMD, 64 CUDA cores per SM and 64 Stream processors per CU.
With 128, 256, 512, I saw kernel time dropping but 1024 stays the same as 512 accross all problem sizes.

For large problem sizes, the number of walkers drop and thus keeping more blocks makes better use of the device. So I think 512 is a sweat spot.

What type(s) of changes does this code introduce?

  • Other (please describe): Performance tuning.

Does this introduce a breaking change?

  • No

What systems has this change been tested on?

epyc-server

Checklist

  • Yes. This PR is up to date with current the current state of 'develop'

@prckent prckent enabled auto-merge March 20, 2022 19:39
@prckent
Copy link
Contributor

prckent commented Mar 20, 2022

Test this please

@prckent prckent merged commit 118e3a7 into QMCPACK:develop Mar 20, 2022
@ye-luo ye-luo deleted the adjust-blocking branch March 20, 2022 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants