Adjust block sizes in offload kernels #3910

ye-luo · 2022-03-20T18:47:19Z

Proposed changes

The full loop iterations space is spliced into sizes of ChunkSizePerTeam to increase the number of CUDA blocks/ workgroups on GPU.

Larger block sizes allow more resident computation and thus more efficient but lose time to solution when the number of blocks. For small problem sizes, the number of blocks is not an issue and efficiency matters more. I noticed much closer kernel performance between legacy CUDA and OpenMP offload.

A larger block size only affects iterations space larger than the block size, namely when the number of blocks changes.
The increase from 128 to 512 actually improve kernel efficiency a lot for small to medium problem sizes with smaller CUDA grid sizes. For example NiO a32 has 288 splines, so need 3 blocks of size 128, 2 blocks of 256, and 1 blocks of 512. 512 should be good for both NVIDIA and AMD, 64 CUDA cores per SM and 64 Stream processors per CU.
With 128, 256, 512, I saw kernel time dropping but 1024 stays the same as 512 accross all problem sizes.

For large problem sizes, the number of walkers drop and thus keeping more blocks makes better use of the device. So I think 512 is a sweat spot.

What type(s) of changes does this code introduce?

Other (please describe): Performance tuning.

Does this introduce a breaking change?

No

What systems has this change been tested on?

epyc-server

Checklist

Yes. This PR is up to date with current the current state of 'develop'

prckent · 2022-03-20T20:04:40Z

Test this please

ye-luo added 3 commits March 20, 2022 13:41

Increase block size to 512 splines.

2c3da53

Increase particle block size to 512.

389fe0b

Point to the latest compiler.

48e16b0

prckent approved these changes Mar 20, 2022

View reviewed changes

prckent enabled auto-merge March 20, 2022 19:39

prckent merged commit 118e3a7 into QMCPACK:develop Mar 20, 2022

ye-luo deleted the adjust-blocking branch March 20, 2022 21:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust block sizes in offload kernels #3910

Adjust block sizes in offload kernels #3910

ye-luo commented Mar 20, 2022

prckent commented Mar 20, 2022

Adjust block sizes in offload kernels #3910

Adjust block sizes in offload kernels #3910

Conversation

ye-luo commented Mar 20, 2022

Proposed changes

What type(s) of changes does this code introduce?

Does this introduce a breaking change?

What systems has this change been tested on?

Checklist

prckent commented Mar 20, 2022