Generalize GPU indexing to add global indexing #1334

MrBurmark · 2022-09-22T23:09:53Z

Generalize GPU indexing to add GPU global indexing

Add a global (thread and block) indexing class that abstracts indexing in a certain dimension. With this you can specify a block size or a grid size at compile time or get those values at runtime. You can also ignore blocks and index only with threads and vice versa.

The kernel and launch policies are now shared. The policy is multi-part and contains a global indexing class, a way to map those global indices like direct or strided loop, and a synchronization requirement. The synchronizatoin requirement allows you to request that all threads make it through the even if some have no work so you can synchronize the block.

I added aliases for all of the pre-existing policies and deprecated some in favor more consistently named policies.
One breaking change is that thread loop policies are no longer safe to block synchronize under, that feature still exists but can only be accessed with a custom policy.

This reduces duplication in the implementations of For and Tile statements in kernel and launch. This cuts down the number of implementations to just one for the direct polices and two for the loop policies as loop policies can differ base on whether you can synchronize inside them. I'm still working on thinking about some slight changes that came in now that things are more uniform.

This PR is a refactoring, feature
It does the following:
- refactors the for and tile implementations for kernel and launch
- Adds gpu global indexing at the request of me

MrBurmark · 2022-09-22T23:12:39Z

I wanted to put this out here for feedback before I went too far. I'm curious what people think of the design and if anyone (@ajkunen) thought the slight differences between things like thread and block implementations were significant. I plan to do a before and after with the perf suite at some point to ensure performance but it passes the tests.

include/RAJA/policy/hip/teams.hpp

rhornung67 · 2022-09-28T15:56:42Z

I wanted to put this out here for feedback before I went too far. I'm curious what people think of the design and if anyone (@ajkunen) thought the slight differences between things like thread and block implementations were significant. I plan to do a before and after with the perf suite at some point to ensure performance but it passes the tests.

@MrBurmark do you mean that the perf suite test passes? There are compilation issues related to global index types in CUDA builds here.

MrBurmark · 2022-09-28T16:14:45Z

I haven't tried to change cuda yet in this branch so that's why its failing. I have not yet run this in the PerfSuite to look at performance, I wanted to be sure if anyone had ideas about the design that I incorporated them before I worried about that too much.

rhornung67 · 2022-09-28T16:26:15Z

@MrBurmark gotcha. I will take a closer look today and provide feedback. Is there anything in particular that you think needs deeper scrutinizing?

MrBurmark · 2022-09-28T20:29:21Z

I'll add comments on some of the things that I think are worth noting/thinking about.

include/RAJA/policy/hip/kernel/For.hpp

include/RAJA/policy/hip/policy.hpp

test/functional/kernel/nested-loop/test-kernel-nested-loop.cpp.in

artv3 · 2022-11-17T22:34:49Z

Should we get this in for the patch release?

rhornung67 · 2022-11-17T22:45:03Z

Should we get this in for the patch release?

No. This is bigger than a bugfix.

include/RAJA/policy/hip/kernel/For.hpp

include/RAJA/policy/hip/kernel/ForICount.hpp

include/RAJA/policy/hip/kernel/Tile.hpp

include/RAJA/policy/hip/kernel/TileTCount.hpp

test/functional/kernel/nested-loop/test-kernel-nested-loop.cpp.in

Add Hip Indexing classes so we can make more generic kernel For/Tile statements and launch loop implementations.

…ernelGpuGlobalIndexing

artv3

Will there be companion PR for examples and docs? Overall looks pretty good

include/RAJA/policy/cuda/MemUtils_CUDA.hpp

include/RAJA/policy/cuda/policy.hpp

include/RAJA/policy/hip/MemUtils_HIP.hpp

…ernelGpuGlobalIndexing

MrBurmark · 2023-06-21T15:19:43Z

Will there be companion PR for examples and docs?

Yes coming soon to a PR near you.
See #1499

include/RAJA/policy/cuda/MemUtils_CUDA.hpp

include/RAJA/policy/cuda/forall.hpp

rhornung67

A couple of questions for your consideration. Otherwise, it looks good.

in MemUtils occupancy calculator methods

MrBurmark requested review from artv3, rhornung67, CRobeck, ajkunen, mdavis36 and rchen20 September 22, 2022 23:09

artv3 reviewed Sep 27, 2022

View reviewed changes

include/RAJA/policy/hip/teams.hpp Outdated Show resolved Hide resolved