Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize GPU indexing to add global indexing #1334

Merged
merged 55 commits into from
Jun 23, 2023

Conversation

MrBurmark
Copy link
Member

@MrBurmark MrBurmark commented Sep 22, 2022

Generalize GPU indexing to add GPU global indexing

Add a global (thread and block) indexing class that abstracts indexing in a certain dimension. With this you can specify a block size or a grid size at compile time or get those values at runtime. You can also ignore blocks and index only with threads and vice versa.

The kernel and launch policies are now shared. The policy is multi-part and contains a global indexing class, a way to map those global indices like direct or strided loop, and a synchronization requirement. The synchronizatoin requirement allows you to request that all threads make it through the even if some have no work so you can synchronize the block.

I added aliases for all of the pre-existing policies and deprecated some in favor more consistently named policies.
One breaking change is that thread loop policies are no longer safe to block synchronize under, that feature still exists but can only be accessed with a custom policy.

This reduces duplication in the implementations of For and Tile statements in kernel and launch. This cuts down the number of implementations to just one for the direct polices and two for the loop policies as loop policies can differ base on whether you can synchronize inside them. I'm still working on thinking about some slight changes that came in now that things are more uniform.

  • This PR is a refactoring, feature
  • It does the following:
    • refactors the for and tile implementations for kernel and launch
    • Adds gpu global indexing at the request of me

@MrBurmark
Copy link
Member Author

I wanted to put this out here for feedback before I went too far. I'm curious what people think of the design and if anyone (@ajkunen) thought the slight differences between things like thread and block implementations were significant. I plan to do a before and after with the perf suite at some point to ensure performance but it passes the tests.

@rhornung67
Copy link
Member

I wanted to put this out here for feedback before I went too far. I'm curious what people think of the design and if anyone (@ajkunen) thought the slight differences between things like thread and block implementations were significant. I plan to do a before and after with the perf suite at some point to ensure performance but it passes the tests.

@MrBurmark do you mean that the perf suite test passes? There are compilation issues related to global index types in CUDA builds here.

@MrBurmark
Copy link
Member Author

I haven't tried to change cuda yet in this branch so that's why its failing. I have not yet run this in the PerfSuite to look at performance, I wanted to be sure if anyone had ideas about the design that I incorporated them before I worried about that too much.

@rhornung67
Copy link
Member

@MrBurmark gotcha. I will take a closer look today and provide feedback. Is there anything in particular that you think needs deeper scrutinizing?

@MrBurmark
Copy link
Member Author

I'll add comments on some of the things that I think are worth noting/thinking about.

@artv3
Copy link
Member

artv3 commented Nov 17, 2022

Should we get this in for the patch release?

@rhornung67
Copy link
Member

Should we get this in for the patch release?

No. This is bigger than a bugfix.

include/RAJA/policy/hip/kernel/For.hpp Outdated Show resolved Hide resolved
include/RAJA/policy/hip/kernel/For.hpp Outdated Show resolved Hide resolved
include/RAJA/policy/hip/kernel/ForICount.hpp Outdated Show resolved Hide resolved
include/RAJA/policy/hip/kernel/Tile.hpp Outdated Show resolved Hide resolved
include/RAJA/policy/hip/kernel/TileTCount.hpp Outdated Show resolved Hide resolved
@MrBurmark MrBurmark force-pushed the feature/burmark1/KernelGpuGlobalIndexing branch from 20bf172 to b930573 Compare December 29, 2022 00:50
Copy link
Member

@artv3 artv3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be companion PR for examples and docs? Overall looks pretty good

include/RAJA/policy/cuda/MemUtils_CUDA.hpp Outdated Show resolved Hide resolved
include/RAJA/policy/cuda/policy.hpp Show resolved Hide resolved
include/RAJA/policy/hip/MemUtils_HIP.hpp Outdated Show resolved Hide resolved
@MrBurmark
Copy link
Member Author

MrBurmark commented Jun 21, 2023

Will there be companion PR for examples and docs?

Yes coming soon to a PR near you.
See #1499

Copy link
Member

@rhornung67 rhornung67 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of questions for your consideration. Otherwise, it looks good.

@MrBurmark MrBurmark enabled auto-merge June 23, 2023 15:16
@rhornung67 rhornung67 mentioned this pull request Jun 23, 2023
@MrBurmark MrBurmark merged commit 4f3d325 into develop Jun 23, 2023
15 checks passed
@MrBurmark MrBurmark deleted the feature/burmark1/KernelGpuGlobalIndexing branch June 23, 2023 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants