-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalize GPU indexing to add global indexing #1334
Conversation
I wanted to put this out here for feedback before I went too far. I'm curious what people think of the design and if anyone (@ajkunen) thought the slight differences between things like thread and block implementations were significant. I plan to do a before and after with the perf suite at some point to ensure performance but it passes the tests. |
@MrBurmark do you mean that the perf suite test passes? There are compilation issues related to global index types in CUDA builds here. |
I haven't tried to change cuda yet in this branch so that's why its failing. I have not yet run this in the PerfSuite to look at performance, I wanted to be sure if anyone had ideas about the design that I incorporated them before I worried about that too much. |
@MrBurmark gotcha. I will take a closer look today and provide feedback. Is there anything in particular that you think needs deeper scrutinizing? |
I'll add comments on some of the things that I think are worth noting/thinking about. |
test/functional/kernel/nested-loop/test-kernel-nested-loop.cpp.in
Outdated
Show resolved
Hide resolved
Should we get this in for the patch release? |
No. This is bigger than a bugfix. |
test/functional/kernel/nested-loop/test-kernel-nested-loop.cpp.in
Outdated
Show resolved
Hide resolved
20bf172
to
b930573
Compare
Add Hip Indexing classes so we can make more generic kernel For/Tile statements and launch loop implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be companion PR for examples and docs? Overall looks pretty good
…ernelGpuGlobalIndexing
…ernelGpuGlobalIndexing
Yes coming soon to a PR near you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of questions for your consideration. Otherwise, it looks good.
in MemUtils occupancy calculator methods
Generalize GPU indexing to add GPU global indexing
Add a global (thread and block) indexing class that abstracts indexing in a certain dimension. With this you can specify a block size or a grid size at compile time or get those values at runtime. You can also ignore blocks and index only with threads and vice versa.
The kernel and launch policies are now shared. The policy is multi-part and contains a global indexing class, a way to map those global indices like direct or strided loop, and a synchronization requirement. The synchronizatoin requirement allows you to request that all threads make it through the even if some have no work so you can synchronize the block.
I added aliases for all of the pre-existing policies and deprecated some in favor more consistently named policies.
One breaking change is that thread loop policies are no longer safe to block synchronize under, that feature still exists but can only be accessed with a custom policy.
This reduces duplication in the implementations of For and Tile statements in kernel and launch. This cuts down the number of implementations to just one for the direct polices and two for the loop policies as loop policies can differ base on whether you can synchronize inside them. I'm still working on thinking about some slight changes that came in now that things are more uniform.