Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HIP Performance Guidelines #3455

Closed

Conversation

matyas-streamhpc
Copy link

No description provided.

Comment on lines 40 to 42
:ref:`synchronization functions`) within the same kernel invocation. If they
belong to different blocks, they must use global memory with two separate
kernel invocations. The latter should be minimized as it adds overhead.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, e.g. SYCL has a guaranty that lowest global index block that is still executing is progressing. It allows some limited use of synchronization for later blocks in the same invocation. However, I don't remember, if it is the case for HIP, and cannot google it fast. Should we ask Young about?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I know, no. But I do not find it either. It would be a good idea.

Comment on lines 137 to 141
and is generally reduced when addresses are more scattered, especially in
global memory.

Device memory is accessed via 32-, 64-, or 128-byte transactions that must be
naturally aligned. Maximizing memory throughput involves coalescing memory
Copy link

@Melirius Melirius Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a short glossary in the beginning can be very valuable. For example, I'm not sure here, is "device memory" and "global memory" the same thing in these sentences, or it is different concepts. And again "coalescing" without explanation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference added to coalescing.

Comment on lines 250 to 253
As for alternative ways to synchronize is using streams. Different streams
can execute commands out of order with respect to one another or concurrently.
This allows for more fine-grained control over the execution order of
commands, which can be beneficial in certain scenarios.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, using streams for intra-block synchronization is a definite overkill. I suggest to extend this paragraph to explain, what level of synchronization is provided by streams, and make a link to their description.

Copy link

@MKKnorr MKKnorr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these sections are very close to the performance guidelines of the cuda programming guide, sometimes almost quoting it directly. I don't think that's a good practice, especially as some parts don't apply to AMDs GPUs at all, and on top of that the cuda programmign guide does not have a permissive license from what I can tell

A better place for inspiration might be gpuopen, that already has some performance guides for e.g. rdna https://gpuopen.com/learn/rdna-performance-guide/

docs/index.md Outdated
@@ -17,6 +17,7 @@ portable applications for AMD and NVIDIA GPUs from single source code.

:::{grid-item-card} Reference

* {doc}`/reference/performance_guidelines`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue this document would rather fit in the "understand" section, than "reference"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is moved to How-to

docs/reference/performance_guidelines.rst Outdated Show resolved Hide resolved
docs/reference/performance_guidelines.rst Outdated Show resolved Hide resolved
docs/reference/performance_guidelines.rst Outdated Show resolved Hide resolved
docs/reference/performance_guidelines.rst Outdated Show resolved Hide resolved
docs/reference/performance_guidelines.rst Outdated Show resolved Hide resolved
Comment on lines 224 to 227
Optimizing memory access: The efficiency of memory access can impact the speed
of arithmetic operations. Coalesced memory access, where threads in a warp
access consecutive memory locations, can improve memory throughput and thus
the speed of arithmetic operations.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantic: arithmetic operations can't be "sped up", the time they can't be scheduled for execution however can depend on memory accesses. I would add a reference to the memory optimizations here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

docs/reference/performance_guidelines.rst Outdated Show resolved Hide resolved
docs/reference/performance_guidelines.rst Outdated Show resolved Hide resolved
docs/reference/performance_guidelines.rst Outdated Show resolved Hide resolved
@Melirius
Copy link

BTW, it should be mentioned somewhere that to fully utilize all SIMD lines/possible threads in the block x-block size should be a multiple of warp size.

@MKKnorr
Copy link

MKKnorr commented Apr 30, 2024

BTW, it should be mentioned somewhere that to fully utilize all SIMD lines/possible threads in the block x-block size should be a multiple of warp size.

When being pedantic: the block size (x*y*z) has to be a multiple of the warp size for full utilization

@matyas-streamhpc
Copy link
Author

Most of these sections are very close to the performance guidelines of the cuda programming guide, sometimes almost quoting it directly. I don't think that's a good practice, especially as some parts don't apply to AMDs GPUs at all, and on top of that the cuda programmign guide does not have a permissive license from what I can tell

A better place for inspiration might be gpuopen, that already has some performance guides for e.g. rdna https://gpuopen.com/learn/rdna-performance-guide/

I am not sure that is best strategy either, but the concept was accepted as a first version. It is not quoting directly the mentioned document, but there is overlap in the content. Personally, I would have appreciated every pieces of recommendation, both in format and in content.

Nonetheless, we always have the opportunity to improve it and make the documentation better for the satisfaction of the developers.

@neon60 neon60 closed this May 15, 2024
@neon60
Copy link

neon60 commented May 15, 2024

Most of this PR changes has been merged in, while the leftover is here: #3483

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants