Skip to content

Feature/byte array #1573

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 16, 2025
Merged

Feature/byte array #1573

merged 7 commits into from
Jun 16, 2025

Conversation

maddyscientist
Copy link
Member

Small feature PR that adds a new class byte_array which is an indexable array type for 8-bit types (signed / unsigned chars) of length 8. This is useful, as it's exactly what we use for the various kernels that do path tracing around the links of the lattice.

With this class, we can use it instead of thread_array, the advantages of byte_array being that it is lower latency, and does not consume any shared memory. E.g., saving more L1 cache which can be helpful for performance. On the kernels I tested, I saw either performance neutrality or a perf uplift.

Also included here:

  • Remove use of KernelOps where possible due to removal of shared-memory storage from using byte_array instead of thread_array
  • Fix stack frame induced in DeGrandRossi contraction kernel (was due to some run-time indexing)
  • Fix stack frame induced in pure-gauge heath bath kernel (was due to run-time indexing in local array)
  • Fix stack frame in plaquette rectangle action (run-time indexing and running out of registers)

@maddyscientist maddyscientist added this to the QUDA 2.0 milestone Jun 10, 2025
@maddyscientist maddyscientist requested a review from a team as a code owner June 10, 2025 19:06
@maddyscientist
Copy link
Member Author

maddyscientist commented Jun 10, 2025

@jcosborn FYI, this PR removes the use of thread_array (and thus KernelOps) from a whole much of kernels. This approach is faster though, and takes less resources.

Copy link
Contributor

@weinbe2 weinbe2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good and passes my various tests (including correctness) with large (> 255) local dimensions, focusing on fat/long link construction. I'm seeing ~perf parity on H100-80GB-SXM5.

This has my approval conditional on csci tests passing, but (last I checked) it seems like it's having domain resolution issues...

@weinbe2
Copy link
Contributor

weinbe2 commented Jun 16, 2025

Update, I see that these issues are identical to the issues being tracked in #1572 and orthogonal to this PR. All other tests are passing. For this reason I'm going to merge this PR.

@weinbe2 weinbe2 merged commit ffc5b94 into develop Jun 16, 2025
5 of 6 checks passed
@weinbe2 weinbe2 deleted the feature/byte_array branch June 16, 2025 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants