-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can we use systolic arrays in this implementation? #1404
Comments
I don't think we have a "nice interface" for this functionality, but it should be possible to call backend builtins for this purpose if backends have them. How exactly this would look like depends on the compilation flow you want to use. I have not yet looked at the Intel extension in depth, but in my experience Intel extensions are not always designed with the goal of being implementable by other implementations. So it's possible we might need a different interface. If you want to work on building this "nice interface", I can provide further guidance :) |
I agree that we might have to change the extension interface. Intel's joint
matrix extension itself is experimental and the API changed this release vs
the last.
Oh yes, I'd certainly love to help build this. I am somewhat new to
compilers but am sure I can pick it up given some documentation. Where do I
get started?
…On Wed, 13 Mar, 2024, 19:38 Aksel Alpay, ***@***.***> wrote:
I don't think we have a "nice interface" for this functionality, but it
should be possible to call backend builtins for this purpose if backends
have them. How exactly this would look like depends on the compilation flow
you want to use.
I have not yet looked at the Intel extension in depth, but in my
experience Intel extensions are not always designed with the goal of being
implementable by other implementations. So it's possible we might need a
different interface.
If you want to work on building this "nice interface", I can provide
further guidance :)
—
Reply to this email directly, view it on GitHub
<#1404 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACG76E3PVDVA6LLIBF55HC3YYBMXRAVCNFSM6AAAAABETRGNZKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJUGQ4TGMZSHA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
That's great :) Firstly be aware that AdaptiveCpp supports multiple compilation flows as outlined here: https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/compilation.md The most important one is the generic single-pass compiler, which is our default and most powerful compiler. So it would probably be a good idea to focus on that one initially. For the generic compiler, in the best case, we won't need any new compiler magic. For example, this file defines the available math builtins: https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/include/hipSYCL/sycl/libkernel/sscp/builtins/math.hpp Ideally we could implement systolic arrays similarly. This would require to first come up with a suitable, unified builtin interface API. For this purpose, it might be useful to do a small survey into how systolic arrays work on the various backends - what do the compiler builtins/instructions look like, which assumptions do they make etc. It might also be useful to look at how DPC++ maps this to hardware. Once we have the builtin API, we could then add implementations for the backends, and then lastly, add a pretty C++ interface to There are probably some nitty-gritty details down the line that we would need to discuss at some point. For example, what exactly should happen if some hardware does not support this functionality natively? DPC++ seems to throw an exception in that case, but perhaps it might be more useful to use a software fallback in that case in order to simplify the development of portable code. But I think initially we should not overcomplicate things and can just assume that the hardware has this functionality. We need to start from somewhere :) We might also need some runtime bits (e.g. to query hardware capabilities and expose that information to users), but that would similarly not be critical initially, I think. |
Let me start with the easy bit
Nvidia tensor core programming: nvcuda::wmma::load_matrix_sync(a_frag, matrix_mma_a_mptr, lda);
nvcuda::wmma::load_matrix_sync(b_frag, matrix_mma_b_mptr, ldb);
// Perform the matrix multiplication
nvcuda::wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag); AMD example:
Intel example
Generally speaking the interface is fairly standard across all the three vendors. I have just dived deep into Intel's version so I can explain how that works. In each execution unit, there is support for both vector instructions and matrix instructions - they both share the register file. We load small matrices (A = 8x16 B = 16x8 => C = 8 x 8) into registers and do dot product accumulate. This can then be used to implement many high level DNN related kernels.
I 100% agree. In case of missing systolic arrays, we should implement a fairly high performant version using vector instructions. BLIS people call it Microkernel and here's how it looks for a CPU: actual version, dumbed down version I will look into DPC++ implementation and report back on details from here. |
Here's some information about DPC++ implementation. Let's look at
Nvidia's version (joint_matrix_mad_cuda) is here. Uses builtin like this:
AMD's version (joint_matrix_mad_hip) is here. Uses builtin like this:
Intel's Spir-v version is defined in OpenCL spirv ops.
So, it looks like all the 3 versions uses built ins. So we don't have to do any compiler shenanigans? |
Thanks for the investigation!
Yep, looks like it :) Now we need to figure out which kind of interface will be useful to cover all three. |
Our interface needs to
Different GPUs support different submatrix sizes, dtypes and memory layouts. So we need to figure out a way to minimize the code duplication there. |
Thanks for the analysis :) I think perhaps it might be a good idea to start on the data types that the interface is supposed to operate on. Do we need an opaque data type for the matrices? Can backends agree on a matrix type size at least? |
Sorry for late reply. Have been traveling.
I think so - at least according to Intel's extension. All the supported combinations are documented here. We do need to allow for different layouts - column major, row major and weird one called VNNI for Intel. There might be some data types supported by some of the GPUs but are not supported by builtin data types - like int4/int2. |
Congrats on your release!
I am wondering if your implementation allows me to use systolic arrays (tensor cores or xmx engines or matrix cores in diff gpu implementations)? Intel's implementation has this extension to support these from all vendors.
I think having support for these is very important for performance of matrix multiplication and highly quantized implementations.
The text was updated successfully, but these errors were encountered: