Hi,
I would like to execute a CUTLASS GEMM (A*B+C) that uses the Tensor Cores on my Volta architecture with :
- matrix A size = 6x123
- matrix B size = 64x6
- matrix C size = 64x123
So, it is a 64x6x123 (MxNxK) GEMM.
I look to the GEMM with the following properties :
Opcode Class : TensorOp
DataType : fp16 * fp16 + fp16 = fp16
Compute Capability : 75
There is no instruction shape as 64x8x128 for this Opcode Class (see README>Funtionality). But I can decompose it in multiple 8x8x128 GEMMs.
The thing is, I have small matrix to process.
Usually, CUTLASS GEMM decomposes big matrix in smaller GEMM at 3 levels (thread block, warp and instruction). Is this deconstruction supported with small matrix ? What is the smallest GEMM we can do? Is it possible to execute a CUTLASS GEMM directly on the instruction level ? Is there some simple examples you could provided that work for small matrix?
Also :
Do you have a simple example in which the user pass his own matrix to the GEMM function instead of using "host_tensor.h".
Is it possible to use CUTLASS GEMM from a CUDA kernel ?
Thanks in advance,
Julie
Aha! Link: https://nvaiinfa.aha.io/features/CUTLASS-22
Hi,
I would like to execute a CUTLASS GEMM (A*B+C) that uses the Tensor Cores on my Volta architecture with :
So, it is a 64x6x123 (MxNxK) GEMM.
I look to the GEMM with the following properties :
Opcode Class : TensorOp
DataType : fp16 * fp16 + fp16 = fp16
Compute Capability : 75
There is no instruction shape as 64x8x128 for this Opcode Class (see README>Funtionality). But I can decompose it in multiple 8x8x128 GEMMs.
The thing is, I have small matrix to process.
Usually, CUTLASS GEMM decomposes big matrix in smaller GEMM at 3 levels (thread block, warp and instruction). Is this deconstruction supported with small matrix ? What is the smallest GEMM we can do? Is it possible to execute a CUTLASS GEMM directly on the instruction level ? Is there some simple examples you could provided that work for small matrix?
Also :
Do you have a simple example in which the user pass his own matrix to the GEMM function instead of using "host_tensor.h".
Is it possible to use CUTLASS GEMM from a CUDA kernel ?
Thanks in advance,
Julie
Aha! Link: https://nvaiinfa.aha.io/features/CUTLASS-22