How to use CUTLASS GEMM for small matrix

Hi,

I would like to execute a CUTLASS GEMM (A*B+C) that uses the Tensor Cores on my Volta architecture with :

- matrix A size = 6x123
- matrix B size = 64x6
- matrix C size = 64x123

So, it is a 64x6x123 (MxNxK) GEMM.

I look to the GEMM with the following properties :
Opcode Class : TensorOp 
DataType : fp16 * fp16 + fp16 = fp16
Compute Capability : 75

There is no instruction shape as 64x8x128 for this Opcode Class (see README>Funtionality). But I can decompose it in multiple 8x8x128 GEMMs.

The thing is, I have small matrix to process. 
Usually, CUTLASS GEMM decomposes big matrix in smaller GEMM at 3 levels (thread block, warp and instruction). Is this deconstruction supported with small matrix ? What is the smallest GEMM we can do? Is it possible to execute a CUTLASS GEMM directly on the instruction level ? Is there some simple examples you could provided that work for small matrix? 

Also :
Do you have a simple example in which the user pass his own matrix to the GEMM function instead of using "host_tensor.h".
Is it possible to use CUTLASS GEMM from a CUDA kernel ? 

Thanks in advance,
Julie

Aha! Link: https://nvaiinfa.aha.io/features/CUTLASS-22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use CUTLASS GEMM for small matrix #293

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

How to use CUTLASS GEMM for small matrix #293

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions