Skip to content

How to use CUTLASS GEMM for small matrix #293

@ju9379

Description

@ju9379

Hi,

I would like to execute a CUTLASS GEMM (A*B+C) that uses the Tensor Cores on my Volta architecture with :

  • matrix A size = 6x123
  • matrix B size = 64x6
  • matrix C size = 64x123

So, it is a 64x6x123 (MxNxK) GEMM.

I look to the GEMM with the following properties :
Opcode Class : TensorOp
DataType : fp16 * fp16 + fp16 = fp16
Compute Capability : 75

There is no instruction shape as 64x8x128 for this Opcode Class (see README>Funtionality). But I can decompose it in multiple 8x8x128 GEMMs.

The thing is, I have small matrix to process.
Usually, CUTLASS GEMM decomposes big matrix in smaller GEMM at 3 levels (thread block, warp and instruction). Is this deconstruction supported with small matrix ? What is the smallest GEMM we can do? Is it possible to execute a CUTLASS GEMM directly on the instruction level ? Is there some simple examples you could provided that work for small matrix?

Also :
Do you have a simple example in which the user pass his own matrix to the GEMM function instead of using "host_tensor.h".
Is it possible to use CUTLASS GEMM from a CUDA kernel ?

Thanks in advance,
Julie

Aha! Link: https://nvaiinfa.aha.io/features/CUTLASS-22

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions