Skip to content

Int8 gemm weight pre-packing as CUBLASLT_ORDER_COL32_2R_4R4 #3233

@vadimkantorov

Description

@vadimkantorov

As I understand, for max perf of int8 gemms, the weights should be prepacked as CUBLASLT_ORDER_COL32_2R_4R4 memory format and inputs as COL32.

At trex graph we can see that output of int8 gemm is formatted as COL32 (C32 means that COL32 layout was used, right?), but can't see if TensorRT prepacked the weights as 4r4. Does it prepack the weights as 4r4? Is TensorRT actually using COL32 for input/output of the quantized MatMul?

image

Thanks :)

Metadata

Metadata

Assignees

Labels

triagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions