Int8 gemm weight pre-packing as `CUBLASLT_ORDER_COL32_2R_4R4`

[As I understand, for max perf of int8 gemms](https://www.speechmatics.com/company/articles-and-news/fast-and-accurate-gpu-quantization-for-transformers#%5E091c98), the weights should be prepacked as [`CUBLASLT_ORDER_COL32_2R_4R4`](https://docs.nvidia.com/cuda/cublas/#cublasltorder-t) memory format and inputs as `COL32`.

At trex graph we can see that output of int8 gemm is formatted as COL32 (`C32` means that COL32 layout was used, right?), but can't see if TensorRT prepacked the weights as 4r4. Does it prepack the weights as 4r4? Is TensorRT actually using COL32 for input/output of the quantized MatMul?

<img width="908" alt="image" src="https://github.com/NVIDIA/TensorRT/assets/1041752/01c6952e-aa34-4a25-960e-3440719d70e5">

Thanks :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Int8 gemm weight pre-packing as `CUBLASLT_ORDER_COL32_2R_4R4` #3233

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Int8 gemm weight pre-packing as CUBLASLT_ORDER_COL32_2R_4R4 #3233

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Int8 gemm weight pre-packing as `CUBLASLT_ORDER_COL32_2R_4R4` #3233