As I understand, for max perf of int8 gemms, the weights should be prepacked as CUBLASLT_ORDER_COL32_2R_4R4 memory format and inputs as COL32.
At trex graph we can see that output of int8 gemm is formatted as COL32 (C32 means that COL32 layout was used, right?), but can't see if TensorRT prepacked the weights as 4r4. Does it prepack the weights as 4r4? Is TensorRT actually using COL32 for input/output of the quantized MatMul?
Thanks :)