You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"Do you use GPTQ/RPTQ": no; maybe they are experimenting with it in upstream ggml, but currently tensors are just split into fixed-size blocks of size 32 and then quantized block-wise.
"Do you use int8 @ int8 -> int32 cublas": don't know... You may check out ggml CUDA code.
How to generate the quantized INT4, INT5 and INT8 model?
Do you use GPTQ/RPTQ or normal per-tensor/per-channel PTQ? For quantized int8 model? Do you use int8 @ int8 -> int32 cublas?
The text was updated successfully, but these errors were encountered: