-
-
Notifications
You must be signed in to change notification settings - Fork 278
Closed
Description
Hello! I saw there several parameters for quantizing model in MLX:
parser.add_argument(
"-q", "--quantize", help="Generate a quantized model.", action="store_true"
)
parser.add_argument(
"--q-group-size", help="Group size for quantization.", type=int, default=64
)
parser.add_argument(
"--q-bits", help="Bits per weight for quantization.", type=int, default=4
)
parser.add_argument(
"--dtype",
help="Type to save the parameters, ignored if -q is given.",
type=str,
choices=["float16", "bfloat16", "float32"],
default="float16",
)
The models that I use are between 7-14 B parameters. Can you please explain what’s that group size and bits per weight and which one should I set to what?
Cause it llama.cpp I had GGUF and there it was like my favorites: Q3_K_L, Q4_K_M, Q5_K_M, Q6_K. So having a table (if it’s even possible to compare) would be nice!
Thanks!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels