Quantization Params Explanation + llama.cpp Relation

Hello! I saw there several parameters for quantizing model in MLX:

```
    parser.add_argument(
        "-q", "--quantize", help="Generate a quantized model.", action="store_true"
    )
    parser.add_argument(
        "--q-group-size", help="Group size for quantization.", type=int, default=64
    )
    parser.add_argument(
        "--q-bits", help="Bits per weight for quantization.", type=int, default=4
    )
    parser.add_argument(
        "--dtype",
        help="Type to save the parameters, ignored if -q is given.",
        type=str,
        choices=["float16", "bfloat16", "float32"],
        default="float16",
    )
```

The models that I use are between 7-14 B parameters. Can you please explain what’s that group size and bits per weight and which one should I set to what?

Cause it llama.cpp I had GGUF and there it was like my favorites: Q3_K_L, Q4_K_M, Q5_K_M, Q6_K. So having a table (if it’s even possible to compare) would be nice!

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quantization Params Explanation + llama.cpp Relation #252

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Quantization Params Explanation + llama.cpp Relation #252

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions