Skip to content

[quantization] Instability on llama quantization #656

@stamalakhov

Description

@stamalakhov

What

There is some instability in llama-based models quantization. E.g.
Running the same command

python tico/quantization/wrapq/examples/quantize_full_qmodel_with_gptq.py --model unsloth/Llama-3.2-3B-Instruct --max_seq_len 2048 --linear_weight_bits 4 --gptq_mse smse --nsamples_for_qcalibration 128 --device cuda --lm_head_weight_bits 4 --save  "ptq_checkpoint"   --no_spinquant  --eval_tasks="mmlu,hellaswag,piqa,truthfulqa" --decode_calibration_steps 8 --sensitivity_path sensitivities_for_unsloth_Llama-3.2-3B-Instruct_wikitext_128_42.pt

on two different gpus produced two dirreferent ppl's (11.86 vs 12.14).

Running mse:

python tico/quantization/wrapq/examples/quantize_full_qmodel_with_gptq.py --model unsloth/Llama-3.2-3B-Instruct --max_seq_len 2048 --linear_weight_bits 4 --gptq_mse mse --nsamples_for_qcalibration 128 --device cuda --lm_head_weight_bits 4 --save  "ptq_checkpoint"   --no_spinquant  --eval_tasks="mmlu,hellaswag,piqa,truthfulqa" --decode_calibration_steps 8 

also produced two different ppl's (12.56 vs 12.52).

Let's make sure that:

  1. this is inevitable
  2. and/or reduce discrepancy of results.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions