[quantization] Add calibrate_seq_len#533
Conversation
17f4335 to
7f5f371
Compare
calibrate_seq_len
7f5f371 to
dc07e8a
Compare
Could you elaborate more about this change?
|
@mhs4670go
IMHO Just an example with
So seems like this PR produces better results. |
dc07e8a to
e8acbdc
Compare
|
@mhs4670go |
c73de94 to
3c29be1
Compare
| self._fq(cos[:, : hidden_states.size(1), :], self.obs_cos), | ||
| self._fq(sin[:, : hidden_states.size(1), :], self.obs_sin), |
There was a problem hiding this comment.
Note for reviewers. It makes it possible to remove padding.
There was a problem hiding this comment.
Just curiosity, is it necessary? Because attetnion_mask masks the padding position.
There was a problem hiding this comment.
@mhs4670go
Sorry for a lack of details.
attention_mask is prepared for current seq_len as
return self.causal_mask_template[..., :seq_len, :seq_len].to(device)
so in case related causal_mask_template size is larger then seq_len everything is fine (Because it's just upper matrix filled with constant).
We can do the same here:
prepare (rope_cos_template, rope_sin_template) for larger seq_len and then just extract what is needed for current seq_len.
It is assumed that calibrate_seq_len >= max_seq_len.
There was a problem hiding this comment.
Ah, this is necessary for evaluation where tokens are not fixed to 2048 (sequence length to be exported). If we give max_seq length token when we export the model, the slicing will be no-op.
| if hasattr(mod, "wrapped"): | ||
| mod = mod.wrapped |
There was a problem hiding this comment.
Note for reviewers: to remove something like tico.convert(model.wrapped, ...
There was a problem hiding this comment.
This change is not necessary. Because PTQWrapper just run wrapper's forward. Did you get an error without this change?
There was a problem hiding this comment.
Ahhh. Yep. I got an error in case "--no_PTQ" option was set.
tico.convert(model.wrapped, ...
it was the cause of crash.
So i transferred the check into convert, but not tested whether it was necessary.
I'll check it.
Thank you!
There was a problem hiding this comment.
you are right. No need in this change. I'll remove it.
3c29be1 to
2dc6a98
Compare
|
Comparison of For
Llama-3.2-3B-Instruct results from logsQuantized RESULTS of GPTQ_MSE_w4A16_main: Quantized RESULTS of GPTQ_MSE_w4A16_#533_equalization_with_2048_calibrate_seq_len: For
Llama-3.2-1B-Instruct results from logsQuantized RESULTS of GPTQ_MSE_w4A16_main: Quantized RESULTS of GPTQ_MSE_w4A16_#533_equalization_with_2048_calibrate_seq_len: For
TinyLlama-1.1B-Chat-v1.0 results from logsQuantized RESULTS of GPTQ_MSE_w4A16_main: Quantized RESULTS of GPTQ_MSE_w4A16_#533_equalization_with_2048_calibrate_seq_len: However for HuggingFaceTB/SmolLM2-135M-Instruct
SmolLM2-135M-Instruct results from logsQuantized RESULTS of GPTQ_MSE_w4A16_main: Quantized RESULTS of GPTQ_MSE_w4A16_#533_equalization_with_2048_calibrate_seq_len: So it seems that Seems like at least it can provide possibility to improve accuracy for some models. |
| ) | ||
| parser.add_argument( | ||
| "--max_seq_len", | ||
| "--convert_seq_len", |
There was a problem hiding this comment.
Or we can leave it as is (max_seq_len).
There was a problem hiding this comment.
Hmm.. Sorry for bothing you. I think max_seq_len looks easier to understand.
calibrate_seq_lencalibrate_seq_len
81f7be9 to
9799f11
Compare
| config = q_m.config | ||
|
|
||
| orig_seq_len = config.max_position_embeddings | ||
| config.max_position_embeddings = args.max_seq_len |
There was a problem hiding this comment.
This kind of changes can be removed.
First, I think we should fix evaluate_llm_on_tasks api.
def evaluate_llm_on_tasks(
model: AutoModelForCausalLM, tokenizer: AutoTokenizer, tasks: str
) -> dict[str, Any]:
model_to_evaluate = HFLM(model, "causal", tokenizer=tokenizer)
tasks_list: list[str] = tasks.split(",")
return evaluator.simple_evaluate(model_to_evaluate, tasks=tasks_list)Since the accelerator has a fixed maximum sequence length, it is better to match the max sequence length during evaluation as well.
Therefore, the code will be:
def evaluate_llm_on_tasks(
model: AutoModelForCausalLM, tokenizer: AutoTokenizer, tasks: str, max_length: int
) -> dict[str, Any]:
# ..
model_to_evaluate = HFLM(
model,
"causal",
tokenizer=tokenizer,
max_length=max_length,
truncation=True,
)Then, you don't have to change max_position_embeddings itself repeatedly.
There was a problem hiding this comment.
Ahh. Ok. I'll try.
There was a problem hiding this comment.
@mhs4670go
Done. At least ppl is fine for SmolLM. I'll check for eval_tasks (its around 1.5 hour)
There was a problem hiding this comment.
Everything is fine. the same results.
9799f11 to
1d46dd2
Compare
This PR adds `calibrate_seq_len` to get better accuracy and adjusts relevant code accordingly. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
1d46dd2 to
d8d7a88
Compare
This PR adds
calibrate_seq_lento get better results on accuracy.please see results here
log of `python tico/quantization/wrapq/examples/quantize_full_qmodel_with_gptq.py --model "HuggingFaceTB/SmolLM2-135M-Instruct" --max_seq_len 256 --gptq_mse "--eval_tasks" "winogrande,arc_easy,arc_challenge,openbookqa" ...`
TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com