Skip to content

[quantization] Force tie embedding in example script#695

Closed
mhs4670go wants to merge 2 commits into
Samsung:mainfrom
mhs4670go:pro2
Closed

[quantization] Force tie embedding in example script#695
mhs4670go wants to merge 2 commits into
Samsung:mainfrom
mhs4670go:pro2

Conversation

@mhs4670go
Copy link
Copy Markdown
Contributor

@mhs4670go mhs4670go commented May 12, 2026

This commit forces tie embedding in example script.

I've run PPL evaluations with below command.

python tico/quantization/wrapq/examples/quantize_full_qmodel_with_gptq.py \
  --model /group-volume/Llama-3.2-3B-Instruct --max_seq_len 2048 \
  --linear_weight_bits 4 --verbose --nsamples_for_qcalibration 128 \
  --decode_calibration_steps 8 --gptq_mse smse \
  --embedding_lm_head_weight_bits 4

mse with embedding_lm_head_weight_bits 8-bit : 11.89
smse with embedding_lm_head_weight_bits 8-bit: 12.05
mse with embedding_lm_head_weight_bits 4-bit: 11.96

Related: #624
TICO-DCO-1.0-Signed-off-by: seongwoo mhs4670go@naver.com

This commit forces tie embedding in example script.

TICO-DCO-1.0-Signed-off-by: seongwoo <mhs4670go@naver.com>
@mhs4670go mhs4670go requested a review from stamalakhov May 12, 2026 04:53
@stamalakhov
Copy link
Copy Markdown
Contributor

@mhs4670go Thank you!
Have you tried to save full circle model with this PR?

@stamalakhov
Copy link
Copy Markdown
Contributor

In my personal and humble opinion we don't need GPTQ for lm_head.

@stamalakhov
Copy link
Copy Markdown
Contributor

@mhs4670go
8 bits of unshared weights for lm_head/mebeddings can't be saved to circle because of 2Gb restriction.
May be it would be better to have an option to tie weights?

@mhs4670go
Copy link
Copy Markdown
Contributor Author

In my personal and humble opinion we don't need GPTQ for lm_head.

I thought you introduced lm_head support in GPTQ because it had better accuracy than the one that doesn't apply GPTQ to lm_head.

I'll make applying lm_head optional in next PR then.

@mhs4670go
Copy link
Copy Markdown
Contributor Author

@mhs4670go 8 bits of unshared weights for lm_head/mebeddings can't be saved to circle because of 2Gb restriction. May be it would be better to have an option to tie weights?

Ah, okay. Let's go with 4-bits as a default one.

@stamalakhov
Copy link
Copy Markdown
Contributor

@mhs4670go 8 bits of unshared weights for lm_head/mebeddings can't be saved to circle because of 2Gb restriction. May be it would be better to have an option to tie weights?

Ah, okay. Let's go with 4-bits as a default one.

@mhs4670go
but what about ppl of 4 bits quantized model?

@mhs4670go
Copy link
Copy Markdown
Contributor Author

@mhs4670go 8 bits of unshared weights for lm_head/mebeddings can't be saved to circle because of 2Gb restriction. May be it would be better to have an option to tie weights?

Ah, okay. Let's go with 4-bits as a default one.

@mhs4670go but what about ppl of 4 bits quantized model?

mse with embedding_lm_head_weight_bits 4-bit: 11.96

I'm running the evaluation for smse right now. That'll be updated soon.

@stamalakhov
Copy link
Copy Markdown
Contributor

@mhs4670go
But why do we need to tie weights for quantized model?
They can not be shared right now.
Am i missing something?

/cc @Torrero

@mhs4670go
Copy link
Copy Markdown
Contributor Author

mhs4670go commented May 12, 2026

But why do we need to tie weights for quantized model?

Because we'll tie weights in final use case. Llama-3.2 model already comes with tied weights. Before supporting the export with shared weight, we can simulate the tied embedding with this approach.

I've posted a PR for qwen, too.

mse with embedding_lm_head_weight_bits 4-bit: 11.96
smse with embedding_lm_head_weight_bits 4-bit: 12.41

@stamalakhov
Copy link
Copy Markdown
Contributor

Before supporting the export with shared weight, we can simulate the tied embedding with this approach.

@mhs4670go
but this PR disables all other supported options. It's enforcement. Is not it?

@mhs4670go
Copy link
Copy Markdown
Contributor Author

mhs4670go commented May 12, 2026

but this PR disables all other supported options. It's enforcement. Is not it?

IIUC, embedding_weight_bits and lm_head_weight_bits options will be disabled with this PR. Do you think it would be better not to disable them and support tied embedding optionally?

@stamalakhov
Copy link
Copy Markdown
Contributor

Do you think it would be better not to disable them and support tied embedding optionally?

@mhs4670go
Yes.
I believe right now it would be better to have option to tie them. Later when weights sharing will be supported in circle, or in case the task with full model export will cancelled. The option to tie weights will be enforced.
But it's just my opinion.
/cc @Torrero

@stamalakhov
Copy link
Copy Markdown
Contributor

I thought you introduced lm_head support in GPTQ because it had better accuracy than the one that doesn't apply GPTQ to lm_head.

@mhs4670go
Yes.

  • In case they are untied, lm_head can be quantized using ordinary GPTQ.
  • In case they are tied, GPTQ for embedding/lm_head with calibrating Hessians for lm_head is wrong. I mean using GPTQ for tied weights is more tricky. and right now it's not supported in TICO.

IMHO

@mhs4670go
Copy link
Copy Markdown
Contributor Author

In case they are tied, GPTQ for embedding/lm_head with calibrating Hessians for lm_head is wrong. I mean using GPTQ for tied weights is more tricky. and right now it's not supported in TICO.

I got your point. I think we should skip GPTQ on lm head when it's tied. I missed this point. Thank you for the opinion!

Eventually, we will use a model whose embeddings are tied. Therefore, we might as well make current GPTQ on lm head optional.

What's your take on this?

@stamalakhov
Copy link
Copy Markdown
Contributor

What's your take on this?

I'm fine with it.
My concern is that this PR disables fully supported (for now) untied weights and forces partially supported tied weights.
So seems like i'm missing something.
Please proceed without my approval.

@stamalakhov stamalakhov requested review from a team and removed request for stamalakhov May 12, 2026 07:31
@stamalakhov
Copy link
Copy Markdown
Contributor

@mhs4670go
I've changed reviewers, please proceed without my approval.

@mhs4670go
Copy link
Copy Markdown
Contributor Author

Ah, I think this PR should be reconsidered later when we fully supports tie embedding as you said.

I'll just post a PR for making lm_head GPTQ optional.

@mhs4670go mhs4670go closed this May 12, 2026
@stamalakhov stamalakhov mentioned this pull request May 12, 2026
@mhs4670go mhs4670go deleted the pro2 branch May 13, 2026 04:01
@stamalakhov
Copy link
Copy Markdown
Contributor

@mhs4670go
Thank you very much for your effort.
may be this PR could be reopened? I've just tried it for Maykeye/TinyLLama-v0 with explicit weights equalization:

    model.lm_head.weight = model.model.embed_tokens.weight

and the resulting circle had shared weights for both input embedding and lm_head. So that we can use 8 bits for lm_head/embed_tokens quantization safely.
Just a small note:
Some llama models don't use tie_word_embeddings (e.g. Maykeye/TinyLLama-v0), so may be the option to tie them would be helpful for debugging purposes at least?

@stamalakhov
Copy link
Copy Markdown
Contributor

@mhs4670go
I mean #701 is merged, so this PR can be landed as it is, with m.b.

model.lm_head.weight = model.model.embed_tokens.weight

for forced weights equalization.

@stamalakhov
Copy link
Copy Markdown
Contributor

stamalakhov commented May 20, 2026

For HuggingFaceTB/SmolLM2-135M it also produced nice result with shared weights.

@mhs4670go
Copy link
Copy Markdown
Contributor Author

@stamalakhov Thank you for the checks! I wonder it's natural to set them tied even though they are not meant to be tied. Therefore, I'll just a PR that just validates if the model is untied when different bits are given.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants