Skip to content

fix: move quantized norm to CPU instead of stale q_linear reference in smooth_quant#4352

Merged
lvhan028 merged 1 commit intoInternLM:mainfrom
Mr-Neutr0n:fix/smooth-quant-norm-cpu-offload
Feb 12, 2026
Merged

fix: move quantized norm to CPU instead of stale q_linear reference in smooth_quant#4352
lvhan028 merged 1 commit intoInternLM:mainfrom
Mr-Neutr0n:fix/smooth-quant-norm-cpu-offload

Conversation

@Mr-Neutr0n
Copy link
Contributor

Summary

  • Fix a copy-paste bug in lmdeploy/lite/apis/smooth_quant.py where the norm quantization loop calls q_linear.to('cpu') instead of q_norm.to('cpu')
  • q_linear is a stale variable from the previous linear quantization loop, so the quantized QRMSNorm objects are never moved to CPU
  • This causes a VRAM leak: every quantized norm layer stays on GPU, which can lead to OOM errors on large models with hundreds of norm layers

Details

In the smooth_quant function, there are two quantization loops:

  1. Linear quantization loop -- correctly calls q_linear.to('cpu') after quantizing each linear layer
  2. Norm quantization loop -- creates q_norm via QRMSNorm.from_float() but then calls q_linear.to('cpu') (copy-pasted from the linear loop) instead of q_norm.to('cpu')

The fix is a one-line change: q_linear.to('cpu') -> q_norm.to('cpu') in the norm quantization loop.

Test plan

  • Verify smooth quantization completes without OOM on a model with many norm layers
  • Confirm GPU memory is properly released after each norm quantization step

@lvhan028 lvhan028 merged commit d37fed7 into InternLM:main Feb 12, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants