fix: move quantized norm to CPU instead of stale q_linear reference in smooth_quant by Mr-Neutr0n · Pull Request #4352 · InternLM/lmdeploy

Mr-Neutr0n · 2026-02-11T14:21:57Z

Summary

Fix a copy-paste bug in lmdeploy/lite/apis/smooth_quant.py where the norm quantization loop calls q_linear.to('cpu') instead of q_norm.to('cpu')
q_linear is a stale variable from the previous linear quantization loop, so the quantized QRMSNorm objects are never moved to CPU
This causes a VRAM leak: every quantized norm layer stays on GPU, which can lead to OOM errors on large models with hundreds of norm layers

In the smooth_quant function, there are two quantization loops:

Linear quantization loop -- correctly calls q_linear.to('cpu') after quantizing each linear layer
Norm quantization loop -- creates q_norm via QRMSNorm.from_float() but then calls q_linear.to('cpu') (copy-pasted from the linear loop) instead of q_norm.to('cpu')

The fix is a one-line change: q_linear.to('cpu') -> q_norm.to('cpu') in the norm quantization loop.

Verify smooth quantization completes without OOM on a model with many norm layers
Confirm GPU memory is properly released after each norm quantization step

fix: move quantized norm to CPU instead of stale linear reference

541edd4

lvhan028 added the Bug:P2 label Feb 12, 2026

lvhan028 approved these changes Feb 12, 2026

View reviewed changes

lvhan028 merged commit d37fed7 into InternLM:main Feb 12, 2026
5 checks passed