Speed up torch engine w8a8 model init #1088

yinfan98 · 2024-01-31T21:01:40Z

When using w8a8 model, it will cost a bunch of time for model loading. I found that when using

QRMSNorm.from_float()
## or
QLinear.from_float()

this func cal the per_channel_quant to get the scale and quanted weight, but it is useless. Because the para not load to model yet, it is just cal for all the zero... so I fix it when init the model on inference stage.

	time cost(s)
cal the scale and weight	17
dont cal	0.08

And I tested it on internLM and internLM2

HIT-cwh · 2024-02-01T03:19:17Z

Hi @yinfan98 !
Thank you for your contribution, we truly appreciate it. We are just a small step away from achieving success. For the next steps, please refer to this document to configure the pre-commit then proceed to complete the lint check. We are looking forward to seeing the completion.

HIT-cwh · 2024-02-01T03:23:39Z

lmdeploy/pytorch/models/q_modules.py

        """Class method to create a QLinear instance from a floating-point
-        module."""
+        module. initialization for dummy init."""


It seems that initialization = True means real init and initialization = False means dummy init?

@HIT-cwh, yes

lmdeploy/pytorch/models/q_modules.py

Co-authored-by: whcao <41630003+HIT-cwh@users.noreply.github.com>

lvhan028 · 2024-02-02T04:10:45Z

Hi, @yinfan98 thanks very much for the contribution.
Could you kindly provide the request throughput of w8a8?

yinfan98 · 2024-02-02T07:41:28Z

@lvhan028 Sure, I will finish benchmark these days

yinfan98 added 2 commits February 1, 2024 04:59

Update convert_to_qmodules.py

694b629

Update q_modules.py

09fcc0e

lvhan028 requested review from HIT-cwh and grimoire February 1, 2024 03:03

HIT-cwh reviewed Feb 1, 2024

View reviewed changes

yinfan98 and others added 2 commits February 1, 2024 15:40

Merge branch 'InternLM:main' into w8a8

4378be2

fix lint

523e364

HIT-cwh reviewed Feb 1, 2024

View reviewed changes

lmdeploy/pytorch/models/q_modules.py Outdated Show resolved Hide resolved

yinfan98 and others added 3 commits February 1, 2024 15:57

Update lmdeploy/pytorch/models/q_modules.py

73b3dc2

Co-authored-by: whcao <41630003+HIT-cwh@users.noreply.github.com>

add comment for from_float

d00cfc2

Update q_modules.py

b974b33

HIT-cwh approved these changes Feb 1, 2024

View reviewed changes

grimoire approved these changes Feb 2, 2024

View reviewed changes

lvhan028 added the improvement label Feb 2, 2024

lvhan028 merged commit a15cede into InternLM:main Feb 2, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up torch engine w8a8 model init #1088

Speed up torch engine w8a8 model init #1088

yinfan98 commented Jan 31, 2024

HIT-cwh commented Feb 1, 2024

HIT-cwh Feb 1, 2024

yinfan98 Feb 1, 2024

lvhan028 commented Feb 2, 2024 •

edited

yinfan98 commented Feb 2, 2024

Speed up torch engine w8a8 model init #1088

Speed up torch engine w8a8 model init #1088

Conversation

yinfan98 commented Jan 31, 2024

HIT-cwh commented Feb 1, 2024

HIT-cwh Feb 1, 2024

Choose a reason for hiding this comment

yinfan98 Feb 1, 2024

Choose a reason for hiding this comment

lvhan028 commented Feb 2, 2024 • edited

yinfan98 commented Feb 2, 2024

lvhan028 commented Feb 2, 2024 •

edited