Slow inference on LLM model (CPU Bound?) #388

piratos · 2023-05-11T16:06:37Z

I tried to run starcoder LLM model by loading it in 8bit.
It works as expected but the inference is slow, one CPU core is running 100% which is weird given everything should be loaded
into the GPU (the device_map shows {'': 0}).
I tried using pytorch profiler and I am seeing this

------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  -------------------------------------------------------  
        Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls                                             Input Shapes  
------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  -------------------------------------------------------  
MatMul8bitLt        11.65%        1.214s        19.18%        1.998s     445.921us     390.375ms         9.73%     515.878ms     115.151us          4480                     [[4, 1, 6144], [6144, 6144], [6144]]  
MatMul8bitLt        11.65%        1.214s        19.11%        1.991s     444.408us     391.215ms         9.75%     516.690ms     115.333us          4480                     [[4, 1, 6144], [6400, 6144], [6400]]  
MatMul8bitLt        11.61%        1.210s        19.04%        1.983s     442.660us        1.068s        26.60%        1.197s     267.199us          4480                   [[4, 1, 6144], [24576, 6144], [24576]]  
MatMul8bitLt        11.44%        1.192s        18.88%        1.967s     439.071us        1.154s        28.76%        1.281s     286.006us          4480                   [[4, 1, 24576], [6144, 24576], [6144]]

It shows that the CPU is stuck with MatMul8bitLt operations.
Another lead I am suspecting is that using accelerate.infer_device_map shows half of the layers on the CPU, so maybe the device map is not correct?

I am not very familiar with the internals of pytorch, so I am looking to this issue from a generalist IT pov, surely I am missing sth

The text was updated successfully, but these errors were encountered:

qeternity · 2023-05-12T16:06:51Z

I have been trying to figure this out for a week now with not much success. The forward function of MatMul8bitLt is definitely cpu bound.

BrandonKoerner · 2023-05-15T17:27:22Z

Interested in this, as I am having the same issues.

NeonBohdan · 2023-05-31T22:00:43Z

The same for me for training #465
It's slower than regular fp16

Oxi84 · 2023-06-02T14:13:11Z

Try 4bit, it seem to be faster.

piratos · 2023-06-05T20:28:52Z

FWIW, I played around CTranslate2 lib, in streaming mode starcoder feels almost realtime ! (on an RTX3090)

github-actions · 2023-12-20T15:17:43Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

github-actions bot closed this as completed Dec 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow inference on LLM model (CPU Bound?) #388

Slow inference on LLM model (CPU Bound?) #388

piratos commented May 11, 2023

qeternity commented May 12, 2023

BrandonKoerner commented May 15, 2023

NeonBohdan commented May 31, 2023

Oxi84 commented Jun 2, 2023

piratos commented Jun 5, 2023 •

edited

Loading

github-actions bot commented Dec 20, 2023

Slow inference on LLM model (CPU Bound?) #388

Slow inference on LLM model (CPU Bound?) #388

Comments

piratos commented May 11, 2023

qeternity commented May 12, 2023

BrandonKoerner commented May 15, 2023

NeonBohdan commented May 31, 2023

Oxi84 commented Jun 2, 2023

piratos commented Jun 5, 2023 • edited Loading

github-actions bot commented Dec 20, 2023

piratos commented Jun 5, 2023 •

edited

Loading