Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow inference on LLM model (CPU Bound?) #388

Closed
piratos opened this issue May 11, 2023 · 6 comments
Closed

Slow inference on LLM model (CPU Bound?) #388

piratos opened this issue May 11, 2023 · 6 comments

Comments

@piratos
Copy link

piratos commented May 11, 2023

I tried to run starcoder LLM model by loading it in 8bit.
It works as expected but the inference is slow, one CPU core is running 100% which is weird given everything should be loaded
into the GPU (the device_map shows {'': 0}).
I tried using pytorch profiler and I am seeing this

------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  -------------------------------------------------------  
        Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls                                             Input Shapes  
------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  -------------------------------------------------------  
MatMul8bitLt        11.65%        1.214s        19.18%        1.998s     445.921us     390.375ms         9.73%     515.878ms     115.151us          4480                     [[4, 1, 6144], [6144, 6144], [6144]]  
MatMul8bitLt        11.65%        1.214s        19.11%        1.991s     444.408us     391.215ms         9.75%     516.690ms     115.333us          4480                     [[4, 1, 6144], [6400, 6144], [6400]]  
MatMul8bitLt        11.61%        1.210s        19.04%        1.983s     442.660us        1.068s        26.60%        1.197s     267.199us          4480                   [[4, 1, 6144], [24576, 6144], [24576]]  
MatMul8bitLt        11.44%        1.192s        18.88%        1.967s     439.071us        1.154s        28.76%        1.281s     286.006us          4480                   [[4, 1, 24576], [6144, 24576], [6144]]

It shows that the CPU is stuck with MatMul8bitLt operations.
Another lead I am suspecting is that using accelerate.infer_device_map shows half of the layers on the CPU, so maybe the device map is not correct?

I am not very familiar with the internals of pytorch, so I am looking to this issue from a generalist IT pov, surely I am missing sth

@qeternity
Copy link

I have been trying to figure this out for a week now with not much success. The forward function of MatMul8bitLt is definitely cpu bound.

@BrandonKoerner
Copy link

Interested in this, as I am having the same issues.

@NeonBohdan
Copy link

The same for me for training #465
It's slower than regular fp16

@Oxi84
Copy link

Oxi84 commented Jun 2, 2023

Try 4bit, it seem to be faster.

@piratos
Copy link
Author

piratos commented Jun 5, 2023

FWIW, I played around CTranslate2 lib, in streaming mode starcoder feels almost realtime ! (on an RTX3090)

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants