-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow inference on LLM model (CPU Bound?) #388
Comments
I have been trying to figure this out for a week now with not much success. The forward function of MatMul8bitLt is definitely cpu bound. |
Interested in this, as I am having the same issues. |
The same for me for training #465 |
Try 4bit, it seem to be faster. |
FWIW, I played around CTranslate2 lib, in streaming mode starcoder feels almost realtime ! (on an RTX3090) |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
I tried to run starcoder LLM model by loading it in 8bit.
It works as expected but the inference is slow, one CPU core is running 100% which is weird given everything should be loaded
into the GPU (the device_map shows
{'': 0}
).I tried using pytorch profiler and I am seeing this
It shows that the CPU is stuck with
MatMul8bitLt
operations.Another lead I am suspecting is that using
accelerate.infer_device_map
shows half of the layers on the CPU, so maybe the device map is not correct?I am not very familiar with the internals of pytorch, so I am looking to this issue from a generalist IT pov, surely I am missing sth
The text was updated successfully, but these errors were encountered: