-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offload model head when using cuBLAS #106
Comments
You're right, there is a significant speedup here, I just tested this on my machine and I get the same sort of speedup. Processing 128 tokens in sequence mode goes from, funnily enough, 70ms to 60ms if I offload the model head. I just slapped this if (i == 0) {
offload(ctx->instance->model.head);
} into the This ends up offloading the head automatically once the first layer is offloaded, which might be the best approach with the current API. |
This makes me want to introduce an optimization where passing NULL logits to rwkv_eval does not evaluate logits at all |
This creates further speedups (by 20% when evaluating 128 tokens using serial mode) |
I implemented it a little differently -- if the last layer was just offloaded, then
Sounds great, I support it! |
Currently, only matrices of layers are offloaded to the GPU. Head, the biggest matrix in the model, stays on CPU and evaluated there.
On my machine, offloading head of 14B model in addition to offloading all layers gives
60 ms
per token latency vs70 ms
without head offloading.As always, the hardest question here is API design -- we need to preserve compatibility and not inflate API with new small functions.
The text was updated successfully, but these errors were encountered: