Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offload model head when using cuBLAS #106

Closed
saharNooby opened this issue Jun 22, 2023 · 4 comments · Fixed by #131
Closed

Offload model head when using cuBLAS #106

saharNooby opened this issue Jun 22, 2023 · 4 comments · Fixed by #131
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@saharNooby
Copy link
Collaborator

Currently, only matrices of layers are offloaded to the GPU. Head, the biggest matrix in the model, stays on CPU and evaluated there.

On my machine, offloading head of 14B model in addition to offloading all layers gives 60 ms per token latency vs 70 ms without head offloading.

As always, the hardest question here is API design -- we need to preserve compatibility and not inflate API with new small functions.

@saharNooby saharNooby added enhancement New feature or request good first issue Good for newcomers labels Jun 22, 2023
@LoganDark
Copy link
Contributor

LoganDark commented Jun 24, 2023

On my machine, offloading head of 14B model in addition to offloading all layers gives 60 ms per token latency vs 70 ms without head offloading.

You're right, there is a significant speedup here, I just tested this on my machine and I get the same sort of speedup. Processing 128 tokens in sequence mode goes from, funnily enough, 70ms to 60ms if I offload the model head.

I just slapped this

            if (i == 0) {
                offload(ctx->instance->model.head);
            }

into the rwkv_gpu_offload_layers loop in order to test. It does literally shave 10ms. It also shaves nearly 50% off evaluating the same sequence in serial mode.

This ends up offloading the head automatically once the first layer is offloaded, which might be the best approach with the current API.

@LoganDark
Copy link
Contributor

This makes me want to introduce an optimization where passing NULL logits to rwkv_eval does not evaluate logits at all

@LoganDark
Copy link
Contributor

LoganDark commented Jun 24, 2023

This makes me want to introduce an optimization where passing NULL logits to rwkv_eval does not evaluate logits at all

This creates further speedups (by 20% when evaluating 128 tokens using serial mode)

@saharNooby
Copy link
Collaborator Author

This ends up offloading the head automatically once the first layer is offloaded, which might be the best approach with the current API.

I implemented it a little differently -- if the last layer was just offloaded, then head is offloaded too. Or, we can treat head as a separate "layer" (AFAIK llama.cpp treats embedding/head like this) -- if a model has 24 actual layers, and we offload 25, then head is offloaded too.

This makes me want to introduce an optimization where passing NULL logits to rwkv_eval does not evaluate logits at all
This creates further speedups (by 20% when evaluating 128 tokens using serial mode)

Sounds great, I support it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants