Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Left Padding Mask KV? #649

Open
aciddelgado opened this issue Nov 1, 2023 · 3 comments
Open

Support for Left Padding Mask KV? #649

aciddelgado opened this issue Nov 1, 2023 · 3 comments

Comments

@aciddelgado
Copy link

Are there plans or a way to support left padding kv attention mask? I believe right padding can be supported with the mha_fwd_kvcache api with the seqlens_k_ pointer, but will there be a similar option for left padding?

@tridao
Copy link
Contributor

tridao commented Nov 1, 2023

I haven't used left-padding. What's the use case of left padding instead of right padding the kv cache?

@aciddelgado
Copy link
Author

@tridao We are trying to support as many different formats as possible... In our case, a lot of models are trained with left-padding and it's useful to support it directly at a kernel level. Is there plans for left-padding support or general masking? Thank you!

@turboderp
Copy link

I'm struggling with this as well. Consider:

seq 0 1 2 3 4 5
0 Hello
1 Once upon a time ,

Here we can't really do any batching at all because the sequences don't line up. We could produce four tokens for seq 0 first, then begin the batched inference after that, or we could start batching right away but discard the results for seq 1 until we reach position 5. Either approach is wasteful compared to left-padding:

seq 0 1 2 3 4 5
0 Hello
1 Once upon a time ,

Now you can sample token 5 for both seqs in one forward pass immediately. The tradeoffs are the wasted inference on the padding tokens during prompt ingestion, and the extra VRAM allocated to the masked keys/values. Whether those are good tradeoffs depends on the circumstances.

One thing I'm working on at the moment is classifier-free guidance, where two prompts of roughly equal length (but maybe differing sentiment) are evaluated in parallel to sample one token from a mix of the two sets of logits. Right-padding simply doesn't work for that. Unpadding could work, but it's also wasteful since it requires reshaping the entire K/V cache once per token.

If there were some way to modulate the attention weights before the softmax, that would unlock not just left-padding but some neat opportunities in speculative decoding as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants