-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Left Padding Mask KV? #649
Comments
I haven't used left-padding. What's the use case of left padding instead of right padding the kv cache? |
@tridao We are trying to support as many different formats as possible... In our case, a lot of models are trained with left-padding and it's useful to support it directly at a kernel level. Is there plans for left-padding support or general masking? Thank you! |
I'm struggling with this as well. Consider:
Here we can't really do any batching at all because the sequences don't line up. We could produce four tokens for seq 0 first, then begin the batched inference after that, or we could start batching right away but discard the results for seq 1 until we reach position 5. Either approach is wasteful compared to left-padding:
Now you can sample token 5 for both seqs in one forward pass immediately. The tradeoffs are the wasted inference on the padding tokens during prompt ingestion, and the extra VRAM allocated to the masked keys/values. Whether those are good tradeoffs depends on the circumstances. One thing I'm working on at the moment is classifier-free guidance, where two prompts of roughly equal length (but maybe differing sentiment) are evaluated in parallel to sample one token from a mix of the two sets of logits. Right-padding simply doesn't work for that. Unpadding could work, but it's also wasteful since it requires reshaping the entire K/V cache once per token. If there were some way to modulate the attention weights before the softmax, that would unlock not just left-padding but some neat opportunities in speculative decoding as well. |
Are there plans or a way to support left padding kv attention mask? I believe right padding can be supported with the mha_fwd_kvcache api with the seqlens_k_ pointer, but will there be a similar option for left padding?
The text was updated successfully, but these errors were encountered: