-
Notifications
You must be signed in to change notification settings - Fork 684
Questions on DotProductAttention API Usage in Flash Attention thd Mode #1409
Description
We are using Megatron-LM with TE (Transformer Engine) for Flash Attention, specifically in the THD mode, and we have some questions about the API usage.
-
What is the specific difference between cu_seqlens_q and cu_seqlens_q_padded?
From the documentation and example code), it seems that both parameters pass the padded values. What is the internal handling difference between them? what means sequence lengths with/without offset. -
We are conducting SFT (Supervised Fine-Tuning) training and aim to construct the
attention_mask = causal_inputs_mask * padding_mask * segment_mask
However, we are encountering difficulties in ensuring the accuracy of padding_mask and causal_inputs_mask, examples when tokens are padded to 2CP.
such ascp=2sequences [1, 2, 3, pad, 4, 5, pad, pad]. Currently, both cu_seqlens_q and cu_seqlens_q_padded are set to [0, 4, 8]. Our attempts to address this issue by setting cu_seqlens_q differently from cu_seqlens_q_padded have consistently resulted in NaN errors. How should we correctly set the Attention Mask to handle these padding tokens?