Skip to content

Questions on DotProductAttention API Usage in Flash Attention thd Mode #1409

@pipSu

Description

@pipSu

We are using Megatron-LM with TE (Transformer Engine) for Flash Attention, specifically in the THD mode, and we have some questions about the API usage.

  1. What is the specific difference between cu_seqlens_q and cu_seqlens_q_padded?
    From the documentation and example code), it seems that both parameters pass the padded values. What is the internal handling difference between them? what means sequence lengths with/without offset.

  2. We are conducting SFT (Supervised Fine-Tuning) training and aim to construct the

    attention_mask = causal_inputs_mask * padding_mask * segment_mask

    However, we are encountering difficulties in ensuring the accuracy of padding_mask and causal_inputs_mask, examples when tokens are padded to 2CP.
    such as cp=2 sequences [1, 2, 3, pad, 4, 5, pad, pad]. Currently, both cu_seqlens_q and cu_seqlens_q_padded are set to [0, 4, 8]. Our attempts to address this issue by setting cu_seqlens_q differently from cu_seqlens_q_padded have consistently resulted in NaN errors. How should we correctly set the Attention Mask to handle these padding tokens?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions