Questions on DotProductAttention API Usage in Flash Attention thd Mode

We are using Megatron-LM with TE (Transformer Engine) for Flash Attention, specifically in the THD mode, and we have some questions about the API usage.

1. What is the specific difference between cu_seqlens_q and cu_seqlens_q_padded?
   From the [documentation](https://github.com/NVIDIA/TransformerEngine/blob/240240617267cff76178a7f5da58a93806e5a6d2/transformer_engine/pytorch/attention.py#L7831)  and [example code](https://github.com/NVIDIA/TransformerEngine/blob/240240617267cff76178a7f5da58a93806e5a6d2/tests/pytorch/fused_attn/run_fused_attn_with_cp.py#L166)), it seems that both parameters pass the padded values. What is the internal handling difference between them? what means sequence lengths with/without offset.

2. We are conducting SFT (Supervised Fine-Tuning) training and aim to construct the


   attention_mask = causal_inputs_mask * padding_mask * segment_mask

   However, we are encountering difficulties in ensuring the accuracy of padding_mask and causal_inputs_mask, examples when tokens are padded to 2CP.
   such as` cp=2`  ` sequences [1, 2, 3, pad, 4, 5, pad, pad]`.  Currently, both cu_seqlens_q and cu_seqlens_q_padded are set to [0, 4, 8].  Our attempts to address this issue by setting cu_seqlens_q differently from cu_seqlens_q_padded have consistently resulted in NaN errors. How should we correctly set the Attention Mask to handle these padding tokens?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions on DotProductAttention API Usage in Flash Attention thd Mode #1409

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Questions on DotProductAttention API Usage in Flash Attention thd Mode #1409

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions