How to use flash attention when setting "--reset-position-ids"? #415

t90tank · 2023-07-11T12:09:30Z

It seems that Megatron do not pass attention mask to flash attention, but when setting "--reset-position-ids", different inputs have different attention masks.
Does Megatron support this case?

Also, can we open sequence parallel when setting "--reset-position-ids"?

torshie · 2023-07-18T12:58:27Z

It seems --reset-position-ids doesn't work with flash attention. I checked flash attention's API, it supports causal attention mask only, not custom attention masks.

mayank31398 · 2023-07-20T21:22:40Z

Yes, this won't work
See this PR: bigcode-project#53 in the BigCode fork

t90tank · 2023-08-01T08:00:31Z

Thanks, I will try other ways.

t90tank closed this as completed Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use flash attention when setting "--reset-position-ids"? #415

How to use flash attention when setting "--reset-position-ids"? #415

t90tank commented Jul 11, 2023

torshie commented Jul 18, 2023

mayank31398 commented Jul 20, 2023

t90tank commented Aug 1, 2023

How to use flash attention when setting "--reset-position-ids"? #415

How to use flash attention when setting "--reset-position-ids"? #415

Comments

t90tank commented Jul 11, 2023

torshie commented Jul 18, 2023

mayank31398 commented Jul 20, 2023

t90tank commented Aug 1, 2023