Add a new `SdpaFwdOp` IR node for Flash Attention #2294

Priya2698 · 2024-05-23T18:03:31Z

Issue #2278.
~~This PR adds the new node with same functionality as torch.nn.functional.scaled_dot_product_attention, and enables scheduling it through ExprEvalScheduler.~~

Based on the PR discussions, this PR is repurposed to introduce a new IR node SdpaFwdOp for scaled dot product flash attention forward (see #2278 for details).
This PR does not include changes to the scheduler.

The next PRs will:

Add mapping for producer-consumer in root_domain_map and enable this op in ExprEvalScheduler.
Python API
Add a node for backward pass similar to this forward node.

After the completion of these tasks, we also aim at introducing Memory Efficient Attention.

csrc/ir/internal_nodes.h

tests/cpp/utils.h

csrc/ir/internal_nodes.h

Priya2698 · 2024-06-03T22:33:51Z

!build

jacobhinkle

Is it correct to say that before accepting this in ExprEval scheduler, we need to handle the other execution modes?

csrc/ir/nodes.cpp

Priya2698 · 2024-06-04T20:02:57Z

Is it correct to say that before accepting this in ExprEval scheduler, we need to handle the other execution modes?

No. I mainly separated them to handle the mapping, and the ID graphs workarounds separately than the node to reduce the scope of this PR.

As we discussed in today's meeting, at the moment we only plan on supporting Flash Attention to support multi-GPU development. Once we support Flash Attention, we can revisit, if we need to add Memory Efficient Attention as well. There could be a few ways:

Plumbing down the backend info from Thunder and using that within our nodes: While the two implementations have different function signatures, there are overlaps and hence, one possibility is to use a superset of the inputs and outputs. The other design here would be distinct nodes for each implementation.
We make the decision about the backend within nvFuser using the same logic as ATen/Thunder. See: https://github.com/Lightning-AI/lightning-thunder/blob/9f0c50cc6df187cf5fd2e31240690fe2b5e9ccc1/thunder/executors/sdpaex.py#L618-L680

jacobhinkle · 2024-06-04T20:42:50Z

Is it correct to say that before accepting this in ExprEval scheduler, we need to handle the other execution modes?

No. I mainly separated them to handle the mapping, and the ID graphs workarounds separately than the node to reduce the scope of this PR.

As we discussed in today's meeting, at the moment we only plan on supporting Flash Attention to support multi-GPU development. Once we support Flash Attention, we can revisit, if we need to add Memory Efficient Attention as well. There could be a few ways:

Plumbing down the backend info from Thunder and using that within our nodes: While the two implementations have different function signatures, there are overlaps and hence, one possibility is to use a superset of the inputs and outputs. The other design here would be distinct nodes for each implementation.

We make the decision about the backend within nvFuser using the same logic as ATen/Thunder. See: https://github.com/Lightning-AI/lightning-thunder/blob/9f0c50cc6df187cf5fd2e31240690fe2b5e9ccc1/thunder/executors/sdpaex.py#L618-L680

I dont understand how to do partial support. If we are given inputs that we cannot evaluate with flashattention, what will happen?

Priya2698 · 2024-06-04T20:45:31Z

Is it correct to say that before accepting this in ExprEval scheduler, we need to handle the other execution modes?

No. I mainly separated them to handle the mapping, and the ID graphs workarounds separately than the node to reduce the scope of this PR.
As we discussed in today's meeting, at the moment we only plan on supporting Flash Attention to support multi-GPU development. Once we support Flash Attention, we can revisit, if we need to add Memory Efficient Attention as well. There could be a few ways:

Plumbing down the backend info from Thunder and using that within our nodes: While the two implementations have different function signatures, there are overlaps and hence, one possibility is to use a superset of the inputs and outputs. The other design here would be distinct nodes for each implementation.

We make the decision about the backend within nvFuser using the same logic as ATen/Thunder. See: https://github.com/Lightning-AI/lightning-thunder/blob/9f0c50cc6df187cf5fd2e31240690fe2b5e9ccc1/thunder/executors/sdpaex.py#L618-L680

I dont understand how to do partial support. If we are given inputs that we cannot evaluate with flashattention, what will happen?

We will only accept the op if the backend identified is Flash Attention in Thunder: https://github.com/Lightning-AI/lightning-thunder/blob/9f0c50cc6df187cf5fd2e31240690fe2b5e9ccc1/thunder/executors/sdpaex.py#L618-L680.

Do you think this will not be sufficient?

jacobhinkle · 2024-06-05T14:43:57Z

Do you think this will not be sufficient?

Makes sense to me. The logic is taking place in thunder to dispatch to flash attention, which seems fine.

naoyam · 2024-06-06T22:36:18Z

Is anyone still reviewing this PR? @jacobhinkle?

jacobhinkle

LGTM once this is addressed: #2294 (comment)

Priya2698 · 2024-06-06T23:37:23Z

!build

Co-authored-by: Jacob Hinkle <1454944+jacobhinkle@users.noreply.github.com>

Priya2698 · 2024-06-06T23:50:13Z

!build

naoyam · 2024-06-07T15:33:36Z

!build

Priya2698 · 2024-06-07T18:57:57Z

!build

Priya2698 · 2024-06-07T23:03:20Z

!build

Priya2698 · 2024-06-10T20:33:00Z

The failing tests look unrelated.

Stacked on #2294. 1. Adds the producer-consumer mapping to root domain map. 2. Adds `SDPAOp` to `ExprEvalScheduler`. 3. Modifies `ExprEvalSched::canSchedule` to skip computeAt checks and only use the compile time check since expression evaluator scheduler will only accept segments with a single expression of type MatmulOp / LinearOp / SdpaOp., --------- Co-authored-by: Jacob Hinkle <1454944+jacobhinkle@users.noreply.github.com>

Priya2698 changed the title ~~[WIP] Add a new SdpaOp IR node~~ Add a new SdpaOp IR node May 23, 2024

Priya2698 marked this pull request as ready for review May 23, 2024 23:49

Priya2698 requested review from jacobhinkle, jjsjann123, cowanmeg and protonu May 23, 2024 23:49

jacobhinkle reviewed May 24, 2024

View reviewed changes

csrc/ir/internal_nodes.h Outdated Show resolved Hide resolved

csrc/ir/internal_nodes.h Outdated Show resolved Hide resolved

jjsjann123 reviewed May 24, 2024

View reviewed changes

tests/cpp/utils.h Outdated Show resolved Hide resolved

csrc/ir/internal_nodes.h Outdated Show resolved Hide resolved

IvanYashchuk reviewed May 28, 2024

View reviewed changes

csrc/ir/internal_nodes.h Outdated Show resolved Hide resolved

Priya2698 mentioned this pull request May 28, 2024

Investigate a new SDPA IR node #2278

Closed

Priya2698 changed the title ~~Add a new SdpaOp IR node~~ [WIP] Add a new SdpaOp IR node May 29, 2024

Priya2698 marked this pull request as draft May 29, 2024 22:48

jacobhinkle mentioned this pull request Jun 3, 2024

Task formalism in our IR [Inspired by Online Softmax] #2329

Open

Priya2698 force-pushed the pm/sdpa branch from cdcbc9d to 33b79f2 Compare June 3, 2024 20:41

Priya2698 changed the title ~~[WIP] Add a new SdpaOp IR node~~ Add a new SdpaFwdOp IR node for Flash Attention Jun 3, 2024

Priya2698 marked this pull request as ready for review June 3, 2024 22:33

Priya2698 requested review from jacobhinkle, jjsjann123 and IvanYashchuk June 3, 2024 22:33

jacobhinkle reviewed Jun 4, 2024

View reviewed changes

csrc/ir/nodes.cpp Outdated Show resolved Hide resolved

csrc/ir/nodes.cpp Show resolved Hide resolved

Priya2698 requested review from jacobhinkle and naoyam June 6, 2024 22:31

jacobhinkle approved these changes Jun 6, 2024

View reviewed changes

Priya2698 mentioned this pull request Jun 6, 2024

Add scheduling support for SDPAOp #2361

Merged

Priya2698 and others added 18 commits June 6, 2024 23:40

sdpa node imp

a71c149

add scale

d8c0b68

checks

0e9ae68

test attn mask cases

6be79c9

id mapping checks

a123d9f

format

3a58a6d

update print

62214cb

sdpa flash attention wip

ec60914

test sdpa node without scheduling

ba62274

remove root domain mapping

c0045d2

add cumulative sequence

8396341

comment, string fn

6537214

format'

da1409b

fix build errors

7f84f52

Update csrc/ir/nodes.cpp

1f140b3

Co-authored-by: Jacob Hinkle <1454944+jacobhinkle@users.noreply.github.com>

add comment, arch range, scalar checks

7a0b17a

format

dc4e218

rebase; rfactor->logical

494d091

Priya2698 force-pushed the pm/sdpa branch from 20d2de8 to 494d091 Compare June 6, 2024 23:49

Merge branch 'main' into pm/sdpa

e6fba61

Priya2698 merged commit 23ee81d into main Jun 10, 2024
35 of 37 checks passed

Priya2698 deleted the pm/sdpa branch June 10, 2024 20:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a new `SdpaFwdOp` IR node for Flash Attention #2294

Add a new `SdpaFwdOp` IR node for Flash Attention #2294

Priya2698 commented May 23, 2024 •

edited

Loading

Priya2698 commented Jun 3, 2024

jacobhinkle left a comment

Priya2698 commented Jun 4, 2024

jacobhinkle commented Jun 4, 2024

Priya2698 commented Jun 4, 2024

jacobhinkle commented Jun 5, 2024

naoyam commented Jun 6, 2024

jacobhinkle left a comment

Priya2698 commented Jun 6, 2024

Priya2698 commented Jun 6, 2024

naoyam commented Jun 7, 2024

Priya2698 commented Jun 7, 2024

Priya2698 commented Jun 7, 2024

Priya2698 commented Jun 10, 2024

Add a new SdpaFwdOp IR node for Flash Attention #2294

Add a new SdpaFwdOp IR node for Flash Attention #2294

Conversation

Priya2698 commented May 23, 2024 • edited Loading

Priya2698 commented Jun 3, 2024

jacobhinkle left a comment

Choose a reason for hiding this comment

Priya2698 commented Jun 4, 2024

jacobhinkle commented Jun 4, 2024

Priya2698 commented Jun 4, 2024

jacobhinkle commented Jun 5, 2024

naoyam commented Jun 6, 2024

jacobhinkle left a comment

Choose a reason for hiding this comment

Priya2698 commented Jun 6, 2024

Priya2698 commented Jun 6, 2024

naoyam commented Jun 7, 2024

Priya2698 commented Jun 7, 2024

Priya2698 commented Jun 7, 2024

Priya2698 commented Jun 10, 2024

Add a new `SdpaFwdOp` IR node for Flash Attention #2294

Add a new `SdpaFwdOp` IR node for Flash Attention #2294

Priya2698 commented May 23, 2024 •

edited

Loading