New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Remove unnecessary concatenation using the zipping approach. #1768

Closed

wujingyue opened this issue Feb 15, 2024 · 1 comment

Assignees

Collaborator

wujingyue commented Feb 15, 2024 •

edited

Loading

A spin-off from #1502 (comment). Created for tracking progress.

Problem

Below is a common pattern in nanoGPT's backprop.

dQ, dK, dV = scaled_dot_product_attention_backprop(...)  # bf16[16,12,128,64], bf16[16,12,128,64], bf16[16,12,128,64]
dQ = transpose(dQ, [0, 2, 1, 3])  # [16, 128, 12, 64]
dQ = reshape(dQ, [16, 128, 768])  # [16, 128, 768]
dK = the product of the same transpose and reshape on dK
dV = the product of the same transpose and reshape on dV

concatenated = cat([dQ, dK, dV], axis=-1)

dQKV_sum = sum(concatenated, ...)  # omitting a round trip to float
dQKV_view = reshape(concatenated, [B*S, H*D*3])
dQKV_permute = transpose(dQKV_view, [1, 0])

return dQKV_sum, dQKV_view, dQKV_permute

Because nvFuser doesn't take sdpa_backward and therefore sees three unconnected input tensors (dQ, dK, and dV), it has to materialize dQKV_view and dQKV_permute.

Solution

TL;DR: change Thunder's cudnnex to feed nvFuser a concatenated tensor that contains dQ, dK and dV, so nvFuser realizes that the existing cat is unnecessary and removes it.

cudnnex will convert the SDPA backward op into a cudnn spda_backward kernel (which outputs one dQKV tensor) followed by a split.

cudnnex will give that split to nvFuser, so nvFuser will see the following pattern:

dQKV = fd.ops.define_tensor([B, S, H, D*3])
dQ = fd.ops.slice(dQKV, ...)
dQ = fd.ops.view(dQ, ...)
dQ = fd.ops.permute(dQ, ...)
dK = ...the same slice-view-permute pattern
dV = ...the same slice-view-permute pattern

concatenated = fd.ops.cat([dQ, dK, dV], axis=-1)

dQKV_sum = fd.ops.sum(concatenated, ...)  # omitting a round trip to float
dQKV_view = fd.ops.view(concatenated, [B*S, H*D*3])
dQKV_permute = fd.ops.permute(dQKV_view, [1, 0])

nvFuser will cancel the slices and the cat and merge all view and permute between them, so the above will become:

dQKV = fd.ops.define_tensor([B, S, H, D*3])
concatenated = fd.ops.permute(fd.ops.view(dQKV, ...), ...)
dQKV_sum = fd.ops.sum(concatenated, ...)  # omitting a round trip to float
dQKV_view = fd.ops.view(concatenated, [B*S, H*D*3])
dQKV_permute = fd.ops.permute(dQKV_view, [1, 0])

As a result, dQKV_view and dQKV_permute will become aliases of dQKV. The fusion will boil down to a ReduceSum kernel that sums [B,S,H,D*3] to [H*D*3].

The text was updated successfully, but these errors were encountered:

wujingyue added a commit that referenced this issue


          Add a pre-segmenter optimization that cancels splits and cats.

0a7121c

For #1768.

[ghstack-poisoned]

wujingyue mentioned this issue

Add a pre-segmenter optimization that cancels splits and cats. #1771

Merged

wujingyue added a commit that referenced this issue


          Add a pre-segmenter optimization that cancels splits and cats.

1dce826

For #1768.

ghstack-source-id: 61a20eef5efa80ac2726fe5ca1c059e47ba55d58
Pull Request resolved: #1771

wujingyue added a commit that referenced this issue


          Update on "Add a pre-segmenter optimization that cancels splits and c…

9d4f009

…ats."

For #1768.

[ghstack-poisoned]

wujingyue added a commit that referenced this issue


          Add a pre-segmenter optimization that cancels splits and cats.

d475dab

For #1768.

ghstack-source-id: 3e24dbf6e795801bb57ae562ea1133678907dceb
Pull Request resolved: #1771

wujingyue added a commit that referenced this issue


          Update base for Update on "Add a pre-segmenter optimization that canc…

6e51e20

…els splits and cats."

For #1768.

[ghstack-poisoned]

wujingyue added a commit that referenced this issue


          Update on "Add a pre-segmenter optimization that cancels splits and c…

51d8390

…ats."

For #1768.

[ghstack-poisoned]

wujingyue added a commit that referenced this issue


          Update base for Update on "Add a pre-segmenter optimization that canc…

52aea5b

…els splits and cats."

For #1768.

[ghstack-poisoned]

wujingyue added a commit that referenced this issue


          Update on "Add a pre-segmenter optimization that cancels splits and c…

8cd016b

…ats."

For #1768.

[ghstack-poisoned]

wujingyue added a commit that referenced this issue


          Update base for Update on "Add a pre-segmenter optimization that canc…

148bb05

…els splits and cats."

For #1768.

[ghstack-poisoned]

wujingyue added a commit that referenced this issue


          Update on "Add a pre-segmenter optimization that cancels splits and c…

c7d83d0

…ats."

For #1768.

[ghstack-poisoned]

wujingyue added a commit that referenced this issue


          Update base for Update on "Add a pre-segmenter optimization that canc…

3eb1c81

…els splits and cats."

For #1768.

[ghstack-poisoned]

wujingyue added a commit that referenced this issue


          Update on "Add a pre-segmenter optimization that cancels splits and c…

975b3c1

…ats."

For #1768.

[ghstack-poisoned]

wujingyue mentioned this issue

Add a pre-segmenter optimization that cancels splits and cats. #1774

Merged

wujingyue added a commit that referenced this issue


          Add a pre-segmenter optimization that cancels splits and cats.

acc98b0

For #1768.

wujingyue added a commit that referenced this issue


          Add a pre-segmenter optimization that cancels splits and cats. (#1774)

e9c68f1

For #1768.

`ghstack land https://github.com/NVIDIA/Fuser/pull/1771` failed for
reasons that I don't understand. I'm trying to land it again without
`ghstack`. See #1771 for review comments.

This was referenced Feb 17, 2024

Optimize split->permute->cat. #1782

Merged

IdModel falsely maps two IterDomains. #1801

Closed

wujingyue added a commit that referenced this issue


          Optimize split->permute->cat. (#1782)

8f69247

With this PR, MoveSplitCatPass can cancel the <split,cat> pair with
`permute`s in between and horizontally merge those `permute`s. See code
comments for details.

For #1768.

wujingyue added a commit that referenced this issue


          Make hasSelfMapping per tensor.

eccd006

For #1768.

wujingyue added a commit that referenced this issue


          Make hasSelfMapping per tensor.

afc9e7d

For #1768.

This was referenced Feb 26, 2024

Make hasSelfMapping per tensor. #1839

Merged

Refactor so CancelSplitCat becomes a class. #1789

Merged

wujingyue added a commit that referenced this issue


          Make hasSelfMapping per tensor. (#1839)

62733fb

For #1768.

wujingyue added a commit that referenced this issue


          Experiment with TransformReplay for #1768.

58f191b

wujingyue added a commit that referenced this issue


          Add utility replayExprWithNewInput for #1768. (#1846)

a8b89e0

wujingyue self-assigned this

wujingyue mentioned this issue

Use replayExprWithNewInput to replay permutes. #1861

Merged

wujingyue added a commit that referenced this issue


          Refactor so CancelSplitCat becomes a class. (#1789)

bf7be62

This makes it convenient to use an IdModel as a class member without
having to pass it through many functions.

I examined the NVFUSER_TRACE. FusionKernelRuntime::FusionKernelRuntime
is bottlenecked by
"Finding valid fusion segment solutions" not pre-segmenter passes. I
added the FUSER_PERF_SCOPE for pre-segmenter passes anyway.

For #1768

wujingyue mentioned this issue

Optimize split->reshape/permute->cat. #1799

Merged

wujingyue mentioned this issue

Allocate dQ, dK, and dV as a catted tensor to save a downstream cat in nvFuser. Lightning-AI/lightning-thunder#59

Merged

Collaborator Author

wujingyue commented Nov 1, 2024

This optimization has been implemented but not turned on by default. The part in nvFuser is turned on unconditionally. The part in Thunder is behind a flag. However, even with this flag on, this optimization won't kick in because Thunder gives cat to a special executor by default.

wujingyue closed this as completed

wujingyue mentioned this issue

Make concat no-op. #1502

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment