[WIP] Integration flash attention 2 #55758

umiswing · 2023-07-27T12:56:47Z

PR types

New features

PR changes

OPs

Description

Pcard-70459

Integrating flash-attention-2 to PaddlePaddle.

差异点：

torch的api对于不能被8整除的head_size做了padding处理，该PR未做此处理。后续业务模型需要支持该case时再添加。
未支持attention变体mqa/gqa。

paddle-bot · 2023-07-27T12:56:54Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Xreki · 2023-08-04T08:04:08Z

paddle/phi/kernels/gpu/flash_attn_grad_kernel.cu

-    num_splits = 1;
-  }
-  bool zero_tensors = false;
+  const int total_q = dims[0];


用int64_t吧

Xreki · 2023-08-04T08:27:41Z

paddle/phi/kernels/gpu/flash_attn_grad_kernel.cu

@@ -55,110 +55,75 @@ void FlashAttnUnpaddedGradKernel(const Context& ctx,
  ctx.template Alloc<T>(dk);
  ctx.template Alloc<T>(dv);

-  cudaStream_t stream = ctx.stream();
-  bool is_bf16 = q.dtype() == DataType::BFLOAT16 ? true : false;
+  const cudaStream_t stream = ctx.stream();


这些参数计算逻辑，似乎在Pad和UnPad、前反向都有用到，可以定义个struct统一下计算代码？

sneaxiy · 2023-08-04T09:45:13Z

paddle/phi/kernels/gpu/flash_attn_kernel.cu

+  const int seqlen_q_rounded = round_multiple(seqlen_q, 128);
+  const int seqlen_k_rounded = round_multiple(seqlen_k, 128);
+
+  softmax_lse->Resize({batch_size, num_heads, seqlen_q_rounded});


softmax_lse->Resize({batch_size, num_heads, seqlen_q}). See https://github.com/Dao-AILab/flash-attention/blob/d30f2e1cd50185c98ed88c0684b4a603f15bee37/csrc/flash_attn/flash_api.cpp#L273 .

不过这里多分配空间应该也不影响。 @Xreki 也可以帮忙看看。

Xreki · 2023-08-05T04:27:27Z

.gitmodules

 [submodule "third_party/gtest"]
 	path = third_party/gtest
 	url = https://github.com/google/googletest.git
 	ignore = dirty
+[submodule "third_party/flashattn"]
+	path = third_party/flashattn
+	url = https://github.com/umiswing/flash-attention.git


submodule改回Paddle的

Xreki · 2023-08-05T08:33:27Z

paddle/phi/kernels/gpu/flash_attn_utils.h

+                       DenseTensor* _softmax,
+                       DenseTensor* _softmax_lse,
+                       DenseTensor* _seed_offset,
+                       const DenseTensor* const fixed_seed_offset_ptr)


一般const参数放前面

Xreki · 2023-08-05T08:52:35Z

paddle/phi/kernels/gpu/flash_attn_kernel.cu

-  int num_splits = 0;  // 0 for an internal heuristic, which is optimal
-  if (FLAGS_cudnn_deterministic) {
-    num_splits = 1;
-  }


跟determinstic相关的这几行代码要不保留，注释掉并加个TODO吧

Xreki

LGTM. Great work~

* Work for fa-2 padded fwd. Code to be cleaned. * Work for fa2 unpadded fwd. * Work for padded-bwd, dk get small diff on np.random.seed(0) * Anyway I pass paddle's utest, except return softmax without dropout. * Clean code. * Modify interface. * Clean code and add some check. * Easy compile for dev. * Fix ci. * Fix ci-build. * Add std c++17 option again. * Limit max job when compiling fa2. * Remove const_cast * Add fwd params, to be cleaned. * Clean code. * Add bwd params. * Clean code. * Add enforce. * Use v2.0.4 * Pass RNG state to fa2 capi * Fix review. * Add assert * Skip compile for sm less than 80.

* [FlashAttn] add flash randomness control (#52902) * add flash randomness control * fix VLOG undefied * [WIP] Integration flash attention 2 (#55758) * Work for fa-2 padded fwd. Code to be cleaned. * Work for fa2 unpadded fwd. * Work for padded-bwd, dk get small diff on np.random.seed(0) * Anyway I pass paddle's utest, except return softmax without dropout. * Clean code. * Modify interface. * Clean code and add some check. * Easy compile for dev. * Fix ci. * Fix ci-build. * Add std c++17 option again. * Limit max job when compiling fa2. * Remove const_cast * Add fwd params, to be cleaned. * Clean code. * Add bwd params. * Clean code. * Add enforce. * Use v2.0.4 * Pass RNG state to fa2 capi * Fix review. * Add assert * Skip compile for sm less than 80. --------- Co-authored-by: Chitsing KUI <kuizhiqing@msn.com>

…lePaddle#56015) * [FlashAttn] add flash randomness control (PaddlePaddle#52902) * add flash randomness control * fix VLOG undefied * [WIP] Integration flash attention 2 (PaddlePaddle#55758) * Work for fa-2 padded fwd. Code to be cleaned. * Work for fa2 unpadded fwd. * Work for padded-bwd, dk get small diff on np.random.seed(0) * Anyway I pass paddle's utest, except return softmax without dropout. * Clean code. * Modify interface. * Clean code and add some check. * Easy compile for dev. * Fix ci. * Fix ci-build. * Add std c++17 option again. * Limit max job when compiling fa2. * Remove const_cast * Add fwd params, to be cleaned. * Clean code. * Add bwd params. * Clean code. * Add enforce. * Use v2.0.4 * Pass RNG state to fa2 capi * Fix review. * Add assert * Skip compile for sm less than 80. --------- Co-authored-by: Chitsing KUI <kuizhiqing@msn.com>

* part-3 cherry from: add check for cembedding (#55621) * part-3 fix cherry from: add check for cembedding * part-3 fix c_embedding * fix test_gpt_with_pir caused by pir * part-3 cherry from: [Distributed] Support dp/sharding overlap in virtual pp (#55651) * Add virtual pp and dp overlap * add sharding/dp overlap * add dp/vpp overlap * fix code * fix log * part-3 cherry from: [cherry-pick] Integration flash attention 2 (#56015) * [FlashAttn] add flash randomness control (#52902) * add flash randomness control * fix VLOG undefied * [WIP] Integration flash attention 2 (#55758) * Work for fa-2 padded fwd. Code to be cleaned. * Work for fa2 unpadded fwd. * Work for padded-bwd, dk get small diff on np.random.seed(0) * Anyway I pass paddle's utest, except return softmax without dropout. * Clean code. * Modify interface. * Clean code and add some check. * Easy compile for dev. * Fix ci. * Fix ci-build. * Add std c++17 option again. * Limit max job when compiling fa2. * Remove const_cast * Add fwd params, to be cleaned. * Clean code. * Add bwd params. * Clean code. * Add enforce. * Use v2.0.4 * Pass RNG state to fa2 capi * Fix review. * Add assert * Skip compile for sm less than 80. --------- Co-authored-by: Chitsing KUI <kuizhiqing@msn.com> * part-4 cherry from: fix codestyle (#56066) * part-4 cherry from(no change): Add assert for static and other plateform (#56044) * part-4 cherry-pick from: dp and sharding coexist (#56096) * dp and sharding coexist * dp * part-4 cherry from: [Distributed] Add debug information for processgroupnccl (#56441) * add debug information * fix log * fix log * add detach for pp * part-4 cherry from: [BugFix]Fix bug in paddle.device.cdua.synchronize() (#56451) * fix bug in synchronize * fix bug in synchronize * part-4 cherry from: add fused gradient (#57048) * part-4 cherry from: [Distribtued] add eager_communication_connection for eager mode in nccl (#57517) * add eager_nccl_connection * add eager_connection * add eager_connection * part-4 cherry from: Add auto growth allocator for CUDA pinned allocator (#57625) * fix h2d bandwidth * remove useless flags * fix cherrry pick #56066 * part-5 cherry from: Add allocation debug FLAGS (#57797) * Add allocation debug FLAGS * add sync after value set * refine flags * part-5 cherry from: fix softmax backward (#57971) * part-5 cherry from: [Distributed]Optimize memory in processgroup (#58299) * optimize memory in processgroupnccl * optimize memory in processgroupnccl * optimize memory in processgroupnccl * optimize memory in processgroupnccl * part-5 cherry from: [Distributed]Add unbalance batch for virtual pp (#58383) * add unbalanced batch for vpp * add unbalanced batch for vpp * add unbalanced batch for vpp * fix * fix comments * fix kunlun compatibility issues * fix test_fused_rotary_position_embedding.py * fix allocator.h * tinyfix * fix conflicts * fix new ir translator c_embedding failure --------- Co-authored-by: ShenLiang <1422485404@qq.com> Co-authored-by: umiswing <umiswing@foxmail.com> Co-authored-by: Chitsing KUI <kuizhiqing@msn.com> Co-authored-by: niuliling123 <51102941+niuliling123@users.noreply.github.com> Co-authored-by: liuzhenhai93 <liuzhenhai93@outlook.com> Co-authored-by: sneaxiy <32832641+sneaxiy@users.noreply.github.com>

umiswing added 5 commits July 26, 2023 05:59

Work for fa-2 padded fwd. Code to be cleaned.

73f87f1

Work for fa2 unpadded fwd.

dd98c69

Work for padded-bwd, dk get small diff on np.random.seed(0)

29b7871

Anyway I pass paddle's utest, except return softmax without dropout.

b9e3bcd

Clean code.

c0fae37

umiswing added 2 commits July 28, 2023 03:27

Modify interface.

3e230ad

Clean code and add some check.

f6af325

umiswing force-pushed the fa-2 branch 2 times, most recently from b1cd6cf to 12d4b4c Compare July 31, 2023 02:40

umiswing added 5 commits July 31, 2023 02:41

Easy compile for dev.

12d4b4c

Fix ci.

3bdd6f8

Fix ci-build.

d5c8723

Add std c++17 option again.

2e4ed0e

Limit max job when compiling fa2.

8eae0ac

umiswing force-pushed the fa-2 branch from 7567eef to 5e64441 Compare August 3, 2023 09:00

Remove const_cast

61c5fb1

umiswing force-pushed the fa-2 branch from 5e64441 to 61c5fb1 Compare August 4, 2023 02:37

Xreki reviewed Aug 4, 2023

View reviewed changes

sneaxiy reviewed Aug 4, 2023

View reviewed changes

Xreki reviewed Aug 5, 2023

View reviewed changes

umiswing added 5 commits August 5, 2023 05:19

Add fwd params, to be cleaned.

b8ba045

Clean code.

a5b9861

Add bwd params.

2fd02cf

Clean code.

2f116b2

Add enforce.

b4635ef

Xreki reviewed Aug 5, 2023

View reviewed changes

umiswing force-pushed the fa-2 branch from 43d6720 to fb7b156 Compare August 5, 2023 15:49

umiswing added 2 commits August 5, 2023 15:51

Use v2.0.4

fb7b156

Pass RNG state to fa2 capi

ec11ab1

Skip compile for sm less than 80.

8271c44

Xreki approved these changes Aug 6, 2023

View reviewed changes

sneaxiy merged commit 0473369 into PaddlePaddle:develop Aug 7, 2023
27 checks passed

umiswing mentioned this pull request Aug 7, 2023

[cherry-pick] Integration flash attention 2 (#55758) #56011

Closed

umiswing mentioned this pull request Aug 7, 2023

[cherry-pick] Integration flash attention 2 #56015

Merged

Xreki mentioned this pull request Aug 7, 2023

Add attn_mask supported for FlashAttnKernel. #55969

Merged

hitywt added a commit to hitywt/Paddle that referenced this pull request Oct 26, 2023

fix PR: (PaddlePaddle#55758)

e86c435

hitywt added a commit to hitywt/Paddle that referenced this pull request Nov 7, 2023

fix PR: (PaddlePaddle#55758)

c0f9eaa

hitywt added a commit to hitywt/Paddle that referenced this pull request Nov 8, 2023

fix PR: (PaddlePaddle#55758)

a07d9fc

hitywt added a commit to hitywt/Paddle that referenced this pull request Nov 9, 2023

fix PR: (PaddlePaddle#55758)

e9bea1c

hitywt added a commit to hitywt/Paddle that referenced this pull request Nov 14, 2023

fix PR: (PaddlePaddle#55758)

ac2c6c4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Integration flash attention 2 #55758

[WIP] Integration flash attention 2 #55758

umiswing commented Jul 27, 2023 •

edited

Loading

paddle-bot bot commented Jul 27, 2023

Xreki Aug 4, 2023

Xreki Aug 4, 2023

sneaxiy Aug 4, 2023

Xreki Aug 5, 2023

Xreki Aug 5, 2023

Xreki Aug 5, 2023

Xreki left a comment

[WIP] Integration flash attention 2 #55758

[WIP] Integration flash attention 2 #55758

Conversation

umiswing commented Jul 27, 2023 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented Jul 27, 2023

Xreki Aug 4, 2023

Choose a reason for hiding this comment

Xreki Aug 4, 2023

Choose a reason for hiding this comment

sneaxiy Aug 4, 2023

Choose a reason for hiding this comment

Xreki Aug 5, 2023

Choose a reason for hiding this comment

Xreki Aug 5, 2023

Choose a reason for hiding this comment

Xreki Aug 5, 2023

Choose a reason for hiding this comment

Xreki left a comment

Choose a reason for hiding this comment

umiswing commented Jul 27, 2023 •

edited

Loading