Integrate Cutlass Fused Multihead Attention in PHI #49910

MARD1NO · 2023-01-18T03:11:07Z

PR types

New features

PR changes

OPs

Describe

Integrate Cutlass fused multihead attention

You can Add custom attention_mask

cutlass2.11.0兼容问题，参考 #50073 (comment) PR修改即可

文档：

Benchmark

dev: cuda11.6 A100 40G

The case is borrowed from xformers

FP16:

Without mask:

	Cutlass Time(ms)	Naive Time(ms)
32, 128, 16, 64	0.05	0.25
64, 128, 16, 16	0.07	0.27
64, 128, 16, 32	0.08	0.32
64, 512, 16, 16	0.91	2.75
64, 512, 16, 32	0.92	2.97
64, 512, 16, 64	1.04	3.36
64, 1024, 16, 128	6.62	18.31
64, 1024, 16, 256	15.36	22.52

With mask:

	Cutlass Time(ms)	Naive Time(ms)
32, 128, 16, 64	0.09	0.29
64, 128, 16, 16	0.12	0.35
64, 128, 16, 32	0.13	0.4
64, 512, 16, 16	1.58	3.91
64, 512, 16, 32	1.58	4.12
64, 512, 16, 64	1.68	4.53
64, 1024, 16, 128	9.44	22.92
64, 1024, 16, 256	17.97	27.12

InferCase FP16

(b, q_seq, kv_seq, num_head, head_size)	Cutlass Time(us)	Naive Time(us)
1, 900, 6000, 32 (PETR FMCA)	376.51	625.1
1, 4096, 4096, 8, 32 (Diffusion FMHA)	554.87	1494.8
1, 4096, 77, 8, 40 (Diffusion FMCA)	34.81	119.14
1, 197, 197, 12, 64 (VIT FMHA)	21.70	73.08

Compare script

def naive_attention_impl(query, key, value, mask, scale): 
    query = paddle.transpose(query, [0, 2, 1, 3])
    key = paddle.transpose(key, [0, 2, 1, 3])
    value = paddle.transpose(value, [0, 2, 1, 3])

    qk_res = paddle.matmul(query, key, transpose_y=True)
    attention = qk_res * scale
    attention = attention + mask 
    softmax_result = paddle.nn.functional.softmax(attention, -1)
    result = paddle.matmul(softmax_result, value)
    result = paddle.transpose(result, [0, 2, 1, 3])
    return result

TODO List

generate.sh “借鉴”自xformers，通过shell脚本生成对应模板特化kernel，实现并行编译，加快编译速度

后续可以考虑采用python脚本来实现Kernel生成。

paddle-bot · 2023-01-18T03:11:11Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

mnicely · 2023-01-19T12:16:17Z

@MARD1NO I wanted to make you aware of a change coming to our fMHA (773)

@hwu36 for vis

MARD1NO · 2023-01-19T12:44:58Z

@MARD1NO I wanted to make you aware of a change coming to our fMHA (773)

@hwu36 for vis

Thanks, I will keep following this update

MARD1NO · 2023-02-02T02:42:11Z

.pre-commit-config.yaml

@@ -7,7 +7,8 @@ exclude: |
        python/paddle/utils/gast/.+|
        .+_pb2\.py|
        python/paddle/fluid/tests/unittests/npu/.+|
-        python/paddle/fluid/tests/unittests/mlu/.+
+        python/paddle/fluid/tests/unittests/mlu/.+|
+        paddle/phi/kernels/fusion/cutlass/fused_multi_head_attention/.+


这里引入的是外部xformers代码，暂时不做format

MARD1NO · 2023-02-02T02:42:50Z

python/paddle/fluid/tests/unittests/test_cutlass_fused_multihead_attention.py

+from op_test import OpTest
+
+# Ensure we use float type to accumulate
+os.environ["FLAGS_gemm_use_half_precision_compute_type"] = "0"


保证对比的naive实现gemm使用float累加

MARD1NO · 2023-02-02T02:44:00Z

paddle/phi/kernels/fusion/cutlass/fused_multi_head_attention/kernels/generate_kernels.sh

+# https://github.com/facebookresearch/xformers/blob/main/xformers/csrc/attention/cuda/fmha/kernels/generate_kernels.sh
+
+#!/bin/bash
+set -ex


这里参考使用xformers的算子模板生成脚本，以实现并行编译，加快速度

MARD1NO · 2023-02-02T02:45:59Z

paddle/phi/kernels/fusion/cutlass/fused_multi_head_attention/kernel_forward.h

+              (tb_tile_offset.n() * MM0::Mma::WarpCount::kN) +
+                  (my_warp_id / MM0::Mma::WarpCount::kM)};
+
+      if (kAddMask) {


如果要在QKmatmul后加mask，则需要将scale提前在寄存器计算好，而不是放到最后的tiledsoftmax里一起做

MARD1NO · 2023-02-02T02:46:23Z

paddle/phi/kernels/fusion/cutlass/fused_multi_head_attention/kernel_forward.h

+            cutlass::multiplies<typename MM0::Mma::FragmentC>()(p.scale, accum);
+      }
+
+      int32_t mask_iter_m = kMaskBroadcastRow ? 1 : problem_size_0_m; 


这里对mask的行broadcast做了一个特化

paddle/phi/kernels/fusion/cutlass/fused_multi_head_attention/gemm/custom_mma.h

paddle/phi/kernels/fusion/cutlass/fused_multi_head_attention/kernel_forward.h

vivienfanghuagood · 2023-02-03T07:48:48Z

LGTM

paddle/phi/kernels/fusion/cutlass/fused_multi_head_attention/kernels/forward.h

paddle/phi/kernels/fusion/cutlass/fused_multi_head_attention.cu

zhoutianzi666 · 2023-02-24T00:49:25Z

https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/incubate/nn/FusedMultiHeadAttention_cn.html#fusedmultiheadattention

这个API是不是应该去掉cutlass_fused_multi_head attention中的fused？

MARD1NO · 2023-02-24T02:25:12Z

https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/incubate/nn/FusedMultiHeadAttention_cn.html#fusedmultiheadattention

这个API是不是应该去掉cutlass_fused_multi_head attention中的fused？

这里的fused是指 q matmul k, softmax, attention matmul v三个操作。原来的FusedMultiHeadAttention实质上也是拆成多个算子来完成操作，这个cutlass算子只启动了一个kernel完成

init

54b24a3

MARD1NO added 4 commits January 18, 2023 07:55

support custom mask and qkscale

058c474

support broadcast and elementwise mask

9d7004a

refine bias name to mask

10b6f84

refine code format

97af74f

MARD1NO added 11 commits January 28, 2023 08:10

refine

fd54333

change to use BMHK format in FMHA Kernel

ccd04c8

refine

292d6e9

refine to use xformers repo's fmha

d232c4d

set mask tensor as optional

144a267

add unittest

e7cb8c1

add docs

423cefc

refine format

a99b88d

use float instead of tfloat32

8479da8

fix float type to run in tensorcore

2b27e86

use parallel compile

c8ed8cb

MARD1NO marked this pull request as ready for review February 2, 2023 02:41

MARD1NO commented Feb 2, 2023

View reviewed changes

remove test when in win32

82d204f

carryyu reviewed Feb 2, 2023

View reviewed changes

paddle/phi/kernels/fusion/cutlass/fused_multi_head_attention/gemm/custom_mma.h Show resolved Hide resolved

limit SmArch

43dc0cd

carryyu reviewed Feb 2, 2023

View reviewed changes

paddle/phi/kernels/fusion/cutlass/fused_multi_head_attention/kernel_forward.h Show resolved Hide resolved

carryyu reviewed Feb 3, 2023

View reviewed changes

paddle/phi/kernels/fusion/cutlass/fused_multi_head_attention/kernels/forward.h Outdated Show resolved Hide resolved

carryyu reviewed Feb 3, 2023

View reviewed changes

paddle/phi/kernels/fusion/cutlass/fused_multi_head_attention.cu Outdated Show resolved Hide resolved

fix comment

0a01d0b

MARD1NO dismissed stale reviews from carryyu and vivienfanghuagood via 0a01d0b February 6, 2023 01:37

MARD1NO added 5 commits February 6, 2023 02:35

fix docs

3a90692

enlarge timeout

62bcb6d

Fix example code

8254854

refine timeout setting

06ab605

change to use paddle throw

4a0c450

MARD1NO marked this pull request as draft February 7, 2023 05:36

MARD1NO added 3 commits February 7, 2023 05:36

Merge branch 'develop' into develop_cutlass_attention

d00b2c4

reduce templates num and change to use python script to generate kernel

8e5d71e

add readme

6ab41d8

MARD1NO marked this pull request as ready for review February 22, 2023 08:27

fix annotation

d6fd08d

MARD1NO added 7 commits March 2, 2023 04:07

align api's name to ops.yaml

8d98200

fix namespace

7a8895b

fix api name

6954128

limit tests available with cutlass

b73996f

modify cmake test_properties

e6619f2

Merge branch 'develop' into develop_cutlass_attention

7edb6b5

fix to use phi memory

0590815

heavengate previously approved these changes Mar 8, 2023

View reviewed changes

add const reference

d7314b5

MARD1NO dismissed heavengate’s stale review via d7314b5 March 9, 2023 02:20

heavengate approved these changes Mar 9, 2023

View reviewed changes

MARD1NO closed this Mar 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Cutlass Fused Multihead Attention in PHI #49910

Integrate Cutlass Fused Multihead Attention in PHI #49910

MARD1NO commented Jan 18, 2023 •

edited

paddle-bot bot commented Jan 18, 2023

mnicely commented Jan 19, 2023

MARD1NO commented Jan 19, 2023

MARD1NO Feb 2, 2023

MARD1NO Feb 2, 2023

MARD1NO Feb 2, 2023

MARD1NO Feb 2, 2023

MARD1NO Feb 2, 2023

vivienfanghuagood commented Feb 3, 2023

zhoutianzi666 commented Feb 24, 2023 •

edited

MARD1NO commented Feb 24, 2023

Integrate Cutlass Fused Multihead Attention in PHI #49910

Integrate Cutlass Fused Multihead Attention in PHI #49910

Conversation

MARD1NO commented Jan 18, 2023 • edited

PR types

PR changes

Describe

Benchmark

FP16:

InferCase FP16

Compare script

TODO List

paddle-bot bot commented Jan 18, 2023

mnicely commented Jan 19, 2023

MARD1NO commented Jan 19, 2023

MARD1NO Feb 2, 2023

Choose a reason for hiding this comment

MARD1NO Feb 2, 2023

Choose a reason for hiding this comment

MARD1NO Feb 2, 2023

Choose a reason for hiding this comment

MARD1NO Feb 2, 2023

Choose a reason for hiding this comment

MARD1NO Feb 2, 2023

Choose a reason for hiding this comment

vivienfanghuagood commented Feb 3, 2023

zhoutianzi666 commented Feb 24, 2023 • edited

MARD1NO commented Feb 24, 2023

MARD1NO commented Jan 18, 2023 •

edited

zhoutianzi666 commented Feb 24, 2023 •

edited