Add FP32 and Bias to fulfill the functionalities required by `torch.nn.attention.SDPBackend.EFFICIENT_ATTENTION` #22

xinyazhang · 2024-04-30T22:58:29Z

This PR includes the following major changes

Add Bias support in the Triton kernel, for both forward and backward directions
Add fp32 datatype support, and the corresponding tuning database information
Fix "argument list too long" error during linking
Improved table_tool.py to partially dump/load .csv file, allowing database merging
Refactor the UT to use PyTorch's method to estimate ATOL/RTOL

Known limitations:

Bias assumes real Rank 4 tensor (.expand()-ed ones are unlikely to work) for backward direction. No checking is performed on this requisite and failure may be silent
test_forward.py is still using the old method to estimate ATOL/RTOL

…TOL/RTOL

GPU reference code path causes segfaults.

This allows users to select device to compute reference results without modifying the source file.

Example usage: ``` DB=v2python/rules/tuning_database.sqlite3 python -m v2python.table_tool -k '' --action dumpcsv \ -f $DB --table_name 'FLASH$attn_fwd' \ --table_file 'attn_fwd.fp32mi300.csv' \ --select_where 'inputs$Q_dtype = "torch.float32"' git checkout another_branch -- $DB python -m v2python.table_tool -k '' --action loadcsv \ -f $DB --table_name 'FLASH$attn_fwd' \ --table_file attn_fwd.fp32mi300.csv \ --ignore_id ``` Note: --ignored_id does not support cases that 'id' is not the first column of the CSV file, for simplicity.

If .expand() is called upon a tensor, the result tensor may have one or more zeros in its stride()

Otherwise the dk result is incorrect for fp32

1. Computed gradients were not stored for future use, which basically means the gradients were not tested at all. 2. bias should be created as a Rank 2 tensor and then expand to Rank 4.

groenenboomj

What are the perf impacts of these changes?

tritonsrc/attn_torch_function.py

v2python/rules/tuning_database.sqlite3

xinyazhang added 24 commits April 30, 2024 22:48

tritonsrc/test_backward: Add fp32 to UTs.

1f17ada

tritonsrc: Refactor the testing and use PyTorch's method to compute A…

0e85b36

…TOL/RTOL

The tile size must be >= 16

5f18445

tune_flash.py: add float32 as dtype

19b7b7a

flash: add fp32 tuning database

d5d2c08

Fix typo

53525ea

Fix potential NAN atol in extreme cases.

cb87531

Enable fp32 kernels

7089d84

Selectively compute reference results on CPU

3e86f9a

GPU reference code path causes segfaults.

Add environment variable AOTRITON_REF_DEVICE_OPTION

ac4bdef

This allows users to select device to compute reference results without modifying the source file.

Merge fp32 MI300X tuning results into the tuning database.

f2badc8

Fix AOTRITON_REF_DEVICE_OPTION=default case

c5d2e5b

Fix "argument list too long" problem for static linking

0e28748

remove debugging assert

29d2d10

tritonsrc/bwd_split_kernel: bias tensor now supports .expand()

94126e1

If .expand() is called upon a tensor, the result tensor may have one or more zeros in its stride()

tritonsrc: Move the computation of bias gradients to dq kernel

94c9540

Otherwise the dk result is incorrect for fp32

Fix two UT problems

6ca8f5c

1. Computed gradients were not stored for future use, which basically means the gradients were not tested at all. 2. bias should be created as a Rank 2 tensor and then expand to Rank 4.

Fix "Memory access fault by GPU node-2 on address (nil)"

75b5a24

swap the db store as an attempt to fix the dq regression.

15a6e2d

fix typo

9f50f78

Return and print absolute errors

b1dabb3

test: randn -> rand, to match pytorch's tests.

bd5c6de

Fix tritonsrc/performance_forward.py

710d055

xinyazhang requested a review from groenenboomj April 30, 2024 22:58

groenenboomj reviewed May 2, 2024

View reviewed changes

tritonsrc/attn_torch_function.py Show resolved Hide resolved

v2python/rules/tuning_database.sqlite3 Show resolved Hide resolved

groenenboomj approved these changes May 2, 2024

View reviewed changes

xinyazhang merged commit 00ccbf3 into main May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FP32 and Bias to fulfill the functionalities required by `torch.nn.attention.SDPBackend.EFFICIENT_ATTENTION` #22

Add FP32 and Bias to fulfill the functionalities required by `torch.nn.attention.SDPBackend.EFFICIENT_ATTENTION` #22

xinyazhang commented Apr 30, 2024 •

edited

Loading

groenenboomj left a comment

Add FP32 and Bias to fulfill the functionalities required by torch.nn.attention.SDPBackend.EFFICIENT_ATTENTION #22

Add FP32 and Bias to fulfill the functionalities required by torch.nn.attention.SDPBackend.EFFICIENT_ATTENTION #22

Conversation

xinyazhang commented Apr 30, 2024 • edited Loading

groenenboomj left a comment

Choose a reason for hiding this comment

Add FP32 and Bias to fulfill the functionalities required by `torch.nn.attention.SDPBackend.EFFICIENT_ATTENTION` #22

Add FP32 and Bias to fulfill the functionalities required by `torch.nn.attention.SDPBackend.EFFICIENT_ATTENTION` #22

xinyazhang commented Apr 30, 2024 •

edited

Loading