feat: add support for fp32 kernels #56

pommedeterresautee · 2022-09-16T10:20:18Z

add new very simple layernorm implementation from xformers (does not work for very large tensors, its only purpose is to show what our max perf can be)
add fp32 tests and support on layernorm
add bf16 tests and support for kernel attention
- Flash attention forward pass failing on fp32 triton-lang/triton#674 -> fp32 doesn't work because of triton bug
fix error in attention benchmark, all but our optimized attention had to allocate output tensor! (which was an unfair advantage to our implementation)
make it easy to add bw pass for each triton kernel

fix #39
fix #44

behavior of autocast: https://h-huang.github.io/tutorials/advanced/dispatcher.html#autocast + https://pytorch.org/docs/stable/amp.html

…eir output tensor

pommedeterresautee · 2022-09-19T11:03:40Z

opened an issue with reproduction code here: triton-lang/triton#674

# Conflicts: # optimizer/linear.py # test/test_linear_layer.py

pommedeterresautee · 2022-09-22T12:41:24Z

measures done from this branch (with autocast)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 128)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x128]                      7.7855 (1.0)     7.7941 (1.0)   7.636 (1.0)    8.0056 (1.0)   8.1341 (1.0)   8.2391 (1.0)   8.0028 (1.0)   9.1752 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x128]  2.4136 (3.23)    2.5606 (3.04)  2.4074 (3.17)  3.6762 (2.18)  2.5179 (3.23)  2.6934 (3.06)  2.476 (3.23)   3.5809 (2.56)
test_benchmark_implementations[onnx_optim_fp16-1x128]               2.9246 (2.66)    3.1849 (2.45)  2.8132 (2.71)  4.6073 (1.74)  4.0431 (2.01)  4.0954 (2.01)  3.0022 (2.67)  5.2359 (1.75)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 16)
Name                                                               Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
-----------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x16]                      7.9821 (1.0)     7.9925 (1.0)   7.6534 (1.0)   8.4778 (1.0)   8.1546 (1.0)   8.2796 (1.0)   8.0009 (1.0)   8.9113 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x16]  1.5585 (5.12)    1.5589 (5.13)  1.5555 (4.92)  1.5677 (5.41)  1.6278 (5.01)  1.6386 (5.05)  1.6181 (4.94)  1.7867 (4.99)
test_benchmark_implementations[onnx_optim_fp16-1x16]               4.8333 (1.65)    4.2276 (1.89)  2.8848 (2.65)  5.5951 (1.52)  4.8374 (1.69)  4.879 (1.7)    3.914 (2.04)   5.1452 (1.73)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 256)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x256]                      7.7609 (1.0)     7.7855 (1.0)   7.6341 (1.0)   8.0261 (1.0)   8.0527 (1.0)   8.1408 (1.0)   7.8861 (1.0)   8.9015 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x256]  2.7382 (2.83)    2.7359 (2.85)  2.7197 (2.81)  2.7433 (2.93)  2.7859 (2.89)  2.787 (2.92)   2.7582 (2.86)  2.8652 (3.11)
test_benchmark_implementations[onnx_optim_fp16-1x256]               2.6307 (2.95)    2.6026 (2.99)  2.5129 (3.04)  2.6684 (3.01)  2.6648 (3.02)  2.9824 (2.73)  2.5794 (3.06)  5.7892 (1.54)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 384)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean          Min            Max
------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  ------------  -------------  -------------
test_benchmark_implementations[baseline-1x384]                      8.2473 (1.0)     8.2564 (1.0)   7.716 (1.0)    9.172 (1.0)    8.9529 (1.0)   9.9168 (1.0)  8.3484 (1.0)   12.8779 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x384]  2.9901 (2.76)    2.9874 (2.76)  2.9696 (2.6)   2.9972 (3.06)  3.0268 (2.96)  3.0024 (3.3)  2.9393 (2.84)  3.0772 (4.18)
test_benchmark_implementations[onnx_optim_fp16-1x384]               2.9051 (2.84)    3.0415 (2.71)  2.8641 (2.69)  4.8742 (1.88)  6.2065 (1.44)  6.1846 (1.6)  5.9793 (1.4)   6.3811 (2.02)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 512)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x512]                      7.5213 (1.0)     7.5546 (1.0)   7.4322 (1.0)   7.7479 (1.0)   7.9151 (1.0)   7.9856 (1.0)   7.8031 (1.0)   8.7109 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x512]  3.7673 (2.0)     3.7712 (2.0)   3.7304 (1.99)  3.7888 (2.04)  3.8216 (2.07)  3.8165 (2.09)  3.7454 (2.08)  4.1214 (2.11)
test_benchmark_implementations[onnx_optim_fp16-1x512]               3.9672 (1.9)     3.9957 (1.89)  3.9396 (1.89)  4.5466 (1.7)   4.0234 (1.97)  4.0321 (1.98)  3.9481 (1.98)  4.3094 (2.02)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 128)
Name                                                                 Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
-------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x128]                      22.7932 (1.0)    23.3685 (1.0)   22.7256 (1.0)   24.1029 (1.0)   21.2464 (1.0)   21.8454 (1.0)   21.0746 (1.0)   22.7786 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x128]  15.1859 (1.5)    15.2549 (1.53)  14.6248 (1.55)  15.9386 (1.51)  14.7982 (1.44)  14.7312 (1.48)  13.8437 (1.52)  15.4613 (1.47)
test_benchmark_implementations[onnx_optim_fp16-32x128]               17.8954 (1.27)   17.8991 (1.31)  17.8913 (1.27)  17.9098 (1.35)  18.0295 (1.18)  17.7181 (1.23)  17.1224 (1.23)  18.0452 (1.26)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 16)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-32x16]                      7.9964 (1.0)     8.1224 (1.0)   7.7518 (1.0)   9.461 (1.0)    8.228 (1.0)    8.275 (1.0)    8.1017 (1.0)   8.8999 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x16]  3.4478 (2.32)    3.4474 (2.36)  3.4284 (2.26)  3.4529 (2.74)  3.4989 (2.35)  3.4876 (2.37)  3.441 (2.35)   3.5331 (2.52)
test_benchmark_implementations[onnx_optim_fp16-32x16]               5.9208 (1.35)    5.1951 (1.56)  3.582 (2.16)   6.826 (1.39)   3.6319 (2.27)  3.9679 (2.09)  3.2939 (2.46)  5.4588 (1.63)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 256)
Name                                                                 Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
-------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  -------------
test_benchmark_implementations[baseline-32x256]                      46.1609 (1.0)    46.5961 (1.0)   46.1609 (1.0)   47.0313 (1.0)   46.4217 (1.0)   46.668 (1.0)    46.4217 (1.0)   46.9143 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x256]  27.5026 (1.68)   27.5053 (1.69)  27.4964 (1.68)  27.5169 (1.71)  27.1399 (1.71)  26.7257 (1.75)  25.3938 (1.83)  27.6435 (1.7)
test_benchmark_implementations[onnx_optim_fp16-32x256]               36.7258 (1.26)   36.7365 (1.27)  36.7258 (1.26)  36.7473 (1.28)  34.6321 (1.34)  36.8357 (1.27)  34.6321 (1.34)  39.0393 (1.2)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 128)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x128]                      8.0763 (1.0)     8.1834 (1.0)   7.5067 (1.0)   8.8566 (1.0)   8.6779 (1.0)   8.6912 (1.0)   8.4971 (1.0)   8.9335 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x128]  5.4733 (1.48)    5.5011 (1.49)  5.4692 (1.37)  5.7477 (1.54)  5.4956 (1.58)  5.4438 (1.6)   5.2913 (1.61)  5.5435 (1.61)
test_benchmark_implementations[onnx_optim_fp16-8x128]               6.4922 (1.24)    6.5071 (1.26)  6.4788 (1.16)  6.5782 (1.35)  6.0271 (1.44)  5.9941 (1.45)  5.8692 (1.45)  6.1643 (1.45)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 16)
Name                                                               Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
-----------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x16]                      8.3579 (1.0)     8.4 (1.0)      8.181 (1.0)    8.8279 (1.0)   8.7168 (1.0)   8.7968 (1.0)   8.5196 (1.0)   9.726 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x16]  2.3337 (3.58)    2.3342 (3.6)   2.3316 (3.51)  2.3429 (3.77)  2.3919 (3.64)  2.4002 (3.66)  2.3867 (3.57)  2.609 (3.73)
test_benchmark_implementations[onnx_optim_fp16-8x16]               2.9184 (2.86)    2.9682 (2.83)  2.8436 (2.88)  3.5615 (2.48)  3.9428 (2.21)  3.7615 (2.34)  2.8695 (2.97)  4.9127 (1.98)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 256)
Name                                                                Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x256]                      15.3569 (1.0)    15.3636 (1.0)   15.2637 (1.0)   15.4481 (1.0)   14.1396 (1.15)  14.4888 (1.06)  13.8929 (1.0)   15.6007 (1.07)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x256]  8.6794 (1.77)    8.6787 (1.77)   8.6723 (1.76)   8.6866 (1.78)   8.7026 (1.86)   8.9109 (1.73)   8.378 (1.66)    9.8552 (1.69)
test_benchmark_implementations[onnx_optim_fp16-8x256]               12.8174 (1.2)    12.8764 (1.19)  12.1201 (1.26)  14.6924 (1.05)  16.2258 (1.0)   15.4099 (1.0)   12.5603 (1.11)  16.6717 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 384)
Name                                                                Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x384]                      21.3668 (1.28)   21.9298 (1.25)  21.1978 (1.27)  23.0248 (1.22)  20.9425 (1.0)   21.2453 (1.0)   20.7567 (1.0)   21.959 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x384]  11.7187 (2.33)   11.7083 (2.33)  11.6275 (2.31)  11.7248 (2.39)  11.5307 (1.82)  11.4106 (1.86)  10.9675 (1.89)  11.6932 (1.88)
test_benchmark_implementations[onnx_optim_fp16-8x384]               27.2484 (1.0)    27.3092 (1.0)   26.9005 (1.0)   28.0269 (1.0)   16.3585 (1.28)  16.555 (1.28)   15.9407 (1.3)   17.1526 (1.28)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 512)
Name                                                                Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x512]                      29.8353 (1.0)    30.6869 (1.0)   29.5884 (1.0)   32.6369 (1.0)   27.9553 (1.0)   27.9247 (1.0)   27.7025 (1.0)   28.1161 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x512]  15.4061 (1.94)   15.5945 (1.97)  15.3999 (1.92)  16.2048 (2.01)  15.4085 (1.81)  15.0448 (1.86)  14.1574 (1.96)  15.5258 (1.81)
test_benchmark_implementations[onnx_optim_fp16-8x512]               21.4405 (1.39)   21.4515 (1.43)  21.4303 (1.38)  21.4884 (1.52)  20.7453 (1.35)  21.3414 (1.31)  20.4224 (1.36)  22.1332 (1.27)

pommedeterresautee · 2022-09-22T12:42:03Z

measures done on main (no autocast, full fp16)

test/test_torchdynamo_bert.py .......................................                                                                                                                                                   [100%]
test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 128)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x128]                      7.2182 (1.0)     7.3326 (1.0)   6.7269 (1.0)   8.8044 (1.0)   6.9617 (1.0)   6.9978 (1.0)   6.7992 (1.0)   7.6291 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x128]  1.3701 (5.27)    1.382 (5.31)   1.3681 (4.92)  1.7418 (5.05)  1.4162 (4.92)  1.4189 (4.93)  1.4118 (4.82)  1.5048 (5.07)
test_benchmark_implementations[onnx_optim_fp16-1x128]               2.826 (2.55)     2.908 (2.52)   2.7239 (2.47)  4.4063 (2.0)   2.8097 (2.48)  2.8331 (2.47)  2.7622 (2.46)  3.1994 (2.38)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 16)
Name                                                               Median (CUDA)    Mean (CUDA)    Min (CUDA)      Max (CUDA)     Median          Mean            Min             Max
-----------------------------------------------------------------  ---------------  -------------  --------------  -------------  --------------  --------------  --------------  -------------
test_benchmark_implementations[baseline-1x16]                      6.5792 (1.0)     6.6026 (1.0)   6.4717 (1.0)    6.7318 (1.0)   6.7322 (1.0)    6.7755 (1.0)    6.5761 (1.0)    7.4069 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x16]  0.5642 (11.66)   0.5641 (11.7)  0.5612 (11.53)  0.5755 (11.7)  0.6223 (10.82)  0.6291 (10.77)  0.6177 (10.65)  0.8654 (8.56)
test_benchmark_implementations[onnx_optim_fp16-1x16]               2.7372 (2.4)     2.7406 (2.41)  2.6716 (2.42)   2.8826 (2.34)  2.7818 (2.42)   2.8279 (2.4)    2.7385 (2.4)    3.2174 (2.3)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 256)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  ------------
test_benchmark_implementations[baseline-1x256]                      6.7359 (1.0)     6.7653 (1.0)   5.9217 (1.0)   7.5438 (1.0)   7.1434 (1.0)   7.3269 (1.0)   6.8743 (1.0)   8.4516 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x256]  1.6701 (4.03)    1.6921 (4.0)   1.6558 (3.58)  2.0818 (3.62)  1.7132 (4.17)  1.7223 (4.25)  1.6971 (4.05)  1.961 (4.31)
test_benchmark_implementations[onnx_optim_fp16-1x256]               2.81 (2.4)       2.7943 (2.42)  2.5736 (2.3)   2.8609 (2.64)  2.5794 (2.77)  2.6425 (2.77)  2.5477 (2.7)   4.463 (1.89)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 384)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x384]                      6.5323 (1.0)     6.5908 (1.0)   6.4246 (1.0)   6.9028 (1.0)   7.9615 (1.0)   8.1049 (1.0)   7.4326 (1.0)   9.396 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x384]  1.9436 (3.36)    1.9413 (3.4)   1.8237 (3.52)  1.9487 (3.54)  1.9532 (4.08)  1.9862 (4.08)  1.8874 (3.94)  2.363 (3.98)
test_benchmark_implementations[onnx_optim_fp16-1x384]               3.1508 (2.07)    3.1709 (2.08)  3.114 (2.06)   3.4931 (1.98)  2.9463 (2.7)   3.0148 (2.69)  2.8853 (2.58)  3.6583 (2.57)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 512)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x512]                      6.3652 (1.0)     6.3897 (1.0)   6.2158 (1.0)   6.5885 (1.0)   6.6024 (1.0)   6.6681 (1.0)   6.4829 (1.0)   7.6228 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x512]  2.687 (2.37)     2.6843 (2.38)  2.6655 (2.33)  2.6911 (2.45)  2.7342 (2.41)  2.8856 (2.31)  2.6962 (2.4)   3.4223 (2.23)
test_benchmark_implementations[onnx_optim_fp16-1x512]               4.2547 (1.5)     4.2604 (1.5)   4.2476 (1.46)  4.3039 (1.53)  4.1765 (1.58)  4.1939 (1.59)  4.0118 (1.62)  4.5923 (1.66)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 128)
Name                                                                 Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
-------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x128]                      17.9098 (1.0)    17.6454 (1.01)  16.7045 (1.07)  18.7661 (1.0)   16.9879 (1.06)  17.3241 (1.03)  16.7234 (1.02)  18.6919 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x128]  13.0755 (1.37)   13.075 (1.37)   13.0714 (1.37)  13.0806 (1.43)  13.1283 (1.37)  12.7596 (1.39)  12.0202 (1.42)  13.1405 (1.42)
test_benchmark_implementations[onnx_optim_fp16-32x128]               17.8982 (1.0)    17.8974 (1.0)   17.8913 (1.0)   17.9005 (1.05)  17.9321 (1.0)   17.7588 (1.0)   17.0673 (1.0)   18.5692 (1.01)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 16)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-32x16]                      6.7881 (1.0)     6.8592 (1.0)   6.6048 (1.0)   8.0138 (1.0)   7.0459 (1.0)   7.1357 (1.0)   6.9431 (1.0)   7.8445 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x16]  2.3316 (2.91)    2.4249 (2.83)  2.3122 (2.86)  2.902 (2.76)   2.3996 (2.94)  2.3778 (3.0)   2.3098 (3.01)  2.4187 (3.24)
test_benchmark_implementations[onnx_optim_fp16-32x16]               3.4058 (1.99)    3.4673 (1.98)  3.3732 (1.96)  3.6168 (2.22)  3.4226 (2.06)  3.4083 (2.09)  3.3127 (2.1)   3.74 (2.1)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 256)
Name                                                                 Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
-------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x256]                      36.0653 (1.02)   36.8824 (1.02)  36.0653 (1.02)  37.6996 (1.01)  36.7123 (1.06)  37.0188 (1.07)  36.7123 (1.06)  37.3252 (1.07)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x256]  26.5318 (1.38)   26.5209 (1.41)  26.498 (1.39)   26.5329 (1.44)  26.3806 (1.47)  25.7391 (1.53)  24.2316 (1.6)   26.6052 (1.51)
test_benchmark_implementations[onnx_optim_fp16-32x256]               36.7207 (1.0)    37.4692 (1.0)   36.7207 (1.0)   38.2177 (1.0)   38.8794 (1.0)   39.4746 (1.0)   38.8794 (1.0)   40.0698 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 128)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x128]                      6.6089 (1.01)    6.6722 (1.04)  6.569 (1.0)    7.0185 (1.12)  6.9639 (1.0)   7.0844 (1.0)   6.9434 (1.0)   7.7573 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x128]  4.3991 (1.52)    4.3848 (1.57)  4.3295 (1.52)  4.4032 (1.78)  4.4054 (1.58)  4.3527 (1.63)  4.2177 (1.65)  4.4202 (1.75)
test_benchmark_implementations[onnx_optim_fp16-8x128]               6.6877 (1.0)     6.9057 (1.0)   6.057 (1.08)   7.8469 (1.0)   6.0324 (1.15)  6.2334 (1.14)  5.8981 (1.18)  6.9805 (1.11)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 16)
Name                                                               Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
-----------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x16]                      6.908 (1.0)      6.9054 (1.0)   6.7901 (1.0)   7.1734 (1.0)   7.2689 (1.0)   7.305 (1.0)    7.1133 (1.0)   7.9249 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x16]  1.2954 (5.33)    1.2953 (5.33)  1.2933 (5.25)  1.2984 (5.52)  1.3495 (5.39)  1.3526 (5.4)   1.3437 (5.29)  1.4379 (5.51)
test_benchmark_implementations[onnx_optim_fp16-8x16]               2.7824 (2.48)    2.8075 (2.46)  2.7506 (2.47)  2.9266 (2.45)  2.883 (2.52)   2.9713 (2.46)  2.8106 (2.53)  3.904 (2.03)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 256)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
------------------------------------------------------------------  ---------------  -------------  --------------  --------------  --------------  --------------  --------------  -------------
test_benchmark_implementations[baseline-8x256]                      12.2911 (1.0)    12.1719 (1.0)  11.6808 (1.03)  12.6466 (1.0)   12.0937 (1.0)   12.072 (1.0)    11.3032 (1.0)   12.8891 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x256]  7.594 (1.62)     7.5937 (1.6)   7.5899 (1.59)   7.5971 (1.66)   7.564 (1.6)     7.4167 (1.63)   6.9834 (1.62)   7.5714 (1.7)
test_benchmark_implementations[onnx_optim_fp16-8x256]               12.1149 (1.01)   12.1114 (1.0)  12.0852 (1.0)   12.1385 (1.04)  11.3902 (1.06)  11.3346 (1.07)  11.0259 (1.03)  11.522 (1.12)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 384)
Name                                                                Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x384]                      16.2662 (1.01)   16.7644 (1.0)   15.8915 (1.03)  17.8504 (1.0)   15.4141 (1.04)  15.5253 (1.02)  14.7454 (1.04)  16.6969 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x384]  10.4929 (1.56)   10.5259 (1.59)  10.4684 (1.57)  10.7295 (1.66)  10.5498 (1.52)  10.3925 (1.53)  9.8169 (1.56)   10.9313 (1.53)
test_benchmark_implementations[onnx_optim_fp16-8x384]               16.4198 (1.0)    16.419 (1.02)   16.3963 (1.0)   16.4342 (1.09)  16.0134 (1.0)   15.8778 (1.0)   15.3047 (1.0)   16.3081 (1.02)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 512)
Name                                                                Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x512]                      23.6012 (1.0)    24.1892 (1.0)   23.296 (1.0)    24.9395 (1.0)   21.9757 (1.0)   22.3595 (1.0)   21.4901 (1.0)   23.266 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x512]  14.3524 (1.64)   14.3414 (1.69)  14.3124 (1.63)  14.3667 (1.74)  14.247 (1.54)   13.8561 (1.61)  12.9897 (1.65)  14.2653 (1.63)
test_benchmark_implementations[onnx_optim_fp16-8x512]               21.3473 (1.11)   21.3571 (1.13)  21.3453 (1.09)  21.3862 (1.17)  20.5277 (1.07)  20.9151 (1.07)  20.3955 (1.05)  21.3705 (1.09)

pommedeterresautee · 2022-09-22T13:09:43Z

for memory, model in full fp16 in the autocast branch (so no autocast called):

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 128)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)    Median        Mean          Min           Max
------------------------------------------------------------------  ---------------  -------------  ------------  ------------  ------------  ------------  ------------  ------------
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x128]  1.3742 (1.0)     1.4852 (1.0)   1.366 (1.0)   1.9476 (1.0)  1.4341 (1.0)  1.4451 (1.0)  1.4202 (1.0)  1.6538 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 16)
Name                                                               Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)    Median        Mean          Min          Max
-----------------------------------------------------------------  ---------------  -------------  ------------  ------------  ------------  ------------  -----------  ------------
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x16]  0.5642 (1.0)     0.5713 (1.0)   0.5622 (1.0)  1.7848 (1.0)  0.6189 (1.0)  0.6233 (1.0)  0.616 (1.0)  0.8692 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 256)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)    Median        Mean          Min           Max
------------------------------------------------------------------  ---------------  -------------  ------------  ------------  ------------  ------------  ------------  ------------
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x256]  1.6394 (1.0)     1.7038 (1.0)   1.6364 (1.0)  2.2067 (1.0)  1.6927 (1.0)  1.7095 (1.0)  1.6795 (1.0)  2.0539 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 384)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)    Median        Mean          Min           Max
------------------------------------------------------------------  ---------------  -------------  ------------  ------------  ------------  ------------  ------------  ------------
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x384]  1.8924 (1.0)     2.0345 (1.0)   1.8883 (1.0)  2.5303 (1.0)  1.9386 (1.0)  1.9683 (1.0)  1.8656 (1.0)  2.3711 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 512)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)    Median        Mean          Min           Max
------------------------------------------------------------------  ---------------  -------------  ------------  ------------  ------------  ------------  ------------  ------------
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x512]  2.6624 (1.0)     2.7399 (1.0)   2.6419 (1.0)  3.4785 (1.0)  2.7101 (1.0)  2.8139 (1.0)  2.6397 (1.0)  3.2979 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 128)
Name                                                                 Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
-------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  ------------
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x128]  13.097 (1.0)     13.097 (1.0)   13.0888 (1.0)  13.1052 (1.0)  13.0454 (1.0)  12.7565 (1.0)  12.0125 (1.0)  13.058 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 16)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)    Median        Mean          Min           Max
------------------------------------------------------------------  ---------------  -------------  ------------  ------------  ------------  ------------  ------------  ------------
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x16]  2.3491 (1.0)     2.3462 (1.0)   2.3296 (1.0)  2.3532 (1.0)  2.3852 (1.0)  2.3689 (1.0)  2.3099 (1.0)  2.4138 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 256)
Name                                                                 Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)    Median         Mean           Min            Max
-------------------------------------------------------------------  ---------------  -------------  -------------  ------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x256]  26.0209 (1.0)    26.0198 (1.0)  26.0137 (1.0)  26.025 (1.0)  25.0071 (1.0)  25.0396 (1.0)  24.0307 (1.0)  26.0812 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 128)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)    Median        Mean          Min           Max
------------------------------------------------------------------  ---------------  -------------  ------------  ------------  ------------  ------------  ------------  ------------
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x128]  4.3622 (1.0)     4.3721 (1.0)   4.3356 (1.0)  4.606 (1.0)   4.3772 (1.0)  4.3444 (1.0)  4.2214 (1.0)  4.3898 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 16)
Name                                                               Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)    Median        Mean          Min           Max
-----------------------------------------------------------------  ---------------  -------------  ------------  ------------  ------------  ------------  ------------  ------------
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x16]  1.2974 (1.0)     1.2972 (1.0)   1.2954 (1.0)  1.2995 (1.0)  1.3512 (1.0)  1.3535 (1.0)  1.3476 (1.0)  1.4402 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 256)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)    Median        Mean          Min          Max
------------------------------------------------------------------  ---------------  -------------  ------------  ------------  ------------  ------------  -----------  ------------
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x256]  7.5663 (1.0)     7.5664 (1.0)   7.5643 (1.0)  7.5704 (1.0)  7.5593 (1.0)  7.4112 (1.0)  7.002 (1.0)  7.5693 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 384)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min           Max
------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  ------------  -------------
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x384]  10.4755 (1.0)    10.4761 (1.0)  10.4684 (1.0)  10.4827 (1.0)  10.4137 (1.0)  10.3188 (1.0)  9.8971 (1.0)  10.5238 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 512)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------  ---------------  -------------  ------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x512]  14.21 (1.0)      14.2105 (1.0)  14.208 (1.0)  14.2152 (1.0)  14.2029 (1.0)  13.8439 (1.0)  13.0863 (1.0)  14.4994 (1.0)

Shows that the code is as fast as in main when inference is not under autocast context manager.

pommedeterresautee · 2022-09-22T13:11:52Z

@gaetansnl there was a bug in Onnx Runtime, on main it's taking baseline model without setting fp16 to False, so it was working in full fp16 which doesn't work. In this branch there is no such flag and the Onnx model is in mixed precision

pommedeterresautee · 2022-09-22T17:48:08Z

with the weights in fp16 and the model in autocast

test/test_torchdynamo_bert.py ...............................................................................................................................................                                                    [100%]
test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 128)
Name                                                                       Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
-------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-1x128]                             7.6605 (2.59)    7.9332 (2.51)   7.5622 (2.62)   11.0316 (1.81)  8.0965 (2.47)   8.3429 (2.41)   7.9112 (2.53)   9.9191 (2.06)
test_benchmark_implementations[dynamo-1x128]                               6.7329 (2.94)    6.7255 (2.96)   6.5987 (3.0)    6.87 (2.91)     6.9386 (2.88)   6.9729 (2.89)   6.8712 (2.91)   7.453 (2.74)
test_benchmark_implementations[dynamo_cuda_graphs-1x128]                   1.5616 (12.68)   1.6294 (12.2)   1.5391 (12.87)  1.7644 (11.35)  1.5984 (12.52)  1.6007 (12.58)  1.5955 (12.54)  1.685 (12.14)
test_benchmark_implementations[dynamo_no_dropout-1x128]                    6.2597 (3.16)    6.2689 (3.17)   6.1901 (3.2)    6.356 (3.15)    6.666 (3.0)     6.679 (3.02)    6.5346 (3.06)   6.9556 (2.94)
test_benchmark_implementations[dynamo_nvfuser_ofi-1x128]                   3.3935 (5.84)    3.3858 (5.87)   3.2553 (6.08)   3.4806 (5.75)   3.7285 (5.37)   3.7228 (5.41)   3.5896 (5.57)   4.0261 (5.08)
test_benchmark_implementations[dynamo_optimized-1x128]                     19.8042 (1.0)    19.88 (1.0)     19.8021 (1.0)   20.0204 (1.0)   20.0119 (1.0)   20.1414 (1.0)   20.0052 (1.0)   20.4569 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x128]         1.5606 (12.69)   1.5597 (12.75)  1.5155 (13.07)  1.5626 (12.81)  1.4283 (14.01)  1.4303 (14.08)  1.4252 (14.04)  1.5174 (13.48)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x128]  1.5718 (12.6)    1.4885 (13.36)  1.3722 (14.43)  1.5739 (12.72)  1.4302 (13.99)  1.4326 (14.06)  1.4263 (14.03)  1.5295 (13.37)
test_benchmark_implementations[onnx-1x128]                                 3.1877 (6.21)    3.1987 (6.21)   3.1795 (6.23)   3.3782 (5.93)   3.2336 (6.19)   3.2478 (6.2)    3.2264 (6.2)    3.5517 (5.76)
test_benchmark_implementations[onnx_optim_fp16-1x128]                      2.8242 (7.01)    2.8327 (7.02)   2.8047 (7.06)   2.8846 (6.94)   2.763 (7.24)    2.7916 (7.21)   2.7425 (7.29)   3.2298 (6.33)
test_benchmark_implementations[onnx_optim_fp32-1x128]                      3.5574 (5.57)    3.4032 (5.84)   3.1846 (6.22)   3.5932 (5.57)   3.2336 (6.19)   3.2463 (6.2)    3.2276 (6.2)    3.5254 (5.8)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 16)
Name                                                                      Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-1x16]                             8.5985 (2.14)    10.0206 (1.85)  7.6575 (2.39)   13.8199 (1.37)  8.2296 (2.25)   8.265 (2.26)    8.0546 (2.29)   8.8783 (2.16)
test_benchmark_implementations[dynamo-1x16]                               6.4256 (2.86)    6.4446 (2.87)   6.3754 (2.88)   6.5732 (2.87)   6.7502 (2.75)   6.774 (2.76)    6.6862 (2.76)   7.0925 (2.7)
test_benchmark_implementations[dynamo_cuda_graphs-1x16]                   1.1192 (16.44)   1.1195 (16.54)  1.1172 (16.41)  1.1223 (16.82)  1.0585 (17.53)  1.0913 (17.13)  1.0548 (17.49)  1.2606 (15.21)
test_benchmark_implementations[dynamo_no_dropout-1x16]                    6.0099 (3.06)    6.0205 (3.07)   5.9835 (3.06)   6.101 (3.09)    6.4984 (2.85)   6.5778 (2.84)   6.4217 (2.87)   7.1127 (2.7)
test_benchmark_implementations[dynamo_nvfuser_ofi-1x16]                   3.1939 (5.76)    3.1746 (5.83)   3.0771 (5.96)   3.2891 (5.74)   3.5884 (5.17)   3.5752 (5.23)   3.4215 (5.39)   3.8911 (4.93)
test_benchmark_implementations[dynamo_optimized-1x16]                     18.4013 (1.0)    18.5119 (1.0)   18.3338 (1.0)   18.8826 (1.0)   18.5509 (1.0)   18.6906 (1.0)   18.4498 (1.0)   19.1759 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x16]         0.6277 (29.31)   0.6279 (29.48)  0.6257 (29.3)   0.6298 (29.98)  0.6245 (29.71)  0.6262 (29.85)  0.6215 (29.69)  0.7118 (26.94)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x16]  0.6287 (29.27)   0.6285 (29.45)  0.6267 (29.26)  0.638 (29.6)    0.6218 (29.84)  0.6234 (29.98)  0.6187 (29.82)  0.7396 (25.93)
test_benchmark_implementations[onnx-1x16]                                 2.4669 (7.46)    2.4797 (7.47)   2.432 (7.54)    2.561 (7.37)    2.4993 (7.42)   2.5286 (7.39)   2.4755 (7.45)   2.9406 (6.52)
test_benchmark_implementations[onnx_optim_fp16-1x16]                      2.8303 (6.5)     2.8303 (6.54)   2.7771 (6.6)    2.898 (6.52)    2.8146 (6.59)   2.8801 (6.49)   2.721 (6.78)    5.2296 (3.67)
test_benchmark_implementations[onnx_optim_fp32-1x16]                      2.4402 (7.54)    2.4559 (7.54)   2.423 (7.57)    2.5518 (7.4)    2.4798 (7.48)   2.5037 (7.47)   2.4579 (7.51)   2.9093 (6.59)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 256)
Name                                                                       Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
-------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-1x256]                             7.553 (2.59)     7.571 (2.57)    7.468 (2.52)    7.6933 (2.57)   7.9049 (2.53)   7.9921 (2.51)   7.8276 (2.54)   8.6992 (2.33)
test_benchmark_implementations[dynamo-1x256]                               6.5608 (2.99)    6.5712 (2.97)   6.4809 (2.91)   6.7236 (2.94)   6.9156 (2.89)   6.9401 (2.89)   6.8395 (2.91)   7.276 (2.79)
test_benchmark_implementations[dynamo_cuda_graphs-1x256]                   2.2589 (8.67)    2.2587 (8.63)   2.2559 (8.35)   2.262 (8.74)    2.0757 (9.63)   2.0715 (9.67)   2.0435 (9.73)   2.137 (9.49)
test_benchmark_implementations[dynamo_no_dropout-1x256]                    6.6285 (2.96)    6.6407 (2.93)   6.615 (2.85)    6.7133 (2.94)   7.0503 (2.84)   7.097 (2.82)    7.0097 (2.84)   7.4603 (2.72)
test_benchmark_implementations[dynamo_nvfuser_ofi-1x256]                   2.8959 (6.77)    2.9247 (6.66)   2.8551 (6.59)   3.1928 (6.19)   3.3429 (5.98)   3.3241 (6.02)   3.1832 (6.25)   3.6621 (5.54)
test_benchmark_implementations[dynamo_optimized-1x256]                     19.5945 (1.0)    19.4843 (1.0)   18.8273 (1.0)   19.7622 (1.0)   19.9905 (1.0)   20.027 (1.0)    19.8866 (1.0)   20.2829 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x256]         1.8452 (10.62)   1.847 (10.55)   1.8432 (10.21)  1.8524 (10.67)  1.7025 (11.74)  1.7237 (11.62)  1.6744 (11.88)  1.9958 (10.16)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x256]  1.8586 (10.54)   1.8785 (10.37)  1.8555 (10.15)  2.3122 (8.55)   1.6936 (11.8)   1.694 (11.82)   1.6775 (11.85)  1.7908 (11.33)
test_benchmark_implementations[onnx-1x256]                                 3.9199 (5.0)     3.9236 (4.97)   3.9137 (4.81)   3.9456 (5.01)   3.9413 (5.07)   3.9492 (5.07)   3.9133 (5.08)   4.2207 (4.81)
test_benchmark_implementations[onnx_optim_fp16-1x256]                      2.7904 (7.02)    2.7935 (6.97)   2.7854 (6.76)   2.818 (7.01)    2.5601 (7.81)   2.5711 (7.79)   2.5551 (7.78)   2.8745 (7.06)
test_benchmark_implementations[onnx_optim_fp32-1x256]                      4.3551 (4.5)     4.3575 (4.47)   4.3518 (4.33)   4.3868 (4.5)    3.9516 (5.06)   3.9669 (5.05)   3.9364 (5.05)   4.2545 (4.77)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 384)
Name                                                                       Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
-------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x384]                             7.5704 (2.43)    7.609 (2.41)   7.5151 (2.43)  7.8592 (2.34)  8.1326 (2.28)  8.2142 (2.27)  8.0782 (2.29)  8.7769 (2.17)
test_benchmark_implementations[dynamo-1x384]                               6.5138 (2.82)    6.5357 (2.81)  6.4809 (2.82)  6.6335 (2.78)  6.8973 (2.69)  6.9176 (2.69)  6.8245 (2.71)  7.2583 (2.63)
test_benchmark_implementations[dynamo_cuda_graphs-1x384]                   2.9 (6.33)       2.9689 (6.18)  2.8621 (6.38)  3.0802 (5.98)  2.9029 (6.38)  2.8796 (6.47)  2.8273 (6.53)  2.9232 (6.53)
test_benchmark_implementations[dynamo_no_dropout-1x384]                    6.3437 (2.89)    6.3538 (2.89)  6.2228 (2.94)  6.5026 (2.83)  6.6473 (2.79)  6.6985 (2.78)  6.626 (2.79)   7.1187 (2.68)
test_benchmark_implementations[dynamo_nvfuser_ofi-1x384]                   3.2851 (5.59)    3.2801 (5.6)   3.1099 (5.88)  3.4582 (5.32)  3.595 (5.15)   3.5772 (5.21)  3.4252 (5.39)  3.941 (4.84)
test_benchmark_implementations[dynamo_optimized-1x384]                     18.3613 (1.0)    18.3575 (1.0)  18.2733 (1.0)  18.4105 (1.0)  18.5198 (1.0)  18.6274 (1.0)  18.4637 (1.0)  19.0848 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x384]         2.0285 (9.05)    2.0215 (9.08)  1.9302 (9.47)  2.0337 (9.05)  1.9281 (9.61)  1.9111 (9.75)  1.8576 (9.94)  1.9673 (9.7)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x384]  2.0449 (8.98)    2.045 (8.98)   2.0408 (8.95)  2.05 (8.98)    1.9278 (9.61)  1.9113 (9.75)  1.8463 (10.0)  1.9523 (9.78)
test_benchmark_implementations[onnx-1x384]                                 5.3535 (3.43)    5.3549 (3.43)  5.3415 (3.42)  5.3873 (3.42)  5.0719 (3.65)  5.2008 (3.58)  4.8636 (3.8)   6.0413 (3.16)
test_benchmark_implementations[onnx_optim_fp16-1x384]                      2.8887 (6.36)    2.8896 (6.35)  2.8621 (6.38)  2.9204 (6.3)   2.9264 (6.33)  2.9292 (6.36)  2.8794 (6.41)  3.1708 (6.02)
test_benchmark_implementations[onnx_optim_fp32-1x384]                      5.0514 (3.63)    5.0504 (3.63)  5.0401 (3.63)  5.0616 (3.64)  5.0965 (3.63)  5.059 (3.68)   4.9089 (3.76)  5.2053 (3.67)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 512)
Name                                                                       Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
-------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x512]                             7.3913 (2.48)    7.4481 (2.46)  7.3667 (2.47)  7.713 (2.39)   7.7369 (2.58)  7.8174 (2.55)  7.7022 (2.56)  8.4882 (2.37)
test_benchmark_implementations[dynamo-1x512]                               6.3889 (2.87)    6.3807 (2.87)  6.2702 (2.91)  6.482 (2.85)   6.617 (3.02)   6.6864 (2.98)  6.5914 (2.99)  7.277 (2.76)
test_benchmark_implementations[dynamo_cuda_graphs-1x512]                   4.6848 (3.91)    4.6386 (3.95)  4.4052 (4.14)  4.6909 (3.93)  4.3883 (4.55)  4.3673 (4.57)  4.3137 (4.57)  4.4019 (4.57)
test_benchmark_implementations[dynamo_no_dropout-1x512]                    5.9894 (3.06)    6.0217 (3.04)  5.9453 (3.06)  6.2024 (2.97)  6.4405 (3.1)   6.479 (3.08)   6.3649 (3.1)   6.7992 (2.96)
test_benchmark_implementations[dynamo_nvfuser_ofi-1x512]                   3.4212 (5.36)    3.425 (5.35)   3.3372 (5.46)  3.5082 (5.26)  3.7336 (5.35)  3.7332 (5.35)  3.5726 (5.52)  4.1709 (4.82)
test_benchmark_implementations[dynamo_optimized-1x512]                     18.3265 (1.0)    18.3329 (1.0)  18.2201 (1.0)  18.4474 (1.0)  19.978 (1.0)   19.9555 (1.0)  19.7277 (1.0)  20.1152 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x512]         2.9204 (6.28)    2.9207 (6.28)  2.9174 (6.25)  2.9245 (6.31)  2.6658 (7.49)  2.6642 (7.49)  2.6248 (7.52)  2.7353 (7.35)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x512]  2.9532 (6.21)    2.9534 (6.21)  2.9491 (6.18)  2.9563 (6.24)  2.6776 (7.46)  2.6961 (7.4)   2.6347 (7.49)  3.1068 (6.47)
test_benchmark_implementations[onnx-1x512]                                 7.3779 (2.48)    7.4523 (2.46)  7.3738 (2.47)  7.935 (2.32)   7.4261 (2.69)  7.3922 (2.7)   7.2257 (2.73)  7.5067 (2.68)
test_benchmark_implementations[onnx_optim_fp16-1x512]                      3.9345 (4.66)    3.9355 (4.66)  3.9158 (4.65)  3.9546 (4.66)  3.9765 (5.02)  3.9747 (5.02)  3.9266 (5.02)  4.2046 (4.78)
test_benchmark_implementations[onnx_optim_fp32-1x512]                      7.9319 (2.31)    7.9335 (2.31)  7.9258 (2.3)   7.9411 (2.32)  7.4303 (2.69)  7.3798 (2.7)   7.2062 (2.74)  7.4877 (2.69)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 128)
Name                                                                        Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
--------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x128]                             19.9885 (1.9)    19.9921 (1.9)   19.9864 (1.9)   20.0018 (1.9)   19.6734 (1.89)  19.8335 (1.9)   19.3291 (1.92)  20.1765 (1.88)
test_benchmark_implementations[dynamo-32x128]                               20.4227 (1.86)   20.4244 (1.86)  20.4206 (1.86)  20.4298 (1.86)  19.0551 (1.95)  19.563 (1.92)   19.0284 (1.96)  20.0972 (1.89)
test_benchmark_implementations[dynamo_cuda_graphs-32x128]                   19.797 (1.92)    19.7992 (1.92)  19.7939 (1.92)  19.8083 (1.92)  18.8559 (1.97)  19.524 (1.93)   18.8221 (1.98)  20.4032 (1.86)
test_benchmark_implementations[dynamo_no_dropout-32x128]                    20.4247 (1.86)   20.354 (1.87)   20.1308 (1.89)  20.4329 (1.86)  19.5796 (1.9)   19.7064 (1.91)  19.1097 (1.95)  20.1087 (1.89)
test_benchmark_implementations[dynamo_nvfuser_ofi-32x128]                   17.791 (2.14)    17.7445 (2.14)  17.5504 (2.17)  17.7961 (2.14)  17.4183 (2.14)  16.9937 (2.21)  16.3069 (2.28)  17.4253 (2.18)
test_benchmark_implementations[dynamo_optimized-32x128]                     18.1373 (2.1)    18.1337 (2.1)   18.1053 (2.1)   18.1473 (2.1)   18.592 (2.0)    18.7188 (2.01)  18.5652 (2.0)   19.1896 (1.98)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x128]         12.9157 (2.94)   12.914 (2.95)   12.9085 (2.94)  12.9198 (2.95)  13.0854 (2.84)  12.7066 (2.96)  11.979 (3.11)   13.0971 (2.9)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x128]  12.9894 (2.93)   12.9894 (2.93)  12.9853 (2.93)  12.9935 (2.93)  12.8909 (2.89)  12.6881 (2.96)  12.0358 (3.09)  13.1549 (2.89)
test_benchmark_implementations[onnx-32x128]                                 37.9894 (1.0)    38.0406 (1.0)   37.9894 (1.0)   38.0918 (1.0)   37.205 (1.0)    37.614 (1.0)    37.205 (1.0)    38.0231 (1.0)
test_benchmark_implementations[onnx_optim_fp16-32x128]                      17.9159 (2.12)   17.9122 (2.12)  17.8954 (2.12)  17.921 (2.13)   17.8506 (2.08)  17.5318 (2.15)  16.8395 (2.21)  17.9537 (2.12)
test_benchmark_implementations[onnx_optim_fp32-32x128]                      38.0037 (1.0)    38.0099 (1.0)   38.0037 (1.0)   38.016 (1.0)    37.0032 (1.01)  37.4619 (1.0)   37.0032 (1.01)  37.9206 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 16)
Name                                                                       Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
-------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-32x16]                             7.7896 (2.36)    7.7981 (2.36)  7.6913 (2.38)  7.9741 (2.31)  8.1593 (2.29)  8.2285 (2.28)  8.0638 (2.3)   8.7488 (2.2)
test_benchmark_implementations[dynamo-32x16]                               6.7135 (2.74)    6.7336 (2.73)  6.6662 (2.74)  6.9366 (2.66)  7.0167 (2.66)  7.0663 (2.66)  6.953 (2.67)   7.6539 (2.52)
test_benchmark_implementations[dynamo_cuda_graphs-32x16]                   3.1007 (5.93)    3.1274 (5.88)  3.0986 (5.9)   3.456 (5.34)   3.1281 (5.97)  3.1278 (6.0)   3.0912 (6.01)  3.2096 (6.0)
test_benchmark_implementations[dynamo_no_dropout-32x16]                    6.3744 (2.88)    6.3898 (2.88)  6.2659 (2.92)  6.5905 (2.8)   6.7537 (2.76)  6.8027 (2.76)  6.7059 (2.77)  7.4592 (2.58)
test_benchmark_implementations[dynamo_nvfuser_ofi-32x16]                   3.6577 (5.03)    3.6874 (4.99)  3.5625 (5.13)  3.8349 (4.81)  4.034 (4.63)   4.047 (4.64)   3.9003 (4.76)  4.4634 (4.31)
test_benchmark_implementations[dynamo_optimized-32x16]                     18.388 (1.0)     18.3839 (1.0)  18.2804 (1.0)  18.4494 (1.0)  18.6591 (1.0)  18.7635 (1.0)  18.5679 (1.0)  19.2508 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x16]         2.5539 (7.2)     2.5544 (7.2)   2.5518 (7.16)  2.5569 (7.22)  2.3349 (7.99)  2.333 (8.04)   2.2951 (8.09)  2.4158 (7.97)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x16]  2.391 (7.69)     2.4274 (7.57)  2.3869 (7.66)  2.5569 (7.22)  2.3637 (7.89)  2.3968 (7.83)  2.3181 (8.01)  2.5495 (7.55)
test_benchmark_implementations[onnx-32x16]                                 5.6269 (3.27)    5.6307 (3.26)  5.6197 (3.25)  5.6525 (3.26)  5.6711 (3.29)  5.6498 (3.32)  5.5571 (3.34)  5.8432 (3.29)
test_benchmark_implementations[onnx_optim_fp16-32x16]                      3.3075 (5.56)    3.3062 (5.56)  3.2862 (5.56)  3.3352 (5.53)  3.3408 (5.59)  3.3575 (5.59)  3.3177 (5.6)   3.649 (5.28)
test_benchmark_implementations[onnx_optim_fp32-32x16]                      5.8204 (3.16)    5.8232 (3.16)  5.8143 (3.14)  5.8378 (3.16)  5.6519 (3.3)   5.673 (3.31)   5.5316 (3.36)  6.1365 (3.14)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 256)
Name                                                                        Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
--------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x256]                             43.902 (1.72)    43.9101 (1.72)  43.902 (1.72)   43.9183 (1.72)  43.1339 (1.72)  43.5829 (1.7)   43.1339 (1.72)  44.0318 (1.68)
test_benchmark_implementations[dynamo-32x256]                               43.8252 (1.72)   43.8595 (1.72)  43.8252 (1.72)  43.8938 (1.72)  43.8847 (1.69)  44.0577 (1.68)  43.8847 (1.69)  44.2306 (1.68)
test_benchmark_implementations[dynamo_cuda_graphs-32x256]                   43.7996 (1.72)   43.8011 (1.72)  43.7996 (1.72)  43.8026 (1.72)  42.7952 (1.73)  43.3153 (1.71)  42.7952 (1.73)  43.8354 (1.69)
test_benchmark_implementations[dynamo_no_dropout-32x256]                    44.1385 (1.71)   44.1411 (1.71)  44.1385 (1.71)  44.1436 (1.71)  43.5725 (1.7)   43.9225 (1.69)  43.5725 (1.7)   44.2724 (1.68)
test_benchmark_implementations[dynamo_nvfuser_ofi-32x256]                   35.9804 (2.1)    36.5861 (2.06)  35.9804 (2.1)   37.1917 (2.03)  35.1051 (2.11)  35.5981 (2.08)  35.1051 (2.11)  36.0911 (2.05)
test_benchmark_implementations[dynamo_optimized-32x256]                     27.3992 (2.75)   27.4169 (2.75)  27.3971 (2.75)  27.4543 (2.75)  26.2085 (2.83)  26.107 (2.84)   25.302 (2.93)   26.8106 (2.77)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x256]         25.9441 (2.91)   25.9434 (2.91)  25.941 (2.91)   25.9451 (2.91)  25.1295 (2.95)  25.0927 (2.96)  23.9535 (3.1)   26.1953 (2.83)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x256]  25.8591 (2.92)   25.857 (2.92)   25.8468 (2.92)  25.8652 (2.92)  25.3347 (2.93)  25.082 (2.96)   23.7889 (3.12)  26.1226 (2.84)
test_benchmark_implementations[onnx-32x256]                                 75.471 (1.0)     75.471 (1.0)    75.471 (1.0)    75.471 (1.0)    74.1612 (1.0)   74.1612 (1.0)   74.1612 (1.0)   74.1612 (1.0)
test_benchmark_implementations[onnx_optim_fp16-32x256]                      36.7094 (2.06)   36.713 (2.06)   36.7094 (2.06)  36.7167 (2.06)  34.473 (2.15)   35.5963 (2.08)  34.473 (2.15)   36.7196 (2.02)
test_benchmark_implementations[onnx_optim_fp32-32x256]                      75.4545 (1.0)    75.4545 (1.0)   75.4545 (1.0)   75.4545 (1.0)   73.9454 (1.0)   73.9454 (1.0)   73.9454 (1.0)   73.9454 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 128)
Name                                                                       Median (CUDA)    Mean (CUDA)     Min (CUDA)     Max (CUDA)      Median          Mean            Min             Max
-------------------------------------------------------------------------  ---------------  --------------  -------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x128]                             7.6882 (2.37)    7.7161 (2.37)   7.6513 (2.36)  7.8203 (2.37)   8.0509 (2.29)   8.1397 (2.27)   8.0176 (2.29)   8.6786 (2.18)
test_benchmark_implementations[dynamo-8x128]                               7.1045 (2.57)    7.1043 (2.57)   7.1004 (2.54)  7.1076 (2.61)   7.0329 (2.62)   7.083 (2.61)    6.9554 (2.64)   7.4152 (2.55)
test_benchmark_implementations[dynamo_cuda_graphs-8x128]                   6.8086 (2.68)    6.7544 (2.71)   6.2536 (2.89)  6.8147 (2.72)   6.1993 (2.97)   6.1946 (2.99)   6.1376 (2.99)   6.2532 (3.03)
test_benchmark_implementations[dynamo_no_dropout-8x128]                    7.1035 (2.57)    7.1032 (2.58)   7.0994 (2.54)  7.1066 (2.61)   6.7912 (2.71)   6.8176 (2.71)   6.7409 (2.72)   7.1907 (2.63)
test_benchmark_implementations[dynamo_nvfuser_ofi-8x128]                   5.8357 (3.12)    5.8353 (3.13)   5.8326 (3.1)   5.8401 (3.18)   5.4088 (3.4)    5.4015 (3.43)   5.3005 (3.46)   5.5612 (3.4)
test_benchmark_implementations[dynamo_optimized-8x128]                     18.2241 (1.0)    18.2907 (1.0)   18.0675 (1.0)  18.5457 (1.0)   18.4038 (1.0)   18.5019 (1.0)   18.3426 (1.0)   18.9316 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x128]         4.6684 (3.9)     4.669 (3.92)    4.6653 (3.87)  4.6725 (3.97)   4.3283 (4.25)   4.3743 (4.23)   4.1687 (4.4)    4.8529 (3.9)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x128]  4.6797 (3.89)    4.6804 (3.91)   4.6766 (3.86)  4.6838 (3.96)   4.3204 (4.26)   4.2784 (4.32)   4.1636 (4.41)   4.3304 (4.37)
test_benchmark_implementations[onnx-8x128]                                 12.0791 (1.51)   12.086 (1.51)   12.073 (1.5)   12.0986 (1.53)  11.0917 (1.66)  11.0635 (1.67)  10.9016 (1.68)  11.1618 (1.7)
test_benchmark_implementations[onnx_optim_fp16-8x128]                      6.4862 (2.81)    6.4874 (2.82)   6.4829 (2.79)  6.4972 (2.85)   6.0809 (3.03)   6.2295 (2.97)   5.902 (3.11)    6.8165 (2.78)
test_benchmark_implementations[onnx_optim_fp32-8x128]                      12.0691 (1.51)   11.7955 (1.55)  11.2886 (1.6)  12.1059 (1.53)  11.0996 (1.66)  11.1007 (1.67)  10.9196 (1.68)  11.3024 (1.68)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 16)
Name                                                                      Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x16]                             8.0026 (2.28)    8.0095 (2.29)   7.9422 (2.28)   8.1019 (2.34)   8.3481 (2.21)   8.474 (2.19)    8.2862 (2.22)   9.7117 (1.97)
test_benchmark_implementations[dynamo-8x16]                               6.9307 (2.63)    6.9342 (2.64)   6.8711 (2.63)   7.0113 (2.7)    7.3655 (2.5)    7.418 (2.51)    7.2169 (2.55)   7.8575 (2.43)
test_benchmark_implementations[dynamo_cuda_graphs-8x16]                   1.8 (10.12)      1.7997 (10.19)  1.7971 (10.07)  1.8033 (10.49)  1.6321 (11.3)   1.6762 (11.09)  1.6288 (11.3)   1.9426 (9.83)
test_benchmark_implementations[dynamo_no_dropout-8x16]                    7.1875 (2.53)    7.1667 (2.56)   7.0556 (2.57)   7.2233 (2.62)   7.4841 (2.46)   7.5437 (2.46)   7.4565 (2.47)   8.0487 (2.37)
test_benchmark_implementations[dynamo_nvfuser_ofi-8x16]                   3.6987 (4.92)    3.6816 (4.98)   3.5512 (5.1)    3.8103 (4.97)   3.937 (4.68)    3.9566 (4.7)    3.8619 (4.77)   4.4714 (4.27)
test_benchmark_implementations[dynamo_optimized-8x16]                     18.2159 (1.0)    18.3345 (1.0)   18.0992 (1.0)   18.9225 (1.0)   18.4395 (1.0)   18.5832 (1.0)   18.4076 (1.0)   19.1045 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x16]         1.4664 (12.42)   1.4669 (12.5)   1.4633 (12.37)  1.4725 (12.85)  1.3477 (13.68)  1.3592 (13.67)  1.3445 (13.69)  1.6115 (11.86)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x16]  1.4653 (12.43)   1.4593 (12.56)  1.2943 (13.98)  1.4694 (12.88)  1.3399 (13.76)  1.3444 (13.82)  1.3352 (13.79)  1.4652 (13.04)
test_benchmark_implementations[onnx-8x16]                                 2.9072 (6.27)    2.911 (6.3)     2.9003 (6.24)   2.9676 (6.38)   2.948 (6.26)    2.9599 (6.28)   2.9437 (6.25)   3.2374 (5.9)
test_benchmark_implementations[onnx_optim_fp16-8x16]                      2.8324 (6.43)    2.8391 (6.46)   2.816 (6.43)    2.9348 (6.45)   2.8898 (6.38)   2.9068 (6.39)   2.8696 (6.41)   3.3752 (5.66)
test_benchmark_implementations[onnx_optim_fp32-8x16]                      2.9259 (6.23)    2.9446 (6.23)   2.9194 (6.2)    3.1304 (6.04)   2.9849 (6.18)   3.0218 (6.15)   2.9654 (6.21)   3.4835 (5.48)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 256)
Name                                                                       Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
-------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x256]                             14.4046 (1.62)   14.3729 (1.62)  14.1752 (1.62)  14.4251 (1.66)  13.4262 (1.67)  13.4115 (1.68)  13.2269 (1.69)  13.6626 (1.65)
test_benchmark_implementations[dynamo-8x256]                               13.4974 (1.73)   13.7875 (1.69)  13.4277 (1.71)  14.3831 (1.67)  13.3872 (1.67)  13.388 (1.68)   13.2685 (1.68)  13.4548 (1.68)
test_benchmark_implementations[dynamo_cuda_graphs-8x256]                   13.1953 (1.77)   13.1859 (1.77)  13.143 (1.75)   13.2649 (1.81)  13.1024 (1.71)  13.0303 (1.73)  12.8504 (1.74)  13.1506 (1.72)
test_benchmark_implementations[dynamo_no_dropout-8x256]                    13.4892 (1.73)   13.653 (1.71)   13.4298 (1.71)  14.3923 (1.67)  13.4651 (1.66)  13.4124 (1.68)  13.2646 (1.68)  13.4941 (1.68)
test_benchmark_implementations[dynamo_nvfuser_ofi-8x256]                   10.5984 (2.21)   10.7886 (2.16)  10.5196 (2.19)  11.2364 (2.14)  10.4592 (2.14)  10.4092 (2.16)  10.1622 (2.2)   10.5341 (2.15)
test_benchmark_implementations[dynamo_optimized-8x256]                     18.1248 (1.29)   18.1269 (1.29)  18.1065 (1.27)  18.1586 (1.32)  18.6109 (1.2)   18.6935 (1.2)   18.4981 (1.21)  19.0574 (1.19)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x256]         7.6616 (3.05)    7.5866 (3.07)   7.4721 (3.08)   7.6646 (3.13)   7.4286 (3.01)   7.3157 (3.07)   6.9428 (3.21)   7.5053 (3.01)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x256]  7.3789 (3.17)    7.4421 (3.13)   7.3677 (3.12)   7.6206 (3.15)   7.3713 (3.04)   7.2102 (3.12)   6.8461 (3.26)   7.3796 (3.06)
test_benchmark_implementations[onnx-8x256]                                 22.7218 (1.03)   23.06 (1.01)    22.6567 (1.02)  23.9933 (1.0)   22.2878 (1.0)   22.4101 (1.0)   22.2321 (1.0)   22.5733 (1.0)
test_benchmark_implementations[onnx_optim_fp16-8x256]                      11.3439 (2.06)   11.3283 (2.06)  11.2681 (2.04)  11.3675 (2.11)  11.2462 (1.99)  11.2272 (2.0)   11.0091 (2.03)  11.3319 (2.0)
test_benchmark_implementations[onnx_optim_fp32-8x256]                      23.4025 (1.0)    23.3101 (1.0)   23.0216 (1.0)   23.4107 (1.02)  22.386 (1.0)    22.478 (1.0)    22.3156 (1.0)   22.6084 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 384)
Name                                                                       Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
-------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x384]                             19.202 (1.63)    19.4443 (1.61)  19.1621 (1.64)  19.9311 (1.58)  19.2175 (1.64)  19.1016 (1.63)  18.658 (1.62)   19.2375 (1.64)
test_benchmark_implementations[dynamo-8x384]                               19.9096 (1.58)   19.806 (1.59)   19.3843 (1.62)  19.9168 (1.58)  19.2918 (1.63)  19.078 (1.63)   18.5617 (1.63)  19.3054 (1.64)
test_benchmark_implementations[dynamo_cuda_graphs-8x384]                   18.9245 (1.66)   18.9348 (1.66)  18.9194 (1.66)  18.9768 (1.66)  19.0011 (1.65)  18.7891 (1.65)  18.3988 (1.64)  19.0703 (1.66)
test_benchmark_implementations[dynamo_no_dropout-8x384]                    19.926 (1.57)    19.9264 (1.58)  19.924 (1.57)   19.9291 (1.58)  19.3009 (1.63)  19.1752 (1.62)  18.8989 (1.6)   19.3352 (1.63)
test_benchmark_implementations[dynamo_nvfuser_ofi-8x384]                   15.401 (2.04)    15.4013 (2.04)  15.3979 (2.04)  15.403 (2.04)   15.4667 (2.03)  15.1685 (2.05)  14.4705 (2.09)  15.4702 (2.04)
test_benchmark_implementations[dynamo_optimized-8x384]                     18.26 (1.72)     18.2686 (1.72)  18.1647 (1.73)  18.4166 (1.71)  18.5932 (1.69)  18.7727 (1.65)  18.5629 (1.63)  19.3952 (1.63)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x384]         10.5236 (2.98)   10.6009 (2.96)  10.5144 (2.98)  10.7827 (2.92)  10.5006 (2.99)  10.2082 (3.04)  9.7592 (3.09)   10.5046 (3.01)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x384]  10.3496 (3.03)   10.3823 (3.02)  10.326 (3.04)   10.6926 (2.94)  10.3932 (3.02)  10.0772 (3.08)  9.604 (3.14)    10.4152 (3.03)
test_benchmark_implementations[onnx-8x384]                                 31.3651 (1.0)    31.3952 (1.0)   31.3559 (1.0)   31.4644 (1.0)   31.4303 (1.0)   31.0518 (1.0)   30.1471 (1.0)   31.578 (1.0)
test_benchmark_implementations[onnx_optim_fp16-8x384]                      16.2304 (1.93)   16.2316 (1.93)  16.2243 (1.93)  16.2428 (1.94)  16.0021 (1.96)  15.9447 (1.95)  15.2727 (1.98)  16.283 (1.94)
test_benchmark_implementations[onnx_optim_fp32-8x384]                      31.3774 (1.0)    31.3737 (1.0)   31.3631 (1.0)   31.3805 (1.0)   31.0449 (1.01)  30.9306 (1.0)   30.1845 (1.0)   31.5624 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 512)
Name                                                                       Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
-------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x512]                             31.0692 (1.51)   31.3934 (1.5)   30.0759 (1.56)  33.0353 (1.42)  28.848 (1.58)   28.571 (1.63)   27.5792 (1.66)  29.2857 (1.62)
test_benchmark_implementations[dynamo-8x512]                               27.7504 (1.7)    27.7494 (1.7)   27.7463 (1.7)   27.7514 (1.7)   27.8712 (1.64)  27.9122 (1.67)  27.6375 (1.65)  28.2279 (1.68)
test_benchmark_implementations[dynamo_cuda_graphs-8x512]                   27.6029 (1.7)    27.6978 (1.7)   27.5907 (1.71)  27.8999 (1.69)  27.1757 (1.68)  27.1697 (1.71)  26.5793 (1.72)  27.7542 (1.71)
test_benchmark_implementations[dynamo_no_dropout-8x512]                    27.7903 (1.69)   27.7845 (1.69)  27.7565 (1.7)   27.8067 (1.69)  28.0226 (1.63)  27.9281 (1.66)  27.7133 (1.65)  28.0485 (1.69)
test_benchmark_implementations[dynamo_nvfuser_ofi-8x512]                   20.4605 (2.3)    20.4662 (2.3)   20.4575 (2.3)   20.4759 (2.3)   20.1937 (2.26)  20.2667 (2.29)  19.715 (2.32)   20.6086 (2.3)
test_benchmark_implementations[dynamo_optimized-8x512]                     18.1924 (2.59)   18.1948 (2.59)  18.1504 (2.59)  18.2446 (2.58)  18.6487 (2.45)  18.7095 (2.48)  18.5356 (2.47)  19.1187 (2.48)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x512]         13.9039 (3.38)   13.9055 (3.38)  13.9028 (3.38)  13.909 (3.38)   13.9589 (3.28)  13.6052 (3.42)  12.8967 (3.54)  14.019 (3.38)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x512]  13.7093 (3.43)   13.6897 (3.44)  13.612 (3.46)   13.7114 (3.43)  13.6507 (3.35)  13.3774 (3.47)  12.6394 (3.62)  13.7645 (3.44)
test_benchmark_implementations[onnx-8x512]                                 46.8101 (1.01)   46.9161 (1.0)   46.8101 (1.01)  47.0221 (1.0)   45.5775 (1.0)   46.4806 (1.0)   45.5775 (1.0)   47.3837 (1.0)
test_benchmark_implementations[onnx_optim_fp16-8x512]                      21.3135 (2.21)   21.314 (2.21)   21.3094 (2.21)  21.3187 (2.21)  20.3842 (2.24)  20.8492 (2.23)  20.3223 (2.25)  21.3459 (2.22)
test_benchmark_implementations[onnx_optim_fp32-8x512]                      47.0578 (1.0)    47.0604 (1.0)   47.0578 (1.0)   47.063 (1.0)    45.7169 (1.0)   46.4234 (1.0)   45.7169 (1.0)   47.1299 (1.01)

gaetansnl

mostly minor but I have things that I'm not sure to understand

implementations/attention_masked_original.py

gaetansnl · 2022-09-23T07:53:00Z

implementations/layer_norm.py

+        return out
+
+
+def layer_norm(x: torch.Tensor, weight: torch.Tensor, bias: torch.Tensor, eps: float, implementation: JITFunction = _layer_norm_fwd_fused_single_pass):


why this one does not work as the other ? "output" is missing

Maybe it's the opposite? Why attention has an output field? IMO it should be removed, just forgot.
For attention we create the output outside the function, and provide the tensor. The kernel is marked to convert all provided tensors to fp16 which includes the output in mixed precision. We should move it inside the function to avoid this unneeded casting.

Layernorm and linearlayer have not this issue by creating the output tensor of the right type from the begining.

output is outside because it allows outside code to control allocations, not sure if it's still useful

gaetansnl · 2022-09-23T07:53:49Z

implementations/linear_layer.py

+        return outputs
+
+
+def linear_layer(x: torch.Tensor,


same here ? why we don't have "output" ?

see layernorm answer

gaetansnl · 2022-09-23T07:57:26Z

test/test_attention.py


-@pytest.mark.parametrize("batch", [1, 8, 32, 64])
-@pytest.mark.parametrize("implementation", ["torch", "triton_original", "triton"])
-def test_benchmark(benchmark, batch, implementation):
    torch.manual_seed(0)


Should be moved at the beginning of the function to avoid mistakes IMO

not sure to understand what you refer to?

torch.manual_seed(0)

moved in an annotation

test/test_attention.py

test/test_linear_layer.py

gaetansnl · 2022-09-23T08:20:23Z

test/test_linear_layer.py

@@ -61,7 +53,7 @@ def test_benchmark(benchmark, shape: Shape, bias: bool, activation: str, contigu
    else:
        raise ValueError(f"Unknown activation: {activation}")

-    torch_linear_layer = torch.nn.Linear(K, N, bias=bias, device="cuda", dtype=torch.float16)
+    torch_linear_layer = torch.nn.Linear(K, N, bias=bias, device="cuda", dtype=dtype)
    torch_linear_layer.weight.data = layer_weight

    def torch_linear_activation(x):


same, you don't comapre to fp32 ?

changed, ref is now fp32

test/test_layer_norm.py

test/test_torchdynamo_bert.py

gaetansnl · 2022-09-23T08:28:10Z

test/test_torchdynamo_bert.py

-                                   (32, 16), (32, 128), (32, 256),
-                                   ], ids=lambda x: f"{x[0]}x{x[1]}")
+@pytest.mark.parametrize("shape", [(bs, seq_l) for bs in [1, 8, 32] for seq_l in [16, 128, 256, 384, 512]
+                                   if bs * seq_l < 10000], ids=lambda x: f"{x[0]}x{x[1]}")
 @pytest.mark.parametrize("implementation", implementations.keys())
 def test_benchmark_implementations(benchmark, model_reference_fp32, shape: (int, int), implementation: str):
    torch.manual_seed(0)


why do we need the assert bellow

which assert?

assert implementation in implementations, f"unknown implementation: {implementation}"

it has been removed

pommedeterresautee · 2022-09-25T15:00:43Z

FYI, tests pass

================================================================================= 1896 passed, 88 skipped, 600 warnings in 4114.37s (1:08:34) =================================================================================

gaetansnl · 2022-09-27T08:39:38Z

test/test_attention.py

+        return attention_forward_original(*args, **kwargs)
+
+
+implementations = {


@pommedeterresautee Maybe we could use other config style to remove this from global scope https://docs.pytest.org/en/6.2.x/example/parametrize.html#paramexamples It makes things really hard to read IMO. And it will be worse as we add test

Not sure if it's harder to read (it's local and there is the same pattern in all but one test not yet refactored, batched matmul), but I share your point about the fact that we want to control the number of implementations to test (probably at least have light and full flags), and doing it through the command line using pytest is certainly the best way. It is also true for number of shapes/batch sizes to test.
For that reason, that part of the code would be moved outside the global context, but it's not clear for me what it should look like.

It makes me think that it should be done in a dedicated PR, are you ok with that? If ok I write the issue.

feat: layernorm, add fp32 test + new very simple implementation

4dbca1a

pommedeterresautee added bug Something isn't working benchmark Measure, measure, measure labels Sep 16, 2022

pommedeterresautee self-assigned this Sep 16, 2022

pommedeterresautee added 7 commits September 16, 2022 15:43

feat: linear layer, add tests for fp32

3b2ca1a

fix: fix layernorm for e2e tests

ff3a531

fix: all implementation of self attention have no more to allocate th…

494d073

…eir output tensor

fix: have a single reference implementation (causal and not causal)

637a2f1

feat: refactoring attention tests

8e34d5e

feat: add support for fp32 attention

c5a6205

fix: failing tests were passing because of bug

a825b88

pommedeterresautee added 9 commits September 21, 2022 11:05

feat: add bf16 unit test for attention kernel

2a82aed

feat: add custom_fw annotation to support AMP

05a2269

feat: refactor cuda graphs tests in linear layer

3cd4fa8

fix: fix linear layer with new API

bdd7105

fix: small modif on linear layer

17081fc

Merge branch 'main' into feat/triton_fp32

838ad5c

# Conflicts: # optimizer/linear.py # test/test_linear_layer.py

fix: fix linear layer search pattern

5efbf46

feat: make easy to add to triton kernel bw function

b0344f9

feat: add autocast to tests

9d9f15a

pommedeterresautee marked this pull request as ready for review September 22, 2022 13:00

pommedeterresautee requested a review from gaetansnl September 22, 2022 13:00

fix: add dependencies

c86069f

feat: save casted weights before using them

5fdaf66

fix: change in the trick to avoid weights casting on triton kernels

293ee28

gaetansnl requested changes Sep 23, 2022

View reviewed changes

pommedeterresautee added 2 commits September 23, 2022 12:20

fix: linear layer / layer norm reference is now fp32 + refactoring

ffb6931

fix: follow comments on PR

7511e4a

pommedeterresautee requested a review from gaetansnl September 23, 2022 12:36

pommedeterresautee added 7 commits September 23, 2022 14:42

fix: move reference implementation

e6029d8

fix: remove unneeded assert

0dce772

fix: refactoring

10e482e

fix: remove cuda graphs pool

522d0b9

fix: refactoring

1e7e01d

fix: fix pytorch implementation of unit tests

490556d

feat: random seed refactoring

2d0de58

gaetansnl reviewed Sep 27, 2022

View reviewed changes

pommedeterresautee requested a review from gaetansnl September 28, 2022 07:35

gaetansnl approved these changes Sep 28, 2022

View reviewed changes

pommedeterresautee merged commit 42513c2 into main Sep 28, 2022

pommedeterresautee deleted the feat/triton_fp32 branch September 28, 2022 07:50

pommedeterresautee mentioned this pull request Sep 28, 2022

Comparer les sorties FP16 aux sorties FP32 de Pytorch #8

Closed

pommedeterresautee linked an issue Sep 28, 2022 that may be closed by this pull request

Comparer les sorties FP16 aux sorties FP32 de Pytorch #8

Closed

pommedeterresautee mentioned this pull request Sep 28, 2022

Do not hardcode kernel tensor type #33

Closed

This was linked to issues Sep 28, 2022

Do not hardcode kernel tensor type #33

Closed

check AMP is compatible with torchdynamo #57

Closed

pommedeterresautee mentioned this pull request Sep 28, 2022

check AMP is compatible with torchdynamo #57

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add support for fp32 kernels #56

feat: add support for fp32 kernels #56

pommedeterresautee commented Sep 16, 2022 •

edited

Loading

pommedeterresautee commented Sep 19, 2022

pommedeterresautee commented Sep 22, 2022

pommedeterresautee commented Sep 22, 2022

pommedeterresautee commented Sep 22, 2022

pommedeterresautee commented Sep 22, 2022

pommedeterresautee commented Sep 22, 2022

gaetansnl left a comment

gaetansnl Sep 23, 2022

pommedeterresautee Sep 23, 2022

gaetansnl Sep 27, 2022

gaetansnl Sep 23, 2022

pommedeterresautee Sep 23, 2022

gaetansnl Sep 23, 2022

pommedeterresautee Sep 23, 2022

gaetansnl Sep 27, 2022

pommedeterresautee Sep 28, 2022

gaetansnl Sep 23, 2022

pommedeterresautee Sep 23, 2022

gaetansnl Sep 23, 2022

pommedeterresautee Sep 23, 2022 •

edited

Loading

gaetansnl Sep 27, 2022

pommedeterresautee Sep 28, 2022

pommedeterresautee commented Sep 25, 2022

gaetansnl Sep 27, 2022

pommedeterresautee Sep 28, 2022 •

edited

Loading

		return out


		def layer_norm(x: torch.Tensor, weight: torch.Tensor, bias: torch.Tensor, eps: float, implementation: JITFunction = _layer_norm_fwd_fused_single_pass):

		return attention_forward_original(args, *kwargs)


		implementations = {

feat: add support for fp32 kernels #56

feat: add support for fp32 kernels #56

Conversation

pommedeterresautee commented Sep 16, 2022 • edited Loading

pommedeterresautee commented Sep 19, 2022

pommedeterresautee commented Sep 22, 2022

pommedeterresautee commented Sep 22, 2022

pommedeterresautee commented Sep 22, 2022

pommedeterresautee commented Sep 22, 2022

pommedeterresautee commented Sep 22, 2022

gaetansnl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pommedeterresautee Sep 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pommedeterresautee commented Sep 25, 2022

Choose a reason for hiding this comment

pommedeterresautee Sep 28, 2022 • edited Loading

Choose a reason for hiding this comment

pommedeterresautee commented Sep 16, 2022 •

edited

Loading

pommedeterresautee Sep 23, 2022 •

edited

Loading

pommedeterresautee Sep 28, 2022 •

edited

Loading