Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for fp32 kernels #56

Merged
merged 29 commits into from
Sep 28, 2022
Merged

Conversation

pommedeterresautee
Copy link
Member

@pommedeterresautee pommedeterresautee commented Sep 16, 2022

  • add new very simple layernorm implementation from xformers (does not work for very large tensors, its only purpose is to show what our max perf can be)
  • add fp32 tests and support on layernorm
  • add bf16 tests and support for kernel attention
  • fix error in attention benchmark, all but our optimized attention had to allocate output tensor! (which was an unfair advantage to our implementation)
  • make it easy to add bw pass for each triton kernel

fix #39
fix #44

behavior of autocast: https://h-huang.github.io/tutorials/advanced/dispatcher.html#autocast + https://pytorch.org/docs/stable/amp.html

@pommedeterresautee pommedeterresautee added bug Something isn't working benchmark Measure, measure, measure labels Sep 16, 2022
@pommedeterresautee pommedeterresautee self-assigned this Sep 16, 2022
@pommedeterresautee
Copy link
Member Author

opened an issue with reproduction code here: triton-lang/triton#674

@pommedeterresautee
Copy link
Member Author

measures done from this branch (with autocast)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 128)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x128]                      7.7855 (1.0)     7.7941 (1.0)   7.636 (1.0)    8.0056 (1.0)   8.1341 (1.0)   8.2391 (1.0)   8.0028 (1.0)   9.1752 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x128]  2.4136 (3.23)    2.5606 (3.04)  2.4074 (3.17)  3.6762 (2.18)  2.5179 (3.23)  2.6934 (3.06)  2.476 (3.23)   3.5809 (2.56)
test_benchmark_implementations[onnx_optim_fp16-1x128]               2.9246 (2.66)    3.1849 (2.45)  2.8132 (2.71)  4.6073 (1.74)  4.0431 (2.01)  4.0954 (2.01)  3.0022 (2.67)  5.2359 (1.75)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 16)
Name                                                               Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
-----------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x16]                      7.9821 (1.0)     7.9925 (1.0)   7.6534 (1.0)   8.4778 (1.0)   8.1546 (1.0)   8.2796 (1.0)   8.0009 (1.0)   8.9113 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x16]  1.5585 (5.12)    1.5589 (5.13)  1.5555 (4.92)  1.5677 (5.41)  1.6278 (5.01)  1.6386 (5.05)  1.6181 (4.94)  1.7867 (4.99)
test_benchmark_implementations[onnx_optim_fp16-1x16]               4.8333 (1.65)    4.2276 (1.89)  2.8848 (2.65)  5.5951 (1.52)  4.8374 (1.69)  4.879 (1.7)    3.914 (2.04)   5.1452 (1.73)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 256)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x256]                      7.7609 (1.0)     7.7855 (1.0)   7.6341 (1.0)   8.0261 (1.0)   8.0527 (1.0)   8.1408 (1.0)   7.8861 (1.0)   8.9015 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x256]  2.7382 (2.83)    2.7359 (2.85)  2.7197 (2.81)  2.7433 (2.93)  2.7859 (2.89)  2.787 (2.92)   2.7582 (2.86)  2.8652 (3.11)
test_benchmark_implementations[onnx_optim_fp16-1x256]               2.6307 (2.95)    2.6026 (2.99)  2.5129 (3.04)  2.6684 (3.01)  2.6648 (3.02)  2.9824 (2.73)  2.5794 (3.06)  5.7892 (1.54)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 384)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean          Min            Max
------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  ------------  -------------  -------------
test_benchmark_implementations[baseline-1x384]                      8.2473 (1.0)     8.2564 (1.0)   7.716 (1.0)    9.172 (1.0)    8.9529 (1.0)   9.9168 (1.0)  8.3484 (1.0)   12.8779 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x384]  2.9901 (2.76)    2.9874 (2.76)  2.9696 (2.6)   2.9972 (3.06)  3.0268 (2.96)  3.0024 (3.3)  2.9393 (2.84)  3.0772 (4.18)
test_benchmark_implementations[onnx_optim_fp16-1x384]               2.9051 (2.84)    3.0415 (2.71)  2.8641 (2.69)  4.8742 (1.88)  6.2065 (1.44)  6.1846 (1.6)  5.9793 (1.4)   6.3811 (2.02)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 512)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x512]                      7.5213 (1.0)     7.5546 (1.0)   7.4322 (1.0)   7.7479 (1.0)   7.9151 (1.0)   7.9856 (1.0)   7.8031 (1.0)   8.7109 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x512]  3.7673 (2.0)     3.7712 (2.0)   3.7304 (1.99)  3.7888 (2.04)  3.8216 (2.07)  3.8165 (2.09)  3.7454 (2.08)  4.1214 (2.11)
test_benchmark_implementations[onnx_optim_fp16-1x512]               3.9672 (1.9)     3.9957 (1.89)  3.9396 (1.89)  4.5466 (1.7)   4.0234 (1.97)  4.0321 (1.98)  3.9481 (1.98)  4.3094 (2.02)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 128)
Name                                                                 Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
-------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x128]                      22.7932 (1.0)    23.3685 (1.0)   22.7256 (1.0)   24.1029 (1.0)   21.2464 (1.0)   21.8454 (1.0)   21.0746 (1.0)   22.7786 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x128]  15.1859 (1.5)    15.2549 (1.53)  14.6248 (1.55)  15.9386 (1.51)  14.7982 (1.44)  14.7312 (1.48)  13.8437 (1.52)  15.4613 (1.47)
test_benchmark_implementations[onnx_optim_fp16-32x128]               17.8954 (1.27)   17.8991 (1.31)  17.8913 (1.27)  17.9098 (1.35)  18.0295 (1.18)  17.7181 (1.23)  17.1224 (1.23)  18.0452 (1.26)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 16)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-32x16]                      7.9964 (1.0)     8.1224 (1.0)   7.7518 (1.0)   9.461 (1.0)    8.228 (1.0)    8.275 (1.0)    8.1017 (1.0)   8.8999 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x16]  3.4478 (2.32)    3.4474 (2.36)  3.4284 (2.26)  3.4529 (2.74)  3.4989 (2.35)  3.4876 (2.37)  3.441 (2.35)   3.5331 (2.52)
test_benchmark_implementations[onnx_optim_fp16-32x16]               5.9208 (1.35)    5.1951 (1.56)  3.582 (2.16)   6.826 (1.39)   3.6319 (2.27)  3.9679 (2.09)  3.2939 (2.46)  5.4588 (1.63)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 256)
Name                                                                 Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
-------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  -------------
test_benchmark_implementations[baseline-32x256]                      46.1609 (1.0)    46.5961 (1.0)   46.1609 (1.0)   47.0313 (1.0)   46.4217 (1.0)   46.668 (1.0)    46.4217 (1.0)   46.9143 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x256]  27.5026 (1.68)   27.5053 (1.69)  27.4964 (1.68)  27.5169 (1.71)  27.1399 (1.71)  26.7257 (1.75)  25.3938 (1.83)  27.6435 (1.7)
test_benchmark_implementations[onnx_optim_fp16-32x256]               36.7258 (1.26)   36.7365 (1.27)  36.7258 (1.26)  36.7473 (1.28)  34.6321 (1.34)  36.8357 (1.27)  34.6321 (1.34)  39.0393 (1.2)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 128)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x128]                      8.0763 (1.0)     8.1834 (1.0)   7.5067 (1.0)   8.8566 (1.0)   8.6779 (1.0)   8.6912 (1.0)   8.4971 (1.0)   8.9335 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x128]  5.4733 (1.48)    5.5011 (1.49)  5.4692 (1.37)  5.7477 (1.54)  5.4956 (1.58)  5.4438 (1.6)   5.2913 (1.61)  5.5435 (1.61)
test_benchmark_implementations[onnx_optim_fp16-8x128]               6.4922 (1.24)    6.5071 (1.26)  6.4788 (1.16)  6.5782 (1.35)  6.0271 (1.44)  5.9941 (1.45)  5.8692 (1.45)  6.1643 (1.45)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 16)
Name                                                               Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
-----------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x16]                      8.3579 (1.0)     8.4 (1.0)      8.181 (1.0)    8.8279 (1.0)   8.7168 (1.0)   8.7968 (1.0)   8.5196 (1.0)   9.726 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x16]  2.3337 (3.58)    2.3342 (3.6)   2.3316 (3.51)  2.3429 (3.77)  2.3919 (3.64)  2.4002 (3.66)  2.3867 (3.57)  2.609 (3.73)
test_benchmark_implementations[onnx_optim_fp16-8x16]               2.9184 (2.86)    2.9682 (2.83)  2.8436 (2.88)  3.5615 (2.48)  3.9428 (2.21)  3.7615 (2.34)  2.8695 (2.97)  4.9127 (1.98)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 256)
Name                                                                Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x256]                      15.3569 (1.0)    15.3636 (1.0)   15.2637 (1.0)   15.4481 (1.0)   14.1396 (1.15)  14.4888 (1.06)  13.8929 (1.0)   15.6007 (1.07)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x256]  8.6794 (1.77)    8.6787 (1.77)   8.6723 (1.76)   8.6866 (1.78)   8.7026 (1.86)   8.9109 (1.73)   8.378 (1.66)    9.8552 (1.69)
test_benchmark_implementations[onnx_optim_fp16-8x256]               12.8174 (1.2)    12.8764 (1.19)  12.1201 (1.26)  14.6924 (1.05)  16.2258 (1.0)   15.4099 (1.0)   12.5603 (1.11)  16.6717 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 384)
Name                                                                Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x384]                      21.3668 (1.28)   21.9298 (1.25)  21.1978 (1.27)  23.0248 (1.22)  20.9425 (1.0)   21.2453 (1.0)   20.7567 (1.0)   21.959 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x384]  11.7187 (2.33)   11.7083 (2.33)  11.6275 (2.31)  11.7248 (2.39)  11.5307 (1.82)  11.4106 (1.86)  10.9675 (1.89)  11.6932 (1.88)
test_benchmark_implementations[onnx_optim_fp16-8x384]               27.2484 (1.0)    27.3092 (1.0)   26.9005 (1.0)   28.0269 (1.0)   16.3585 (1.28)  16.555 (1.28)   15.9407 (1.3)   17.1526 (1.28)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 512)
Name                                                                Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x512]                      29.8353 (1.0)    30.6869 (1.0)   29.5884 (1.0)   32.6369 (1.0)   27.9553 (1.0)   27.9247 (1.0)   27.7025 (1.0)   28.1161 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x512]  15.4061 (1.94)   15.5945 (1.97)  15.3999 (1.92)  16.2048 (2.01)  15.4085 (1.81)  15.0448 (1.86)  14.1574 (1.96)  15.5258 (1.81)
test_benchmark_implementations[onnx_optim_fp16-8x512]               21.4405 (1.39)   21.4515 (1.43)  21.4303 (1.38)  21.4884 (1.52)  20.7453 (1.35)  21.3414 (1.31)  20.4224 (1.36)  22.1332 (1.27)

@pommedeterresautee
Copy link
Member Author

measures done on main (no autocast, full fp16)

test/test_torchdynamo_bert.py .......................................                                                                                                                                                   [100%]
test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 128)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x128]                      7.2182 (1.0)     7.3326 (1.0)   6.7269 (1.0)   8.8044 (1.0)   6.9617 (1.0)   6.9978 (1.0)   6.7992 (1.0)   7.6291 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x128]  1.3701 (5.27)    1.382 (5.31)   1.3681 (4.92)  1.7418 (5.05)  1.4162 (4.92)  1.4189 (4.93)  1.4118 (4.82)  1.5048 (5.07)
test_benchmark_implementations[onnx_optim_fp16-1x128]               2.826 (2.55)     2.908 (2.52)   2.7239 (2.47)  4.4063 (2.0)   2.8097 (2.48)  2.8331 (2.47)  2.7622 (2.46)  3.1994 (2.38)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 16)
Name                                                               Median (CUDA)    Mean (CUDA)    Min (CUDA)      Max (CUDA)     Median          Mean            Min             Max
-----------------------------------------------------------------  ---------------  -------------  --------------  -------------  --------------  --------------  --------------  -------------
test_benchmark_implementations[baseline-1x16]                      6.5792 (1.0)     6.6026 (1.0)   6.4717 (1.0)    6.7318 (1.0)   6.7322 (1.0)    6.7755 (1.0)    6.5761 (1.0)    7.4069 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x16]  0.5642 (11.66)   0.5641 (11.7)  0.5612 (11.53)  0.5755 (11.7)  0.6223 (10.82)  0.6291 (10.77)  0.6177 (10.65)  0.8654 (8.56)
test_benchmark_implementations[onnx_optim_fp16-1x16]               2.7372 (2.4)     2.7406 (2.41)  2.6716 (2.42)   2.8826 (2.34)  2.7818 (2.42)   2.8279 (2.4)    2.7385 (2.4)    3.2174 (2.3)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 256)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  ------------
test_benchmark_implementations[baseline-1x256]                      6.7359 (1.0)     6.7653 (1.0)   5.9217 (1.0)   7.5438 (1.0)   7.1434 (1.0)   7.3269 (1.0)   6.8743 (1.0)   8.4516 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x256]  1.6701 (4.03)    1.6921 (4.0)   1.6558 (3.58)  2.0818 (3.62)  1.7132 (4.17)  1.7223 (4.25)  1.6971 (4.05)  1.961 (4.31)
test_benchmark_implementations[onnx_optim_fp16-1x256]               2.81 (2.4)       2.7943 (2.42)  2.5736 (2.3)   2.8609 (2.64)  2.5794 (2.77)  2.6425 (2.77)  2.5477 (2.7)   4.463 (1.89)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 384)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x384]                      6.5323 (1.0)     6.5908 (1.0)   6.4246 (1.0)   6.9028 (1.0)   7.9615 (1.0)   8.1049 (1.0)   7.4326 (1.0)   9.396 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x384]  1.9436 (3.36)    1.9413 (3.4)   1.8237 (3.52)  1.9487 (3.54)  1.9532 (4.08)  1.9862 (4.08)  1.8874 (3.94)  2.363 (3.98)
test_benchmark_implementations[onnx_optim_fp16-1x384]               3.1508 (2.07)    3.1709 (2.08)  3.114 (2.06)   3.4931 (1.98)  2.9463 (2.7)   3.0148 (2.69)  2.8853 (2.58)  3.6583 (2.57)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 512)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x512]                      6.3652 (1.0)     6.3897 (1.0)   6.2158 (1.0)   6.5885 (1.0)   6.6024 (1.0)   6.6681 (1.0)   6.4829 (1.0)   7.6228 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x512]  2.687 (2.37)     2.6843 (2.38)  2.6655 (2.33)  2.6911 (2.45)  2.7342 (2.41)  2.8856 (2.31)  2.6962 (2.4)   3.4223 (2.23)
test_benchmark_implementations[onnx_optim_fp16-1x512]               4.2547 (1.5)     4.2604 (1.5)   4.2476 (1.46)  4.3039 (1.53)  4.1765 (1.58)  4.1939 (1.59)  4.0118 (1.62)  4.5923 (1.66)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 128)
Name                                                                 Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
-------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x128]                      17.9098 (1.0)    17.6454 (1.01)  16.7045 (1.07)  18.7661 (1.0)   16.9879 (1.06)  17.3241 (1.03)  16.7234 (1.02)  18.6919 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x128]  13.0755 (1.37)   13.075 (1.37)   13.0714 (1.37)  13.0806 (1.43)  13.1283 (1.37)  12.7596 (1.39)  12.0202 (1.42)  13.1405 (1.42)
test_benchmark_implementations[onnx_optim_fp16-32x128]               17.8982 (1.0)    17.8974 (1.0)   17.8913 (1.0)   17.9005 (1.05)  17.9321 (1.0)   17.7588 (1.0)   17.0673 (1.0)   18.5692 (1.01)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 16)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-32x16]                      6.7881 (1.0)     6.8592 (1.0)   6.6048 (1.0)   8.0138 (1.0)   7.0459 (1.0)   7.1357 (1.0)   6.9431 (1.0)   7.8445 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x16]  2.3316 (2.91)    2.4249 (2.83)  2.3122 (2.86)  2.902 (2.76)   2.3996 (2.94)  2.3778 (3.0)   2.3098 (3.01)  2.4187 (3.24)
test_benchmark_implementations[onnx_optim_fp16-32x16]               3.4058 (1.99)    3.4673 (1.98)  3.3732 (1.96)  3.6168 (2.22)  3.4226 (2.06)  3.4083 (2.09)  3.3127 (2.1)   3.74 (2.1)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 256)
Name                                                                 Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
-------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x256]                      36.0653 (1.02)   36.8824 (1.02)  36.0653 (1.02)  37.6996 (1.01)  36.7123 (1.06)  37.0188 (1.07)  36.7123 (1.06)  37.3252 (1.07)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x256]  26.5318 (1.38)   26.5209 (1.41)  26.498 (1.39)   26.5329 (1.44)  26.3806 (1.47)  25.7391 (1.53)  24.2316 (1.6)   26.6052 (1.51)
test_benchmark_implementations[onnx_optim_fp16-32x256]               36.7207 (1.0)    37.4692 (1.0)   36.7207 (1.0)   38.2177 (1.0)   38.8794 (1.0)   39.4746 (1.0)   38.8794 (1.0)   40.0698 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 128)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x128]                      6.6089 (1.01)    6.6722 (1.04)  6.569 (1.0)    7.0185 (1.12)  6.9639 (1.0)   7.0844 (1.0)   6.9434 (1.0)   7.7573 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x128]  4.3991 (1.52)    4.3848 (1.57)  4.3295 (1.52)  4.4032 (1.78)  4.4054 (1.58)  4.3527 (1.63)  4.2177 (1.65)  4.4202 (1.75)
test_benchmark_implementations[onnx_optim_fp16-8x128]               6.6877 (1.0)     6.9057 (1.0)   6.057 (1.08)   7.8469 (1.0)   6.0324 (1.15)  6.2334 (1.14)  5.8981 (1.18)  6.9805 (1.11)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 16)
Name                                                               Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
-----------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x16]                      6.908 (1.0)      6.9054 (1.0)   6.7901 (1.0)   7.1734 (1.0)   7.2689 (1.0)   7.305 (1.0)    7.1133 (1.0)   7.9249 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x16]  1.2954 (5.33)    1.2953 (5.33)  1.2933 (5.25)  1.2984 (5.52)  1.3495 (5.39)  1.3526 (5.4)   1.3437 (5.29)  1.4379 (5.51)
test_benchmark_implementations[onnx_optim_fp16-8x16]               2.7824 (2.48)    2.8075 (2.46)  2.7506 (2.47)  2.9266 (2.45)  2.883 (2.52)   2.9713 (2.46)  2.8106 (2.53)  3.904 (2.03)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 256)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
------------------------------------------------------------------  ---------------  -------------  --------------  --------------  --------------  --------------  --------------  -------------
test_benchmark_implementations[baseline-8x256]                      12.2911 (1.0)    12.1719 (1.0)  11.6808 (1.03)  12.6466 (1.0)   12.0937 (1.0)   12.072 (1.0)    11.3032 (1.0)   12.8891 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x256]  7.594 (1.62)     7.5937 (1.6)   7.5899 (1.59)   7.5971 (1.66)   7.564 (1.6)     7.4167 (1.63)   6.9834 (1.62)   7.5714 (1.7)
test_benchmark_implementations[onnx_optim_fp16-8x256]               12.1149 (1.01)   12.1114 (1.0)  12.0852 (1.0)   12.1385 (1.04)  11.3902 (1.06)  11.3346 (1.07)  11.0259 (1.03)  11.522 (1.12)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 384)
Name                                                                Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x384]                      16.2662 (1.01)   16.7644 (1.0)   15.8915 (1.03)  17.8504 (1.0)   15.4141 (1.04)  15.5253 (1.02)  14.7454 (1.04)  16.6969 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x384]  10.4929 (1.56)   10.5259 (1.59)  10.4684 (1.57)  10.7295 (1.66)  10.5498 (1.52)  10.3925 (1.53)  9.8169 (1.56)   10.9313 (1.53)
test_benchmark_implementations[onnx_optim_fp16-8x384]               16.4198 (1.0)    16.419 (1.02)   16.3963 (1.0)   16.4342 (1.09)  16.0134 (1.0)   15.8778 (1.0)   15.3047 (1.0)   16.3081 (1.02)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 512)
Name                                                                Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x512]                      23.6012 (1.0)    24.1892 (1.0)   23.296 (1.0)    24.9395 (1.0)   21.9757 (1.0)   22.3595 (1.0)   21.4901 (1.0)   23.266 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x512]  14.3524 (1.64)   14.3414 (1.69)  14.3124 (1.63)  14.3667 (1.74)  14.247 (1.54)   13.8561 (1.61)  12.9897 (1.65)  14.2653 (1.63)
test_benchmark_implementations[onnx_optim_fp16-8x512]               21.3473 (1.11)   21.3571 (1.13)  21.3453 (1.09)  21.3862 (1.17)  20.5277 (1.07)  20.9151 (1.07)  20.3955 (1.05)  21.3705 (1.09)


@pommedeterresautee pommedeterresautee marked this pull request as ready for review September 22, 2022 13:00
@pommedeterresautee
Copy link
Member Author

for memory, model in full fp16 in the autocast branch (so no autocast called):

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 128)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)    Median        Mean          Min           Max
------------------------------------------------------------------  ---------------  -------------  ------------  ------------  ------------  ------------  ------------  ------------
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x128]  1.3742 (1.0)     1.4852 (1.0)   1.366 (1.0)   1.9476 (1.0)  1.4341 (1.0)  1.4451 (1.0)  1.4202 (1.0)  1.6538 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 16)
Name                                                               Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)    Median        Mean          Min          Max
-----------------------------------------------------------------  ---------------  -------------  ------------  ------------  ------------  ------------  -----------  ------------
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x16]  0.5642 (1.0)     0.5713 (1.0)   0.5622 (1.0)  1.7848 (1.0)  0.6189 (1.0)  0.6233 (1.0)  0.616 (1.0)  0.8692 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 256)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)    Median        Mean          Min           Max
------------------------------------------------------------------  ---------------  -------------  ------------  ------------  ------------  ------------  ------------  ------------
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x256]  1.6394 (1.0)     1.7038 (1.0)   1.6364 (1.0)  2.2067 (1.0)  1.6927 (1.0)  1.7095 (1.0)  1.6795 (1.0)  2.0539 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 384)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)    Median        Mean          Min           Max
------------------------------------------------------------------  ---------------  -------------  ------------  ------------  ------------  ------------  ------------  ------------
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x384]  1.8924 (1.0)     2.0345 (1.0)   1.8883 (1.0)  2.5303 (1.0)  1.9386 (1.0)  1.9683 (1.0)  1.8656 (1.0)  2.3711 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 512)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)    Median        Mean          Min           Max
------------------------------------------------------------------  ---------------  -------------  ------------  ------------  ------------  ------------  ------------  ------------
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x512]  2.6624 (1.0)     2.7399 (1.0)   2.6419 (1.0)  3.4785 (1.0)  2.7101 (1.0)  2.8139 (1.0)  2.6397 (1.0)  3.2979 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 128)
Name                                                                 Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
-------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  ------------
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x128]  13.097 (1.0)     13.097 (1.0)   13.0888 (1.0)  13.1052 (1.0)  13.0454 (1.0)  12.7565 (1.0)  12.0125 (1.0)  13.058 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 16)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)    Median        Mean          Min           Max
------------------------------------------------------------------  ---------------  -------------  ------------  ------------  ------------  ------------  ------------  ------------
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x16]  2.3491 (1.0)     2.3462 (1.0)   2.3296 (1.0)  2.3532 (1.0)  2.3852 (1.0)  2.3689 (1.0)  2.3099 (1.0)  2.4138 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 256)
Name                                                                 Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)    Median         Mean           Min            Max
-------------------------------------------------------------------  ---------------  -------------  -------------  ------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x256]  26.0209 (1.0)    26.0198 (1.0)  26.0137 (1.0)  26.025 (1.0)  25.0071 (1.0)  25.0396 (1.0)  24.0307 (1.0)  26.0812 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 128)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)    Median        Mean          Min           Max
------------------------------------------------------------------  ---------------  -------------  ------------  ------------  ------------  ------------  ------------  ------------
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x128]  4.3622 (1.0)     4.3721 (1.0)   4.3356 (1.0)  4.606 (1.0)   4.3772 (1.0)  4.3444 (1.0)  4.2214 (1.0)  4.3898 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 16)
Name                                                               Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)    Median        Mean          Min           Max
-----------------------------------------------------------------  ---------------  -------------  ------------  ------------  ------------  ------------  ------------  ------------
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x16]  1.2974 (1.0)     1.2972 (1.0)   1.2954 (1.0)  1.2995 (1.0)  1.3512 (1.0)  1.3535 (1.0)  1.3476 (1.0)  1.4402 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 256)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)    Median        Mean          Min          Max
------------------------------------------------------------------  ---------------  -------------  ------------  ------------  ------------  ------------  -----------  ------------
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x256]  7.5663 (1.0)     7.5664 (1.0)   7.5643 (1.0)  7.5704 (1.0)  7.5593 (1.0)  7.4112 (1.0)  7.002 (1.0)  7.5693 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 384)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min           Max
------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  ------------  -------------
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x384]  10.4755 (1.0)    10.4761 (1.0)  10.4684 (1.0)  10.4827 (1.0)  10.4137 (1.0)  10.3188 (1.0)  9.8971 (1.0)  10.5238 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 512)
Name                                                                Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------  ---------------  -------------  ------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x512]  14.21 (1.0)      14.2105 (1.0)  14.208 (1.0)  14.2152 (1.0)  14.2029 (1.0)  13.8439 (1.0)  13.0863 (1.0)  14.4994 (1.0)

Shows that the code is as fast as in main when inference is not under autocast context manager.

@pommedeterresautee
Copy link
Member Author

@gaetansnl there was a bug in Onnx Runtime, on main it's taking baseline model without setting fp16 to False, so it was working in full fp16 which doesn't work. In this branch there is no such flag and the Onnx model is in mixed precision

@pommedeterresautee
Copy link
Member Author

with the weights in fp16 and the model in autocast

test/test_torchdynamo_bert.py ...............................................................................................................................................                                                    [100%]
test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 128)
Name                                                                       Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
-------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-1x128]                             7.6605 (2.59)    7.9332 (2.51)   7.5622 (2.62)   11.0316 (1.81)  8.0965 (2.47)   8.3429 (2.41)   7.9112 (2.53)   9.9191 (2.06)
test_benchmark_implementations[dynamo-1x128]                               6.7329 (2.94)    6.7255 (2.96)   6.5987 (3.0)    6.87 (2.91)     6.9386 (2.88)   6.9729 (2.89)   6.8712 (2.91)   7.453 (2.74)
test_benchmark_implementations[dynamo_cuda_graphs-1x128]                   1.5616 (12.68)   1.6294 (12.2)   1.5391 (12.87)  1.7644 (11.35)  1.5984 (12.52)  1.6007 (12.58)  1.5955 (12.54)  1.685 (12.14)
test_benchmark_implementations[dynamo_no_dropout-1x128]                    6.2597 (3.16)    6.2689 (3.17)   6.1901 (3.2)    6.356 (3.15)    6.666 (3.0)     6.679 (3.02)    6.5346 (3.06)   6.9556 (2.94)
test_benchmark_implementations[dynamo_nvfuser_ofi-1x128]                   3.3935 (5.84)    3.3858 (5.87)   3.2553 (6.08)   3.4806 (5.75)   3.7285 (5.37)   3.7228 (5.41)   3.5896 (5.57)   4.0261 (5.08)
test_benchmark_implementations[dynamo_optimized-1x128]                     19.8042 (1.0)    19.88 (1.0)     19.8021 (1.0)   20.0204 (1.0)   20.0119 (1.0)   20.1414 (1.0)   20.0052 (1.0)   20.4569 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x128]         1.5606 (12.69)   1.5597 (12.75)  1.5155 (13.07)  1.5626 (12.81)  1.4283 (14.01)  1.4303 (14.08)  1.4252 (14.04)  1.5174 (13.48)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x128]  1.5718 (12.6)    1.4885 (13.36)  1.3722 (14.43)  1.5739 (12.72)  1.4302 (13.99)  1.4326 (14.06)  1.4263 (14.03)  1.5295 (13.37)
test_benchmark_implementations[onnx-1x128]                                 3.1877 (6.21)    3.1987 (6.21)   3.1795 (6.23)   3.3782 (5.93)   3.2336 (6.19)   3.2478 (6.2)    3.2264 (6.2)    3.5517 (5.76)
test_benchmark_implementations[onnx_optim_fp16-1x128]                      2.8242 (7.01)    2.8327 (7.02)   2.8047 (7.06)   2.8846 (6.94)   2.763 (7.24)    2.7916 (7.21)   2.7425 (7.29)   3.2298 (6.33)
test_benchmark_implementations[onnx_optim_fp32-1x128]                      3.5574 (5.57)    3.4032 (5.84)   3.1846 (6.22)   3.5932 (5.57)   3.2336 (6.19)   3.2463 (6.2)    3.2276 (6.2)    3.5254 (5.8)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 16)
Name                                                                      Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-1x16]                             8.5985 (2.14)    10.0206 (1.85)  7.6575 (2.39)   13.8199 (1.37)  8.2296 (2.25)   8.265 (2.26)    8.0546 (2.29)   8.8783 (2.16)
test_benchmark_implementations[dynamo-1x16]                               6.4256 (2.86)    6.4446 (2.87)   6.3754 (2.88)   6.5732 (2.87)   6.7502 (2.75)   6.774 (2.76)    6.6862 (2.76)   7.0925 (2.7)
test_benchmark_implementations[dynamo_cuda_graphs-1x16]                   1.1192 (16.44)   1.1195 (16.54)  1.1172 (16.41)  1.1223 (16.82)  1.0585 (17.53)  1.0913 (17.13)  1.0548 (17.49)  1.2606 (15.21)
test_benchmark_implementations[dynamo_no_dropout-1x16]                    6.0099 (3.06)    6.0205 (3.07)   5.9835 (3.06)   6.101 (3.09)    6.4984 (2.85)   6.5778 (2.84)   6.4217 (2.87)   7.1127 (2.7)
test_benchmark_implementations[dynamo_nvfuser_ofi-1x16]                   3.1939 (5.76)    3.1746 (5.83)   3.0771 (5.96)   3.2891 (5.74)   3.5884 (5.17)   3.5752 (5.23)   3.4215 (5.39)   3.8911 (4.93)
test_benchmark_implementations[dynamo_optimized-1x16]                     18.4013 (1.0)    18.5119 (1.0)   18.3338 (1.0)   18.8826 (1.0)   18.5509 (1.0)   18.6906 (1.0)   18.4498 (1.0)   19.1759 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x16]         0.6277 (29.31)   0.6279 (29.48)  0.6257 (29.3)   0.6298 (29.98)  0.6245 (29.71)  0.6262 (29.85)  0.6215 (29.69)  0.7118 (26.94)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x16]  0.6287 (29.27)   0.6285 (29.45)  0.6267 (29.26)  0.638 (29.6)    0.6218 (29.84)  0.6234 (29.98)  0.6187 (29.82)  0.7396 (25.93)
test_benchmark_implementations[onnx-1x16]                                 2.4669 (7.46)    2.4797 (7.47)   2.432 (7.54)    2.561 (7.37)    2.4993 (7.42)   2.5286 (7.39)   2.4755 (7.45)   2.9406 (6.52)
test_benchmark_implementations[onnx_optim_fp16-1x16]                      2.8303 (6.5)     2.8303 (6.54)   2.7771 (6.6)    2.898 (6.52)    2.8146 (6.59)   2.8801 (6.49)   2.721 (6.78)    5.2296 (3.67)
test_benchmark_implementations[onnx_optim_fp32-1x16]                      2.4402 (7.54)    2.4559 (7.54)   2.423 (7.57)    2.5518 (7.4)    2.4798 (7.48)   2.5037 (7.47)   2.4579 (7.51)   2.9093 (6.59)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 256)
Name                                                                       Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
-------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-1x256]                             7.553 (2.59)     7.571 (2.57)    7.468 (2.52)    7.6933 (2.57)   7.9049 (2.53)   7.9921 (2.51)   7.8276 (2.54)   8.6992 (2.33)
test_benchmark_implementations[dynamo-1x256]                               6.5608 (2.99)    6.5712 (2.97)   6.4809 (2.91)   6.7236 (2.94)   6.9156 (2.89)   6.9401 (2.89)   6.8395 (2.91)   7.276 (2.79)
test_benchmark_implementations[dynamo_cuda_graphs-1x256]                   2.2589 (8.67)    2.2587 (8.63)   2.2559 (8.35)   2.262 (8.74)    2.0757 (9.63)   2.0715 (9.67)   2.0435 (9.73)   2.137 (9.49)
test_benchmark_implementations[dynamo_no_dropout-1x256]                    6.6285 (2.96)    6.6407 (2.93)   6.615 (2.85)    6.7133 (2.94)   7.0503 (2.84)   7.097 (2.82)    7.0097 (2.84)   7.4603 (2.72)
test_benchmark_implementations[dynamo_nvfuser_ofi-1x256]                   2.8959 (6.77)    2.9247 (6.66)   2.8551 (6.59)   3.1928 (6.19)   3.3429 (5.98)   3.3241 (6.02)   3.1832 (6.25)   3.6621 (5.54)
test_benchmark_implementations[dynamo_optimized-1x256]                     19.5945 (1.0)    19.4843 (1.0)   18.8273 (1.0)   19.7622 (1.0)   19.9905 (1.0)   20.027 (1.0)    19.8866 (1.0)   20.2829 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x256]         1.8452 (10.62)   1.847 (10.55)   1.8432 (10.21)  1.8524 (10.67)  1.7025 (11.74)  1.7237 (11.62)  1.6744 (11.88)  1.9958 (10.16)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x256]  1.8586 (10.54)   1.8785 (10.37)  1.8555 (10.15)  2.3122 (8.55)   1.6936 (11.8)   1.694 (11.82)   1.6775 (11.85)  1.7908 (11.33)
test_benchmark_implementations[onnx-1x256]                                 3.9199 (5.0)     3.9236 (4.97)   3.9137 (4.81)   3.9456 (5.01)   3.9413 (5.07)   3.9492 (5.07)   3.9133 (5.08)   4.2207 (4.81)
test_benchmark_implementations[onnx_optim_fp16-1x256]                      2.7904 (7.02)    2.7935 (6.97)   2.7854 (6.76)   2.818 (7.01)    2.5601 (7.81)   2.5711 (7.79)   2.5551 (7.78)   2.8745 (7.06)
test_benchmark_implementations[onnx_optim_fp32-1x256]                      4.3551 (4.5)     4.3575 (4.47)   4.3518 (4.33)   4.3868 (4.5)    3.9516 (5.06)   3.9669 (5.05)   3.9364 (5.05)   4.2545 (4.77)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 384)
Name                                                                       Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
-------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x384]                             7.5704 (2.43)    7.609 (2.41)   7.5151 (2.43)  7.8592 (2.34)  8.1326 (2.28)  8.2142 (2.27)  8.0782 (2.29)  8.7769 (2.17)
test_benchmark_implementations[dynamo-1x384]                               6.5138 (2.82)    6.5357 (2.81)  6.4809 (2.82)  6.6335 (2.78)  6.8973 (2.69)  6.9176 (2.69)  6.8245 (2.71)  7.2583 (2.63)
test_benchmark_implementations[dynamo_cuda_graphs-1x384]                   2.9 (6.33)       2.9689 (6.18)  2.8621 (6.38)  3.0802 (5.98)  2.9029 (6.38)  2.8796 (6.47)  2.8273 (6.53)  2.9232 (6.53)
test_benchmark_implementations[dynamo_no_dropout-1x384]                    6.3437 (2.89)    6.3538 (2.89)  6.2228 (2.94)  6.5026 (2.83)  6.6473 (2.79)  6.6985 (2.78)  6.626 (2.79)   7.1187 (2.68)
test_benchmark_implementations[dynamo_nvfuser_ofi-1x384]                   3.2851 (5.59)    3.2801 (5.6)   3.1099 (5.88)  3.4582 (5.32)  3.595 (5.15)   3.5772 (5.21)  3.4252 (5.39)  3.941 (4.84)
test_benchmark_implementations[dynamo_optimized-1x384]                     18.3613 (1.0)    18.3575 (1.0)  18.2733 (1.0)  18.4105 (1.0)  18.5198 (1.0)  18.6274 (1.0)  18.4637 (1.0)  19.0848 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x384]         2.0285 (9.05)    2.0215 (9.08)  1.9302 (9.47)  2.0337 (9.05)  1.9281 (9.61)  1.9111 (9.75)  1.8576 (9.94)  1.9673 (9.7)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x384]  2.0449 (8.98)    2.045 (8.98)   2.0408 (8.95)  2.05 (8.98)    1.9278 (9.61)  1.9113 (9.75)  1.8463 (10.0)  1.9523 (9.78)
test_benchmark_implementations[onnx-1x384]                                 5.3535 (3.43)    5.3549 (3.43)  5.3415 (3.42)  5.3873 (3.42)  5.0719 (3.65)  5.2008 (3.58)  4.8636 (3.8)   6.0413 (3.16)
test_benchmark_implementations[onnx_optim_fp16-1x384]                      2.8887 (6.36)    2.8896 (6.35)  2.8621 (6.38)  2.9204 (6.3)   2.9264 (6.33)  2.9292 (6.36)  2.8794 (6.41)  3.1708 (6.02)
test_benchmark_implementations[onnx_optim_fp32-1x384]                      5.0514 (3.63)    5.0504 (3.63)  5.0401 (3.63)  5.0616 (3.64)  5.0965 (3.63)  5.059 (3.68)   4.9089 (3.76)  5.2053 (3.67)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 512)
Name                                                                       Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
-------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x512]                             7.3913 (2.48)    7.4481 (2.46)  7.3667 (2.47)  7.713 (2.39)   7.7369 (2.58)  7.8174 (2.55)  7.7022 (2.56)  8.4882 (2.37)
test_benchmark_implementations[dynamo-1x512]                               6.3889 (2.87)    6.3807 (2.87)  6.2702 (2.91)  6.482 (2.85)   6.617 (3.02)   6.6864 (2.98)  6.5914 (2.99)  7.277 (2.76)
test_benchmark_implementations[dynamo_cuda_graphs-1x512]                   4.6848 (3.91)    4.6386 (3.95)  4.4052 (4.14)  4.6909 (3.93)  4.3883 (4.55)  4.3673 (4.57)  4.3137 (4.57)  4.4019 (4.57)
test_benchmark_implementations[dynamo_no_dropout-1x512]                    5.9894 (3.06)    6.0217 (3.04)  5.9453 (3.06)  6.2024 (2.97)  6.4405 (3.1)   6.479 (3.08)   6.3649 (3.1)   6.7992 (2.96)
test_benchmark_implementations[dynamo_nvfuser_ofi-1x512]                   3.4212 (5.36)    3.425 (5.35)   3.3372 (5.46)  3.5082 (5.26)  3.7336 (5.35)  3.7332 (5.35)  3.5726 (5.52)  4.1709 (4.82)
test_benchmark_implementations[dynamo_optimized-1x512]                     18.3265 (1.0)    18.3329 (1.0)  18.2201 (1.0)  18.4474 (1.0)  19.978 (1.0)   19.9555 (1.0)  19.7277 (1.0)  20.1152 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x512]         2.9204 (6.28)    2.9207 (6.28)  2.9174 (6.25)  2.9245 (6.31)  2.6658 (7.49)  2.6642 (7.49)  2.6248 (7.52)  2.7353 (7.35)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x512]  2.9532 (6.21)    2.9534 (6.21)  2.9491 (6.18)  2.9563 (6.24)  2.6776 (7.46)  2.6961 (7.4)   2.6347 (7.49)  3.1068 (6.47)
test_benchmark_implementations[onnx-1x512]                                 7.3779 (2.48)    7.4523 (2.46)  7.3738 (2.47)  7.935 (2.32)   7.4261 (2.69)  7.3922 (2.7)   7.2257 (2.73)  7.5067 (2.68)
test_benchmark_implementations[onnx_optim_fp16-1x512]                      3.9345 (4.66)    3.9355 (4.66)  3.9158 (4.65)  3.9546 (4.66)  3.9765 (5.02)  3.9747 (5.02)  3.9266 (5.02)  4.2046 (4.78)
test_benchmark_implementations[onnx_optim_fp32-1x512]                      7.9319 (2.31)    7.9335 (2.31)  7.9258 (2.3)   7.9411 (2.32)  7.4303 (2.69)  7.3798 (2.7)   7.2062 (2.74)  7.4877 (2.69)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 128)
Name                                                                        Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
--------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x128]                             19.9885 (1.9)    19.9921 (1.9)   19.9864 (1.9)   20.0018 (1.9)   19.6734 (1.89)  19.8335 (1.9)   19.3291 (1.92)  20.1765 (1.88)
test_benchmark_implementations[dynamo-32x128]                               20.4227 (1.86)   20.4244 (1.86)  20.4206 (1.86)  20.4298 (1.86)  19.0551 (1.95)  19.563 (1.92)   19.0284 (1.96)  20.0972 (1.89)
test_benchmark_implementations[dynamo_cuda_graphs-32x128]                   19.797 (1.92)    19.7992 (1.92)  19.7939 (1.92)  19.8083 (1.92)  18.8559 (1.97)  19.524 (1.93)   18.8221 (1.98)  20.4032 (1.86)
test_benchmark_implementations[dynamo_no_dropout-32x128]                    20.4247 (1.86)   20.354 (1.87)   20.1308 (1.89)  20.4329 (1.86)  19.5796 (1.9)   19.7064 (1.91)  19.1097 (1.95)  20.1087 (1.89)
test_benchmark_implementations[dynamo_nvfuser_ofi-32x128]                   17.791 (2.14)    17.7445 (2.14)  17.5504 (2.17)  17.7961 (2.14)  17.4183 (2.14)  16.9937 (2.21)  16.3069 (2.28)  17.4253 (2.18)
test_benchmark_implementations[dynamo_optimized-32x128]                     18.1373 (2.1)    18.1337 (2.1)   18.1053 (2.1)   18.1473 (2.1)   18.592 (2.0)    18.7188 (2.01)  18.5652 (2.0)   19.1896 (1.98)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x128]         12.9157 (2.94)   12.914 (2.95)   12.9085 (2.94)  12.9198 (2.95)  13.0854 (2.84)  12.7066 (2.96)  11.979 (3.11)   13.0971 (2.9)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x128]  12.9894 (2.93)   12.9894 (2.93)  12.9853 (2.93)  12.9935 (2.93)  12.8909 (2.89)  12.6881 (2.96)  12.0358 (3.09)  13.1549 (2.89)
test_benchmark_implementations[onnx-32x128]                                 37.9894 (1.0)    38.0406 (1.0)   37.9894 (1.0)   38.0918 (1.0)   37.205 (1.0)    37.614 (1.0)    37.205 (1.0)    38.0231 (1.0)
test_benchmark_implementations[onnx_optim_fp16-32x128]                      17.9159 (2.12)   17.9122 (2.12)  17.8954 (2.12)  17.921 (2.13)   17.8506 (2.08)  17.5318 (2.15)  16.8395 (2.21)  17.9537 (2.12)
test_benchmark_implementations[onnx_optim_fp32-32x128]                      38.0037 (1.0)    38.0099 (1.0)   38.0037 (1.0)   38.016 (1.0)    37.0032 (1.01)  37.4619 (1.0)   37.0032 (1.01)  37.9206 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 16)
Name                                                                       Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
-------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-32x16]                             7.7896 (2.36)    7.7981 (2.36)  7.6913 (2.38)  7.9741 (2.31)  8.1593 (2.29)  8.2285 (2.28)  8.0638 (2.3)   8.7488 (2.2)
test_benchmark_implementations[dynamo-32x16]                               6.7135 (2.74)    6.7336 (2.73)  6.6662 (2.74)  6.9366 (2.66)  7.0167 (2.66)  7.0663 (2.66)  6.953 (2.67)   7.6539 (2.52)
test_benchmark_implementations[dynamo_cuda_graphs-32x16]                   3.1007 (5.93)    3.1274 (5.88)  3.0986 (5.9)   3.456 (5.34)   3.1281 (5.97)  3.1278 (6.0)   3.0912 (6.01)  3.2096 (6.0)
test_benchmark_implementations[dynamo_no_dropout-32x16]                    6.3744 (2.88)    6.3898 (2.88)  6.2659 (2.92)  6.5905 (2.8)   6.7537 (2.76)  6.8027 (2.76)  6.7059 (2.77)  7.4592 (2.58)
test_benchmark_implementations[dynamo_nvfuser_ofi-32x16]                   3.6577 (5.03)    3.6874 (4.99)  3.5625 (5.13)  3.8349 (4.81)  4.034 (4.63)   4.047 (4.64)   3.9003 (4.76)  4.4634 (4.31)
test_benchmark_implementations[dynamo_optimized-32x16]                     18.388 (1.0)     18.3839 (1.0)  18.2804 (1.0)  18.4494 (1.0)  18.6591 (1.0)  18.7635 (1.0)  18.5679 (1.0)  19.2508 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x16]         2.5539 (7.2)     2.5544 (7.2)   2.5518 (7.16)  2.5569 (7.22)  2.3349 (7.99)  2.333 (8.04)   2.2951 (8.09)  2.4158 (7.97)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x16]  2.391 (7.69)     2.4274 (7.57)  2.3869 (7.66)  2.5569 (7.22)  2.3637 (7.89)  2.3968 (7.83)  2.3181 (8.01)  2.5495 (7.55)
test_benchmark_implementations[onnx-32x16]                                 5.6269 (3.27)    5.6307 (3.26)  5.6197 (3.25)  5.6525 (3.26)  5.6711 (3.29)  5.6498 (3.32)  5.5571 (3.34)  5.8432 (3.29)
test_benchmark_implementations[onnx_optim_fp16-32x16]                      3.3075 (5.56)    3.3062 (5.56)  3.2862 (5.56)  3.3352 (5.53)  3.3408 (5.59)  3.3575 (5.59)  3.3177 (5.6)   3.649 (5.28)
test_benchmark_implementations[onnx_optim_fp32-32x16]                      5.8204 (3.16)    5.8232 (3.16)  5.8143 (3.14)  5.8378 (3.16)  5.6519 (3.3)   5.673 (3.31)   5.5316 (3.36)  6.1365 (3.14)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 256)
Name                                                                        Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
--------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x256]                             43.902 (1.72)    43.9101 (1.72)  43.902 (1.72)   43.9183 (1.72)  43.1339 (1.72)  43.5829 (1.7)   43.1339 (1.72)  44.0318 (1.68)
test_benchmark_implementations[dynamo-32x256]                               43.8252 (1.72)   43.8595 (1.72)  43.8252 (1.72)  43.8938 (1.72)  43.8847 (1.69)  44.0577 (1.68)  43.8847 (1.69)  44.2306 (1.68)
test_benchmark_implementations[dynamo_cuda_graphs-32x256]                   43.7996 (1.72)   43.8011 (1.72)  43.7996 (1.72)  43.8026 (1.72)  42.7952 (1.73)  43.3153 (1.71)  42.7952 (1.73)  43.8354 (1.69)
test_benchmark_implementations[dynamo_no_dropout-32x256]                    44.1385 (1.71)   44.1411 (1.71)  44.1385 (1.71)  44.1436 (1.71)  43.5725 (1.7)   43.9225 (1.69)  43.5725 (1.7)   44.2724 (1.68)
test_benchmark_implementations[dynamo_nvfuser_ofi-32x256]                   35.9804 (2.1)    36.5861 (2.06)  35.9804 (2.1)   37.1917 (2.03)  35.1051 (2.11)  35.5981 (2.08)  35.1051 (2.11)  36.0911 (2.05)
test_benchmark_implementations[dynamo_optimized-32x256]                     27.3992 (2.75)   27.4169 (2.75)  27.3971 (2.75)  27.4543 (2.75)  26.2085 (2.83)  26.107 (2.84)   25.302 (2.93)   26.8106 (2.77)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x256]         25.9441 (2.91)   25.9434 (2.91)  25.941 (2.91)   25.9451 (2.91)  25.1295 (2.95)  25.0927 (2.96)  23.9535 (3.1)   26.1953 (2.83)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x256]  25.8591 (2.92)   25.857 (2.92)   25.8468 (2.92)  25.8652 (2.92)  25.3347 (2.93)  25.082 (2.96)   23.7889 (3.12)  26.1226 (2.84)
test_benchmark_implementations[onnx-32x256]                                 75.471 (1.0)     75.471 (1.0)    75.471 (1.0)    75.471 (1.0)    74.1612 (1.0)   74.1612 (1.0)   74.1612 (1.0)   74.1612 (1.0)
test_benchmark_implementations[onnx_optim_fp16-32x256]                      36.7094 (2.06)   36.713 (2.06)   36.7094 (2.06)  36.7167 (2.06)  34.473 (2.15)   35.5963 (2.08)  34.473 (2.15)   36.7196 (2.02)
test_benchmark_implementations[onnx_optim_fp32-32x256]                      75.4545 (1.0)    75.4545 (1.0)   75.4545 (1.0)   75.4545 (1.0)   73.9454 (1.0)   73.9454 (1.0)   73.9454 (1.0)   73.9454 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 128)
Name                                                                       Median (CUDA)    Mean (CUDA)     Min (CUDA)     Max (CUDA)      Median          Mean            Min             Max
-------------------------------------------------------------------------  ---------------  --------------  -------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x128]                             7.6882 (2.37)    7.7161 (2.37)   7.6513 (2.36)  7.8203 (2.37)   8.0509 (2.29)   8.1397 (2.27)   8.0176 (2.29)   8.6786 (2.18)
test_benchmark_implementations[dynamo-8x128]                               7.1045 (2.57)    7.1043 (2.57)   7.1004 (2.54)  7.1076 (2.61)   7.0329 (2.62)   7.083 (2.61)    6.9554 (2.64)   7.4152 (2.55)
test_benchmark_implementations[dynamo_cuda_graphs-8x128]                   6.8086 (2.68)    6.7544 (2.71)   6.2536 (2.89)  6.8147 (2.72)   6.1993 (2.97)   6.1946 (2.99)   6.1376 (2.99)   6.2532 (3.03)
test_benchmark_implementations[dynamo_no_dropout-8x128]                    7.1035 (2.57)    7.1032 (2.58)   7.0994 (2.54)  7.1066 (2.61)   6.7912 (2.71)   6.8176 (2.71)   6.7409 (2.72)   7.1907 (2.63)
test_benchmark_implementations[dynamo_nvfuser_ofi-8x128]                   5.8357 (3.12)    5.8353 (3.13)   5.8326 (3.1)   5.8401 (3.18)   5.4088 (3.4)    5.4015 (3.43)   5.3005 (3.46)   5.5612 (3.4)
test_benchmark_implementations[dynamo_optimized-8x128]                     18.2241 (1.0)    18.2907 (1.0)   18.0675 (1.0)  18.5457 (1.0)   18.4038 (1.0)   18.5019 (1.0)   18.3426 (1.0)   18.9316 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x128]         4.6684 (3.9)     4.669 (3.92)    4.6653 (3.87)  4.6725 (3.97)   4.3283 (4.25)   4.3743 (4.23)   4.1687 (4.4)    4.8529 (3.9)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x128]  4.6797 (3.89)    4.6804 (3.91)   4.6766 (3.86)  4.6838 (3.96)   4.3204 (4.26)   4.2784 (4.32)   4.1636 (4.41)   4.3304 (4.37)
test_benchmark_implementations[onnx-8x128]                                 12.0791 (1.51)   12.086 (1.51)   12.073 (1.5)   12.0986 (1.53)  11.0917 (1.66)  11.0635 (1.67)  10.9016 (1.68)  11.1618 (1.7)
test_benchmark_implementations[onnx_optim_fp16-8x128]                      6.4862 (2.81)    6.4874 (2.82)   6.4829 (2.79)  6.4972 (2.85)   6.0809 (3.03)   6.2295 (2.97)   5.902 (3.11)    6.8165 (2.78)
test_benchmark_implementations[onnx_optim_fp32-8x128]                      12.0691 (1.51)   11.7955 (1.55)  11.2886 (1.6)  12.1059 (1.53)  11.0996 (1.66)  11.1007 (1.67)  10.9196 (1.68)  11.3024 (1.68)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 16)
Name                                                                      Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x16]                             8.0026 (2.28)    8.0095 (2.29)   7.9422 (2.28)   8.1019 (2.34)   8.3481 (2.21)   8.474 (2.19)    8.2862 (2.22)   9.7117 (1.97)
test_benchmark_implementations[dynamo-8x16]                               6.9307 (2.63)    6.9342 (2.64)   6.8711 (2.63)   7.0113 (2.7)    7.3655 (2.5)    7.418 (2.51)    7.2169 (2.55)   7.8575 (2.43)
test_benchmark_implementations[dynamo_cuda_graphs-8x16]                   1.8 (10.12)      1.7997 (10.19)  1.7971 (10.07)  1.8033 (10.49)  1.6321 (11.3)   1.6762 (11.09)  1.6288 (11.3)   1.9426 (9.83)
test_benchmark_implementations[dynamo_no_dropout-8x16]                    7.1875 (2.53)    7.1667 (2.56)   7.0556 (2.57)   7.2233 (2.62)   7.4841 (2.46)   7.5437 (2.46)   7.4565 (2.47)   8.0487 (2.37)
test_benchmark_implementations[dynamo_nvfuser_ofi-8x16]                   3.6987 (4.92)    3.6816 (4.98)   3.5512 (5.1)    3.8103 (4.97)   3.937 (4.68)    3.9566 (4.7)    3.8619 (4.77)   4.4714 (4.27)
test_benchmark_implementations[dynamo_optimized-8x16]                     18.2159 (1.0)    18.3345 (1.0)   18.0992 (1.0)   18.9225 (1.0)   18.4395 (1.0)   18.5832 (1.0)   18.4076 (1.0)   19.1045 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x16]         1.4664 (12.42)   1.4669 (12.5)   1.4633 (12.37)  1.4725 (12.85)  1.3477 (13.68)  1.3592 (13.67)  1.3445 (13.69)  1.6115 (11.86)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x16]  1.4653 (12.43)   1.4593 (12.56)  1.2943 (13.98)  1.4694 (12.88)  1.3399 (13.76)  1.3444 (13.82)  1.3352 (13.79)  1.4652 (13.04)
test_benchmark_implementations[onnx-8x16]                                 2.9072 (6.27)    2.911 (6.3)     2.9003 (6.24)   2.9676 (6.38)   2.948 (6.26)    2.9599 (6.28)   2.9437 (6.25)   3.2374 (5.9)
test_benchmark_implementations[onnx_optim_fp16-8x16]                      2.8324 (6.43)    2.8391 (6.46)   2.816 (6.43)    2.9348 (6.45)   2.8898 (6.38)   2.9068 (6.39)   2.8696 (6.41)   3.3752 (5.66)
test_benchmark_implementations[onnx_optim_fp32-8x16]                      2.9259 (6.23)    2.9446 (6.23)   2.9194 (6.2)    3.1304 (6.04)   2.9849 (6.18)   3.0218 (6.15)   2.9654 (6.21)   3.4835 (5.48)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 256)
Name                                                                       Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
-------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x256]                             14.4046 (1.62)   14.3729 (1.62)  14.1752 (1.62)  14.4251 (1.66)  13.4262 (1.67)  13.4115 (1.68)  13.2269 (1.69)  13.6626 (1.65)
test_benchmark_implementations[dynamo-8x256]                               13.4974 (1.73)   13.7875 (1.69)  13.4277 (1.71)  14.3831 (1.67)  13.3872 (1.67)  13.388 (1.68)   13.2685 (1.68)  13.4548 (1.68)
test_benchmark_implementations[dynamo_cuda_graphs-8x256]                   13.1953 (1.77)   13.1859 (1.77)  13.143 (1.75)   13.2649 (1.81)  13.1024 (1.71)  13.0303 (1.73)  12.8504 (1.74)  13.1506 (1.72)
test_benchmark_implementations[dynamo_no_dropout-8x256]                    13.4892 (1.73)   13.653 (1.71)   13.4298 (1.71)  14.3923 (1.67)  13.4651 (1.66)  13.4124 (1.68)  13.2646 (1.68)  13.4941 (1.68)
test_benchmark_implementations[dynamo_nvfuser_ofi-8x256]                   10.5984 (2.21)   10.7886 (2.16)  10.5196 (2.19)  11.2364 (2.14)  10.4592 (2.14)  10.4092 (2.16)  10.1622 (2.2)   10.5341 (2.15)
test_benchmark_implementations[dynamo_optimized-8x256]                     18.1248 (1.29)   18.1269 (1.29)  18.1065 (1.27)  18.1586 (1.32)  18.6109 (1.2)   18.6935 (1.2)   18.4981 (1.21)  19.0574 (1.19)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x256]         7.6616 (3.05)    7.5866 (3.07)   7.4721 (3.08)   7.6646 (3.13)   7.4286 (3.01)   7.3157 (3.07)   6.9428 (3.21)   7.5053 (3.01)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x256]  7.3789 (3.17)    7.4421 (3.13)   7.3677 (3.12)   7.6206 (3.15)   7.3713 (3.04)   7.2102 (3.12)   6.8461 (3.26)   7.3796 (3.06)
test_benchmark_implementations[onnx-8x256]                                 22.7218 (1.03)   23.06 (1.01)    22.6567 (1.02)  23.9933 (1.0)   22.2878 (1.0)   22.4101 (1.0)   22.2321 (1.0)   22.5733 (1.0)
test_benchmark_implementations[onnx_optim_fp16-8x256]                      11.3439 (2.06)   11.3283 (2.06)  11.2681 (2.04)  11.3675 (2.11)  11.2462 (1.99)  11.2272 (2.0)   11.0091 (2.03)  11.3319 (2.0)
test_benchmark_implementations[onnx_optim_fp32-8x256]                      23.4025 (1.0)    23.3101 (1.0)   23.0216 (1.0)   23.4107 (1.02)  22.386 (1.0)    22.478 (1.0)    22.3156 (1.0)   22.6084 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 384)
Name                                                                       Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
-------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x384]                             19.202 (1.63)    19.4443 (1.61)  19.1621 (1.64)  19.9311 (1.58)  19.2175 (1.64)  19.1016 (1.63)  18.658 (1.62)   19.2375 (1.64)
test_benchmark_implementations[dynamo-8x384]                               19.9096 (1.58)   19.806 (1.59)   19.3843 (1.62)  19.9168 (1.58)  19.2918 (1.63)  19.078 (1.63)   18.5617 (1.63)  19.3054 (1.64)
test_benchmark_implementations[dynamo_cuda_graphs-8x384]                   18.9245 (1.66)   18.9348 (1.66)  18.9194 (1.66)  18.9768 (1.66)  19.0011 (1.65)  18.7891 (1.65)  18.3988 (1.64)  19.0703 (1.66)
test_benchmark_implementations[dynamo_no_dropout-8x384]                    19.926 (1.57)    19.9264 (1.58)  19.924 (1.57)   19.9291 (1.58)  19.3009 (1.63)  19.1752 (1.62)  18.8989 (1.6)   19.3352 (1.63)
test_benchmark_implementations[dynamo_nvfuser_ofi-8x384]                   15.401 (2.04)    15.4013 (2.04)  15.3979 (2.04)  15.403 (2.04)   15.4667 (2.03)  15.1685 (2.05)  14.4705 (2.09)  15.4702 (2.04)
test_benchmark_implementations[dynamo_optimized-8x384]                     18.26 (1.72)     18.2686 (1.72)  18.1647 (1.73)  18.4166 (1.71)  18.5932 (1.69)  18.7727 (1.65)  18.5629 (1.63)  19.3952 (1.63)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x384]         10.5236 (2.98)   10.6009 (2.96)  10.5144 (2.98)  10.7827 (2.92)  10.5006 (2.99)  10.2082 (3.04)  9.7592 (3.09)   10.5046 (3.01)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x384]  10.3496 (3.03)   10.3823 (3.02)  10.326 (3.04)   10.6926 (2.94)  10.3932 (3.02)  10.0772 (3.08)  9.604 (3.14)    10.4152 (3.03)
test_benchmark_implementations[onnx-8x384]                                 31.3651 (1.0)    31.3952 (1.0)   31.3559 (1.0)   31.4644 (1.0)   31.4303 (1.0)   31.0518 (1.0)   30.1471 (1.0)   31.578 (1.0)
test_benchmark_implementations[onnx_optim_fp16-8x384]                      16.2304 (1.93)   16.2316 (1.93)  16.2243 (1.93)  16.2428 (1.94)  16.0021 (1.96)  15.9447 (1.95)  15.2727 (1.98)  16.283 (1.94)
test_benchmark_implementations[onnx_optim_fp32-8x384]                      31.3774 (1.0)    31.3737 (1.0)   31.3631 (1.0)   31.3805 (1.0)   31.0449 (1.01)  30.9306 (1.0)   30.1845 (1.0)   31.5624 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 512)
Name                                                                       Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
-------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x512]                             31.0692 (1.51)   31.3934 (1.5)   30.0759 (1.56)  33.0353 (1.42)  28.848 (1.58)   28.571 (1.63)   27.5792 (1.66)  29.2857 (1.62)
test_benchmark_implementations[dynamo-8x512]                               27.7504 (1.7)    27.7494 (1.7)   27.7463 (1.7)   27.7514 (1.7)   27.8712 (1.64)  27.9122 (1.67)  27.6375 (1.65)  28.2279 (1.68)
test_benchmark_implementations[dynamo_cuda_graphs-8x512]                   27.6029 (1.7)    27.6978 (1.7)   27.5907 (1.71)  27.8999 (1.69)  27.1757 (1.68)  27.1697 (1.71)  26.5793 (1.72)  27.7542 (1.71)
test_benchmark_implementations[dynamo_no_dropout-8x512]                    27.7903 (1.69)   27.7845 (1.69)  27.7565 (1.7)   27.8067 (1.69)  28.0226 (1.63)  27.9281 (1.66)  27.7133 (1.65)  28.0485 (1.69)
test_benchmark_implementations[dynamo_nvfuser_ofi-8x512]                   20.4605 (2.3)    20.4662 (2.3)   20.4575 (2.3)   20.4759 (2.3)   20.1937 (2.26)  20.2667 (2.29)  19.715 (2.32)   20.6086 (2.3)
test_benchmark_implementations[dynamo_optimized-8x512]                     18.1924 (2.59)   18.1948 (2.59)  18.1504 (2.59)  18.2446 (2.58)  18.6487 (2.45)  18.7095 (2.48)  18.5356 (2.47)  19.1187 (2.48)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x512]         13.9039 (3.38)   13.9055 (3.38)  13.9028 (3.38)  13.909 (3.38)   13.9589 (3.28)  13.6052 (3.42)  12.8967 (3.54)  14.019 (3.38)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x512]  13.7093 (3.43)   13.6897 (3.44)  13.612 (3.46)   13.7114 (3.43)  13.6507 (3.35)  13.3774 (3.47)  12.6394 (3.62)  13.7645 (3.44)
test_benchmark_implementations[onnx-8x512]                                 46.8101 (1.01)   46.9161 (1.0)   46.8101 (1.01)  47.0221 (1.0)   45.5775 (1.0)   46.4806 (1.0)   45.5775 (1.0)   47.3837 (1.0)
test_benchmark_implementations[onnx_optim_fp16-8x512]                      21.3135 (2.21)   21.314 (2.21)   21.3094 (2.21)  21.3187 (2.21)  20.3842 (2.24)  20.8492 (2.23)  20.3223 (2.25)  21.3459 (2.22)
test_benchmark_implementations[onnx_optim_fp32-8x512]                      47.0578 (1.0)    47.0604 (1.0)   47.0578 (1.0)   47.063 (1.0)    45.7169 (1.0)   46.4234 (1.0)   45.7169 (1.0)   47.1299 (1.01)



Copy link
Contributor

@gaetansnl gaetansnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly minor but I have things that I'm not sure to understand

implementations/attention_masked_original.py Outdated Show resolved Hide resolved
return out


def layer_norm(x: torch.Tensor, weight: torch.Tensor, bias: torch.Tensor, eps: float, implementation: JITFunction = _layer_norm_fwd_fused_single_pass):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this one does not work as the other ? "output" is missing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's the opposite? Why attention has an output field? IMO it should be removed, just forgot.
For attention we create the output outside the function, and provide the tensor. The kernel is marked to convert all provided tensors to fp16 which includes the output in mixed precision. We should move it inside the function to avoid this unneeded casting.

Layernorm and linearlayer have not this issue by creating the output tensor of the right type from the begining.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output is outside because it allows outside code to control allocations, not sure if it's still useful

return outputs


def linear_layer(x: torch.Tensor,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here ? why we don't have "output" ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see layernorm answer


@pytest.mark.parametrize("batch", [1, 8, 32, 64])
@pytest.mark.parametrize("implementation", ["torch", "triton_original", "triton"])
def test_benchmark(benchmark, batch, implementation):
torch.manual_seed(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be moved at the beginning of the function to avoid mistakes IMO

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure to understand what you refer to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.manual_seed(0)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved in an annotation

test/test_attention.py Show resolved Hide resolved
test/test_linear_layer.py Show resolved Hide resolved
@@ -61,7 +53,7 @@ def test_benchmark(benchmark, shape: Shape, bias: bool, activation: str, contigu
else:
raise ValueError(f"Unknown activation: {activation}")

torch_linear_layer = torch.nn.Linear(K, N, bias=bias, device="cuda", dtype=torch.float16)
torch_linear_layer = torch.nn.Linear(K, N, bias=bias, device="cuda", dtype=dtype)
torch_linear_layer.weight.data = layer_weight

def torch_linear_activation(x):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, you don't comapre to fp32 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed, ref is now fp32

test/test_layer_norm.py Outdated Show resolved Hide resolved
test/test_torchdynamo_bert.py Show resolved Hide resolved
(32, 16), (32, 128), (32, 256),
], ids=lambda x: f"{x[0]}x{x[1]}")
@pytest.mark.parametrize("shape", [(bs, seq_l) for bs in [1, 8, 32] for seq_l in [16, 128, 256, 384, 512]
if bs * seq_l < 10000], ids=lambda x: f"{x[0]}x{x[1]}")
@pytest.mark.parametrize("implementation", implementations.keys())
def test_benchmark_implementations(benchmark, model_reference_fp32, shape: (int, int), implementation: str):
torch.manual_seed(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need the assert bellow

Copy link
Member Author

@pommedeterresautee pommedeterresautee Sep 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which assert?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert implementation in implementations, f"unknown implementation: {implementation}"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it has been removed

@pommedeterresautee
Copy link
Member Author

FYI, tests pass

================================================================================= 1896 passed, 88 skipped, 600 warnings in 4114.37s (1:08:34) =================================================================================

return attention_forward_original(*args, **kwargs)


implementations = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pommedeterresautee Maybe we could use other config style to remove this from global scope https://docs.pytest.org/en/6.2.x/example/parametrize.html#paramexamples It makes things really hard to read IMO. And it will be worse as we add test

Copy link
Member Author

@pommedeterresautee pommedeterresautee Sep 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's harder to read (it's local and there is the same pattern in all but one test not yet refactored, batched matmul), but I share your point about the fact that we want to control the number of implementations to test (probably at least have light and full flags), and doing it through the command line using pytest is certainly the best way. It is also true for number of shapes/batch sizes to test.
For that reason, that part of the code would be moved outside the global context, but it's not clear for me what it should look like.

It makes me think that it should be done in a dedicated PR, are you ok with that? If ok I write the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark Measure, measure, measure bug Something isn't working
Projects
None yet
2 participants