Skip to content

Allow distinct K0/K1 values for A/B block descriptor#98

Merged
asroy merged 10 commits into
developfrom
pr-lds-pattern
Feb 28, 2022
Merged

Allow distinct K0/K1 values for A/B block descriptor#98
asroy merged 10 commits into
developfrom
pr-lds-pattern

Conversation

@rosenrodt
Copy link
Copy Markdown
Contributor

@rosenrodt rosenrodt commented Feb 24, 2022

Summary

  • GridwiseGemm_k0mk1_k0nk1_mn_xdlops_v3r1 template parameter change:
    • K0 -> KPerBlock
    • K1 -> AK1/BK1
  • Add conflict-free LDS patterns to FP16 DeviceGemmXdl_C_Shuffle kernel instances

Next action

  • Add FP32 DeviceGemmXdl_C_Shuffle kernel instances
Test script reports all error = 0

#!/bin/bash

echo 'test TT'
./ckProfiler gemm 1 0 1 1 0 1 256 256 256 256 256 256  | grep error -B 1
echo 'test TN'
./ckProfiler gemm 1 1 1 1 0 1 256 256 256 256 256 256  | grep error -B 1
echo 'test NT'
./ckProfiler gemm 1 2 1 1 0 1 256 256 256 256 256 256  | grep error -B 1
echo 'test NN'
./ckProfiler gemm 1 3 1 1 0 1 256 256 256 256 256 256  | grep error -B 1
test TT
Perf: 0.05184 ms, 0.647269 TFlops, 7.58519 GB/s, DeviceGemmXdl<256, 256, 128, 4>
error: 0
--
Perf: 0.05952 ms, 0.563751 TFlops, 6.60645 GB/s, DeviceGemmXdl<256, 128, 256, 4>
error: 0
--
Perf: 0.053439 ms, 0.627902 TFlops, 7.35822 GB/s, DeviceGemmXdl<128, 128, 128, 4>
error: 0
--
Perf: 0.037439 ms, 0.896243 TFlops, 10.5028 GB/s, DeviceGemmXdl<256, 128, 128, 4>
error: 0
--
Perf: 0.03424 ms, 0.979978 TFlops, 11.4841 GB/s, DeviceGemmXdl<128, 128, 64, 4>
error: 0
--
Perf: 0.03792 ms, 0.884874 TFlops, 10.3696 GB/s, DeviceGemmXdl<128, 64, 128, 4>
error: 0
--
Perf: 0.028959 ms, 1.15869 TFlops, 13.5784 GB/s, DeviceGemmXdl<256, 128, 64, 4>
error: 0
--
Perf: 0.029119 ms, 1.15232 TFlops, 13.5038 GB/s, DeviceGemmXdl<256, 64, 128, 4>
error: 0
--
Perf: 0.052159 ms, 0.64331 TFlops, 7.53879 GB/s, DeviceGemmXdl_C_Shuffle<256, 256, 128, 32, 8, 2>
error: 0
--
Perf: 0.052319 ms, 0.641343 TFlops, 7.51574 GB/s, DeviceGemmXdl_C_Shuffle<256, 256, 128, 32, 8, 8>
error: 0
--
Perf: 0.058239 ms, 0.576151 TFlops, 6.75176 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 8, 2>
error: 0
--
Perf: 0.0576 ms, 0.582542 TFlops, 6.82667 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 8, 8>
error: 0
--
Perf: 0.053119 ms, 0.631684 TFlops, 7.40255 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 8, 2>
error: 0
--
Perf: 0.052 ms, 0.645278 TFlops, 7.56185 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 8, 8>
error: 0
--
Perf: 0.0392 ms, 0.85598 TFlops, 10.031 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 8, 2>
error: 0
--
Perf: 0.03808 ms, 0.881156 TFlops, 10.3261 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 8, 8>
error: 0
--
Perf: 0.0352 ms, 0.953251 TFlops, 11.1709 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 8, 2>
error: 0
--
Perf: 0.0352 ms, 0.953251 TFlops, 11.1709 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 8, 8>
error: 0
--
Perf: 0.03744 ms, 0.896219 TFlops, 10.5026 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 8, 2>
error: 0
--
Perf: 0.03792 ms, 0.884874 TFlops, 10.3696 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 8, 8>
error: 0
--
Perf: 0.02944 ms, 1.13976 TFlops, 13.3565 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 8, 2>
error: 0
--
Perf: 0.02976 ms, 1.1275 TFlops, 13.2129 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 8, 8>
error: 0
--
Perf: 0.0296 ms, 1.1336 TFlops, 13.2843 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 8, 2>
error: 0
--
Perf: 0.029279 ms, 1.14602 TFlops, 13.43 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 8, 8>
error: 0
test TN
Perf: 0.0512 ms, 0.65536 TFlops, 7.68 GB/s, DeviceGemmXdl<256, 256, 128, 4>
error: 0
--
Perf: 0.057759 ms, 0.580939 TFlops, 6.80787 GB/s, DeviceGemmXdl<256, 128, 256, 4>
error: 0
--
Perf: 0.05216 ms, 0.643298 TFlops, 7.53865 GB/s, DeviceGemmXdl<128, 128, 128, 4>
error: 0
--
Perf: 0.03744 ms, 0.896219 TFlops, 10.5026 GB/s, DeviceGemmXdl<256, 128, 128, 4>
error: 0
--
Perf: 0.033759 ms, 0.99394 TFlops, 11.6477 GB/s, DeviceGemmXdl<128, 128, 64, 4>
error: 0
--
Perf: 0.035679 ms, 0.940453 TFlops, 11.0209 GB/s, DeviceGemmXdl<128, 64, 128, 4>
error: 0
--
Perf: 0.0328 ms, 1.023 TFlops, 11.9883 GB/s, DeviceGemmXdl<64, 64, 64, 4>
error: 0
--
Perf: 0.02768 ms, 1.21223 TFlops, 14.2058 GB/s, DeviceGemmXdl<256, 128, 64, 4>
error: 0
--
Perf: 0.0288 ms, 1.16508 TFlops, 13.6533 GB/s, DeviceGemmXdl<256, 64, 128, 4>
error: 0
--
Perf: 0.027039 ms, 1.24096 TFlops, 14.5425 GB/s, DeviceGemmXdl<128, 128, 32, 4>
error: 0
--
Perf: 0.02896 ms, 1.15865 TFlops, 13.5779 GB/s, DeviceGemmXdl<128, 32, 128, 4>
error: 0
--
Perf: 0.02608 ms, 1.2866 TFlops, 15.0773 GB/s, DeviceGemmXdl<64, 64, 32, 4>
error: 0
--
Perf: 0.02624 ms, 1.27875 TFlops, 14.9854 GB/s, DeviceGemmXdl<64, 32, 64, 4>
error: 0
--
Perf: 0.05152 ms, 0.651289 TFlops, 7.6323 GB/s, DeviceGemmXdl_C_Shuffle<256, 256, 128, 32, 8, 8>
error: 0
--
Perf: 0.056 ms, 0.599186 TFlops, 7.02171 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 8, 8>
error: 0
--
Perf: 0.052319 ms, 0.641343 TFlops, 7.51574 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 8, 8>
error: 0
--
Perf: 0.03696 ms, 0.907858 TFlops, 10.639 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 8, 8>
error: 0
--
Perf: 0.03216 ms, 1.04336 TFlops, 12.2269 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 8, 8>
error: 0
--
Perf: 0.0352 ms, 0.953251 TFlops, 11.1709 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 8, 8>
error: 0
--
Perf: 0.033119 ms, 1.01315 TFlops, 11.8728 GB/s, DeviceGemmXdl_C_Shuffle<64, 64, 64, 32, 8, 8>
error: 0
--
Perf: 0.0272 ms, 1.23362 TFlops, 14.4565 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 8, 8>
error: 0
--
Perf: 0.02896 ms, 1.15865 TFlops, 13.5779 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 8, 8>
error: 0
--
Perf: 0.02704 ms, 1.24092 TFlops, 14.542 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 32, 32, 8, 8>
error: 0
--
Perf: 0.029279 ms, 1.14602 TFlops, 13.43 GB/s, DeviceGemmXdl_C_Shuffle<128, 32, 128, 32, 8, 8>
error: 0
--
Perf: 0.02512 ms, 1.33577 TFlops, 15.6535 GB/s, DeviceGemmXdl_C_Shuffle<64, 64, 32, 32, 8, 8>
error: 0
--
Perf: 0.02608 ms, 1.2866 TFlops, 15.0773 GB/s, DeviceGemmXdl_C_Shuffle<64, 32, 64, 32, 8, 8>
error: 0
test NT
Perf: 0.05776 ms, 0.580929 TFlops, 6.80776 GB/s, DeviceGemmXdl<256, 256, 128, 4>
error: 0
--
Perf: 0.061279 ms, 0.547568 TFlops, 6.41681 GB/s, DeviceGemmXdl<256, 128, 256, 4>
error: 0
--
Perf: 0.0568 ms, 0.590747 TFlops, 6.92282 GB/s, DeviceGemmXdl<128, 128, 128, 4>
error: 0
--
Perf: 0.040479 ms, 0.828934 TFlops, 9.71407 GB/s, DeviceGemmXdl<256, 128, 128, 4>
error: 0
--
Perf: 0.036959 ms, 0.907883 TFlops, 10.6392 GB/s, DeviceGemmXdl<128, 128, 64, 4>
error: 0
--
Perf: 0.037919 ms, 0.884898 TFlops, 10.3699 GB/s, DeviceGemmXdl<128, 64, 128, 4>
error: 0
--
Perf: 0.03072 ms, 1.09227 TFlops, 12.8 GB/s, DeviceGemmXdl<256, 128, 64, 4>
error: 0
--
Perf: 0.031359 ms, 1.07001 TFlops, 12.5392 GB/s, DeviceGemmXdl<256, 64, 128, 4>
error: 0
--
Perf: 0.05968 ms, 0.562239 TFlops, 6.58874 GB/s, DeviceGemmXdl_C_Shuffle<256, 256, 128, 32, 2, 2>
error: 0
--
Perf: 0.056639 ms, 0.592426 TFlops, 6.9425 GB/s, DeviceGemmXdl_C_Shuffle<256, 256, 128, 32, 8, 8>
error: 0
--
Perf: 0.061119 ms, 0.549002 TFlops, 6.43361 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 2, 4>
error: 0
--
Perf: 0.061279 ms, 0.547568 TFlops, 6.41681 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 8, 8>
error: 0
--
Perf: 0.05936 ms, 0.56527 TFlops, 6.62426 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 2, 2>
error: 0
--
Perf: 0.05664 ms, 0.592416 TFlops, 6.94237 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 8, 8>
error: 0
--
Perf: 0.04112 ms, 0.816012 TFlops, 9.56265 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 2, 2>
error: 0
--
Perf: 0.04048 ms, 0.828914 TFlops, 9.71383 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 8, 8>
error: 0
--
Perf: 0.03952 ms, 0.849049 TFlops, 9.9498 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 2, 2>
error: 0
--
Perf: 0.03696 ms, 0.907858 TFlops, 10.639 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 8, 8>
error: 0
--
Perf: 0.038559 ms, 0.87021 TFlops, 10.1978 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 2, 2>
error: 0
--
Perf: 0.038079 ms, 0.881179 TFlops, 10.3263 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 8, 8>
error: 0
--
Perf: 0.03008 ms, 1.11551 TFlops, 13.0723 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 2, 2>
error: 0
--
Perf: 0.03088 ms, 1.08661 TFlops, 12.7337 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 8, 8>
error: 0
--
Perf: 0.031199 ms, 1.0755 TFlops, 12.6035 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 2, 2>
error: 0
--
Perf: 0.03136 ms, 1.06998 TFlops, 12.5388 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 8, 8>
error: 0
test NN
Perf: 0.0568 ms, 0.590747 TFlops, 6.92282 GB/s, DeviceGemmXdl<256, 256, 128, 4>
error: 0
--
Perf: 0.0608 ms, 0.551882 TFlops, 6.46737 GB/s, DeviceGemmXdl<256, 128, 256, 4>
error: 0
--
Perf: 0.05536 ms, 0.606113 TFlops, 7.10289 GB/s, DeviceGemmXdl<128, 128, 128, 4>
error: 0
--
Perf: 0.039519 ms, 0.849071 TFlops, 9.95005 GB/s, DeviceGemmXdl<256, 128, 128, 4>
error: 0
--
Perf: 0.03504 ms, 0.957604 TFlops, 11.2219 GB/s, DeviceGemmXdl<128, 128, 64, 4>
error: 0
--
Perf: 0.0368 ms, 0.911805 TFlops, 10.6852 GB/s, DeviceGemmXdl<128, 64, 128, 4>
error: 0
--
Perf: 0.02784 ms, 1.20526 TFlops, 14.1241 GB/s, DeviceGemmXdl<256, 128, 64, 4>
error: 0
--
Perf: 0.02944 ms, 1.13976 TFlops, 13.3565 GB/s, DeviceGemmXdl<256, 64, 128, 4>
error: 0
--
Perf: 0.05696 ms, 0.589088 TFlops, 6.90337 GB/s, DeviceGemmXdl_C_Shuffle<256, 256, 128, 32, 2, 8>
error: 0
--
Perf: 0.05552 ms, 0.604367 TFlops, 7.08242 GB/s, DeviceGemmXdl_C_Shuffle<256, 256, 128, 32, 8, 8>
error: 0
--
Perf: 0.061919 ms, 0.541909 TFlops, 6.35049 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 2, 8>
error: 0
--
Perf: 0.05792 ms, 0.579324 TFlops, 6.78895 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 8, 8>
error: 0
--
Perf: 0.05664 ms, 0.592416 TFlops, 6.94237 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 2, 8>
error: 0
--
Perf: 0.054239 ms, 0.61864 TFlops, 7.24969 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 8, 8>
error: 0
--
Perf: 0.03728 ms, 0.900065 TFlops, 10.5476 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 2, 8>
error: 0
--
Perf: 0.03888 ms, 0.863025 TFlops, 10.1136 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 8, 8>
error: 0
--
Perf: 0.03488 ms, 0.961996 TFlops, 11.2734 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 2, 8>
error: 0
--
Perf: 0.03408 ms, 0.984578 TFlops, 11.538 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 8, 8>
error: 0
--
Perf: 0.03648 ms, 0.919804 TFlops, 10.7789 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 2, 8>
error: 0
--
Perf: 0.037119 ms, 0.903969 TFlops, 10.5934 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 8, 8>
error: 0
--
Perf: 0.02896 ms, 1.15865 TFlops, 13.5779 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 2, 8>
error: 0
--
Perf: 0.0288 ms, 1.16508 TFlops, 13.6533 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 8, 8>
error: 0
--
Perf: 0.029599 ms, 1.13363 TFlops, 13.2848 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 2, 8>
error: 0
--
Perf: 0.02976 ms, 1.1275 TFlops, 13.2129 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 8, 8>
error: 0

Performance data on MI100 at 1087mhz

+ echo 'bench TT'
bench TT
+ ./ckProfiler gemm 1 0 0 1 0 10 7680 8192 8192 8256 8256 8256
+ grep Perf
Perf: 8.81146 ms, 116.983 TFlops, 42.8405 GB/s, DeviceGemmXdl<256, 256, 128, 4>
Perf: 10.0394 ms, 102.675 TFlops, 37.6007 GB/s, DeviceGemmXdl<256, 128, 256, 4>
Perf: 12.595 ms, 81.8412 TFlops, 29.9711 GB/s, DeviceGemmXdl<128, 128, 128, 4>
Perf: 9.94777 ms, 103.62 TFlops, 37.9469 GB/s, DeviceGemmXdl<256, 128, 128, 4>
Perf: 13.5126 ms, 76.2837 TFlops, 27.9359 GB/s, DeviceGemmXdl<128, 128, 64, 4>
Perf: 18.7978 ms, 54.8358 TFlops, 20.0815 GB/s, DeviceGemmXdl<128, 64, 128, 4>
Perf: 13.4199 ms, 76.8106 TFlops, 28.1289 GB/s, DeviceGemmXdl<256, 128, 64, 4>
Perf: 17.7388 ms, 58.1096 TFlops, 21.2804 GB/s, DeviceGemmXdl<256, 64, 128, 4>
Perf: 8.94279 ms, 115.265 TFlops, 42.2114 GB/s, DeviceGemmXdl_C_Shuffle<256, 256, 128, 32, 8, 2>
Perf: 8.84072 ms, 116.596 TFlops, 42.6987 GB/s, DeviceGemmXdl_C_Shuffle<256, 256, 128, 32, 8, 8>
Perf: 9.76063 ms, 105.607 TFlops, 38.6745 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 8, 2>
Perf: 10.1106 ms, 101.951 TFlops, 37.3357 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 8, 8>
Perf: 12.5257 ms, 82.2944 TFlops, 30.1371 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 8, 2>
Perf: 12.9219 ms, 79.7708 TFlops, 29.2129 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 8, 8>
Perf: 9.89129 ms, 104.212 TFlops, 38.1636 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 8, 2>
Perf: 9.92436 ms, 103.865 TFlops, 38.0364 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 8, 8>
Perf: 13.4806 ms, 76.4651 TFlops, 28.0024 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 8, 2>
Perf: 13.5133 ms, 76.2796 TFlops, 27.9344 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 8, 8>
Perf: 18.86 ms, 54.6549 TFlops, 20.0152 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 8, 2>
Perf: 18.8581 ms, 54.6604 TFlops, 20.0172 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 8, 8>
Perf: 13.2743 ms, 77.653 TFlops, 28.4374 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 8, 2>
Perf: 13.5609 ms, 76.012 TFlops, 27.8364 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 8, 8>
Perf: 17.9333 ms, 57.4792 TFlops, 21.0495 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 8, 2>
Perf: 17.7393 ms, 58.1079 TFlops, 21.2797 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 8, 8>
Best Perf: 8.81146 ms, 116.983 TFlops, 42.8405 GB/s, DeviceGemmXdl<256, 256, 128, 4>
+ ./ckProfiler gemm 1 0 0 1 0 10 3840 4096 4096 8256 8256 8256
+ grep Perf
Perf: 1.16151 ms, 110.932 TFlops, 81.2493 GB/s, DeviceGemmXdl<256, 256, 128, 4>
Perf: 1.33431 ms, 96.5662 TFlops, 70.7272 GB/s, DeviceGemmXdl<256, 128, 256, 4>
Perf: 1.76221 ms, 73.118 TFlops, 53.5532 GB/s, DeviceGemmXdl<128, 128, 128, 4>
Perf: 1.22928 ms, 104.816 TFlops, 76.7697 GB/s, DeviceGemmXdl<256, 128, 128, 4>
Perf: 1.82074 ms, 70.7674 TFlops, 51.8316 GB/s, DeviceGemmXdl<128, 128, 64, 4>
Perf: 2.04128 ms, 63.1218 TFlops, 46.2318 GB/s, DeviceGemmXdl<128, 64, 128, 4>
Perf: 1.71976 ms, 74.9227 TFlops, 54.875 GB/s, DeviceGemmXdl<256, 128, 64, 4>
Perf: 1.78357 ms, 72.2423 TFlops, 52.9118 GB/s, DeviceGemmXdl<256, 64, 128, 4>
Perf: 1.18432 ms, 108.795 TFlops, 79.6841 GB/s, DeviceGemmXdl_C_Shuffle<256, 256, 128, 32, 8, 2>
Perf: 1.16983 ms, 110.143 TFlops, 80.6715 GB/s, DeviceGemmXdl_C_Shuffle<256, 256, 128, 32, 8, 8>
Perf: 1.30389 ms, 98.8188 TFlops, 72.377 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 8, 2>
Perf: 1.35039 ms, 95.4163 TFlops, 69.885 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 8, 8>
Perf: 1.7597 ms, 73.2223 TFlops, 53.6297 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 8, 2>
Perf: 1.80163 ms, 71.518 TFlops, 52.3814 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 8, 8>
Perf: 1.2324 ms, 104.551 TFlops, 76.5754 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 8, 2>
Perf: 1.23823 ms, 104.059 TFlops, 76.2152 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 8, 8>
Perf: 1.81013 ms, 71.1821 TFlops, 52.1353 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 8, 2>
Perf: 1.81734 ms, 70.8997 TFlops, 51.9285 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 8, 8>
Perf: 2.04043 ms, 63.148 TFlops, 46.251 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 8, 2>
Perf: 2.04367 ms, 63.0479 TFlops, 46.1777 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 8, 8>
Perf: 1.72176 ms, 74.8357 TFlops, 54.8113 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 8, 2>
Perf: 1.7192 ms, 74.9471 TFlops, 54.8929 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 8, 8>
Perf: 1.78614 ms, 72.1382 TFlops, 52.8356 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 8, 2>
Perf: 1.79425 ms, 71.8123 TFlops, 52.5969 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 8, 8>
Best Perf: 1.16151 ms, 110.932 TFlops, 81.2493 GB/s, DeviceGemmXdl<256, 256, 128, 4>
+ ./ckProfiler gemm 1 0 0 1 0 10 1920 2048 4096 8256 8256 8256
+ grep Perf
Perf: 0.460075 ms, 70.0152 TFlops, 85.4678 GB/s, DeviceGemmXdl<256, 128, 256, 4>
Perf: 0.407932 ms, 78.9648 TFlops, 96.3926 GB/s, DeviceGemmXdl<128, 128, 128, 4>
Perf: 0.35422 ms, 90.9385 TFlops, 111.009 GB/s, DeviceGemmXdl<256, 128, 128, 4>
Perf: 0.418396 ms, 76.9899 TFlops, 93.9819 GB/s, DeviceGemmXdl<128, 128, 64, 4>
Perf: 0.450219 ms, 71.5479 TFlops, 87.3388 GB/s, DeviceGemmXdl<128, 64, 128, 4>
Perf: 0.426492 ms, 75.5285 TFlops, 92.1978 GB/s, DeviceGemmXdl<256, 128, 64, 4>
Perf: 0.459163 ms, 70.1543 TFlops, 85.6375 GB/s, DeviceGemmXdl<256, 64, 128, 4>
Perf: 0.449979 ms, 71.5861 TFlops, 87.3854 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 8, 2>
Perf: 0.463275 ms, 69.5316 TFlops, 84.8774 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 8, 8>
Perf: 0.402364 ms, 80.0575 TFlops, 97.7265 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 8, 2>
Perf: 0.41726 ms, 77.1996 TFlops, 94.2377 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 8, 8>
Perf: 0.365948 ms, 88.0241 TFlops, 107.451 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 8, 2>
Perf: 0.35893 ms, 89.7452 TFlops, 109.552 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 8, 8>
Perf: 0.414988 ms, 77.6222 TFlops, 94.7537 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 8, 2>
Perf: 0.421228 ms, 76.4723 TFlops, 93.35 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 8, 8>
Perf: 0.444379 ms, 72.4882 TFlops, 88.4866 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 8, 2>
Perf: 0.452891 ms, 71.1258 TFlops, 86.8235 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 8, 8>
Perf: 0.421196 ms, 76.4782 TFlops, 93.3571 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 8, 2>
Perf: 0.433307 ms, 74.3404 TFlops, 90.7476 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 8, 8>
Perf: 0.450811 ms, 71.454 TFlops, 87.2241 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 8, 2>
Perf: 0.462187 ms, 69.6953 TFlops, 85.0772 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 8, 8>
Best Perf: 0.35422 ms, 90.9385 TFlops, 111.009 GB/s, DeviceGemmXdl<256, 128, 128, 4>
+ echo 'bench TN'
bench TN
+ ./ckProfiler gemm 1 1 0 1 0 10 7680 8192 8192 8256 8256 8256
+ grep Perf
Perf: 8.53975 ms, 120.705 TFlops, 44.2035 GB/s, DeviceGemmXdl<256, 256, 128, 4>
Perf: 9.39694 ms, 109.694 TFlops, 40.1713 GB/s, DeviceGemmXdl<256, 128, 256, 4>
Perf: 12.8086 ms, 80.4765 TFlops, 29.4714 GB/s, DeviceGemmXdl<128, 128, 128, 4>
Perf: 11.1946 ms, 92.079 TFlops, 33.7203 GB/s, DeviceGemmXdl<256, 128, 128, 4>
Perf: 14.4221 ms, 71.4729 TFlops, 26.1741 GB/s, DeviceGemmXdl<128, 128, 64, 4>
Perf: 22.6651 ms, 45.4792 TFlops, 16.655 GB/s, DeviceGemmXdl<128, 64, 128, 4>
Perf: 23.3338 ms, 44.176 TFlops, 16.1777 GB/s, DeviceGemmXdl<64, 64, 64, 4>
Perf: 13.8274 ms, 74.5469 TFlops, 27.2999 GB/s, DeviceGemmXdl<256, 128, 64, 4>
Perf: 22.9344 ms, 44.9452 TFlops, 16.4594 GB/s, DeviceGemmXdl<256, 64, 128, 4>
Perf: 20.6504 ms, 49.9164 TFlops, 18.2799 GB/s, DeviceGemmXdl<128, 128, 32, 4>
Perf: 40.0452 ms, 25.7407 TFlops, 9.42654 GB/s, DeviceGemmXdl<128, 32, 128, 4>
Perf: 27.9584 ms, 36.8688 TFlops, 13.5018 GB/s, DeviceGemmXdl<64, 64, 32, 4>
Perf: 43.2153 ms, 23.8525 TFlops, 8.73505 GB/s, DeviceGemmXdl<64, 32, 64, 4>
Perf: 8.53809 ms, 120.729 TFlops, 44.2122 GB/s, DeviceGemmXdl_C_Shuffle<256, 256, 128, 32, 8, 8>
Perf: 9.28386 ms, 111.031 TFlops, 40.6606 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 8, 8>
Perf: 12.7737 ms, 80.6965 TFlops, 29.5519 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 8, 8>
Perf: 10.9373 ms, 94.2453 TFlops, 34.5137 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 8, 8>
Perf: 14.4939 ms, 71.119 TFlops, 26.0446 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 8, 8>
Perf: 22.6185 ms, 45.5731 TFlops, 16.6894 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 8, 8>
Perf: 23.3687 ms, 44.1099 TFlops, 16.1535 GB/s, DeviceGemmXdl_C_Shuffle<64, 64, 64, 32, 8, 8>
Perf: 13.7725 ms, 74.8441 TFlops, 27.4087 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 8, 8>
Perf: 23.0396 ms, 44.7401 TFlops, 16.3843 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 8, 8>
Perf: 20.638 ms, 49.9464 TFlops, 18.2909 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 32, 32, 8, 8>
Perf: 40.0551 ms, 25.7343 TFlops, 9.4242 GB/s, DeviceGemmXdl_C_Shuffle<128, 32, 128, 32, 8, 8>
Perf: 27.8843 ms, 36.9667 TFlops, 13.5376 GB/s, DeviceGemmXdl_C_Shuffle<64, 64, 32, 32, 8, 8>
Perf: 43.2246 ms, 23.8473 TFlops, 8.73315 GB/s, DeviceGemmXdl_C_Shuffle<64, 32, 64, 32, 8, 8>
Best Perf: 8.53809 ms, 120.729 TFlops, 44.2122 GB/s, DeviceGemmXdl_C_Shuffle<256, 256, 128, 32, 8, 8>
+ ./ckProfiler gemm 1 1 0 1 0 10 3840 4096 4096 8256 8256 8256
+ grep Perf
Perf: 1.12536 ms, 114.495 TFlops, 83.8589 GB/s, DeviceGemmXdl<256, 256, 128, 4>
Perf: 1.19855 ms, 107.504 TFlops, 78.7385 GB/s, DeviceGemmXdl<256, 128, 256, 4>
Perf: 1.73579 ms, 74.2307 TFlops, 54.3682 GB/s, DeviceGemmXdl<128, 128, 128, 4>
Perf: 1.14896 ms, 112.144 TFlops, 82.1364 GB/s, DeviceGemmXdl<256, 128, 128, 4>
Perf: 1.89134 ms, 68.1256 TFlops, 49.8967 GB/s, DeviceGemmXdl<128, 128, 64, 4>
Perf: 2.25372 ms, 57.1717 TFlops, 41.8738 GB/s, DeviceGemmXdl<128, 64, 128, 4>
Perf: 2.77072 ms, 46.5037 TFlops, 34.0604 GB/s, DeviceGemmXdl<64, 64, 64, 4>
Perf: 2.01953 ms, 63.8016 TFlops, 46.7297 GB/s, DeviceGemmXdl<256, 128, 64, 4>
Perf: 2.31641 ms, 55.6245 TFlops, 40.7406 GB/s, DeviceGemmXdl<256, 64, 128, 4>
Perf: 2.77335 ms, 46.4597 TFlops, 34.0281 GB/s, DeviceGemmXdl<128, 128, 32, 4>
Perf: 4.0992 ms, 31.4327 TFlops, 23.022 GB/s, DeviceGemmXdl<128, 32, 128, 4>
Perf: 3.79188 ms, 33.9802 TFlops, 24.8879 GB/s, DeviceGemmXdl<64, 64, 32, 4>
Perf: 5.01201 ms, 25.7081 TFlops, 18.8291 GB/s, DeviceGemmXdl<64, 32, 64, 4>
Perf: 1.1285 ms, 114.177 TFlops, 83.6259 GB/s, DeviceGemmXdl_C_Shuffle<256, 256, 128, 32, 8, 8>
Perf: 1.2021 ms, 107.187 TFlops, 78.5058 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 8, 8>
Perf: 1.71614 ms, 75.0806 TFlops, 54.9907 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 8, 8>
Perf: 1.15525 ms, 111.533 TFlops, 81.6894 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 8, 8>
Perf: 1.87371 ms, 68.7669 TFlops, 50.3664 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 8, 8>
Perf: 2.22929 ms, 57.7982 TFlops, 42.3327 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 8, 8>
Perf: 2.75242 ms, 46.813 TFlops, 34.2869 GB/s, DeviceGemmXdl_C_Shuffle<64, 64, 64, 32, 8, 8>
Perf: 2.01457 ms, 63.9585 TFlops, 46.8446 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 8, 8>
Perf: 2.33225 ms, 55.2467 TFlops, 40.4639 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 8, 8>
Perf: 2.78493 ms, 46.2665 TFlops, 33.8866 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 32, 32, 8, 8>
Perf: 4.12289 ms, 31.2521 TFlops, 22.8897 GB/s, DeviceGemmXdl_C_Shuffle<128, 32, 128, 32, 8, 8>
Perf: 3.80907 ms, 33.8269 TFlops, 24.7756 GB/s, DeviceGemmXdl_C_Shuffle<64, 64, 32, 32, 8, 8>
Perf: 4.9585 ms, 25.9855 TFlops, 19.0323 GB/s, DeviceGemmXdl_C_Shuffle<64, 32, 64, 32, 8, 8>
Best Perf: 1.12536 ms, 114.495 TFlops, 83.8589 GB/s, DeviceGemmXdl<256, 256, 128, 4>
+ ./ckProfiler gemm 1 1 0 1 0 10 1920 2048 4096 8256 8256 8256
+ grep Perf
Perf: 0.438187 ms, 73.5125 TFlops, 89.737 GB/s, DeviceGemmXdl<256, 128, 256, 4>
Perf: 0.58641 ms, 54.9313 TFlops, 67.0548 GB/s, DeviceGemmXdl<128, 128, 128, 4>
Perf: 0.33622 ms, 95.807 TFlops, 116.952 GB/s, DeviceGemmXdl<256, 128, 128, 4>
Perf: 0.439387 ms, 73.3118 TFlops, 89.4919 GB/s, DeviceGemmXdl<128, 128, 64, 4>
Perf: 0.420348 ms, 76.6324 TFlops, 93.5455 GB/s, DeviceGemmXdl<128, 64, 128, 4>
Perf: 0.911718 ms, 35.3314 TFlops, 43.1291 GB/s, DeviceGemmXdl<64, 64, 64, 4>
Perf: 0.485995 ms, 66.2811 TFlops, 80.9095 GB/s, DeviceGemmXdl<256, 128, 64, 4>
Perf: 0.425051 ms, 75.7844 TFlops, 92.5102 GB/s, DeviceGemmXdl<256, 64, 128, 4>
Perf: 0.944038 ms, 34.1218 TFlops, 41.6526 GB/s, DeviceGemmXdl<128, 128, 32, 4>
Perf: 0.702437 ms, 45.8579 TFlops, 55.9789 GB/s, DeviceGemmXdl<128, 32, 128, 4>
Perf: 1.53158 ms, 21.032 TFlops, 25.6738 GB/s, DeviceGemmXdl<64, 64, 32, 4>
Perf: 1.27813 ms, 25.2026 TFlops, 30.7649 GB/s, DeviceGemmXdl<64, 32, 64, 4>
Perf: 0.409947 ms, 78.5765 TFlops, 95.9186 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 8, 8>
Perf: 0.585482 ms, 55.0184 TFlops, 67.1611 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 8, 8>
Perf: 0.340668 ms, 94.5561 TFlops, 115.425 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 8, 8>
Perf: 0.436843 ms, 73.7387 TFlops, 90.0131 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 8, 8>
Perf: 0.418571 ms, 76.9576 TFlops, 93.9424 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 8, 8>
Perf: 0.894448 ms, 36.0135 TFlops, 43.9618 GB/s, DeviceGemmXdl_C_Shuffle<64, 64, 64, 32, 8, 8>
Perf: 0.486331 ms, 66.2353 TFlops, 80.8536 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 8, 8>
Perf: 0.421324 ms, 76.4549 TFlops, 93.3287 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 8, 8>
Perf: 0.948438 ms, 33.9635 TFlops, 41.4593 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 32, 32, 8, 8>
Perf: 0.709209 ms, 45.42 TFlops, 55.4443 GB/s, DeviceGemmXdl_C_Shuffle<128, 32, 128, 32, 8, 8>
Perf: 1.56608 ms, 20.5687 TFlops, 25.1083 GB/s, DeviceGemmXdl_C_Shuffle<64, 64, 32, 32, 8, 8>
Perf: 1.29266 ms, 24.9194 TFlops, 30.4192 GB/s, DeviceGemmXdl_C_Shuffle<64, 32, 64, 32, 8, 8>
Best Perf: 0.33622 ms, 95.807 TFlops, 116.952 GB/s, DeviceGemmXdl<256, 128, 128, 4>
+ echo 'bench NT'
bench NT
+ ./ckProfiler gemm 1 2 0 1 0 10 7680 8192 8192 8256 8256 8256
+ grep Perf
Perf: 9.55829 ms, 107.843 TFlops, 39.4932 GB/s, DeviceGemmXdl<256, 256, 128, 4>
Perf: 10.1058 ms, 102 TFlops, 37.3535 GB/s, DeviceGemmXdl<256, 128, 256, 4>
Perf: 13.2599 ms, 77.7377 TFlops, 28.4684 GB/s, DeviceGemmXdl<128, 128, 128, 4>
Perf: 9.76759 ms, 105.532 TFlops, 38.6469 GB/s, DeviceGemmXdl<256, 128, 128, 4>
Perf: 13.942 ms, 73.9344 TFlops, 27.0756 GB/s, DeviceGemmXdl<128, 128, 64, 4>
Perf: 18.9461 ms, 54.4066 TFlops, 19.9243 GB/s, DeviceGemmXdl<128, 64, 128, 4>
Perf: 13.9042 ms, 74.1353 TFlops, 27.1492 GB/s, DeviceGemmXdl<256, 128, 64, 4>
Perf: 17.4749 ms, 58.9869 TFlops, 21.6017 GB/s, DeviceGemmXdl<256, 64, 128, 4>
Perf: 9.73505 ms, 105.885 TFlops, 38.7761 GB/s, DeviceGemmXdl_C_Shuffle<256, 256, 128, 32, 2, 2>
Perf: 9.55442 ms, 107.886 TFlops, 39.5092 GB/s, DeviceGemmXdl_C_Shuffle<256, 256, 128, 32, 8, 8>
Perf: 10.062 ms, 102.444 TFlops, 37.5161 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 2, 4>
Perf: 10.1078 ms, 101.98 TFlops, 37.3461 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 8, 8>
Perf: 10.5806 ms, 97.423 TFlops, 35.6774 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 2, 2>
Perf: 13.2864 ms, 77.5823 TFlops, 28.4115 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 8, 8>
Perf: 9.7785 ms, 105.414 TFlops, 38.6038 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 2, 2>
Perf: 9.78808 ms, 105.311 TFlops, 38.566 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 8, 8>
Perf: 13.8754 ms, 74.2894 TFlops, 27.2056 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 2, 2>
Perf: 13.9321 ms, 73.987 TFlops, 27.0948 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 8, 8>
Perf: 18.8833 ms, 54.5875 TFlops, 19.9905 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 2, 2>
Perf: 18.9072 ms, 54.5186 TFlops, 19.9653 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 8, 8>
Perf: 13.8154 ms, 74.6117 TFlops, 27.3236 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 2, 2>
Perf: 13.9523 ms, 73.8799 TFlops, 27.0556 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 8, 8>
Perf: 17.748 ms, 58.0794 TFlops, 21.2693 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 2, 2>
Perf: 17.5161 ms, 58.8482 TFlops, 21.5509 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 8, 8>
Best Perf: 9.55442 ms, 107.886 TFlops, 39.5092 GB/s, DeviceGemmXdl_C_Shuffle<256, 256, 128, 32, 8, 8>
+ ./ckProfiler gemm 1 2 0 1 0 10 3840 4096 4096 8256 8256 8256
+ grep Perf
Perf: 1.27371 ms, 101.16 TFlops, 74.0918 GB/s, DeviceGemmXdl<256, 256, 128, 4>
Perf: 1.34821 ms, 95.5704 TFlops, 69.9979 GB/s, DeviceGemmXdl<256, 128, 256, 4>
Perf: 1.87259 ms, 68.808 TFlops, 50.3965 GB/s, DeviceGemmXdl<128, 128, 128, 4>
Perf: 1.27488 ms, 101.067 TFlops, 74.0239 GB/s, DeviceGemmXdl<256, 128, 128, 4>
Perf: 1.88094 ms, 68.5024 TFlops, 50.1727 GB/s, DeviceGemmXdl<128, 128, 64, 4>
Perf: 2.07319 ms, 62.15 TFlops, 45.52 GB/s, DeviceGemmXdl<128, 64, 128, 4>
Perf: 1.7828 ms, 72.2735 TFlops, 52.9347 GB/s, DeviceGemmXdl<256, 128, 64, 4>
Perf: 1.78112 ms, 72.3417 TFlops, 52.9846 GB/s, DeviceGemmXdl<256, 64, 128, 4>
Perf: 1.30642 ms, 98.6277 TFlops, 72.2371 GB/s, DeviceGemmXdl_C_Shuffle<256, 256, 128, 32, 2, 2>
Perf: 1.27258 ms, 101.25 TFlops, 74.158 GB/s, DeviceGemmXdl_C_Shuffle<256, 256, 128, 32, 8, 8>
Perf: 1.34661 ms, 95.684 TFlops, 70.081 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 2, 4>
Perf: 1.35013 ms, 95.4345 TFlops, 69.8984 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 8, 8>
Perf: 1.38394 ms, 93.1032 TFlops, 68.1908 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 2, 2>
Perf: 1.881 ms, 68.5001 TFlops, 50.171 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 8, 8>
Perf: 1.26701 ms, 101.695 TFlops, 74.484 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 2, 2>
Perf: 1.28173 ms, 100.527 TFlops, 73.6285 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 8, 8>
Perf: 1.881 ms, 68.5001 TFlops, 50.171 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 2, 2>
Perf: 1.88123 ms, 68.492 TFlops, 50.165 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 8, 8>
Perf: 2.05656 ms, 62.6528 TFlops, 45.8883 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 2, 2>
Perf: 2.05668 ms, 62.649 TFlops, 45.8855 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 8, 8>
Perf: 1.77149 ms, 72.735 TFlops, 53.2727 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 2, 2>
Perf: 1.79574 ms, 71.7525 TFlops, 52.5531 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 8, 8>
Perf: 1.76808 ms, 72.8752 TFlops, 53.3754 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 2, 2>
Perf: 1.78369 ms, 72.2372 TFlops, 52.9081 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 8, 8>
Best Perf: 1.26701 ms, 101.695 TFlops, 74.484 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 2, 2>
+ ./ckProfiler gemm 1 2 0 1 0 10 1920 2048 4096 8256 8256 8256
+ grep Perf
Perf: 0.465451 ms, 69.2065 TFlops, 84.4806 GB/s, DeviceGemmXdl<256, 128, 256, 4>
Perf: 0.446891 ms, 72.0807 TFlops, 87.9891 GB/s, DeviceGemmXdl<128, 128, 128, 4>
Perf: 0.36894 ms, 87.3102 TFlops, 106.58 GB/s, DeviceGemmXdl<256, 128, 128, 4>
Perf: 0.461547 ms, 69.7919 TFlops, 85.1951 GB/s, DeviceGemmXdl<128, 128, 64, 4>
Perf: 0.469371 ms, 68.6285 TFlops, 83.775 GB/s, DeviceGemmXdl<128, 64, 128, 4>
Perf: 0.466811 ms, 69.0049 TFlops, 84.2344 GB/s, DeviceGemmXdl<256, 128, 64, 4>
Perf: 0.466539 ms, 69.0451 TFlops, 84.2836 GB/s, DeviceGemmXdl<256, 64, 128, 4>
Perf: 0.473323 ms, 68.0555 TFlops, 83.0756 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 2, 4>
Perf: 0.468811 ms, 68.7105 TFlops, 83.8751 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 8, 8>
Perf: 0.419756 ms, 76.7405 TFlops, 93.6773 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 2, 2>
Perf: 0.448843 ms, 71.7672 TFlops, 87.6065 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 8, 8>
Perf: 0.390412 ms, 82.5083 TFlops, 100.718 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 2, 2>
Perf: 0.373548 ms, 86.2332 TFlops, 105.265 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 8, 8>
Perf: 0.460555 ms, 69.9422 TFlops, 85.3787 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 2, 2>
Perf: 0.464059 ms, 69.4141 TFlops, 84.734 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 8, 8>
Perf: 0.458427 ms, 70.2669 TFlops, 85.775 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 2, 2>
Perf: 0.470411 ms, 68.4768 TFlops, 83.5898 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 8, 8>
Perf: 0.461307 ms, 69.8282 TFlops, 85.2395 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 2, 2>
Perf: 0.470843 ms, 68.414 TFlops, 83.5131 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 8, 8>
Perf: 0.461275 ms, 69.833 TFlops, 85.2454 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 2, 2>
Perf: 0.468619 ms, 68.7386 TFlops, 83.9095 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 8, 8>
Best Perf: 0.36894 ms, 87.3102 TFlops, 106.58 GB/s, DeviceGemmXdl<256, 128, 128, 4>
+ echo 'bench NN'
bench NN
+ ./ckProfiler gemm 1 3 0 1 0 10 7680 8192 8192 8256 8256 8256
+ grep Perf
Perf: 9.60275 ms, 107.343 TFlops, 39.3104 GB/s, DeviceGemmXdl<256, 256, 128, 4>
Perf: 10.1092 ms, 101.966 TFlops, 37.3411 GB/s, DeviceGemmXdl<256, 128, 256, 4>
Perf: 12.9045 ms, 79.8783 TFlops, 29.2523 GB/s, DeviceGemmXdl<128, 128, 128, 4>
Perf: 11.237 ms, 91.7322 TFlops, 33.5933 GB/s, DeviceGemmXdl<256, 128, 128, 4>
Perf: 14.2435 ms, 72.3691 TFlops, 26.5023 GB/s, DeviceGemmXdl<128, 128, 64, 4>
Perf: 21.2278 ms, 48.5586 TFlops, 17.7827 GB/s, DeviceGemmXdl<128, 64, 128, 4>
Perf: 14.116 ms, 73.0231 TFlops, 26.7419 GB/s, DeviceGemmXdl<256, 128, 64, 4>
Perf: 22.2111 ms, 46.4089 TFlops, 16.9954 GB/s, DeviceGemmXdl<256, 64, 128, 4>
Perf: 9.5376 ms, 108.077 TFlops, 39.5789 GB/s, DeviceGemmXdl_C_Shuffle<256, 256, 128, 32, 2, 8>
Perf: 9.58814 ms, 107.507 TFlops, 39.3703 GB/s, DeviceGemmXdl_C_Shuffle<256, 256, 128, 32, 8, 8>
Perf: 10.0153 ms, 102.922 TFlops, 37.6912 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 2, 8>
Perf: 10.1344 ms, 101.713 TFlops, 37.2483 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 8, 8>
Perf: 12.8952 ms, 79.936 TFlops, 29.2734 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 2, 8>
Perf: 13.0247 ms, 79.1411 TFlops, 28.9823 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 8, 8>
Perf: 11.2214 ms, 91.8594 TFlops, 33.6399 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 2, 8>
Perf: 11.1851 ms, 92.1572 TFlops, 33.749 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 8, 8>
Perf: 14.1968 ms, 72.6075 TFlops, 26.5897 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 2, 8>
Perf: 14.261 ms, 72.2806 TFlops, 26.47 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 8, 8>
Perf: 21.2267 ms, 48.5612 TFlops, 17.7836 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 2, 8>
Perf: 21.2716 ms, 48.4586 TFlops, 17.7461 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 8, 8>
Perf: 13.8619 ms, 74.3613 TFlops, 27.2319 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 2, 8>
Perf: 14.179 ms, 72.6984 TFlops, 26.6229 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 8, 8>
Perf: 22.1061 ms, 46.6293 TFlops, 17.0762 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 2, 8>
Perf: 22.7502 ms, 45.3091 TFlops, 16.5927 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 8, 8>
Best Perf: 9.5376 ms, 108.077 TFlops, 39.5789 GB/s, DeviceGemmXdl_C_Shuffle<256, 256, 128, 32, 2, 8>
+ ./ckProfiler gemm 1 3 0 1 0 10 3840 4096 4096 8256 8256 8256
+ grep Perf
Perf: 1.27325 ms, 101.197 TFlops, 74.1188 GB/s, DeviceGemmXdl<256, 256, 128, 4>
Perf: 1.26375 ms, 101.958 TFlops, 74.6762 GB/s, DeviceGemmXdl<256, 128, 256, 4>
Perf: 1.78397 ms, 72.2262 TFlops, 52.9 GB/s, DeviceGemmXdl<128, 128, 128, 4>
Perf: 1.30829 ms, 98.4866 TFlops, 72.1337 GB/s, DeviceGemmXdl<256, 128, 128, 4>
Perf: 1.9614 ms, 65.6923 TFlops, 48.1145 GB/s, DeviceGemmXdl<128, 128, 64, 4>
Perf: 2.15495 ms, 59.792 TFlops, 43.793 GB/s, DeviceGemmXdl<128, 64, 128, 4>
Perf: 1.80493 ms, 71.3871 TFlops, 52.2855 GB/s, DeviceGemmXdl<256, 128, 64, 4>
Perf: 1.8764 ms, 68.6684 TFlops, 50.2942 GB/s, DeviceGemmXdl<256, 64, 128, 4>
Perf: 1.27435 ms, 101.109 TFlops, 74.0546 GB/s, DeviceGemmXdl_C_Shuffle<256, 256, 128, 32, 2, 8>
Perf: 1.27322 ms, 101.199 TFlops, 74.1207 GB/s, DeviceGemmXdl_C_Shuffle<256, 256, 128, 32, 8, 8>
Perf: 1.29102 ms, 99.8039 TFlops, 73.0986 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 2, 8>
Perf: 1.26855 ms, 101.572 TFlops, 74.3937 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 8, 8>
Perf: 1.77217 ms, 72.7068 TFlops, 53.252 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 2, 8>
Perf: 1.81256 ms, 71.0869 TFlops, 52.0656 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 8, 8>
Perf: 1.30165 ms, 98.9891 TFlops, 72.5018 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 2, 8>
Perf: 1.30736 ms, 98.5565 TFlops, 72.1849 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 8, 8>
Perf: 1.94331 ms, 66.304 TFlops, 48.5625 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 2, 8>
Perf: 1.95305 ms, 65.9732 TFlops, 48.3202 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 8, 8>
Perf: 2.12928 ms, 60.513 TFlops, 44.321 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 2, 8>
Perf: 2.14606 ms, 60.0399 TFlops, 43.9745 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 8, 8>
Perf: 1.80457 ms, 71.4014 TFlops, 52.2959 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 2, 8>
Perf: 1.81854 ms, 70.8531 TFlops, 51.8944 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 8, 8>
Perf: 1.87678 ms, 68.6543 TFlops, 50.2839 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 2, 8>
Perf: 1.87785 ms, 68.6151 TFlops, 50.2552 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 8, 8>
Best Perf: 1.26375 ms, 101.958 TFlops, 74.6762 GB/s, DeviceGemmXdl<256, 128, 256, 4>
+ ./ckProfiler gemm 1 3 0 1 0 10 1920 2048 4096 8256 8256 8256
+ grep Perf
Perf: 0.440523 ms, 73.1227 TFlops, 89.2611 GB/s, DeviceGemmXdl<256, 128, 256, 4>
Perf: 0.379548 ms, 84.87 TFlops, 103.601 GB/s, DeviceGemmXdl<128, 128, 128, 4>
Perf: 0.371356 ms, 86.7422 TFlops, 105.886 GB/s, DeviceGemmXdl<256, 128, 128, 4>
Perf: 0.465579 ms, 69.1875 TFlops, 84.4574 GB/s, DeviceGemmXdl<128, 128, 64, 4>
Perf: 0.436011 ms, 73.8794 TFlops, 90.1848 GB/s, DeviceGemmXdl<128, 64, 128, 4>
Perf: 0.471227 ms, 68.3582 TFlops, 83.4451 GB/s, DeviceGemmXdl<256, 128, 64, 4>
Perf: 0.449323 ms, 71.6906 TFlops, 87.5129 GB/s, DeviceGemmXdl<256, 64, 128, 4>
Perf: 0.465163 ms, 69.2494 TFlops, 84.5329 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 2, 8>
Perf: 0.446203 ms, 72.1919 TFlops, 88.1248 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 256, 32, 8, 8>
Perf: 0.396668 ms, 81.2071 TFlops, 99.1298 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 2, 8>
Perf: 0.390876 ms, 82.4104 TFlops, 100.599 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 128, 32, 8, 8>
Perf: 0.35022 ms, 91.9771 TFlops, 112.277 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 2, 8>
Perf: 0.372419 ms, 86.4946 TFlops, 105.584 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 8, 8>
Perf: 0.462667 ms, 69.623 TFlops, 84.989 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 2, 8>
Perf: 0.466939 ms, 68.986 TFlops, 84.2114 GB/s, DeviceGemmXdl_C_Shuffle<128, 128, 64, 32, 8, 8>
Perf: 0.431004 ms, 74.7378 TFlops, 91.2327 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 2, 8>
Perf: 0.435627 ms, 73.9445 TFlops, 90.2643 GB/s, DeviceGemmXdl_C_Shuffle<128, 64, 128, 32, 8, 8>
Perf: 0.461579 ms, 69.7871 TFlops, 85.1893 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 2, 8>
Perf: 0.474283 ms, 67.9178 TFlops, 82.9075 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 64, 32, 8, 8>
Perf: 0.442971 ms, 72.7186 TFlops, 88.7678 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 2, 8>
Perf: 0.450843 ms, 71.4489 TFlops, 87.2179 GB/s, DeviceGemmXdl_C_Shuffle<256, 64, 128, 32, 8, 8>
Best Perf: 0.35022 ms, 91.9771 TFlops, 112.277 GB/s, DeviceGemmXdl_C_Shuffle<256, 128, 128, 32, 2, 8>

- remove dangling header include
- modify example gemm_xdl accordingly
- infer KPack value from M/NPerXdl
- device conv2d fwd: update parameters accordingly for the underlying gridwise gemm v3r1
(API for conv2d fwd stays the same for now until we decide to expose individual K0s for activation and weight)
Comment thread device_operation/include/device_gemm_xdl_c_shuffle.hpp Outdated
Comment thread device_operation/include/device_gemm_xdl_c_shuffle.hpp
Comment thread .gitignore
@rosenrodt rosenrodt requested a review from asroy February 25, 2022 12:34
@asroy
Copy link
Copy Markdown
Contributor

asroy commented Feb 25, 2022

Please use clang-format-10 in future PR

@asroy asroy closed this Feb 25, 2022
@asroy asroy reopened this Feb 25, 2022
@asroy asroy requested a review from zjing14 February 25, 2022 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants