Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suboptimal GROMACS performance on Vega #222

Closed
pszi1ard opened this issue Oct 6, 2017 · 30 comments
Closed

Suboptimal GROMACS performance on Vega #222

pszi1ard opened this issue Oct 6, 2017 · 30 comments

Comments

@pszi1ard
Copy link

pszi1ard commented Oct 6, 2017

The main kernel (force-only, name-pattern nbnxn_kernel___F_opencl_gfxXXX) runs up to 22% slower on Vega than on Fiji. Other kernels are a mixed bag. One of the likely contributors is that most kernels end up using quite a bit more registers when compiled for gfx900, e.g.

nbnxn_kernel_ElecEw_VdwLJCombGeom_F_opencl_gfx803 registers: 81, 54
nbnxn_kernel_ElecEw_VdwLJCombGeom_F_opencl_gfx900 registers: 85, 54

nbnxn_kernel_ElecEw_VdwLJCombGeom_VF_opencl_gfx803 registers: 84, 68
nbnxn_kernel_ElecEw_VdwLJCombGeom_VF_opencl_gfx900 registers 93, 68

For the former "F" kernels, performance across the relevant range of input sizes:

Input	Kernel time/iteration (ms)
size	Fiji	Vega	
0.96	0.0727	0.0615	84.59%
1.5	0.1018	0.0799	78.49%
3	0.1165	0.0995	85.41%
6	0.1539	0.169	109.81%
12	0.2582	0.2987	115.69%
24	0.4327	0.5242	121.15%
48	0.7794	0.9545	122.47%
96	1.4489	1.7762	122.59%
192	2.7754	3.3861	122.00%
384	5.6074	6.7722	120.77%
768	11.0613	13.5129	122.16%
1536	21.6894	26.646	122.85%
3072	43.3131	53.2424	122.92%

@gstoner
Copy link

gstoner commented Oct 13, 2017

@pszi1ard I need to follow up with you on this and tool

@gstoner
Copy link

gstoner commented Oct 16, 2017

@pszi1ard Hey we loaded a new version of ROCm 1.6.4 at repo.radeon.com can you test it.

@pszi1ard
Copy link
Author

Some improvements on the register usage side, but still higher than on Fiji:
nbnxn_kernel_ElecEw_VdwLJCombGeom_F_opencl_gfx803 registers: 81, 54
nbnxn_kernel_ElecEw_VdwLJCombGeom_F_opencl_gfx900 registers: 83, 54

nbnxn_kernel_ElecEw_VdwLJCombGeom_VF_opencl_gfx803 registers: 84, 68
nbnxn_kernel_ElecEw_VdwLJCombGeom_VF_opencl_gfx900 registers 85, 68

When it comes to global memory and PCIe bandwidth, not much has changed (if anything Vega D2H bandwidth might be a bit down): both Vega and Fiji global memory bandwidths are unchanged (and low).

@gstoner
Copy link

gstoner commented Oct 17, 2017

Some strange going on since you D2H log is showing your clipping 2-3GB/S

================ Benchmark Mode Result ===================================
Data Size Avg Time(us) Avg BW(GB/s) Min Time(us) Peak BW(GB/s)
1k 12.762500 0.080235 12.222000 0.083783
2K 12.638400 0.162046 12.353000 0.165790
4K 12.867100 0.318331 12.702000 0.322469
8K 13.326700 0.614706 13.066000 0.626971
16K 14.255100 1.149343 13.995000 1.170704
32K 15.923300 2.057865 15.444000 2.121730
64K 19.457500 3.368161 19.133000 3.425286
128K 26.205300 5.001736 25.940000 5.052891
256K 39.380900 6.656628 39.010000 6.719918
512K 65.807800 7.966958 65.486000 8.006108
1M 118.311300 8.862856 117.859000 8.896868
2M 221.706500 9.459136 221.189000 9.481267
4M 429.735700 9.760194 428.219000 9.794764
8M 841.345400 9.970469 840.796000 9.976984
16M 1664.901700 10.077001 1663.707000 10.084237
32M 3311.324500 10.133236 3309.414000 10.139086
64M 6608.728400 10.154580 6606.718000 10.157670
128M 13200.614400 10.167536 13198.751000 10.168972
256M 26387.797500 10.172712 26382.996000 10.174563
512M 52766.415200 10.174481 52756.801000 10.176336

@gstoner
Copy link

gstoner commented Oct 17, 2017

What is the transfer size of packet for D2H fo gromacs

@gstoner
Copy link

gstoner commented Oct 17, 2017

Here is what i see on EPYC

p47:~/MI25_Test_Kit$ ./gpu_memory_benchmark -f 0 -t 8 -b

Benchmark Mode Result

|----------|---------------------|-----------------|----------------|--------------------|
|Data Size | Avg Time(us) | Avg BW(GB/s) | Min Time(us) | Peak BW(GB/s).|
| 1k 18.520800 0.055289 13.656000 0.074985
| 2K 15.625500 0.131068 13.545000 0.151200
| 4K 15.588300 0.262761 13.646000 0.300161
| 8K 16.930800 0.483852 14.177000 0.577837
| 16K 18.839500 0.869662 14.687000 1.115544
| 32K 21.969700 1.491509 15.639000 2.095275
| 64K 22.192900 2.953017 17.984000 3.644128
| 128K 40.251900 3.256293 22.923000 5.717925
| 256K 78.562200 3.336770 33.353000 7.859683
512K 120.168500 4.362940 53.130000 9.868022
1M 235.025700 4.461538 93.046000 11.269437
2M 454.138800 4.617866 172.585000 12.151415
4M 533.782900 7.857696 333.378000 12.581226
8M 1156.005500 7.256547 652.410000 12.857878
16M 2025.732600 8.282049 1290.061000 13.004979
32M 3458.726200 9.701384 2564.433000 13.084542
64M 5930.747900 11.315413 5113.257000 13.124485
128M 10497.422000 12.785780 10211.324000 13.144008
256M 20910.831500 12.837149 20409.853000 13.152248
512M 41265.533000 13.010153 40801.303000 13.158181

@pszi1ard
Copy link
Author

pszi1ard commented Oct 17, 2017

What is the transfer size of packet for D2H fo gromacs

In single-GPU runs typically 100Kb-1Mb, one H2D and one D2H per step, when strong-scaling it can be as small as 10-20 Kb and two transfers both ways. In some cases very small (tens to hundreds of bytes) transfers are also required, so latency is also important.

@pszi1ard
Copy link
Author

Here are my performance numbers on an Intel x99 platform with the Vega Frontier:

./gpu_memory_benchmark -f 0      -t 2 -b
================ Benchmark Mode Result ===================================
Data Size             Avg Time(us)         Avg BW(GB/s)          Min Time(us)         Peak BW(GB/s)
    1k                 12.533200              0.081703              12.200000               0.083934
    2K                 12.543700              0.163269              12.313000               0.166328
    4K                 12.780700              0.320483              12.539000               0.326661
    8K                 13.262600              0.617677              13.024000               0.628993
   16K                 14.368300              1.140288              14.110000               1.161162
   32K                 16.036500              2.043339              15.720000               2.084478
   64K                 19.530100              3.355641              19.155000               3.421352
  128K                 26.286800              4.986229              25.974000               5.046277
  256K                 39.308900              6.668821              38.885000               6.741520
  512K                 65.604000              7.991708              65.391000               8.017739
    1M                118.365500              8.858798             117.865000               8.896415
    2M                221.759000              9.456897             221.452000               9.470007
    4M                429.789800              9.758966             428.860000               9.780124
    8M                841.641700              9.966959             841.024000               9.974279
   16M               1665.270400             10.074770            1664.851000              10.077308
   32M               3311.814100             10.131738            3310.745000              10.135009
   64M               6602.370200             10.164359            6601.135000              10.166261
  128M              13184.114800             10.180261           13180.434000              10.183104
  256M              26345.754400             10.188946           26342.481000              10.190212
  512M              52669.658300             10.193172           52666.968000              10.193693

./gpu_memory_benchmark -f 2 -t 0      -b
================ Benchmark Mode Result ===================================
Data Size             Avg Time(us)         Avg BW(GB/s)          Min Time(us)         Peak BW(GB/s)
    1k                 12.337000              0.083002              12.027000               0.085142
    2K                 12.235000              0.167389              12.080000               0.169536
    4K                 12.363600              0.331295              12.134000               0.337564
    8K                 12.670100              0.646562              12.458000               0.657569
   16K                 13.237200              1.237724              12.998000               1.260502
   32K                 14.621600              2.241068              14.278000               2.294999
   64K                 17.371300              3.772660              17.030000               3.848268
  128K                 22.923900              5.717701              22.556000               5.810959
  256K                 35.128100              7.462516              34.610000               7.574227
  512K                 56.659300              9.253344              55.932000               9.373668
    1M                106.671300              9.829973              99.839000              10.502669
    2M                273.602400              7.664962             203.775000              10.291508
    4M                586.556400              7.150726             485.646000               8.636546
    8M               1139.917300              7.358962             865.675000               9.690251
   16M               1931.029600              8.688223            1565.613000              10.716068
   32M               3443.301600              9.744843            3196.719000              10.496522
   64M               6585.798900             10.189935            6117.735000              10.969560
  128M              12281.606500             10.928353           11666.830000              11.504216
  256M              23276.077600             11.532676           22623.668000              11.865249
  512M              45279.919000             11.856711           44712.184000              12.007262

./gpu_memory_benchmark -f 2 -t 2 -b
================ Benchmark Mode Result ===================================
Data Size             Avg Time(us)         Avg BW(GB/s)          Min Time(us)         Peak BW(GB/s)
    1k                 12.405800              0.082542              12.078000               0.084782
    2K                 12.284200              0.166718              12.149000               0.168574
    4K                 12.275300              0.333678              12.141000               0.337369
    8K                 12.402000              0.660539              12.296000               0.666233
   16K                 13.049100              1.255566              12.914000               1.268701
   32K                 13.228900              2.477001              12.993000               2.521973
   64K                 13.404400              4.889141              13.257000               4.943502
  128K                 14.127300              9.277923              13.900000               9.429640
  256K                 13.822800             18.964609              13.662000              19.187820
  512K                 15.188300             34.519202              15.049000              34.838727
    1M                 17.568200             59.686024              17.378000              60.339280
    2M                 22.200300             94.465030              21.945000              95.564001
    4M                 31.961900            131.228244              31.570000             132.857270
    8M                 51.848300            161.791380              51.461000             163.009036
   16M                 91.178900            184.003273              90.701000             184.972779
   32M                170.122100            197.237349             169.271000             198.229065
   64M                327.929500            204.644181             326.227000             205.712170
  128M                643.712300            208.505769             641.730000             209.149842
  256M               1276.864600            210.230165            1271.332000             211.145048
  512M               2538.724900            211.472662            2528.659000             212.314477

@pszi1ard
Copy link
Author

And this is Fiji, this look really bad:

./gpu_memory_benchmark -f 0      -t 1 -b
================ Benchmark Mode Result ===================================
Data Size             Avg Time(us)         Avg BW(GB/s)          Min Time(us)         Peak BW(GB/s)
    1k                 35.657600              0.028718              22.542000               0.045426
    2K                 35.603100              0.057523              34.473000               0.059409
    4K                 35.926800              0.114010              34.814000               0.117654
    8K                 36.372100              0.225228              35.377000               0.231563
   16K                 37.473200              0.437219              36.339000               0.450865
   32K                 40.007700              0.819042              38.431000               0.852645
   64K                 43.839000              1.494925              42.969000               1.525193
  128K                 52.991400              2.473458              52.170000               2.512402
  256K                 70.927200              3.695959              69.805000               3.755376
  512K                106.170800              4.938156              85.047000               6.164685
    1M                177.890300              5.894509             165.299000               6.343511
    2M                325.171000              6.449382             300.179000               6.986338
    4M                633.247300              6.623485             610.467000               6.870648
    8M               1430.001700              5.866152            1374.239000               6.104184
   16M               2764.360800              6.069112            2637.112000               6.361966
   32M               5872.450400              5.713872            5685.395000               5.901865
   64M              11759.369300              5.706842           11060.123000               6.067642
  128M              22980.787400              5.840432           22864.856000               5.870045
  256M              43573.965500              6.160455           43220.192000               6.210881
  512M              80551.669700              6.664926           79902.397000               6.719084

./gpu_memory_benchmark -f 1 -t 0      -b
================ Benchmark Mode Result ===================================
Data Size             Avg Time(us)         Avg BW(GB/s)          Min Time(us)         Peak BW(GB/s)
    1k                 37.078600              0.027617              34.758000               0.029461
    2K                 35.635900              0.057470              34.452000               0.059445
    4K                 35.714400              0.114688              34.891000               0.117394
    8K                 36.109200              0.226867              32.356000               0.253183
   16K                 37.521000              0.436662              36.322000               0.451076
   32K                 39.075400              0.838584              37.361000               0.877064
   64K                 43.627000              1.502189              40.944000               1.600625
  128K                 53.008500              2.472660              51.787000               2.530983
  256K                 70.481700              3.719320              69.623000               3.765193
  512K                106.752300              4.911257             105.588000               4.965413
    1M                178.156200              5.885712             156.248000               6.710972
    2M                322.662400              6.499524             315.902000               6.638616
    4M                614.859300              6.821567             610.366000               6.871785
    8M               1388.899000              6.039754            1375.367000               6.099178
   16M               2743.273100              6.115766            2412.498000               6.954292
   32M               5839.240100              5.746370            5562.187000               6.032597
   64M              11759.709300              5.706677           11701.196000               5.735214
  128M              22980.096000              5.840608           22855.037000               5.872567
  256M              43763.902900              6.133718           43497.076000               6.171345
  512M              80734.697900              6.649816           78170.282000               6.867967

./gpu_memory_benchmark -f 1 -t 1 -b
================ Benchmark Mode Result ===================================
Data Size             Avg Time(us)         Avg BW(GB/s)          Min Time(us)         Peak BW(GB/s)
    1k                 36.272400              0.028231              34.043000               0.030080
    2K                 34.982000              0.058544              33.999000               0.060237
    4K                 35.266400              0.116145              33.826000               0.121090
    8K                 36.356800              0.225322              33.925000               0.241474
   16K                 38.547900              0.425030              36.224000               0.452297
   32K                 40.147000              0.816200              37.759000               0.867820
   64K                 43.739900              1.498312              42.822000               1.530428
  128K                 52.625000              2.490679              51.638000               2.538286
  256K                 70.180400              3.735288              69.091000               3.794184
  512K                106.474700              4.924062             105.422000               4.973231
    1M                178.652800              5.869351             176.445000               5.942792
    2M                325.552000              6.441834             299.354000               7.005592
    4M                618.739100              6.778793             565.930000               7.411348
    8M               1386.523600              6.050101            1314.483000               6.381679
   16M               2709.173100              6.192744            2367.479000               7.086532
   32M               5846.720100              5.739018            5663.761000               5.924408
   64M              11680.165400              5.745541           11387.905000               5.892995
  128M              22835.302700              5.877642           22330.984000               6.010381
  256M              44134.967600              6.082149           41747.038000               6.430048
  512M              82943.254200              6.472750           78545.529000               6.835156

@gstoner
Copy link

gstoner commented Oct 17, 2017

There was a bug in SDMA firmware for FIJI, they shut it off. I am chasing down the firmware team

@pszi1ard
Copy link
Author

As a reference, here's what I get on an identical machine with an NVIDIA card:

Transfer Size (Bytes)	NVIDIA	Vega || GB/s
1024			0.31	0.08
2048			0.61	0.17
4096			1.14	0.33
8192			2.10	0.63
16384			3.45	1.16
32768			5.82	2.08
65536			7.63	3.42
131072			9.32	5.05
262144			10.37	6.74
524288			11.11	8.02
1048576			11.45	8.90
2097152			11.56	9.47
4194304			11.71	9.78
16777216		11.84	10.08
33554432		11.86	10.14

@pszi1ard
Copy link
Author

There was a bug in SDMA firmware for FIJI, they shut it off. I am chasing down the firmware team

@gstoner Thanks!

What about the Vega gmem bandwidth, aren't those on the low side too? Regrading the register count, are there any best practices/tricks I should be doing on the kernel level to reduce register use?

@boxerab
Copy link

boxerab commented Oct 18, 2017

@pszi1ard the best guide I've found for managing registers is this:

https://gpuopen.com/optimizing-gpu-occupancy-resource-usage-large-thread-groups/

Also, this talk:

https://www.youtube.com/watch?v=Bmy3Tt3Ottc

@gstoner
Copy link

gstoner commented Oct 22, 2017

Here something I need to clean up but, but it how to get the best out of LC compiler

Here is the short summary of thy ways to get rid of scratch:
In most cases, scratch is not a result of compiler spilling something and thus allocating to scratch.

Usually, it is a private array or arrays in the kernel itself.

Private arrays and variables have to be allocated to scratch by their semantics. The compiler can optimize them out and allocate to VGPRs under certain conditions.

If all VGPRs are not exhausted but you see scratch access that means it is such a private array.

A private array can be eliminated and allocated to VGPRs if the compiler can always tell which array element is used in every instruction accessing it. For example:
float a[64];
for (int i=0; i < 64; i++)
use(a[i]);

In this code offset to “a” used inside the loop is dynamic because it is different with every loop iteration. It depends on the value of “I” and such array cannot be eliminated as is.
It can be eliminated if we have:
use(a[0]);
use(a[1]);
use(a[2]);

use(a[61]);
use(a[62]);
use(a[63]);

Here all indexes are constant and we can use a register for each of the elements of “a”. Note that if a dynamic index has left anywhere in the code the whole array cannot be optimized out.
The above transformation is basically a full unroll of the loop.

The compiler does its best to do a full unroll of loops where it detects a private array addressing depending on a variable which changes inside the loop. That is not always possible though. There are several main reasons why it cannot:

  1. Loop cannot be unrolled because its bounds are unknown. For example:

for (i = get_local_id(0); i < 256; i++) { … }

Here the lower bound of the loop is not known at the compile time because it is not constant and not uniform. Change the loop like:

for (i = 0; i < 256; i++) {
if (i < get_local_id(0)) continue;

}

Another example:

for (i = 0; i < min(limit, 256); i++) { … }

Here the upper bound is not constant and the loop cannot be fully unrolled. Transform:

for (i = 0; i < 256; i++) {
if (i >= limit) continue;

}

In general, do not have any conditionals or dynamic stuff contributing to the loop bounds or increment.

  1. In certain situations use of signed types for loop induction variable is a problem to determine loop bounds because we have to handle signed integer overflow.
    Use unsigned indexes where you can.

  2. Even though we are doing our best to unroll loops whose indexes contribute to the private addressing there is always a limit for unroll.
    When you have a nest of loops we have to choose which of them to unroll to make it beneficial and not to blow out code size too much.

5 or 6 loops deep nests are common in MIOpen which creates a difficult optimization problem. If you can have less loop nest depth, make it less.

If you cannot try to identify which of loop variables contribute to indexing and help compiler by placing #pragma unroll statement before the loop. A statement without an integer after it shall request full unroll.

But in any case, make sure that loop can be unrolled in principle (it does not have dependencies like described in the 1).

  1. Sometimes we do not honor unroll pragmas even then. That is in case if code size going to increase too much.
    Probably in 5 or 6 deep loop nest there is another loop which gets unrolled and then unrolling an outer loop would blow up the code size beyond the threshold:

#pragma unroll
for (i = 0; i < 256; i++) {
for (j = 0; j < 256; j++) {

}
use(a[i]);
}

We would like to unroll the outer loop because a[i] depends on its induction variable. But we might have already unrolled the inner loop due to other considerations and now the size of the fully unrolled outer loop would become too big after unroll.
You can disable unroll of an inner loop (#pragma nounroll) or limit it (like #pragma unroll 4):

#pragma unroll
for (i = 0; i < 256; i++) {
#pragma nounroll
for (j = 0; j < 256; j++) {

}
use(a[i]);
}
Also avoid taking pointer of private variables.
If you have a private array which you can split into two, split it, it will be easier to analyze.

@pszi1ard
Copy link
Author

pszi1ard commented Oct 25, 2017

@gstoner That a great guide, you should consider adding it to some wiki.

The issues you list however do not apply in our case, I think. We only have a loop nesting of depth 3, the inner two have compile-time constant trip counts and are manually unrolled (though I wonder if the compiler does emit warnings if it ignores a #pragma unroll?). There is a local array, but it's indexed with an unrolled loop's counter. Therefore, it's still not clear to me what else could be done in the source to reduce register use -- which is I suspect too high as it's ~18 more than what nvcc uses for the identically structured CUDA kernels.

@gstoner
Copy link

gstoner commented Oct 25, 2017

I working on our new documentation site where we will include this. http://rocm-documentation.readthedocs.io/en/latest/index.html

@pszi1ard
Copy link
Author

Any update on this? We have switched into beta release gear and when that ends we won't be able to do any tweaks on our side to make the performance competitive on Vega. (Before the end of the year we'll have a final 2018 release BTW).

@pszi1ard
Copy link
Author

Just noticed, I mis-posted my ROCm 1.8 Vega specific feedback; see here: #93 (comment)

Briefly: I'd like to get help tracking down the difference between the ROCm and AMDGPU-PRO legacy compiler that seems to lead to major performance penalty on ROCm. Suggestions on how to get action on this are welcome.

@pszi1ard
Copy link
Author

@gstoner Actually, some of our kernels seem to produce incorrect results with ROCm 1.8 on Vega, while AMDGPU-PRO seems fine. Should I file a separate issue? Can please look into this?

@gstoner
Copy link

gstoner commented May 29, 2018 via email

@pszi1ard
Copy link
Author

pszi1ard commented May 29, 2018

Yes 18.20 is different compiler then in ROCm

I know. As advised, will file a new github issue. Under which project, ROCm or ROCm-OpenCL-Runtime?

@gstoner
Copy link

gstoner commented May 29, 2018 via email

@gstoner
Copy link

gstoner commented May 31, 2018

Ok we to the bottom of your performance issue on Vega,

Gfx9 and later: denorms on by default; you can turn them off with -cl-denorms-are-zero
Gfx8 and earlier – denorms are off by default

@pszi1ard
Copy link
Author

@gstoner filed a new issue under ROCm: #427

Thanks for checking the perf regression. Is this only hardware generation-dependent?
What about AMDGPU-PRO? Given that with AMDGPU-PRO 17.50 I measured significantly faster kernel times, denorms must be off there too?

@gstoner
Copy link

gstoner commented May 31, 2018 via email

@pszi1ard
Copy link
Author

pszi1ard commented May 31, 2018

They are off on the hsail path

Meaning that -cl-denorms-are-zero is on on <= GFX8 with ROCm and also with the Pro stack?

@gstoner
Copy link

gstoner commented May 31, 2018 via email

@gstoner
Copy link

gstoner commented Jun 1, 2018

Meaning that -cl-denorms-are-zero is on on <= GFX8 with ROCm and also with the Pro stack?

No they are off

@pszi1ard
Copy link
Author

The performance issue has been solved (also in the last GROMACS release); for the record, the initially posted numbers have changed the following way with 1.8.2 + flushing denorms:

size	Fiji	Vega		Fiji*	Vega*
0.96	0.0727	0.0615		0.0648	0.0504
1.5	0.1018	0.0799		0.0909	0.0675
3	0.1165	0.0995		0.1035	0.0757
6	0.1539	0.169		0.1515	0.1059
12	0.2582	0.2987		0.2459	0.1682
24	0.4327	0.5242		0.416	0.2797
48	0.7794	0.9545		0.7576	0.5051
96	1.4489	1.7762		1.398	0.9184
192	2.7754	3.3861		2.6616	1.7504
384	5.6074	6.7722		5.2908	3.5299
768	11.0613	13.5129		10.5129	7.0194
1536	21.6894	26.646		20.6894	14.4534
3072	43.3131	53.2424		41.4343	28.6263

@pszi1ard
Copy link
Author

Note that ROCm 1.9 introduced up to 5-6% regression wrt the numbers reported above. Not huge, but enough to note, IMO. I was wondering what is the threshold that flags regressions in your internal benchmarks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants