-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suboptimal GROMACS performance on Vega #222
Comments
@pszi1ard I need to follow up with you on this and tool |
@pszi1ard Hey we loaded a new version of ROCm 1.6.4 at repo.radeon.com can you test it. |
Some improvements on the register usage side, but still higher than on Fiji: nbnxn_kernel_ElecEw_VdwLJCombGeom_VF_opencl_gfx803 registers: 84, 68 When it comes to global memory and PCIe bandwidth, not much has changed (if anything Vega D2H bandwidth might be a bit down): both Vega and Fiji global memory bandwidths are unchanged (and low). |
Some strange going on since you D2H log is showing your clipping 2-3GB/S ================ Benchmark Mode Result =================================== |
What is the transfer size of packet for D2H fo gromacs |
Here is what i see on EPYC p47:~/MI25_Test_Kit$ ./gpu_memory_benchmark -f 0 -t 8 -b Benchmark Mode Result |----------|---------------------|-----------------|----------------|--------------------| |
In single-GPU runs typically 100Kb-1Mb, one H2D and one D2H per step, when strong-scaling it can be as small as 10-20 Kb and two transfers both ways. In some cases very small (tens to hundreds of bytes) transfers are also required, so latency is also important. |
Here are my performance numbers on an Intel x99 platform with the Vega Frontier:
|
And this is Fiji, this look really bad:
|
There was a bug in SDMA firmware for FIJI, they shut it off. I am chasing down the firmware team |
As a reference, here's what I get on an identical machine with an NVIDIA card:
|
@gstoner Thanks! What about the Vega gmem bandwidth, aren't those on the low side too? Regrading the register count, are there any best practices/tricks I should be doing on the kernel level to reduce register use? |
@pszi1ard the best guide I've found for managing registers is this: https://gpuopen.com/optimizing-gpu-occupancy-resource-usage-large-thread-groups/ Also, this talk: |
Here something I need to clean up but, but it how to get the best out of LC compiler Here is the short summary of thy ways to get rid of scratch: Usually, it is a private array or arrays in the kernel itself. Private arrays and variables have to be allocated to scratch by their semantics. The compiler can optimize them out and allocate to VGPRs under certain conditions. If all VGPRs are not exhausted but you see scratch access that means it is such a private array. A private array can be eliminated and allocated to VGPRs if the compiler can always tell which array element is used in every instruction accessing it. For example: In this code offset to “a” used inside the loop is dynamic because it is different with every loop iteration. It depends on the value of “I” and such array cannot be eliminated as is. Here all indexes are constant and we can use a register for each of the elements of “a”. Note that if a dynamic index has left anywhere in the code the whole array cannot be optimized out. The compiler does its best to do a full unroll of loops where it detects a private array addressing depending on a variable which changes inside the loop. That is not always possible though. There are several main reasons why it cannot:
for (i = get_local_id(0); i < 256; i++) { … } Here the lower bound of the loop is not known at the compile time because it is not constant and not uniform. Change the loop like: for (i = 0; i < 256; i++) { Another example: for (i = 0; i < min(limit, 256); i++) { … } Here the upper bound is not constant and the loop cannot be fully unrolled. Transform: for (i = 0; i < 256; i++) { In general, do not have any conditionals or dynamic stuff contributing to the loop bounds or increment.
5 or 6 loops deep nests are common in MIOpen which creates a difficult optimization problem. If you can have less loop nest depth, make it less. If you cannot try to identify which of loop variables contribute to indexing and help compiler by placing #pragma unroll statement before the loop. A statement without an integer after it shall request full unroll. But in any case, make sure that loop can be unrolled in principle (it does not have dependencies like described in the 1).
#pragma unroll We would like to unroll the outer loop because a[i] depends on its induction variable. But we might have already unrolled the inner loop due to other considerations and now the size of the fully unrolled outer loop would become too big after unroll. #pragma unroll |
@gstoner That a great guide, you should consider adding it to some wiki. The issues you list however do not apply in our case, I think. We only have a loop nesting of depth 3, the inner two have compile-time constant trip counts and are manually unrolled (though I wonder if the compiler does emit warnings if it ignores a #pragma unroll?). There is a local array, but it's indexed with an unrolled loop's counter. Therefore, it's still not clear to me what else could be done in the source to reduce register use -- which is I suspect too high as it's ~18 more than what nvcc uses for the identically structured CUDA kernels. |
I working on our new documentation site where we will include this. http://rocm-documentation.readthedocs.io/en/latest/index.html |
Any update on this? We have switched into beta release gear and when that ends we won't be able to do any tweaks on our side to make the performance competitive on Vega. (Before the end of the year we'll have a final 2018 release BTW). |
Just noticed, I mis-posted my ROCm 1.8 Vega specific feedback; see here: #93 (comment) Briefly: I'd like to get help tracking down the difference between the ROCm and AMDGPU-PRO legacy compiler that seems to lead to major performance penalty on ROCm. Suggestions on how to get action on this are welcome. |
@gstoner Actually, some of our kernels seem to produce incorrect results with ROCm 1.8 on Vega, while AMDGPU-PRO seems fine. Should I file a separate issue? Can please look into this? |
Yes 18.20 is different compiler then in ROCm
Greg
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Szilárd Páll <notifications@github.com>
Sent: Tuesday, May 29, 2018 7:08:55 AM
To: RadeonOpenCompute/ROCm
Cc: Gregory Stoner; Mention
Subject: Re: [RadeonOpenCompute/ROCm] Suboptimal GROMACS performance on Vega (#222)
@gstoner<https://github.com/gstoner> Actually, some of our kernels seem to produce incorrect results with ROCm 1.8 on Vega, while AMDGPU-PRO seems fine. Should I file a separate issue? Can please look into this?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#222 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8DudNug-H6tD40aQAOm3CPql2KdHwsks5t3TpXgaJpZM4Pv5fU>.
|
I know. As advised, will file a new github issue. Under which project, ROCm or ROCm-OpenCL-Runtime? |
Yes and steps to recreate
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Szilárd Páll <notifications@github.com>
Sent: Tuesday, May 29, 2018 7:57:59 AM
To: RadeonOpenCompute/ROCm
Cc: Gregory Stoner; Mention
Subject: Re: [RadeonOpenCompute/ROCm] Suboptimal GROMACS performance on Vega (#222)
Yes 18.20 is different compiler then in ROCm
I know. As advised, will file a new github issue.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#222 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8DuSjuiU8A61nTbJ4Ey6ZBfGdbsR_4ks5t3UXXgaJpZM4Pv5fU>.
|
Ok we to the bottom of your performance issue on Vega, Gfx9 and later: denorms on by default; you can turn them off with -cl-denorms-are-zero |
They are off on the hsail path
|
Meaning that -cl-denorms-are-zero is on on <= GFX8 with ROCm and also with the Pro stack? |
LC had it on by default only for GFX9 GPU. The other GPU’s (GFX8 and GFX7) denorm has big penalty So it off by default
HSAIL compiler it is off all gpus
|
Meaning that -cl-denorms-are-zero is on on <= GFX8 with ROCm and also with the Pro stack? No they are off |
The performance issue has been solved (also in the last GROMACS release); for the record, the initially posted numbers have changed the following way with 1.8.2 + flushing denorms:
|
Note that ROCm 1.9 introduced up to 5-6% regression wrt the numbers reported above. Not huge, but enough to note, IMO. I was wondering what is the threshold that flags regressions in your internal benchmarks? |
The main kernel (force-only, name-pattern nbnxn_kernel___F_opencl_gfxXXX) runs up to 22% slower on Vega than on Fiji. Other kernels are a mixed bag. One of the likely contributors is that most kernels end up using quite a bit more registers when compiled for gfx900, e.g.
nbnxn_kernel_ElecEw_VdwLJCombGeom_F_opencl_gfx803 registers: 81, 54
nbnxn_kernel_ElecEw_VdwLJCombGeom_F_opencl_gfx900 registers: 85, 54
nbnxn_kernel_ElecEw_VdwLJCombGeom_VF_opencl_gfx803 registers: 84, 68
nbnxn_kernel_ElecEw_VdwLJCombGeom_VF_opencl_gfx900 registers 93, 68
For the former "F" kernels, performance across the relevant range of input sizes:
The text was updated successfully, but these errors were encountered: