-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenCL performance regression on Gromacs #93
Comments
Yes that compiler was not optimized yet like we said when we released the OpenCL developer preview, It was for functional testing only. What you seeing is the compiler is spilling registers and missing few more optimizations which we have been working on. One of the things we now have Assembler so we push the performance well beyond what you seeing on AMDGPU Pro
Do you have test you like us run.. We have been testing some of the GROMACS benchmarks
G
On Mar 3, 2017, at 6:08 PM, Szilárd Páll <notifications@github.com<mailto:notifications@github.com>> wrote:
I measure up to 1.5x kernel performance regression with the ROCm 1.4 release compared to AMDGPU-PRO. The application is GROMACS version 2016.2.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#93>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8DuVkxff-g3gXziYR4Hzk95J5_5NwYks5riKuLgaJpZM4MS4y2>.
|
Thanks for the feedback. I know the compiler as released in 1.4 was far from optimal. However, for GROMACS after correctness performance is the second most important functionality, so my report is technically concerning an application functionality :) It would be great if you could include some testing/benchmarks in your internal testing. We have a mall number of hot kernels and quite peculiar application behavior that tends to stress the driver and cause API overhead), so those aspect would be good to get tested and improved if needed. Let me know how would you like to proceed. |
I've had another look at some internal GROMACS profiler counters and there are strong indications that the runtime is using a lot of CPU resources resulting in both increased host-side cost of enqueue and increased interference with work executed on the CPU concurrently with the GPU. Are such issues also known/expected? |
the base driver team dropped in some last minute changes on 1.4 which we seeing some quirkiness.
This is one of the GROMACS test we are running, We found a core issue in this 1.4 and Gromacs already which we working on now.
cd /root/Desktop/ISV/Gromacs-2016/gromacs-2016
export GMX_OCL_FILE_PATH=/usr/local/gromacs/share/gromacs/opencl
cd ~/Gromacs-2016/gromacs-2016/build/bin/rnase_cubic/
../gmx grompp -f pme_verlet.mdp
../gmx mdrun
On Mar 4, 2017, at 9:40 AM, Szilárd Páll <notifications@github.com<mailto:notifications@github.com>> wrote:
I've had another look at some internal GROMACS profiler counters and there are strong indications that the runtime is using a lot of CPU resources resulting in both increased host-side cost of enqueue and increased interference with work executed on the CPU concurrently with the GPU. Are such issues also known/expected?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#93 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8DuceccuI_iNw5bGbD8ONX-EgazA52ks5riYXXgaJpZM4MS4y2>.
|
@gstoner Sounds good. The test case you are using is pretty decent, but a bit more coverage of input sizes/use-cases and some command line tweaks to run only the kernels of interest might nor hurt. Briefly, this is what's of strong interest and I'd recommend tracking (using a at least a few test cases):
[1] GROMACS 5.1 / 2016 GPU kernel throughput https://drive.google.com/file/d/0B6dQqsegA1FMZk5kNXI4SzVhbzNyT0NackpCY05FNlM1dWNv/view?usp=sharing Let me know if you need more input and what other feedback would be useful for you. I'd really like to see the Vega GPUs hit the ground running (hopefully with rocm and also mesa!), but so far software stack issues have been the one thing that has been limiting both development and user adoption. |
When you did into the core kernel, what is critical section running on the GPU that you see the most. Is it GEMM or some other kernel function. What is the average gpu local memory needed for running real jobs 4GB, 8GB or more? Also when you do multi-gpu enablement are you looking at library like NCCL on CUDA side now.
On Mar 4, 2017, at 1:32 PM, Szilárd Páll <notifications@github.com<mailto:notifications@github.com>> wrote:
@gstoner<https://github.com/gstoner> Sounds good. The test case you are using is pretty decent, but a bit more coverage of input sizes/use-cases and some command line tweaks to run only the kernels of interest might nor hurt.
Briefly, this is what's of strong interest and I'd recommend tracking (using a at least a few test cases):
* (post-load balancing) Average execution time of the hottest offloaded kernel across a range of input sizes. Bad performance with very small kernels can become showstoppers for strong scaling (hence the emphasis on the range of problem sized), e.g. see [1] where besides getting good peak, the left-hand "tail" is what would be great if improved;
* OpenCL API overhead which, especially at peak performance (< 1 ms/iteration) can be quite significant. What's also a concern is the behavior of the driver overhead under high kernel issue rate and when starved (by application threads) which has been a serious issue in the past, e.g. see [2].
[1] GROMACS 5.1 / 2016 GPU kernel throughput https://drive.google.com/file/d/0B6dQqsegA1FMZk5kNXI4SzVhbzNyT0NackpCY05FNlM1dWNv/view?usp=sharing
[2] API overhead in GROMACS runs on a three fglrx versions. https://drive.google.com/open?id=0B6dQqsegA1FMLTJZb2NRWUlLbHc
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#93 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8DubTCWlfqiLaZ2KmJ36AkuQSuDCAJks5ribw5gaJpZM4MS4y2>.
|
If I understand correctly, you're asking about the GPU algorithm that run on the critical path? It's not GEMM, but a pair interaction algorithm (based on neighbor list traversal), but we use our own SIMD-tuned algorithm.
Not sure if you are referring to local or global memory in OpenCL terminology (I assume the latter given the Gigabytes)?
No, at the moment all communication is done on the CPU using MPI/shared memory. |
@gstoner forgot to answer: for typical runs we need <200 Mb, for pretty much all relevant cases <1 Gb. This may increase a little as we're moving to offloading more, but ultimately GROMACS runs in the strong-scaling regime where the performance practical for research is only achieved when the per-node data is very little. |
Yes, it should have be dig into
On Mar 4, 2017, at 2:07 PM, Szilárd Páll <notifications@github.com<mailto:notifications@github.com>> wrote:
What is the average gpu local memory needed for running real jobs 4GB, 8GB or more?
Not sure if you are referring to local or global memory in OpenCL terminology (I assume the latter given the Gigabytes)?
@gstoner<https://github.com/gstoner> forgot to answer: for typical runs we need <200 Mb, for pretty much all relevant cases <1 Gb. This may increase a little as we're moving to offloading more, but ultimately GROMACS runs in the strong-scaling regime where the performance practical for research is only achieved when the per-node data is very little.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#93 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8DuRXo0WFErXSFwN8UepMsYrH0PLqIks5ricSAgaJpZM4MS4y2>.
|
@gstoner Could you clarify? Also, maybe my above late edit was missed, so let me reiterate it: Let me know if you need more input and what other feedback would be useful for you. I'd really like to see the Vega GPUs hit the ground running (hopefully with rocm and also mesa!), but so far software stack issues have been the one thing that has been limiting both development and user adoption. |
What I am looking for is what is the critical section in your kernel.
Greg
On Mar 4, 2017, at 2:23 PM, Szilárd Páll <notifications@github.com<mailto:notifications@github.com>> wrote:
Yes, it should have be dig into
@gstoner<https://github.com/gstoner> Could you clarify?
Also, maybe my above late edit was missed, so let me reiterate it:
Let me know if you need more input and what other feedback would be useful for you. I'd really like to see the Vega GPUs hit the ground running (hopefully with rocm and also mesa!), but so far software stack issues have been the one thing that has been limiting both development and user adoption.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#93 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8Duba3xQB_63eYyMCjcFkHlIOwtIZ6ks5richbgaJpZM4MS4y2>.
|
Critical performance section
greg
On Mar 4, 2017, at 2:23 PM, Szilárd Páll <notifications@github.com<mailto:notifications@github.com>> wrote:
Yes, it should have be dig into
@gstoner<https://github.com/gstoner> Could you clarify?
Also, maybe my above late edit was missed, so let me reiterate it:
Let me know if you need more input and what other feedback would be useful for you. I'd really like to see the Vega GPUs hit the ground running (hopefully with rocm and also mesa!), but so far software stack issues have been the one thing that has been limiting both development and user adoption.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#93 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8Duba3xQB_63eYyMCjcFkHlIOwtIZ6ks5richbgaJpZM4MS4y2>.
|
@gstoner The kernel is here. It implements a pair-interaction algorithm that's tuned for SIMD architectures.
|
Thanks.
greg
On Mar 4, 2017, at 2:35 PM, Szilárd Páll <notifications@github.com<mailto:notifications@github.com>> wrote:
The kernel is here<https://github.com/gromacs/gromacs/blob/master/src/gromacs/mdlib/nbnxn_ocl/nbnxn_ocl_kernel_amd.clh>. It implements a pair-interaction algorithm<http://www.sciencedirect.com/science/article/pii/S2352711015000059#f000015> that's tuned for SIMD architectures<http://www.sciencedirect.com/science/article/pii/S2352711015000059#f000020>.
* Single precision
* significant ampount of 32-bit integer ops
* arithmetically intensive (~15 Flops/byte)
* uses lots of registers, typically instruction latency-bound. Lots of Integers ops
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#93 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8DuYlo9NP_r-y8GSApLjGwtGw0koHhks5ricshgaJpZM4MS4y2>.
|
Do you have FIJI? greg |
PS: the kernel loves lane shuffle for reduction which we're greatly missing on AMD hardware! |
We have an R9 Nano for development. (BTW performance compared to the green guys was on the PDF linked earlier: https://drive.google.com/file/d/0B6dQqsegA1FMZk5kNXI4SzVhbzNyT0NackpCY05FNlM1dWNv/view?usp=sharing) |
Did you see this article AMD GCN Assembly: Cross-Lane Operations http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ We can do this now inside the new compiler we had lot of hardware that was masked by the old compiler Posted on August 10, 2016 by Ben Sander Terminology The basic execution unit of an AMD GCN GPU is called a wavefront, which is basically a SIMD vector. Why Not Just Use LDS? At the same time, the register bandwidth is (1,050 GHz) * (64 CUs) * (64 lanes) * (12 bytes per lane) = 51.6 TB/s. That’s another order of magnitude, so communication between threads is much slower than just crunching data in the thread registers. But can we do better by sharing? The answer is yes, if we further reduce our scope from a workgroup to a single wavefront. |
I have seen it, but I did not think it was possible to use in-line ASM with OpenCL. Or is there some other way to implement it in OpenCL?
So how and when will you expose it in the (OpenCL) compiler? |
We fixed that with the new OpenCL compiler using native code generator. We want to give ability to fully express the hardware even when the compiler can not generate the instructions. We figured every one needed if we needed it for our work on miOpen our deep learning solver and Tensile. We have been working on GCN ISA assembly optimized kernel for a number of the key convolutions and now GEMMS
Next OpenCL drop on ROCm will 100% OpenSource so you since see better into the tools.
I do not know if you saw this one on Sub Dword Addressing as well http://gpuopen.com/using-sub-dword-addressing-on-amd-gpus-with-rocm/
Greg
On Mar 4, 2017, at 2:44 PM, Szilárd Páll <notifications@github.com<mailto:notifications@github.com>> wrote:
Did you see this article AMD GCN Assembly: Cross-Lane Operations http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
I have seen it, but I did not think it was possible to use in-line ASM with OpenCL. Or is there some other way to implement it in OpenCL?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#93 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8DuVySj4LcoN8iaBDAtnNvB-9mfgY6ks5ric1KgaJpZM4MS4y2>.
|
That sounds great. How will this be exposed? Will you add OpenCL extensions for the equivalent permute/swizzle intrinsics supported in HCC? Will you allow inline ASM in OpenCL?
No, I have not. Thanks for the link. In our current kernels I doubt we can use these tricks. We need SP for floating point; integer data is either 32-bit bitmasks or indices that do not lend themselves well to packing. |
On Mar 4, 2017, at 8:25 PM, Szilárd Páll <notifications@github.com<mailto:notifications@github.com>> wrote:
We fixed that with the new OpenCL compiler using native code generator. We want to give ability to fully express the hardware even when the compiler can not generate the instructions. We figured every one needed if we needed it for our work on miOpen our deep learning solver and Tensile.
That sounds great. How will this be exposed? Will you add OpenCL extensions for the equivalent permute/swizzle intrinsics supported in HCC? Will you allow inline ASM in OpenCL?
We will have intrinsics for you to get access to them, in addition you will be able to do inline asm in OpenCL. HCC and OpenCL both now use CLANG FrontEnd so we making sure match functionality in both languages.
Greg
|
Sound good. Is there an ETA/release schedule for 1.5 and later? Actually, do you have any ROCm roadmap/plans regarding features, support, etc. that you can share? |
I have tested ROCm 1.6 release and my initial kernel-only performance assessment is: performance is still 30%-40% lower than what I measured with the old fglrx stack (and the early versions of AMDGPU-PRO). Here's the data with GROAMCS 2016 obtained using a benchmark that runs a wide range of input sizes: |
It's possible to optimize performance bottlenecks by replacing the kernels with native ISA. Here's a repository with examples and how-to's: https://github.com/ROCm-Developer-Tools/LLVM-AMDGPU-Assembler-Extra |
BTW, llvm supports gcn inline assembly (the syntax is the same as https://gcc.gnu.org/onlinedocs/gcc/Using-Assembly-Language-with-C.html) Simple example can be found here https://github.com/ROCm-Developer-Tools/LLVM-AMDGPU-Assembler-Extra/blob/master/examples/gfx8/s_memrealtime_inline.cl |
@jstefanop don't get toxic. This is the last place we need that. Growing pains with an inspectable environment and tangable changes by going opensource with a fully open background fed with modern clang. The relevance and importance of this project and AMD has never been higher. They need to know you mean business but you need to stay a customer too so you get your stuff performing, crypto guys and image processing/dense computations too. It's clear they got alot on their plate and things look ready for a rapid evolution to kick ass status, but it's hard to know what day that is watching and waiting as we have been. Have the patience of Buddha, remove your last line in your previous message, and contact AMD product marketing as officially as your university allows you to and sell the story of your research helping defeat cancer enabled by their hardware - tell them what you need working in a way non-engineers would understand, but with some specifics. Try and get the universities allowance for any marketing blerb on linkedin/hpc news you if you can while imparting the importance of what you need. Try and make sure everybody wins. Then wait. Hopefully they'll either strategize differently or get more people to help get things done sooner - or things just work at T+1. I do wish we knew what was wrong with the current state of the compiler, it's clear a little wonky. Runtime still has some bugs clearly too but I don't know if those are too bad... if I understood how the runtime worked (or where it is) I'd probably fix some of those issues. |
The Gromacs issue is different then what @jstefanop woried about he is looking at Currency mining. What you see in the in the data above is the ROCm compiler is doing it job up to 3 nano sec, Performance slows down where it falls behind, it was leading up until then. @pszi1ard We are digging into more into it this month, Is data line for 2236.1 on Fiji the AMDGPUpro 17.10 driver. I am now looking at for Gromacs what is memory utilization at 3 nano sec simulation time @pszi1ard. thank you for your patience. We finally have EPYC 4 GPU server we talked about I can also test on. Also ROCm 1.6.1 will be out next week, we did found some firmware issue that were effecting OpenCL. OpenCL right now on ROCm is Pre-Release Beta. We are working on making ROCm best product possible, it is new and it will have issue. We are digging into to see if it compiler or driver issue. We are going back and comparing the exact same compiler that we use on Windows and AMDGPUpro 17.10 driver on ROCm to see if compiler issue or based driver issue. Base driver issue take longer to triage since we sit on AMDGPU driver same one as the Open source driver uses. So we have to look at changes in the DRM and even in base linux kernel, and dig though all the firmware changes. |
I think I've got that nailed down. It looks like the driver / runtime is quite resource-intensive and it wants core 0 (possibly hw thread 0?) to run on and if it does not have it, it fails to launch GPU tasks until the core frees up. Does that make sense? I'm collecting detailed data and will post it later. |
Here's more detailed data (also comparing to the CUDA runtime/driver): This essentially seems confirm that if the first core is left empty (that is neither of the hw threads used), performance is good (although clFinish somewhat expensive, it seems). Otherwise, if core 0 gets loaded, the GPU task scheduling behavior is not pretty erratic. I've also noticed that with this modded 4.9 kernel (4.9.0-kfd-compute-rocm-rel-1.6-77), the per-core load does not show up in monitoring tools, so it looks like something is broken in /proc:
|
OpenCL has producer/consumer thread for dispatch, the OpenCL runtime should not be pinning to the first core, the Main developer thinks CQ thread to jump from one core to another if all core has pinned threads to get a fair time share. He is thinking there must be a heuristic preventing the thread to reschedule on another core. He is going to try bump the priority of the CQ thread to see if we still have that issue with oversubscription and pinning. |
Not directly related, but GROMACS will by default use all hardware threads available which has not (and seems that it will not) play well with the AMD OpenCL runtime, so we'd need some heuristic to leave some resources available. How much resources do you expect that this thread requires? Should in theory a hardware thread be enough? Is there one CQ thread only, one per device (perhaps one per application context per device)? I assume he dispatch work is NUMA sensitive?
I did not assume that either, that's why I was careful to not claim that the runtime's thread is pinned. However, the observations seem to suggest that the thread does not move around.
Note that even if there are plenty of free cores, I still observe the peculiar behavior. |
Good news: I see improvements on across all performance issues -- great work!
Remaining issues:
|
On Aug 2, 2017, at 8:44 PM, Szilárd Páll <notifications@github.com<mailto:notifications@github.com>> wrote:
Good news: I see improvements on across all performance issues -- great work!
* kernel performance is up by 25-30% and marginally better than the best observed ever (the fglrx reference)
* register count for the particular kernel I looked at nbnxn_kernel_ElecEw_VdwLJCombGeom_F_opencl) is down to 81; still a bit higher than before, but
* the CQ thread issue seems mostly resolved (presumably with the new kernel driver), though I still see significant application slowdown when all threads are used in computation right after the GPU task enqueue (see col 7 and 8 here: https://docs.google.com/spreadsheets/d/1GjIhiWLXsFK5SxE2n88-oz0dYZ3WXXifgIUTyKDE74M/pubchart?oid=939045217);
* measured PCI-E bandwidth is improved: 6.7-9.4 GB/s peak (see https://drive.google.com/drive/folders/0B6dQqsegA1FMa3VYRHNFZmJmcXc)
Remaining issues:
* GPU queue sync (clFinish) still seems quite expensive, ends up taking 10-20% of runtime at short iteration time;
* peak PCI-E transfer rate is still not reached
When you test PCIe bandwidth this is NUMA effects, we are looking into this.
Greg
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#93 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8DufRINNpMOrgWZnQSqAsyEGI6zgvDks5sUSXtgaJpZM4MS4y2>.
|
I'm using a single socket, single NUMA CPU, how can it be a NUMA effect? Can you comment on the expected driver/runtime expected resource needs and whether my observed clFinish overhead is "normal" and here to stay? We need to provision resources to jitter and overheads. |
We looking at looking clFINISH
I will look your Xeon since they have internal NUMA organization when you cross a certain level of cores
Greg
|
It's an LLC single-ring i7 5960X, not a Xeon! |
Thanks it is still Xeon, just rebranded ; ) Used to work for Intel
Greg
|
I have not worked at Intel, but Xeon is a marketing name rather than an arch name, isn't it ;) |
In some of them, not yours. |
Update: with ROCm 1.6.148 I see no significant changes in either of the remaining issues (PCIe transfer and task wait/launch overhead -- related to the latter see some numbers here: |
Thanks, I will look this over. |
Any update on this? We'll be releasing the next major version and it would be good to know whether/what should we advise users. |
Update: for now we ended up recommending the use of ROCm in the new release documentation, but I'd be more comfortable if we could test better 1.7 and hopefully have issues solved soon. |
Here's my ROCm 1.8 feedback. On our test machines PCIe transfer speed is still rather low (about half of what it should be; ~6 GB/s both direction, with transfers of a few MB in size); note that with AMDGPU-PRO 17.5 I get the expected 12 GB/s in both directions. Kernel performance is slightly improved which is good. However, it seems that (as I suspected) something is off with Vega performance on ROCm. With an AMDGPU-PRO legacy install I seem to get ~30% better performance in out main kernel; the other kernels are also faster! This is a huge difference that would immediate make the Vega GPUs quite competitive, so I'd like to get to the bottom of it and hopefully improve it soon. (Still having plenty of trouble with rcp on ROCm so please let me know if your team can look into this.) |
Side-note: what are the identical/different components in ROCm vs AMDGPU-PRO legacy / pal? Is there a thorough documentation on this somewhere? |
@pszi1ard this is still on the MSI motherboard correct |
Side-note: what are the identical/different components in ROCm vs AMDGPU-PRO legacy / pal? Is there a thorough documentation on this somewhere? We have two compiler, one is OpenCL using CLANG to LLVM to HSAIL intermediate language, then it passes IL to the finalizer which then calls the AMD GPU shader compiler which is used for OpenGL and DirectX 12. This can be supported on ORCA, PAL, and ROCr. It is proprietary compiler so we can not opensource it. The Compiler on ROCm stack is 100% opensource you can find its documentation here. |
Also, can you run this on 1.8 and report the numbers |
@gstoner Do you mean that the PRO stack uses an entirely different finalizer/codegen so regardless of whether "legacy" mode is used or not there is no similarity with the ROCm compiler?
This is on an X99 and on random Z97 mobo: The X99 sytem is clearly not getting the peak performance. There is also still some discrepancy between what these HSA benchmarks show and what I measure in GROMACS / OpenCL. |
@pszi1ard This benchmark test the bandwidth at ROCr level removing Language Runtime. So now I am going to go get the team looks over OpenCL mapping to ROCr to see if there is an issue. I finally have MSI x99 come in, we do not see this on ASUS X99. |
Thanks! I've not looked at the implementation, does this benchmark use explicitly pinned buffers? As a side-note: we've built C++ custom allocator-support for page-alignment and pinning for CUDA, so if there is any use of page-aligning or somehow pinning CPU buffers, we could certainly do that. Any advise you could give is welcome.
Great, thanks. In the meantime I'll try to see if we can update the firmware and get back. |
I'm still observing PCIe BW issues. I've three cards, one in the infamous MSI X99 board two in another Asus X99 and other than the RX560 in the latter neither the Fiji nor the Vega card behave as they should wrt PCIe BW: Note that the the gpu_memory_benchmark also behaves weird, in the few M's regime it peaks >15-16 GB/s which is, I think, not is a reasonable measurement for PCIe BW. Used ROCm 1.8.2 (rock-dkms 1.8-192). Should I file a separate report about this? |
We now obsoleted that benchmark gpu_memorybenchmark. It was using the CPU for timer not GPU like NVIDIA does in it PCIe benchmark. You should use https://github.com/RadeonOpenCompute/rocm_bandwidth_test One thing we just release a Beta of our ROCm Validation TestSuite beta for validating ROCm on your hardware. |
Apologies, I forgot about the new bandwidth test tool. Reran with rocm_bandwidth_test (detailed results below) and, not too unexpectedly, I get marginally higher numbers in most cases, with one exception: an RX560 plugged into an Asus X99 with PLX swtiches now performs very poorly with the rocm_bandwidth_test tool:
Detailed bench results: |
Thanks @pszi1ard |
I measure up to 1.5x kernel performance regression with the ROCm 1.4 release compared to AMDGPU-PRO. The application is GROMACS version 2016.2.
The text was updated successfully, but these errors were encountered: