OpenCL performance regression on Gromacs #93

pszi1ard · 2017-03-04T00:08:42Z

I measure up to 1.5x kernel performance regression with the ROCm 1.4 release compared to AMDGPU-PRO. The application is GROMACS version 2016.2.

gstoner · 2017-03-04T03:50:52Z

Yes that compiler was not optimized yet like we said when we released the OpenCL developer preview, It was for functional testing only. What you seeing is the compiler is spilling registers and missing few more optimizations which we have been working on. One of the things we now have Assembler so we push the performance well beyond what you seeing on AMDGPU Pro Do you have test you like us run.. We have been testing some of the GROMACS benchmarks G On Mar 3, 2017, at 6:08 PM, Szilárd Páll <notifications@github.com<mailto:notifications@github.com>> wrote: I measure up to 1.5x kernel performance regression with the ROCm 1.4 release compared to AMDGPU-PRO. The application is GROMACS version 2016.2. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#93>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8DuVkxff-g3gXziYR4Hzk95J5_5NwYks5riKuLgaJpZM4MS4y2>.

pszi1ard · 2017-03-04T14:33:59Z

Thanks for the feedback. I know the compiler as released in 1.4 was far from optimal. However, for GROMACS after correctness performance is the second most important functionality, so my report is technically concerning an application functionality :)

It would be great if you could include some testing/benchmarks in your internal testing. We have a mall number of hot kernels and quite peculiar application behavior that tends to stress the driver and cause API overhead), so those aspect would be good to get tested and improved if needed. Let me know how would you like to proceed.

pszi1ard · 2017-03-04T15:40:06Z

I've had another look at some internal GROMACS profiler counters and there are strong indications that the runtime is using a lot of CPU resources resulting in both increased host-side cost of enqueue and increased interference with work executed on the CPU concurrently with the GPU. Are such issues also known/expected?

gstoner · 2017-03-04T17:55:40Z

the base driver team dropped in some last minute changes on 1.4 which we seeing some quirkiness. This is one of the GROMACS test we are running, We found a core issue in this 1.4 and Gromacs already which we working on now. cd /root/Desktop/ISV/Gromacs-2016/gromacs-2016 export GMX_OCL_FILE_PATH=/usr/local/gromacs/share/gromacs/opencl cd ~/Gromacs-2016/gromacs-2016/build/bin/rnase_cubic/ ../gmx grompp -f pme_verlet.mdp ../gmx mdrun On Mar 4, 2017, at 9:40 AM, Szilárd Páll <notifications@github.com<mailto:notifications@github.com>> wrote: I've had another look at some internal GROMACS profiler counters and there are strong indications that the runtime is using a lot of CPU resources resulting in both increased host-side cost of enqueue and increased interference with work executed on the CPU concurrently with the GPU. Are such issues also known/expected? — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#93 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8DuceccuI_iNw5bGbD8ONX-EgazA52ks5riYXXgaJpZM4MS4y2>.

pszi1ard · 2017-03-04T19:32:08Z

@gstoner Sounds good. The test case you are using is pretty decent, but a bit more coverage of input sizes/use-cases and some command line tweaks to run only the kernels of interest might nor hurt.

Briefly, this is what's of strong interest and I'd recommend tracking (using a at least a few test cases):

(post-load balancing) Average execution time of the hottest offloaded kernel across a range of input sizes. Bad performance with very small kernels can become showstoppers for strong scaling (hence the emphasis on the range of problem sized), e.g. see [1] where besides getting good peak, the left-hand "tail" is what would be great if improved;
OpenCL API overhead which, especially at peak performance (< 1 ms/iteration) can be quite significant. What's also a concern is the behavior of the driver overhead under high kernel issue rate and when starved (by application threads) which has been a serious issue in the past, e.g. see [2].

[1] GROMACS 5.1 / 2016 GPU kernel throughput https://drive.google.com/file/d/0B6dQqsegA1FMZk5kNXI4SzVhbzNyT0NackpCY05FNlM1dWNv/view?usp=sharing
[2] API overhead in GROMACS runs on a three fglrx versions. https://drive.google.com/open?id=0B6dQqsegA1FMLTJZb2NRWUlLbHc

Let me know if you need more input and what other feedback would be useful for you. I'd really like to see the Vega GPUs hit the ground running (hopefully with rocm and also mesa!), but so far software stack issues have been the one thing that has been limiting both development and user adoption.

gstoner · 2017-03-04T19:38:08Z

When you did into the core kernel, what is critical section running on the GPU that you see the most. Is it GEMM or some other kernel function. What is the average gpu local memory needed for running real jobs 4GB, 8GB or more? Also when you do multi-gpu enablement are you looking at library like NCCL on CUDA side now. On Mar 4, 2017, at 1:32 PM, Szilárd Páll <notifications@github.com<mailto:notifications@github.com>> wrote: @gstoner<https://github.com/gstoner> Sounds good. The test case you are using is pretty decent, but a bit more coverage of input sizes/use-cases and some command line tweaks to run only the kernels of interest might nor hurt. Briefly, this is what's of strong interest and I'd recommend tracking (using a at least a few test cases): * (post-load balancing) Average execution time of the hottest offloaded kernel across a range of input sizes. Bad performance with very small kernels can become showstoppers for strong scaling (hence the emphasis on the range of problem sized), e.g. see [1] where besides getting good peak, the left-hand "tail" is what would be great if improved; * OpenCL API overhead which, especially at peak performance (< 1 ms/iteration) can be quite significant. What's also a concern is the behavior of the driver overhead under high kernel issue rate and when starved (by application threads) which has been a serious issue in the past, e.g. see [2]. [1] GROMACS 5.1 / 2016 GPU kernel throughput https://drive.google.com/file/d/0B6dQqsegA1FMZk5kNXI4SzVhbzNyT0NackpCY05FNlM1dWNv/view?usp=sharing [2] API overhead in GROMACS runs on a three fglrx versions. https://drive.google.com/open?id=0B6dQqsegA1FMLTJZb2NRWUlLbHc — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#93 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8DubTCWlfqiLaZ2KmJ36AkuQSuDCAJks5ribw5gaJpZM4MS4y2>.

pszi1ard · 2017-03-04T19:55:40Z

When you did into the core kernel, what is critical section running on the GPU that you see the most. Is it GEMM or some other kernel function.

If I understand correctly, you're asking about the GPU algorithm that run on the critical path? It's not GEMM, but a pair interaction algorithm (based on neighbor list traversal), but we use our own SIMD-tuned algorithm.

What is the average gpu local memory needed for running real jobs 4GB, 8GB or more?

Not sure if you are referring to local or global memory in OpenCL terminology (I assume the latter given the Gigabytes)?

Also when you do multi-gpu enablement are you looking at library like NCCL on CUDA side now.

No, at the moment all communication is done on the CPU using MPI/shared memory.
We are porting more code to offload to GPUs and I have been looking into CUDA GPUDirect and NCCL, but OpenCL porting will likely lag behind CUDA, so it is unlikely to be important in the next year or so.

pszi1ard · 2017-03-04T20:07:27Z

What is the average gpu local memory needed for running real jobs 4GB, 8GB or more?
Not sure if you are referring to local or global memory in OpenCL terminology (I assume the latter given the Gigabytes)?

@gstoner forgot to answer: for typical runs we need <200 Mb, for pretty much all relevant cases <1 Gb. This may increase a little as we're moving to offloading more, but ultimately GROMACS runs in the strong-scaling regime where the performance practical for research is only achieved when the per-node data is very little.

gstoner · 2017-03-04T20:19:45Z

Yes, it should have be dig into On Mar 4, 2017, at 2:07 PM, Szilárd Páll <notifications@github.com<mailto:notifications@github.com>> wrote: What is the average gpu local memory needed for running real jobs 4GB, 8GB or more? Not sure if you are referring to local or global memory in OpenCL terminology (I assume the latter given the Gigabytes)? @gstoner<https://github.com/gstoner> forgot to answer: for typical runs we need <200 Mb, for pretty much all relevant cases <1 Gb. This may increase a little as we're moving to offloading more, but ultimately GROMACS runs in the strong-scaling regime where the performance practical for research is only achieved when the per-node data is very little. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#93 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8DuRXo0WFErXSFwN8UepMsYrH0PLqIks5ricSAgaJpZM4MS4y2>.

pszi1ard · 2017-03-04T20:23:54Z

Yes, it should have be dig into

@gstoner Could you clarify?

Also, maybe my above late edit was missed, so let me reiterate it:

Let me know if you need more input and what other feedback would be useful for you. I'd really like to see the Vega GPUs hit the ground running (hopefully with rocm and also mesa!), but so far software stack issues have been the one thing that has been limiting both development and user adoption.

gstoner · 2017-03-04T20:24:41Z

What I am looking for is what is the critical section in your kernel. Greg On Mar 4, 2017, at 2:23 PM, Szilárd Páll <notifications@github.com<mailto:notifications@github.com>> wrote: Yes, it should have be dig into @gstoner<https://github.com/gstoner> Could you clarify? Also, maybe my above late edit was missed, so let me reiterate it: Let me know if you need more input and what other feedback would be useful for you. I'd really like to see the Vega GPUs hit the ground running (hopefully with rocm and also mesa!), but so far software stack issues have been the one thing that has been limiting both development and user adoption. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#93 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8Duba3xQB_63eYyMCjcFkHlIOwtIZ6ks5richbgaJpZM4MS4y2>.

gstoner · 2017-03-04T20:25:36Z

Critical performance section greg On Mar 4, 2017, at 2:23 PM, Szilárd Páll <notifications@github.com<mailto:notifications@github.com>> wrote: Yes, it should have be dig into @gstoner<https://github.com/gstoner> Could you clarify? Also, maybe my above late edit was missed, so let me reiterate it: Let me know if you need more input and what other feedback would be useful for you. I'd really like to see the Vega GPUs hit the ground running (hopefully with rocm and also mesa!), but so far software stack issues have been the one thing that has been limiting both development and user adoption. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#93 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8Duba3xQB_63eYyMCjcFkHlIOwtIZ6ks5richbgaJpZM4MS4y2>.

pszi1ard · 2017-03-04T20:35:45Z

@gstoner The kernel is here. It implements a pair-interaction algorithm that's tuned for SIMD architectures.

Single precision
significant amount of 32-bit integer ops
arithmetically intensive (~15 Flops/byte)
uses lots of registers
typically instruction latency-bound on GPUs

gstoner · 2017-03-04T20:36:33Z

Thanks. greg On Mar 4, 2017, at 2:35 PM, Szilárd Páll <notifications@github.com<mailto:notifications@github.com>> wrote: The kernel is here<https://github.com/gromacs/gromacs/blob/master/src/gromacs/mdlib/nbnxn_ocl/nbnxn_ocl_kernel_amd.clh>. It implements a pair-interaction algorithm<http://www.sciencedirect.com/science/article/pii/S2352711015000059#f000015> that's tuned for SIMD architectures<http://www.sciencedirect.com/science/article/pii/S2352711015000059#f000020>. * Single precision * significant ampount of 32-bit integer ops * arithmetically intensive (~15 Flops/byte) * uses lots of registers, typically instruction latency-bound. Lots of Integers ops — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#93 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8DuYlo9NP_r-y8GSApLjGwtGw0koHhks5ricshgaJpZM4MS4y2>.

gstoner · 2017-03-04T20:37:59Z

Do you have FIJI?

greg

pszi1ard · 2017-03-04T20:38:04Z

PS: the kernel loves lane shuffle for reduction which we're greatly missing on AMD hardware!

pszi1ard · 2017-03-04T20:40:28Z

We have an R9 Nano for development. (BTW performance compared to the green guys was on the PDF linked earlier: https://drive.google.com/file/d/0B6dQqsegA1FMZk5kNXI4SzVhbzNyT0NackpCY05FNlM1dWNv/view?usp=sharing)

gstoner · 2017-03-04T20:40:49Z

Did you see this article AMD GCN Assembly: Cross-Lane Operations http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/

We can do this now inside the new compiler we had lot of hardware that was masked by the old compiler

Posted on August 10, 2016 by Ben Sander
Boltzmann, GCN, GPU, HCC, HIP, HSA
Cross-lane operations are an efficient way to share data between wavefront lanes. This article covers in detail the cross-lane features that GCN3 offers. I’d like to thank Ilya Perminov of Luxsoft for co-authoring this blog post.

Terminology
We’ll be optimizing communication between work-items, so it is important to start with a consistent set of terminology:

The basic execution unit of an AMD GCN GPU is called a wavefront, which is basically a SIMD vector.
A wavefront comprises 64 parallel elements, called lanes, that each represent a separate work item.
A lane index is a coordinate of the work item in a wavefront, with a value ranging from 0 to 63.
Because a wavefront is the lowest level that flow control can affect, groups of 64 work items execute in lockstep. The actual GCN hardware implements 16-wide SIMD, so wavefronts decompose into groups of 16 lanes called wavefront rows that are executed on 4 consecutive cycles.
This hardware organization affects cross-lane operations – some operations work at the wavefront level and some only at the row level. We’ll discuss the details below.

Why Not Just Use LDS?
Local data share (LDS) was introduced exactly for that reason: to allow efficient communication and data sharing between threads in the same compute unit. LDS is a low-latency RAM physically located on chip in each compute unit (CU). Still, most actual compute instructions operate on data in registers. Now, let’s look at the peak-performance numbers. The memory bandwidth of AMD’s Radeon R9 Fury X is an amazing 512 GB/s. Its LDS implementation has a total memory bandwidth of (1,050 GHz) * (64 CUs) * (32 LDS banks) * (4 bytes per read per lane) = 8.6 TB/s. Just imagine reading all the content of a high-capacity 8 TB HDD in one second! Moreover, the LDS latency is an order of magnitude less than that of global memory, helping feed all 4,096 insatiable ALUs. LDS is only available on a workgroup level.

At the same time, the register bandwidth is (1,050 GHz) * (64 CUs) * (64 lanes) * (12 bytes per lane) = 51.6 TB/s. That’s another order of magnitude, so communication between threads is much slower than just crunching data in the thread registers.

But can we do better by sharing? The answer is yes, if we further reduce our scope from a workgroup to a single wavefront.

pszi1ard · 2017-03-04T20:44:57Z

Did you see this article AMD GCN Assembly: Cross-Lane Operations http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/

I have seen it, but I did not think it was possible to use in-line ASM with OpenCL. Or is there some other way to implement it in OpenCL?

We can do this now inside the new compiler we had lot of hardware that was masked by the old compiler

So how and when will you expose it in the (OpenCL) compiler?

gstoner · 2017-03-04T20:53:54Z

We fixed that with the new OpenCL compiler using native code generator. We want to give ability to fully express the hardware even when the compiler can not generate the instructions. We figured every one needed if we needed it for our work on miOpen our deep learning solver and Tensile. We have been working on GCN ISA assembly optimized kernel for a number of the key convolutions and now GEMMS Next OpenCL drop on ROCm will 100% OpenSource so you since see better into the tools. I do not know if you saw this one on Sub Dword Addressing as well http://gpuopen.com/using-sub-dword-addressing-on-amd-gpus-with-rocm/ Greg On Mar 4, 2017, at 2:44 PM, Szilárd Páll <notifications@github.com<mailto:notifications@github.com>> wrote: Did you see this article AMD GCN Assembly: Cross-Lane Operations http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ I have seen it, but I did not think it was possible to use in-line ASM with OpenCL. Or is there some other way to implement it in OpenCL? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#93 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8DuVySj4LcoN8iaBDAtnNvB-9mfgY6ks5ric1KgaJpZM4MS4y2>.

pszi1ard · 2017-03-05T02:25:13Z

We fixed that with the new OpenCL compiler using native code generator. We want to give ability to fully express the hardware even when the compiler can not generate the instructions. We figured every one needed if we needed it for our work on miOpen our deep learning solver and Tensile.

That sounds great. How will this be exposed? Will you add OpenCL extensions for the equivalent permute/swizzle intrinsics supported in HCC? Will you allow inline ASM in OpenCL?

I do not know if you saw this one on Sub Dword Addressing as well http://gpuopen.com/using-sub-dword-addressing-on-amd-gpus-with-rocm/

No, I have not. Thanks for the link. In our current kernels I doubt we can use these tricks. We need SP for floating point; integer data is either 32-bit bitmasks or indices that do not lend themselves well to packing.

gstoner · 2017-03-05T14:31:54Z

On Mar 4, 2017, at 8:25 PM, Szilárd Páll <notifications@github.com<mailto:notifications@github.com>> wrote: We fixed that with the new OpenCL compiler using native code generator. We want to give ability to fully express the hardware even when the compiler can not generate the instructions. We figured every one needed if we needed it for our work on miOpen our deep learning solver and Tensile. That sounds great. How will this be exposed? Will you add OpenCL extensions for the equivalent permute/swizzle intrinsics supported in HCC? Will you allow inline ASM in OpenCL? We will have intrinsics for you to get access to them, in addition you will be able to do inline asm in OpenCL. HCC and OpenCL both now use CLANG FrontEnd so we making sure match functionality in both languages. Greg

pszi1ard · 2017-03-06T18:29:33Z

Sound good. Is there an ETA/release schedule for 1.5 and later? Actually, do you have any ROCm roadmap/plans regarding features, support, etc. that you can share?

pszi1ard · 2017-07-13T19:53:38Z

I have tested ROCm 1.6 release and my initial kernel-only performance assessment is: performance is still 30%-40% lower than what I measured with the old fglrx stack (and the early versions of AMDGPU-PRO). Here's the data with GROAMCS 2016 obtained using a benchmark that runs a wide range of input sizes:

Fiji
Hawaii

BorisI · 2017-07-13T21:42:01Z

It's possible to optimize performance bottlenecks by replacing the kernels with native ISA. Here's a repository with examples and how-to's: https://github.com/ROCm-Developer-Tools/LLVM-AMDGPU-Assembler-Extra

Kirpich30000 · 2017-07-13T21:48:56Z

BTW,

llvm supports gcn inline assembly (the syntax is the same as https://gcc.gnu.org/onlinedocs/gcc/Using-Assembly-Language-with-C.html)
You can try to use it. Disclaimer - it's not easy, but it works =)

Simple example can be found here https://github.com/ROCm-Developer-Tools/LLVM-AMDGPU-Assembler-Extra/blob/master/examples/gfx8/s_memrealtime_inline.cl

nevion · 2017-07-14T08:16:55Z

@jstefanop don't get toxic. This is the last place we need that. Growing pains with an inspectable environment and tangable changes by going opensource with a fully open background fed with modern clang. The relevance and importance of this project and AMD has never been higher. They need to know you mean business but you need to stay a customer too so you get your stuff performing, crypto guys and image processing/dense computations too. It's clear they got alot on their plate and things look ready for a rapid evolution to kick ass status, but it's hard to know what day that is watching and waiting as we have been. Have the patience of Buddha, remove your last line in your previous message, and contact AMD product marketing as officially as your university allows you to and sell the story of your research helping defeat cancer enabled by their hardware - tell them what you need working in a way non-engineers would understand, but with some specifics. Try and get the universities allowance for any marketing blerb on linkedin/hpc news you if you can while imparting the importance of what you need. Try and make sure everybody wins. Then wait. Hopefully they'll either strategize differently or get more people to help get things done sooner - or things just work at T+1.

I do wish we knew what was wrong with the current state of the compiler, it's clear a little wonky. Runtime still has some bugs clearly too but I don't know if those are too bad... if I understood how the runtime worked (or where it is) I'd probably fix some of those issues.

gstoner · 2017-07-14T12:41:25Z

The Gromacs issue is different then what @jstefanop woried about he is looking at Currency mining. What you see in the in the data above is the ROCm compiler is doing it job up to 3 nano sec, Performance slows down where it falls behind, it was leading up until then. @pszi1ard We are digging into more into it this month, Is data line for 2236.1 on Fiji the AMDGPUpro 17.10 driver.

I am now looking at for Gromacs what is memory utilization at 3 nano sec simulation time @pszi1ard. thank you for your patience. We finally have EPYC 4 GPU server we talked about I can also test on.

Also ROCm 1.6.1 will be out next week, we did found some firmware issue that were effecting OpenCL. OpenCL right now on ROCm is Pre-Release Beta. We are working on making ROCm best product possible, it is new and it will have issue. We are digging into to see if it compiler or driver issue.

We are going back and comparing the exact same compiler that we use on Windows and AMDGPUpro 17.10 driver on ROCm to see if compiler issue or based driver issue. Base driver issue take longer to triage since we sit on AMDGPU driver same one as the Open source driver uses. So we have to look at changes in the DRM and even in base linux kernel, and dig though all the firmware changes.

Here comparison on memory bandwidth for the two compilers

pszi1ard · 2017-07-18T19:23:34Z

I also reproducibly get no asynchronous execution behavior in some cases.

I think I've got that nailed down. It looks like the driver / runtime is quite resource-intensive and it wants core 0 (possibly hw thread 0?) to run on and if it does not have it, it fails to launch GPU tasks until the core frees up. Does that make sense?

I'm collecting detailed data and will post it later.

pszi1ard · 2017-07-19T00:19:33Z

Here's more detailed data (also comparing to the CUDA runtime/driver):
https://docs.google.com/spreadsheets/d/1GjIhiWLXsFK5SxE2n88-oz0dYZ3WXXifgIUTyKDE74M/edit?usp=sharing

This essentially seems confirm that if the first core is left empty (that is neither of the hw threads used), performance is good (although clFinish somewhat expensive, it seems). Otherwise, if core 0 gets loaded, the GPU task scheduling behavior is not pretty erratic.

I've also noticed that with this modded 4.9 kernel (4.9.0-kfd-compute-rocm-rel-1.6-77), the per-core load does not show up in monitoring tools, so it looks like something is broken in /proc:

$ stress -c 1 &
$ ps aux | grep $!
pszilard 18516  0.0  0.0   7332   960 pts/1    S    02:18   0:00  |           \_ stress -c 1 -t 10

gstoner · 2017-07-19T12:43:07Z

OpenCL has producer/consumer thread for dispatch, the OpenCL runtime should not be pinning to the first core, the Main developer thinks CQ thread to jump from one core to another if all core has pinned threads to get a fair time share. He is thinking there must be a heuristic preventing the thread to reschedule on another core.

He is going to try bump the priority of the CQ thread to see if we still have that issue with oversubscription and pinning.

pszi1ard · 2017-07-19T14:03:49Z

OpenCL has producer/consumer thread for dispatch,

Not directly related, but GROMACS will by default use all hardware threads available which has not (and seems that it will not) play well with the AMD OpenCL runtime, so we'd need some heuristic to leave some resources available.

How much resources do you expect that this thread requires? Should in theory a hardware thread be enough? Is there one CQ thread only, one per device (perhaps one per application context per device)? I assume he dispatch work is NUMA sensitive?

the OpenCL runtime should not be pinning to the first core, the Main developer thinks CQ thread to jump from one core to another if all core has pinned threads to get a fair time share.

I did not assume that either, that's why I was careful to not claim that the runtime's thread is pinned. However, the observations seem to suggest that the thread does not move around.

He is going to try bump the priority of the CQ thread to see if we still have that issue with oversubscription and pinning.

Note that even if there are plenty of free cores, I still observe the peculiar behavior.

pszi1ard · 2017-08-03T01:44:12Z

Good news: I see improvements on across all performance issues -- great work!

kernel performance is up by 25-30% and marginally better than the best observed ever (the fglrx reference)
register count for the particular kernel I looked at nbnxn_kernel_ElecEw_VdwLJCombGeom_F_opencl) is down to 81; still a bit higher than before, but
the CQ thread issue seems mostly resolved (presumably with the new kernel driver), though I still see significant application slowdown when all threads are used in computation right after the GPU task enqueue (see col 7 and 8 here: https://docs.google.com/spreadsheets/d/1GjIhiWLXsFK5SxE2n88-oz0dYZ3WXXifgIUTyKDE74M/pubchart?oid=939045217);
measured PCI-E bandwidth is improved: 6.7-9.4 GB/s peak (see https://drive.google.com/drive/folders/0B6dQqsegA1FMa3VYRHNFZmJmcXc)

Remaining issues:

GPU queue sync (clFinish) still seems quite expensive, ends up taking 10-20% of runtime at short iteration time;
peak PCI-E transfer rate is still not reached

gstoner · 2017-08-03T02:06:06Z

On Aug 2, 2017, at 8:44 PM, Szilárd Páll <notifications@github.com<mailto:notifications@github.com>> wrote: Good news: I see improvements on across all performance issues -- great work! * kernel performance is up by 25-30% and marginally better than the best observed ever (the fglrx reference) * register count for the particular kernel I looked at nbnxn_kernel_ElecEw_VdwLJCombGeom_F_opencl) is down to 81; still a bit higher than before, but * the CQ thread issue seems mostly resolved (presumably with the new kernel driver), though I still see significant application slowdown when all threads are used in computation right after the GPU task enqueue (see col 7 and 8 here: https://docs.google.com/spreadsheets/d/1GjIhiWLXsFK5SxE2n88-oz0dYZ3WXXifgIUTyKDE74M/pubchart?oid=939045217); * measured PCI-E bandwidth is improved: 6.7-9.4 GB/s peak (see https://drive.google.com/drive/folders/0B6dQqsegA1FMa3VYRHNFZmJmcXc) Remaining issues: * GPU queue sync (clFinish) still seems quite expensive, ends up taking 10-20% of runtime at short iteration time; * peak PCI-E transfer rate is still not reached When you test PCIe bandwidth this is NUMA effects, we are looking into this. Greg — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#93 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8DufRINNpMOrgWZnQSqAsyEGI6zgvDks5sUSXtgaJpZM4MS4y2>.

pszi1ard · 2017-08-06T17:51:55Z

When you test PCIe bandwidth this is NUMA effects, we are looking into this.

I'm using a single socket, single NUMA CPU, how can it be a NUMA effect?

Can you comment on the expected driver/runtime expected resource needs and whether my observed clFinish overhead is "normal" and here to stay? We need to provision resources to jitter and overheads.

gstoner · 2017-08-06T18:11:29Z

We looking at looking clFINISH I will look your Xeon since they have internal NUMA organization when you cross a certain level of cores Greg

pszi1ard · 2017-08-06T18:34:13Z

I will look your Xeon since they have internal NUMA organization when you cross a certain level of cores

It's an LLC single-ring i7 5960X, not a Xeon!

gstoner · 2017-08-06T18:35:05Z

Thanks it is still Xeon, just rebranded ; ) Used to work for Intel Greg

pszi1ard · 2017-08-06T19:35:49Z

I have not worked at Intel, but Xeon is a marketing name rather than an arch name, isn't it ;)
In any case, do you mean that there can be NUMA effect even though it's an LLC single-ring / home agent chip?

gstoner · 2017-08-06T19:47:42Z

In some of them, not yours.

pszi1ard · 2017-08-28T18:32:39Z

Update: with ROCm 1.6.148 I see no significant changes in either of the remaining issues (PCIe transfer and task wait/launch overhead -- related to the latter see some numbers here:
https://docs.google.com/spreadsheets/d/1bKI9FwHh8AGXkK4XtK3qBOTyDJ-7gxbuSTomuLkAEKY/edit?usp=sharing
)

gstoner · 2017-08-28T19:25:44Z

Thanks, I will look this over.

pszi1ard · 2017-12-11T14:09:20Z

Any update on this? We'll be releasing the next major version and it would be good to know whether/what should we advise users.

pszi1ard · 2018-01-18T17:13:01Z

Update: for now we ended up recommending the use of ROCm in the new release documentation, but I'd be more comfortable if we could test better 1.7 and hopefully have issues solved soon.

pszi1ard · 2018-05-16T17:29:16Z

Here's my ROCm 1.8 feedback.

On our test machines PCIe transfer speed is still rather low (about half of what it should be; ~6 GB/s both direction, with transfers of a few MB in size); note that with AMDGPU-PRO 17.5 I get the expected 12 GB/s in both directions.

Kernel performance is slightly improved which is good. However, it seems that (as I suspected) something is off with Vega performance on ROCm. With an AMDGPU-PRO legacy install I seem to get ~30% better performance in out main kernel; the other kernels are also faster! This is a huge difference that would immediate make the Vega GPUs quite competitive, so I'd like to get to the bottom of it and hopefully improve it soon. (Still having plenty of trouble with rcp on ROCm so please let me know if your team can look into this.)

pszi1ard · 2018-05-16T17:29:46Z

Side-note: what are the identical/different components in ROCm vs AMDGPU-PRO legacy / pal? Is there a thorough documentation on this somewhere?

gstoner · 2018-05-16T17:56:08Z

@pszi1ard this is still on the MSI motherboard correct

gstoner · 2018-05-16T18:07:17Z

Side-note: what are the identical/different components in ROCm vs AMDGPU-PRO legacy / pal? Is there a thorough documentation on this somewhere?

We have two compiler, one is OpenCL using CLANG to LLVM to HSAIL intermediate language, then it passes IL to the finalizer which then calls the AMD GPU shader compiler which is used for OpenGL and DirectX 12. This can be supported on ORCA, PAL, and ROCr. It is proprietary compiler so we can not opensource it.

The Compiler on ROCm stack is 100% opensource you can find its documentation here.
https://llvm.org/docs/AMDGPUUsage.html#introduction. Note this support ROCr and PAL as well.

gstoner · 2018-05-16T18:15:56Z

Also, can you run this on 1.8 and report the numbers
https://github.com/RadeonOpenCompute/rocm_bandwidth_test

pszi1ard · 2018-05-17T23:54:19Z

@gstoner Do you mean that the PRO stack uses an entirely different finalizer/codegen so regardless of whether "legacy" mode is used or not there is no similarity with the ROCm compiler?

Also, can you run this on 1.8 and report the numbers
tps://github.com/RadeonOpenCompute/rocm_bandwidth_test

This is on an X99 and on random Z97 mobo:
http://termbin.com/r9z2c
http://termbin.com/my5b

The X99 sytem is clearly not getting the peak performance. There is also still some discrepancy between what these HSA benchmarks show and what I measure in GROMACS / OpenCL.

gstoner · 2018-05-17T23:59:43Z

@pszi1ard This benchmark test the bandwidth at ROCr level removing Language Runtime. So now I am going to go get the team looks over OpenCL mapping to ROCr to see if there is an issue.

I finally have MSI x99 come in, we do not see this on ASUS X99.

pszi1ard · 2018-05-18T18:13:31Z

@pszi1ard This benchmark test the bandwidth at ROCr level removing Language Runtime. So now I am going to go get the team looks over OpenCL mapping to ROCr to see if there is an issue.

Thanks! I've not looked at the implementation, does this benchmark use explicitly pinned buffers?

As a side-note: we've built C++ custom allocator-support for page-alignment and pinning for CUDA, so if there is any use of page-aligning or somehow pinning CPU buffers, we could certainly do that. Any advise you could give is welcome.

I finally have MSI x99 come in, we do not see this on ASUS X99.

Great, thanks. In the meantime I'll try to see if we can update the firmware and get back.

pszi1ard · 2018-08-24T17:12:11Z

I'm still observing PCIe BW issues. I've three cards, one in the infamous MSI X99 board two in another Asus X99 and other than the RX560 in the latter neither the Fiji nor the Vega card behave as they should wrt PCIe BW:
Fiji + MSI X99 http://termbin.com/ic26
Vega + Asus X99 http://termbin.com/mxpj
R560 + Asus X99 http://termbin.com/4i1x

Note that the the gpu_memory_benchmark also behaves weird, in the few M's regime it peaks >15-16 GB/s which is, I think, not is a reasonable measurement for PCIe BW.

Used ROCm 1.8.2 (rock-dkms 1.8-192).

Should I file a separate report about this?

gstoner · 2018-08-30T14:19:53Z

We now obsoleted that benchmark gpu_memorybenchmark. It was using the CPU for timer not GPU like NVIDIA does in it PCIe benchmark. You should use https://github.com/RadeonOpenCompute/rocm_bandwidth_test

One thing we just release a Beta of our ROCm Validation TestSuite beta for validating ROCm on your hardware.
https://github.com/ROCm-Developer-Tools/ROCmValidationSuite

pszi1ard · 2018-09-03T19:45:06Z

Apologies, I forgot about the new bandwidth test tool. Reran with rocm_bandwidth_test (detailed results below) and, not too unexpectedly, I get marginally higher numbers in most cases, with one exception: an RX560 plugged into an Asus X99 with PLX swtiches now performs very poorly with the rocm_bandwidth_test tool:

$ ./rocm_bandwidth_test -s 0 -d 1 -m 128
..

================           Benchmark Result         ================
================ Src Device Id: 0 Src Device Type: Cpu ================
================ Dst Device Id: 1 Dst Device Type: Gpu ================

Data Size      Avg Time(us)   Avg BW(GB/s)   Min Time(us)   Peak BW(GB/s)  
128 MB         9715.109667    13.815359      9705.184000    13.829488      

 $ ./rocm_bandwidth_test -s 0 -d 2 -m 128
..

================           Benchmark Result         ================
================ Src Device Id: 0 Src Device Type: Cpu ================
================ Dst Device Id: 2 Dst Device Type: Gpu ================

Data Size      Avg Time(us)   Avg BW(GB/s)   Min Time(us)   Peak BW(GB/s)  
128 MB         90470.829000   1.483547       90465.811000   1.483629       


 $ ./gpu_memory_benchmark -f 0 -t 1 -s 131072
================ User-Defined  Mode Result ===================================
  131072KB                             9.747396

$ ./gpu_memory_benchmark -f 0 -t 2 -s 131072
================ User-Defined  Mode Result ===================================
  131072KB                             13.471502

Detailed bench results:
Fiji + MSI X99 http://termbin.com/jv45
Vega + Asus X99 http://termbin.com/0cng
R560 + Asus X99 http://termbin.com/in4c

ROCmSupport · 2020-11-18T11:25:34Z

Thanks @pszi1ard
As its very old issue, and no updates for the last 2 years, this issue is going to be closed.
Request to open a new ticket, if you found any.
Thank you.

BorisI closed this as completed Jul 13, 2017

BorisI reopened this Jul 13, 2017

pszi1ard mentioned this issue May 16, 2018

Suboptimal GROMACS performance on Vega #222

Closed

streamhsa closed this as completed Nov 18, 2020

OpenCL performance regression on Gromacs #93

OpenCL performance regression on Gromacs #93

Comments

pszi1ard commented Mar 4, 2017

gstoner commented Mar 4, 2017 via email

pszi1ard commented Mar 4, 2017

pszi1ard commented Mar 4, 2017

gstoner commented Mar 4, 2017 via email

pszi1ard commented Mar 4, 2017 • edited Loading

gstoner commented Mar 4, 2017 via email

pszi1ard commented Mar 4, 2017

pszi1ard commented Mar 4, 2017

gstoner commented Mar 4, 2017 via email

pszi1ard commented Mar 4, 2017

gstoner commented Mar 4, 2017 via email

gstoner commented Mar 4, 2017 via email

pszi1ard commented Mar 4, 2017 • edited Loading

gstoner commented Mar 4, 2017 via email

gstoner commented Mar 4, 2017

pszi1ard commented Mar 4, 2017

pszi1ard commented Mar 4, 2017

gstoner commented Mar 4, 2017

pszi1ard commented Mar 4, 2017 • edited Loading

gstoner commented Mar 4, 2017 via email

pszi1ard commented Mar 5, 2017

gstoner commented Mar 5, 2017 via email

pszi1ard commented Mar 6, 2017

pszi1ard commented Jul 13, 2017

BorisI commented Jul 13, 2017

Kirpich30000 commented Jul 13, 2017 • edited Loading

nevion commented Jul 14, 2017 • edited Loading

gstoner commented Jul 14, 2017

pszi1ard commented Jul 18, 2017

pszi1ard commented Jul 19, 2017 • edited Loading

gstoner commented Jul 19, 2017

pszi1ard commented Jul 19, 2017

pszi1ard commented Aug 3, 2017

gstoner commented Aug 3, 2017 via email

pszi1ard commented Aug 6, 2017

gstoner commented Aug 6, 2017 via email • edited Loading

pszi1ard commented Aug 6, 2017

gstoner commented Aug 6, 2017 via email • edited Loading

pszi1ard commented Aug 6, 2017

gstoner commented Aug 6, 2017

pszi1ard commented Aug 28, 2017

gstoner commented Aug 28, 2017

pszi1ard commented Dec 11, 2017

pszi1ard commented Jan 18, 2018

pszi1ard commented May 16, 2018

pszi1ard commented May 16, 2018 • edited Loading

gstoner commented May 16, 2018

gstoner commented May 16, 2018

gstoner commented May 16, 2018

pszi1ard commented May 17, 2018

gstoner commented May 17, 2018

pszi1ard commented May 18, 2018

pszi1ard commented Aug 24, 2018

gstoner commented Aug 30, 2018 • edited Loading

pszi1ard commented Sep 3, 2018

ROCmSupport commented Nov 18, 2020

pszi1ard commented Mar 4, 2017 •

edited

Loading

pszi1ard commented Mar 4, 2017 •

edited

Loading

pszi1ard commented Mar 4, 2017 •

edited

Loading

Kirpich30000 commented Jul 13, 2017 •

edited

Loading

nevion commented Jul 14, 2017 •

edited

Loading

pszi1ard commented Jul 19, 2017 •

edited

Loading

gstoner commented Aug 6, 2017 via email •

edited

Loading

gstoner commented Aug 6, 2017 via email •

edited

Loading

pszi1ard commented May 16, 2018 •

edited

Loading

gstoner commented Aug 30, 2018 •

edited

Loading