Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCL performance regression on Gromacs #93

Closed
pszi1ard opened this issue Mar 4, 2017 · 77 comments
Closed

OpenCL performance regression on Gromacs #93

pszi1ard opened this issue Mar 4, 2017 · 77 comments

Comments

@pszi1ard
Copy link

pszi1ard commented Mar 4, 2017

I measure up to 1.5x kernel performance regression with the ROCm 1.4 release compared to AMDGPU-PRO. The application is GROMACS version 2016.2.

@gstoner
Copy link

gstoner commented Mar 4, 2017 via email

@pszi1ard
Copy link
Author

pszi1ard commented Mar 4, 2017

Thanks for the feedback. I know the compiler as released in 1.4 was far from optimal. However, for GROMACS after correctness performance is the second most important functionality, so my report is technically concerning an application functionality :)

It would be great if you could include some testing/benchmarks in your internal testing. We have a mall number of hot kernels and quite peculiar application behavior that tends to stress the driver and cause API overhead), so those aspect would be good to get tested and improved if needed. Let me know how would you like to proceed.

@pszi1ard
Copy link
Author

pszi1ard commented Mar 4, 2017

I've had another look at some internal GROMACS profiler counters and there are strong indications that the runtime is using a lot of CPU resources resulting in both increased host-side cost of enqueue and increased interference with work executed on the CPU concurrently with the GPU. Are such issues also known/expected?

@gstoner
Copy link

gstoner commented Mar 4, 2017 via email

@pszi1ard
Copy link
Author

pszi1ard commented Mar 4, 2017

@gstoner Sounds good. The test case you are using is pretty decent, but a bit more coverage of input sizes/use-cases and some command line tweaks to run only the kernels of interest might nor hurt.

Briefly, this is what's of strong interest and I'd recommend tracking (using a at least a few test cases):

  • (post-load balancing) Average execution time of the hottest offloaded kernel across a range of input sizes. Bad performance with very small kernels can become showstoppers for strong scaling (hence the emphasis on the range of problem sized), e.g. see [1] where besides getting good peak, the left-hand "tail" is what would be great if improved;
  • OpenCL API overhead which, especially at peak performance (< 1 ms/iteration) can be quite significant. What's also a concern is the behavior of the driver overhead under high kernel issue rate and when starved (by application threads) which has been a serious issue in the past, e.g. see [2].

[1] GROMACS 5.1 / 2016 GPU kernel throughput https://drive.google.com/file/d/0B6dQqsegA1FMZk5kNXI4SzVhbzNyT0NackpCY05FNlM1dWNv/view?usp=sharing
[2] API overhead in GROMACS runs on a three fglrx versions. https://drive.google.com/open?id=0B6dQqsegA1FMLTJZb2NRWUlLbHc

Let me know if you need more input and what other feedback would be useful for you. I'd really like to see the Vega GPUs hit the ground running (hopefully with rocm and also mesa!), but so far software stack issues have been the one thing that has been limiting both development and user adoption.

@gstoner
Copy link

gstoner commented Mar 4, 2017 via email

@pszi1ard
Copy link
Author

pszi1ard commented Mar 4, 2017

When you did into the core kernel, what is critical section running on the GPU that you see the most. Is it GEMM or some other kernel function.

If I understand correctly, you're asking about the GPU algorithm that run on the critical path? It's not GEMM, but a pair interaction algorithm (based on neighbor list traversal), but we use our own SIMD-tuned algorithm.

What is the average gpu local memory needed for running real jobs 4GB, 8GB or more?

Not sure if you are referring to local or global memory in OpenCL terminology (I assume the latter given the Gigabytes)?

Also when you do multi-gpu enablement are you looking at library like NCCL on CUDA side now.

No, at the moment all communication is done on the CPU using MPI/shared memory.
We are porting more code to offload to GPUs and I have been looking into CUDA GPUDirect and NCCL, but OpenCL porting will likely lag behind CUDA, so it is unlikely to be important in the next year or so.

@pszi1ard
Copy link
Author

pszi1ard commented Mar 4, 2017

What is the average gpu local memory needed for running real jobs 4GB, 8GB or more?
Not sure if you are referring to local or global memory in OpenCL terminology (I assume the latter given the Gigabytes)?

@gstoner forgot to answer: for typical runs we need <200 Mb, for pretty much all relevant cases <1 Gb. This may increase a little as we're moving to offloading more, but ultimately GROMACS runs in the strong-scaling regime where the performance practical for research is only achieved when the per-node data is very little.

@gstoner
Copy link

gstoner commented Mar 4, 2017 via email

@pszi1ard
Copy link
Author

pszi1ard commented Mar 4, 2017

Yes, it should have be dig into

@gstoner Could you clarify?

Also, maybe my above late edit was missed, so let me reiterate it:

Let me know if you need more input and what other feedback would be useful for you. I'd really like to see the Vega GPUs hit the ground running (hopefully with rocm and also mesa!), but so far software stack issues have been the one thing that has been limiting both development and user adoption.

@gstoner
Copy link

gstoner commented Mar 4, 2017 via email

@gstoner
Copy link

gstoner commented Mar 4, 2017 via email

@pszi1ard
Copy link
Author

pszi1ard commented Mar 4, 2017

@gstoner The kernel is here. It implements a pair-interaction algorithm that's tuned for SIMD architectures.

  • Single precision
  • significant amount of 32-bit integer ops
  • arithmetically intensive (~15 Flops/byte)
  • uses lots of registers
  • typically instruction latency-bound on GPUs

@gstoner
Copy link

gstoner commented Mar 4, 2017 via email

@gstoner
Copy link

gstoner commented Mar 4, 2017

Do you have FIJI?

greg

@pszi1ard
Copy link
Author

pszi1ard commented Mar 4, 2017

PS: the kernel loves lane shuffle for reduction which we're greatly missing on AMD hardware!

@pszi1ard
Copy link
Author

pszi1ard commented Mar 4, 2017

We have an R9 Nano for development. (BTW performance compared to the green guys was on the PDF linked earlier: https://drive.google.com/file/d/0B6dQqsegA1FMZk5kNXI4SzVhbzNyT0NackpCY05FNlM1dWNv/view?usp=sharing)

@gstoner
Copy link

gstoner commented Mar 4, 2017

Did you see this article AMD GCN Assembly: Cross-Lane Operations http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/

We can do this now inside the new compiler we had lot of hardware that was masked by the old compiler

Posted on August 10, 2016 by Ben Sander
Boltzmann, GCN, GPU, HCC, HIP, HSA
Cross-lane operations are an efficient way to share data between wavefront lanes. This article covers in detail the cross-lane features that GCN3 offers. I’d like to thank Ilya Perminov of Luxsoft for co-authoring this blog post.

Terminology
We’ll be optimizing communication between work-items, so it is important to start with a consistent set of terminology:

The basic execution unit of an AMD GCN GPU is called a wavefront, which is basically a SIMD vector.
A wavefront comprises 64 parallel elements, called lanes, that each represent a separate work item.
A lane index is a coordinate of the work item in a wavefront, with a value ranging from 0 to 63.
Because a wavefront is the lowest level that flow control can affect, groups of 64 work items execute in lockstep. The actual GCN hardware implements 16-wide SIMD, so wavefronts decompose into groups of 16 lanes called wavefront rows that are executed on 4 consecutive cycles.
This hardware organization affects cross-lane operations – some operations work at the wavefront level and some only at the row level. We’ll discuss the details below.

Why Not Just Use LDS?
Local data share (LDS) was introduced exactly for that reason: to allow efficient communication and data sharing between threads in the same compute unit. LDS is a low-latency RAM physically located on chip in each compute unit (CU). Still, most actual compute instructions operate on data in registers. Now, let’s look at the peak-performance numbers. The memory bandwidth of AMD’s Radeon R9 Fury X is an amazing 512 GB/s. Its LDS implementation has a total memory bandwidth of (1,050 GHz) * (64 CUs) * (32 LDS banks) * (4 bytes per read per lane) = 8.6 TB/s. Just imagine reading all the content of a high-capacity 8 TB HDD in one second! Moreover, the LDS latency is an order of magnitude less than that of global memory, helping feed all 4,096 insatiable ALUs. LDS is only available on a workgroup level.

At the same time, the register bandwidth is (1,050 GHz) * (64 CUs) * (64 lanes) * (12 bytes per lane) = 51.6 TB/s. That’s another order of magnitude, so communication between threads is much slower than just crunching data in the thread registers.

But can we do better by sharing? The answer is yes, if we further reduce our scope from a workgroup to a single wavefront.

@pszi1ard
Copy link
Author

pszi1ard commented Mar 4, 2017

Did you see this article AMD GCN Assembly: Cross-Lane Operations http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/

I have seen it, but I did not think it was possible to use in-line ASM with OpenCL. Or is there some other way to implement it in OpenCL?

We can do this now inside the new compiler we had lot of hardware that was masked by the old compiler

So how and when will you expose it in the (OpenCL) compiler?

@gstoner
Copy link

gstoner commented Mar 4, 2017 via email

@pszi1ard
Copy link
Author

pszi1ard commented Mar 5, 2017

We fixed that with the new OpenCL compiler using native code generator. We want to give ability to fully express the hardware even when the compiler can not generate the instructions. We figured every one needed if we needed it for our work on miOpen our deep learning solver and Tensile.

That sounds great. How will this be exposed? Will you add OpenCL extensions for the equivalent permute/swizzle intrinsics supported in HCC? Will you allow inline ASM in OpenCL?

I do not know if you saw this one on Sub Dword Addressing as well http://gpuopen.com/using-sub-dword-addressing-on-amd-gpus-with-rocm/

No, I have not. Thanks for the link. In our current kernels I doubt we can use these tricks. We need SP for floating point; integer data is either 32-bit bitmasks or indices that do not lend themselves well to packing.

@gstoner
Copy link

gstoner commented Mar 5, 2017 via email

@pszi1ard
Copy link
Author

pszi1ard commented Mar 6, 2017

Sound good. Is there an ETA/release schedule for 1.5 and later? Actually, do you have any ROCm roadmap/plans regarding features, support, etc. that you can share?

@pszi1ard
Copy link
Author

I have tested ROCm 1.6 release and my initial kernel-only performance assessment is: performance is still 30%-40% lower than what I measured with the old fglrx stack (and the early versions of AMDGPU-PRO). Here's the data with GROAMCS 2016 obtained using a benchmark that runs a wide range of input sizes:

@BorisI
Copy link

BorisI commented Jul 13, 2017

It's possible to optimize performance bottlenecks by replacing the kernels with native ISA. Here's a repository with examples and how-to's: https://github.com/ROCm-Developer-Tools/LLVM-AMDGPU-Assembler-Extra

@BorisI BorisI closed this as completed Jul 13, 2017
@BorisI BorisI reopened this Jul 13, 2017
@Kirpich30000
Copy link

Kirpich30000 commented Jul 13, 2017

BTW,

llvm supports gcn inline assembly (the syntax is the same as https://gcc.gnu.org/onlinedocs/gcc/Using-Assembly-Language-with-C.html)
You can try to use it. Disclaimer - it's not easy, but it works =)

Simple example can be found here https://github.com/ROCm-Developer-Tools/LLVM-AMDGPU-Assembler-Extra/blob/master/examples/gfx8/s_memrealtime_inline.cl

@nevion
Copy link

nevion commented Jul 14, 2017

@jstefanop don't get toxic. This is the last place we need that. Growing pains with an inspectable environment and tangable changes by going opensource with a fully open background fed with modern clang. The relevance and importance of this project and AMD has never been higher. They need to know you mean business but you need to stay a customer too so you get your stuff performing, crypto guys and image processing/dense computations too. It's clear they got alot on their plate and things look ready for a rapid evolution to kick ass status, but it's hard to know what day that is watching and waiting as we have been. Have the patience of Buddha, remove your last line in your previous message, and contact AMD product marketing as officially as your university allows you to and sell the story of your research helping defeat cancer enabled by their hardware - tell them what you need working in a way non-engineers would understand, but with some specifics. Try and get the universities allowance for any marketing blerb on linkedin/hpc news you if you can while imparting the importance of what you need. Try and make sure everybody wins. Then wait. Hopefully they'll either strategize differently or get more people to help get things done sooner - or things just work at T+1.

I do wish we knew what was wrong with the current state of the compiler, it's clear a little wonky. Runtime still has some bugs clearly too but I don't know if those are too bad... if I understood how the runtime worked (or where it is) I'd probably fix some of those issues.

@gstoner
Copy link

gstoner commented Jul 14, 2017

The Gromacs issue is different then what @jstefanop woried about he is looking at Currency mining. What you see in the in the data above is the ROCm compiler is doing it job up to 3 nano sec, Performance slows down where it falls behind, it was leading up until then. @pszi1ard We are digging into more into it this month, Is data line for 2236.1 on Fiji the AMDGPUpro 17.10 driver.

I am now looking at for Gromacs what is memory utilization at 3 nano sec simulation time @pszi1ard. thank you for your patience. We finally have EPYC 4 GPU server we talked about I can also test on.

Also ROCm 1.6.1 will be out next week, we did found some firmware issue that were effecting OpenCL. OpenCL right now on ROCm is Pre-Release Beta. We are working on making ROCm best product possible, it is new and it will have issue. We are digging into to see if it compiler or driver issue.

We are going back and comparing the exact same compiler that we use on Windows and AMDGPUpro 17.10 driver on ROCm to see if compiler issue or based driver issue. Base driver issue take longer to triage since we sit on AMDGPU driver same one as the Open source driver uses. So we have to look at changes in the DRM and even in base linux kernel, and dig though all the firmware changes.

Here comparison on memory bandwidth for the two compilers
screen shot 2017-07-14 at 7 31 28 am
screen shot 2017-07-14 at 7 31 49 am

@pszi1ard
Copy link
Author

I also reproducibly get no asynchronous execution behavior in some cases.

I think I've got that nailed down. It looks like the driver / runtime is quite resource-intensive and it wants core 0 (possibly hw thread 0?) to run on and if it does not have it, it fails to launch GPU tasks until the core frees up. Does that make sense?

I'm collecting detailed data and will post it later.

@pszi1ard
Copy link
Author

pszi1ard commented Jul 19, 2017

Here's more detailed data (also comparing to the CUDA runtime/driver):
https://docs.google.com/spreadsheets/d/1GjIhiWLXsFK5SxE2n88-oz0dYZ3WXXifgIUTyKDE74M/edit?usp=sharing

This essentially seems confirm that if the first core is left empty (that is neither of the hw threads used), performance is good (although clFinish somewhat expensive, it seems). Otherwise, if core 0 gets loaded, the GPU task scheduling behavior is not pretty erratic.

I've also noticed that with this modded 4.9 kernel (4.9.0-kfd-compute-rocm-rel-1.6-77), the per-core load does not show up in monitoring tools, so it looks like something is broken in /proc:

$ stress -c 1 &
$ ps aux | grep $!
pszilard 18516  0.0  0.0   7332   960 pts/1    S    02:18   0:00  |           \_ stress -c 1 -t 10

@gstoner
Copy link

gstoner commented Jul 19, 2017

OpenCL has producer/consumer thread for dispatch, the OpenCL runtime should not be pinning to the first core, the Main developer thinks CQ thread to jump from one core to another if all core has pinned threads to get a fair time share. He is thinking there must be a heuristic preventing the thread to reschedule on another core.

He is going to try bump the priority of the CQ thread to see if we still have that issue with oversubscription and pinning.

@pszi1ard
Copy link
Author

OpenCL has producer/consumer thread for dispatch,

Not directly related, but GROMACS will by default use all hardware threads available which has not (and seems that it will not) play well with the AMD OpenCL runtime, so we'd need some heuristic to leave some resources available.

How much resources do you expect that this thread requires? Should in theory a hardware thread be enough? Is there one CQ thread only, one per device (perhaps one per application context per device)? I assume he dispatch work is NUMA sensitive?

the OpenCL runtime should not be pinning to the first core, the Main developer thinks CQ thread to jump from one core to another if all core has pinned threads to get a fair time share.

I did not assume that either, that's why I was careful to not claim that the runtime's thread is pinned. However, the observations seem to suggest that the thread does not move around.

He is going to try bump the priority of the CQ thread to see if we still have that issue with oversubscription and pinning.

Note that even if there are plenty of free cores, I still observe the peculiar behavior.

@pszi1ard
Copy link
Author

pszi1ard commented Aug 3, 2017

Good news: I see improvements on across all performance issues -- great work!

Remaining issues:

  • GPU queue sync (clFinish) still seems quite expensive, ends up taking 10-20% of runtime at short iteration time;
  • peak PCI-E transfer rate is still not reached

@gstoner
Copy link

gstoner commented Aug 3, 2017 via email

@pszi1ard
Copy link
Author

pszi1ard commented Aug 6, 2017

When you test PCIe bandwidth this is NUMA effects, we are looking into this.

I'm using a single socket, single NUMA CPU, how can it be a NUMA effect?

Can you comment on the expected driver/runtime expected resource needs and whether my observed clFinish overhead is "normal" and here to stay? We need to provision resources to jitter and overheads.

@gstoner
Copy link

gstoner commented Aug 6, 2017 via email

@pszi1ard
Copy link
Author

pszi1ard commented Aug 6, 2017

I will look your Xeon since they have internal NUMA organization when you cross a certain level of cores

It's an LLC single-ring i7 5960X, not a Xeon!

@gstoner
Copy link

gstoner commented Aug 6, 2017 via email

@pszi1ard
Copy link
Author

pszi1ard commented Aug 6, 2017

I have not worked at Intel, but Xeon is a marketing name rather than an arch name, isn't it ;)
In any case, do you mean that there can be NUMA effect even though it's an LLC single-ring / home agent chip?

@gstoner
Copy link

gstoner commented Aug 6, 2017

In some of them, not yours.

@pszi1ard
Copy link
Author

Update: with ROCm 1.6.148 I see no significant changes in either of the remaining issues (PCIe transfer and task wait/launch overhead -- related to the latter see some numbers here:
https://docs.google.com/spreadsheets/d/1bKI9FwHh8AGXkK4XtK3qBOTyDJ-7gxbuSTomuLkAEKY/edit?usp=sharing
)

@gstoner
Copy link

gstoner commented Aug 28, 2017

Thanks, I will look this over.

@pszi1ard
Copy link
Author

Any update on this? We'll be releasing the next major version and it would be good to know whether/what should we advise users.

@pszi1ard
Copy link
Author

Update: for now we ended up recommending the use of ROCm in the new release documentation, but I'd be more comfortable if we could test better 1.7 and hopefully have issues solved soon.

@pszi1ard
Copy link
Author

Here's my ROCm 1.8 feedback.

On our test machines PCIe transfer speed is still rather low (about half of what it should be; ~6 GB/s both direction, with transfers of a few MB in size); note that with AMDGPU-PRO 17.5 I get the expected 12 GB/s in both directions.

Kernel performance is slightly improved which is good. However, it seems that (as I suspected) something is off with Vega performance on ROCm. With an AMDGPU-PRO legacy install I seem to get ~30% better performance in out main kernel; the other kernels are also faster! This is a huge difference that would immediate make the Vega GPUs quite competitive, so I'd like to get to the bottom of it and hopefully improve it soon. (Still having plenty of trouble with rcp on ROCm so please let me know if your team can look into this.)

@pszi1ard
Copy link
Author

pszi1ard commented May 16, 2018

Side-note: what are the identical/different components in ROCm vs AMDGPU-PRO legacy / pal? Is there a thorough documentation on this somewhere?

@gstoner
Copy link

gstoner commented May 16, 2018

@pszi1ard this is still on the MSI motherboard correct

@gstoner
Copy link

gstoner commented May 16, 2018

Side-note: what are the identical/different components in ROCm vs AMDGPU-PRO legacy / pal? Is there a thorough documentation on this somewhere?

We have two compiler, one is OpenCL using CLANG to LLVM to HSAIL intermediate language, then it passes IL to the finalizer which then calls the AMD GPU shader compiler which is used for OpenGL and DirectX 12. This can be supported on ORCA, PAL, and ROCr. It is proprietary compiler so we can not opensource it.

The Compiler on ROCm stack is 100% opensource you can find its documentation here.
https://llvm.org/docs/AMDGPUUsage.html#introduction. Note this support ROCr and PAL as well.

@gstoner
Copy link

gstoner commented May 16, 2018

Also, can you run this on 1.8 and report the numbers
https://github.com/RadeonOpenCompute/rocm_bandwidth_test

@pszi1ard
Copy link
Author

@gstoner Do you mean that the PRO stack uses an entirely different finalizer/codegen so regardless of whether "legacy" mode is used or not there is no similarity with the ROCm compiler?

Also, can you run this on 1.8 and report the numbers
tps://github.com/RadeonOpenCompute/rocm_bandwidth_test

This is on an X99 and on random Z97 mobo:
http://termbin.com/r9z2c
http://termbin.com/my5b

The X99 sytem is clearly not getting the peak performance. There is also still some discrepancy between what these HSA benchmarks show and what I measure in GROMACS / OpenCL.

@gstoner
Copy link

gstoner commented May 17, 2018

@pszi1ard This benchmark test the bandwidth at ROCr level removing Language Runtime. So now I am going to go get the team looks over OpenCL mapping to ROCr to see if there is an issue.

I finally have MSI x99 come in, we do not see this on ASUS X99.

@pszi1ard
Copy link
Author

@pszi1ard This benchmark test the bandwidth at ROCr level removing Language Runtime. So now I am going to go get the team looks over OpenCL mapping to ROCr to see if there is an issue.

Thanks! I've not looked at the implementation, does this benchmark use explicitly pinned buffers?

As a side-note: we've built C++ custom allocator-support for page-alignment and pinning for CUDA, so if there is any use of page-aligning or somehow pinning CPU buffers, we could certainly do that. Any advise you could give is welcome.

I finally have MSI x99 come in, we do not see this on ASUS X99.

Great, thanks. In the meantime I'll try to see if we can update the firmware and get back.

@pszi1ard
Copy link
Author

I'm still observing PCIe BW issues. I've three cards, one in the infamous MSI X99 board two in another Asus X99 and other than the RX560 in the latter neither the Fiji nor the Vega card behave as they should wrt PCIe BW:
Fiji + MSI X99 http://termbin.com/ic26
Vega + Asus X99 http://termbin.com/mxpj
R560 + Asus X99 http://termbin.com/4i1x

Note that the the gpu_memory_benchmark also behaves weird, in the few M's regime it peaks >15-16 GB/s which is, I think, not is a reasonable measurement for PCIe BW.

Used ROCm 1.8.2 (rock-dkms 1.8-192).

Should I file a separate report about this?

@gstoner
Copy link

gstoner commented Aug 30, 2018

We now obsoleted that benchmark gpu_memorybenchmark. It was using the CPU for timer not GPU like NVIDIA does in it PCIe benchmark. You should use https://github.com/RadeonOpenCompute/rocm_bandwidth_test

One thing we just release a Beta of our ROCm Validation TestSuite beta for validating ROCm on your hardware.
https://github.com/ROCm-Developer-Tools/ROCmValidationSuite

@pszi1ard
Copy link
Author

pszi1ard commented Sep 3, 2018

Apologies, I forgot about the new bandwidth test tool. Reran with rocm_bandwidth_test (detailed results below) and, not too unexpectedly, I get marginally higher numbers in most cases, with one exception: an RX560 plugged into an Asus X99 with PLX swtiches now performs very poorly with the rocm_bandwidth_test tool:

$ ./rocm_bandwidth_test -s 0 -d 1 -m 128
..

================           Benchmark Result         ================
================ Src Device Id: 0 Src Device Type: Cpu ================
================ Dst Device Id: 1 Dst Device Type: Gpu ================

Data Size      Avg Time(us)   Avg BW(GB/s)   Min Time(us)   Peak BW(GB/s)  
128 MB         9715.109667    13.815359      9705.184000    13.829488      

 $ ./rocm_bandwidth_test -s 0 -d 2 -m 128
..

================           Benchmark Result         ================
================ Src Device Id: 0 Src Device Type: Cpu ================
================ Dst Device Id: 2 Dst Device Type: Gpu ================

Data Size      Avg Time(us)   Avg BW(GB/s)   Min Time(us)   Peak BW(GB/s)  
128 MB         90470.829000   1.483547       90465.811000   1.483629       


 $ ./gpu_memory_benchmark -f 0 -t 1 -s 131072
================ User-Defined  Mode Result ===================================
  131072KB                             9.747396

$ ./gpu_memory_benchmark -f 0 -t 2 -s 131072
================ User-Defined  Mode Result ===================================
  131072KB                             13.471502

Detailed bench results:
Fiji + MSI X99 http://termbin.com/jv45
Vega + Asus X99 http://termbin.com/0cng
R560 + Asus X99 http://termbin.com/in4c

@ROCmSupport
Copy link

Thanks @pszi1ard
As its very old issue, and no updates for the last 2 years, this issue is going to be closed.
Request to open a new ticket, if you found any.
Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants