-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenCL "slow" performance on ethminer (ethereum) #132
Comments
Installing ROCm-1.6 further decreases performance to 15MH/s (vs 22 in AMDGPU-PRO 16.40) |
Ok positive is it runs, we have team looking mining on ROCm now we have the deep learning work the first milestone released |
Thanks for the reply and for looking for problem. The difference (rocm 1.4 vs AMDGPU-PRO) was also benched by Phoronix - the result is similar to mine (see LuxMark 3.0, OpenCL Device: GPU - Scene: Luxball HDR) [2] [1] http://www.luxmark.info/node/4487 |
Now the AMDGPUpro driver for Vega10 supports the new lighting compiler and ROCm stack as well When we started the ROCm project, we made a decision to build out fully open source solution, which meant we need to move away from the traditional Shader Compiler used in our graphics stack since it was staying proprietary. The traditional flow was two-stage compiler; we would compile the code to an intermediate language, HSAIL, then it would be picked up finalized and compiled by our shader compiler. This same backend used by Graphics shaders. This journey started in earnest a little over a year ago to look the best way forward to fully open source compiler. We began with the LLVM R600 codebase which needed a bit of work to get to be production class compiler. But it was the right foundation to meet our goal of a fully open stack, With this transition, we know we will have performance gaps, which we are working to close. What we need help with from the community is assist us in testing a broader set of applications and reporting the and do some analysis potentially why. One thing we have seen as well sometimes you need to code differently for LLVM compiler then the SC based compiler to get the best performance out if it. We are now active in the LLVM community, pushing upgrades to the code base to better enable GPU computing. Also, changes are also up-streamed into LLVM repository. Note one significant changes the compiler now generate GCN ISA binary object directly. With this change, it makes it easier for the compiler supports Inline ASM support for all of our languages ( OpenCL, HCC, HIP) and also native assembler and disassembler support. It is also a critical foundation for our math library and MiOpen projects. For the last year, we have spent more time focusing on FIJI and Vega10 with Deep Learning Frameworks, MIOpen, and GEMM solvers. We also have been filling in the gaps in LLVM for the optimization we need for GPU Computing, also improving the scheduler, register allocator, loop optimizer and lot more. It is a bit of work as you can imagine. But we already saw where the effort been worth it since it faster on a number of the codes. We test thing like follow on the compiler
New test recently added: Radeon Rays, SideFX Houdini Test, Blender, Radeon ProRender, On Ray Tracer we are just starting our performance analysis and optimization that more specific to this class of work, What you see over the summer is we will be focusing on optimization for the compiler for currency mining and raytracing. I just have to stage this work in with the team. I saw you referenced Phoronix article, for ROCm 1.5 the new compiler was faster than LLVM/HSAIL/SC on FIJI for Blender, but for Luxmark we were slower. http://www.phoronix.com/scan.php?page=article&item=rocm-15-opencl&num=2 One thing I will leave you with is we build standardized loader and linker and object format, with this it allows us to do some you never could do with AMGGPUpro driver, upgrades the compiler before we release a new driver. So we can now address issue independently of the base driver for OpenCL, HCC, and HIP and the base LLVM compiler foundation. Hope this helps |
Dear gstoner, thank you very much for detailed explanation of internal works of ROCm. I didn't find other such explanations of ROCm elsewhere on the internet. I also think it should be posted somewhere more publicly. I did use ethereum and LuxMark as benchmark, since it's hard do find other tools to make performance measurements (Blender does not include standard benchmark). Some benchmarks simply didn’t know how install or use. I didn't knew other benchmarks that you mentioned. Will try to use in the future. I missed the part ROCm being faster than AMDGPU-pro. Congratulations to whole team. Many kudos! The reason you decided to go fully open-source and linux as base platform is why I chose AMD graphics for my workstation. My colleagues opted for nvidia since it has traditionally better support for research (cuda applications, Matlab support (essential!)). Also our institute recently bought "supercomputer" that has 4x Titan X gpus and we are suggested to use it. I am happy that I can develop/test neural networks on my PC (rx 480) and then deeply it to "nvidia based supercomuter" (simply many times faster). For smaller network the rocCaffe works fast enough. The sole reason for being open-source and standard (opencl, hip, mesa, opengl, vulkan, freesync, ...) motivates me to use AMD products against proprietary and closed-source (cuda, gsync, shaddowplay, PhysX...). I hope that also AMDs CPU market will be more open-source frendly (the RYZEN's security platform and temperature sensors). Also with my own budget, I am opting AMD - I am buying RYZEN pc in near future. I also wish the ROCm will "ROCK", becoming popular enough so also my insittute will get AMD supercomputer solution. But with my work budget I simply cannot afford VEGA:FE-class compute gpu in near future. I have question about LLVM r600 - are you also reusing the open-source OpenCL code ("clover", opencl 1.1)? I am not opencl developer, but I am iterested what is the reason for not having Opencl 2.0 (or 2.1) device driver? (the opencl 2.0 was already supported on FGRLX on my previous HD 5670) Last question - I had it for months - are you planning to support older GNC based gpus (e.g. R9 270)? |
You won't find many people currently. |
I have a question about LLVM r600 - are you also reusing the open-source OpenCL code ("clover", OpenCL 1.1)?
No it is not based on Clover, this is based on the core OpenCL language runtime and Frontend we support on AMDGPUpro and Windows driver, but we map it to ROCr runtime API.
I am not OpenCL developer, but I am interested what is the reason for not having OpenCL 2.0 (or 2.1) device driver? (the OpenCL 2.0 was already supported on FGRLX on my previous HD 5670).
OpenCL 2.0 was really designed for APU or SOC devices, ROCm OpenCL support all the 2.0 API minus Pipe and clEnqueue, both which really need more time in spec development. We are looking at OpenCL 2.1 to bring across ROCm, AMDGPUpro and Windows Driver, but it still under evaluation. One thing majority of OpenCL code is still OpenCL 1.1 and 1.2 so they can be compatible with NVIDIA and Intel.
Last question - I had it for months - are you planning to support older GCN based GPU's (e.g. R9 270)?
That is Tonga, we were going back and forth on this one. We need special firmware and capabilities which really Fiji forward have. Hawaii we experimented with so we could have a large memory and 1/2 precision, but Vega10 take care the large memory issue.
Greg
|
So I was looking at the data put the integer performance into Roofline plot to understand performance when and where which stack is faster. What you see is the current miners are using very low IOPS/byte. Right now you see crossover point for the two stacks is 8.25 IOPS/byte then they merge again at about 2.25 IOPS/byte. Now on SGEMM the cross over is 24.25 Flops per byte This will show why FFT was slower on ROCm, GEMM is doing well ROCm. We dig into this more and get you guys update patch. |
Status Update, So I personally we have been digging through performance issues and doing code review on the entire source base of the driver. My team normally sits above the Thunk layer on what we work on. But we now going down and debugging firmware and base Linux kernel and AMDGPU driver to sort where sources is. The issue with Ryzen was a Linux Kernel issue, We found in the AMDGPU base driver an issue power management code not correctly setting voltages forcing the chip not run efficiently on Vega 10. We found few another issue in the base kernel driver. Based on this 1.6.1 is moving to Linux Kernel 4.11 and the respective AMDGPU base driver that goes with it. At the Thunk Layer, We found VMA alignment issue that affects GFX8 devices ( Fiji and Polaris 10, which the fix is now in 1.6.1 At the ROCr runtime time level we are now seeing, we have an internal test we use. Fiji Device Memory, Coarse-Grained Polaris 10 Device Memory, Coarse-Grained Vega10 Device Memory, Coarse-Grained We are now back up through the language stack to push them on getting memory performance. On ethash if we comment out isolate flag: and also set the compiler parameters --cl-local-work 512 --cl-global-work 10752 we see big jump in performance to 37 Mh/s One thing with the new compiler the same flag and setting you did in the past for HSAIL/SC compiler may not work on LLVM based OpenCL compiler to get the best performance. --- a/libethash-cl/ethash_cl_miner_kernel.cl
+// if (isolate) We are finalizing 1.6.1, which I am hoping we have out by Tuesday, note we will have 1.6.2 release following this release, I am looking at few other areas right now I like to address in the HIP and OpenCL Runtime which will not make it into 1.6.1 |
Thanks for making tests. The lastest rocm version 1.6.115 performance goes back to ~18Mh/s. Changing I did not realize that --cl-... flags matter so much till now. When setting --cl-local-work 256 and --cl-global-work 8192 I did get around 21 Mh/s, which is same as AMDGPU-pro. ALSO BIG NOTICEI noticed in the past that setting to manual performance level below max heped the noise, but didn't inflict lower performance. Now i found out that lowering GPU core freq actually helps.
also the heat and fan are higher. Any idea why? Also you can use triple ` for start and for end of code section. it's hard to read code on github since it makes markup...
|
Update On 8 GB Vega10 we measuring this with a new build.
On second test for 8GB Vega10
We also seeing good number with this update for Ethereum |
We rolled out ROCm 1.6.3 with 2MB support. I working to release a firmware solution to get you to symmetric 413 GB/s on one our key memory tests |
@gstoner what are you guys getting for ethereum with the rocr API? Im assuming this is not a straightforward change from the ethminer codebase to run it with this API? |
Can I contact you like to do firmware beta test this second piece of puzzle
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: jstefanop <notifications@github.com>
Sent: Friday, August 25, 2017 11:45:24 AM
To: RadeonOpenCompute/ROCm
Cc: Gregory Stoner; Mention
Subject: Re: [RadeonOpenCompute/ROCm] OpenCL "slow" performance on ethminer (ethereum) (#132)
@gstoner<https://github.com/gstoner> what are you guys getting for ethereum with the rocr API? Im assuming this is not a straightforward change from the ethminer codebase to run it with this API?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#132 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8DuVzFtp-9NH1_n4lLFLgRYq4Hwny4ks5sbxZEgaJpZM4OAwtd>.
|
@gstoner sure...same handle on Skype. |
Ok. I am flying home now. I ping you tonight or Monday.
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: jstefanop <notifications@github.com>
Sent: Friday, August 25, 2017 12:37:42 PM
To: RadeonOpenCompute/ROCm
Cc: Gregory Stoner; Mention
Subject: Re: [RadeonOpenCompute/ROCm] OpenCL "slow" performance on ethminer (ethereum) (#132)
@gstoner<https://github.com/gstoner> sure...same handle on Skype.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#132 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8DuThLVK9JISLsu_4gXBrXBl65bHxCks5sbyKGgaJpZM4OAwtd>.
|
This thread seems to be going places. Glad to see that. No intent to highjack, but I was wondering if I may be so bold as to throw a few quick questions in here, directly related perhaps: a) Are you guys aware of the DAG slowdown issue at all ? ( http://1stminingrig.com/amd-working-on-ethereum-mining-hashrate-drop-fix-for-polaris-gpus/ ) - it has something to with the pipelining .. I don't want to go into detail if this is within the scope of this problem, or being addressed elsewhere. b) ADL does still allow you to change some of the settings and there are other methods that involve kernel " patches" that allow you to change the GPU frequency .. all of them give MUCH better TDP (half), but don't affect performance. I get 23Mh/s on ETH on 910Mhz at approx 80W on a RX 480 8GB. It seem that @gsedej and I are more interested in running in a smaller thermal envelope rather than burning up the silicon. AMD makes this very hard. Most people are reflashing their cards to accomplish this, the *NIX tools provided are not great. More than anything I would love it if you could expand rocm-smi and at least give us what ADL can and /still sort of still does/did. I assembled everything in Linux that Windows can do, so I know it is ALL possible. I can't understand how Windoze gets a pretty front end and we get a half working python script that moans about every little thing. I have heard the objections about a lot of it being related to upstream changes to the kernel requirements, but oddly this tool https://github.com/matszpk/amdcovc can do a lot of it too - without new kernels or ROCM, and as I mentioned ADL can also get some of it done, even on new silicon. The fact that it's all doable with a patchwork of bits and pieces rather than a nice clean tool is what boggles my mind. c) I have 3 cards that throw NMI errors under load. I have had some of them replaced by the manufacturer and they don't believe me that the cards are "faulty". Without going into any great amount of detail .. The error manifests by repeatedly throwing NMIs that look like this : and then the card just loops through VM PROTECTION FAULTS on that slot over and over and over again. I have isolated everything and I am convinced there is a bug in the firmware somewhere or the driver (I have tried every driver and just about every kernel version). I am not sure what my escalation path for this is, but I am very tired of trying to convince someone that there is something wrong with them. They have never been flashed/modified in anyway, and they exhibit this behavior without any optimisations. If there is new firmware going around I would be more than game to try it out. I have 3x 8GB Ellesmeres (out of a fairly large population) that are currently paperweights, and I do have machines that satisfy the RoCM requirements for PCIe Atomics, and I do have some experience mining with the ROCM stack. (not a lot because it was slow- i.e. the reason for this ticket). I used sgminer which I could only compile using gcc. If ethminer compiles with clang/llvm I would be keen to try that out! Thanks.. apologies for the diversion. If anyone can address what you can, ignore what you can't or don't feel is relevant. |
@int03h GRUB_CMDLINE_LINUX="... amdgpu.vm_fragment_size=9" To see it worked at the shell prompt dmesg | grep fragment ROCm-SMI allow you to set frequencies on ROCm stack NMI issue I have not seen on Polaris card but we mostly work Radeon Instinct cards |
@gstoner Thanks ! Let me give that a spin! Yeah - ROCm-SMI lets you set GPU Frequency and fans etc: It does not allow you to change the voltages like: Nor does it let you change the memory frequency. My settings 2050Mhz - 910 Mhz - 818mV - got like 21Mh/s before the slowdown. With 5xRX480 8GB cards at 339W at the wall TOTAL ( including Mobo etc ) .. I am sure I could get more if I fiddled with the memory straps, but I don't want to flash the cards with non-reference bioses. As I say .. this is not done via vBIOS flash, and has been up to 25 days uptime, very low hardware error rate. Temps are at about 45C with fans at 80% ( I don't let the fans autorange - they don't seem to do a good job of figuring this all out since they stop turning for minutes). As I say - windows users can break their cards with a few clicks. Seems like giving bunch of kids all the power and "us" nothing of substance. I even know of someone that is modifying memory straps in memory! So it is all possible, just a matter of will. |
@TekComm hi, can you please post some more details? thanks :) |
Hi, You have taken quite a different route - highly technical - than the rest of the posts I have read on increasing MH/s. Not sure if you have seen https://access.redhat.com/solutions/2144921. Would you still use your custom init? If so, why? |
OK. Thank you. To me that means your performance of 44 MH/s is not related to your init, but rather mainly to pinning the gpus to the appropriate cpus, right? Am I simplifying it too much? |
Also, I am not sure how qrng and watchdog are related? I mean what is the idea? |
In one paragraph you say, "The overclocking of the cards is set in the 256k bios directly and the voltage is set there for undervolting" and in the next paragraph you say, "I don't oc and I don't use external fans". Could you add some context to those statements so I can better understand? Can I access your beta somewhere? |
The good news is that you can use "ROCm driver" for cpp-ethminer. Which does fail on amdgpu-pro 17.10. (at least on RX 480). I am posting this here, since I couldn't find anyone using ROCm for mining.
I do not know what are differences between amdgpu-pro and ROCm version of OpenCL (AMD-APP), if someone explain it would be nice
The "problem" is performance. The performance on RX 480 using amdgpu-pro should be ~ 22MH/s [*], but the max I can get is ~19MH/s. The more interesting thing is, if I manually underclock to 900MHz [**] (level 2 in
pp_dpm_sclk
) the speed stays the same, but there is much reduction of noise heat and power consumption.Is there any known reason for slower mining speed and not scaling on higher frequencises?
[*] http://www.phoronix.com/scan.php?page=article&item=ethminer-linux-gpus&num=2
[**]
The text was updated successfully, but these errors were encountered: