-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROCm RX VEGA hash rates for Cryptonight (linux vs windows) #325
Comments
If I remember correctly, to hit the 1900+h/s with cryptonight in windows it required doing a hot re-initialization of the GPUs. I remember people were initially doing this by enabling or disabling HBCC which seemed to cause re-initialization of the GPU. Without the re-initialization, I believe performance is pretty similar to the linux rocm stack. This seems to suggest that the windows initialization procedures at boot and for hot reset are different somehow. I would love to get some clarity into what the difference is that is causing this performance delta. A 50% performance increase running the exact same code is impressive. |
@todxx its nothing with hbcc. just reloading the driver with a disable/enable in device manager is all it takes. No way to do that in linux. Neither modprobe -r or rmmod/insmod will allow amdgpu or amdkfd to be removed/reloaded. |
Can whatever it is Windows is toggling when reinitializing the gpus be accomplished with a bios mod? |
@rhlug I don't think it has to do just with reloading the driver. I think the way windows reloads the driver for a hot reset is somehow different than how it loads on boot. But I'm just guessing here. @ob7 I believe the bios on Vega requires to be signed and therefore is not moddable. In either case, without knowing what is being changed, it will be difficult to reproduce. |
Perhaps you can try to disable module on boot and only manually load it before running the compute program. My suspicion though is that on windows, the card is getting used by something during the whole boot process and resources are not released completely. The reload perhaps allows for releasing all needless resources on the card. btw what are you using to tune card cooling and speed on linux? It seems proper underclocking is always needed for good results. |
@rhlug Did you achieve 1900h cryptonight without overclocking the memory frequency on Windows? Because as far as I know overclocking the memory is required to get a hashrate like 1900h/s. (Am I wrong?) |
@lsimplify You can only achieve 1900 H/s on Windows with overclocking sure enough, but even without OC a Vega is at >1500 H/s after the disable/enable toggle. On Linux it’s less than 1200 H/s without OC, and maybe ~1300 with. @akostadinov You can simply let a Vega run compute jobs on a headless system, forgoing X completely. It doesn’t change anything, sadly.
The author of the original Vega mining guide on reddit writes that it has something to do with a power saving feature, but doesn’t give any specifics on how he’s reached that conclusion:
|
@949f45ac , this might be what author thinks but it is not necessarily true as well it is driver dependent. I for one had very unstable setup with the plain blockchain driver. Then updated only the driver with a newer one from pro-series driver (leaving the rest to be from the blockchain driver. Setup is much more stable. The interesting thing is that I have to
This is the best I have so far. If I run the underclock utility after disable/enable, then card sucks much more power for some reason. Until we have stable linux drivers upstreamed situation will be crap it seems. IMO it is still worth trying to avoid loading rocm until just before compute software is to be run. I can't try though because mainboard is incompatible :/ I decided to wait until linux drivers stabilize and there is better statistics which mainboards are supported. |
@TekComm What you write is true for the RX 400 / 500 series. However, it seems that Vega memory overclocks just fine with I personally believe it would be very nice if we got control over all the remaining DPM features on Vega. Looking into vega10_hwmgr.h in the kernel driver, we see this: enum {
GNLD_DPM_PREFETCHER = 0,
GNLD_DPM_GFXCLK,
GNLD_DPM_UCLK,
GNLD_DPM_SOCCLK,
GNLD_DPM_UVD,
GNLD_DPM_VCE,
GNLD_ULV,
GNLD_DPM_MP0CLK,
GNLD_DPM_LINK,
GNLD_DPM_DCEFCLK,
// goes on with non-DPM features So there’s the graphics clock (GFXCLK), the memory clock (UCLK), but also apparently the SoC clock and something related to the Prefetcher. When I profile a Cryptonight miner on Vega with RCP, I see Another thing I notice is that the main loop in Cryptonight, which is iterated over 500k times doing two random 16 byte fetches and two random 16 byte reads, actually has a |
@949f45ac I think you might be onto something there. A misbehaving prefetcher could cause a perf hit like this, and your data seems to back it up. I'd be curious to see what happens if the 3 lines starting here are changed/removed to disable the prefetcher smu feature. It would be nice to get some input from the devs on what the GNLD_DPM_PREFETCHER feature actually controls before playing Vega roulette. |
@TekComm I've made the changes and have a build going. However, I'm not sure if I need to do anything to make it play nice with roc-dkms. I'm not familiar with how dkms works. I guess I'll just try a normal kernel install and hope for the best. |
Disabling the feature had no effect. Cryptonight still running around 1200. |
@todxx I already tried simply disabling What we actually want, though, is try different manual overrides to prefetcher power level. For this we’d need the driver extended to expose a file to sysfs like |
This is correct, the GNLD_DPM_PREFETCHER has nothing to do with instruction cache or Data Cache prefetcher. |
@gstoner So is it possible, in principle, to control power state of Vega memory prefetcher from the driver? We are talking about a compute kernel that goes through a loop 2^19 times, doing 2 reads with a write-back each, in every iteration. uint a, b;
uint128_t pad[131072];
for (int i = 0; i < 524288; i++) {
data = pad[a];
data = hash1(data);
pad[a] = data;
b = newBfromData(data);
data = pad[b];
data = hash2(data);
pad[b] = data;
a = newAfromData(data);
} The memory prefetcher seems to be pointlessly reading ahead after each of these random reads, and also scheduling every write to a wrong location at first – possibly worsening performance. Do you think modifying the prefetcher’s power state could help? Or do you have another idea on how to improve performance? |
@rhlug Can you try the beta http://repo.radeon.com/misc/archive/beta/rocm-1.7.1.beta.4.tar.bz2 It support 4.13 Linux kernel |
I tested this out since I was testing another bug with the 1.7.1 beta 4. There seems to be no difference in cryptonight performance, though I did not profile to check fetch behaviour. |
@gstoner I made clean installation of Ubuntu 16.04.4, ROCm 1.7.1b4. CryptoNight algo gives me up to 1200 H/s. Same miner but Windows gives me up to 1900 H/s.
The card is ref Vega 56. |
Thanks, I am talking with the SMU team about delta. We will soon have a new ROCm profiling foundation available with trace and perf counter support. It is a massive update on what we had in the past. Also working on Debugger that will show the PC, VGPR, SGPR, Support Breakpoints and Runcontrol etc. It would great in the short run if we can get someone to profile the base driver with Perf and eBPF now we have DKMS based install. Here is a rough outline of the API of the New Profiler Foundation. Returned API status:
Info API:
Context API:
Sampling API:
Intercepting API:
Returning the error string method:
|
@gstoner provide some instructions, and I will try to do all my best (-: |
Is there any update on this? |
Apparently with the new Windows driver release 18.3.4 the cryptonight performance bugs on Windows are completely fixed, i.e. you don’t even have to toggle your cards off/on anymore to achieve great performance. @gstoner Maybe you could talk to the team building the Windows drivers and find out what fix they put into this release and get it done in the Linux drivers too? That’d be wonderful. Release notes for 18.3.4 simply state
|
Still no updates? |
What is the unrolling bug you are talking about? Do you have link to more information? Just curious. |
Is there a reproducer program. I'm seeing lock-ups here with a frontier edition. But also can be the driver vs older mobo and cpu. |
@TekComm would that be why my Vega FE does this? https://i.imgur.com/xTDxroL.jpg |
Maybe some light is shed on magic Windows HBCC driver switch? |
@TekComm so, problem can be solved with ROM modification? Can you suggest any references for finding such mods? |
From my experience |
@ddobreff I'm seeing errors such as:
Is this due to the driver issue you are talking about? |
Thats what broken compiler from 18.30 causes, read up what I wrote. |
Cast XMR released a beta for Linux on the 5th. Although it supposedly works with amdgpu-pro, I've tried everything i know and I end up with the following on ROCm:
https://bitcointalk.org/index.php?topic=2256917.2900 If anyone can take a look. Much appreciated! |
I flipped the switches on my reference 56s over to the 64 bios. Did like test on xmrig-hip and xmr-stak 2.5.0, and xmrig-hip is around 1% faster.
I'm pushing a 1390mhz sclk pp_table to all of these vegas, but I believe drivers base final sclk on some factor of ASIC quality? So my GPU3 (xfx reference 56 with 64 bios) only runs actual sclk 1310 @ 925mv.
With this same setup, here are xmr-stak dev 2.5.0 rates. xmr-stak 2.5.0
That said, I can split xmr-stak into 6 processes and not loose hashrate like I see on xmrig-hip, so I'm using xmr-stak because it makes monitoring processes and hashrates easier. |
I don't see any advantage of xmrig/xmr-stak-hip over normal OpenCL hashrate.
And clocks:
|
That is 10s rates on a 1 card sample size. I'm not sure how much I can trust those numbers. For me, xmrig-hip is 1% faster when comparing the 15m hashrates averages across 6 card rig. Sample size is a bit stronger. Maybe you can duplicate longer tests across more cards to get some statistically significant numbers? And your actual sclk @ 1450 @ 900mv? For me, the drivers seem to do whatever the fuck they want if I dont give them gobs of power.
|
For me, OpenCL is >5% faster now. Running HIP at 8x448, OpenCL 2 threads, each at intensity=1928, unroll=8, strided=2, stride=2. Part of the problem comes down to 8x448x2=7168, meaning ~700 MB of memory are unused by the HIP miner; but blocks=480 is slower, because it is not an integer multiple of 56. I get 835 H/s on RX 470 sclk=1126 (BIOS timings mod, but memclock @ 1750) though (4x480, kind of weird config), and cannot even get more than 800 with OpenCL. I think I’m missing the right settings. |
I only have one Vega64 and it doesn't really matter if I run it 10sec or 3h. |
@949f45ac you should be able to push well over 900 on opencl with your rx470s. Mine with 4gb samsung memory push 965, and 4gb hynix push 950 @ 1250mhz core. |
Ref: "the kfd rule". I haven't actually tried(tested) this udev rule, for regex, and for env{vars} ucase/lcase aspect: /etc/udev/rules.d/00-rocm-or-not"To try ROCm with an upstream kernel, install ROCm as normal,but do not install the rock-dkms package. Also add a udev ruleto control /dev/kfd permissions:"IMPORT{cmdline}="BOOT_IMAGE" #581 |
@rhlug, would you mind sharing your Vega configuration for xmr-stak? I've just updated to the 2.5.0 for the Monero-v8 fork, made minimal edits to my config to get it to at least start up, and find my Vega FE hash rate has halved! :( Having real trouble finding info on it... |
(also, I'm on linux 4.19_rc8 if that helps...) |
Ubuntu released Cosmic Cuttlefish ystdy. Monero went to v8 today. Geniuses used the words proximately, as if by association,"AMD..assymetry...CUDA...parallelism", but I thought hip hcc were limitted by cuda seriaiity on pre-fetch for Monero (algo v8), as I read about xmr-stak-hip? Then there's hardware capacity for assymetrical/parallel prefetch and such, but what, languages aren't capable? ..."Vulkan", "SPIR(...already in edge OpenCL). Assymetry/parallism and even AI in prefetch, are required in languages for Monero algo and to use this: "Vega GPU IP also includes in that HBCC/HBC/memory controller subsystem the ability to manage in its Vega GPU micro-architecture based GPU SKUs the GPU’s own virtual memory paging swap space of up to 512TB of total GPU virtual memory address space. So that’s 512TB of addressable into any system DRAM and onto the system’s memory swap space of any system NVM/SSD or hard drive storage devices attached". Geniuses need to know what veteran light bulb changers know for sure: "On CUDA, memory operations cannot be run asynchronous like that. Instead we use prefetch to achieve a similar effect." |
Updated https://github.com/949f45ac/xmrig-HIP to support CN/2 aka Monerov8/9.
"Asynchronous" can be on many levels. E.g. all the GPU compute frameworks work in a manner such that you put jobs into a pipeline and then collect the results later. This is asynchronous, but not very important as far as Monero is concerned, since our most important compute task runs for about 2 seconds, meaning we don’t put a lot of jobs into the pipeline, compared to other programs. Memory transfer between RAM and GPU device can also be asynchronous on some frameworks – again we don’t care a lot, as the amount of memory a miner transfers is laughably small. |
@949f45ac have you tried using mainline kernel? I have serious performance drop using mainline vs amdgpu-pro in OpenCL. |
Can confirm mainline kernel is slow. Both HIP miner and OpenCL miner suck on it. What should the OpenCL v8 performance on amdgpu-pro be? |
I don't know expected stock performance but with these settings-> sclk: 1250, mclk: 1100, vlt: 875 my Asus Strix Vega64 does 1820H/s without any problem using xmrig-amd or xmr-stak, castxmr is a bit slower. |
@ddobreff is this hashrate measured with latest Monero v8 fork? |
@uentity , check out everything by Spudz76 on xmr-stak's github. He benchmarks what he says about xmr-stak, xmrig, and 18.30 and rocm. I'm scripting my own Ubuntu downgrade release, Narnic Nun, applying a few debconf/apt/dpkg pins to xenial server. The gist of it is, compilers aren't working for amd 18.30 and/or rocm and/or advanced kernels. I assume that's temporary. |
@BobDodds thanks, I'll take a deeper look, but at a glance:
|
OK, I might have wasted your time then, @uentity . That's such a big drop in hashrate that I would suspect errors. If your gpu's are detected by xmr-stak that's the first sign you're ok, but such a big drop hints at possible errors. clinfo gives my gpu's a clean bill, and I can do tricks with pp_tables, turn fans up and down, but xmr-stak will compile but not recognize gpu's. |
Yes hashrate is for CNV2 a.ka Monero v8 with amdgpu-pro 18.30 dkms kernel driver and 18.10 OpenCL compiler. |
I am on Ubuntu 18 with 4.15.0-55-generic kernel. I have ROCm driver installed with rocm-dkms. I have xmr-stak 2.10.7 fd19a5d and hashrate is around 1300H/s Can I improve the speed or I have to switch to Windows? |
@cirolaferrara |
Fixed with debian 10, this guide and installing opencl driver |
I'm running opencl 19.30 on Arch and still can't reach 2k with a 56 flashed 64 Vega. Using Soft PP tables and amdmemtweak with TMR. Any suggestions? Thanks! |
I am closing this issue as its around 2 years old. |
Going to start a new issue in hopes to find a solution to the performance of cryptonight mining on linux under ROCm, as we continue to lag behind the windows aug 23rd blockchain drivers by 35%.
GPUs - RX Vega 64
Running Aug 23 blockchain drivers on windows, I see 1900h cryptonight, and 39Mh ethash.
Running ROCm 1.7 on ubuntu, I see 1250h cryptonight, and 39Mh ethash.
So the fact that I can get like rates on ethash means the opencl stack is just as good as windows.
I gave windows 64GB of virtual memory and ubuntu 64GB of swap. Tested amdkfd.noretry 1 and 0.
Any other recommendations on things to try?
The text was updated successfully, but these errors were encountered: