ROCm RX VEGA hash rates for Cryptonight (linux vs windows) #325

rhlug · 2018-02-03T23:44:51Z

Going to start a new issue in hopes to find a solution to the performance of cryptonight mining on linux under ROCm, as we continue to lag behind the windows aug 23rd blockchain drivers by 35%.

GPUs - RX Vega 64

Running Aug 23 blockchain drivers on windows, I see 1900h cryptonight, and 39Mh ethash.
Running ROCm 1.7 on ubuntu, I see 1250h cryptonight, and 39Mh ethash.

So the fact that I can get like rates on ethash means the opencl stack is just as good as windows.

I gave windows 64GB of virtual memory and ubuntu 64GB of swap. Tested amdkfd.noretry 1 and 0.

Any other recommendations on things to try?

todxx · 2018-02-04T07:05:49Z

If I remember correctly, to hit the 1900+h/s with cryptonight in windows it required doing a hot re-initialization of the GPUs. I remember people were initially doing this by enabling or disabling HBCC which seemed to cause re-initialization of the GPU.

Without the re-initialization, I believe performance is pretty similar to the linux rocm stack. This seems to suggest that the windows initialization procedures at boot and for hot reset are different somehow. I would love to get some clarity into what the difference is that is causing this performance delta.

A 50% performance increase running the exact same code is impressive.

rhlug · 2018-02-04T17:50:32Z

@todxx its nothing with hbcc. just reloading the driver with a disable/enable in device manager is all it takes. No way to do that in linux. Neither modprobe -r or rmmod/insmod will allow amdgpu or amdkfd to be removed/reloaded.

ob7 · 2018-02-05T03:30:41Z

Can whatever it is Windows is toggling when reinitializing the gpus be accomplished with a bios mod?

todxx · 2018-02-05T04:58:02Z

@rhlug I don't think it has to do just with reloading the driver. I think the way windows reloads the driver for a hot reset is somehow different than how it loads on boot. But I'm just guessing here.

@ob7 I believe the bios on Vega requires to be signed and therefore is not moddable. In either case, without knowing what is being changed, it will be difficult to reproduce.

akostadinov · 2018-02-15T23:15:48Z

Perhaps you can try to disable module on boot and only manually load it before running the compute program. My suspicion though is that on windows, the card is getting used by something during the whole boot process and resources are not released completely. The reload perhaps allows for releasing all needless resources on the card.
If you load linux module before running computing, that might be the same. e.g. would prevent X to ever try to use the card. If this helps, then x/wayland may need to be configured to never touch the cards.

btw what are you using to tune card cooling and speed on linux? It seems proper underclocking is always needed for good results.

lsimplify · 2018-02-18T04:14:51Z

@rhlug Did you achieve 1900h cryptonight without overclocking the memory frequency on Windows? Because as far as I know overclocking the memory is required to get a hashrate like 1900h/s. (Am I wrong?)

949f45ac · 2018-02-18T10:47:25Z

@lsimplify You can only achieve 1900 H/s on Windows with overclocking sure enough, but even without OC a Vega is at >1500 H/s after the disable/enable toggle. On Linux it’s less than 1200 H/s without OC, and maybe ~1300 with.

@akostadinov You can simply let a Vega run compute jobs on a headless system, forgoing X completely. It doesn’t change anything, sadly.

the card is getting used by something during the whole boot process and resources are not released completely.

The author of the original Vega mining guide on reddit writes that it has something to do with a power saving feature, but doesn’t give any specifics on how he’s reached that conclusion:

The blockchain Beta driver has some sort of a bug (i see it as a feature) that when you restart the GPU device some sort of power saving feature (i see it as a bug) doesn't get activated. Therefore, by restarting your GPU device it will hash higher.

akostadinov · 2018-02-19T17:34:27Z

@949f45ac , this might be what author thinks but it is not necessarily true as well it is driver dependent. I for one had very unstable setup with the plain blockchain driver. Then updated only the driver with a newer one from pro-series driver (leaving the rest to be from the blockchain driver. Setup is much more stable. The interesting thing is that I have to

run under/overclock utility
disable/enable device

This is the best I have so far. If I run the underclock utility after disable/enable, then card sucks much more power for some reason. Until we have stable linux drivers upstreamed situation will be crap it seems. IMO it is still worth trying to avoid loading rocm until just before compute software is to be run. I can't try though because mainboard is incompatible :/ I decided to wait until linux drivers stabilize and there is better statistics which mainboards are supported.

949f45ac · 2018-02-27T07:56:18Z

@TekComm What you write is true for the RX 400 / 500 series. However, it seems that Vega memory overclocks just fine with rocm-smi alone. Hash rate goes up, and you achieve numbers mostly similar to those on Windows without the device toggle. But if you do the device toggle on Windows, you get another +30%.

I personally believe it would be very nice if we got control over all the remaining DPM features on Vega. Looking into vega10_hwmgr.h in the kernel driver, we see this:

enum {
        GNLD_DPM_PREFETCHER = 0,
        GNLD_DPM_GFXCLK,
        GNLD_DPM_UCLK,
        GNLD_DPM_SOCCLK,
        GNLD_DPM_UVD,
        GNLD_DPM_VCE,
        GNLD_ULV,
        GNLD_DPM_MP0CLK,
        GNLD_DPM_LINK,
        GNLD_DPM_DCEFCLK,
// goes on with non-DPM features

So there’s the graphics clock (GFXCLK), the memory clock (UCLK), but also apparently the SoC clock and something related to the Prefetcher.

When I profile a Cryptonight miner on Vega with RCP, I see MemUnitStalled 30% upwards. On the normal RX series this is only a few percent. Possibly this is related to having HBM2 memory, which may have higher bandwidth, but clocks much lower.

Another thing I notice is that the main loop in Cryptonight, which is iterated over 500k times doing two random 16 byte fetches and two random 16 byte reads, actually has a FetchSize 4 times the expected amount and a WriteSize double. I believe this might be prefetch logic reading +3 ahead on every read and queueing one write to the predicted location, then one write to the actual one. Now maybe if we could tune the prefetcher differently, this could increase performance, is my wild speculation.

todxx · 2018-02-27T08:58:20Z

@949f45ac I think you might be onto something there. A misbehaving prefetcher could cause a perf hit like this, and your data seems to back it up.

I'd be curious to see what happens if the 3 lines starting here are changed/removed to disable the prefetcher smu feature.

It would be nice to get some input from the devs on what the GNLD_DPM_PREFETCHER feature actually controls before playing Vega roulette.

todxx · 2018-02-27T10:11:36Z

@TekComm I've made the changes and have a build going. However, I'm not sure if I need to do anything to make it play nice with roc-dkms. I'm not familiar with how dkms works. I guess I'll just try a normal kernel install and hope for the best.

todxx · 2018-02-27T12:20:32Z

Disabling the feature had no effect. Cryptonight still running around 1200.

949f45ac · 2018-02-27T19:08:48Z

@todxx I already tried simply disabling GNLD_DPM_PREFETCHER -- then I realised that what this does is probably only to disable the DPM for Prefetcher, ie. the driver is telling the card: "I have no intent of setting power levels for prefetcher myself."

What we actually want, though, is try different manual overrides to prefetcher power level. For this we’d need the driver extended to expose a file to sysfs like /sys/class/drm/card0/device/pp_dpm_sclk (which is what rocm-smi uses when you call --setsclk). I thought about trying my hand at it, but I didn’t really understand where the driver was getting information about possible sclk states from, to begin with -- not straightforward from the card, I think. (I believe it might get some voltage information from the card and use that to calculate clocks.) So I suppose that some knowledge of the hardware internals is needed.

gstoner · 2018-02-27T19:11:59Z

This is correct, the GNLD_DPM_PREFETCHER has nothing to do with instruction cache or Data Cache prefetcher.

949f45ac · 2018-02-27T19:47:05Z

@gstoner So is it possible, in principle, to control power state of Vega memory prefetcher from the driver?

We are talking about a compute kernel that goes through a loop 2^19 times, doing 2 reads with a write-back each, in every iteration.
In simplified terms:

uint a, b;
uint128_t pad[131072];
for (int i = 0; i < 524288; i++) {
  data = pad[a];
  data = hash1(data);
  pad[a] = data;
  b = newBfromData(data);

  data = pad[b];
  data = hash2(data);
  pad[b] = data;
  a = newAfromData(data);
}

The memory prefetcher seems to be pointlessly reading ahead after each of these random reads, and also scheduling every write to a wrong location at first – possibly worsening performance. Do you think modifying the prefetcher’s power state could help? Or do you have another idea on how to improve performance?

gstoner · 2018-03-02T23:03:44Z

@rhlug Can you try the beta http://repo.radeon.com/misc/archive/beta/rocm-1.7.1.beta.4.tar.bz2 It support 4.13 Linux kernel

todxx · 2018-03-03T20:22:51Z

I tested this out since I was testing another bug with the 1.7.1 beta 4. There seems to be no difference in cryptonight performance, though I did not profile to check fetch behaviour.

grafptitsyn · 2018-03-03T20:35:06Z

@gstoner I made clean installation of Ubuntu 16.04.4, ROCm 1.7.1b4. CryptoNight algo gives me up to 1200 H/s. Same miner but Windows gives me up to 1900 H/s.

~> cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS"

~> uname -a
Linux dahlia 4.13.0-36-generic #40~16.04.1-Ubuntu SMP Fri Feb 16 23:25:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

~> cat /etc/default/grub
...
GRUB_CMDLINE_LINUX_DEFAULT="splash quiet amdgpu.vm_fragment_size=9"
...

The card is ref Vega 56.

gstoner · 2018-03-03T21:56:49Z

Thanks, I am talking with the SMU team about delta. We will soon have a new ROCm profiling foundation available with trace and perf counter support. It is a massive update on what we had in the past. Also working on Debugger that will show the PC, VGPR, SGPR, Support Breakpoints and Runcontrol etc.

It would great in the short run if we can get someone to profile the base driver with Perf and eBPF now we have DKMS based install.

Here is a rough outline of the API of the New Profiler Foundation.

Returned API status:

hsa_status_t - HSA status codes are used from hsa.h header

Info API:

rocprofiler_info_kind_t - profiling info kind
rocprofiler_info_query_t - profiling info query
rocprofiler_info_data_t - profiling info data
rocprofiler_iterate_info - iterate over the info for a given info kind
rocprofiler_query_info - iterate over the info for a given info query

Context API:

rocprofiler_t - profiling context handle
rocprofiler_feature_kind_t - profiling feature kind
rocprofiler_feature_parameter_t - profiling feature parameter
rocprofiler_data_kind_t - profiling data kind
rocprofiler_data_t - profiling data
rocprofiler_feature_t - profiling feature
rocprofiler_mode_t - profiling modes
rocprofiler_properties_t - profiler properties
rocprofiler_open - open new profiling context
rocprofiler_close - close profiling context and release all allocated resources
rocprofiler_group_count - return profiling groups count
rocprofiler_get_group - return profiling group for a given index
rocprofiler_get_metrics - method for calculating the metrics data
rocprofiler_iterate_trace_data - method for iterating output trace data instances

Sampling API:

rocprofiler_start - start profiling
rocprofiler_stop - stop profiling
rocprofiler_read - read profiling data to the profiling features objects
rocprofiler_get_data - wait for profiling data
Group versions of start/stop/read/get_data methods:
o rocprofiler_group_start
o rocprofiler_group_stop
o rocprofiler_group_read
o rocprofiler_group_get_data

Intercepting API:

rocprofiler_callback_t - profiling callback type
rocprofiler_callback_data_t - profiling callback data type
rocprofiler_set_dispatch_callback - adding kernel dispatch callback
rocprofiler_remove_dispatch_callback - removing kernel dispatch callback

Returning the error string method:

rocprofiler_error_string - method for returning the API error string

grafptitsyn · 2018-03-05T10:50:15Z

@gstoner provide some instructions, and I will try to do all my best (-:

gurupras · 2018-04-03T03:49:46Z

Is there any update on this?

949f45ac · 2018-04-09T06:51:28Z

Apparently with the new Windows driver release 18.3.4 the cryptonight performance bugs on Windows are completely fixed, i.e. you don’t even have to toggle your cards off/on anymore to achieve great performance.

@gstoner Maybe you could talk to the team building the Windows drivers and find out what fix they put into this release and get it done in the Linux drivers too? That’d be wonderful. Release notes for 18.3.4 simply state

Fixed Issues:

Some blockchain workloads may experience lower performance than expected when compared to previous Radeon Software releases.

grafptitsyn · 2018-04-22T16:07:37Z

Still no updates?

briansp2020 · 2018-04-22T17:12:40Z

What is the unrolling bug you are talking about? Do you have link to more information? Just curious.

akostadinov · 2018-04-22T18:55:15Z

Is there a reproducer program. I'm seeing lock-ups here with a frontier edition. But also can be the driver vs older mobo and cpu.

Mandrewoid · 2018-04-23T01:03:47Z

@TekComm would that be why my Vega FE does this? https://i.imgur.com/xTDxroL.jpg
it works fine until I put it under 100% load, even under windows after about 2 minutes full load it does that

uentity · 2018-05-03T11:25:38Z

Maybe some light is shed on magic Windows HBCC driver switch?
I'm having the same issue as topic starter.

uentity · 2018-05-03T11:55:18Z

@TekComm so, problem can be solved with ROM modification? Can you suggest any references for finding such mods?

uentity · 2018-10-12T10:11:39Z

Maybe it’s due to the fact that we’re running with HSA_ENABLE_SDMA=0 for Vega – something about the scheduling / work distribution.

From my experience HSA_ENABLE_SDMA=0 is no longer needed with ROCm 1.9.

Zarkoob · 2018-10-12T19:52:48Z

@ddobreff I'm seeing errors such as:

kernel: [ 3214.177085] amdgpu 0000:0a:00.0: GPU fault detected: 147 0x00424802
kernel: [ 3214.177104] amdgpu 0000:0a:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000008
kernel: [ 3214.177118] amdgpu 0000:0a:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x07048002
kernel: [ 3214.177132] amdgpu 0000:0a:00.0: VM fault (0x02, vmid 3) at page 8, write from 'TC4' (0x54433400) (72)

Is this due to the driver issue you are talking about?

ddobreff · 2018-10-12T21:55:15Z

Thats what broken compiler from 18.30 causes, read up what I wrote.
P.S. HSA_ENABLE_SDMA=0 should help disabling/bypassing pci atomics requirements for Vega only.

shimmervoid · 2018-10-14T20:06:30Z

Cast XMR released a beta for Linux on the 5th. Although it supposedly works with amdgpu-pro, I've tried everything i know and I end up with the following on ROCm:

[14:44:24] Initializing GPU, loading kernel ...
Detected OpenCL Platform: OpenCL 2.1 AMD-APP (2679.0)
Driver Version OK.
GPU0:  (gfx900) | 64 Compute Units | Memory (MB): 16368 | Intensity: 10 / 12
[14:44:27] Error clBuildProgramm -11

https://bitcointalk.org/index.php?topic=2256917.2900

If anyone can take a look. Much appreciated!

rhlug · 2018-10-15T14:04:36Z

I flipped the switches on my reference 56s over to the 64 bios. Did like test on xmrig-hip and xmr-stak 2.5.0, and xmrig-hip is around 1% faster.

[2018-10-14 19:55:22] speed 10s/60s/15m 10980.8 10978.5 n/a H/s max 11001.2 H/s
| THREAD | GPU | 10s H/s | 60s H/s | 15m H/s | NAME
|      0 |   0 |  1803.4 |  1803.2 |  1804.3 | Vega [Radeon RX Vega]
|      1 |   1 |  1868.1 |  1868.2 |  1868.0 | Vega [Radeon RX Vega]
|      2 |   2 |  1803.4 |  1803.3 | 1803.6 | Vega [Radeon RX Vega]
|      3 |   3 |  1769.2 |  1767.2 | 1768.4 | Vega [Radeon RX Vega]
|      4 |   4 |  1891.2 |  1891.6 | 1891.4 | Vega [Radeon RX Vega]
|      5 |   5 |  1844.8 |  1844.9 | 1844.7 | Vega [Radeon RX Vega]
[2018-10-14 19:55:25] speed 10s/60s/15m 10980.5 10978.5 10980.4 H/s max 11001.2 H/s

I'm pushing a 1390mhz sclk pp_table to all of these vegas, but I believe drivers base final sclk on some factor of ASIC quality? So my GPU3 (xfx reference 56 with 64 bios) only runs actual sclk 1310 @ 925mv.

 ID       Name  Sclk  Mclk Volts Watts  Temp   Fan
============================================================
  0   rxvega56  1331  1090   925   108    61   36%
  1   rxvega64  1325  1090   925   109    49   36%
  2   rxvega56  1333  1090   925    95    61   48%
  3   rxvega56  1310  1090   925   103    61   37%
  4   rxvega64  1343  1090   925   109    45   36%
  5   rxvega56  1362  1090   925   108    61   61%
============================================================
  6                                632

With this same setup, here are xmr-stak dev 2.5.0 rates.

xmr-stak 2.5.0

HASHRATE REPORT - AMD
| ID |    10s |    60s |    15m | ID |    10s |    60s |    15m |
|  0 |  894.3 |  845.5 |  844.2 |  1 |  900.7 |  940.4 |  942.0 |
|  2 |  958.8 |  957.4 |  949.1 |  3 |  894.8 |  893.6 |  899.8 |
|  4 |  900.5 |  900.4 |  869.2 |  5 |  900.5 |  900.4 |  923.2 |
|  6 |  876.8 |  847.7 |  834.2 |  7 |  883.1 |  909.2 |  920.0 |
|  8 |  968.9 |  969.6 |  968.0 |  9 |  904.3 |  904.9 |  907.4 |
| 10 |  914.9 |  842.4 |  855.8 | 11 |  914.9 |  967.2 |  958.5 |
Totals (AMD):  10913.0 10879.4 10871.9 H/s
-----------------------------------------------------------------
Totals (ALL):   10913.0 10879.4 10871.9 H/s
Highest:  11196.9 H/s
-----------------------------------------------------------------

That said, I can split xmr-stak into 6 processes and not loose hashrate like I see on xmrig-hip, so I'm using xmr-stak because it makes monitoring processes and hashrates easier.

ddobreff · 2018-10-15T22:24:10Z

I don't see any advantage of xmrig/xmr-stak-hip over normal OpenCL hashrate.

[2018-10-16 01:21:25] speed 10s/60s/15m 1946.7 n/a n/a H/s max 1946.6 H/s
| THREAD | GPU | 10s H/s | 60s H/s | 15m H/s |
| 0 | 0 | 642.6 | n/a | n/a |
| 1 | 0 | 654.7 | n/a | n/a |
| 2 | 0 | 650.6 | n/a | n/a |
[2018-10-16 01:21:30] speed 10s/60s/15m 1947.9 n/a n/a H/s max 1946.6 H/s

And clocks:

GPU0: PCI 0000:03:00, Radeon RX Vega 8.0 GB - Bios: 115-D050PIL-100
SCLK: 1450Mhz, DPM: 5 , MCLK: 1075Mhz, PWR:115.00W, VLT: 0.90v , FAN: 40%, TEMP: 52C

rhlug · 2018-10-16T00:39:16Z

That is 10s rates on a 1 card sample size. I'm not sure how much I can trust those numbers.

For me, xmrig-hip is 1% faster when comparing the 15m hashrates averages across 6 card rig. Sample size is a bit stronger. Maybe you can duplicate longer tests across more cards to get some statistically significant numbers?

And your actual sclk @ 1450 @ 900mv? For me, the drivers seem to do whatever the fuck they want if I dont give them gobs of power.

# cat /sys/class/drm/card4/device/pp_dpm_sclk | grep ^7
7: 1390Mhz *

# grep -A2 "GFX Clocks" /sys/kernel/debug/dri/4/amdgpu_pm_info
GFX Clocks and Power:
	1090 MHz (MCLK)
	1339 MHz (SCLK)

949f45ac · 2018-10-16T05:32:57Z

For me, OpenCL is >5% faster now. Running HIP at 8x448, OpenCL 2 threads, each at intensity=1928, unroll=8, strided=2, stride=2. Part of the problem comes down to 8x448x2=7168, meaning ~700 MB of memory are unused by the HIP miner; but blocks=480 is slower, because it is not an integer multiple of 56.

I get 835 H/s on RX 470 sclk=1126 (BIOS timings mod, but memclock @ 1750) though (4x480, kind of weird config), and cannot even get more than 800 with OpenCL. I think I’m missing the right settings.

ddobreff · 2018-10-16T08:17:33Z

For me, xmrig-hip is 1% faster when comparing the 15m hashrates averages across 6 card rig. Sample size is a bit stronger. Maybe you can duplicate longer tests across more cards to get some statistically significant numbers?

I only have one Vega64 and it doesn't really matter if I run it 10sec or 3h.
The tricky part is SoC clock. I am running on dpm5 and setting SoC idx5 to mclk+id, that way my SoC clock is higher than memory and can handle more memory OC. Testing it at 1560/1100/925 doesn't really give much more but power is like 10% more.
I don't have problem with polaris GPUs, they perform as expected with OpenCL.
949f45ac: Unfortunately I can't test your hip port on polaris because it requires atomics enabled and I have to use it on just one gpu and let others sleep. IIRC xmr-stak-hip gave ~ the same as opencl version but with single thread. If AMD decides to remove pci atomic requirements for polaris I'll be able to perform more intensive tests on your port.

rhlug · 2018-10-16T14:35:54Z

@949f45ac you should be able to push well over 900 on opencl with your rx470s. Mine with 4gb samsung memory push 965, and 4gb hynix push 950 @ 1250mhz core.

BobDodds · 2018-10-17T19:52:26Z

Ref: "the kfd rule". I haven't actually tried(tested) this udev rule, for regex, and for env{vars} ucase/lcase aspect:

/etc/udev/rules.d/00-rocm-or-not

"To try ROCm with an upstream kernel, install ROCm as normal,

but do not install the rock-dkms package. Also add a udev rule

to control /dev/kfd permissions:"

IMPORT{cmdline}="BOOT_IMAGE"
ENV{BOOT_IMAGE}=="*4.1[234567].*", GOTO="go_rocm"
SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video"
LABEL="go_rocm"

#581

qolii · 2018-10-18T20:58:36Z

@rhlug, would you mind sharing your Vega configuration for xmr-stak?

I've just updated to the 2.5.0 for the Monero-v8 fork, made minimal edits to my config to get it to at least start up, and find my Vega FE hash rate has halved! :( Having real trouble finding info on it...

qolii · 2018-10-18T21:00:29Z

(also, I'm on linux 4.19_rc8 if that helps...)

BobDodds · 2018-10-18T23:39:41Z

Ubuntu released Cosmic Cuttlefish ystdy. Monero went to v8 today.

Geniuses used the words proximately, as if by association,"AMD..assymetry...CUDA...parallelism", but I thought hip hcc were limitted by cuda seriaiity on pre-fetch for Monero (algo v8), as I read about xmr-stak-hip? Then there's hardware capacity for assymetrical/parallel prefetch and such, but what, languages aren't capable? ..."Vulkan", "SPIR(...already in edge OpenCL).

Assymetry/parallism and even AI in prefetch, are required in languages for Monero algo and to use this: "Vega GPU IP also includes in that HBCC/HBC/memory controller subsystem the ability to manage in its Vega GPU micro-architecture based GPU SKUs the GPU’s own virtual memory paging swap space of up to 512TB of total GPU virtual memory address space.

So that’s 512TB of addressable into any system DRAM and onto the system’s memory swap space of any system NVM/SSD or hard drive storage devices attached".

Geniuses need to know what veteran light bulb changers know for sure:

"On CUDA, memory operations cannot be run asynchronous like that. Instead we use prefetch to achieve a similar effect."

949f45ac · 2018-10-23T16:18:03Z

Updated https://github.com/949f45ac/xmrig-HIP to support CN/2 aka Monerov8/9.
Does 1600 H/s and up without errors.
Hope it’s useful to some of you. :) OpenCL miner did not work well for me on v8.

@BobDodds

"On CUDA, memory operations cannot be run asynchronous like that. Instead we use prefetch to achieve a similar effect."

"Asynchronous" can be on many levels. E.g. all the GPU compute frameworks work in a manner such that you put jobs into a pipeline and then collect the results later. This is asynchronous, but not very important as far as Monero is concerned, since our most important compute task runs for about 2 seconds, meaning we don’t put a lot of jobs into the pipeline, compared to other programs. Memory transfer between RAM and GPU device can also be asynchronous on some frameworks – again we don’t care a lot, as the amount of memory a miner transfers is laughably small.
What we actually care about in CryptoNight is reads/writes to the scratch pad, which in the GPU miner resides in the card’s video memory. (In the CPU miner, it is a much different situation.)
And here the interesting thing about GCN ISA (AMD), as opposed to PTX ISA (Nvidia) is that GCN ISA has seperate instructions for requesting the memory operation, and waiting on its completion. I think that most of the time it is actually not a big deal at all, and a prefetch instruction on PTX does the job too. But if you really get down on the ISA level, there will be interesting uses.

ddobreff · 2018-10-26T20:55:07Z

@949f45ac have you tried using mainline kernel? I have serious performance drop using mainline vs amdgpu-pro in OpenCL.

949f45ac · 2018-10-27T05:30:16Z

Can confirm mainline kernel is slow. Both HIP miner and OpenCL miner suck on it.

What should the OpenCL v8 performance on amdgpu-pro be?
I think I may have some interference between amdgpu-pro and the ROCm stack (no dkms) on my machines. Probably the miners are using the wrong OpenCL implementation. But the right one seems to be missing after ROCm installation.
I noticed this when I did a fresh install to test castxmr. Installed amdgpu-pro first, tested castxmr, ran fine. Then I installed some ROCm stuff in order to be able to run HIP programs, and castxmr wouldn’t work anymore, no matter what OpenCL platform id I told it to run on.

ddobreff · 2018-10-27T10:46:30Z

I don't know expected stock performance but with these settings-> sclk: 1250, mclk: 1100, vlt: 875 my Asus Strix Vega64 does 1820H/s without any problem using xmrig-amd or xmr-stak, castxmr is a bit slower.
You can't run castxmr on non amdgpu-pro OpenCL stack, the devs kernel is compiled specifically for amdgpu-pro and cannot run on rocm or mesa.
I have talked to AMD devs and informed them about performance regression in mainline/upstream kernels vs amdgpu-pro, I hope they take it serious and fix that in upstream at least.
P.S. For everyone having problems with editing powerplay table on 4.18+, here is the patch that fixes it:
https://patchwork.freedesktop.org/patch/258557/

uentity · 2018-10-28T06:30:22Z

@ddobreff is this hashrate measured with latest Monero v8 fork?
I had more than twice drop after the fork.

BobDodds · 2018-10-28T07:04:23Z

@uentity , check out everything by Spudz76 on xmr-stak's github. He benchmarks what he says about xmr-stak, xmrig, and 18.30 and rocm.

I'm scripting my own Ubuntu downgrade release, Narnic Nun, applying a few debconf/apt/dpkg pins to xenial server. The gist of it is, compilers aren't working for amd 18.30 and/or rocm and/or advanced kernels. I assume that's temporary.

uentity · 2018-10-28T07:21:52Z

@BobDodds thanks, I'll take a deeper look, but at a glance:

I don't use Ubuntu (I'm on Fedora).
I don't have any problems with GPU detection, xmr-stak startup issues, etc. I just want to say that my hashrate dropped significantly after recent Monero fork.
Switched to eth for a while...

BobDodds · 2018-10-28T07:37:50Z

OK, I might have wasted your time then, @uentity . That's such a big drop in hashrate that I would suspect errors. If your gpu's are detected by xmr-stak that's the first sign you're ok, but such a big drop hints at possible errors. clinfo gives my gpu's a clean bill, and I can do tricks with pp_tables, turn fans up and down, but xmr-stak will compile but not recognize gpu's.

ddobreff · 2018-10-28T11:18:27Z

Yes hashrate is for CNV2 a.ka Monero v8 with amdgpu-pro 18.30 dkms kernel driver and 18.10 OpenCL compiler.

cirolaferrara · 2019-08-02T19:37:57Z

I am on Ubuntu 18 with 4.15.0-55-generic kernel. I have ROCm driver installed with rocm-dkms. I have xmr-stak 2.10.7 fd19a5d and hashrate is around 1300H/s

Can I improve the speed or I have to switch to Windows?

949f45ac · 2019-08-09T17:22:46Z

@cirolaferrara
Try using amdgpu-pro (preferrably 18.40) installed with --opencl=pal --headless options. No rocm-dkms. This makes the open source miners work well for cn/2, but I believe cn/r is still pretty bad, can only be fixed by using team red miner, at least when I last tried.

cirolaferrara · 2019-08-09T19:13:14Z

Fixed with debian 10, this guide and installing opencl driver

vanities · 2019-08-23T15:26:03Z

I'm running opencl 19.30 on Arch and still can't reach 2k with a 56 flashed 64 Vega. Using Soft PP tables and amdmemtweak with TMR. Any suggestions? Thanks!

ROCmSupport · 2021-01-05T09:52:18Z

I am closing this issue as its around 2 years old.
Request to try with the latest ROCm released version 4.0 and open a new issue if any.
Thank you.

gstoner added the Bug_Performance_issue label Mar 3, 2018

szogun1987 mentioned this issue Mar 22, 2018

Can miner workaround driver issue? fireice-uk/xmr-stak#1183

Closed

949f45ac mentioned this issue Apr 18, 2019

CryptoNight v2 RX Vega Performance Regression on AMDGPU-PRO 18.50 #775

Closed

ROCmSupport closed this as completed Jan 5, 2021

ROCm RX VEGA hash rates for Cryptonight (linux vs windows) #325

ROCm RX VEGA hash rates for Cryptonight (linux vs windows) #325

Comments

rhlug commented Feb 3, 2018

todxx commented Feb 4, 2018

rhlug commented Feb 4, 2018

ob7 commented Feb 5, 2018 • edited Loading

todxx commented Feb 5, 2018

akostadinov commented Feb 15, 2018

lsimplify commented Feb 18, 2018

949f45ac commented Feb 18, 2018

akostadinov commented Feb 19, 2018 • edited Loading

949f45ac commented Feb 27, 2018

todxx commented Feb 27, 2018

todxx commented Feb 27, 2018

todxx commented Feb 27, 2018

949f45ac commented Feb 27, 2018 • edited Loading

gstoner commented Feb 27, 2018

949f45ac commented Feb 27, 2018 • edited Loading

gstoner commented Mar 2, 2018

todxx commented Mar 3, 2018

grafptitsyn commented Mar 3, 2018 • edited Loading

gstoner commented Mar 3, 2018 • edited Loading

grafptitsyn commented Mar 5, 2018

gurupras commented Apr 3, 2018

949f45ac commented Apr 9, 2018

grafptitsyn commented Apr 22, 2018

briansp2020 commented Apr 22, 2018

akostadinov commented Apr 22, 2018

Mandrewoid commented Apr 23, 2018 • edited Loading

uentity commented May 3, 2018

uentity commented May 3, 2018

uentity commented Oct 12, 2018

Zarkoob commented Oct 12, 2018

ddobreff commented Oct 12, 2018 • edited Loading

shimmervoid commented Oct 14, 2018

rhlug commented Oct 15, 2018

ddobreff commented Oct 15, 2018 • edited Loading

rhlug commented Oct 16, 2018

949f45ac commented Oct 16, 2018

ddobreff commented Oct 16, 2018

rhlug commented Oct 16, 2018

BobDodds commented Oct 17, 2018 • edited Loading

/etc/udev/rules.d/00-rocm-or-not

"To try ROCm with an upstream kernel, install ROCm as normal,

but do not install the rock-dkms package. Also add a udev rule

to control /dev/kfd permissions:"

#581

qolii commented Oct 18, 2018

qolii commented Oct 18, 2018

BobDodds commented Oct 18, 2018 • edited Loading

949f45ac commented Oct 23, 2018

ddobreff commented Oct 26, 2018

949f45ac commented Oct 27, 2018

ddobreff commented Oct 27, 2018

uentity commented Oct 28, 2018 • edited Loading

BobDodds commented Oct 28, 2018 • edited Loading

uentity commented Oct 28, 2018

BobDodds commented Oct 28, 2018

ddobreff commented Oct 28, 2018

cirolaferrara commented Aug 2, 2019

949f45ac commented Aug 9, 2019

cirolaferrara commented Aug 9, 2019

vanities commented Aug 23, 2019

ROCmSupport commented Jan 5, 2021

ob7 commented Feb 5, 2018 •

edited

Loading

akostadinov commented Feb 19, 2018 •

edited

Loading

949f45ac commented Feb 27, 2018 •

edited

Loading

949f45ac commented Feb 27, 2018 •

edited

Loading

grafptitsyn commented Mar 3, 2018 •

edited

Loading

gstoner commented Mar 3, 2018 •

edited

Loading

Mandrewoid commented Apr 23, 2018 •

edited

Loading

ddobreff commented Oct 12, 2018 •

edited

Loading

ddobreff commented Oct 15, 2018 •

edited

Loading

BobDodds commented Oct 17, 2018 •

edited

Loading

BobDodds commented Oct 18, 2018 •

edited

Loading

uentity commented Oct 28, 2018 •

edited

Loading

BobDodds commented Oct 28, 2018 •

edited

Loading