Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROCm RX VEGA hash rates for Cryptonight (linux vs windows) #325

Closed
rhlug opened this issue Feb 3, 2018 · 136 comments
Closed

ROCm RX VEGA hash rates for Cryptonight (linux vs windows) #325

rhlug opened this issue Feb 3, 2018 · 136 comments

Comments

@rhlug
Copy link

rhlug commented Feb 3, 2018

Going to start a new issue in hopes to find a solution to the performance of cryptonight mining on linux under ROCm, as we continue to lag behind the windows aug 23rd blockchain drivers by 35%.

GPUs - RX Vega 64

Running Aug 23 blockchain drivers on windows, I see 1900h cryptonight, and 39Mh ethash.
Running ROCm 1.7 on ubuntu, I see 1250h cryptonight, and 39Mh ethash.

So the fact that I can get like rates on ethash means the opencl stack is just as good as windows.

I gave windows 64GB of virtual memory and ubuntu 64GB of swap. Tested amdkfd.noretry 1 and 0.

Any other recommendations on things to try?

@todxx
Copy link

todxx commented Feb 4, 2018

If I remember correctly, to hit the 1900+h/s with cryptonight in windows it required doing a hot re-initialization of the GPUs. I remember people were initially doing this by enabling or disabling HBCC which seemed to cause re-initialization of the GPU.

Without the re-initialization, I believe performance is pretty similar to the linux rocm stack. This seems to suggest that the windows initialization procedures at boot and for hot reset are different somehow. I would love to get some clarity into what the difference is that is causing this performance delta.

A 50% performance increase running the exact same code is impressive.

@rhlug
Copy link
Author

rhlug commented Feb 4, 2018

@todxx its nothing with hbcc. just reloading the driver with a disable/enable in device manager is all it takes. No way to do that in linux. Neither modprobe -r or rmmod/insmod will allow amdgpu or amdkfd to be removed/reloaded.

@ob7
Copy link

ob7 commented Feb 5, 2018

Can whatever it is Windows is toggling when reinitializing the gpus be accomplished with a bios mod?

@todxx
Copy link

todxx commented Feb 5, 2018

@rhlug I don't think it has to do just with reloading the driver. I think the way windows reloads the driver for a hot reset is somehow different than how it loads on boot. But I'm just guessing here.

@ob7 I believe the bios on Vega requires to be signed and therefore is not moddable. In either case, without knowing what is being changed, it will be difficult to reproduce.

@akostadinov
Copy link

Perhaps you can try to disable module on boot and only manually load it before running the compute program. My suspicion though is that on windows, the card is getting used by something during the whole boot process and resources are not released completely. The reload perhaps allows for releasing all needless resources on the card.
If you load linux module before running computing, that might be the same. e.g. would prevent X to ever try to use the card. If this helps, then x/wayland may need to be configured to never touch the cards.

btw what are you using to tune card cooling and speed on linux? It seems proper underclocking is always needed for good results.

@lsimplify
Copy link

@rhlug Did you achieve 1900h cryptonight without overclocking the memory frequency on Windows? Because as far as I know overclocking the memory is required to get a hashrate like 1900h/s. (Am I wrong?)

@949f45ac
Copy link

@lsimplify You can only achieve 1900 H/s on Windows with overclocking sure enough, but even without OC a Vega is at >1500 H/s after the disable/enable toggle. On Linux it’s less than 1200 H/s without OC, and maybe ~1300 with.

@akostadinov You can simply let a Vega run compute jobs on a headless system, forgoing X completely. It doesn’t change anything, sadly.

the card is getting used by something during the whole boot process and resources are not released completely.

The author of the original Vega mining guide on reddit writes that it has something to do with a power saving feature, but doesn’t give any specifics on how he’s reached that conclusion:

The blockchain Beta driver has some sort of a bug (i see it as a feature) that when you restart the GPU device some sort of power saving feature (i see it as a bug) doesn't get activated. Therefore, by restarting your GPU device it will hash higher.

@akostadinov
Copy link

akostadinov commented Feb 19, 2018

@949f45ac , this might be what author thinks but it is not necessarily true as well it is driver dependent. I for one had very unstable setup with the plain blockchain driver. Then updated only the driver with a newer one from pro-series driver (leaving the rest to be from the blockchain driver. Setup is much more stable. The interesting thing is that I have to

  1. run under/overclock utility
  2. disable/enable device

This is the best I have so far. If I run the underclock utility after disable/enable, then card sucks much more power for some reason. Until we have stable linux drivers upstreamed situation will be crap it seems. IMO it is still worth trying to avoid loading rocm until just before compute software is to be run. I can't try though because mainboard is incompatible :/ I decided to wait until linux drivers stabilize and there is better statistics which mainboards are supported.

@949f45ac
Copy link

@TekComm What you write is true for the RX 400 / 500 series. However, it seems that Vega memory overclocks just fine with rocm-smi alone. Hash rate goes up, and you achieve numbers mostly similar to those on Windows without the device toggle. But if you do the device toggle on Windows, you get another +30%.

I personally believe it would be very nice if we got control over all the remaining DPM features on Vega. Looking into vega10_hwmgr.h in the kernel driver, we see this:

enum {
        GNLD_DPM_PREFETCHER = 0,
        GNLD_DPM_GFXCLK,
        GNLD_DPM_UCLK,
        GNLD_DPM_SOCCLK,
        GNLD_DPM_UVD,
        GNLD_DPM_VCE,
        GNLD_ULV,
        GNLD_DPM_MP0CLK,
        GNLD_DPM_LINK,
        GNLD_DPM_DCEFCLK,
// goes on with non-DPM features

So there’s the graphics clock (GFXCLK), the memory clock (UCLK), but also apparently the SoC clock and something related to the Prefetcher.

When I profile a Cryptonight miner on Vega with RCP, I see MemUnitStalled 30% upwards. On the normal RX series this is only a few percent. Possibly this is related to having HBM2 memory, which may have higher bandwidth, but clocks much lower.

Another thing I notice is that the main loop in Cryptonight, which is iterated over 500k times doing two random 16 byte fetches and two random 16 byte reads, actually has a FetchSize 4 times the expected amount and a WriteSize double. I believe this might be prefetch logic reading +3 ahead on every read and queueing one write to the predicted location, then one write to the actual one. Now maybe if we could tune the prefetcher differently, this could increase performance, is my wild speculation.

@todxx
Copy link

todxx commented Feb 27, 2018

@949f45ac I think you might be onto something there. A misbehaving prefetcher could cause a perf hit like this, and your data seems to back it up.

I'd be curious to see what happens if the 3 lines starting here are changed/removed to disable the prefetcher smu feature.

It would be nice to get some input from the devs on what the GNLD_DPM_PREFETCHER feature actually controls before playing Vega roulette.

@todxx
Copy link

todxx commented Feb 27, 2018

@TekComm I've made the changes and have a build going. However, I'm not sure if I need to do anything to make it play nice with roc-dkms. I'm not familiar with how dkms works. I guess I'll just try a normal kernel install and hope for the best.

@todxx
Copy link

todxx commented Feb 27, 2018

Disabling the feature had no effect. Cryptonight still running around 1200.

@949f45ac
Copy link

949f45ac commented Feb 27, 2018

@todxx I already tried simply disabling GNLD_DPM_PREFETCHER -- then I realised that what this does is probably only to disable the DPM for Prefetcher, ie. the driver is telling the card: "I have no intent of setting power levels for prefetcher myself."

What we actually want, though, is try different manual overrides to prefetcher power level. For this we’d need the driver extended to expose a file to sysfs like /sys/class/drm/card0/device/pp_dpm_sclk (which is what rocm-smi uses when you call --setsclk). I thought about trying my hand at it, but I didn’t really understand where the driver was getting information about possible sclk states from, to begin with -- not straightforward from the card, I think. (I believe it might get some voltage information from the card and use that to calculate clocks.) So I suppose that some knowledge of the hardware internals is needed.

@gstoner
Copy link

gstoner commented Feb 27, 2018

This is correct, the GNLD_DPM_PREFETCHER has nothing to do with instruction cache or Data Cache prefetcher.

@949f45ac
Copy link

949f45ac commented Feb 27, 2018

@gstoner So is it possible, in principle, to control power state of Vega memory prefetcher from the driver?

We are talking about a compute kernel that goes through a loop 2^19 times, doing 2 reads with a write-back each, in every iteration.
In simplified terms:

uint a, b;
uint128_t pad[131072];
for (int i = 0; i < 524288; i++) {
  data = pad[a];
  data = hash1(data);
  pad[a] = data;
  b = newBfromData(data);

  data = pad[b];
  data = hash2(data);
  pad[b] = data;
  a = newAfromData(data);
}

The memory prefetcher seems to be pointlessly reading ahead after each of these random reads, and also scheduling every write to a wrong location at first – possibly worsening performance. Do you think modifying the prefetcher’s power state could help? Or do you have another idea on how to improve performance?

@gstoner
Copy link

gstoner commented Mar 2, 2018

@rhlug Can you try the beta http://repo.radeon.com/misc/archive/beta/rocm-1.7.1.beta.4.tar.bz2 It support 4.13 Linux kernel

@todxx
Copy link

todxx commented Mar 3, 2018

I tested this out since I was testing another bug with the 1.7.1 beta 4. There seems to be no difference in cryptonight performance, though I did not profile to check fetch behaviour.

@grafptitsyn
Copy link

grafptitsyn commented Mar 3, 2018

@gstoner I made clean installation of Ubuntu 16.04.4, ROCm 1.7.1b4. CryptoNight algo gives me up to 1200 H/s. Same miner but Windows gives me up to 1900 H/s.

~> cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS"

~> uname -a
Linux dahlia 4.13.0-36-generic #40~16.04.1-Ubuntu SMP Fri Feb 16 23:25:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

~> cat /etc/default/grub
...
GRUB_CMDLINE_LINUX_DEFAULT="splash quiet amdgpu.vm_fragment_size=9"
...

The card is ref Vega 56.

@gstoner
Copy link

gstoner commented Mar 3, 2018

Thanks, I am talking with the SMU team about delta. We will soon have a new ROCm profiling foundation available with trace and perf counter support. It is a massive update on what we had in the past. Also working on Debugger that will show the PC, VGPR, SGPR, Support Breakpoints and Runcontrol etc.

It would great in the short run if we can get someone to profile the base driver with Perf and eBPF now we have DKMS based install.

Here is a rough outline of the API of the New Profiler Foundation.

Returned API status:

  • hsa_status_t - HSA status codes are used from hsa.h header

Info API:

  • rocprofiler_info_kind_t - profiling info kind
  • rocprofiler_info_query_t - profiling info query
  • rocprofiler_info_data_t - profiling info data
  • rocprofiler_iterate_info - iterate over the info for a given info kind
  • rocprofiler_query_info - iterate over the info for a given info query

Context API:

  • rocprofiler_t - profiling context handle
  • rocprofiler_feature_kind_t - profiling feature kind
  • rocprofiler_feature_parameter_t - profiling feature parameter
  • rocprofiler_data_kind_t - profiling data kind
  • rocprofiler_data_t - profiling data
  • rocprofiler_feature_t - profiling feature
  • rocprofiler_mode_t - profiling modes
  • rocprofiler_properties_t - profiler properties
  • rocprofiler_open - open new profiling context
  • rocprofiler_close - close profiling context and release all allocated resources
  • rocprofiler_group_count - return profiling groups count
  • rocprofiler_get_group - return profiling group for a given index
  • rocprofiler_get_metrics - method for calculating the metrics data
  • rocprofiler_iterate_trace_data - method for iterating output trace data instances

Sampling API:

  • rocprofiler_start - start profiling
  • rocprofiler_stop - stop profiling
  • rocprofiler_read - read profiling data to the profiling features objects
  • rocprofiler_get_data - wait for profiling data
    Group versions of start/stop/read/get_data methods:
    o rocprofiler_group_start
    o rocprofiler_group_stop
    o rocprofiler_group_read
    o rocprofiler_group_get_data

Intercepting API:

  • rocprofiler_callback_t - profiling callback type
  • rocprofiler_callback_data_t - profiling callback data type
  • rocprofiler_set_dispatch_callback - adding kernel dispatch callback
  • rocprofiler_remove_dispatch_callback - removing kernel dispatch callback

Returning the error string method:

  • rocprofiler_error_string - method for returning the API error string

@grafptitsyn
Copy link

@gstoner provide some instructions, and I will try to do all my best (-:

@gurupras
Copy link

gurupras commented Apr 3, 2018

Is there any update on this?

@949f45ac
Copy link

949f45ac commented Apr 9, 2018

Apparently with the new Windows driver release 18.3.4 the cryptonight performance bugs on Windows are completely fixed, i.e. you don’t even have to toggle your cards off/on anymore to achieve great performance.

@gstoner Maybe you could talk to the team building the Windows drivers and find out what fix they put into this release and get it done in the Linux drivers too? That’d be wonderful. Release notes for 18.3.4 simply state

Fixed Issues:

  • Some blockchain workloads may experience lower performance than expected when compared to previous Radeon Software releases.

@grafptitsyn
Copy link

Still no updates?

@briansp2020
Copy link

What is the unrolling bug you are talking about? Do you have link to more information? Just curious.

@akostadinov
Copy link

Is there a reproducer program. I'm seeing lock-ups here with a frontier edition. But also can be the driver vs older mobo and cpu.

@Mandrewoid
Copy link

Mandrewoid commented Apr 23, 2018

@TekComm would that be why my Vega FE does this? https://i.imgur.com/xTDxroL.jpg
it works fine until I put it under 100% load, even under windows after about 2 minutes full load it does that

@uentity
Copy link

uentity commented May 3, 2018

Maybe some light is shed on magic Windows HBCC driver switch?
I'm having the same issue as topic starter.

@uentity
Copy link

uentity commented May 3, 2018

@TekComm so, problem can be solved with ROM modification? Can you suggest any references for finding such mods?

@uentity
Copy link

uentity commented Oct 12, 2018

Maybe it’s due to the fact that we’re running with HSA_ENABLE_SDMA=0 for Vega – something about the scheduling / work distribution.

From my experience HSA_ENABLE_SDMA=0 is no longer needed with ROCm 1.9.

@Zarkoob
Copy link

Zarkoob commented Oct 12, 2018

@ddobreff I'm seeing errors such as:

kernel: [ 3214.177085] amdgpu 0000:0a:00.0: GPU fault detected: 147 0x00424802
kernel: [ 3214.177104] amdgpu 0000:0a:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000008
kernel: [ 3214.177118] amdgpu 0000:0a:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x07048002
kernel: [ 3214.177132] amdgpu 0000:0a:00.0: VM fault (0x02, vmid 3) at page 8, write from 'TC4' (0x54433400) (72)

Is this due to the driver issue you are talking about?

@ddobreff
Copy link

ddobreff commented Oct 12, 2018

Thats what broken compiler from 18.30 causes, read up what I wrote.
P.S. HSA_ENABLE_SDMA=0 should help disabling/bypassing pci atomics requirements for Vega only.

@shimmervoid
Copy link

Cast XMR released a beta for Linux on the 5th. Although it supposedly works with amdgpu-pro, I've tried everything i know and I end up with the following on ROCm:

[14:44:24] Initializing GPU, loading kernel ...
Detected OpenCL Platform: OpenCL 2.1 AMD-APP (2679.0)
Driver Version OK.
GPU0:  (gfx900) | 64 Compute Units | Memory (MB): 16368 | Intensity: 10 / 12
[14:44:27] Error clBuildProgramm -11

https://bitcointalk.org/index.php?topic=2256917.2900

If anyone can take a look. Much appreciated!

@rhlug
Copy link
Author

rhlug commented Oct 15, 2018

I flipped the switches on my reference 56s over to the 64 bios. Did like test on xmrig-hip and xmr-stak 2.5.0, and xmrig-hip is around 1% faster.

[2018-10-14 19:55:22] speed 10s/60s/15m 10980.8 10978.5 n/a H/s max 11001.2 H/s
| THREAD | GPU | 10s H/s | 60s H/s | 15m H/s | NAME
|      0 |   0 |  1803.4 |  1803.2 |  1804.3 | Vega [Radeon RX Vega]
|      1 |   1 |  1868.1 |  1868.2 |  1868.0 | Vega [Radeon RX Vega]
|      2 |   2 |  1803.4 |  1803.3 | 1803.6 | Vega [Radeon RX Vega]
|      3 |   3 |  1769.2 |  1767.2 | 1768.4 | Vega [Radeon RX Vega]
|      4 |   4 |  1891.2 |  1891.6 | 1891.4 | Vega [Radeon RX Vega]
|      5 |   5 |  1844.8 |  1844.9 | 1844.7 | Vega [Radeon RX Vega]
[2018-10-14 19:55:25] speed 10s/60s/15m 10980.5 10978.5 10980.4 H/s max 11001.2 H/s

I'm pushing a 1390mhz sclk pp_table to all of these vegas, but I believe drivers base final sclk on some factor of ASIC quality? So my GPU3 (xfx reference 56 with 64 bios) only runs actual sclk 1310 @ 925mv.

 ID       Name  Sclk  Mclk Volts Watts  Temp   Fan
============================================================
  0   rxvega56  1331  1090   925   108    61   36%
  1   rxvega64  1325  1090   925   109    49   36%
  2   rxvega56  1333  1090   925    95    61   48%
  3   rxvega56  1310  1090   925   103    61   37%
  4   rxvega64  1343  1090   925   109    45   36%
  5   rxvega56  1362  1090   925   108    61   61%
============================================================
  6                                632

With this same setup, here are xmr-stak dev 2.5.0 rates.

xmr-stak 2.5.0

HASHRATE REPORT - AMD
| ID |    10s |    60s |    15m | ID |    10s |    60s |    15m |
|  0 |  894.3 |  845.5 |  844.2 |  1 |  900.7 |  940.4 |  942.0 |
|  2 |  958.8 |  957.4 |  949.1 |  3 |  894.8 |  893.6 |  899.8 |
|  4 |  900.5 |  900.4 |  869.2 |  5 |  900.5 |  900.4 |  923.2 |
|  6 |  876.8 |  847.7 |  834.2 |  7 |  883.1 |  909.2 |  920.0 |
|  8 |  968.9 |  969.6 |  968.0 |  9 |  904.3 |  904.9 |  907.4 |
| 10 |  914.9 |  842.4 |  855.8 | 11 |  914.9 |  967.2 |  958.5 |
Totals (AMD):  10913.0 10879.4 10871.9 H/s
-----------------------------------------------------------------
Totals (ALL):   10913.0 10879.4 10871.9 H/s
Highest:  11196.9 H/s
-----------------------------------------------------------------

That said, I can split xmr-stak into 6 processes and not loose hashrate like I see on xmrig-hip, so I'm using xmr-stak because it makes monitoring processes and hashrates easier.

@ddobreff
Copy link

ddobreff commented Oct 15, 2018

I don't see any advantage of xmrig/xmr-stak-hip over normal OpenCL hashrate.

[2018-10-16 01:21:25] speed 10s/60s/15m 1946.7 n/a n/a H/s max 1946.6 H/s
| THREAD | GPU | 10s H/s | 60s H/s | 15m H/s |
| 0 | 0 | 642.6 | n/a | n/a |
| 1 | 0 | 654.7 | n/a | n/a |
| 2 | 0 | 650.6 | n/a | n/a |
[2018-10-16 01:21:30] speed 10s/60s/15m 1947.9 n/a n/a H/s max 1946.6 H/s

And clocks:

GPU0: PCI 0000:03:00, Radeon RX Vega 8.0 GB - Bios: 115-D050PIL-100
SCLK: 1450Mhz, DPM: 5 , MCLK: 1075Mhz, PWR:115.00W, VLT: 0.90v , FAN: 40%, TEMP: 52C

@rhlug
Copy link
Author

rhlug commented Oct 16, 2018

That is 10s rates on a 1 card sample size. I'm not sure how much I can trust those numbers.

For me, xmrig-hip is 1% faster when comparing the 15m hashrates averages across 6 card rig. Sample size is a bit stronger. Maybe you can duplicate longer tests across more cards to get some statistically significant numbers?

And your actual sclk @ 1450 @ 900mv? For me, the drivers seem to do whatever the fuck they want if I dont give them gobs of power.

# cat /sys/class/drm/card4/device/pp_dpm_sclk | grep ^7
7: 1390Mhz *

# grep -A2 "GFX Clocks" /sys/kernel/debug/dri/4/amdgpu_pm_info
GFX Clocks and Power:
	1090 MHz (MCLK)
	1339 MHz (SCLK)

@949f45ac
Copy link

For me, OpenCL is >5% faster now. Running HIP at 8x448, OpenCL 2 threads, each at intensity=1928, unroll=8, strided=2, stride=2. Part of the problem comes down to 8x448x2=7168, meaning ~700 MB of memory are unused by the HIP miner; but blocks=480 is slower, because it is not an integer multiple of 56.

I get 835 H/s on RX 470 sclk=1126 (BIOS timings mod, but memclock @ 1750) though (4x480, kind of weird config), and cannot even get more than 800 with OpenCL. I think I’m missing the right settings.

@ddobreff
Copy link

For me, xmrig-hip is 1% faster when comparing the 15m hashrates averages across 6 card rig. Sample size is a bit stronger. Maybe you can duplicate longer tests across more cards to get some statistically significant numbers?

I only have one Vega64 and it doesn't really matter if I run it 10sec or 3h.
The tricky part is SoC clock. I am running on dpm5 and setting SoC idx5 to mclk+id, that way my SoC clock is higher than memory and can handle more memory OC. Testing it at 1560/1100/925 doesn't really give much more but power is like 10% more.
I don't have problem with polaris GPUs, they perform as expected with OpenCL.
949f45ac: Unfortunately I can't test your hip port on polaris because it requires atomics enabled and I have to use it on just one gpu and let others sleep. IIRC xmr-stak-hip gave ~ the same as opencl version but with single thread. If AMD decides to remove pci atomic requirements for polaris I'll be able to perform more intensive tests on your port.

@rhlug
Copy link
Author

rhlug commented Oct 16, 2018

@949f45ac you should be able to push well over 900 on opencl with your rx470s. Mine with 4gb samsung memory push 965, and 4gb hynix push 950 @ 1250mhz core.

@BobDodds
Copy link

BobDodds commented Oct 17, 2018

Ref: "the kfd rule". I haven't actually tried(tested) this udev rule, for regex, and for env{vars} ucase/lcase aspect:

/etc/udev/rules.d/00-rocm-or-not

"To try ROCm with an upstream kernel, install ROCm as normal,

but do not install the rock-dkms package. Also add a udev rule

to control /dev/kfd permissions:"

IMPORT{cmdline}="BOOT_IMAGE"
ENV{BOOT_IMAGE}=="*4.1[234567].*", GOTO="go_rocm"
SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video"
LABEL="go_rocm"

#581

@qolii
Copy link

qolii commented Oct 18, 2018

@rhlug, would you mind sharing your Vega configuration for xmr-stak?

I've just updated to the 2.5.0 for the Monero-v8 fork, made minimal edits to my config to get it to at least start up, and find my Vega FE hash rate has halved! :( Having real trouble finding info on it...

@qolii
Copy link

qolii commented Oct 18, 2018

(also, I'm on linux 4.19_rc8 if that helps...)

@BobDodds
Copy link

BobDodds commented Oct 18, 2018

Ubuntu released Cosmic Cuttlefish ystdy. Monero went to v8 today.

Geniuses used the words proximately, as if by association,"AMD..assymetry...CUDA...parallelism", but I thought hip hcc were limitted by cuda seriaiity on pre-fetch for Monero (algo v8), as I read about xmr-stak-hip? Then there's hardware capacity for assymetrical/parallel prefetch and such, but what, languages aren't capable? ..."Vulkan", "SPIR(...already in edge OpenCL).

Assymetry/parallism and even AI in prefetch, are required in languages for Monero algo and to use this: "Vega GPU IP also includes in that HBCC/HBC/memory controller subsystem the ability to manage in its Vega GPU micro-architecture based GPU SKUs the GPU’s own virtual memory paging swap space of up to 512TB of total GPU virtual memory address space.

So that’s 512TB of addressable into any system DRAM and onto the system’s memory swap space of any system NVM/SSD or hard drive storage devices attached".

Geniuses need to know what veteran light bulb changers know for sure:

"On CUDA, memory operations cannot be run asynchronous like that. Instead we use prefetch to achieve a similar effect."

@949f45ac
Copy link

Updated https://github.com/949f45ac/xmrig-HIP to support CN/2 aka Monerov8/9.
Does 1600 H/s and up without errors.
Hope it’s useful to some of you. :) OpenCL miner did not work well for me on v8.


@BobDodds

"On CUDA, memory operations cannot be run asynchronous like that. Instead we use prefetch to achieve a similar effect."

"Asynchronous" can be on many levels. E.g. all the GPU compute frameworks work in a manner such that you put jobs into a pipeline and then collect the results later. This is asynchronous, but not very important as far as Monero is concerned, since our most important compute task runs for about 2 seconds, meaning we don’t put a lot of jobs into the pipeline, compared to other programs. Memory transfer between RAM and GPU device can also be asynchronous on some frameworks – again we don’t care a lot, as the amount of memory a miner transfers is laughably small.
What we actually care about in CryptoNight is reads/writes to the scratch pad, which in the GPU miner resides in the card’s video memory. (In the CPU miner, it is a much different situation.)
And here the interesting thing about GCN ISA (AMD), as opposed to PTX ISA (Nvidia) is that GCN ISA has seperate instructions for requesting the memory operation, and waiting on its completion. I think that most of the time it is actually not a big deal at all, and a prefetch instruction on PTX does the job too. But if you really get down on the ISA level, there will be interesting uses.

@ddobreff
Copy link

@949f45ac have you tried using mainline kernel? I have serious performance drop using mainline vs amdgpu-pro in OpenCL.

@949f45ac
Copy link

Can confirm mainline kernel is slow. Both HIP miner and OpenCL miner suck on it.

What should the OpenCL v8 performance on amdgpu-pro be?
I think I may have some interference between amdgpu-pro and the ROCm stack (no dkms) on my machines. Probably the miners are using the wrong OpenCL implementation. But the right one seems to be missing after ROCm installation.
I noticed this when I did a fresh install to test castxmr. Installed amdgpu-pro first, tested castxmr, ran fine. Then I installed some ROCm stuff in order to be able to run HIP programs, and castxmr wouldn’t work anymore, no matter what OpenCL platform id I told it to run on.

@ddobreff
Copy link

I don't know expected stock performance but with these settings-> sclk: 1250, mclk: 1100, vlt: 875 my Asus Strix Vega64 does 1820H/s without any problem using xmrig-amd or xmr-stak, castxmr is a bit slower.
You can't run castxmr on non amdgpu-pro OpenCL stack, the devs kernel is compiled specifically for amdgpu-pro and cannot run on rocm or mesa.
I have talked to AMD devs and informed them about performance regression in mainline/upstream kernels vs amdgpu-pro, I hope they take it serious and fix that in upstream at least.
P.S. For everyone having problems with editing powerplay table on 4.18+, here is the patch that fixes it:
https://patchwork.freedesktop.org/patch/258557/

@uentity
Copy link

uentity commented Oct 28, 2018

@ddobreff is this hashrate measured with latest Monero v8 fork?
I had more than twice drop after the fork.

@BobDodds
Copy link

BobDodds commented Oct 28, 2018

@uentity , check out everything by Spudz76 on xmr-stak's github. He benchmarks what he says about xmr-stak, xmrig, and 18.30 and rocm.

I'm scripting my own Ubuntu downgrade release, Narnic Nun, applying a few debconf/apt/dpkg pins to xenial server. The gist of it is, compilers aren't working for amd 18.30 and/or rocm and/or advanced kernels. I assume that's temporary.

@uentity
Copy link

uentity commented Oct 28, 2018

@BobDodds thanks, I'll take a deeper look, but at a glance:

  1. I don't use Ubuntu (I'm on Fedora).
  2. I don't have any problems with GPU detection, xmr-stak startup issues, etc. I just want to say that my hashrate dropped significantly after recent Monero fork.
    Switched to eth for a while...

@BobDodds
Copy link

OK, I might have wasted your time then, @uentity . That's such a big drop in hashrate that I would suspect errors. If your gpu's are detected by xmr-stak that's the first sign you're ok, but such a big drop hints at possible errors. clinfo gives my gpu's a clean bill, and I can do tricks with pp_tables, turn fans up and down, but xmr-stak will compile but not recognize gpu's.

@ddobreff
Copy link

Yes hashrate is for CNV2 a.ka Monero v8 with amdgpu-pro 18.30 dkms kernel driver and 18.10 OpenCL compiler.

@cirolaferrara
Copy link

I am on Ubuntu 18 with 4.15.0-55-generic kernel. I have ROCm driver installed with rocm-dkms. I have xmr-stak 2.10.7 fd19a5d and hashrate is around 1300H/s

Can I improve the speed or I have to switch to Windows?

@949f45ac
Copy link

949f45ac commented Aug 9, 2019

@cirolaferrara
Try using amdgpu-pro (preferrably 18.40) installed with --opencl=pal --headless options. No rocm-dkms. This makes the open source miners work well for cn/2, but I believe cn/r is still pretty bad, can only be fixed by using team red miner, at least when I last tried.

@cirolaferrara
Copy link

Fixed with debian 10, this guide and installing opencl driver

@vanities
Copy link

I'm running opencl 19.30 on Arch and still can't reach 2k with a 56 flashed 64 Vega. Using Soft PP tables and amdmemtweak with TMR. Any suggestions? Thanks!

@ROCmSupport
Copy link

I am closing this issue as its around 2 years old.
Request to try with the latest ROCm released version 4.0 and open a new issue if any.
Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests