-
Notifications
You must be signed in to change notification settings - Fork 344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compute Device result rate decreases ~51% when a second Compute Device is running #242
Comments
What do you mean by "device 0 performance decreases by 105%"? Could you give an example? |
I think my 105% is wrong. I should have used: What do the two variables equal?
An Example Each Device has a configuration directory systemd unit files execute Device 0
Device 1
Data
Device 1
Stat. Analysis Device 0's percent error: [ (943.0 - 457.8) / 943.0 ] · 100 = ~51% So Device 0's performance, or result rate, decreased ~51%, not 105%. dmesg
|
One thing to verify is that GPU selection is working correctly. If you run just one instance, in turn on GPU-0 and GPU-1, do you see the correct GPU heating up/spinning up? If this works as expected, when you run the two instances at the same time, do both GPUs heat-up/spin up? |
Hm.. I used to have a problem like this on OpenCL on fglrx ~ 3 years ago and have been wondering how multi-GPU would work since then. Actually I think the only other guy who noticed it at the time was a cryptominer for litecoin or something. I had the problem in a laptop with an m290x crossfire. Have no way of testing this as of now, but figured it was worth mentioning since some bits may have been shared since the AMD App SDK days. |
When I start one instance, both GPUs go from ~38W to ~85W. Soon after the instance signifies the OpenCL kernels are compiled, the wattage of the correct GPU increases a lot more than the other GPU. I tried this in turn for Device 0 and Device 1. In case you're wondering why I didn't use temperature:
Yes. Cooley, the wattages "oscillate". I wish I could graph this to see whether the oscillations match up to the frequency of the dmesg message I don't have access to wattage when using the standard kernel and tools, but I can tell you that the ~51% performance decrease did not happen with the kernel I used before!
|
I'll post more examples involving OK. Here are more examples: I followed the "Verify installation" instructions on the ROCm page. The Both Examples still have Second Example I executed
dmesg logs that resulted:
End of Second Example Start of Third Example I stopped Upon starting
I ran this I checked dmesg and I saw an error message I have not seen before... 1000+ of them:
Minutes have passed... maybe 100 ...and these ^^^ messages are still coming.
|
I have 2 vega 56's both mining eth, I dont see any drop on 2nd GPU. Both are in x16 slots ATM.
|
I lowered the intensity a little and found a sweet spot. Also, I'm using xmr-stak instead of xmr-stak-amd, but I hadn't thought (or checked) if it fixes this oddity that my posts were regarding. With the lower intensity, no longer does starting one cause the other's hashrate to decrease. Anyone have a clue on the existence of this oddity? The cards up-time is 36 hours. The cards are overclocked in firmware and with rocm-smi. I think it is remarkable that these two cards didn't have any compute errors (xmr-stak lists this information) or "<?VM.>.*.mem< ^access!fault<..FE x007820<<<" dmesg errors. That tool is so useful and I love the program's interface. Thank you! You are helping me run my cards mad. ;) but in a good way. These numbers were copied from the webpage after an up-time of 36 hours when both were running: GPU1 or (b) @rhlug Do you want to try increasing the intensity in the config for one ..or both cards and then running them and checking dmesg, etc. I'm wondering if setting "higher" intensities reproduces the same error messages.... or at least a slow down on the other card in general. |
@rhlug Not really sure, havent been messing with vega on linux because I cant undervolt them easily. When I did run xmr-stak-amd on ubuntu 16.04 with vega 56 (flashed to 64), I was getting 1300 h/s @ 2016/1600 intensities. I dont have things setup to do any experiments right now. Maybe for ROCm 1.7 I will. |
Even though I don't have an answer to why only one card was used at a time (I described this above), I'm closing this issue because this issue goes away when I decrease the intensity number like I stated in an above post. |
My issue is that my performance (result rate) decreases once I run OpenCL code on a second GPU.
I start the program and ask it to run on Device 0. I start the program a second time and ask it to run on Device 1. Device 1 starts and stays at half performance compared to it's performance when ran alone. Device 0's performance decreases ~105%.
(I setup the program to use 2 CPU threads for each Compute Device, in case that makes any difference for you.)
My two compute devices each have their own x8 PCIe 3.0 directly to the processor.
Compute Devices: RX580 4GB (Qty 2)
./rocm_agent_enumerator -t gpu
Headless Ubuntu 17.10, AMD Ryzen 5 1600X, 32GiB system ram
uname -r
4.11.0-kfd-compute-rocm-rel-1.6-180
lspci -v -d1002
dmesg | grep -i IOMMU
dmseg | grep kfd
Here are some boot messages:
and these messages appear when starting even a single Compute Device:
I continuously get additional messages when running both Compute Devices:
The text was updated successfully, but these errors were encountered: