New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] GRResult error occuring a couple times a day farming a few PB of C2 compressed plots using an Nvidia P4 GPU - Bladebit #15404
Comments
Still happening with GPU driver Linux Nvidia beta v535.43.02 As soon as the following GPU error occurred - the GRResult error was thrown in the chia debug.log [Tue May 30 19:49:17 2023] NVRM: Xid (PCI:0000:08:00): 31, pid=459359, name=start_harvester, Ch 00000008, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7f53_718af000. Fault is of type FAULT_PDE ACCESS_TYPE_READ which led to the debug log error 2023-05-30T19:49:18.260 harvester chia.harvester.harvester: ERROR Exception fetching full proof for /media/chia/hdd142/plot-k32-c02-2023-04-26-06-20-xxxxxxxxxxxxx.plot. GRResult is not GRResult_OK |
Another "me too" Dell R630 server, dual E5-2620v4 CPUs, 64GB RAM, Debian 11.7, Tesla P4 with 530.30.02 drivers. |
Ref issue #15470 This isn't limited to a few times a day. I switched to a pool to test proofs and got flooded with these errors with each partial until falling back to CPU harvesting. |
Same issue here ! Ubuntu 22.04 / kernel 5.15.0-73 Plots : C7 / around 9000 GRResult error in chia log + nvidia FAULT_PDE ACCESS_TYPE_READ in kernel log Happens randomly, worse : 2hours, best : 20 hours without error. |
I am facing same GRResult issue. My details are: Win 10. The issue has happened twice in last two days. Restarting the GUI fixed the issue. |
I am able to consistently reproduce this error on a Windows-based system by using the Disable Device option in the display driver properties menu, waiting a few seconds, and enabling the device with the same button. The GRResult issue will then appear in the logs. |
I am also affected by this. Running a distinct harvester (separated from full_node and farmer) on a BTC-T37 board with a Tesla P4 GPU and a LSI 9102 (SAS2116) HBA. Both HBA and GPU are attached via PCIe 1x gen2. Ubuntu 22.04 is running on a Celeron 1037U CPU with 4GB DDR3 RAM. My harvester node is of version 2.0.0b3.dev116 bladebit alpha 4.3 obtained via the chia discord. Tried bladebit alpha 4.4 but this will not work at all. Farming 4280 C7 plots (bladebit) and some 300 non compressed NFT plots. Edit: In my opinion this should produce an error message in the logs, maybe even critical, but not stopping the harvester to work. |
This issue has not been updated in 14 days and is now flagged as stale. If this issue is still affecting you and in need of further review, please comment on it with an update to keep it from auto closing in 7 days. |
I periodically still getting a GRResult error Harvester only, no other activity on server. Kinda leaving this box as is for testing this issue. |
Can you try this with the release candidate. Let us know if you still see issues. Thanks |
I am running rc1. I am getting GRR error. Debian 12 nvidia driver 535.86.05 with a Tesla p4 as harvester. |
Which |
Same issue here with Chia 2.0.0rc1 on Ubuntu 22.04.2 and Nvidia P4. (7000 c7 plots) |
Can you try rc3? We added several architectures explicitly to the harvester and plotter |
Installed the rc3. I will evaluate for the next couple of days. |
Same result on RC3. Harvester stopped sending partials after the same error occured. |
after replacing gpu(1070) with rtx2080ti i stopped getting GRResult errors |
Same with RC6 |
in these cases where the harvester drops out, do you see a message in dmesg about NVIDIA driver, or a windows hardware event for NVIDIA? Does the driver drop out and recover? Do you see anything else in the log about which GRR event was logged after |
On my Windows Server 2019 Standard with Tesla P4 driver 536.25, E5-2697v3, 64GB ram. Here's some log line entries before and after the 3 earlier today: 2023-08-21T15:25:58.561 harvester chia.harvester.harvester: INFO 5 plots were eligible for farming 68e12a41d5... Found 0 proofs. Time: 0.51637 s. Total 3434 plots Nothing in the Application nor System event viewer, no errors, no warnings, nothing about NVidia drivers. |
Does the harvester log show any |
Hi, Found no messages in dmesg. Regards Ubuntu 22.04.2 LTS (256GB Memory) |
Same issue: And also getting another silent GPU fail now and again that only shows up in dmesg as: Happens approx once per day on average, for both issues auto detect in logs/dmesg and restart harvester with a bash cron script. |
Happening randomly, mostly when running chia plots check but also during farming. |
Well it seems that c07 plots are causing this so I replotted to level c05 and all is good. |
This kind of lines up with my suspicion that something is a bit off with the calculation for the required memory at C7. It would be nice if somebody who knew what they were doing could try to increase the allocations a bit. |
I am using c05 and am experiencing this issue, might be a qty of plots? |
Hard to say. c07 doesn't seem to immediately cause this error to occur, one of my harvesters with 1050 GTX and around 50TB of c07 plots (with 50TB c05 mixed in) has seen multiple days without errors but can freak out in less than a day. Also running plot check on those plots might encounter this error in one run and pass with next one. So something random seems to be happening. Removing all c07 plots in my case seems to have fixed the situation, but could be that the odds of it happening are just much smaller... Also I was wondering if the amount of proofs found could affect this? And does the decompressor_thread_count-setting affect GPU decompressor? |
dunno if this info is any helpful, but i had my C7 farm running on a old i7 for about a few months now. the GPU error occurred usually 1-2/day.. not much more.. sometimes even less. what changed is, that the GPU error now shows up multiple times per day. sometimes even 5 or 6 times. interestingly, even though the error rate is now higher - i do see less stale partials. this might be related to a chia client update - or not.. i dont know. i just thought to share this observation. Linux Mint 21.2 - full node |
Found this as well GRResult is not GRResult_OK Repeats three time for for the same plot. then will occur on another plot some time later. Would this affect farming block wins/ OS: Ubuntu 20.04.3 LTS x86_64 |
Same problem here (one time):
Thanks @larod for your script OS: Ubuntu 22.04.4 LTS x86_64 |
Chia 2.2.0 is live. Can someone confirm, that bug is fixed? Didn't see anything about it in patch notes. |
Version 2.2.0 does not do GPU mining. No matter which graphics card is used. I have the RTX 2080 Super, RTX 3060 or a GT 1030. In version 2.1.4 it works without any problems. The BUG is of course included. |
2.2.0 GPU farming work fine for me. No bug still (about 3 hours) accured. P.S. Bug still present. Same error on 2.2.0: GRResult is not GRResult_OK, received GRResult_OutOfMemory. |
The same problem |
How hard is it for devs to get a 1060 and reproduce the problem locally. |
Converting c07 -> c05 didn't seem to help, one linux machine with 1050 4GB still occasionally has this problem. Interestingly Win10 machines with 1050 and 1030 are much more stable. I wonder if the env settings from Eth mining are affecting this (GPU_MAX_ALLOC_PERCENT etc...). I have those only on the Win machines. |
|
@GolDenis72 what kind of script? |
#15404 (comment) |
Hi. I'm very new to this. Maybe you could tell me where I made a mistake ?
2024-04-15 20:41:21 Starting to monitor syslog... I have a sense that it's not working because I'm on windows 10 but then you said that you are using win 10 too. I really need this to work, my system needs restarting every 6h or so... Please help. |
Hi! Use tail to monitor the log file and grep for the log message Not working for Windows, so we need to put the real path to the chia log file |
@GolDenis72
|
your comp is right! |
Oh my god.... how did I miss that.... I'm so grateful you have pointed it out to me. Looks like it's running now, will see if it works. Thanks, again. |
@GolDenis72 Hi again. :) It looks that the script monitors the log but when found the error harvester didn't restart, says command not found...any ideas ?
|
And again. Your comp is right! command not found! :-) |
@GolDenis72
|
check that path twice " C:/Program Files/Chia/resources/app.asar.unpacked/daemon/chia.exe" |
there are 2 chia execution files in the system (dont ask me why) you need find the right one. & just check it with simple chia command first (Like chia -h) to be sure, that you find the right one |
The path is right..... I'm thinking must be something that its two words Program files.... |
@GolDenis72 |
What happened?
When the system (ProLiant DL360 Gen9, dual E5-2620 v4, 32 gigs ram, Nvidia P4, 75k C2 plots) hits a high IO load on the same block device as the Chia Full Node DB, shortly after the debug.log in chia will show GRResult not ok. The number of plots, lookup times, all seems fine - but the harvester stops finding proofs until the harvester is restarted. Happens 1-2 times in a 24 hour period on Alpha 4 through Alpha 4.3
Whenever error occurs, block validation time and lookup time consistently increase leading up to the error being thrown.
Reproducible with Nvidia Unix GPU Driver versions 530.30.03, 530.41.03, and 535.43.02
Version
2.0.0b3.dev56
What platform are you using?
Ubuntu 22.04
Linux Kernel 5.15.0-73-generic
ProLiant DL360 Gen9, dual E5-2620 v4, 32 gigs ram, Nvidia P4, 75k C2 plots
What ui mode are you using?
CLI
Relevant log output
The text was updated successfully, but these errors were encountered: