Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] GRResult error occuring a couple times a day farming a few PB of C2 compressed plots using an Nvidia P4 GPU - Bladebit #15404

Open
chain-enterprises opened this issue May 30, 2023 · 98 comments
Assignees
Labels
2.0.0 bug Something isn't working compression Related to compressed plotting/farming

Comments

@chain-enterprises
Copy link

chain-enterprises commented May 30, 2023

What happened?

When the system (ProLiant DL360 Gen9, dual E5-2620 v4, 32 gigs ram, Nvidia P4, 75k C2 plots) hits a high IO load on the same block device as the Chia Full Node DB, shortly after the debug.log in chia will show GRResult not ok. The number of plots, lookup times, all seems fine - but the harvester stops finding proofs until the harvester is restarted. Happens 1-2 times in a 24 hour period on Alpha 4 through Alpha 4.3

Whenever error occurs, block validation time and lookup time consistently increase leading up to the error being thrown.

Reproducible with Nvidia Unix GPU Driver versions 530.30.03, 530.41.03, and 535.43.02

Version

2.0.0b3.dev56

What platform are you using?

Ubuntu 22.04
Linux Kernel 5.15.0-73-generic
ProLiant DL360 Gen9, dual E5-2620 v4, 32 gigs ram, Nvidia P4, 75k C2 plots

What ui mode are you using?

CLI

Relevant log output

023-05-29T20:45:32.552 full_node chia.full_node.mempool_manager: WARNING  pre_validate_spendbundle took 2.0414 seconds for xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
2023-05-29T20:45:42.620 full_node chia.full_node.mempool_manager: WARNING  add_spendbundle xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx took 10.06 seconds. Cost: 2924758101 (26.589% of max block cost)
2023-05-29T20:45:56.840 full_node chia.full_node.full_node: WARNING  Block validation time: 2.82 seconds, pre_validation time: 2.81 seconds, cost: None header_hash: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx height: 3732042
2023-05-29T20:46:57.239 full_node chia.full_node.full_node: WARNING  Block validation time: 3.34 seconds, pre_validation time: 0.42 seconds, cost: 3165259860, percent full: 28.775% header_hash: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx height: 3732044
2023-05-29T20:49:26.913 full_node chia.full_node.full_node: WARNING  Block validation time: 2.40 seconds, pre_validation time: 0.49 seconds, cost: 2041855544, percent full: 18.562% header_hash: 8d0ce076a3270a0c8c9c8d1f0e73c9b5b884618ee34020d2a4f3ffafa459cfd0 height: 3732055
2023-05-29T20:51:06.259 full_node full_node_server        : WARNING  Banning 89.58.33.71 for 10 seconds
2023-05-29T20:51:06.260 full_node full_node_server        : WARNING  Invalid handshake with peer. Maybe the peer is running old software.
2023-05-29T20:51:27.986 harvester chia.harvester.harvester: ERROR    Exception fetching full proof for /media/chia/hdd23/plot-k32-c02-2023-04-23-someplot.plot. GRResult is not GRResult_OK.
2023-05-29T20:51:28.025 harvester chia.harvester.harvester: ERROR    File: /media/chia/hdd23/someplot.plot Plot ID: someplotID, challenge: 7b5b6f11ec2a86a7298cb55b7db8a016a775efea221104b37905366b49f2e2bd, plot_info: PlotInfo(prover=<chiapos.DiskProver object at 0x7f3544998f30>, pool_public_key=None, pool_contract_puzzle_hash=<bytes32: contractHash>, plot_public_key=<G1Element PlotPubKey>, file_size=92374601728, time_modified=1682261996.8218756)
2023-05-29T20:51:57.482 full_node chia.full_node.full_node: WARNING  Block validation time: 10.23 seconds, pre_validation time: 0.29 seconds, cost: 959315244, percent full: 8.721% header_hash: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx height: 3732059
2023-05-29T20:55:24.640 full_node chia.full_node.full_node: WARNING  Block validation time: 3.18 seconds, pre_validation time: 0.26 seconds, cost: 2282149756, percent full: 20.747% header_hash: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx height: 3732067
2023-05-29T20:56:01.825 wallet wallet_server              : WARNING  Banning 95.54.100.118 for 10 seconds
2023-05-29T20:56:01.827 wallet wallet_server              : ERROR    Exception Invalid version: '1.6.2-sweet', exception Stack: Traceback (most recent call last):
  File "chia/server/server.py", line 483, in start_client
  File "chia/server/ws_connection.py", line 222, in perform_handshake
  File "packaging/version.py", line 198, in __init__
packaging.version.InvalidVersion: Invalid version: '1.6.2-sweet'
@chain-enterprises chain-enterprises added the bug Something isn't working label May 30, 2023
@shaneo257 shaneo257 changed the title [Bug] GRResult error occuring a couple times a day farming a few PB of C2 compressed plots using an Nvidia P4 GPU [Bug] GRResult error occuring a couple times a day farming a few PB of C2 compressed plots using an Nvidia P4 GPU - Bladebit May 30, 2023
@chain-enterprises
Copy link
Author

Still happening with GPU driver Linux Nvidia beta v535.43.02

As soon as the following GPU error occurred - the GRResult error was thrown in the chia debug.log

[Tue May 30 19:49:17 2023] NVRM: Xid (PCI:0000:08:00): 31, pid=459359, name=start_harvester, Ch 00000008, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7f53_718af000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

which led to the debug log error

2023-05-30T19:49:18.260 harvester chia.harvester.harvester: ERROR Exception fetching full proof for /media/chia/hdd142/plot-k32-c02-2023-04-26-06-20-xxxxxxxxxxxxx.plot. GRResult is not GRResult_OK

@liyujcx
Copy link

liyujcx commented May 31, 2023

Me too...windows 10 ,gui,.
gpu device :
image

error log like this:
image
and no proofs any more....

@ab0tj
Copy link

ab0tj commented May 31, 2023

Another "me too"

Dell R630 server, dual E5-2620v4 CPUs, 64GB RAM, Debian 11.7, Tesla P4 with 530.30.02 drivers.

@reythia
Copy link

reythia commented Jun 8, 2023

Ref issue #15470

This isn't limited to a few times a day. I switched to a pool to test proofs and got flooded with these errors with each partial until falling back to CPU harvesting.

@jinglenode
Copy link

Same issue here !

Ubuntu 22.04 / kernel 5.15.0-73
Driver Version: 530.30.02 CUDA Version: 12.1
Dual E5 2680 V4 / 256Gb 2133Mhz RAM / Tesla P4

Plots : C7 / around 9000

GRResult error in chia log + nvidia FAULT_PDE ACCESS_TYPE_READ in kernel log

Happens randomly, worse : 2hours, best : 20 hours without error.

@prodchia
Copy link

I am facing same GRResult issue. My details are:

Win 10.
GTX 1060 with 535.98/CUDA 12.2
E5 2690V4/64GB RAM
Currently 2428 C7 plots, and increasing.
Using chia gui.

The issue has happened twice in last two days. Restarting the GUI fixed the issue.

@thesemaphoreslim
Copy link

I am able to consistently reproduce this error on a Windows-based system by using the Disable Device option in the display driver properties menu, waiting a few seconds, and enabling the device with the same button. The GRResult issue will then appear in the logs.

@javanaut-de
Copy link

javanaut-de commented Jul 3, 2023

I am also affected by this.

Running a distinct harvester (separated from full_node and farmer) on a BTC-T37 board with a Tesla P4 GPU and a LSI 9102 (SAS2116) HBA. Both HBA and GPU are attached via PCIe 1x gen2. Ubuntu 22.04 is running on a Celeron 1037U CPU with 4GB DDR3 RAM.

My harvester node is of version 2.0.0b3.dev116 bladebit alpha 4.3 obtained via the chia discord. Tried bladebit alpha 4.4 but this will not work at all. Farming 4280 C7 plots (bladebit) and some 300 non compressed NFT plots.

Edit: In my opinion this should produce an error message in the logs, maybe even critical, but not stopping the harvester to work.

@github-actions
Copy link
Contributor

This issue has not been updated in 14 days and is now flagged as stale. If this issue is still affecting you and in need of further review, please comment on it with an update to keep it from auto closing in 7 days.

@github-actions github-actions bot added the stale-issue flagged as stale and will be closed in 7 days if not updated label Jul 17, 2023
@robcirrus
Copy link

I periodically still getting a GRResult error
GRResult is not GRResult_OK, received GRResult_OutOfMemory
This was just GRResult is not GRResult_OK on alpha 4.5
No errors in Event Viewer
stops sending partials
chia start harvester -r resets and starts working ok
Occurs about every 1-3 days.

Harvester only, no other activity on server.
Alpha 4.6 (and had them on Alpha 4.5)
NVidia Tesla P4, issues with drivers: 528.89, 536.25
HP Apollo 4200
Windows Server 2019
E5-2678v3, 64GB, all locally attached SAS,SATA drives.
3,434 C7 plots

Kinda leaving this box as is for testing this issue.
Have other similar systems (> 20 harvesters) with A2000 6GB GPU with 4k - 15k mainly C5 plots and CPU compressed plots and no issues on them.

@github-actions github-actions bot removed the stale-issue flagged as stale and will be closed in 7 days if not updated label Jul 18, 2023
@wjblanke
Copy link
Contributor

Can you try this with the release candidate. Let us know if you still see issues. Thanks

@ericgr3gory
Copy link

I am running rc1. I am getting GRR error. Debian 12 nvidia driver 535.86.05 with a Tesla p4 as harvester.

@harold-b
Copy link
Contributor

Which GRResult specifically is it showing?

@wallentx wallentx added the compression Related to compressed plotting/farming label Jul 27, 2023
@Synergy1900
Copy link

Synergy1900 commented Aug 7, 2023

Same issue here with Chia 2.0.0rc1 on Ubuntu 22.04.2 and Nvidia P4. (7000 c7 plots)
Have to restart the harvester every hour to keep farming.

@wallentx
Copy link
Contributor

wallentx commented Aug 7, 2023

Same issue here with Chia 2.0.0rc1 on Ubuntu 22.04.2 and Nvidia P4. (7000 c7 plots) Have to restart the harvester every hour to keep farming.

Can you try rc3? We added several architectures explicitly to the harvester and plotter

@Synergy1900
Copy link

Same issue here with Chia 2.0.0rc1 on Ubuntu 22.04.2 and Nvidia P4. (7000 c7 plots) Have to restart the harvester every hour to keep farming.

Can you try rc3? We added several architectures explicitly to the harvester and plotter

Installed the rc3. I will evaluate for the next couple of days.
Thx!

@Synergy1900
Copy link

Same issue here with Chia 2.0.0rc1 on Ubuntu 22.04.2 and Nvidia P4. (7000 c7 plots) Have to restart the harvester every hour to keep farming.

Can you try rc3? We added several architectures explicitly to the harvester and plotter

Installed the rc3. I will evaluate for the next couple of days. Thx!

Same result on RC3. Harvester stopped sending partials after the same error occured.

@kinomexanik
Copy link

after replacing gpu(1070) with rtx2080ti i stopped getting GRResult errors

@Synergy1900
Copy link

Same issue here with Chia 2.0.0rc1 on Ubuntu 22.04.2 and Nvidia P4. (7000 c7 plots) Have to restart the harvester every hour to keep farming.

Can you try rc3? We added several architectures explicitly to the harvester and plotter

Installed the rc3. I will evaluate for the next couple of days. Thx!

Same result on RC3. Harvester stopped sending partials after the same error occured.

Same with RC6

@jmhands
Copy link

jmhands commented Aug 21, 2023

in these cases where the harvester drops out, do you see a message in dmesg about NVIDIA driver, or a windows hardware event for NVIDIA? Does the driver drop out and recover? Do you see anything else in the log about which GRR event was logged after GRRResult is not GRResult_OK ?

@robcirrus
Copy link

On my Windows Server 2019 Standard with Tesla P4 driver 536.25, E5-2697v3, 64GB ram.
Just received latest ones earlier today, and it was giving 3 messages on same plot together. Have seen it report multiple consecutive errors sometimes, but not usually. No log items before/after indicating other issues.

Here's some log line entries before and after the 3 earlier today:

2023-08-21T15:25:58.561 harvester chia.harvester.harvester: INFO 5 plots were eligible for farming 68e12a41d5... Found 0 proofs. Time: 0.51637 s. Total 3434 plots
2023-08-21T15:26:07.546 harvester chia.harvester.harvester: INFO 11 plots were eligible for farming 68e12a41d5... Found 0 proofs. Time: 0.59371 s. Total 3434 plots
2023-08-21T15:26:15.999 harvester chia.harvester.harvester: INFO 7 plots were eligible for farming 68e12a41d5... Found 0 proofs. Time: 0.34372 s. Total 3434 plots
2023-08-21T15:26:26.596 harvester chia.harvester.harvester: ERROR Exception fetching full proof for I:\BBF2P1C5\plot-k32-c05-2023-06-15-06-38-50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7.plot. GRResult is not GRResult_OK, received GRResult_OutOfMemory
2023-08-21T15:26:26.596 harvester chia.harvester.harvester: ERROR File: I:\BBF2P1C5\plot-k32-c05-2023-06-15-06-38-50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7.plot Plot ID: 50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7, challenge: d5a056ba8dfe416ecd1a7fdd3aca84aeec2e08a93554df9b83e087d332b1b992, plot_info: PlotInfo(prover=<chiapos.DiskProver object at 0x00000240AC939630>, pool_public_key=None, pool_contract_puzzle_hash=<bytes32: 4c288e3a30931f7882607f8d0a9b3773322fb6cead8d292146103441f259c86b>, plot_public_key=, file_size=87233802240, time_modified=1686811230.8092616)
2023-08-21T15:26:26.815 harvester chia.harvester.harvester: ERROR Exception fetching full proof for I:\BBF2P1C5\plot-k32-c05-2023-06-15-06-38-50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7.plot. GRResult is not GRResult_OK, received GRResult_OutOfMemory
2023-08-21T15:26:26.815 harvester chia.harvester.harvester: ERROR File: I:\BBF2P1C5\plot-k32-c05-2023-06-15-06-38-50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7.plot Plot ID: 50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7, challenge: d5a056ba8dfe416ecd1a7fdd3aca84aeec2e08a93554df9b83e087d332b1b992, plot_info: PlotInfo(prover=<chiapos.DiskProver object at 0x00000240AC939630>, pool_public_key=None, pool_contract_puzzle_hash=<bytes32: 4c288e3a30931f7882607f8d0a9b3773322fb6cead8d292146103441f259c86b>, plot_public_key=, file_size=87233802240, time_modified=1686811230.8092616)
2023-08-21T15:26:27.002 harvester chia.harvester.harvester: ERROR Exception fetching full proof for I:\BBF2P1C5\plot-k32-c05-2023-06-15-06-38-50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7.plot. GRResult is not GRResult_OK, received GRResult_OutOfMemory
2023-08-21T15:26:27.002 harvester chia.harvester.harvester: ERROR File: I:\BBF2P1C5\plot-k32-c05-2023-06-15-06-38-50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7.plot Plot ID: 50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7, challenge: d5a056ba8dfe416ecd1a7fdd3aca84aeec2e08a93554df9b83e087d332b1b992, plot_info: PlotInfo(prover=<chiapos.DiskProver object at 0x00000240AC939630>, pool_public_key=None, pool_contract_puzzle_hash=<bytes32: 4c288e3a30931f7882607f8d0a9b3773322fb6cead8d292146103441f259c86b>, plot_public_key=, file_size=87233802240, time_modified=1686811230.8092616)
2023-08-21T15:26:27.002 harvester chia.harvester.harvester: INFO 6 plots were eligible for farming 68e12a41d5... Found 0 proofs. Time: 1.09760 s. Total 3434 plots
2023-08-21T15:26:36.080 harvester chia.harvester.harvester: INFO 6 plots were eligible for farming 68e12a41d5... Found 0 proofs. Time: 0.35933 s. Total 3434 plots
2023-08-21T15:26:44.877 harvester chia.harvester.harvester: INFO 9 plots were eligible for farming 68e12a41d5... Found 0 proofs. Time: 0.53123 s. Total 3434 plots

Nothing in the Application nor System event viewer, no errors, no warnings, nothing about NVidia drivers.

@harold-b
Copy link
Contributor

Does the harvester log show any GRResult_Failed messages at any point?

@esaung esaung added the 2.0.0 label Aug 21, 2023
@Synergy1900
Copy link

in these cases where the harvester drops out, do you see a message in dmesg about NVIDIA driver, or a windows hardware event for NVIDIA? Does the driver drop out and recover? Do you see anything else in the log about which GRR event was logged after GRRResult is not GRResult_OK ?

Hi,

Found no messages in dmesg.
Once it happens I'm getting the same message GRRResult is not GRResult_OK until I restart the harvester (chia start -r harvester).
There are no other messages in the debug.log.
After the upgrade to RC6 it worked for about a day before the first error occured again. Mostly it occurs randomly multiple times a day.

Regards
S.

Ubuntu 22.04.2 LTS (256GB Memory)
Nvidia Tesla P4
Driver Version: 535.86.10
CUDA Version: 12.2

@bryankr
Copy link

bryankr commented Jan 19, 2024

Same issue:
Chia 2.1.4
Ubuntu 22.04
NVIDIA P4
Driver Version: 535.146.02 CUDA Version: 12.2
Xeon E5-2620 v3 @ 2.40GHz
HP Proliant ML150

And also getting another silent GPU fail now and again that only shows up in dmesg as:
[273333.524849] NVRM: Xid (PCI:0000:03:00): 31, pid=166747, name=chia_harvester, Ch 00000008, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7f6e_1ed8b000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

Happens approx once per day on average, for both issues auto detect in logs/dmesg and restart harvester with a bash cron script.

@spegelius
Copy link

Happening randomly, mostly when running chia plots check but also during farming.
Chia 2.1.4
Ubuntu 22.02 and Windows 10
Drivers 546.17 for Windows, 525.x and 535.x for Linux
Multiple 1050 4G cards and one 1030 2G affected. Also two 3060 cards but those don't have any issues.
Supprisingly there's no difference between 1030 and 1050 regards this crash...

@spegelius
Copy link

Well it seems that c07 plots are causing this so I replotted to level c05 and all is good.

@Jahorse
Copy link

Jahorse commented Feb 11, 2024

Well it seems that c07 plots are causing this so I replotted to level c05 and all is good.

This kind of lines up with my suspicion that something is a bit off with the calculation for the required memory at C7. It would be nice if somebody who knew what they were doing could try to increase the allocations a bit.

@timolow
Copy link

timolow commented Feb 11, 2024

I am using c05 and am experiencing this issue, might be a qty of plots?

@spegelius
Copy link

I am using c05 and am experiencing this issue, might be a qty of plots?

Hard to say. c07 doesn't seem to immediately cause this error to occur, one of my harvesters with 1050 GTX and around 50TB of c07 plots (with 50TB c05 mixed in) has seen multiple days without errors but can freak out in less than a day. Also running plot check on those plots might encounter this error in one run and pass with next one. So something random seems to be happening. Removing all c07 plots in my case seems to have fixed the situation, but could be that the odds of it happening are just much smaller... Also I was wondering if the amount of proofs found could affect this? And does the decompressor_thread_count-setting affect GPU decompressor?

@4ntibala
Copy link

dunno if this info is any helpful, but i had my C7 farm running on a old i7 for about a few months now. the GPU error occurred usually 1-2/day.. not much more.. sometimes even less.
recently i changed to a workstation.. an old D30, running the same OS and the same GPU.

what changed is, that the GPU error now shows up multiple times per day. sometimes even 5 or 6 times.

interestingly, even though the error rate is now higher - i do see less stale partials.

this might be related to a chia client update - or not.. i dont know. i just thought to share this observation.
same GPU, different fail rates.

Linux Mint 21.2 - full node
Lenovo D30 Workstation 256 GB RAM
NVIDIA GeForce GTX 1050 Ti - 4 GB
Driver Version: 535.86.10
CUDA Version: 12.2
Compression: C7
Plots: 4534
Chia Version: 2.1.4

@jeancur
Copy link

jeancur commented Feb 18, 2024

Found this as well GRResult is not GRResult_OK

Repeats three time for for the same plot. then will occur on another plot some time later.
None for days 04-Feb to 13-Feb, then a whack of then 14-Feb, then a few the next day, then nothing next two days.
System, 1910 Threadripper, does nothing but farm.

Would this affect farming block wins/

OS: Ubuntu 20.04.3 LTS x86_64
Kernel: 5.15.0-94-generic
Memory: 5.90GiB / 16.0 GiB
Chia version: 2.1.4 farming only 10,000 C7 Plots
NVIDIA M2000 4Gb, CUDA Version: 5.2
Driver Version: 535.154.05

@djerfy
Copy link

djerfy commented Feb 28, 2024

Same problem here (one time):

Feb 27 02:25:54 ChiaHarvester3 kernel: [32847.362629] NVRM: GPU at PCI:0000:01:00: GPU-863a3809-614b-4d4d-5f2c-3e071e56b7bb
Feb 27 02:25:54 ChiaHarvester3 kernel: [32847.362633] NVRM: GPU Board Serial Number: 0421619095918
Feb 27 02:25:54 ChiaHarvester3 kernel: [32847.362634] NVRM: Xid (PCI:0000:01:00): 31, pid=2905, name=chia_harvester, Ch 00000008, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7fe2_25685000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

Thanks @larod for your script

OS: Ubuntu 22.04.4 LTS x86_64
Kernel: 5.15.0-97-generic
Memory: 9.06GiB / 16.0 GiB
Chia version: 2.1.4 farming only 2078 plots (C0 23%, C5 77%)
NVIDIA TESLA P4 8Gb, CUDA Version: 12.2
Driver Version: 535.161.07

@Nuke79
Copy link

Nuke79 commented Feb 29, 2024

Chia 2.2.0 is live. Can someone confirm, that bug is fixed? Didn't see anything about it in patch notes.

@Proace1
Copy link

Proace1 commented Feb 29, 2024

Version 2.2.0 does not do GPU mining. No matter which graphics card is used. I have the RTX 2080 Super, RTX 3060 or a GT 1030. In version 2.1.4 it works without any problems. The BUG is of course included.

@Nuke79
Copy link

Nuke79 commented Feb 29, 2024

Version 2.2.0 does not do GPU mining. No matter which graphics card is used. I have the RTX 2080 Super, RTX 3060 or a GT 1030. In version 2.1.4 it works without any problems. The BUG is of course included.

2.2.0 GPU farming work fine for me. No bug still (about 3 hours) accured.

P.S. Bug still present. Same error on 2.2.0: GRResult is not GRResult_OK, received GRResult_OutOfMemory.
Nvidia GTX 1070/8Gb vRAM.

@GolDenis72
Copy link

The same problem
2024-03-22T01:42:57.016 harvester chia.harvester.harvester: ERROR Exception fetching full proof for xxxxxxxxxxxxxxxx.plot. GRResult is not GRResult_OK, received GRResult_OutOfMemory
Every 3-6 hours
~6500 plots, Win 10, Nvidia GTX 1060/8Gb vRAM

@mehditlili
Copy link

How hard is it for devs to get a 1060 and reproduce the problem locally.
Just blindly upgrading the version with unrelated fixes and asking people to test it for you is not very professional. Chia is opensource and that is nice and we are grateful, but you are paid and that is your job, so do it please.

@spegelius
Copy link

spegelius commented Apr 1, 2024

Converting c07 -> c05 didn't seem to help, one linux machine with 1050 4GB still occasionally has this problem. Interestingly Win10 machines with 1050 and 1030 are much more stable. I wonder if the env settings from Eth mining are affecting this (GPU_MAX_ALLOC_PERCENT etc...). I have those only on the Win machines.

@GolDenis72
Copy link

"How hard is it for devs to get a 1060 and reproduce"
Hmmmmm.... about doezn of (OLD!) VCU brands, a few windows adition. Who will care about that?
Just instal restart script & forget.
Regards.

@Proace1
Copy link

Proace1 commented Apr 2, 2024

@GolDenis72 what kind of script?

@GolDenis72
Copy link

GolDenis72 commented Apr 2, 2024

#15404 (comment)
works fine for me. Even on Win 10 via Git. A few correction was made (e-mail sending for example) but mainly.... as original.
p.s. LOG_MESSAGE="Fault: ENGINE GRAPHICS GPCCLIENT" change for our LOG_MESSAGE="GRResult is not GRResult_OK, received GRResult_OutOfMemory" if that is found in the debug.log (give the real path to debug.log) - it will restart ONLY!!!! harvester! SUPERB usefull!!!!
Found error - restart harvester (no any synch problem, lost of connection.... etc.) VERY nice!

@Daivis88
Copy link

Daivis88 commented Apr 15, 2024

@larod @GolDenis72

#15404 (comment) works fine for me. Even on Win 10 via Git. A few correction was made (e-mail sending for example) but mainly.... as original. p.s. LOG_MESSAGE="Fault: ENGINE GRAPHICS GPCCLIENT" change for our LOG_MESSAGE="GRResult is not GRResult_OK, received GRResult_OutOfMemory" if that is found in the debug.log (give the real path to debug.log) - it will restart ONLY!!!! harvester! SUPERB usefull!!!! Found error - restart harvester (no any synch problem, lost of connection.... etc.) VERY nice!

Hi. I'm very new to this. Maybe you could tell me where I made a mistake ?

$ #!/bin/bash

# Define the log file path in your home directory
LOG_FILE="C:\Users\2010m\.chia\mainnet\log\debug.log"

# The specific log message to look for
LOG_MESSAGE="Fault: GRResult is not GRResult_OK, received GRResult_OutOfMemory"

# Start an infinite loop to monitor the log file
while true; do
  CURRENT_TIME=$(date "+%Y-%m-%d %H:%M:%S")
  echo "$CURRENT_TIME Starting to monitor syslog..."

  # Use tail to monitor the log file and grep for the log message
  if tail -n 0 -F /var/log/syslog | grep -q "$LOG_MESSAGE"; then

    sleep 5

    # Get the current date and time
    CURRENT_TIME=$(date "+%Y-%m-%d %H:%M:%S")

    # Execute the Chia harvester command and capture its return code
    chia start harvester -r
    RETURN_CODE=$?

    # Determine the result of the restart
    if [ $RETURN_CODE -eq 0 ]; then
      RESTART_RESULT="Done"
    else
      RESTART_RESULT="Failed"
    fi

    # Log to the system syslog
    logger -t "Chia Harvester" "Restarting Chia Harvester Service... [$RESTART_RESULT]"
  fi

done

2024-04-15 20:41:21 Starting to monitor syslog...
tail: cannot open '/var/log/syslog' for reading: No such file or directory

I have a sense that it's not working because I'm on windows 10 but then you said that you are using win 10 too. I really need this to work, my system needs restarting every 6h or so...

Please help.

@GolDenis72
Copy link

GolDenis72 commented Apr 18, 2024

Hi!
Define the log file path in your home directory
Means log file for script's output (in my case "LOG_FILE="c:/Users/denis/.chia/mainnet/log/start_harvester.log")

Use tail to monitor the log file and grep for the log message
if tail -n 0 -F /var/log/syslog | grep -q "$LOG_MESSAGE"; then

Not working for Windows, so we need to put the real path to the chia log file
if tail -n 0 -F c:/Users/denis/.chia/mainnet/log/debug.log | grep -q "$LOG_MESSAGE"; then
Good luck!

@Daivis88
Copy link

Daivis88 commented Apr 19, 2024

Hi! Define the log file path in your home directory Means log file for script's output (in my case "LOG_FILE="c:/Users/denis/.chia/mainnet/log/start_harvester.log")

Use tail to monitor the log file and grep for the log message if tail -n 0 -F /var/log/syslog | grep -q "$LOG_MESSAGE"; then

Not working for Windows, so we need to put the real path to the chia log file if tail -n 0 -F c:/Users/denis/.chia/mainnet/log/debug.log | grep -q "$LOG_MESSAGE"; then Good luck!

@GolDenis72
Hi again.
I have tried it and it didint work.. it loks like this now.. Any ideas ?


2010m@ChiaRig MINGW64 ~/Desktop
$ #!/bin/bash

# Define the log file path in your home directory
LOG_FILE="C:\Users\2010m\.chia\mainnet\log\Start_harvester.log"

# The specific log message to look for
LOG_MESSAGE="Fault: GRResult is not GRResult_OK, received GRResult_OutOfMemory"

# Start an infinite loop to monitor the log file
while true; do
  CURRENT_TIME=$(date "+%Y-%m-%d %H:%M:%S")
  echo "$CURRENT_TIME Starting to monitor syslog..."

  # Use tail to monitor the log file and grep for the log message
  if tail -n 0 -F C:\Users\2010m\.chia\mainnet\log\debug.log | grep -q "$LOG_MESSAGE"; then

    sleep 5

    # Get the current date and time
    CURRENT_TIME=$(date "+%Y-%m-%d %H:%M:%S")

    # Execute the Chia harvester command and capture its return code
    chia start harvester -r
    RETURN_CODE=$?

    # Determine the result of the restart
    if [ $RETURN_CODE -eq 0 ]; then
      RESTART_RESULT="Done"
    else
      RESTART_RESULT="Failed"
    fi

    # Log to the system syslog
    logger -t "Chia Harvester" "Restarting Chia Harvester Service... [$RESTART_RESULT]"
  fi

done
2024-04-19 23:40:15 Starting to monitor syslog...
tail: cannot open 'C:Users2010m.chiamainnetlogdebug.log' for reading: No such file or directory

@GolDenis72
Copy link

GolDenis72 commented Apr 20, 2024

your comp is right!
Slash!!! "/" NOT back slash!!!! ""\"!!!
Check your paths again!
mine: c:/Users/denis/.chia/mainnet/log/debug.log
yours: C:\Users\2010m.chia\mainnet\log\debug.log
See the differences?
=> "/" NOT "\"
Good luck!

@Daivis88
Copy link

your comp is right! Slash!!! "/" NOT back slash!!!! """!!! Check your paths again! mine: c:/Users/denis/.chia/mainnet/log/debug.log yours: C:\Users\2010m.chia\mainnet\log\debug.log See the differences? => "/" NOT "" Good luck!

Oh my god.... how did I miss that.... I'm so grateful you have pointed it out to me. Looks like it's running now, will see if it works. Thanks, again.

@Daivis88
Copy link

Daivis88 commented Apr 21, 2024

@GolDenis72 Hi again. :)

It looks that the script monitors the log but when found the error harvester didn't restart, says command not found...any ideas ?

2010m@ChiaRig MINGW64 ~/Desktop
$ #!/bin/bash

# Define the log file path in your home directory
LOG_FILE="C:/Users/2010m/.chia/mainnet/log/Start_harvester.log"

# The specific log message to look for
LOG_MESSAGE="GRResult is not GRResult_OK, received GRResult_OutOfMemory"

# Start an infinite loop to monitor the log file
while true; do
  CURRENT_TIME=$(date "+%Y-%m-%d %H:%M:%S")
  echo "$CURRENT_TIME Starting to monitor syslog..."

  # Use tail to monitor the log file and grep for the log message
  if tail -n 0 -F C:/Users/2010m/.chia/mainnet/log/debug.log | grep -q "$LOG_MESSAGE"; then

    sleep 5

    # Get the current date and time
    CURRENT_TIME=$(date "+%Y-%m-%d %H:%M:%S")

    # Execute the Chia harvester command and capture its return code
    chia start harvester -r
    RETURN_CODE=$?

    # Determine the result of the restart
    if [ $RETURN_CODE -eq 0 ]; then
      RESTART_RESULT="Done"
    else
      RESTART_RESULT="Failed"
    fi

    # Log to the system syslog
    logger -t "Chia Harvester" "Restarting Chia Harvester Service... [$RESTART_RESULT]"
  fi

done
2024-04-20 20:42:54 Starting to monitor syslog...
bash: chia: command not found
bash: logger: command not found
2024-04-21 18:01:36 Starting to monitor syslog...

@GolDenis72
Copy link

And again. Your comp is right! command not found! :-)
You are try to made direct script transfer from Linux to Windows.
You should do that correctly. For example.
"chia start harvester -r" means, that system know WHERE chia.exe is (path to the chia directory put before into the system paths) OR you are started that script from the chia.exe directory.
If NOT - put the FULL PATH to the chia.exe in the script like (in my case) c:/Users/denis/AppData/Local/Programs/ChiaFox/resources/app.asar.unpacked/daemon/chia.exe start harvester -r
Keep trying! Good luck!

@Daivis88
Copy link

And again. Your comp is right! command not found! :-) You are try to made direct script transfer from Linux to Windows. You should do that correctly. For example. "chia start harvester -r" means, that system know WHERE chia.exe is (path to the chia directory put before into the system paths) OR you are started that script from the chia.exe directory. If NOT - put the FULL PATH to the chia.exe in the script like (in my case) c:/Users/denis/AppData/Local/Programs/ChiaFox/resources/app.asar.unpacked/daemon/chia.exe start harvester -r Keep trying! Good luck!

@GolDenis72
I did it and it came up with error again.... Please don't judge to harsh...


2010m@ChiaRig MINGW64 ~/Desktop
$ #!/bin/bash

# Define the log file path in your home directory
LOG_FILE="C:/Users/2010m/.chia/mainnet/log/Start_harvester.log"

# The specific log message to look for
LOG_MESSAGE="GRResult is not GRResult_OK, received GRResult_OutOfMemory"

# Start an infinite loop to monitor the log file
while true; do
  CURRENT_TIME=$(date "+%Y-%m-%d %H:%M:%S")
  echo "$CURRENT_TIME Starting to monitor syslog..."

  # Use tail to monitor the log file and grep for the log message
  if tail -n 0 -F C:/Users/2010m/.chia/mainnet/log/debug.log | grep -q "$LOG_MESSAGE"; then

    sleep 5

    # Get the current date and time
    CURRENT_TIME=$(date "+%Y-%m-%d %H:%M:%S")

    # Execute the Chia harvester command and capture its return code
    C:/Program Files/Chia/resources/app.asar.unpacked/daemon/chia.exe start harvester -r
    RETURN_CODE=$?

    # Determine the result of the restart
    if [ $RETURN_CODE -eq 0 ]; then
      RESTART_RESULT="Done"
    else
      RESTART_RESULT="Failed"
    fi

    # Log to the system syslog
    logger -t "Chia Harvester" "Restarting Chia Harvester Service... [$RESTART_RESULT]"
  fi

done
2024-04-24 12:10:12 Starting to monitor syslog...
bash: C:/Program: No such file or directory
bash: logger: command not found
2024-04-25 13:33:07 Starting to monitor syslog...
bash: C:/Program: No such file or directory
bash: logger: command not found
2024-04-25 13:42:07 Starting to monitor syslog...
bash: C:/Program: No such file or directory
bash: logger: command not found
2024-04-25 13:45:39 Starting to monitor syslog...
bash: C:/Program: No such file or directory
bash: logger: command not found
2024-04-25 13:48:50 Starting to monitor syslog...
bash: C:/Program: No such file or directory
bash: logger: command not found

@GolDenis72
Copy link

check that path twice " C:/Program Files/Chia/resources/app.asar.unpacked/daemon/chia.exe"
I think something wrong with that

@GolDenis72
Copy link

there are 2 chia execution files in the system (dont ask me why) you need find the right one. & just check it with simple chia command first (Like chia -h) to be sure, that you find the right one
After, just put the FULL path to that exe file into the script
m-m-m.... not sure about logger. Looks like I comment that line

@Daivis88
Copy link

Daivis88 commented Apr 25, 2024

check that path twice " C:/Program Files/Chia/resources/app.asar.unpacked/daemon/chia.exe" I think something wrong with that

The path is right..... I'm thinking must be something that its two words Program files....

@Daivis88
Copy link

@GolDenis72
Works finally... so the issue was the two words C:/Program Files/Chia/resources/app.asar.unpacked/daemon/chia.exe
As soon as I added these guys "C:/Program Files/Chia/resources/app.asar.unpacked/daemon/chia.exe" it found the path.
So happy that it works, I really appreciate your help, thank you for the patience and help. I can sleep without a worry now :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.0.0 bug Something isn't working compression Related to compressed plotting/farming
Projects
None yet
Development

No branches or pull requests