Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.7.0 intermittantly hangs on Windows, segfaults on Linux at startup #32

Open
JayDDee opened this issue Mar 1, 2021 · 14 comments
Open

Comments

@JayDDee
Copy link

JayDDee commented Mar 1, 2021

Windows 10, GTX-970.
Ubuntu 20.04, various Maxwell and Pascal cards.

Very often the miner hangs (Windows) or segfaults (Linux) on startup after the stratum connection but before the worksize set
message is displayed. Sometimes it starts ok. Once it's running it seems stable.

v0.6.2 doesn't have this problem, it always starts reliably, tested on Windows only.
Edit: Ubuntu 20.04 was tested with v0.6.1 with no segfaults.

Windows was tested with the precompiled binary. Ubuntu was tested using the precompiled binary and compiled from source.

Update: the problem is more persistent on Windows. I can't get v0.7.0 to work at all on my 2 Windows PCs, v0.6.2 has no
problems. The failure on Windows includes Turing cards.

@JayDDee
Copy link
Author

JayDDee commented Mar 1, 2021

I have mor einfo about the problem. In v0.7.0 my GPUs are reported as CL devices as well as Cuda, v0.6.2 only detects it
as a Cuda device.

coin@sys27:~/miners/verthash/VerthashMiner-0.7.0/build$ ./VerthashMiner -l
[2021-03-01 12:24:14] INFO Found 1 OpenCL devices.
[2021-03-01 12:24:14] INFO Found 1 CUDA devices

Device list:

OpenCL devices:
Index: 0. Name: GeForce GTX 960
Platform index: 0
Platform name: NVIDIA Corporation
pcieId: 65:00:0

CUDA devices:
Index: 0. Name: GeForce GTX 960. pcieId: 65:00:0

coin@sys27:~/miners/verthash/VerthashMiner-0.6.1/build$ ./VerthashMiner -l
[2021-03-01 12:26:00] WARN Skipping CL platform (index: 0, NVIDIA Corporation)
[2021-03-01 12:26:00] INFO Found 0 OpenCL devices.
[2021-03-01 12:26:00] INFO Found 1 CUDA devices

Device list(raw):

OpenCL devices: None

CUDA devices:
Index: 0. Name: GeForce GTX 960

It looks like --no-restrict-cuda is default. How to I force the miner to use Cuda?

@JayDDee
Copy link
Author

JayDDee commented Mar 1, 2021

I have a workaround that solved the problem for my case with Cuda 10 but I don't think Cuda 10 is the issue because the
problem also occurs with the Windows Cuda 11 build.

I disabled the "Nvidia GPU to CUDA restrictions" check in main.cpp by commenting out the entire #ifdef HAVE_CUDA block.
This code checks for Cuda 11 & compute 3.0. Something wrong seems to be happening in that code causing the
"continue" path to be executed.

Update: The workaround may not fix the issue. In an ironic twist one system that had not seen the crash (started ok the
first and only try) crashed twice when I applied the workaround. When I reverted to the original code it started the first
try.

Now I'm just confused. The only thing that is conststent is that it never crashes on v0.6.2.

@CryptoGraphics
Copy link
Owner

CryptoGraphics commented Mar 2, 2021

Hello. You can always select devices manually.
e.g select a single CUDA device (index 0).
--cu-devices 0

@JayDDee
Copy link
Author

JayDDee commented Mar 3, 2021

I apologize for the confusion above, intermittant problems tend to be difficult to define.

I'm not sure anything above is of any value except that v0.7.0 crashes intermittantly at startup on both Linux & Windows
with Cuda GPUs where v0.6.1 & v0.6.2 did not.

I've tried --cl-devices n and --all-cu-devices, it makes no difference.

I see this on all my rigs. Is it just me that has this problem?

@kanehbosm
Copy link

kanehbosm commented Mar 4, 2021

jay, i too have encountered this. if you launch your windows "reliability monitor" you should see an entry for vertcoin crashing. it probably matches mine from three different machines https://i.imgur.com/PaN4aU7.png

@CryptoGraphics
Copy link
Owner

I was unable to reproduce this problem.
However there are several changes in v0.7.1.
Miner will no longer load NVML and ADL libraries when they are not needed.
e.g. you can disable GPU monitoring, set WorkSize to 131072 and it will become pretty much like v0.6.2.

@JayDDee
Copy link
Author

JayDDee commented Mar 6, 2021

Unfortunately it doesn't seem to have improved. It took 2 attempts to get 0.7.1 to start on Windows10 and after 4 tries
on Ubuntu-20.04 I reverted to the old version.

Maybe this is a problem only with Maxwell cards. My rig with Pascal also had a Maxwell so it maybe it was the Maxwell that
was the problem.

There seems to be some inconsistency in reporting available devices. Sometimes it also reports Nvidia as available for OpenCL
& Cuda, sometimes only for Cuda. This might have to do with the code changes I previously made (described above)
so it might not mean anything. I need to test more.

Update: The reporting of Nvidia as OpenCL device doesn't seem to make a difference. In the latest test on one particular
Linux system it has both crashed and started successfully using both 0.7.0 & 0.7.1. Changes in 0.7.1 don't seem to be a factor.

@JayDDee
Copy link
Author

JayDDee commented Mar 10, 2021

It's looking like it's a Maxwell only problem. v0.7.1 hasn't crashed with no Maxwell cards selected.
It sometimes crashes 3 or 4 times before a successful start with Maxwell.

@CryptoGraphics
Copy link
Owner

I was able to test the latest version on Maxwell GPUs and there were no issues so far.
Did you try to run in OpenCL only mode(skip/disable CUDA and turn on debugging)?
i.e: --verbose --all-cl-devices --no-restrict-cuda --log-file.

@JayDDee
Copy link
Author

JayDDee commented Mar 15, 2021

GTX960 on ubuntu-20.04, compiled from source fails because it can't find a file but it exists in src/kernels/, console session
further below. But first I have a gdb back trace of a crash using cuda. It appears to have some useful information:

`$ gdb --args ./VerthashMiner -o stratum+tcp://mine.zergpool.com:4534 -u x -p x --verthash-data verthash.dat --all-cu-devices --verbose
GNU gdb (Ubuntu 9.1-0ubuntu1) 9.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./VerthashMiner...
(gdb) run
Starting program: /home/coin/miners/verthash/VerthashMiner-0.7.1/build/VerthashMiner -o stratum+tcp://mine.zergpool.com:4534 -u 1FXaRoufZC6LyPzjNrs7wS47tpgzEpBSiw -p g27,c=btc,sd=0.1 --verthash-data verthash.dat --all-cu-devices --verbose
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[2021-03-14 20:36:25] INFO Found 0 OpenCL devices.
[New Thread 0x7ffff36cf700 (LWP 2440526)]
[2021-03-14 20:36:25] INFO Found 1 CUDA devices
[2021-03-14 20:36:27] INFO Verthash data file has been loaded succesfully!
[2021-03-14 20:36:56] INFO Verthash data file has been verified succesfully!
[2021-03-14 20:36:56] INFO Miner has been successfully configured! (Errors: 0, Warnings: 0)
[2021-03-14 20:36:56] INFO Configured 0(CL) and 1(CUDA) workers
[2021-03-14 20:36:56] DEBUG Found 1 NVML device
[New Thread 0x7ffff28a8700 (LWP 2440562)]
[New Thread 0x7ffff20a7700 (LWP 2440563)]
[2021-03-14 20:36:56] INFO Starting Stratum on stratum+tcp://mine.zergpool.com:4534
[New Thread 0x7ffff18a6700 (LWP 2440564)]
[2021-03-14 20:36:56] INFO 1 miner threads started, using Verthash algorithm.
[2021-03-14 20:36:56] DEBUG Verthash CUDA thread started
[New Thread 0x7ffff10a5700 (LWP 2440565)]
[New Thread 0x7ffff08a4700 (LWP 2440566)]
[Thread 0x7ffff10a5700 (LWP 2440565) exited]
[New Thread 0x7ffff10a5700 (LWP 2440567)]
[2021-03-14 20:36:56] DEBUG Load verthash data size: 1283457024
[2021-03-14 20:36:57] DEBUG Stratum session id: 3f41f53b4590433f112977b502f709ee

Thread 5 "VerthashMiner" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff18a6700 (LWP 2440564)]
__strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:65
65 ../sysdeps/x86_64/multiarch/strlen-avx2.S: No such file or directory.
(gdb) bt
#0 __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:65
#1 0x00007ffff797b503 in __GI___strdup (s=0x0) at strdup.c:41
#2 0x0000555555574816 in stratum_gen_work(stratum_ctx*, work*) ()
#3 0x000055555557882c in verthashCuda_thread(void*) ()
#4 0x0000555555560164 in _thrd_wrapper_function ()
#5 0x00007ffff7f8e609 in start_thread (arg=) at pthread_create.c:477
#6 0x00007ffff79fb103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
`

./VerthashMiner -o stratum+tcp://mine.zergpool.com:4534 -u x -p x --verthash-data verthash.dat --verbose --all-cl-devices --no-restrict-cuda --log-file [2021-03-14 20:23:39] INFO Found 1 OpenCL devices. [2021-03-14 20:23:39] INFO Found 1 CUDA devices [2021-03-14 20:23:42] INFO Verthash data file has been loaded succesfully! [2021-03-14 20:24:11] INFO Verthash data file has been verified succesfully! [2021-03-14 20:24:11] INFO Miner has been successfully configured! (Errors: 0, Warnings: 0) [2021-03-14 20:24:11] INFO Configured 1(CL) and 0(CUDA) workers [2021-03-14 20:24:11] DEBUG Found 1 NVML device [2021-03-14 20:24:11] INFO Starting Stratum on stratum+tcp://mine.zergpool.com:4534 [2021-03-14 20:24:11] INFO 1 miner threads started, using Verthash algorithm. [2021-03-14 20:24:11] DEBUG Verthash OCL thread started [2021-03-14 20:24:11] DEBUG Load verthash data size: 1283457024 [2021-03-14 20:24:12] DEBUG Stratum session id: 3c1ba0d2078d1613512ea66360c77e3e Failed to read file: kernels/sha3_512_precompute.cl Failed to create an OpenCL program from source. [2021-03-14 20:24:12] ERROR cl_device(0):Failed to create a SHA3 precompute program. [2021-03-14 20:24:12] INFO cl_device(0):Exiting worker thread id(0)... [2021-03-14 20:24:12] INFO Stratum difficulty set to 0.1 [2021-03-14 20:24:12] INFO Verthash block: 1525737 [2021-03-14 20:24:12] INFO All worker threads have been exited. [2021-03-14 20:24:12] DEBUG Exit workIO thread [2021-03-14 20:24:12] INFO WorkIO thread has been finished. [2021-03-14 20:24:12] INFO Waiting for worker threads to exit... [2021-03-14 20:24:12] INFO Waiting for stratum thread to exit... [2021-03-14 20:24:12] ERROR Stratum connection interrupted [2021-03-14 20:24:12] DEBUG Stratum thread exit [2021-03-14 20:24:12] INFO Freeing allocated memory... [2021-03-14 20:24:12] INFO Application has been exited gracefully.

@JayDDee
Copy link
Author

JayDDee commented Mar 15, 2021

Apologies for the garbled output for the CL test, but I don't think it was useful.

It looks to me like the crash is due to data misalignment of job_id when calling the AVX2 version of strdup.
That explains the intermittant nature of the crash. It would be more difficult to reproduce if the CPU doesn't have
AVX2. Maybe it's not a Maxwell issue after all.

Upon further thought it may be job_id is uninitialized considering it's the first job.

Yup confirmed it. I added a log just before the crash. When it crashes the sctx job_id pointer is null.

./VerthashMiner -o stratum+tcp://mine.zergpool.com:4534 -u x -p x --verthash-data verthash.dat --all-cu-devices
[2021-03-14 21:45:10] INFO Found 0 OpenCL devices.
[2021-03-14 21:45:10] INFO Found 1 CUDA devices
[2021-03-14 21:45:13] INFO Verthash data file has been loaded succesfully!
[2021-03-14 21:45:42] INFO Verthash data file has been verified succesfully!
[2021-03-14 21:45:42] INFO Miner has been successfully configured! (Errors: 0, Warnings: 0)
[2021-03-14 21:45:42] INFO Configured 0(CL) and 1(CUDA) workers
[2021-03-14 21:45:42] DEBUG Found 1 NVML device
[2021-03-14 21:45:42] INFO 1 miner threads started, using Verthash algorithm.
[2021-03-14 21:45:42] INFO Starting Stratum on stratum+tcp://mine.zergpool.com:4534
[2021-03-14 21:45:43] INFO sctx job id: 0
Segmentation fault (core dumped)

@JayDDee
Copy link
Author

JayDDee commented Mar 15, 2021

I added the following to the miner thread just before enterring the loop. It seems to prevent the crash.
Sometimes I get a couple of logs before it starts.

// Wait for first job
if ( have_stratum ) while ( !stratum.job.job_id )
{
   applog( LOG_INFO, "Waiting for first job...");
   sleep_ms( 1000 );
}

[2021-03-15 17:39:47] INFO Found 0 OpenCL devices.
[2021-03-15 17:39:47] INFO Found 1 CUDA devices
[2021-03-15 17:39:48] INFO Verthash data file has been loaded succesfully!
[2021-03-15 17:40:04] INFO Verthash data file has been verified succesfully!
[2021-03-15 17:40:04] INFO Miner has been successfully configured! (Errors: 0, Warnings: 0)
[2021-03-15 17:40:04] INFO Configured 0(CL) and 1(CUDA) workers
[2021-03-15 17:40:04] DEBUG Found 1 NVML device
[2021-03-15 17:40:04] INFO 1 miner threads started, using Verthash algorithm.
[2021-03-15 17:40:04] INFO Starting Stratum on stratum+tcp://mine.zergpool.com:4534
[2021-03-15 17:40:04] INFO Waiting for first job...
[2021-03-15 17:40:05] INFO Waiting for first job...
[2021-03-15 17:40:06] INFO Stratum difficulty set to 0.1
[2021-03-15 17:40:06] INFO Verthash block: 1526222
[2021-03-15 17:40:10] INFO cu_device(0): err:0, temp:33C, power:66W, fan:48%, hashrate: 164.07 kH/s
[2021-03-15 17:40:11] INFO accepted: 1/1 (100.00%), total hashrate: 164.07 kH/s

Edit: you might want to tweak this a little. A multi-gpu rig will output a log for every thread. I also didn't check if
job_id pointer is guaranteed to be initialized to zero or NULL.

@JayDDee
Copy link
Author

JayDDee commented Mar 17, 2021

So far no crashes with the job_id check at startup.

While investigating this problem I noticed a couple of things you may be interested in following up on.

You don't subscribe to stratum extranonce. All of the supporting code is present but never used. Other verthash miners support
it. It's not a critical feature for verthash as it's not a high revving algo, but it should be simple to add.

The other is calling stratum_gen_work from the miner thread. That's only necessary when the thread runs out of nonces
before stratum sends new work. And this only applies if extranonce is enabled. A better approach might be to call
stratum_gen_work primarilly from the stratum thread whenever new work is received. It would fill g_work and signal the
threads to restart. The threads would then copy g_work into their local work. I have this implemented in cpuminer-opt.

It's just a couple of suggestions in case you're interested.

As for the issue of the startup crash, I consider it technically resolved and await a new release with a fix.

@Nyanraltotlapun
Copy link

I have same issues with v0.7.2 release on Linux with Tesla M40 (Maxwell 2.0) GPU
v0.6.2 works fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants