Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux (Zombie Process) - Various Miners - RBM : "failed to close within 10 seconds" / "Refused to die" #1296

Closed
saki2fifty opened this issue Jan 9, 2021 · 12 comments

Comments

@saki2fifty
Copy link

saki2fifty commented Jan 9, 2021

When auto switching from one miner to another, I've seen RBM reporting that it could not close the miner, and RBM doesn't continue. It looks like RBM is sending a ^C to terminate, but not doing anything else to attempt to close it.

Top shows the defunct process still using 100% cpu.

  1. Can we add other methods of closing such as killing the screen / process and continuing?
  2. If the issue is unfixable, can we have the option to reboot on these types of issues?
  3. After 5 attempts /reboots maybe temporarily disable the miner so that it doesn't keep looping and getting stuck? And log it.
  4. This happens on all my rigs, not just the below.

Logs:
[2021-01-08 21:25:09] INFO: Send ^C to Miner Trex-GPU#00-GPU#01-GPU#02-GPU#03-GPU#04-GPU#05-GPU#06-GPU#07-GPU#08's screen sakkisminer01_gpu00_gpu01_gpu02_gpu03_gpu04_gpu05_gpu06_gpu07_gpu08
[2021-01-08 21:25:21] WARNING: Miner Trex-GPU#00-GPU#01-GPU#02-GPU#03-GPU#04-GPU#05-GPU#06-GPU#07-GPU#08 failed to close within 10 seconds
[2021-01-08 21:25:32] INFO: OCDaemon for start-stop-daemon --stop --name t-rex --pidfile /home/saki2fifty/RainbowMiner/Data/pid/sakkisminer01_gpu00_gpu01_gpu02_gpu03_gpu04_gpu05_gpu06_gpu07_gpu08_pid.txt --retry 5 reports: Program t-rex, 1 process(es), refused to die.

sudo screen -ls
There are screens on:
2135.sakkisminer01_gpu00_gpu01_gpu02_gpu03_gpu04_gpu05_gpu06_gpu07_gpu08 (01/08/2021 09:20:18 PM) (Detached)
397.RainbowMiner (01/08/2021 09:13:11 PM) (Detached)
2 Sockets in /run/screen/S-root.

saki2fifty@sakkisminer01:~/RainbowMiner$ sudo ps aux | grep t-rex
[sudo] password for saki2fifty:
root 2154 99.1 0.0 0 0 pts/0 Zl+ Jan08 635:02 [t-rex]
saki2fi+ 29291 0.0 0.0 14436 1116 pts/2 S+ 08:00 0:00 grep --color=auto t-rex

top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2154 root 20 0 0 0 0 Z 100.0 0.0 608:36.42 t-rex

@saki2fifty
Copy link
Author

debug_2021-01-09.zip

@saki2fifty
Copy link
Author

Hmm... fyi, when doing ps aux | grep t-rex above, there is a "defunct" after the t-rex process, but github is not showing it when I pasted it above.

@saki2fifty
Copy link
Author

Just going to work with the below for now, and closing this out.

The problem, was that when inserting the ^C into the screen of the miner, at times the miner doesn't stop, and RBM doesn't continue.

Instead, doing the below in OCDaemon.psm1:

                if ($Name -match $DeviceNameMatch) {
                    #Invoke-Exe "screen" -ArgumentList "-S $Name -X stuff `^C" > $null
                    #Start-Sleep -Milliseconds 250
                    Invoke-Exe "screen" -ArgumentList "-S $Name -X quit" > $null
                    Start-Sleep -Milliseconds 250
                    Invoke-Exe "screen" -ArgumentList "-wipe" > $null
                }

Will probably screw something up. Will see.

@RainbowMiner
Copy link
Owner

Thank you! I will improve the function. Let me know, if quit and wipe manage to get rid of defunct processes.

@RainbowMiner
Copy link
Owner

Ok, it seems, that the subsequent kill -9 of such a zombie process hangs for ever. I will remove the kill for now, so that RainbowMiner doesn't stop. Most likely a reboot is the only way to get rid of those zombies.

RainbowMiner added a commit that referenced this issue Jan 11, 2021
- linux: remove kill -9 to avoid hang when trying to kill a zombie process (issue #1296)
@RainbowMiner
Copy link
Owner

RainbowMiner commented Jan 11, 2021

Done. If you like to have RainbowMiner reboot the machine, after such a zombie/defunct process has been created, just set "EnableRestartComputer": "1", in config.txt.

@saki2fifty
Copy link
Author

Thank you! I will improve the function. Let me know, if quit and wipe manage to get rid of defunct processes.

Ok, re-benchmarking with your change, keeping the quit/wipe and once done, I'll let it sit for a few days to see if it keeps hanging. I know it's not a clean exit, but...

Thanks!

@saki2fifty
Copy link
Author

Still "zombies". I'm good, I'll just manually switch and just mine one coin at a time.

I was reading that zombies only occur when the parent process ends sooner than the child, and doesn't keep track of it. Read that in almost all cases, it's due to the handling via the code.

Anywho...
Thx!

@RainbowMiner
Copy link
Owner

Sure, zombies are to be expected. But does Rainbowminer still stop? The last fix was not to get rid of zombies, but to avoid RainbowMiner waiting forever, after it failed to kill a miner process.

@saki2fifty
Copy link
Author

Ok, got it.

Almost positive it continued, but the miner itself and it's screen hang. Don't think any other screens spawn after that.

I'm working from home today, so let me rerun through it again and clear the logs to be sure I'm telling you accurately. Sorry, i'll let you know.

@saki2fifty
Copy link
Author

1147
1155

RainbowMiner does not continue. 2 Screenshots show @ least 8 minutes, in between, but I went much longer. Note W10 times.

My problem might be something else. GPU randomly stops, disable in RBM/Reboot, another GPU fails, disable in RBM/Reboot, 15 or so benchmarks go through fine, then the rest all fail one after the other saying OpenCL not found. Unrelated to the original issue. I'll strip it down to 1 gpu and work my way up.

@RainbowMiner
Copy link
Owner

It really looks like an intensity problem with these GPUs. Try to set lower intensities or reduce the overclocking.
In the meantime, I will try to use the miner's APIs to shut them down (if supported) instead of sending a ^C. Maybe that helps, too.
But lost GPUs are always really bad, since no other miner will be able to continue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants