Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU detection hangs when starting BOINC #5183

Closed
etdr opened this issue Apr 6, 2023 · 16 comments
Closed

GPU detection hangs when starting BOINC #5183

etdr opened this issue Apr 6, 2023 · 16 comments

Comments

@etdr
Copy link

etdr commented Apr 6, 2023

Describe the bug
I am inferring the GPU detection is the problem because of how the logs are split relative to when I shut the service down. (See screenshots.)

I might be missing something really obvious, or it might have to do with the systemd unit, but wanted to make sure no one knew why this was happening here.

Steps To Reproduce

  1. Install the boinc package from Arch repository
  2. Enable and start the boinc-client.service unit
  3. The system journal never gets past the first picture below.

Expected behavior
Even if GPU detection fails, I expect it to come back with a yes or a no, not to hang.

Screenshots
this is what it shows up until I kill it

this is what it shows after I kill the service

System Information

  • OS: Arch Linux with Linux 6.2.9
  • BOINC Version: 7.22

Additional context
I have two GPUs in this machine, one AMD and one NVIDIA, both pretty recent. I have the most up-to-date libraries for opencl installed for both of them.

Nothing shows up in the stderrgpudetect file.

@davidpanderson
Copy link
Contributor

If you run
boinc --gpu_detect
it will do GPU detection, write the results to coproc_info.xml, and exit.
I'd like to know if this crashes (and if so, where).

But the client is supposed to keep going even if the GPU detection subprocess fails.
Instead, it's being sent a SIGTERM (signal 14) and this causes it to exit.
We need to figure out who's sending this signal.

@etdr
Copy link
Author

etdr commented Apr 7, 2023

The SIGTERM is because I'm exiting it because it's hanging, because I didn't think GPU detection should take >5 minutes, I should have been clearer about that.

This is what I get when I try the --gpu_detect option:
image

There is a --no_gpus option, though, but that is for explicitly not checking for GPUs?

@BrianNixon
Copy link
Contributor

Try --detect_gpus instead of --gpu_detect

@etdr
Copy link
Author

etdr commented Apr 7, 2023

Just so I'm not exiting prematurely, how long should this take, maximum?

@BrianNixon
Copy link
Contributor

“Not that long”, probably… (I don’t know the real answer)

But if the detection hangs, the client will also hang because it waits forever for the detection to complete. A timeout here would clearly be a good idea.

@etdr
Copy link
Author

etdr commented Apr 7, 2023

OK, it's just hanging then. Both running as boinc user and as root user. I never get that coproc_info.xml file that @davidpanderson was looking for.

I have added the boinc user to the video group, but maybe it's a permissions issue separate from that? Not sure.

@BrianNixon
Copy link
Contributor

BrianNixon commented Apr 7, 2023

Not sure what else to suggest. You could try the forum; maybe somebody there has some ideas.

During detection, warning messages are stored for reporting back to the client on completion. It would be useful if those also got written to the detector process’s stderr; then they’d be saved to stderrgpudetect.txt and we might get a better picture of how far it gets before hanging.

Also, in case anybody’s wondering: GPU detection failed: process was terminated by signal is bogus; clearly 114 is not a valid signal number. That’s just the random garbage returned by waitpid() after it failed with EINTR when the client main process got killed.

@AenBleidd
Copy link
Member

AenBleidd commented Apr 7, 2023 via email

@davidpanderson
Copy link
Contributor

boinc --detect_gpus shouldn't hang even if expected libraries are missing.

I'll implement Brian's idea of writing warnings to stderr; that may shed some light.
Otherwise maybe Eli can build a debug version from source and use gdb
(we can help with this if needed)

@BrianNixon
Copy link
Contributor

warnings

And/or some tracepoints to track progress through the code. It’s not as though there’s any need to keep the noise down in that file…

@davidpanderson
Copy link
Contributor

I added the stderr writes in a 'dpa_gpu_detect' branch.
We can add more as needed.

@davidpanderson
Copy link
Contributor

BTW it should take << 1 sec to complete

@etdr
Copy link
Author

etdr commented Apr 8, 2023

@AenBleidd So I'm on Arch, which is probably not officially supported, but I have the equivalents of those libraries installed according to the Arch Wiki.

@davidpanderson I can try to clone that branch, build, and report back

@etdr
Copy link
Author

etdr commented Apr 9, 2023

OK, so with that enabled (plus a few "test" fprintfs from me) here's what I get:

image

So I guess the error is happening sometime after the libaticalrt.so isn't found, yes? Do I need this file using the current version of BOINC? I would rather use OpenCL, but I probably am misunderstanding something.

Update: it's something in the opencl detection. I'll keep digging.

Update: It's a problem with this line of code:

ciErrNum = (*p_clGetPlatformIDs)(MAX_OPENCL_PLATFORMS, platforms, &num_platforms);

Does this mean I need to get in touch with the OpenCL people? ChatGPT tells me this is calling the clGetPlatformIds function from OpenCL.

@etdr
Copy link
Author

etdr commented Apr 9, 2023

OK well by following the instructions in this thread (uninstalling some intel stuff) I was able to get it to work. I guess the problem is with OneAPI. Sorry about all this, but thank you for the help!

@AenBleidd
Copy link
Member

@etdr, thank you for testing this and sharing the solution.
Since this is not an issue on BOINC side, I'm converting this issue to discussion.

@BOINC BOINC locked and limited conversation to collaborators Apr 9, 2023
@AenBleidd AenBleidd converted this issue into discussion #5187 Apr 9, 2023
@AenBleidd AenBleidd moved this from Backlog to In Testing in BOINC Client/Manager May 14, 2023
@AenBleidd AenBleidd moved this from In Testing to Done in BOINC Client/Manager May 14, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
No open projects
Development

No branches or pull requests

4 participants