Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix duplicate GPU problem #3200

Open
davidpanderson opened this issue Jun 28, 2019 · 5 comments
Open

Fix duplicate GPU problem #3200

davidpanderson opened this issue Jun 28, 2019 · 5 comments

Comments

@davidpanderson
Copy link
Contributor

The GPU detection logic sometimes decides that 1 GPU is actually 2. Apparently this can happen because of buggy drivers, or because there are multiple OpenCL "platforms" (e.g. POCL is present).

Note: gpu_opencl.cpp contains the comment
//TODO: Must we check if multiple platforms found the same GPU and merge the records?

@bema-aei
Copy link
Contributor

bema-aei commented Jul 1, 2019

As far as I understand, a "buggy driver" (as reported on boinc_alpha mailing list) also creates a new OpenCL platform with the same device(s), so both cases are actually the same issue.

Note that (so far) this problem is limited to OpenCL on NVidia.

Einstein@Home asked The Khronos Group to add a property to the device record that would allow to uniquely identify a device regardless of the interface (OpenCL platforms or even CUDA). I remember that this request was originally ignored or declined, but I don't know what the current status is. @brevilo might know more.

Finally, merging the device records from different platforms would help with the device scheduling on the client side, but wouldn't solve the problem that a possibly unsuitable platform is passed to the application. This is complicated by the fact that different applications may have different requirements regarding the platform. IMHO we either need

  • a way for the app to tell the client which platform to use (or not to use) or

  • pass the app a list of platforms and a list of devices (one for each platform), such that the app can pick a platform to its own criteria and then knows which device to use.

@RichardHaselgrove
Copy link
Contributor

@bema-aei: I'm sorry, but I don't think that's a complete analysis of the problems we're seeing. Over the first 10 years of GPU computing, we've become accustomed to referring to "the driver" as a single entity, most commonly provided by the hardware manufacturer. But in reality, the downloaded driver file is a multi-component delivery system, and as with all software installation packages, it is responsible for both installing new components, and uninstalling old components.

OpenCL is just one of the installed/uninstalled packages, and it probably wasn't originally created by the hardware manufacturer: OpenCL is supposed to be a cross-platform language, after all.

In recent years, there has been a move away from hardware manufacturers supplying driver packages direct to end users. I ran a controlled experiment on a Windows 10 machine some time ago, with the result that:

  • With an NVidia GPU installed, Microsoft supplied a driver with CUDA included, but without OpenCL
  • With an Intel iGPU active, Microsoft supplied an Intel driver with OpenCL capability
  • Reverting to the NVidia GPU, OpenCL computing was possible using the Intel OpenCL stack supplied by Microsoft

all without any direct intervention by either hardware provider.

So, in these cases, I don't think that 'buggy driver' is quite the right description: I'd perhaps call it 'poor shared component management', and we ought to be able to detect and mitigate that.

There's a helpful and informative copy of BOINC's own coproc detection output online at http://stateson.net/images/coproc_info_10_nfg.xml: this comes from a machine with 5 AMD devices (to complete the hardware set). BOINC has detected them as opencl_device_index 0 thru 4, and 0 thru 4 again. But BOINC has identified them as device_num 0 thru 9. Context and discussion at SETI message 1989298: a similar analysis was performed by @JuhaSointusalo in BOINC message 90061. In the second case (but not the first), there is evidence that two different opencl_driver_versions were installed.

I think the issues reported by Jacob Klein on the alpha mailing list are more properly described as 'buggy' (whether drivers or deployments, we wait to see), associated with the Windows insider builds he was testing.

@smoe
Copy link
Contributor

smoe commented Jul 2, 2019

When adding an AMD RX 580 to an Nvidia GTX 1660 under Windows 10, the Nvidia OpenCL is disabled. Instead, from the project (Einstein for me), the ATI OpenCL workunits are retrieved while the Nvidia OpenCL platform is no longer found. It is possible to run NVidia CUDA and ATI OpenCL jobs in parallel, though - like with SETI. I admit not to know if this is the expected behaviour. Ping me you want me to run antything on that machine.

Another observation of mine under Ubuntu 18.04 on a former GPU mining rig is that access via the single PCI3x16 and the many PCI2x1 ports are apparently distinguished. Having only one card in the PCI3 works fine. Adding one to PCI2 makes it still a single card in BOINC only even though there are two in the system. Adding two to PCI2 show as two cards, even though there are three in total now. Anyone else with similar observations?

EDIT: I have revisited that machine a bit more systematically and identified a non-functional USB riser. That explains the "missing card" phenotype.

@RichardHaselgrove
Copy link
Contributor

Sticking to the Windows 10 machine for the time being, and depending how deep you're prepared to dig, I'd be interested in seeing:

  1. The startup lines from BOINC's Event Log, showing the outcomes of GPU detection
  2. The coproc_info.xml file from BOINC's data directory
  3. The output from Oblomov's CLinfo (download from https://github.com/Oblomov/clinfo, at bottom: run at command prompt)

@bema-aei
Copy link
Contributor

bema-aei commented Jul 7, 2019

@RichardHaselgrove Actually I didn't mean my analysis to be complete.

To narrow down this problem: I would guess that all the "duplicate device" problems arise from the OpenCL device handling. Is there any case of "double device" which involves only CUDA or ATI (CAL, Stream?) Apps?

@AenBleidd AenBleidd added this to Backlog in Client and Manager via automation Oct 31, 2019
@AenBleidd AenBleidd added this to To do in BOINC Client/Manager via automation Oct 31, 2019
@AenBleidd AenBleidd added this to the Client/Manager 8.0 milestone Oct 31, 2019
@AenBleidd AenBleidd removed this from Backlog in Client and Manager Aug 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog
Development

No branches or pull requests

5 participants