Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA version number truncated #3893

Closed
Ageless93 opened this issue Jul 9, 2020 · 20 comments · Fixed by #3897
Closed

CUDA version number truncated #3893

Ageless93 opened this issue Jul 9, 2020 · 20 comments · Fixed by #3897

Comments

@Ageless93
Copy link
Contributor

As reported via the BOINC forums, it looks as if the client (7.6.16, Linux version) is truncating the CUDA driver version:

Starting BOINC client version 7.16.6 for x86_64-pc-linux-gnu
CUDA: NVIDIA GPU 0: GeForce RTX 2070 SUPER (driver version 440.10, CUDA version 10.2, compute capability 7.5, 4096MB, 3968MB available, 9216 GFLOPS peak)
CUDA: NVIDIA GPU 1 (not used): GeForce GTX 1060 6GB (driver version 440.10, CUDA version 10.2, compute capability 6.1, 4096MB, 3974MB available, 4568 GFLOPS peak)
OpenCL: NVIDIA GPU 0: GeForce RTX 2070 SUPER (driver version 440.100, device version OpenCL 1.2 CUDA, 7982MB, 3968MB available, 9216 GFLOPS peak)
OpenCL: NVIDIA GPU 1 (ignored by config): GeForce GTX 1060 6GB (driver version 440.100, device version OpenCL 1.2 CUDA, 6075MB, 3974MB available, 4568 GFLOPS peak)
OS: Linux Fedora: Fedora 32 (Workstation Edition) [5.7.6-201.fc32.x86_64|libc 2.31 (GNU libc)]

440.10 is a different driver version than 440.100
According to an earlier thread, it's been doing this for a while, as there it was BOINC 7.9.3 that made 390.13 from 390.132

@RichardHaselgrove
Copy link
Contributor

Confirmed.

Thu 09 Jul 2020 14:17:47 BST |  | Starting BOINC client version 7.17.0 for x86_64-pc-linux-gnu
Thu 09 Jul 2020 14:17:47 BST |  | CUDA: NVIDIA GPU 0: GeForce GTX 1660 SUPER (driver version 440.10, CUDA version 10.2, compute capability 7.5, 4096MB, 3974MB available, 5153 GFLOPS peak)
Thu 09 Jul 2020 14:17:47 BST |  | OpenCL: NVIDIA GPU 0: GeForce GTX 1660 SUPER (driver version 440.100, device version OpenCL 1.2 CUDA, 5943MB, 3974MB available, 5153 GFLOPS peak)
Thu 09 Jul 2020 14:40:12 BST | Asteroids@home | Sending scheduler request: To fetch work.
Thu 09 Jul 2020 14:40:12 BST | Asteroids@home | Requesting new tasks for NVIDIA GPU
Thu 09 Jul 2020 14:40:12 BST | Asteroids@home | [sched_op] NVIDIA GPU work request: 881.05 seconds; 1.00 devices
Thu 09 Jul 2020 14:40:13 BST | Asteroids@home | Scheduler request completed: got 0 new tasks
Thu 09 Jul 2020 14:40:13 BST | Asteroids@home | [sched_op] Server version 707
Thu 09 Jul 2020 14:40:13 BST | Asteroids@home | Message from server: NVIDIA GPU: Upgrade to the latest driver to process tasks using your computer's GPU

The problem, as confirmed here, lies in the server comparison of the reported/truncated version number, with the minimum version required by the project.

@AenBleidd
Copy link
Member

Unfortunately, it's not so easy to fix:
2 of 3 platforms report driver version as an integer number. Third platform reports it as an integer represented as a string.
So it's basically no information how to interpret this integer as a float number.
Requires additional investigation.

@RichardHaselgrove
Copy link
Contributor

Yes, I see (from sched_request...) that I'm now reporting

<drvVersion>44010</drvVersion> [cuda]
<opencl_driver_version>440.100</opencl_driver_version>

Surely we should have designed-in some form of consistency?

@AenBleidd
Copy link
Member

Hm, I believe I found the reason. I can make a fix to set proper in request but it will still shows incorrect version (two numbers after point instead of three) in log....
@RichardHaselgrove, as far as I understood, you can reproduce this, right? If yes - then I can try to prepare a fix and ask you to verify it, because I don't have linux machine with nVidia GPU

@AenBleidd AenBleidd self-assigned this Jul 9, 2020
@RichardHaselgrove
Copy link
Contributor

Yes, I have the hardware, and I'm currently running v7.17.0 from @LocutusOfBorg PPA. I also have the ability to compile (client-only) from source.

The mis-matched driver versions are only causing a problem at Asteroids. My other GPU project - GPUGrid - doesn't have a problem with work fetch, I'm assuming because it doesn't have such a stringent minimum driver version number test to satisfy. But we can't easily see what the plan_class requirements are, without the assistance of a project administrator.

@Ageless93
Copy link
Contributor Author

Ageless93 commented Jul 9, 2020 via email

@AenBleidd
Copy link
Member

@Ageless93, in case of difference, whom we have to believe?

@Ageless93
Copy link
Contributor Author

@AenBleidd, I think the higher number, especially since CUDA truncating happened before. But the sanity check can go either way.
As soon as either CUDA or OpenCL is lower than the other, use the higher number.

@davidpanderson
Copy link
Contributor

The basic problem is that BOINC encodes the driver version into an int of the form MMmm:
https://boinc.berkeley.edu/trac/wiki/AppPlanSpec#GPUapps
i.e. it assumes that the minor version # is <= 99.

The solution is to store the version as a string everywhere.
This will require server changes too.

We should do this for all version numbers, not just video drivers.
We should make no assumptions about # of digits,
or how many parts the version# has.

@RichardHaselgrove
Copy link
Contributor

The client handles it properly when reporting its own version number to a server: separate fields for major, minor, revision.

We could use that, or we could use a single string with dividers between the fields. But there's far too much scope for errors over time with assumed fixed-width numeric fields.

https://boinc.berkeley.edu/trac/wiki/AppPlanSpec#GPUapps is not as you say: we already have
AMD driver versions are represented as MMmmRRRR
NVIDIA driver versions are represented as MMMmm

@RichardHaselgrove
Copy link
Contributor

And although not documented, I'm also already reporting

<opencl_driver_version>3.0.1.10878</opencl_driver_version> [OpenCL on intel CPU]
<opencl_driver_version>10.18.10.3621</opencl_driver_version> [OpenCL on intel_gpu]

@RichardHaselgrove
Copy link
Contributor

Possible suggestion for a temporary workround, while we design a comprehensive solution:

Cap the client reporting of the CUDA minor version at 99

That would prevent a future project requirement of a minor version with three digits (but that's simply an incentive to finish doing the proper job in a timely fashion). Reporting the current NVidia driver 440.100 as 44099 in RPCs would at least allow Asteroids@Home to issue work again.

@AenBleidd AenBleidd linked a pull request Jul 10, 2020 that will close this issue
AenBleidd added a commit to AenBleidd/boinc that referenced this issue Jul 10, 2020
Change minor version to 99 if actual minor version is > 99

This fixes BOINC#3893

Signed-off-by: Vitalii Koshura <lestat.de.lionkur@gmail.com>
@AenBleidd AenBleidd added this to To do in BOINC Client/Manager via automation Jul 10, 2020
@AenBleidd AenBleidd added this to the Client/Manager 8.0 milestone Jul 10, 2020
@AenBleidd
Copy link
Member

@RichardHaselgrove, @davidpanderson, fix is ready for review and testing

@cwallbaum
Copy link

Hi,

I just want to let you know that version numbers of Nvidia Linux drivers can also have a "Releaseversion". I stumbled upon this lately when I had to configure correct drivers for a Tensorflow project. Here's a sample output of one of these boxes:

walli@p206-6:~$ nvidia-smi 
Fri Jul 10 12:09:48 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    On   | 00000000:01:00.0 Off |                  N/A |
| 79%   87C    P2   141W / 180W |    193MiB /  8119MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

Please also see the "CUDA Toolkit and Compatible Driver Versions" table here: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-major-component-versions (you have to scroll down a little bit because this table has no html anchor).

@AenBleidd
Copy link
Member

@AenBleidd
Copy link
Member

@cwallbaum, could you please verify how BOINC reports this version on this particular machine?

@cwallbaum
Copy link

It's not the latest, but at least not a very old build from Gianfranco's PPA:

Starting BOINC client version 7.16.6 for x86_64-pc-linux-gnu	
log flags: file_xfer, sched_ops, task	
Libraries: libcurl/7.58.0 OpenSSL/1.1.1 zlib/1.2.11 libidn2/2.0.4 libpsl/0.19.1 (+libidn2/2.0.4) nghttp2/1.30.0 librtmp/2.3	
Data directory: /var/lib/boinc-client	
CUDA: NVIDIA GPU 0: GeForce GTX 1080 (driver version 440.33, CUDA version 10.2, compute capability 6.1, 4096MB, 3968MB available, 8876 GFLOPS peak)	
OpenCL: NVIDIA GPU 0: GeForce GTX 1080 (driver version 440.33.01, device version OpenCL 1.2 CUDA, 8120MB, 3968MB available, 8876 GFLOPS peak)	
libc: Ubuntu GLIBC 2.27-3ubuntu1 version 2.27	
Host name: p206-6	
Processor: 12 GenuineIntel Intel(R) Xeon(R) E-2186G CPU @ 3.80GHz [Family 6 Model 158 Stepping 10]	
Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperf	
OS: Linux LinuxMint: Linux Mint 19.3 Tricia [5.3.0-26-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)]	
...

@AenBleidd
Copy link
Member

I believe it's fine enough. @cwallbaum, thanks for quick test

@AenBleidd AenBleidd removed this from To do in BOINC Client/Manager Jul 10, 2020
@AenBleidd AenBleidd added this to Backlog in Client Release 7.16 via automation Jul 10, 2020
@AenBleidd AenBleidd moved this from Backlog to Development in Client Release 7.16 Jul 10, 2020
@AenBleidd AenBleidd modified the milestones: Client/Manager 8.0, Client Release 7.16.8 Jul 10, 2020
Client Release 7.16 automation moved this from Development to Done Jul 10, 2020
@Ageless93
Copy link
Contributor Author

The thing that strikes me though is that here too OpenCL reports the whole driver version, 440.33.01 vs 440.33 for CUDA.
And so I wonder if it isn't easier to use the driver version OpenCL reports, rather than the version CUDA reports. They'll always both be the same as they come in the same package, but the OpenCL one is more complete.

@AenBleidd
Copy link
Member

@Ageless93, there is a possibility that OpenCL driver could be not installed on the system

@AenBleidd AenBleidd removed this from Done in Client Release 7.16 Jul 16, 2020
@AenBleidd AenBleidd modified the milestones: Client Release 7.18.0, Client Release 7.16.11 Sep 8, 2020
@AenBleidd AenBleidd added this to Backlog in Client Release 7.16 via automation Sep 8, 2020
@AenBleidd AenBleidd moved this from Backlog to Done in Client Release 7.16 Sep 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment