CUDA version number truncated #3893

Ageless93 · 2020-07-09T12:29:59Z

As reported via the BOINC forums, it looks as if the client (7.6.16, Linux version) is truncating the CUDA driver version:

Starting BOINC client version 7.16.6 for x86_64-pc-linux-gnu
CUDA: NVIDIA GPU 0: GeForce RTX 2070 SUPER (driver version 440.10, CUDA version 10.2, compute capability 7.5, 4096MB, 3968MB available, 9216 GFLOPS peak)
CUDA: NVIDIA GPU 1 (not used): GeForce GTX 1060 6GB (driver version 440.10, CUDA version 10.2, compute capability 6.1, 4096MB, 3974MB available, 4568 GFLOPS peak)
OpenCL: NVIDIA GPU 0: GeForce RTX 2070 SUPER (driver version 440.100, device version OpenCL 1.2 CUDA, 7982MB, 3968MB available, 9216 GFLOPS peak)
OpenCL: NVIDIA GPU 1 (ignored by config): GeForce GTX 1060 6GB (driver version 440.100, device version OpenCL 1.2 CUDA, 6075MB, 3974MB available, 4568 GFLOPS peak)
OS: Linux Fedora: Fedora 32 (Workstation Edition) [5.7.6-201.fc32.x86_64|libc 2.31 (GNU libc)]

440.10 is a different driver version than 440.100
According to an earlier thread, it's been doing this for a while, as there it was BOINC 7.9.3 that made 390.13 from 390.132

The text was updated successfully, but these errors were encountered:

RichardHaselgrove · 2020-07-09T13:58:35Z

Confirmed.

Thu 09 Jul 2020 14:17:47 BST |  | Starting BOINC client version 7.17.0 for x86_64-pc-linux-gnu
Thu 09 Jul 2020 14:17:47 BST |  | CUDA: NVIDIA GPU 0: GeForce GTX 1660 SUPER (driver version 440.10, CUDA version 10.2, compute capability 7.5, 4096MB, 3974MB available, 5153 GFLOPS peak)
Thu 09 Jul 2020 14:17:47 BST |  | OpenCL: NVIDIA GPU 0: GeForce GTX 1660 SUPER (driver version 440.100, device version OpenCL 1.2 CUDA, 5943MB, 3974MB available, 5153 GFLOPS peak)
Thu 09 Jul 2020 14:40:12 BST | Asteroids@home | Sending scheduler request: To fetch work.
Thu 09 Jul 2020 14:40:12 BST | Asteroids@home | Requesting new tasks for NVIDIA GPU
Thu 09 Jul 2020 14:40:12 BST | Asteroids@home | [sched_op] NVIDIA GPU work request: 881.05 seconds; 1.00 devices
Thu 09 Jul 2020 14:40:13 BST | Asteroids@home | Scheduler request completed: got 0 new tasks
Thu 09 Jul 2020 14:40:13 BST | Asteroids@home | [sched_op] Server version 707
Thu 09 Jul 2020 14:40:13 BST | Asteroids@home | Message from server: NVIDIA GPU: Upgrade to the latest driver to process tasks using your computer's GPU

The problem, as confirmed here, lies in the server comparison of the reported/truncated version number, with the minimum version required by the project.

AenBleidd · 2020-07-09T14:49:09Z

Unfortunately, it's not so easy to fix:
2 of 3 platforms report driver version as an integer number. Third platform reports it as an integer represented as a string.
So it's basically no information how to interpret this integer as a float number.
Requires additional investigation.

RichardHaselgrove · 2020-07-09T15:18:53Z

Yes, I see (from sched_request...) that I'm now reporting

<drvVersion>44010</drvVersion> [cuda]
<opencl_driver_version>440.100</opencl_driver_version>

Surely we should have designed-in some form of consistency?

AenBleidd · 2020-07-09T15:24:47Z

Hm, I believe I found the reason. I can make a fix to set proper in request but it will still shows incorrect version (two numbers after point instead of three) in log....
@RichardHaselgrove, as far as I understood, you can reproduce this, right? If yes - then I can try to prepare a fix and ask you to verify it, because I don't have linux machine with nVidia GPU

RichardHaselgrove · 2020-07-09T15:42:47Z

Yes, I have the hardware, and I'm currently running v7.17.0 from @LocutusOfBorg PPA. I also have the ability to compile (client-only) from source.

The mis-matched driver versions are only causing a problem at Asteroids. My other GPU project - GPUGrid - doesn't have a problem with work fetch, I'm assuming because it doesn't have such a stringent minimum driver version number test to satisfy. But we can't easily see what the plan_class requirements are, without the assistance of a project administrator.

Ageless93 · 2020-07-09T18:56:42Z

I wonder, Since we have both the CUDA and OpenCL version number of the driver and both will be the same as you cannot have two different driver versions on your system, can't we do a sanity check with that and if found that CUDA is truncated again use the OpenCL version number?

AenBleidd · 2020-07-09T19:42:21Z

@Ageless93, in case of difference, whom we have to believe?

Ageless93 · 2020-07-09T19:55:11Z

@AenBleidd, I think the higher number, especially since CUDA truncating happened before. But the sanity check can go either way.
As soon as either CUDA or OpenCL is lower than the other, use the higher number.

davidpanderson · 2020-07-09T21:29:31Z

The basic problem is that BOINC encodes the driver version into an int of the form MMmm:
https://boinc.berkeley.edu/trac/wiki/AppPlanSpec#GPUapps
i.e. it assumes that the minor version # is <= 99.

The solution is to store the version as a string everywhere.
This will require server changes too.

We should do this for all version numbers, not just video drivers.
We should make no assumptions about # of digits,
or how many parts the version# has.

RichardHaselgrove · 2020-07-09T21:54:15Z

The client handles it properly when reporting its own version number to a server: separate fields for major, minor, revision.

We could use that, or we could use a single string with dividers between the fields. But there's far too much scope for errors over time with assumed fixed-width numeric fields.

https://boinc.berkeley.edu/trac/wiki/AppPlanSpec#GPUapps is not as you say: we already have
AMD driver versions are represented as MMmmRRRR
NVIDIA driver versions are represented as MMMmm

RichardHaselgrove · 2020-07-09T22:01:21Z

And although not documented, I'm also already reporting

<opencl_driver_version>3.0.1.10878</opencl_driver_version> [OpenCL on intel CPU]
<opencl_driver_version>10.18.10.3621</opencl_driver_version> [OpenCL on intel_gpu]

RichardHaselgrove · 2020-07-09T22:37:42Z

Possible suggestion for a temporary workround, while we design a comprehensive solution:

Cap the client reporting of the CUDA minor version at 99

That would prevent a future project requirement of a minor version with three digits (but that's simply an incentive to finish doing the proper job in a timely fashion). Reporting the current NVidia driver 440.100 as 44099 in RPCs would at least allow Asteroids@Home to issue work again.

Change minor version to 99 if actual minor version is > 99 This fixes BOINC#3893 Signed-off-by: Vitalii Koshura <lestat.de.lionkur@gmail.com>

AenBleidd · 2020-07-10T00:52:46Z

@RichardHaselgrove, @davidpanderson, fix is ready for review and testing

cwallbaum · 2020-07-10T10:22:47Z

Hi,

I just want to let you know that version numbers of Nvidia Linux drivers can also have a "Releaseversion". I stumbled upon this lately when I had to configure correct drivers for a Tensorflow project. Here's a sample output of one of these boxes:

walli@p206-6:~$ nvidia-smi 
Fri Jul 10 12:09:48 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    On   | 00000000:01:00.0 Off |                  N/A |
| 79%   87C    P2   141W / 180W |    193MiB /  8119MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

Please also see the "CUDA Toolkit and Compatible Driver Versions" table here: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-major-component-versions (you have to scroll down a little bit because this table has no html anchor).

AenBleidd · 2020-07-10T13:40:05Z

https://www.youtube.com/watch?v=_36yNWw_07g

AenBleidd · 2020-07-10T13:40:45Z

@cwallbaum, could you please verify how BOINC reports this version on this particular machine?

cwallbaum · 2020-07-10T14:04:37Z

It's not the latest, but at least not a very old build from Gianfranco's PPA:

Starting BOINC client version 7.16.6 for x86_64-pc-linux-gnu	
log flags: file_xfer, sched_ops, task	
Libraries: libcurl/7.58.0 OpenSSL/1.1.1 zlib/1.2.11 libidn2/2.0.4 libpsl/0.19.1 (+libidn2/2.0.4) nghttp2/1.30.0 librtmp/2.3	
Data directory: /var/lib/boinc-client	
CUDA: NVIDIA GPU 0: GeForce GTX 1080 (driver version 440.33, CUDA version 10.2, compute capability 6.1, 4096MB, 3968MB available, 8876 GFLOPS peak)	
OpenCL: NVIDIA GPU 0: GeForce GTX 1080 (driver version 440.33.01, device version OpenCL 1.2 CUDA, 8120MB, 3968MB available, 8876 GFLOPS peak)	
libc: Ubuntu GLIBC 2.27-3ubuntu1 version 2.27	
Host name: p206-6	
Processor: 12 GenuineIntel Intel(R) Xeon(R) E-2186G CPU @ 3.80GHz [Family 6 Model 158 Stepping 10]	
Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperf	
OS: Linux LinuxMint: Linux Mint 19.3 Tricia [5.3.0-26-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)]	
...

AenBleidd · 2020-07-10T14:08:04Z

I believe it's fine enough. @cwallbaum, thanks for quick test

Ageless93 · 2020-07-11T22:14:41Z

The thing that strikes me though is that here too OpenCL reports the whole driver version, 440.33.01 vs 440.33 for CUDA.
And so I wonder if it isn't easier to use the driver version OpenCL reports, rather than the version CUDA reports. They'll always both be the same as they come in the same package, but the OpenCL one is more complete.

AenBleidd · 2020-07-11T22:22:42Z

@Ageless93, there is a possibility that OpenCL driver could be not installed on the system

AenBleidd self-assigned this Jul 9, 2020

AenBleidd linked a pull request Jul 10, 2020 that will close this issue

[linux][cuda] Fix CUDA version number truncated #3895

Closed

AenBleidd mentioned this issue Jul 10, 2020

[linux][cuda] Fix CUDA version number truncated #3895

Closed

AenBleidd added a commit to AenBleidd/boinc that referenced this issue Jul 10, 2020

[linux][cuda] Fix CUDA version number truncated

9ab7942

Change minor version to 99 if actual minor version is > 99 This fixes BOINC#3893 Signed-off-by: Vitalii Koshura <lestat.de.lionkur@gmail.com>

AenBleidd added C: Client - Daemon P: Major R: fixed T: Defect labels Jul 10, 2020

AenBleidd added this to To do in BOINC Client/Manager via automation Jul 10, 2020

AenBleidd added this to the Client/Manager 8.0 milestone Jul 10, 2020

AenBleidd mentioned this issue Jul 10, 2020

When parsing NVIDIA driver version, max minor version with 99. #3897

Merged

AenBleidd assigned davidpanderson Jul 10, 2020

AenBleidd removed this from To do in BOINC Client/Manager Jul 10, 2020

AenBleidd added this to Backlog in Client Release 7.16 via automation Jul 10, 2020

AenBleidd moved this from Backlog to Development in Client Release 7.16 Jul 10, 2020

AenBleidd modified the milestones: Client/Manager 8.0, Client Release 7.16.8 Jul 10, 2020

AenBleidd closed this as completed in #3897 Jul 10, 2020

Client Release 7.16 automation moved this from Development to Done Jul 10, 2020

AenBleidd removed this from Done in Client Release 7.16 Jul 16, 2020

AenBleidd modified the milestones: Client Release 7.18.0, Client Release 7.16.11 Sep 8, 2020

AenBleidd added this to Backlog in Client Release 7.16 via automation Sep 8, 2020

AenBleidd moved this from Backlog to Done in Client Release 7.16 Sep 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA version number truncated #3893

CUDA version number truncated #3893

Ageless93 commented Jul 9, 2020

RichardHaselgrove commented Jul 9, 2020

AenBleidd commented Jul 9, 2020

RichardHaselgrove commented Jul 9, 2020

AenBleidd commented Jul 9, 2020

RichardHaselgrove commented Jul 9, 2020

Ageless93 commented Jul 9, 2020 via email •

edited

Loading

AenBleidd commented Jul 9, 2020

Ageless93 commented Jul 9, 2020

davidpanderson commented Jul 9, 2020

RichardHaselgrove commented Jul 9, 2020

RichardHaselgrove commented Jul 9, 2020

RichardHaselgrove commented Jul 9, 2020

AenBleidd commented Jul 10, 2020

cwallbaum commented Jul 10, 2020

AenBleidd commented Jul 10, 2020

AenBleidd commented Jul 10, 2020

cwallbaum commented Jul 10, 2020

AenBleidd commented Jul 10, 2020

Ageless93 commented Jul 11, 2020

AenBleidd commented Jul 11, 2020

CUDA version number truncated #3893

CUDA version number truncated #3893

Comments

Ageless93 commented Jul 9, 2020

RichardHaselgrove commented Jul 9, 2020

AenBleidd commented Jul 9, 2020

RichardHaselgrove commented Jul 9, 2020

AenBleidd commented Jul 9, 2020

RichardHaselgrove commented Jul 9, 2020

Ageless93 commented Jul 9, 2020 via email • edited Loading

AenBleidd commented Jul 9, 2020

Ageless93 commented Jul 9, 2020

davidpanderson commented Jul 9, 2020

RichardHaselgrove commented Jul 9, 2020

RichardHaselgrove commented Jul 9, 2020

RichardHaselgrove commented Jul 9, 2020

AenBleidd commented Jul 10, 2020

cwallbaum commented Jul 10, 2020

AenBleidd commented Jul 10, 2020

AenBleidd commented Jul 10, 2020

cwallbaum commented Jul 10, 2020

AenBleidd commented Jul 10, 2020

Ageless93 commented Jul 11, 2020

AenBleidd commented Jul 11, 2020

Ageless93 commented Jul 9, 2020 via email •

edited

Loading