Blacklist bad versions of the cuda drivers by abergeron · Pull Request #352 · Theano/libgpuarray

abergeron · 2017-02-10T22:04:36Z

The cuda drivers version 375 have some problems with the JIT that lead to bad code.

Since this is basically impossible to detect otherwise, this add a way to get the actual version of the cuda driver and blacklists bad versions. The blacklist will need to be updated once a fixed version is released.

The last known good version is 373.06 and there are no working versions above that as far as I know, which is why the current blacklist has an open range.

… drivers.

abergeron · 2017-02-10T22:08:30Z

I should also mention that the windows code is completely untested and that the dladdr code will not work on mac (it should just trigger the warning about not being able to get the version).

Fixing those issues will come later, but in light of that and the the numpy breakage for array, I am inclined to fix those issues and cut a new release (perhaps including the disk_cache changes, which could be made to expose the binary once again).

nouiz

From your previous answer on this issue, I was thinking you would go with a detection in Theano. Why you did that way? The problem is that we need to manually detect which driver is bad and is good. Doing a test at init time in Theano would make sure that we don't miss broken driver.

Anyway, for now we can go with this.

nouiz · 2017-02-13T15:34:42Z

+    fprintf(stderr, "WARNING: could not determine cuda driver version.  Some versions return bad results, make sure your version is fine\n");
+
+  if (v > 373.06) {
+    fprintf(stderr, "ERROR: refusing to load cuda driver library because the version is blacklisted\n");


Give more information so the user know how to fix the error. Can you add in the error msg a few driver version that are know to be good.

Or better, give the link to the issue. This will give them the up to date status. We don't want the user that see this error to ask us questions, they should be able to work around the problem without contacting us.

I can give some pointer to version that we know are not broken. This will have to be updated at some point when NVIDIA fixes the problem.

abergeron · 2017-02-13T16:33:26Z

We kind of have to manually detect anyway since a driver being broken can manifest in any number of ways and we can't test for everything.

In this specific instance we could do a test but unless it's rather involved, the test would be brittle (like for example the sum issue that was reported goes away if you use a debug build of libgpuarray, but there are still other broken things).

Also, I didn't want to leave other potential consumers of libgpuarray out to dry and make them have to figure out the problem on their own.

So in that sense a version check is the safest we can do for now, I think.

nouiz · 2017-02-13T16:37:53Z

make sense.

…

On Mon, Feb 13, 2017 at 11:33 AM abergeron ***@***.***> wrote: We kind of have to manually detect anyway since a driver being broken can manifest in any number of ways and we can't test for everything. In this specific instance we could do a test but unless it's rather involved, the test would be brittle (like for example the sum issue that was reported goes away if you use a debug build of libgpuarray, but there are still other broken things). Also, I didn't want to leave other potential consumers of libgpuarray out to dry and make them have to figure out the problem on their own. So in that sense a version check is the safest we can do for now, I think. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#352 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AALC-4_YaI4tQPMHrpGTvZp3gbDgEmo6ks5rcIXWgaJpZM4L955G> .

abergeron · 2017-02-13T16:47:52Z

I've added the maximum version that we know is ok and added a way to bypass the check without recompiling the library.

notoraptor · 2017-02-13T17:55:58Z

-  if (v > 373.06) {
-    fprintf(stderr, "ERROR: refusing to load cuda driver library because the version is blacklisted\n");
-    return GA_LOAD_ERROR;
+  if (v > 373.06)


It seems an opening bracket { is missing at the end of this line !

notoraptor · 2017-02-13T19:09:36Z

Currently, it does not work on Windows. Version computed is 6.14, but my driver version is 378.49. I am looking for a solution, but it seems really weird.

If this may help, these are the details about the nvcuda.dll on my computer:

It seems the version number we want is at the end of the "file version" (6.14.13.7849: at the end we have 3.7849), but it would be a very strange way for NVIDIA to encode the version number...

I have also found this function: cudaDriverGetVersion(int* driverVersion): http://horacio9573.no-ip.org/cuda/group__CUDART____VERSION_g0e7ca3e5a5997d4eaef36ee22caddd01.html . Could it be a solution ?

abergeron · 2017-02-13T19:15:23Z

@notoraptor cuDriverGetVersion() returns something like 7050 or 8000, which is useless.

I guess we will have to get at the first string "File description" and parse it. Why is this so hard?

notoraptor · 2017-02-13T20:12:02Z

It seems the file version is related to some Windows specifications which forces the first digits to some certain values (about DirectX and other infos): https://msdn.microsoft.com/en-us/library/windows/hardware/ff570155.aspx

And on some forums people also noticed that last digits of the file version seems to match the official NVIDIA driver version: e,g: https://forums.geforce.com/default/topic/765903/off-topic/what-is-the-significance-of-nvidia-graphic-driver-naming-/post/4277236/#4277236

Now I'm just looking for an official NVIDIA page that confirms these details, but I can't find it for the moment ...

obilaniu · 2017-02-15T05:28:58Z

@notoraptor @abergeron @nouiz Assuming you actually need that particular number and not something more closely related to the JIT version, the proper way to get that 375.39 or whatever is NVML, the library that nvidia-smi calls into and which is the appropriate way of making programmatic nvidia-smi-like queries. That includes such goodies as PCI, fans, temperatures, LEDs and other lovely random stuff.

#include <nvml.h>

[...]

nvmlInit();

[...]

/**
 * Retrieves the version of the system's graphics driver.
 * 
 * For all products.
 *
 * The version identifier is an alphanumeric string.  It will not exceed 80 characters in length
 * (including the NULL terminator).  See \ref nvmlConstants::NVML_SYSTEM_DRIVER_VERSION_BUFFER_SIZE.
 *
 * @param version                              Reference in which to return the version identifier
 * @param length                               The maximum allowed length of the string returned in \a version
 *
 * @return 
 *         - \ref NVML_SUCCESS                 if \a version has been set
 *         - \ref NVML_ERROR_UNINITIALIZED     if the library has not been successfully initialized
 *         - \ref NVML_ERROR_INVALID_ARGUMENT  if \a version is NULL
 *         - \ref NVML_ERROR_INSUFFICIENT_SIZE if \a length is too small 
 */
nvmlReturn_t DECLDIR nvmlSystemGetDriverVersion(char *version, unsigned int length);

and one links using

-lnvidia-ml

nouiz · 2017-02-15T14:12:23Z

@abergeron known this already. I think he didn't wanted to have one more library linked as we need to dinamilally link them. @abergeron, can you confirm? Do you know if NVML is available on the 3 OS we support?

…

On Wed, Feb 15, 2017 at 12:28 AM Olexa Bilaniuk ***@***.***> wrote: @notoraptor <https://github.com/notoraptor> @abergeron <https://github.com/abergeron> @nouiz <https://github.com/nouiz> Assuming you actually need that particular number and not something more closely related to the JIT version, the proper way to get that 375.39 or whatever is NVML, the library that nvidia-smi calls into and which is the appropriate way of making programmatic nvidia-smi-like queries. That includes such goodies as PCI, fans, temperatures, LEDs and other lovely random stuff. #include <nvml.h> [...] nvmlInit(); [...] /** * Retrieves the version of the system's graphics driver. * * For all products. * * The version identifier is an alphanumeric string. It will not exceed 80 characters in length * (including the NULL terminator). See \ref nvmlConstants::NVML_SYSTEM_DRIVER_VERSION_BUFFER_SIZE. * * @param version Reference in which to return the version identifier * @param length The maximum allowed length of the string returned in \a version * * @return * - \ref NVML_SUCCESS if \a version has been set * - \ref NVML_ERROR_UNINITIALIZED if the library has not been successfully initialized * - \ref NVML_ERROR_INVALID_ARGUMENT if \a version is NULL * - \ref NVML_ERROR_INSUFFICIENT_SIZE if \a length is too small */ nvmlReturn_t DECLDIR nvmlSystemGetDriverVersion(char *version, unsigned int length); and one links using -lnvidia-ml — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#352 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AALC-0KvhFjkG5iRBUp5J3N-9vLm6mA0ks5rco0bgaJpZM4L955G> .

abergeron · 2017-02-15T15:35:20Z

Yes, I already know that, but I wanted to avoid an extra dependency. Also, NVML is hard to link with on Windows for some reason and unavailable on Mac.

Since we are not doing blacklisting on Mac anyway, we could eventually switch to using NVML if this sort of hijinks is required for some time.

obilaniu · 2017-02-15T18:08:35Z

On Linux, ldd `which nvidia-smi` tells me that nvidia-smi links against libdl, and strings `which nvidia-smi` | grep "\.so" reveals to me the presence of numerous paths containing the string libnvidia-ml.so.

I've no idea what are the equivalents on Windows, but you should be able to hunt it down too.

Apparently nvidia-smi doesn't exist on Mac OS X, and neither do the APIs it relies upon. Odd. Can you string libcuda.dylib | grep 375.26 or similar to find whether a string corresponding to the library version is embedded within the library somewhere? Perhaps an API listed by nm -D accesses that string.

abergeron · 2017-02-15T21:12:08Z

I don't really want to run strings from the C library. That would be even more fragile than what is currently there. Not accounting for the fact that the broken drivers is a range of versions not a single one. Also, I don't want to build in a long list of possibly changing paths that is partially reverse-engineered in my code for now.

In any case none of the code is under a public interface, so I can change it later if it turns out to be required.

obilaniu · 2017-02-15T21:22:29Z

@abergeron Oh, I was most definitely not suggesting executing strings out of C code. We know well that the name of the library on Linux is libnvidia-ml.so, and it's installed in the PATH; We don't know the same for Windows yet.

abergeron added 2 commits February 10, 2017 16:50

Stop 'make rel' and 'make debug' always rebuilding everything.

af35721

Add a function to get the library version and use it to blacklist bad…

0abade9

… drivers.

abergeron mentioned this pull request Feb 10, 2017

Detect bad nvidia driver in the new back-end Theano/Theano#5530

Closed

nouiz reviewed Feb 13, 2017

View reviewed changes

Add a way to bypass the error and be explict about working versions.

a0d1035

notoraptor reviewed Feb 13, 2017

View reviewed changes

Fix code.

12ceff2

Fix the message.

64ad544

notoraptor and others added 2 commits February 13, 2017 18:33

Detect driver version on Windows. (#3)

85e2a50

Fixes for macOS (we don't blacklist there).

309e87b

abergeron merged commit 4da85f6 into Theano:master Feb 14, 2017

abergeron deleted the check_version branch March 20, 2017 22:25

Conversation

abergeron commented Feb 10, 2017

Uh oh!

abergeron commented Feb 10, 2017

Uh oh!

nouiz left a comment

Choose a reason for hiding this comment

Uh oh!

nouiz Feb 13, 2017

Choose a reason for hiding this comment

Uh oh!

abergeron Feb 13, 2017

Choose a reason for hiding this comment

Uh oh!

abergeron commented Feb 13, 2017

Uh oh!

nouiz commented Feb 13, 2017 via email

Uh oh!

abergeron commented Feb 13, 2017

Uh oh!

notoraptor Feb 13, 2017

Choose a reason for hiding this comment

Uh oh!

notoraptor commented Feb 13, 2017

Uh oh!

abergeron commented Feb 13, 2017

Uh oh!

notoraptor commented Feb 13, 2017

Uh oh!

obilaniu commented Feb 15, 2017

Uh oh!

nouiz commented Feb 15, 2017 via email

Uh oh!

abergeron commented Feb 15, 2017

Uh oh!

obilaniu commented Feb 15, 2017

Uh oh!

abergeron commented Feb 15, 2017

Uh oh!

obilaniu commented Feb 15, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants