Skip to content

Blacklist bad versions of the cuda drivers#352

Merged
abergeron merged 7 commits intoTheano:masterfrom
abergeron:check_version
Feb 14, 2017
Merged

Blacklist bad versions of the cuda drivers#352
abergeron merged 7 commits intoTheano:masterfrom
abergeron:check_version

Conversation

@abergeron
Copy link
Copy Markdown
Member

The cuda drivers version 375 have some problems with the JIT that lead to bad code.

Since this is basically impossible to detect otherwise, this add a way to get the actual version of the cuda driver and blacklists bad versions. The blacklist will need to be updated once a fixed version is released.

The last known good version is 373.06 and there are no working versions above that as far as I know, which is why the current blacklist has an open range.

@abergeron
Copy link
Copy Markdown
Member Author

I should also mention that the windows code is completely untested and that the dladdr code will not work on mac (it should just trigger the warning about not being able to get the version).

Fixing those issues will come later, but in light of that and the the numpy breakage for array, I am inclined to fix those issues and cut a new release (perhaps including the disk_cache changes, which could be made to expose the binary once again).

Copy link
Copy Markdown
Member

@nouiz nouiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From your previous answer on this issue, I was thinking you would go with a detection in Theano. Why you did that way? The problem is that we need to manually detect which driver is bad and is good. Doing a test at init time in Theano would make sure that we don't miss broken driver.

Anyway, for now we can go with this.

Comment thread src/loaders/libcuda.c Outdated
fprintf(stderr, "WARNING: could not determine cuda driver version. Some versions return bad results, make sure your version is fine\n");

if (v > 373.06) {
fprintf(stderr, "ERROR: refusing to load cuda driver library because the version is blacklisted\n");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give more information so the user know how to fix the error. Can you add in the error msg a few driver version that are know to be good.

Or better, give the link to the issue. This will give them the up to date status. We don't want the user that see this error to ask us questions, they should be able to work around the problem without contacting us.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can give some pointer to version that we know are not broken. This will have to be updated at some point when NVIDIA fixes the problem.

@abergeron
Copy link
Copy Markdown
Member Author

We kind of have to manually detect anyway since a driver being broken can manifest in any number of ways and we can't test for everything.

In this specific instance we could do a test but unless it's rather involved, the test would be brittle (like for example the sum issue that was reported goes away if you use a debug build of libgpuarray, but there are still other broken things).

Also, I didn't want to leave other potential consumers of libgpuarray out to dry and make them have to figure out the problem on their own.

So in that sense a version check is the safest we can do for now, I think.

@nouiz
Copy link
Copy Markdown
Member

nouiz commented Feb 13, 2017 via email

@abergeron
Copy link
Copy Markdown
Member Author

I've added the maximum version that we know is ok and added a way to bypass the check without recompiling the library.

Comment thread src/loaders/libcuda.c Outdated
if (v > 373.06) {
fprintf(stderr, "ERROR: refusing to load cuda driver library because the version is blacklisted\n");
return GA_LOAD_ERROR;
if (v > 373.06)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems an opening bracket { is missing at the end of this line !

@notoraptor
Copy link
Copy Markdown
Contributor

Currently, it does not work on Windows. Version computed is 6.14, but my driver version is 378.49. I am looking for a solution, but it seems really weird.

If this may help, these are the details about the nvcuda.dll on my computer:

image

It seems the version number we want is at the end of the "file version" (6.14.13.7849: at the end we have 3.7849), but it would be a very strange way for NVIDIA to encode the version number...

I have also found this function: cudaDriverGetVersion(int* driverVersion): http://horacio9573.no-ip.org/cuda/group__CUDART____VERSION_g0e7ca3e5a5997d4eaef36ee22caddd01.html . Could it be a solution ?

@abergeron
Copy link
Copy Markdown
Member Author

@notoraptor cuDriverGetVersion() returns something like 7050 or 8000, which is useless.

I guess we will have to get at the first string "File description" and parse it. Why is this so hard?

@notoraptor
Copy link
Copy Markdown
Contributor

It seems the file version is related to some Windows specifications which forces the first digits to some certain values (about DirectX and other infos): https://msdn.microsoft.com/en-us/library/windows/hardware/ff570155.aspx

And on some forums people also noticed that last digits of the file version seems to match the official NVIDIA driver version: e,g: https://forums.geforce.com/default/topic/765903/off-topic/what-is-the-significance-of-nvidia-graphic-driver-naming-/post/4277236/#4277236

Now I'm just looking for an official NVIDIA page that confirms these details, but I can't find it for the moment ...

@abergeron abergeron merged commit 4da85f6 into Theano:master Feb 14, 2017
@obilaniu
Copy link
Copy Markdown
Contributor

@notoraptor @abergeron @nouiz Assuming you actually need that particular number and not something more closely related to the JIT version, the proper way to get that 375.39 or whatever is NVML, the library that nvidia-smi calls into and which is the appropriate way of making programmatic nvidia-smi-like queries. That includes such goodies as PCI, fans, temperatures, LEDs and other lovely random stuff.

#include <nvml.h>

[...]

nvmlInit();

[...]

/**
 * Retrieves the version of the system's graphics driver.
 * 
 * For all products.
 *
 * The version identifier is an alphanumeric string.  It will not exceed 80 characters in length
 * (including the NULL terminator).  See \ref nvmlConstants::NVML_SYSTEM_DRIVER_VERSION_BUFFER_SIZE.
 *
 * @param version                              Reference in which to return the version identifier
 * @param length                               The maximum allowed length of the string returned in \a version
 *
 * @return 
 *         - \ref NVML_SUCCESS                 if \a version has been set
 *         - \ref NVML_ERROR_UNINITIALIZED     if the library has not been successfully initialized
 *         - \ref NVML_ERROR_INVALID_ARGUMENT  if \a version is NULL
 *         - \ref NVML_ERROR_INSUFFICIENT_SIZE if \a length is too small 
 */
nvmlReturn_t DECLDIR nvmlSystemGetDriverVersion(char *version, unsigned int length);

and one links using

-lnvidia-ml

@nouiz
Copy link
Copy Markdown
Member

nouiz commented Feb 15, 2017 via email

@abergeron
Copy link
Copy Markdown
Member Author

Yes, I already know that, but I wanted to avoid an extra dependency. Also, NVML is hard to link with on Windows for some reason and unavailable on Mac.

Since we are not doing blacklisting on Mac anyway, we could eventually switch to using NVML if this sort of hijinks is required for some time.

@obilaniu
Copy link
Copy Markdown
Contributor

On Linux, ldd `which nvidia-smi` tells me that nvidia-smi links against libdl, and strings `which nvidia-smi` | grep "\.so" reveals to me the presence of numerous paths containing the string libnvidia-ml.so.

I've no idea what are the equivalents on Windows, but you should be able to hunt it down too.

Apparently nvidia-smi doesn't exist on Mac OS X, and neither do the APIs it relies upon. Odd. Can you string libcuda.dylib | grep 375.26 or similar to find whether a string corresponding to the library version is embedded within the library somewhere? Perhaps an API listed by nm -D accesses that string.

@abergeron
Copy link
Copy Markdown
Member Author

I don't really want to run strings from the C library. That would be even more fragile than what is currently there. Not accounting for the fact that the broken drivers is a range of versions not a single one. Also, I don't want to build in a long list of possibly changing paths that is partially reverse-engineered in my code for now.

In any case none of the code is under a public interface, so I can change it later if it turns out to be required.

@obilaniu
Copy link
Copy Markdown
Contributor

@abergeron Oh, I was most definitely not suggesting executing strings out of C code. We know well that the name of the library on Linux is libnvidia-ml.so, and it's installed in the PATH; We don't know the same for Windows yet.

@abergeron abergeron deleted the check_version branch March 20, 2017 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants