Blacklist bad versions of the cuda drivers#352
Conversation
|
I should also mention that the windows code is completely untested and that the dladdr code will not work on mac (it should just trigger the warning about not being able to get the version). Fixing those issues will come later, but in light of that and the the numpy breakage for array, I am inclined to fix those issues and cut a new release (perhaps including the disk_cache changes, which could be made to expose the binary once again). |
nouiz
left a comment
There was a problem hiding this comment.
From your previous answer on this issue, I was thinking you would go with a detection in Theano. Why you did that way? The problem is that we need to manually detect which driver is bad and is good. Doing a test at init time in Theano would make sure that we don't miss broken driver.
Anyway, for now we can go with this.
| fprintf(stderr, "WARNING: could not determine cuda driver version. Some versions return bad results, make sure your version is fine\n"); | ||
|
|
||
| if (v > 373.06) { | ||
| fprintf(stderr, "ERROR: refusing to load cuda driver library because the version is blacklisted\n"); |
There was a problem hiding this comment.
Give more information so the user know how to fix the error. Can you add in the error msg a few driver version that are know to be good.
Or better, give the link to the issue. This will give them the up to date status. We don't want the user that see this error to ask us questions, they should be able to work around the problem without contacting us.
There was a problem hiding this comment.
I can give some pointer to version that we know are not broken. This will have to be updated at some point when NVIDIA fixes the problem.
|
We kind of have to manually detect anyway since a driver being broken can manifest in any number of ways and we can't test for everything. In this specific instance we could do a test but unless it's rather involved, the test would be brittle (like for example the sum issue that was reported goes away if you use a debug build of libgpuarray, but there are still other broken things). Also, I didn't want to leave other potential consumers of libgpuarray out to dry and make them have to figure out the problem on their own. So in that sense a version check is the safest we can do for now, I think. |
|
make sense.
…On Mon, Feb 13, 2017 at 11:33 AM abergeron ***@***.***> wrote:
We kind of have to manually detect anyway since a driver being broken can
manifest in any number of ways and we can't test for everything.
In this specific instance we could do a test but unless it's rather
involved, the test would be brittle (like for example the sum issue that
was reported goes away if you use a debug build of libgpuarray, but there
are still other broken things).
Also, I didn't want to leave other potential consumers of libgpuarray out
to dry and make them have to figure out the problem on their own.
So in that sense a version check is the safest we can do for now, I think.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#352 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AALC-4_YaI4tQPMHrpGTvZp3gbDgEmo6ks5rcIXWgaJpZM4L955G>
.
|
|
I've added the maximum version that we know is ok and added a way to bypass the check without recompiling the library. |
| if (v > 373.06) { | ||
| fprintf(stderr, "ERROR: refusing to load cuda driver library because the version is blacklisted\n"); | ||
| return GA_LOAD_ERROR; | ||
| if (v > 373.06) |
There was a problem hiding this comment.
It seems an opening bracket { is missing at the end of this line !
|
Currently, it does not work on Windows. Version computed is If this may help, these are the details about the It seems the version number we want is at the end of the "file version" ( I have also found this function: |
|
@notoraptor cuDriverGetVersion() returns something like 7050 or 8000, which is useless. I guess we will have to get at the first string "File description" and parse it. Why is this so hard? |
|
It seems the file version is related to some Windows specifications which forces the first digits to some certain values (about DirectX and other infos): https://msdn.microsoft.com/en-us/library/windows/hardware/ff570155.aspx And on some forums people also noticed that last digits of the file version seems to match the official NVIDIA driver version: e,g: https://forums.geforce.com/default/topic/765903/off-topic/what-is-the-significance-of-nvidia-graphic-driver-naming-/post/4277236/#4277236 Now I'm just looking for an official NVIDIA page that confirms these details, but I can't find it for the moment ... |
|
@notoraptor @abergeron @nouiz Assuming you actually need that particular number and not something more closely related to the JIT version, the proper way to get that and one links using |
|
@abergeron known this already. I think he didn't wanted to have one more
library linked as we need to dinamilally link them. @abergeron, can you
confirm?
Do you know if NVML is available on the 3 OS we support?
…On Wed, Feb 15, 2017 at 12:28 AM Olexa Bilaniuk ***@***.***> wrote:
@notoraptor <https://github.com/notoraptor> @abergeron
<https://github.com/abergeron> @nouiz <https://github.com/nouiz> Assuming
you actually need that particular number and not something more closely
related to the JIT version, the proper way to get that 375.39 or whatever
is NVML, the library that nvidia-smi calls into and which is the
appropriate way of making programmatic nvidia-smi-like queries. That
includes such goodies as PCI, fans, temperatures, LEDs and other lovely
random stuff.
#include <nvml.h>
[...]
nvmlInit();
[...]
/**
* Retrieves the version of the system's graphics driver.
*
* For all products.
*
* The version identifier is an alphanumeric string. It will not exceed 80 characters in length
* (including the NULL terminator). See \ref nvmlConstants::NVML_SYSTEM_DRIVER_VERSION_BUFFER_SIZE.
*
* @param version Reference in which to return the version identifier
* @param length The maximum allowed length of the string returned in \a version
*
* @return
* - \ref NVML_SUCCESS if \a version has been set
* - \ref NVML_ERROR_UNINITIALIZED if the library has not been successfully initialized
* - \ref NVML_ERROR_INVALID_ARGUMENT if \a version is NULL
* - \ref NVML_ERROR_INSUFFICIENT_SIZE if \a length is too small
*/
nvmlReturn_t DECLDIR nvmlSystemGetDriverVersion(char *version, unsigned int length);
and one links using
-lnvidia-ml
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#352 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AALC-0KvhFjkG5iRBUp5J3N-9vLm6mA0ks5rco0bgaJpZM4L955G>
.
|
|
Yes, I already know that, but I wanted to avoid an extra dependency. Also, NVML is hard to link with on Windows for some reason and unavailable on Mac. Since we are not doing blacklisting on Mac anyway, we could eventually switch to using NVML if this sort of hijinks is required for some time. |
|
On Linux, I've no idea what are the equivalents on Windows, but you should be able to hunt it down too. Apparently |
|
I don't really want to run In any case none of the code is under a public interface, so I can change it later if it turns out to be required. |
|
@abergeron Oh, I was most definitely not suggesting executing |

The cuda drivers version 375 have some problems with the JIT that lead to bad code.
Since this is basically impossible to detect otherwise, this add a way to get the actual version of the cuda driver and blacklists bad versions. The blacklist will need to be updated once a fixed version is released.
The last known good version is 373.06 and there are no working versions above that as far as I know, which is why the current blacklist has an open range.