You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
File "/home/adam/GitRepos/Theano/theano/gpuarray/type.py", line 640, in set_value
self.container.value = value
File "/home/adam/GitRepos/Theano/theano/gof/link.py", line 477, in __set__
**kwargs)
File "/home/adam/GitRepos/Theano/theano/gpuarray/type.py", line 299, in filter_inplace
data = pygpu.array(data, context=self.context)
File "pygpu/gpuarray.pyx", line 915, in pygpu.gpuarray.array (pygpu/gpuarray.c:12227)
File "pygpu/gpuarray.pyx", line 970, in pygpu.gpuarray.carray (pygpu/gpuarray.c:13109)
File "pygpu/gpuarray.pyx", line 664, in pygpu.gpuarray.pygpu_fromhostdata (pygpu/gpuarray.c:9851)
File "pygpu/gpuarray.pyx", line 301, in pygpu.gpuarray.array_copy_from_host (pygpu/gpuarray.c:5817)
pygpu.gpuarray.GpuArrayException: (b'an illegal memory access was encountered', 'Container name "obs_sub"')
A couple of the GPUs will shed their memory allocation (I guess these are the ones giving the error) while some others are stuck at 100% utilization, despite the fact that they should just finish and then hit a barrier they can't pass. This happens in a code only when more than 2 GPUs are used.
Each GPU has its own python process, and is attempting to set its own version of the same variable, with the same shape data, which can definitely fit.
It happens exactly the same every run--it's always due to the same variable trying to be written, but despite that there is a similar, larger variable with its value already set (and again, the data definitely fits, because it works with two GPUs, doesn't matter which two). Does not change with data length or preallocation setting.
I just pulled and reinstalled libgpuarray along with Theano (and CuDNN v6) tonight. It is a variable that is fed into a dnn_conv net.
Certainly still a chance I'm doing something wrong....
Thoughts??
The text was updated successfully, but these errors were encountered:
The illegal memory access is probably not from the set_value() but from something before. To try and identify what exactly caused it, try to run with either CUDA_LAUNCH_BLOCKING=1 (somewhat slow) or THEANO_FLAGS=gpuarray.preallocate=-1 and cuda-memcheck (very slow).
Yes! Thank you for that excellent advice! Problem was entirely on my end, and CUDA_LAUNCH_BLOCKING=1 showed me how.
It revealed the problem was in cuda src/reduce.cu, which is broken in the NVLink version I'm using. Inadvertently, I still had the workers calling reduce when the master was calling all_reduce. Interesting that the code ran for 2 GPUs (the reduce bug does not affect a 2-GPU call)...
Hi,
I'm getting a strange error as below:
A couple of the GPUs will shed their memory allocation (I guess these are the ones giving the error) while some others are stuck at 100% utilization, despite the fact that they should just finish and then hit a barrier they can't pass. This happens in a code only when more than 2 GPUs are used.
Each GPU has its own python process, and is attempting to set its own version of the same variable, with the same shape data, which can definitely fit.
It happens exactly the same every run--it's always due to the same variable trying to be written, but despite that there is a similar, larger variable with its value already set (and again, the data definitely fits, because it works with two GPUs, doesn't matter which two). Does not change with data length or preallocation setting.
I just pulled and reinstalled libgpuarray along with Theano (and CuDNN v6) tonight. It is a variable that is fed into a dnn_conv net.
Certainly still a chance I'm doing something wrong....
Thoughts??
The text was updated successfully, but these errors were encountered: