illegal memory access during shared_var.set_value() when more than 2 GPUs #404

astooke · 2017-04-12T10:35:37Z

Hi,

I'm getting a strange error as below:

File "/home/adam/GitRepos/Theano/theano/gpuarray/type.py", line 640, in set_value
    self.container.value = value
  File "/home/adam/GitRepos/Theano/theano/gof/link.py", line 477, in __set__
    **kwargs)
  File "/home/adam/GitRepos/Theano/theano/gpuarray/type.py", line 299, in filter_inplace
    data = pygpu.array(data, context=self.context)
  File "pygpu/gpuarray.pyx", line 915, in pygpu.gpuarray.array (pygpu/gpuarray.c:12227)
  File "pygpu/gpuarray.pyx", line 970, in pygpu.gpuarray.carray (pygpu/gpuarray.c:13109)
  File "pygpu/gpuarray.pyx", line 664, in pygpu.gpuarray.pygpu_fromhostdata (pygpu/gpuarray.c:9851)
  File "pygpu/gpuarray.pyx", line 301, in pygpu.gpuarray.array_copy_from_host (pygpu/gpuarray.c:5817)
pygpu.gpuarray.GpuArrayException: (b'an illegal memory access was encountered', 'Container name "obs_sub"')

A couple of the GPUs will shed their memory allocation (I guess these are the ones giving the error) while some others are stuck at 100% utilization, despite the fact that they should just finish and then hit a barrier they can't pass. This happens in a code only when more than 2 GPUs are used.

Each GPU has its own python process, and is attempting to set its own version of the same variable, with the same shape data, which can definitely fit.

It happens exactly the same every run--it's always due to the same variable trying to be written, but despite that there is a similar, larger variable with its value already set (and again, the data definitely fits, because it works with two GPUs, doesn't matter which two). Does not change with data length or preallocation setting.

I just pulled and reinstalled libgpuarray along with Theano (and CuDNN v6) tonight. It is a variable that is fed into a dnn_conv net.

Certainly still a chance I'm doing something wrong....

Thoughts??

The text was updated successfully, but these errors were encountered:

astooke · 2017-04-12T10:37:19Z

Might be related: illegal memory (from theano-users)

abergeron · 2017-04-12T15:15:46Z

The illegal memory access is probably not from the set_value() but from something before. To try and identify what exactly caused it, try to run with either CUDA_LAUNCH_BLOCKING=1 (somewhat slow) or THEANO_FLAGS=gpuarray.preallocate=-1 and cuda-memcheck (very slow).

astooke · 2017-04-12T18:10:25Z

Yes! Thank you for that excellent advice! Problem was entirely on my end, and CUDA_LAUNCH_BLOCKING=1 showed me how.

It revealed the problem was in cuda src/reduce.cu, which is broken in the NVLink version I'm using. Inadvertently, I still had the workers calling reduce when the master was calling all_reduce. Interesting that the code ran for 2 GPUs (the reduce bug does not affect a 2-GPU call)...

astooke closed this as completed Apr 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

illegal memory access during shared_var.set_value() when more than 2 GPUs #404

illegal memory access during shared_var.set_value() when more than 2 GPUs #404

astooke commented Apr 12, 2017

astooke commented Apr 12, 2017

abergeron commented Apr 12, 2017

astooke commented Apr 12, 2017

illegal memory access during shared_var.set_value() when more than 2 GPUs #404

illegal memory access during shared_var.set_value() when more than 2 GPUs #404

Comments

astooke commented Apr 12, 2017

astooke commented Apr 12, 2017

abergeron commented Apr 12, 2017

astooke commented Apr 12, 2017