Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

illegal memory access during shared_var.set_value() when more than 2 GPUs #404

Closed
astooke opened this issue Apr 12, 2017 · 3 comments
Closed

Comments

@astooke
Copy link

astooke commented Apr 12, 2017

Hi,

I'm getting a strange error as below:

File "/home/adam/GitRepos/Theano/theano/gpuarray/type.py", line 640, in set_value
    self.container.value = value
  File "/home/adam/GitRepos/Theano/theano/gof/link.py", line 477, in __set__
    **kwargs)
  File "/home/adam/GitRepos/Theano/theano/gpuarray/type.py", line 299, in filter_inplace
    data = pygpu.array(data, context=self.context)
  File "pygpu/gpuarray.pyx", line 915, in pygpu.gpuarray.array (pygpu/gpuarray.c:12227)
  File "pygpu/gpuarray.pyx", line 970, in pygpu.gpuarray.carray (pygpu/gpuarray.c:13109)
  File "pygpu/gpuarray.pyx", line 664, in pygpu.gpuarray.pygpu_fromhostdata (pygpu/gpuarray.c:9851)
  File "pygpu/gpuarray.pyx", line 301, in pygpu.gpuarray.array_copy_from_host (pygpu/gpuarray.c:5817)
pygpu.gpuarray.GpuArrayException: (b'an illegal memory access was encountered', 'Container name "obs_sub"')

A couple of the GPUs will shed their memory allocation (I guess these are the ones giving the error) while some others are stuck at 100% utilization, despite the fact that they should just finish and then hit a barrier they can't pass. This happens in a code only when more than 2 GPUs are used.

Each GPU has its own python process, and is attempting to set its own version of the same variable, with the same shape data, which can definitely fit.

It happens exactly the same every run--it's always due to the same variable trying to be written, but despite that there is a similar, larger variable with its value already set (and again, the data definitely fits, because it works with two GPUs, doesn't matter which two). Does not change with data length or preallocation setting.

I just pulled and reinstalled libgpuarray along with Theano (and CuDNN v6) tonight. It is a variable that is fed into a dnn_conv net.

Certainly still a chance I'm doing something wrong....

Thoughts??

@astooke
Copy link
Author

astooke commented Apr 12, 2017

Might be related: illegal memory (from theano-users)

@abergeron
Copy link
Member

The illegal memory access is probably not from the set_value() but from something before. To try and identify what exactly caused it, try to run with either CUDA_LAUNCH_BLOCKING=1 (somewhat slow) or THEANO_FLAGS=gpuarray.preallocate=-1 and cuda-memcheck (very slow).

@astooke
Copy link
Author

astooke commented Apr 12, 2017

Yes! Thank you for that excellent advice! Problem was entirely on my end, and CUDA_LAUNCH_BLOCKING=1 showed me how.

It revealed the problem was in cuda src/reduce.cu, which is broken in the NVLink version I'm using. Inadvertently, I still had the workers calling reduce when the master was calling all_reduce. Interesting that the code ran for 2 GPUs (the reduce bug does not affect a 2-GPU call)...

@astooke astooke closed this as completed Apr 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants