Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: an illegal memory access was encountered at src/convolution.cu:259 device number specification and assertion #40

Closed
GaoQiyu opened this issue Oct 25, 2019 · 11 comments

Comments

@GaoQiyu
Copy link

GaoQiyu commented Oct 25, 2019

I can run my code with GTX1060
but it doesn't works with Tesla P100 on server
Ant output:

Traceback_ (most recent call last):
File "train.py", line 135, in
trainer.train(epoch)

File "train.py", line 56, in train
output_sparse = self.model(point)

File "/home/gaoqiyu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)

File "/home/gaoqiyu/PointCloudSeg_Minkowski/model/minkunet.py", line 122, in forward
out = self.conv0p1s1(x)

File "/home/gaoqiyu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)

File "/home/gaoqiyu/PointCloudSeg_Minkowski/MinkowskiEngine/MinkowskiConvolution.py", line 269, in forward
out_coords_key, input.coords_man)

File "/home/gaoqiyu/PointCloudSeg_Minkowski/MinkowskiEngine/MinkowskiConvolution.py", line 91, in forward
ctx.coords_man.CPPCoordsManager)
RuntimeError: an illegal memory access was encountered at src/convolution.cu:259

@chrischoy
Copy link
Contributor

Hmm could you post the engine version, cuda version, nvcc version?

I confirmed it worked on most Pascal architecture chips, but not the P100. I’ll check.

@GaoQiyu
Copy link
Author

GaoQiyu commented Oct 25, 2019

System: Host: C249-SYS-7048GR-TR Kernel: 4.15.0-54-generic x86_64 bits: 64 Console: tty 3
Distro: Ubuntu 18.04.2 LTS

Machine: Device: kvm System: Supermicro product: SYS-7048GR-TR v: 0123456789 serial: N/A
Mobo: Supermicro model: X10DRG-Q v: 1.10 serial: N/A
UEFI [Legacy]: American Megatrends v: 2.0b date: 07/26/2017

CPU(s): 2 8 core Intel Xeon E5-2620 v4s (-MT-MCP-SMP-) cache: 40960 KB
clock speeds: max: 3000 MHz 1: 1200 MHz 2: 1206 MHz 3: 1347 MHz 4: 1209 MHz 5: 1741 MHz 6: 1201 MHz
7: 1200 MHz 8: 1271 MHz 9: 1292 MHz 10: 1200 MHz 11: 1202 MHz 12: 1204 MHz 13: 1200 MHz 14: 1204 MHz
15: 1200 MHz 16: 1200 MHz 17: 1200 MHz 18: 1206 MHz 19: 1218 MHz 20: 1246 MHz 21: 1295 MHz
22: 1202 MHz 23: 1200 MHz 24: 1843 MHz 25: 1825 MHz 26: 1200 MHz 27: 1200 MHz 28: 1202 MHz
29: 1200 MHz 30: 1211 MHz 31: 1200 MHz 32: 1200 MHz

Graphics: Card-1: NVIDIA GM206GL [Quadro M2000]
Card-2: NVIDIA GP100GL [Tesla P100 PCIe 16GB]
Card-3: ASPEED ASPEED Graphics Family
Card-4: NVIDIA GP100GL [Tesla P100 PCIe 16GB]
Display Server: N/A drivers: nvidia (unloaded: fbdev,vesa) FAILED: modesetting,nouveau
tty size: 160x49 Advanced Data: N/A out of X

CUDA Version 10.0.130

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Apr_24_19:10:27_PDT_2019
Cuda compilation tools, release 10.1, V10.1.168

@GaoQiyu
Copy link
Author

GaoQiyu commented Oct 25, 2019

Have you test on Maxwell architecture?
I also have a Titan X server.
Maybe it can run on Titan?

@chrischoy
Copy link
Contributor

I checked it works on K40, K80, GTX1060, GTX1080Ti, GTX2080Ti, Titan X, Titan XP, Titan RTX, V100.

@GaoQiyu
Copy link
Author

GaoQiyu commented Oct 25, 2019

ok I fix it
just set
self.device = torch.device("cuda:0")

@chrischoy
Copy link
Contributor

chrischoy commented Oct 25, 2019

Hmm interesting. I’ve always specified this before so I didn’t encounter this, but adding some assertion checks to prevent this would be a good idea.

@chrischoy chrischoy added the enhancement New feature or request label Oct 25, 2019
@GaoQiyu
Copy link
Author

GaoQiyu commented Oct 26, 2019

I find when use the first gpu it always work, but if use other gpu, it seems the program will also use the first gpu, and the one which is specified is also used. so this is the real error. I don't know whether it is my code mistake.(Very likely) I will check it.
sorry to bother

@chrischoy
Copy link
Contributor

I always use export CUDA_VISIBLE_DEVICES=2; python main.py ..... to run programs on a server. I haven’t checked the multi-gpu setup for newer pytorch version, so it is possible that this is not working well with newer pytorch.

@chrischoy chrischoy changed the title RuntimeError: an illegal memory access was encountered at src/convolution.cu:259 RuntimeError: an illegal memory access was encountered at src/convolution.cu:259 device number specification and assertion Oct 30, 2019
@chrischoy chrischoy mentioned this issue Nov 6, 2019
@chrischoy chrischoy added bug Something isn't working and removed bug Something isn't working labels Nov 15, 2019
@chrischoy
Copy link
Contributor

chrischoy commented Nov 21, 2019

For pytorch v1.3.1, we found out that the asynchronous memory allocation in pytorch doesn't always emit the error in place and propagates the error back to the engine.

To verify this is not the error in the engine, we placed CUDA_CHECK(cudaDeviceSynchronize()) before any CUDA calls inside convolution.cu.

We found out that when OOM were to happen, we get this error illegal memory access on the cudaDeviceSynchronize() call. I suspect that this is linked with asynchronous device malloc failure that leads to accessing invalid pointer on device and thus illegal memory access. However, this problem is hidden until we force the device synchronization.

This requires further investigation in side the pytorch as well, but for now, I can assure you that there is no memory leakage, invalid pointer access error in the Minkowski Engine.

@chrischoy chrischoy removed the enhancement New feature or request label Nov 21, 2019
@asurada404
Copy link

When I use export CUDA_VISIBLE_DEVICES=1, I can't set Tensor's device by .to method.
For example:

$ export CUDA_VISIBLE_DEVICES=1
$ python

But I got an error:

>>> import torch
>>> a = torch.Tensor(1).to("cuda:0")
>>> a = torch.Tensor(1).to("cuda:1")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: invalid device ordinal

@GaoQiyu
Copy link
Author

GaoQiyu commented Jul 24, 2020

When I use export CUDA_VISIBLE_DEVICES=1, I can't set Tensor's device by .to method.
For example:

$ export CUDA_VISIBLE_DEVICES=1
$ python

But I got an error:

>>> import torch
>>> a = torch.Tensor(1).to("cuda:0")
>>> a = torch.Tensor(1).to("cuda:1")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: invalid device ordinal

when you set CUDA_VISIBLE_DEVICES=1, the only GPU your code can get is the second gpu in your system. So I think if you set "a = torch.Tensor(1).to("cuda:0")", it will be ok.
but you can't set "a = torch.Tensor(1).to("cuda:1")" because the code can't see another gpu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants