RuntimeError: an illegal memory access was encountered at src/convolution.cu:259 device number specification and assertion #40

GaoQiyu · 2019-10-25T08:53:03Z

I can run my code with GTX1060
but it doesn't works with Tesla P100 on server
Ant output:

Traceback_ (most recent call last):
File "train.py", line 135, in
trainer.train(epoch)

File "train.py", line 56, in train
output_sparse = self.model(point)

File "/home/gaoqiyu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)

File "/home/gaoqiyu/PointCloudSeg_Minkowski/model/minkunet.py", line 122, in forward
out = self.conv0p1s1(x)

File "/home/gaoqiyu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)

File "/home/gaoqiyu/PointCloudSeg_Minkowski/MinkowskiEngine/MinkowskiConvolution.py", line 269, in forward
out_coords_key, input.coords_man)

File "/home/gaoqiyu/PointCloudSeg_Minkowski/MinkowskiEngine/MinkowskiConvolution.py", line 91, in forward
ctx.coords_man.CPPCoordsManager)
RuntimeError: an illegal memory access was encountered at src/convolution.cu:259

chrischoy · 2019-10-25T09:05:38Z

Hmm could you post the engine version, cuda version, nvcc version?

I confirmed it worked on most Pascal architecture chips, but not the P100. I’ll check.

GaoQiyu · 2019-10-25T09:25:32Z

System: Host: C249-SYS-7048GR-TR Kernel: 4.15.0-54-generic x86_64 bits: 64 Console: tty 3
Distro: Ubuntu 18.04.2 LTS

Machine: Device: kvm System: Supermicro product: SYS-7048GR-TR v: 0123456789 serial: N/A
Mobo: Supermicro model: X10DRG-Q v: 1.10 serial: N/A
UEFI [Legacy]: American Megatrends v: 2.0b date: 07/26/2017

CPU(s): 2 8 core Intel Xeon E5-2620 v4s (-MT-MCP-SMP-) cache: 40960 KB
clock speeds: max: 3000 MHz 1: 1200 MHz 2: 1206 MHz 3: 1347 MHz 4: 1209 MHz 5: 1741 MHz 6: 1201 MHz
7: 1200 MHz 8: 1271 MHz 9: 1292 MHz 10: 1200 MHz 11: 1202 MHz 12: 1204 MHz 13: 1200 MHz 14: 1204 MHz
15: 1200 MHz 16: 1200 MHz 17: 1200 MHz 18: 1206 MHz 19: 1218 MHz 20: 1246 MHz 21: 1295 MHz
22: 1202 MHz 23: 1200 MHz 24: 1843 MHz 25: 1825 MHz 26: 1200 MHz 27: 1200 MHz 28: 1202 MHz
29: 1200 MHz 30: 1211 MHz 31: 1200 MHz 32: 1200 MHz

Graphics: Card-1: NVIDIA GM206GL [Quadro M2000]
Card-2: NVIDIA GP100GL [Tesla P100 PCIe 16GB]
Card-3: ASPEED ASPEED Graphics Family
Card-4: NVIDIA GP100GL [Tesla P100 PCIe 16GB]
Display Server: N/A drivers: nvidia (unloaded: fbdev,vesa) FAILED: modesetting,nouveau
tty size: 160x49 Advanced Data: N/A out of X

CUDA Version 10.0.130

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Apr_24_19:10:27_PDT_2019
Cuda compilation tools, release 10.1, V10.1.168

GaoQiyu · 2019-10-25T09:43:48Z

Have you test on Maxwell architecture?
I also have a Titan X server.
Maybe it can run on Titan?

chrischoy · 2019-10-25T09:46:25Z

I checked it works on K40, K80, GTX1060, GTX1080Ti, GTX2080Ti, Titan X, Titan XP, Titan RTX, V100.

GaoQiyu · 2019-10-25T11:39:53Z

ok I fix it
just set
self.device = torch.device("cuda:0")

chrischoy · 2019-10-25T18:34:11Z

Hmm interesting. I’ve always specified this before so I didn’t encounter this, but adding some assertion checks to prevent this would be a good idea.

GaoQiyu · 2019-10-26T08:16:12Z

I find when use the first gpu it always work, but if use other gpu, it seems the program will also use the first gpu, and the one which is specified is also used. so this is the real error. I don't know whether it is my code mistake.（Very likely) I will check it.
sorry to bother

chrischoy · 2019-10-26T09:17:44Z

I always use export CUDA_VISIBLE_DEVICES=2; python main.py ..... to run programs on a server. I haven’t checked the multi-gpu setup for newer pytorch version, so it is possible that this is not working well with newer pytorch.

chrischoy · 2019-11-21T01:00:22Z

For pytorch v1.3.1, we found out that the asynchronous memory allocation in pytorch doesn't always emit the error in place and propagates the error back to the engine.

To verify this is not the error in the engine, we placed CUDA_CHECK(cudaDeviceSynchronize()) before any CUDA calls inside convolution.cu.

We found out that when OOM were to happen, we get this error illegal memory access on the cudaDeviceSynchronize() call. I suspect that this is linked with asynchronous device malloc failure that leads to accessing invalid pointer on device and thus illegal memory access. However, this problem is hidden until we force the device synchronization.

This requires further investigation in side the pytorch as well, but for now, I can assure you that there is no memory leakage, invalid pointer access error in the Minkowski Engine.

asurada404 · 2020-07-24T00:57:43Z

When I use export CUDA_VISIBLE_DEVICES=1, I can't set Tensor's device by .to method.
For example:

$ export CUDA_VISIBLE_DEVICES=1
$ python

But I got an error:

>>> import torch
>>> a = torch.Tensor(1).to("cuda:0")
>>> a = torch.Tensor(1).to("cuda:1")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: invalid device ordinal

GaoQiyu · 2020-07-24T03:19:10Z

When I use export CUDA_VISIBLE_DEVICES=1, I can't set Tensor's device by .to method.
For example:
$ export CUDA_VISIBLE_DEVICES=1
$ python
But I got an error:
>>> import torch
>>> a = torch.Tensor(1).to("cuda:0")
>>> a = torch.Tensor(1).to("cuda:1")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: invalid device ordinal

when you set CUDA_VISIBLE_DEVICES=1, the only GPU your code can get is the second gpu in your system. So I think if you set "a = torch.Tensor(1).to("cuda:0")", it will be ok.
but you can't set "a = torch.Tensor(1).to("cuda:1")" because the code can't see another gpu

chrischoy added the enhancement New feature or request label Oct 25, 2019

chrischoy changed the title ~~RuntimeError: an illegal memory access was encountered at src/convolution.cu:259~~ RuntimeError: an illegal memory access was encountered at src/convolution.cu:259 device number specification and assertion Oct 30, 2019

chrischoy mentioned this issue Nov 6, 2019

GPU problem #45

Closed

chrischoy added bug Something isn't working and removed bug Something isn't working labels Nov 15, 2019

chrischoy removed the enhancement New feature or request label Nov 21, 2019

chrischoy closed this as completed Dec 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: an illegal memory access was encountered at src/convolution.cu:259 device number specification and assertion #40

RuntimeError: an illegal memory access was encountered at src/convolution.cu:259 device number specification and assertion #40

GaoQiyu commented Oct 25, 2019 •

edited

chrischoy commented Oct 25, 2019

GaoQiyu commented Oct 25, 2019 •

edited

GaoQiyu commented Oct 25, 2019

chrischoy commented Oct 25, 2019

GaoQiyu commented Oct 25, 2019

chrischoy commented Oct 25, 2019 •

edited

GaoQiyu commented Oct 26, 2019 •

edited

chrischoy commented Oct 26, 2019

chrischoy commented Nov 21, 2019 •

edited

asurada404 commented Jul 24, 2020

GaoQiyu commented Jul 24, 2020

RuntimeError: an illegal memory access was encountered at src/convolution.cu:259 device number specification and assertion #40

RuntimeError: an illegal memory access was encountered at src/convolution.cu:259 device number specification and assertion #40

Comments

GaoQiyu commented Oct 25, 2019 • edited

chrischoy commented Oct 25, 2019

GaoQiyu commented Oct 25, 2019 • edited

GaoQiyu commented Oct 25, 2019

chrischoy commented Oct 25, 2019

GaoQiyu commented Oct 25, 2019

chrischoy commented Oct 25, 2019 • edited

GaoQiyu commented Oct 26, 2019 • edited

chrischoy commented Oct 26, 2019

chrischoy commented Nov 21, 2019 • edited

asurada404 commented Jul 24, 2020

GaoQiyu commented Jul 24, 2020

GaoQiyu commented Oct 25, 2019 •

edited

GaoQiyu commented Oct 25, 2019 •

edited

chrischoy commented Oct 25, 2019 •

edited

GaoQiyu commented Oct 26, 2019 •

edited

chrischoy commented Nov 21, 2019 •

edited