Cuda 11.1 - Coordinate manager #330

zgojcic · 2021-03-13T22:28:34Z

Hi Chris,

I have stumbled onto the following problem when using ME 0.5.1 or 0.5.2 with Cuda 11.1:

  File "/home/zgojcic/anaconda3/envs/rigid_3dsf/lib/python3.7/site-packages/MinkowskiEngine-0.5.1-py3.7-linux-x86_64.egg/MinkowskiEngine/MinkowskiConvolution.py", line 84, in forward
    coordinate_manager._manager,
RuntimeError: /home/zgojcic/Documents/Rigid3DSceneFlow/MinkowskiEngine/src/convolution_gpu.cu:85, assertion (in_feat.size(0) == p_map_manager->size(in_key)) failed. Invalid in_feat size 0 != 5296

Note that the same code works perfectly fine with Cuda 10.2. I am sorry that I do not have a very compact working example, but the error occurs when running the code available in https://github.com/zgojcic/Rigid3DSceneFlow. For example when running the following evaluation:

python eval.py ./configs/eval/eval_lidar_kitti.yaml

If you actually want to run the code you also have to download the dataset, but it is very small (see the repo). If I can help you somehow or should provide more information, please let me know.

Best,
Zan

Diagnostic from one of the computers that I have used (I have observed the same error on three computers running either ME 0.5.1 or 0.5.2):

==========System==========
Linux-5.4.0-66-generic-x86_64-with-debian-buster-sid
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.5 LTS"
3.7.10 (default, Feb 26 2021, 18:47:35)
[GCC 7.3.0]
==========Pytorch==========
1.8.0
torch.cuda.is_available(): True
==========NVIDIA-SMI==========
/usr/bin/nvidia-smi
Driver Version 455.32.00
CUDA Version 11.1
VBIOS Version 88.00.41.00.18
Image Version G001.0000.01.04
==========NVCC==========
sh: 1: nvcc: not found
==========CC==========
CC=g++-7
/usr/bin/g++-7
g++-7 (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

==========MinkowskiEngine==========
0.5.1
MinkowskiEngine compiled with CUDA Support: True
NVCC version MinkowskiEngine is compiled: 11010
CUDART version MinkowskiEngine is compiled: 11010

The text was updated successfully, but these errors were encountered:

chrischoy · 2021-03-14T00:47:19Z

The error says that you fed a 0 length feature matrix. You might want to put a break point import ipdb; ipdb.set_trace() before the line got the error and make sure you are doing everything correctly.

zgojcic · 2021-03-14T10:50:46Z

Exactly the same code runs on the same computer if I use Cuda 10.2 with the same ME version, so I assume it is a combination of the Cuda 11.1 with ME.

If I debug the code step by step the error actually happens before, when I cast the values to the sparse tensor like:

        sinput1 = ME.SparseTensor(features=input_dict['sinput_s_F'].to(self.device),
            coordinates=input_dict['sinput_s_C'].to(self.device))

the error message is:

 terminate called after throwing an instance of 'thrust::system::system_error'
  what():  CUDA free failed: cudaErrorIllegalAddress: an illegal memory access was encountered

The inputs are generated with

coords_batch1, feats_batch1 = ME.utils.sparse_collate(coords=coords1, feats=feats1)

but the batch size is one and a single worker is used in the data loader. The dimension of the inputs is [5296,3] and [5296,4] respectively.

I have tried to generate a minimum working example but if I just cast random values to a tensor it works without an error.

chrischoy · 2021-03-14T11:06:47Z

It would be great if you can prepare a self-contained code for debugging.

zgojcic · 2021-03-14T18:20:29Z

So the following example should show the problem. On my machine (the diagnostic is in the first post) in returns:

tensor(8188)
torch.Size([8188, 3])
torch.Size([8188, 3])

tensor(8188, device='cuda:0')
torch.Size([8188, 3])
torch.Size([0, 3])

I think that there is something wrong when casting the features to the ME.SparseTensor, as I can for example also not use
print(sinput1.F), the python just hangs in this case. Hope that this helps.

Just as an info the same code with Cuda 10.2 returns

tensor(8188)
torch.Size([8188, 3])
torch.Size([8188, 3])

tensor(8188, device='cuda:0')
torch.Size([8188, 3])
torch.Size([8188, 3])

Thank you in advance for your help

import torch
import MinkowskiEngine as ME
import numpy as np

pc_1 = np.random.rand(8192,3) * 20
voxel_size = 0.1

# Voxelization
_, sel1 = ME.utils.sparse_quantize(pc_1 / voxel_size, return_index=True)

# Slect the voxelized points
pc_1 = pc_1[sel1,:]

# Get sparse indices
coords1 = np.floor(pc_1 / voxel_size)

# Use absolute features as input
feats1 = coords1.copy()

coords_batch1, feats_batch1 = ME.utils.sparse_collate(coords=[coords1], feats=[feats1])

sinput1 = ME.SparseTensor(features=feats_batch1,
            coordinates=coords_batch1)
sinput1_cuda = ME.SparseTensor(features=feats_batch1.to('cuda'),
            coordinates=coords_batch1.to('cuda'))

for b_idx in range(len(sinput1.decomposed_coordinates)):
    feat_s = sinput1.F[sinput1.C[:,0] == b_idx]

    print(sum(sinput1.C[:,0] == b_idx))
    print(sinput1.F.shape)
    print(feat_s.shape)

for b_idx in range(len(sinput1_cuda.decomposed_coordinates)):
    feat_s_cuda = sinput1_cuda.F[sinput1_cuda.C[:,0] == b_idx]

    print(sum(sinput1_cuda.C[:,0] == b_idx))
    print(sinput1_cuda.F.shape)
    print(feat_s_cuda.shape)

eldar · 2021-03-22T13:58:25Z

Hi! Any chance this issue could be looked at? I am using an NVIDIA 3000 series GPU which only runs CUDA 11 and therefore I cannot use Minkowski Engine.

ShengyuH · 2021-03-25T18:17:09Z

Just FYI, the code snippet works on my Machine: MinkowskiEngine==0.5.0, Cuda 11.2, GeForce RTX 3090.

eldar · 2021-03-26T14:30:16Z

I ran the snippet on CUDA 11.2 (ME==0.5.1, RTX 3090) and still getting the same error. ME==0.5.0 wouldn't compile.

victoryc · 2021-03-30T01:00:11Z

What I ran into in #338 is perhaps the same issue as this?

chrischoy · 2021-04-06T22:36:51Z

Hmm, can't replicate the error on the latest master. Both with CUDA 11.0 and CUDA 11.1.

My environments are

python -c "import MinkowskiEngine; MinkowskiEngine.print_diagnostics()"

==========System==========
Linux-5.4.0-67-generic-x86_64-with-glibc2.10
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.1 LTS"
3.8.8 (default, Feb 24 2021, 21:46:12)
[GCC 7.3.0]
==========Pytorch==========
1.8.1
torch.cuda.is_available(): True
==========NVIDIA-SMI==========
/usr/bin/nvidia-smi
Driver Version 460.39
CUDA Version 11.2
VBIOS Version 90.02.2E.00.0C
Image Version G001.0000.02.04
==========NVCC==========
/usr/local/cuda/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
==========CC==========
CC=g++
/usr/bin/g++
g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

==========MinkowskiEngine==========
0.5.2
MinkowskiEngine compiled with CUDA Support: True
NVCC version MinkowskiEngine is compiled: 11000
CUDART version MinkowskiEngine is compiled: 11000

and

==========System==========
Linux-5.4.0-67-generic-x86_64-with-glibc2.10
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.1 LTS"
3.8.8 (default, Feb 24 2021, 21:46:12)
[GCC 7.3.0]
==========Pytorch==========
1.8.1
torch.cuda.is_available(): True
==========NVIDIA-SMI==========
/usr/bin/nvidia-smi
Driver Version 460.39
CUDA Version 11.2
VBIOS Version 90.02.2E.00.0C
Image Version G001.0000.02.04
==========NVCC==========
/usr/local/cuda/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
==========CC==========
CC=g++
/usr/bin/g++
g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

==========MinkowskiEngine==========
0.5.2
MinkowskiEngine compiled with CUDA Support: True
NVCC version MinkowskiEngine is compiled: 11010
CUDART version MinkowskiEngine is compiled: 11010

zmlshiwo · 2021-04-07T08:47:25Z

Hey Chris, @chrischoy , I still produce this error on 3090 GPU using the latest master, with the environments:

==========System==========
Linux-5.8.0-44-generic-x86_64-with-glibc2.10
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
3.8.8 (default, Feb 24 2021, 21:46:12)
[GCC 7.3.0]
==========Pytorch==========
1.8.1
torch.cuda.is_available(): True
==========NVIDIA-SMI==========
/usr/bin/nvidia-smi
Driver Version 460.39
CUDA Version 11.2
VBIOS Version 94.02.26.88.3C
Image Version G001.0000.03.03
==========NVCC==========
/usr/local/cuda-11.1/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Tue_Sep_15_19:10:02_PDT_2020
Cuda compilation tools, release 11.1, V11.1.74
Build cuda_11.1.TC455_06.29069683_0
==========CC==========
CC=g++-7
/usr/bin/g++-7
g++-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

==========MinkowskiEngine==========
0.5.2
MinkowskiEngine compiled with CUDA Support: True
NVCC version MinkowskiEngine is compiled: 11010
CUDART version MinkowskiEngine is compiled: 11010

I first remove the last conda environments, and create new environments using conda.
Then, I running the commend,

conda install openblas-devel -c anaconda

conda install pytorch=1.8.1 torchvision cudatoolkit=11.1 -c pytorch -c conda-forge

pip install -U git+https://github.com/NVIDIA/MinkowskiEngine -v --no-deps --install-option="--blas_include_dirs=${CONDA_PREFIX}/include" --install-option="--blas=openblas"

python test.py
/home/ps/anaconda3/envs/rigid_3dsf/lib/python3.8/site-packages/MinkowskiEngine-0.5.2-py3.8-linux-x86_64.egg/MinkowskiEngine/init.py:36: UserWarning: The environment variable OMP_NUM_THREADS not set. MinkowskiEngine will automatically set OMP_NUM_THREADS=16. If you want to set OMP_NUM_THREADS manually, please export it on the command line before running a python script. e.g. export OMP_NUM_THREADS=12; python your_program.py. It is recommended to set it below 24.
warnings.warn(
tensor(8188)
torch.Size([8188, 3])
torch.Size([8188, 3])
tensor(8188, device='cuda:0')
torch.Size([8188, 3])
torch.Size([0, 3])

chrischoy · 2021-04-08T00:12:47Z

Sorry, I misread the issue. I assumed the cudaIllegalMemoryAccess was the problem. Yes, I was able to reproduce this error. Let me get back to you ASAP.

chrischoy · 2021-04-08T00:54:09Z

TLDR: This is an error in pytorch (v1.8.X + CUDA11.X) which affects many other custom C extension libraries.

On pytorch 1.8.1 + cuda 11.1

import MinkowskiEngine as ME
import torch

coordinates = torch.rand(8192,3) * 200
bcoords, bfeats = coordinates.cuda(), coordinates.cuda()
print(bcoords, bfeats)  # without print, it works fine... print seems to be triggering something
ME.SparseTensor(bfeats, bcoords)

The full log for the above script with ME debug installation is

...

/home/chrischoy/projects/MinkowskiEngine/src/coordinate_map_gpu.cu:225 nm_threads 8192
/home/chrischoy/projects/MinkowskiEngine/src/coordinate_map_gpu.cu:227 nm_blocks 64
/home/chrischoy/projects/MinkowskiEngine/src/coordinate_map_gpu.cu:229 unused_key 4294967295
CUDA error 101 [/usr/local/cuda-11.1/include/cub/block/../iterator/../util_device.cuh, 471]: invalid device ordinal
CUDA error 101 [/usr/local/cuda-11.1/include/cub/device/dispatch/dispatch_reduce.cuh, 653]: invalid device ordinal
/home/chrischoy/projects/MinkowskiEngine/src/storage.cuh:80 Deallocating 2 gpu storage at 0x7fd1c62e0000
/home/chrischoy/projects/MinkowskiEngine/src/storage.cuh:80 Deallocating 0 gpu storage at 0
/home/chrischoy/projects/MinkowskiEngine/src/storage.cuh:80 Deallocating 8192 gpu storage at 0x7fd1c62e8200
/home/chrischoy/projects/MinkowskiEngine/src/storage.cuh:80 Deallocating 8192 gpu storage at 0x7fd1c62e0200
Traceback (most recent call last):
  File "test330.py", line 7, in <module>
    ME.SparseTensor(bfeats, bcoords)
  File "/home/chrischoy/projects/MinkowskiEngine/MinkowskiEngine/MinkowskiSparseTensor.py", line 269, in __init__
    coordinates, features, coordinate_map_key = self.initialize_coordinates(
  File "/home/chrischoy/projects/MinkowskiEngine/MinkowskiEngine/MinkowskiSparseTensor.py", line 294, in initialize_coordinates
    ) = self._manager.insert_and_map(coordinates, *coordinate_map_key.get_key())
  File "/home/chrischoy/projects/MinkowskiEngine/MinkowskiEngine/MinkowskiCoordinateManager.py", line 179, in insert_and_map
    return self._manager.insert_and_map(coordinates, tensor_stride, string_id)
RuntimeError: after reduction step 1: cudaErrorInvalidDevice: invalid device ordinal

The invalid device ordinal should not be triggered.

A related issue happens also on these libraries with pytorch 1.8.x + CUDA 11.X

This is a pytorch error which probably will be fixed in the next update. In the meantime, I'll update the readme and recommend

pytorch 1.8.1 + CUDA 10.2
pytorch 1.7.1 + CUDA 11.X

but not

pytorch 1.8.1 + CUDA 11.X

zhaopku · 2021-05-30T12:03:20Z

Are there any updates on this? I have an RTX 3090, which is only compatible with CUDA 11.1+.

snuffle-PX · 2021-07-15T08:01:09Z

Does pytorch1.9+cuda11.1 fix this problem? Thx.

san-santra · 2021-07-27T16:52:32Z

I have tried running the codes that were given in the previous posts, they are running fine. So, I guess this is fixed.

==========System==========
Linux-5.8.0-49-generic-x86_64-with-glibc2.17
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
3.8.10 (default, Jun  4 2021, 15:09:15) 
[GCC 7.5.0]
==========Pytorch==========
1.9.0
torch.cuda.is_available(): True
==========NVIDIA-SMI==========
/usr/bin/nvidia-smi
Driver Version 460.56
CUDA Version 11.2
VBIOS Version 88.00.4F.00.04
Image Version G500.0200.00.03
==========NVCC==========
/var/tmp/cuda-11.1/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
==========CC==========
/usr/bin/c++
c++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

==========MinkowskiEngine==========
0.5.4
MinkowskiEngine compiled with CUDA Support: True
NVCC version MinkowskiEngine is compiled: 11010
CUDART version MinkowskiEngine is compiled: 11010

chrischoy · 2021-07-27T17:24:31Z

Great! I wasn't sure it was solved. So I'll close the ticket since I got the confirmation that it's been resolved.

zgojcic mentioned this issue Mar 14, 2021

Invalid in_feat_size 0 with Cuda 11 zgojcic/Rigid3DSceneFlow#1

Closed

chrischoy added the Bug in other libraries label Apr 8, 2021

chrischoy added a commit that referenced this issue Apr 8, 2021

installation instr updates for issue #330

81b0115

chrischoy added a commit that referenced this issue Apr 8, 2021

installation instr updates for issue #330

73fe866

chrischoy mentioned this issue Apr 8, 2021

Cuda error while initializing SparseTensor #338

Closed

taochenshh mentioned this issue May 8, 2021

copy_if failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #350

Closed

chrischoy closed this as completed Jul 27, 2021

wangfudong mentioned this issue Feb 22, 2022

Dose ME 0.5.4 support A100 ? #445

Closed

Tanazzah pushed a commit to Tanazzah/MinkowskiEngine that referenced this issue Feb 9, 2024

installation instr updates for issue NVIDIA#330

3b987ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda 11.1 - Coordinate manager #330

Cuda 11.1 - Coordinate manager #330

zgojcic commented Mar 13, 2021

chrischoy commented Mar 14, 2021

zgojcic commented Mar 14, 2021

chrischoy commented Mar 14, 2021

zgojcic commented Mar 14, 2021 •

edited

eldar commented Mar 22, 2021

ShengyuH commented Mar 25, 2021

eldar commented Mar 26, 2021

victoryc commented Mar 30, 2021

chrischoy commented Apr 6, 2021 •

edited

zmlshiwo commented Apr 7, 2021 •

edited

chrischoy commented Apr 8, 2021 •

edited

chrischoy commented Apr 8, 2021 •

edited

zhaopku commented May 30, 2021 •

edited

snuffle-PX commented Jul 15, 2021

san-santra commented Jul 27, 2021 •

edited

chrischoy commented Jul 27, 2021

Cuda 11.1 - Coordinate manager #330

Cuda 11.1 - Coordinate manager #330

Comments

zgojcic commented Mar 13, 2021

chrischoy commented Mar 14, 2021

zgojcic commented Mar 14, 2021

chrischoy commented Mar 14, 2021

zgojcic commented Mar 14, 2021 • edited

eldar commented Mar 22, 2021

ShengyuH commented Mar 25, 2021

eldar commented Mar 26, 2021

victoryc commented Mar 30, 2021

chrischoy commented Apr 6, 2021 • edited

zmlshiwo commented Apr 7, 2021 • edited

chrischoy commented Apr 8, 2021 • edited

chrischoy commented Apr 8, 2021 • edited

zhaopku commented May 30, 2021 • edited

snuffle-PX commented Jul 15, 2021

san-santra commented Jul 27, 2021 • edited

chrischoy commented Jul 27, 2021

zgojcic commented Mar 14, 2021 •

edited

chrischoy commented Apr 6, 2021 •

edited

zmlshiwo commented Apr 7, 2021 •

edited

chrischoy commented Apr 8, 2021 •

edited

chrischoy commented Apr 8, 2021 •

edited

zhaopku commented May 30, 2021 •

edited

san-santra commented Jul 27, 2021 •

edited