Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda 11.1 - Coordinate manager #330

Closed
zgojcic opened this issue Mar 13, 2021 · 16 comments
Closed

Cuda 11.1 - Coordinate manager #330

zgojcic opened this issue Mar 13, 2021 · 16 comments

Comments

@zgojcic
Copy link

zgojcic commented Mar 13, 2021

Hi Chris,

I have stumbled onto the following problem when using ME 0.5.1 or 0.5.2 with Cuda 11.1:

  File "/home/zgojcic/anaconda3/envs/rigid_3dsf/lib/python3.7/site-packages/MinkowskiEngine-0.5.1-py3.7-linux-x86_64.egg/MinkowskiEngine/MinkowskiConvolution.py", line 84, in forward
    coordinate_manager._manager,
RuntimeError: /home/zgojcic/Documents/Rigid3DSceneFlow/MinkowskiEngine/src/convolution_gpu.cu:85, assertion (in_feat.size(0) == p_map_manager->size(in_key)) failed. Invalid in_feat size 0 != 5296

Note that the same code works perfectly fine with Cuda 10.2. I am sorry that I do not have a very compact working example, but the error occurs when running the code available in https://github.com/zgojcic/Rigid3DSceneFlow. For example when running the following evaluation:

python eval.py ./configs/eval/eval_lidar_kitti.yaml

If you actually want to run the code you also have to download the dataset, but it is very small (see the repo). If I can help you somehow or should provide more information, please let me know.

Best,
Zan

Diagnostic from one of the computers that I have used (I have observed the same error on three computers running either ME 0.5.1 or 0.5.2):

==========System==========
Linux-5.4.0-66-generic-x86_64-with-debian-buster-sid
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.5 LTS"
3.7.10 (default, Feb 26 2021, 18:47:35)
[GCC 7.3.0]
==========Pytorch==========
1.8.0
torch.cuda.is_available(): True
==========NVIDIA-SMI==========
/usr/bin/nvidia-smi
Driver Version 455.32.00
CUDA Version 11.1
VBIOS Version 88.00.41.00.18
Image Version G001.0000.01.04
==========NVCC==========
sh: 1: nvcc: not found
==========CC==========
CC=g++-7
/usr/bin/g++-7
g++-7 (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

==========MinkowskiEngine==========
0.5.1
MinkowskiEngine compiled with CUDA Support: True
NVCC version MinkowskiEngine is compiled: 11010
CUDART version MinkowskiEngine is compiled: 11010

@chrischoy
Copy link
Contributor

The error says that you fed a 0 length feature matrix. You might want to put a break point import ipdb; ipdb.set_trace() before the line got the error and make sure you are doing everything correctly.

@zgojcic
Copy link
Author

zgojcic commented Mar 14, 2021

Exactly the same code runs on the same computer if I use Cuda 10.2 with the same ME version, so I assume it is a combination of the Cuda 11.1 with ME.

If I debug the code step by step the error actually happens before, when I cast the values to the sparse tensor like:

        sinput1 = ME.SparseTensor(features=input_dict['sinput_s_F'].to(self.device),
            coordinates=input_dict['sinput_s_C'].to(self.device))

the error message is:

 terminate called after throwing an instance of 'thrust::system::system_error'
  what():  CUDA free failed: cudaErrorIllegalAddress: an illegal memory access was encountered

The inputs are generated with

coords_batch1, feats_batch1 = ME.utils.sparse_collate(coords=coords1, feats=feats1)

but the batch size is one and a single worker is used in the data loader. The dimension of the inputs is [5296,3] and [5296,4] respectively.

I have tried to generate a minimum working example but if I just cast random values to a tensor it works without an error.

@chrischoy
Copy link
Contributor

It would be great if you can prepare a self-contained code for debugging.

@zgojcic
Copy link
Author

zgojcic commented Mar 14, 2021

So the following example should show the problem. On my machine (the diagnostic is in the first post) in returns:

tensor(8188)
torch.Size([8188, 3])
torch.Size([8188, 3])

tensor(8188, device='cuda:0')
torch.Size([8188, 3])
torch.Size([0, 3])

I think that there is something wrong when casting the features to the ME.SparseTensor, as I can for example also not use
print(sinput1.F), the python just hangs in this case. Hope that this helps.

Just as an info the same code with Cuda 10.2 returns

tensor(8188)
torch.Size([8188, 3])
torch.Size([8188, 3])

tensor(8188, device='cuda:0')
torch.Size([8188, 3])
torch.Size([8188, 3])

Thank you in advance for your help

import torch
import MinkowskiEngine as ME
import numpy as np

pc_1 = np.random.rand(8192,3) * 20
voxel_size = 0.1

# Voxelization
_, sel1 = ME.utils.sparse_quantize(pc_1 / voxel_size, return_index=True)

# Slect the voxelized points
pc_1 = pc_1[sel1,:]

# Get sparse indices
coords1 = np.floor(pc_1 / voxel_size)

# Use absolute features as input
feats1 = coords1.copy()

coords_batch1, feats_batch1 = ME.utils.sparse_collate(coords=[coords1], feats=[feats1])

sinput1 = ME.SparseTensor(features=feats_batch1,
            coordinates=coords_batch1)
sinput1_cuda = ME.SparseTensor(features=feats_batch1.to('cuda'),
            coordinates=coords_batch1.to('cuda'))

for b_idx in range(len(sinput1.decomposed_coordinates)):
    feat_s = sinput1.F[sinput1.C[:,0] == b_idx]

    print(sum(sinput1.C[:,0] == b_idx))
    print(sinput1.F.shape)
    print(feat_s.shape)

for b_idx in range(len(sinput1_cuda.decomposed_coordinates)):
    feat_s_cuda = sinput1_cuda.F[sinput1_cuda.C[:,0] == b_idx]

    print(sum(sinput1_cuda.C[:,0] == b_idx))
    print(sinput1_cuda.F.shape)
    print(feat_s_cuda.shape)

@eldar
Copy link

eldar commented Mar 22, 2021

Hi! Any chance this issue could be looked at? I am using an NVIDIA 3000 series GPU which only runs CUDA 11 and therefore I cannot use Minkowski Engine.

@ShengyuH
Copy link

Just FYI, the code snippet works on my Machine: MinkowskiEngine==0.5.0, Cuda 11.2, GeForce RTX 3090.

@eldar
Copy link

eldar commented Mar 26, 2021

I ran the snippet on CUDA 11.2 (ME==0.5.1, RTX 3090) and still getting the same error. ME==0.5.0 wouldn't compile.

@victoryc
Copy link

What I ran into in #338 is perhaps the same issue as this?

@chrischoy
Copy link
Contributor

chrischoy commented Apr 6, 2021

Hmm, can't replicate the error on the latest master. Both with CUDA 11.0 and CUDA 11.1.

My environments are

python -c "import MinkowskiEngine; MinkowskiEngine.print_diagnostics()"
==========System==========
Linux-5.4.0-67-generic-x86_64-with-glibc2.10
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.1 LTS"
3.8.8 (default, Feb 24 2021, 21:46:12)
[GCC 7.3.0]
==========Pytorch==========
1.8.1
torch.cuda.is_available(): True
==========NVIDIA-SMI==========
/usr/bin/nvidia-smi
Driver Version 460.39
CUDA Version 11.2
VBIOS Version 90.02.2E.00.0C
Image Version G001.0000.02.04
==========NVCC==========
/usr/local/cuda/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
==========CC==========
CC=g++
/usr/bin/g++
g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

==========MinkowskiEngine==========
0.5.2
MinkowskiEngine compiled with CUDA Support: True
NVCC version MinkowskiEngine is compiled: 11000
CUDART version MinkowskiEngine is compiled: 11000

and

==========System==========
Linux-5.4.0-67-generic-x86_64-with-glibc2.10
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.1 LTS"
3.8.8 (default, Feb 24 2021, 21:46:12)
[GCC 7.3.0]
==========Pytorch==========
1.8.1
torch.cuda.is_available(): True
==========NVIDIA-SMI==========
/usr/bin/nvidia-smi
Driver Version 460.39
CUDA Version 11.2
VBIOS Version 90.02.2E.00.0C
Image Version G001.0000.02.04
==========NVCC==========
/usr/local/cuda/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
==========CC==========
CC=g++
/usr/bin/g++
g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

==========MinkowskiEngine==========
0.5.2
MinkowskiEngine compiled with CUDA Support: True
NVCC version MinkowskiEngine is compiled: 11010
CUDART version MinkowskiEngine is compiled: 11010

@zmlshiwo
Copy link

zmlshiwo commented Apr 7, 2021

Hey Chris, @chrischoy , I still produce this error on 3090 GPU using the latest master, with the environments:

==========System==========
Linux-5.8.0-44-generic-x86_64-with-glibc2.10
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
3.8.8 (default, Feb 24 2021, 21:46:12)
[GCC 7.3.0]
==========Pytorch==========
1.8.1
torch.cuda.is_available(): True
==========NVIDIA-SMI==========
/usr/bin/nvidia-smi
Driver Version 460.39
CUDA Version 11.2
VBIOS Version 94.02.26.88.3C
Image Version G001.0000.03.03
==========NVCC==========
/usr/local/cuda-11.1/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Tue_Sep_15_19:10:02_PDT_2020
Cuda compilation tools, release 11.1, V11.1.74
Build cuda_11.1.TC455_06.29069683_0
==========CC==========
CC=g++-7
/usr/bin/g++-7
g++-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

==========MinkowskiEngine==========
0.5.2
MinkowskiEngine compiled with CUDA Support: True
NVCC version MinkowskiEngine is compiled: 11010
CUDART version MinkowskiEngine is compiled: 11010

I first remove the last conda environments, and create new environments using conda.
Then, I running the commend,

conda install openblas-devel -c anaconda

conda install pytorch=1.8.1 torchvision cudatoolkit=11.1 -c pytorch -c conda-forge

pip install -U git+https://github.com/NVIDIA/MinkowskiEngine -v --no-deps --install-option="--blas_include_dirs=${CONDA_PREFIX}/include" --install-option="--blas=openblas"

python test.py
/home/ps/anaconda3/envs/rigid_3dsf/lib/python3.8/site-packages/MinkowskiEngine-0.5.2-py3.8-linux-x86_64.egg/MinkowskiEngine/init.py:36: UserWarning: The environment variable OMP_NUM_THREADS not set. MinkowskiEngine will automatically set OMP_NUM_THREADS=16. If you want to set OMP_NUM_THREADS manually, please export it on the command line before running a python script. e.g. export OMP_NUM_THREADS=12; python your_program.py. It is recommended to set it below 24.
warnings.warn(
tensor(8188)
torch.Size([8188, 3])
torch.Size([8188, 3])
tensor(8188, device='cuda:0')
torch.Size([8188, 3])
torch.Size([0, 3])

@chrischoy
Copy link
Contributor

chrischoy commented Apr 8, 2021

Sorry, I misread the issue. I assumed the cudaIllegalMemoryAccess was the problem. Yes, I was able to reproduce this error. Let me get back to you ASAP.

@chrischoy
Copy link
Contributor

chrischoy commented Apr 8, 2021

TLDR: This is an error in pytorch (v1.8.X + CUDA11.X) which affects many other custom C extension libraries.

On pytorch 1.8.1 + cuda 11.1

import MinkowskiEngine as ME
import torch

coordinates = torch.rand(8192,3) * 200
bcoords, bfeats = coordinates.cuda(), coordinates.cuda()
print(bcoords, bfeats)  # without print, it works fine... print seems to be triggering something
ME.SparseTensor(bfeats, bcoords)

The full log for the above script with ME debug installation is

...

/home/chrischoy/projects/MinkowskiEngine/src/coordinate_map_gpu.cu:225 nm_threads 8192
/home/chrischoy/projects/MinkowskiEngine/src/coordinate_map_gpu.cu:227 nm_blocks 64
/home/chrischoy/projects/MinkowskiEngine/src/coordinate_map_gpu.cu:229 unused_key 4294967295
CUDA error 101 [/usr/local/cuda-11.1/include/cub/block/../iterator/../util_device.cuh, 471]: invalid device ordinal
CUDA error 101 [/usr/local/cuda-11.1/include/cub/device/dispatch/dispatch_reduce.cuh, 653]: invalid device ordinal
/home/chrischoy/projects/MinkowskiEngine/src/storage.cuh:80 Deallocating 2 gpu storage at 0x7fd1c62e0000
/home/chrischoy/projects/MinkowskiEngine/src/storage.cuh:80 Deallocating 0 gpu storage at 0
/home/chrischoy/projects/MinkowskiEngine/src/storage.cuh:80 Deallocating 8192 gpu storage at 0x7fd1c62e8200
/home/chrischoy/projects/MinkowskiEngine/src/storage.cuh:80 Deallocating 8192 gpu storage at 0x7fd1c62e0200
Traceback (most recent call last):
  File "test330.py", line 7, in <module>
    ME.SparseTensor(bfeats, bcoords)
  File "/home/chrischoy/projects/MinkowskiEngine/MinkowskiEngine/MinkowskiSparseTensor.py", line 269, in __init__
    coordinates, features, coordinate_map_key = self.initialize_coordinates(
  File "/home/chrischoy/projects/MinkowskiEngine/MinkowskiEngine/MinkowskiSparseTensor.py", line 294, in initialize_coordinates
    ) = self._manager.insert_and_map(coordinates, *coordinate_map_key.get_key())
  File "/home/chrischoy/projects/MinkowskiEngine/MinkowskiEngine/MinkowskiCoordinateManager.py", line 179, in insert_and_map
    return self._manager.insert_and_map(coordinates, tensor_stride, string_id)
RuntimeError: after reduction step 1: cudaErrorInvalidDevice: invalid device ordinal

The invalid device ordinal should not be triggered.

A related issue happens also on these libraries with pytorch 1.8.x + CUDA 11.X

This is a pytorch error which probably will be fixed in the next update. In the meantime, I'll update the readme and recommend

  • pytorch 1.8.1 + CUDA 10.2
  • pytorch 1.7.1 + CUDA 11.X

but not

  • pytorch 1.8.1 + CUDA 11.X

@zhaopku
Copy link

zhaopku commented May 30, 2021

Are there any updates on this? I have an RTX 3090, which is only compatible with CUDA 11.1+.

@snuffle-PX
Copy link

Does pytorch1.9+cuda11.1 fix this problem? Thx.

@san-santra
Copy link

san-santra commented Jul 27, 2021

I have tried running the codes that were given in the previous posts, they are running fine. So, I guess this is fixed.

==========System==========
Linux-5.8.0-49-generic-x86_64-with-glibc2.17
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
3.8.10 (default, Jun  4 2021, 15:09:15) 
[GCC 7.5.0]
==========Pytorch==========
1.9.0
torch.cuda.is_available(): True
==========NVIDIA-SMI==========
/usr/bin/nvidia-smi
Driver Version 460.56
CUDA Version 11.2
VBIOS Version 88.00.4F.00.04
Image Version G500.0200.00.03
==========NVCC==========
/var/tmp/cuda-11.1/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
==========CC==========
/usr/bin/c++
c++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

==========MinkowskiEngine==========
0.5.4
MinkowskiEngine compiled with CUDA Support: True
NVCC version MinkowskiEngine is compiled: 11010
CUDART version MinkowskiEngine is compiled: 11010

@chrischoy
Copy link
Contributor

Great! I wasn't sure it was solved. So I'll close the ticket since I got the confirmation that it's been resolved.

Tanazzah pushed a commit to Tanazzah/MinkowskiEngine that referenced this issue Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants