PTX JIT compiler failed #314

williamFalcon · 2019-05-16T17:25:43Z

I installed as such:

pytorch 1.1.
torchvison 2.1
(both from conda -c python).
built apex.

And now I'm getting this error:

  File "/home/waf/miniconda3/envs/fisherman/lib/python3.7/site-packages/apex/multi_tensor_apply/multi_tensor_
apply.py", line 30, in __call__                                                                                    
    *args)                                                                                                         
RuntimeError: CUDA error: a PTX JIT compilation failed (multi_tensor_apply at csrc/multi_tensor_apply.cuh:101)     
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f119a22c441 in /home/waf/miniconda3/en
vs/fisherman/lib/python3.7/site-packages/torch/lib/libc10.so)                                                      
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f119a22bd7a in /home/waf/mini
conda3/envs/fisherman/lib/python3.7/site-packages/torch/lib/libc10.so)                                             
frame #2: void multi_tensor_apply<2, ScaleFunctor<c10::Half, float>, float>(int, int, at::Tensor const&, std::vecto
r<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::T
ensor> > > > const&, ScaleFunctor<c10::Half, float>, float) + 0x2c71 (0x7f112bd57a01 in /home/waf/miniconda3/
envs/fisherman/lib/python3.7/site-packages/amp_C.cpython-37m-x86_64-linux-gnu.so)                                  
frame #3: multi_tensor_scale_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >
, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, float) + 0x3a8 (0x7f112bd4d9a8 in /home/
waf/miniconda3/envs/fisherman/lib/python3.7/site-packages/amp_C.cpython-37m-x86_64-linux-gnu.so)             
frame #4: <unknown function> + 0x15bac (0x7f112bd4cbac in /home/waf/miniconda3/envs/fisherman/lib/python3.7/s
ite-packages/amp_C.cpython-37m-x86_64-linux-gnu.so)                                                                
frame #5: <unknown function> + 0x15c6e (0x7f112bd4cc6e in /home/waf/miniconda3/envs/fisherman/lib/python3.7/s
ite-packages/amp_C.cpython-37m-x86_64-linux-gnu.so)                                                                
frame #6: <unknown function> + 0x12277 (0x7f112bd49277 in /home/waf/miniconda3/envs/fisherman/lib/python3.7/s
ite-packages/amp_C.cpython-37m-x86_64-linux-gnu.so)                                                                
<omitting python frames>                                                                                           
frame #45: __libc_start_main + 0xf1 (0x7f119ea3c2e1 in /lib/x86_64-linux-gnu/libc.so.6)

The text was updated successfully, but these errors were encountered:

ngimel · 2019-05-16T17:37:34Z

What is your environment, i.e. cuda version, nvidia driver version?

williamFalcon · 2019-05-16T17:41:10Z

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:17_PST_2019
Cuda compilation tools, release 10.1, V10.1.105

conda install pytorch torchvision cudatoolkit=10.0 -c pytorch

ngimel · 2019-05-16T18:04:38Z

And driver version? Does pytorch w/o apex work? E.g. python -c "import torch; a=torch.randn(3).cuda()"

williamFalcon · 2019-05-16T18:05:59Z

yes, it works without it.

| NVIDIA-SMI 410.72       Driver Version: 410.72       CUDA Version: 10.0     |

ngimel · 2019-05-16T18:30:29Z

Your pytorch is compiled for cuda 10.0, your driver is cuda 10.0, yet your compiler (with which you compile apex) is 10.1. Your driver version is insufficient for this.

williamFalcon · 2019-05-16T18:32:20Z

can I compile apex with 10?
do I need to compile pytorch with 10.1 ?

ngimel · 2019-05-16T19:17:16Z

Please compile apex with cuda 10.

williamFalcon · 2019-05-16T19:20:15Z

ok, how would I do that? is there a different flag?

ngimel · 2019-05-16T19:22:33Z

You have to have cuda toolkit 10.0 installed somewhere on your system. nvcc --version should return 10.0.130 or something like this.

williamFalcon · 2019-05-16T19:27:26Z

compile with this?

CUDA_HOME=<path to cuda same version of the one installed by cuda> python setup.py install --cuda_ext --cpp_ext .

But which one do I use here:

(fisherman) waf@gpu-8:~/Developer/apex$ conda list|grep cuda
cudatoolkit               10.0.130                      0  
pytorch                   1.1.0           py3.7_cuda10.0.130_cudnn7.5.1_0    pytorch

or

(fisherman) waf@gpu-8:~/Developer/fisherman/fisherman/apex$ which nvcc
/usr/local/cuda/bin/nvcc

williamFalcon · 2019-05-16T19:42:47Z

solved... I found the cuda version by running:

find / -type d -name cuda 2>/dev/null

Which gave me a bunch of options. I picked this one:

/home/waf/miniconda3/pkgs/pytorch-1.1.0-py3.5_cuda10.0.130_cudnn7.5.1_0/lib/python
3.5/site-packages/torch/cuda

then finally compiled apex

# clean up old apex install
pip uninstall apex
pip uninstall apex
cd apex/
rm -rf apex.egg-info
rm -rf build

# compile new
# use the super long path you found above
CUDA_HOME=/super/long/path/from/previous/step python setup.py install --cuda_ext --cpp_ext .

ldhai · 2019-06-01T13:47:44Z

I use this command:CUDA_HOME=/home/wys/anaconda3/pkgs/pytorch-1.1.0-py3.7_cuda9.0.176_cudnn7.5.1_0/lib/python3.7/site-packages/torch/cuda python setup.py install --cuda_ext --cpp_ext
But it has error

I hope someone can help me solve this problem.

ptrblck · 2019-06-03T10:28:01Z

@ldhai
PyTorch does not ship with the cuda compiler (nvcc), so you need to install the matching CUDA version on your system manually. Have a look at this install guide.

ldhai · 2019-06-04T05:45:44Z

OK, thanks.

mcarilli · 2019-06-04T15:03:40Z

You should install a version of the cuda toolkit that matches the version used to compile your pytorch binaries. You can query the latter via torch.version.cuda. For example, if torch.version.cuda is 10.0.xyz, install a cuda 10.0 toolkit.

beibuwandeluori · 2020-05-27T07:27:09Z

You can updata your torch version to torch-1.5.0,! I solve this problem by updata my version from 1.3.0 to 1.5.0

ngimel closed this as completed May 16, 2019

mcarilli mentioned this issue May 22, 2019

Hard error on mismatch between torch.version.cuda and + the Cuda toolkit version being used to compile Apex #323

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PTX JIT compiler failed #314

PTX JIT compiler failed #314

williamFalcon commented May 16, 2019 •

edited

ngimel commented May 16, 2019

williamFalcon commented May 16, 2019

ngimel commented May 16, 2019

williamFalcon commented May 16, 2019

ngimel commented May 16, 2019

williamFalcon commented May 16, 2019

ngimel commented May 16, 2019

williamFalcon commented May 16, 2019

ngimel commented May 16, 2019

williamFalcon commented May 16, 2019 •

edited

williamFalcon commented May 16, 2019

ldhai commented Jun 1, 2019

ptrblck commented Jun 3, 2019

ldhai commented Jun 4, 2019

mcarilli commented Jun 4, 2019

beibuwandeluori commented May 27, 2020

PTX JIT compiler failed #314

PTX JIT compiler failed #314

Comments

williamFalcon commented May 16, 2019 • edited

ngimel commented May 16, 2019

williamFalcon commented May 16, 2019

ngimel commented May 16, 2019

williamFalcon commented May 16, 2019

ngimel commented May 16, 2019

williamFalcon commented May 16, 2019

ngimel commented May 16, 2019

williamFalcon commented May 16, 2019

ngimel commented May 16, 2019

williamFalcon commented May 16, 2019 • edited

williamFalcon commented May 16, 2019

ldhai commented Jun 1, 2019

ptrblck commented Jun 3, 2019

ldhai commented Jun 4, 2019

mcarilli commented Jun 4, 2019

beibuwandeluori commented May 27, 2020

williamFalcon commented May 16, 2019 •

edited

williamFalcon commented May 16, 2019 •

edited