Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PTX JIT compiler failed #314

Closed
williamFalcon opened this issue May 16, 2019 · 16 comments
Closed

PTX JIT compiler failed #314

williamFalcon opened this issue May 16, 2019 · 16 comments

Comments

@williamFalcon
Copy link

williamFalcon commented May 16, 2019

I installed as such:

  1. pytorch 1.1.

  2. torchvison 2.1
    (both from conda -c python).

  3. built apex.

And now I'm getting this error:

  File "/home/waf/miniconda3/envs/fisherman/lib/python3.7/site-packages/apex/multi_tensor_apply/multi_tensor_
apply.py", line 30, in __call__                                                                                    
    *args)                                                                                                         
RuntimeError: CUDA error: a PTX JIT compilation failed (multi_tensor_apply at csrc/multi_tensor_apply.cuh:101)     
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f119a22c441 in /home/waf/miniconda3/en
vs/fisherman/lib/python3.7/site-packages/torch/lib/libc10.so)                                                      
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f119a22bd7a in /home/waf/mini
conda3/envs/fisherman/lib/python3.7/site-packages/torch/lib/libc10.so)                                             
frame #2: void multi_tensor_apply<2, ScaleFunctor<c10::Half, float>, float>(int, int, at::Tensor const&, std::vecto
r<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::T
ensor> > > > const&, ScaleFunctor<c10::Half, float>, float) + 0x2c71 (0x7f112bd57a01 in /home/waf/miniconda3/
envs/fisherman/lib/python3.7/site-packages/amp_C.cpython-37m-x86_64-linux-gnu.so)                                  
frame #3: multi_tensor_scale_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >
, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, float) + 0x3a8 (0x7f112bd4d9a8 in /home/
waf/miniconda3/envs/fisherman/lib/python3.7/site-packages/amp_C.cpython-37m-x86_64-linux-gnu.so)             
frame #4: <unknown function> + 0x15bac (0x7f112bd4cbac in /home/waf/miniconda3/envs/fisherman/lib/python3.7/s
ite-packages/amp_C.cpython-37m-x86_64-linux-gnu.so)                                                                
frame #5: <unknown function> + 0x15c6e (0x7f112bd4cc6e in /home/waf/miniconda3/envs/fisherman/lib/python3.7/s
ite-packages/amp_C.cpython-37m-x86_64-linux-gnu.so)                                                                
frame #6: <unknown function> + 0x12277 (0x7f112bd49277 in /home/waf/miniconda3/envs/fisherman/lib/python3.7/s
ite-packages/amp_C.cpython-37m-x86_64-linux-gnu.so)                                                                
<omitting python frames>                                                                                           
frame #45: __libc_start_main + 0xf1 (0x7f119ea3c2e1 in /lib/x86_64-linux-gnu/libc.so.6)      

@ngimel
Copy link
Contributor

ngimel commented May 16, 2019

What is your environment, i.e. cuda version, nvidia driver version?

@williamFalcon
Copy link
Author

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:17_PST_2019
Cuda compilation tools, release 10.1, V10.1.105
conda install pytorch torchvision cudatoolkit=10.0 -c pytorch

@ngimel
Copy link
Contributor

ngimel commented May 16, 2019

And driver version? Does pytorch w/o apex work? E.g. python -c "import torch; a=torch.randn(3).cuda()"

@williamFalcon
Copy link
Author

yes, it works without it.

| NVIDIA-SMI 410.72       Driver Version: 410.72       CUDA Version: 10.0     |

@ngimel
Copy link
Contributor

ngimel commented May 16, 2019

Your pytorch is compiled for cuda 10.0, your driver is cuda 10.0, yet your compiler (with which you compile apex) is 10.1. Your driver version is insufficient for this.

@williamFalcon
Copy link
Author

can I compile apex with 10?
do I need to compile pytorch with 10.1 ?

@ngimel
Copy link
Contributor

ngimel commented May 16, 2019

Please compile apex with cuda 10.

@williamFalcon
Copy link
Author

ok, how would I do that? is there a different flag?

@ngimel
Copy link
Contributor

ngimel commented May 16, 2019

You have to have cuda toolkit 10.0 installed somewhere on your system. nvcc --version should return 10.0.130 or something like this.

@williamFalcon
Copy link
Author

williamFalcon commented May 16, 2019

compile with this?

CUDA_HOME=<path to cuda same version of the one installed by cuda> python setup.py install --cuda_ext --cpp_ext .

But which one do I use here:

(fisherman) waf@gpu-8:~/Developer/apex$ conda list|grep cuda
cudatoolkit               10.0.130                      0  
pytorch                   1.1.0           py3.7_cuda10.0.130_cudnn7.5.1_0    pytorch

or

(fisherman) waf@gpu-8:~/Developer/fisherman/fisherman/apex$ which nvcc
/usr/local/cuda/bin/nvcc

@williamFalcon
Copy link
Author

solved... I found the cuda version by running:

find / -type d -name cuda 2>/dev/null

Which gave me a bunch of options. I picked this one:

/home/waf/miniconda3/pkgs/pytorch-1.1.0-py3.5_cuda10.0.130_cudnn7.5.1_0/lib/python
3.5/site-packages/torch/cuda

then finally compiled apex

# clean up old apex install
pip uninstall apex
pip uninstall apex
cd apex/
rm -rf apex.egg-info
rm -rf build

# compile new
# use the super long path you found above
CUDA_HOME=/super/long/path/from/previous/step python setup.py install --cuda_ext --cpp_ext .

@ldhai
Copy link

ldhai commented Jun 1, 2019

I use this command:CUDA_HOME=/home/wys/anaconda3/pkgs/pytorch-1.1.0-py3.7_cuda9.0.176_cudnn7.5.1_0/lib/python3.7/site-packages/torch/cuda python setup.py install --cuda_ext --cpp_ext
But it has error
image
I hope someone can help me solve this problem.

@ptrblck
Copy link
Contributor

ptrblck commented Jun 3, 2019

@ldhai
PyTorch does not ship with the cuda compiler (nvcc), so you need to install the matching CUDA version on your system manually. Have a look at this install guide.

@ldhai
Copy link

ldhai commented Jun 4, 2019

OK, thanks.

@mcarilli
Copy link
Contributor

mcarilli commented Jun 4, 2019

You should install a version of the cuda toolkit that matches the version used to compile your pytorch binaries. You can query the latter via torch.version.cuda. For example, if torch.version.cuda is 10.0.xyz, install a cuda 10.0 toolkit.

@beibuwandeluori
Copy link

You can updata your torch version to torch-1.5.0,! I solve this problem by updata my version from 1.3.0 to 1.5.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants