Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tch::Cuda::is_available() returns false using local libtorch 1.8 for CUDA 11.1 #329

Closed
sethmnielsen opened this issue Mar 7, 2021 · 9 comments

Comments

@sethmnielsen
Copy link

I am on Arch Linux, Rust/cargo 1.50 and have set both LIBTORCH and LD_LIBRARY_PATH according to the README. tch-rs is on latest commit to master, commit 25ac21d. I am running the example with cargo run --example basics. I get false returned by both tch::Cuda::is_available() and tch::Cuda::cudnn_is_available().

I have triple checked that the path for LIBTORCH is correct, and it must be correct as everything builds fine and libtorch was not downloaded inside of tch-rs. I know that CUDA is installed correctly as well, because using the python-pytorch-cuda package from the Arch community repo (which is on PyTorch 1.8) I can use CUDA tensors just fine and torch.cuda.is_available() returns True.

I should note that my version of CUDA is 11.2, though I haven't seen that cause any issues with PyTorch 1.8 in Python (which apparently was built for CUDA 11.1).

Any suggestions? Are others having this issue? I saw #291, but I am not building in release, so I think this is a different problem.

@sethmnielsen
Copy link
Author

sethmnielsen commented Mar 7, 2021

Looks like if I unset LIBTORCH, run cargo clean then cargo run --example basics, it downloads libtorch (target directory is now 1.4 GB) and builds successfully, but I still get false for tch::Cuda::is_available() and tch::Cuda::cudnn_is_available(). So it definitely is not an issue with correctly setting environment variables.

@LaurentMazare
Copy link
Owner

Thanks for reporting this issue, I just pushed a (hacky) fix that should hopefully help with this.
The culprit here is that the C++ library is split in a cpu and a cuda version, and it's often the case for the cuda version not to be included by the linker as there is no "hard" dependency to it. We have a hack in place to get around this by forcing the dependency but this hack broke with the 1.8 release as the cuda library was split in multiple sub libraries and one of them (cuda_cu) was removed by the linker. I tweaked the hack to force this to be included.
Longer term, this will be tackled by passing -Wl,--no-as-needed via Cargo extra-link-args but this is only available since cargo 1.50 in nightly mode so we'll wait for this to reach stable until we push the fix.

@sethmnielsen
Copy link
Author

Thanks for the quick reply and fix! Ah, I see - let's hope that makes it to cargo stable soon.

That fix seemed to do the trick! I am now getting true for both function calls. 😄 Thanks for your help!

@danieldk
Copy link
Contributor

danieldk commented Mar 8, 2021

With the latest change linkage fails against libtorch compiled against CUDA 10.2:

  = note: /nix/store/cp1sa3xxvl71cypiinw2c62i5s33chlr-binutils-2.35.1/bin/ld: cannot find -ltorch_cuda_cu
          /nix/store/cp1sa3xxvl71cypiinw2c62i5s33chlr-binutils-2.35.1/bin/ld: cannot find -ltorch_cuda_cpp
          collect2: error: ld returned 1 exit status

because it doesn't have these libraries

❯ unzip -l libtorch-cxx11-abi-shared-with-deps-1.8.0.zip | grep libtorch_
352214112  02-27-2021 00:02   libtorch/lib/libtorch_cpu.so
1158264872  02-27-2021 00:02   libtorch/lib/libtorch_cuda.so
    12640  02-27-2021 00:02   libtorch/lib/libtorch_global_deps.so
 24837016  02-27-2021 00:02   libtorch/lib/libtorch_python.so

Maybe I should switch to libtorch with CUDA 11.1. Hopefully it doesn't have the same regressions for convolutions as PyTorch 1.7.1 with CUDA 11 had.

@LaurentMazare
Copy link
Owner

Ah it's a bummer that this depends on the cuda version. Anyway I just pushed a small tweak that will only trigger these libs to be linked if the files are present, hopefully this getting 10.2 to work.

@danieldk
Copy link
Contributor

danieldk commented Mar 8, 2021

Works like a charm, thanks!

@sethmnielsen
Copy link
Author

sethmnielsen commented Mar 13, 2021

So now I am not entirely sure that this is working. Running the basics example works just fine, but if I try to run the reinforcement learning example, I am getting a lot of linker errors.
BTW: I am still on the first commit where you first made the fix; I haven't pulled in the commit(s) after that.

➜  cargo run --example reinforcement-learning  --features=python a2c2 > log.txt

   Compiling tch v0.4.0 (/home/seth/school/adv_dl/project2/tch-rs)
error: linking with `cc` failed: exit code: 1
  |
  = note: "cc" "-Wl,--as-needed" "-Wl,-z,noexecstack" "-m64" "-Wl,--eh-frame-hdr" "-L" "/usr/lib64/rustlib/x86_64-unknown-linux-gnu/lib" 
  ...

(then there is lots and lots of linker flags, I'll just share the first and last few of them)

    "-Wl,-Bdynamic" "-lstdc++" "-ltorch_cuda" "-ltorch_cuda_cu" "-ltorch_cuda_cpp" "-ltorch" "-ltorch_cpu" "-lc10" "-lgomp" "-lbz2" "-lpython3.9" "-lgcc_s" "-lutil" "-lrt" "-lpthread" "-lm" "-ldl" "-lc"
  = note: /usr/bin/ld: /home/seth/packages/libtorch/lib/libtorch_cuda_cpp.so: undefined reference to `FLAGS_caffe2_rnn_executor'
          /usr/bin/ld: /home/seth/packages/libtorch/lib/libtorch_cuda_cpp.so: undefined reference to `caffe2::TensorShape::TensorShape()'
          /usr/bin/ld: /home/seth/packages/libtorch/lib/libtorch_cuda_cpp.so: undefined reference to `c10::C10FlagsRegistry[abi:cxx11]()'
          /usr/bin/ld: /home/seth/packages/libtorch/lib/libtorch_cuda_cpp.so: undefined reference to `bool c10::C10FlagParser::Parse<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)'
          /usr/bin/ld: /home/seth/packages/libtorch/lib/libtorch_cuda_cpp.so: undefined reference to `FLAGS_caffe2_force_shared_col_buffer'
          /usr/bin/ld: /home/seth/packages/libtorch/lib/libtorch_cuda_cpp.so: undefined reference to `caffe2::OperatorDef::OperatorDef()'
          /usr/bin/ld: /home/seth/packages/libtorch/lib/libtorch_cuda_cpp.so: undefined reference to `FLAGS_caffe2_operator_throw_if_fp_exceptions'
          /usr/bin/ld: /home/seth/packages/libtorch/lib/libtorch_cuda_cpp.so: undefined reference to `caffe2::BlobProto::BlobProto()'
          /usr/bin/ld: /home/seth/packages/libtorch/lib/libtorch_cuda_cpp.so: undefined reference to `FLAGS_caffe2_print_blob_sizes_at_exit'
          /usr/bin/ld: /home/seth/packages/libtorch/lib/libtorch_cuda_cu.so: undefined reference to `c10::MessageLogger::~MessageLogger()'
          /usr/bin/ld: /home/seth/packages/libtorch/lib/libtorch_cuda_cpp.so: undefined reference to `bool c10::C10FlagParser::Parse<int>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int*)'
          /usr/bin/ld: /home/seth/packages/libtorch/lib/libtorch_cuda_cpp.so: undefined reference to `FLAGS_caffe2_max_keep_on_shrink_memory'
          /usr/bin/ld: /home/seth/packages/libtorch/lib/libtorch_cuda_cpp.so: undefined reference to `FLAGS_caffe2_operator_throw_on_first_occurrence_if_fp_exceptions'
          /usr/bin/ld: /home/seth/packages/libtorch/lib/libtorch_cuda_cpp.so: undefined reference to `caffe2::TensorProtos::TensorProtos()'
          /usr/bin/ld: /home/seth/packages/libtorch/lib/libtorch_cuda_cpp.so: undefined reference to `caffe2::DeviceOption::DeviceOption()'
          /usr/bin/ld: /home/seth/packages/libtorch/lib/libtorch_cuda_cpp.so: undefined reference to `caffe2::NetDef::NetDef()'
          /usr/bin/ld: /home/seth/packages/libtorch/lib/libtorch_cuda_cpp.so: undefined reference to `FLAGS_caffe2_operator_throw_if_fp_overflow_exceptions'
          /usr/bin/ld: /home/seth/packages/libtorch/lib/libtorch_cuda_cu.so: undefined reference to `c10::MessageLogger::MessageLogger(char const*, int, int)'
          /usr/bin/ld: /home/seth/packages/libtorch/lib/libtorch_cuda_cpp.so: undefined reference to `FLAGS_caffe2_keep_on_shrink'
          /usr/bin/ld: /home/seth/packages/libtorch/lib/libtorch_cuda_cpp.so: undefined reference to `FLAGS_caffe2_workspace_stack_debug'
          /usr/bin/ld: /home/seth/packages/libtorch/lib/libtorch_cuda_cpp.so: undefined reference to `bool c10::C10FlagParser::Parse<bool>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool*)'
          collect2: error: ld returned 1 exit status


error: aborting due to previous error

error: could not compile `tch`

To learn more, run the command again with --verbose.

EDIT: Should I make a separate issue? It seems to be correctly trying to link to libtorch_cuda_cpp.so, but is having issues doing so.

@sethmnielsen
Copy link
Author

I fixed it. There must have been some issue with trying to link with /usr/lib/libtorch_cuda.so (installed from the python-pytorch-cuda Arch package) vs. the locally downloaded libtorch, as uninstalling the python-pytorch-cuda package resulted in a successful build and run of the example program. It's also using Cuda(0) as the device, so everything looks good!

Sorry for the false alarm!

@LaurentMazare
Copy link
Owner

Glad that you got it to work, closing this issue for now but feel free to re-open if you notice more issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants