Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special torch version breaks GPU install of ML backends #121

Closed
aulwes opened this issue Jan 3, 2022 · 11 comments
Closed

Special torch version breaks GPU install of ML backends #121

aulwes opened this issue Jan 3, 2022 · 11 comments
Labels
area: build Issues related to builds, makefiles, installs, etc area: third-party Issues related to Issues related to dependencies and third-part and third-party package integrations user issue Issue posted by user

Comments

@aulwes
Copy link

aulwes commented Jan 3, 2022

Hi, I'm trying to configure SmartSim for an Intel+Nvidia Volta node on our local cluster. I was able to get the conda environment setup and successfully executed 'pip install smartsim'. However, when I tried the next step 'smart --device gpu', I get this error:

(SmartSim-cime) [rta@cn135 SmartSim]$ smart --device gpu

Backends Requested

PyTorch: True
TensorFlow: True
ONNX: False

Running SmartSim build process...
Traceback (most recent call last):
File "/home/rta/.conda/envs/SmartSim-cime/bin/smart", line 424, in
cli()
File "/home/rta/.conda/envs/SmartSim-cime/bin/smart", line 421, in cli
builder.run_build(args.device, pt, tf, onnx)
File "/home/rta/.conda/envs/SmartSim-cime/bin/smart", line 89, in run_build
self.install_torch(device=device)
File "/home/rta/.conda/envs/SmartSim-cime/bin/smart", line 138, in install_torch
if not self.check_installed("torch", self.torch_version):
File "/home/rta/.conda/envs/SmartSim-cime/bin/smart", line 121, in check_installed
installed_major, installed_minor, _ = installed_version.split(".")
ValueError: too many values to unpack (expected 3)

@Spartee
Copy link
Contributor

Spartee commented Jan 3, 2022

Hi @aulwes! A few questions for you.

  • What version of SmartSim are you using?
  • are you using the default torch version? or did you try configuring one yourself?
  • did you have a previously installed version of torch? if so, what the output from pip list when your conda env is active?

My guess is that you have a conda installed torch version that doesn't perfectly follow semantic versioning (i.e. torch==1.7 and not torch==1.7.1). So I assume if you uninstall that torch version and allow the smart tool to configure and install one for you, it should work.

Luckily, the CLI is getting many enhancements in our upcoming release so we will make sure to address this issue.

@Spartee Spartee added area: build Issues related to builds, makefiles, installs, etc area: third-party Issues related to Issues related to dependencies and third-part and third-party package integrations user issue Issue posted by user labels Jan 3, 2022
@aulwes
Copy link
Author

aulwes commented Jan 3, 2022

Hi Sam,

When I do 'pip list', the smartsim version is 0.3.2 and torch is 1.7.1.post2. How do I uninstall this version of torch and use smart to install torch instead?

Thanks,Rob

@Spartee
Copy link
Contributor

Spartee commented Jan 3, 2022

yup, thats my fault. didn't account for that in the CLI version parsing.

to fix you should only have to do

conda activate smartsim-cime
pip uninstall torch torchvision # (or conda uninstall torch torchvision if you installed them with conda)
# hit y a couple times
smart --device gpu

Just FYI to, for the GPU build you will need to set your CUDA/CUDNN information prior to running smart --device gpu.

Specifically the env vars

export CUDNN_LIBRARY=/path/to/cuda/lib64/
export CUDNN_INCLUDE_DIR=/path/to/cuda/include/

# usually only needed on Crays because they use CRAY_LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$CUDNN_LIBRARY:$LD_LIBRARY_PATH

You will see the CLI complain if you don't have these set.

lastly, as stated (and Im including here for public record) we will be making some sweeping enhancements to the CLI itself in the upcoming release. I won't close this issue until we get those in though so we can track this. Feel free to post back here if you continue to have issues.

@Spartee Spartee changed the title Configuring for GPU device Special torch version breaks GPU install of ML backends Jan 3, 2022
@aulwes
Copy link
Author

aulwes commented Jan 3, 2022

Thanks Sam! Getting closer, but running into build issues. Copying the rather long error output:

Running SmartSim build process...
TensorFlow installed in Python environment
Traceback (most recent call last):
File "/home/rta/.conda/envs/SmartSim-cime/bin/smart", line 424, in
cli()
File "/home/rta/.conda/envs/SmartSim-cime/bin/smart", line 421, in cli
builder.run_build(args.device, pt, tf, onnx)
File "/home/rta/.conda/envs/SmartSim-cime/bin/smart", line 112, in run_build
raise SetupError(error)
main.SetupError: SmartSim setup failed with exitcode 1
Cloning into '/vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/smartsim/bin/scripts/../../.third-party/RedisAI'...
Note: switching to '3f192ebd2bc874fb21cfdb3aff3bb0647df9b6ea'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

git switch -c

Or undo this operation with:

git switch -

Turn off this advice by setting config variable advice.detachedHead to false

Updating files: 100% (285/285), done.
Submodule 'opt/googletest' (https://github.com/google/googletest.git) registered for path 'opt/googletest'
Submodule 'opt/readies' (https://github.com/RedisLabsModules/readies.git) registered for path 'opt/readies'
Cloning into '/vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/smartsim/.third-party/RedisAI/opt/googletest'...
Cloning into '/vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/smartsim/.third-party/RedisAI/opt/readies'...
From https://github.com/RedisLabsModules/readies

  • branch 75459c6142ac01ff82fa7b4646d9d574d177fa3d -> FETCH_HEAD
    cp: cannot stat '/vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/smartsim/bin/scripts/../../../modules/FindTensorFlow.cmake': No such file or directory
    fatal: not a git repository (or any parent up to mount point /vast)
    Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
    Cloning into 'dlpack'...
    Note: switching to 'a07f962d446b577adf4baef2b347a0f3a2a20617'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

git switch -c

Or undo this operation with:

git switch -

Turn off this advice by setting config variable advice.detachedHead to false

make[1]: *** ../bin/linux-x64-release/src: No such file or directory. Stop.
make: [Makefile:197: clean] Error 2 (ignored)
CMake Warning (dev) at /vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:29 (find_package):
Policy CMP0074 is not set: find_package uses _ROOT variables.
Run "cmake --help-policy CMP0074" for policy details. Use the cmake_policy
command to set the policy and suppress this warning.

Environment variable CUDA_ROOT is set to:

/projects/darwin-nv/centos8/x86_64/packages/cuda/11.4.2

For compatibility, CMake is ignoring the variable.
Call Stack (most recent call first):
/vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
/vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:40 (find_package)
CMakeLists.txt:197 (FIND_PACKAGE)
This warning is for project developers. Use -Wno-dev to suppress it.

CMake Warning at /vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:109 (message):
Caffe2: Cannot find cuDNN library. Turning the option off
Call Stack (most recent call first):
/vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
/vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:40 (find_package)
CMakeLists.txt:197 (FIND_PACKAGE)

CMake Error at /vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:96 (message):
Your installed Caffe2 version uses cuDNN but I cannot find the cuDNN
libraries. Please set the proper cuDNN prefixes and / or install cuDNN.
Call Stack (most recent call first):
/vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:40 (find_package)
CMakeLists.txt:197 (FIND_PACKAGE)

make: *** [../opt/readies/mk/cmake.rules:6: ../bin/linux-x64-release/src/Makefile] Error 1

@Spartee
Copy link
Contributor

Spartee commented Jan 3, 2022

Ok, I think this is the pip installed cmake not playing nicely. Im guessing you installed cudnn through conda?

Please try

pip uninstall cmake
conda install cmake
# and then try again
smart --clean
smart --device gpu

This should at least alleviate the cmake errors.

@aulwes
Copy link
Author

aulwes commented Jan 4, 2022 via email

@Spartee
Copy link
Contributor

Spartee commented Jan 4, 2022

Sorry CUDA + Cmake + Torch/TF can be a real pain sometimes. We will eventually be able to use singularity containers for everything :)

So it can't find the cudnn libraries, a couple things to confirm.

  1. You are using torch 1.7.1 correct?
  2. cmake installed through conda and not pip?
  3. cudnn version right for the cuda version you are using? (see https://docs.nvidia.com/deeplearning/cudnn/support-matrix/index.html)
  4. cudnn install through conda and the 3 environment variables are set correctly?
export CUDNN_LIBRARY=/path/to/cudnn/lib64/
export CUDNN_INCLUDE_DIR=/path/to/cudnn/include/
export LD_LIBRARY_PATH=$CUDNN_LIBRARY:$LD_LIBRARY_PATH
  1. have you run smart --clean with your environment active between each try?
  2. Contents of the cudnn dirs look right?

What cudnn libs look like on an example system

libcudnn_adv_infer.so        libcudnn_adv_train.so        libcudnn_cnn_infer.so        libcudnn.so
libcudnn_adv_infer.so.8      libcudnn_adv_train.so.8      libcudnn_cnn_infer.so.8      libcudnn.so.8
libcudnn_adv_infer.so.8.2.0  libcudnn_adv_train.so.8.2.0  libcudnn_cnn_infer.so.8.2.0  libcudnn.so.8.2.0

cudnn headers

cudnn_adv_infer.h     cudnn_backend.h       cudnn_cnn_train.h     cudnn_ops_infer_v8.h  cudnn_version.h
cudnn_adv_infer_v8.h  cudnn_backend_v8.h    cudnn_cnn_train_v8.h  cudnn_ops_train.h     cudnn_version_v8.h
cudnn_adv_train.h     cudnn_cnn_infer.h     cudnn.h               cudnn_ops_train_v8.h
cudnn_adv_train_v8.h  cudnn_cnn_infer_v8.h  cudnn_ops_infer.h     cudnn_v8.h

@Spartee
Copy link
Contributor

Spartee commented Jan 5, 2022

So another thing to try would be the Caffe environment variables for Cudnn

export CUDNN_LIBRARY_PATH=$CUDNN_LIBRARY
export CUDNN_INCLUDE_PATH=$CUDNN_INCLUDE_DIR

Torch should override these, but the error is coming from Caffe which is a bit odd.

@aulwes
Copy link
Author

aulwes commented Jan 5, 2022 via email

@Spartee
Copy link
Contributor

Spartee commented Jan 7, 2022

@aulwes Awesome!

just btw, not sure what group your working for, but we have a number of people writing a CIME interface for SmartSim right now and they are all in our slack channel.

some to point out are @jedwards4b @ashao

@Spartee
Copy link
Contributor

Spartee commented Jan 7, 2022

Going to close this. 1 action item is that these env vars will be added to the warning list in the CLI build in #122

@Spartee Spartee closed this as completed Jan 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: build Issues related to builds, makefiles, installs, etc area: third-party Issues related to Issues related to dependencies and third-part and third-party package integrations user issue Issue posted by user
Projects
None yet
Development

No branches or pull requests

2 participants