Special torch version breaks GPU install of ML backends #121

aulwes · 2022-01-03T16:23:36Z

Hi, I'm trying to configure SmartSim for an Intel+Nvidia Volta node on our local cluster. I was able to get the conda environment setup and successfully executed 'pip install smartsim'. However, when I tried the next step 'smart --device gpu', I get this error:

(SmartSim-cime) [rta@cn135 SmartSim]$ smart --device gpu

Backends Requested

PyTorch: True
TensorFlow: True
ONNX: False

Running SmartSim build process...
Traceback (most recent call last):
File "/home/rta/.conda/envs/SmartSim-cime/bin/smart", line 424, in
cli()
File "/home/rta/.conda/envs/SmartSim-cime/bin/smart", line 421, in cli
builder.run_build(args.device, pt, tf, onnx)
File "/home/rta/.conda/envs/SmartSim-cime/bin/smart", line 89, in run_build
self.install_torch(device=device)
File "/home/rta/.conda/envs/SmartSim-cime/bin/smart", line 138, in install_torch
if not self.check_installed("torch", self.torch_version):
File "/home/rta/.conda/envs/SmartSim-cime/bin/smart", line 121, in check_installed
installed_major, installed_minor, _ = installed_version.split(".")
ValueError: too many values to unpack (expected 3)

The text was updated successfully, but these errors were encountered:

Spartee · 2022-01-03T19:09:07Z

Hi @aulwes! A few questions for you.

What version of SmartSim are you using?
are you using the default torch version? or did you try configuring one yourself?
did you have a previously installed version of torch? if so, what the output from pip list when your conda env is active?

My guess is that you have a conda installed torch version that doesn't perfectly follow semantic versioning (i.e. torch==1.7 and not torch==1.7.1). So I assume if you uninstall that torch version and allow the smart tool to configure and install one for you, it should work.

Luckily, the CLI is getting many enhancements in our upcoming release so we will make sure to address this issue.

aulwes · 2022-01-03T19:59:18Z

Hi Sam,

When I do 'pip list', the smartsim version is 0.3.2 and torch is 1.7.1.post2. How do I uninstall this version of torch and use smart to install torch instead?

Thanks,Rob

Spartee · 2022-01-03T20:09:03Z

yup, thats my fault. didn't account for that in the CLI version parsing.

to fix you should only have to do

conda activate smartsim-cime
pip uninstall torch torchvision # (or conda uninstall torch torchvision if you installed them with conda)
# hit y a couple times
smart --device gpu

Just FYI to, for the GPU build you will need to set your CUDA/CUDNN information prior to running smart --device gpu.

Specifically the env vars

export CUDNN_LIBRARY=/path/to/cuda/lib64/
export CUDNN_INCLUDE_DIR=/path/to/cuda/include/

# usually only needed on Crays because they use CRAY_LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$CUDNN_LIBRARY:$LD_LIBRARY_PATH

You will see the CLI complain if you don't have these set.

lastly, as stated (and Im including here for public record) we will be making some sweeping enhancements to the CLI itself in the upcoming release. I won't close this issue until we get those in though so we can track this. Feel free to post back here if you continue to have issues.

aulwes · 2022-01-03T20:47:49Z

Thanks Sam! Getting closer, but running into build issues. Copying the rather long error output:

Running SmartSim build process...
TensorFlow installed in Python environment
Traceback (most recent call last):
File "/home/rta/.conda/envs/SmartSim-cime/bin/smart", line 424, in
cli()
File "/home/rta/.conda/envs/SmartSim-cime/bin/smart", line 421, in cli
builder.run_build(args.device, pt, tf, onnx)
File "/home/rta/.conda/envs/SmartSim-cime/bin/smart", line 112, in run_build
raise SetupError(error)
main.SetupError: SmartSim setup failed with exitcode 1
Cloning into '/vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/smartsim/bin/scripts/../../.third-party/RedisAI'...
Note: switching to '3f192ebd2bc874fb21cfdb3aff3bb0647df9b6ea'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

git switch -c

Or undo this operation with:

git switch -

Turn off this advice by setting config variable advice.detachedHead to false

Updating files: 100% (285/285), done.
Submodule 'opt/googletest' (https://github.com/google/googletest.git) registered for path 'opt/googletest'
Submodule 'opt/readies' (https://github.com/RedisLabsModules/readies.git) registered for path 'opt/readies'
Cloning into '/vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/smartsim/.third-party/RedisAI/opt/googletest'...
Cloning into '/vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/smartsim/.third-party/RedisAI/opt/readies'...
From https://github.com/RedisLabsModules/readies

branch 75459c6142ac01ff82fa7b4646d9d574d177fa3d -> FETCH_HEAD
cp: cannot stat '/vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/smartsim/bin/scripts/../../../modules/FindTensorFlow.cmake': No such file or directory
fatal: not a git repository (or any parent up to mount point /vast)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
Cloning into 'dlpack'...
Note: switching to 'a07f962d446b577adf4baef2b347a0f3a2a20617'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

git switch -c

Or undo this operation with:

git switch -

Turn off this advice by setting config variable advice.detachedHead to false

make[1]: *** ../bin/linux-x64-release/src: No such file or directory. Stop.
make: [Makefile:197: clean] Error 2 (ignored)
CMake Warning (dev) at /vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:29 (find_package):
Policy CMP0074 is not set: find_package uses _ROOT variables.
Run "cmake --help-policy CMP0074" for policy details. Use the cmake_policy
command to set the policy and suppress this warning.

Environment variable CUDA_ROOT is set to:

/projects/darwin-nv/centos8/x86_64/packages/cuda/11.4.2

For compatibility, CMake is ignoring the variable.
Call Stack (most recent call first):
/vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
/vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:40 (find_package)
CMakeLists.txt:197 (FIND_PACKAGE)
This warning is for project developers. Use -Wno-dev to suppress it.

CMake Warning at /vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:109 (message):
Caffe2: Cannot find cuDNN library. Turning the option off
Call Stack (most recent call first):
/vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
/vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:40 (find_package)
CMakeLists.txt:197 (FIND_PACKAGE)

CMake Error at /vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:96 (message):
Your installed Caffe2 version uses cuDNN but I cannot find the cuDNN
libraries. Please set the proper cuDNN prefixes and / or install cuDNN.
Call Stack (most recent call first):
/vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:40 (find_package)
CMakeLists.txt:197 (FIND_PACKAGE)

make: *** [../opt/readies/mk/cmake.rules:6: ../bin/linux-x64-release/src/Makefile] Error 1

Spartee · 2022-01-03T21:05:07Z

Ok, I think this is the pip installed cmake not playing nicely. Im guessing you installed cudnn through conda?

Please try

pip uninstall cmake
conda install cmake
# and then try again
smart --clean
smart --device gpu

This should at least alleviate the cmake errors.

aulwes · 2022-01-04T21:26:21Z

Sam, Did the steps below. Tried ‘smart –device gpu’ again, but ran into error about cuDNN. Then I ran ‘conda install cuDNN’, but still getting same error: CMake Error at /vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:96 (message): Your installed Caffe2 version uses cuDNN but I cannot find the cuDNN libraries. Please set the proper cuDNN prefixes and / or install cuDNN. Call Stack (most recent call first): /vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:40 (find_package) CMakeLists.txt:197 (FIND_PACKAGE) Do I need to set additional environment vars? From: Sam Partee ***@***.***> Reply-To: CrayLabs/SmartSim ***@***.***> Date: Monday, January 3, 2022 at 2:05 PM To: CrayLabs/SmartSim ***@***.***> Cc: "Aulwes, Rob Tuan" ***@***.***>, Mention ***@***.***> Subject: [EXTERNAL] Re: [CrayLabs/SmartSim] Special torch version breaks GPU install of ML backends (Issue #121) Ok, I think this is the pip installed cmake not playing nicely. Im guessing you installed cudnn through conda? Please try pip uninstall cmake conda install cmake This should at least alleviate the cmake errors. — Reply to this email directly, view it on GitHub<#121 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB46TOQPUCN7RZUBUAKDBETUUIFQ5ANCNFSM5LFRUGLQ>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

Spartee · 2022-01-04T22:26:52Z

Sorry CUDA + Cmake + Torch/TF can be a real pain sometimes. We will eventually be able to use singularity containers for everything :)

So it can't find the cudnn libraries, a couple things to confirm.

You are using torch 1.7.1 correct?
cmake installed through conda and not pip?
cudnn version right for the cuda version you are using? (see https://docs.nvidia.com/deeplearning/cudnn/support-matrix/index.html)
cudnn install through conda and the 3 environment variables are set correctly?

export CUDNN_LIBRARY=/path/to/cudnn/lib64/
export CUDNN_INCLUDE_DIR=/path/to/cudnn/include/
export LD_LIBRARY_PATH=$CUDNN_LIBRARY:$LD_LIBRARY_PATH

have you run smart --clean with your environment active between each try?
Contents of the cudnn dirs look right?

What cudnn libs look like on an example system

libcudnn_adv_infer.so        libcudnn_adv_train.so        libcudnn_cnn_infer.so        libcudnn.so
libcudnn_adv_infer.so.8      libcudnn_adv_train.so.8      libcudnn_cnn_infer.so.8      libcudnn.so.8
libcudnn_adv_infer.so.8.2.0  libcudnn_adv_train.so.8.2.0  libcudnn_cnn_infer.so.8.2.0  libcudnn.so.8.2.0

cudnn headers

cudnn_adv_infer.h     cudnn_backend.h       cudnn_cnn_train.h     cudnn_ops_infer_v8.h  cudnn_version.h
cudnn_adv_infer_v8.h  cudnn_backend_v8.h    cudnn_cnn_train_v8.h  cudnn_ops_train.h     cudnn_version_v8.h
cudnn_adv_train.h     cudnn_cnn_infer.h     cudnn.h               cudnn_ops_train_v8.h
cudnn_adv_train_v8.h  cudnn_cnn_infer_v8.h  cudnn_ops_infer.h     cudnn_v8.h

Spartee · 2022-01-05T07:02:08Z

So another thing to try would be the Caffe environment variables for Cudnn

export CUDNN_LIBRARY_PATH=$CUDNN_LIBRARY
export CUDNN_INCLUDE_PATH=$CUDNN_INCLUDE_DIR

Torch should override these, but the error is coming from Caffe which is a bit odd.

aulwes · 2022-01-05T16:36:29Z

Success, thanks Sam! Could you now point me to how I can test the installation and run an example on GPU? From: Sam Partee ***@***.***> Reply-To: CrayLabs/SmartSim ***@***.***> Date: Tuesday, January 4, 2022 at 3:27 PM To: CrayLabs/SmartSim ***@***.***> Cc: "Aulwes, Rob Tuan" ***@***.***>, Mention ***@***.***> Subject: [EXTERNAL] Re: [CrayLabs/SmartSim] Special torch version breaks GPU install of ML backends (Issue #121) Sorry CUDA + Cmake + Torch/TF can be a real pain sometimes. We will eventually be able to use singularity containers for everything :) So it can't find the cudnn libraries, a couple things to confirm. 1. You are using torch 1.7.1 correct? 2. cmake installed through conda and not pip? 3. cudnn version right for the cuda version you are using? (see https://docs.nvidia.com/deeplearning/cudnn/support-matrix/index.html) 4. cudnn install through conda and the 3 environment variables are set correctly? export CUDNN_LIBRARY=/path/to/cudnn/lib64/ export CUDNN_INCLUDE_DIR=/path/to/cudnn/include/ export LD_LIBRARY_PATH=$CUDNN_LIBRARY:$LD_LIBRARY_PATH 1. have you run smart --clean with your environment active between each try? 2. Contents of the cudnn dirs look right? What cudnn libs look like on an example system libcudnn_adv_infer.so libcudnn_adv_train.so libcudnn_cnn_infer.so libcudnn.so libcudnn_adv_infer.so.8 libcudnn_adv_train.so.8 libcudnn_cnn_infer.so.8 libcudnn.so.8 libcudnn_adv_infer.so.8.2.0 libcudnn_adv_train.so.8.2.0 libcudnn_cnn_infer.so.8.2.0 libcudnn.so.8.2.0 cudnn headers cudnn_adv_infer.h cudnn_backend.h cudnn_cnn_train.h cudnn_ops_infer_v8.h cudnn_version.h cudnn_adv_infer_v8.h cudnn_backend_v8.h cudnn_cnn_train_v8.h cudnn_ops_train.h cudnn_version_v8.h cudnn_adv_train.h cudnn_cnn_infer.h cudnn.h cudnn_ops_train_v8.h cudnn_adv_train_v8.h cudnn_cnn_infer_v8.h cudnn_ops_infer.h cudnn_v8.h — Reply to this email directly, view it on GitHub<#121 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB46TOXKEVXJYCPR2U2KWWLUUNX3PANCNFSM5LFRUGLQ>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

Spartee · 2022-01-07T18:43:32Z

@aulwes Awesome!

just btw, not sure what group your working for, but we have a number of people writing a CIME interface for SmartSim right now and they are all in our slack channel.

some to point out are @jedwards4b @ashao

Spartee · 2022-01-07T18:44:08Z

Going to close this. 1 action item is that these env vars will be added to the warning list in the CLI build in #122

aulwes added the type: bug label Jan 3, 2022

Spartee added area: build Issues related to builds, makefiles, installs, etc area: third-party Issues related to Issues related to dependencies and third-part and third-party package integrations user issue Issue posted by user labels Jan 3, 2022

Spartee changed the title ~~Configuring for GPU device~~ Special torch version breaks GPU install of ML backends Jan 3, 2022

Spartee mentioned this issue Jan 4, 2022

Implement Version class for third-party dependency version checking. #123

Closed

8 tasks

Spartee closed this as completed Jan 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Special torch version breaks GPU install of ML backends #121

Special torch version breaks GPU install of ML backends #121

aulwes commented Jan 3, 2022

Spartee commented Jan 3, 2022

aulwes commented Jan 3, 2022

Spartee commented Jan 3, 2022

aulwes commented Jan 3, 2022

Spartee commented Jan 3, 2022 •

edited

aulwes commented Jan 4, 2022 via email

Spartee commented Jan 4, 2022

Spartee commented Jan 5, 2022

aulwes commented Jan 5, 2022 via email

Spartee commented Jan 7, 2022

Spartee commented Jan 7, 2022

Special torch version breaks GPU install of ML backends #121

Special torch version breaks GPU install of ML backends #121

Comments

aulwes commented Jan 3, 2022

Backends Requested

Spartee commented Jan 3, 2022

aulwes commented Jan 3, 2022

Spartee commented Jan 3, 2022

aulwes commented Jan 3, 2022

Spartee commented Jan 3, 2022 • edited

aulwes commented Jan 4, 2022 via email

Spartee commented Jan 4, 2022

Spartee commented Jan 5, 2022

aulwes commented Jan 5, 2022 via email

Spartee commented Jan 7, 2022

Spartee commented Jan 7, 2022

Spartee commented Jan 3, 2022 •

edited