New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Special torch version breaks GPU install of ML backends #121
Comments
Hi @aulwes! A few questions for you.
My guess is that you have a conda installed torch version that doesn't perfectly follow semantic versioning (i.e. Luckily, the CLI is getting many enhancements in our upcoming release so we will make sure to address this issue. |
Hi Sam, When I do 'pip list', the smartsim version is 0.3.2 and torch is 1.7.1.post2. How do I uninstall this version of torch and use smart to install torch instead? Thanks,Rob |
yup, thats my fault. didn't account for that in the CLI version parsing. to fix you should only have to do conda activate smartsim-cime
pip uninstall torch torchvision # (or conda uninstall torch torchvision if you installed them with conda)
# hit y a couple times
smart --device gpu Just FYI to, for the GPU build you will need to set your CUDA/CUDNN information prior to running Specifically the env vars export CUDNN_LIBRARY=/path/to/cuda/lib64/
export CUDNN_INCLUDE_DIR=/path/to/cuda/include/
# usually only needed on Crays because they use CRAY_LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$CUDNN_LIBRARY:$LD_LIBRARY_PATH You will see the CLI complain if you don't have these set. lastly, as stated (and Im including here for public record) we will be making some sweeping enhancements to the CLI itself in the upcoming release. I won't close this issue until we get those in though so we can track this. Feel free to post back here if you continue to have issues. |
Thanks Sam! Getting closer, but running into build issues. Copying the rather long error output: Running SmartSim build process... You are in 'detached HEAD' state. You can look around, make experimental If you want to create a new branch to retain commits you create, you may git switch -c Or undo this operation with: git switch - Turn off this advice by setting config variable advice.detachedHead to false Updating files: 100% (285/285), done.
You are in 'detached HEAD' state. You can look around, make experimental If you want to create a new branch to retain commits you create, you may git switch -c Or undo this operation with: git switch - Turn off this advice by setting config variable advice.detachedHead to false make[1]: *** ../bin/linux-x64-release/src: No such file or directory. Stop. Environment variable CUDA_ROOT is set to:
For compatibility, CMake is ignoring the variable. CMake Warning at /vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:109 (message): CMake Error at /vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:96 (message): make: *** [../opt/readies/mk/cmake.rules:6: ../bin/linux-x64-release/src/Makefile] Error 1 |
Ok, I think this is the pip installed cmake not playing nicely. Im guessing you installed cudnn through conda? Please try pip uninstall cmake
conda install cmake
# and then try again
smart --clean
smart --device gpu This should at least alleviate the cmake errors. |
Sam,
Did the steps below. Tried ‘smart –device gpu’ again, but ran into error about cuDNN. Then I ran ‘conda install cuDNN’, but still getting same error:
CMake Error at /vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:96 (message):
Your installed Caffe2 version uses cuDNN but I cannot find the cuDNN
libraries. Please set the proper cuDNN prefixes and / or install cuDNN.
Call Stack (most recent call first):
/vast/home/rta/.conda/envs/SmartSim-cime/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:40 (find_package)
CMakeLists.txt:197 (FIND_PACKAGE)
Do I need to set additional environment vars?
From: Sam Partee ***@***.***>
Reply-To: CrayLabs/SmartSim ***@***.***>
Date: Monday, January 3, 2022 at 2:05 PM
To: CrayLabs/SmartSim ***@***.***>
Cc: "Aulwes, Rob Tuan" ***@***.***>, Mention ***@***.***>
Subject: [EXTERNAL] Re: [CrayLabs/SmartSim] Special torch version breaks GPU install of ML backends (Issue #121)
Ok, I think this is the pip installed cmake not playing nicely. Im guessing you installed cudnn through conda?
Please try
pip uninstall cmake
conda install cmake
This should at least alleviate the cmake errors.
—
Reply to this email directly, view it on GitHub<#121 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB46TOQPUCN7RZUBUAKDBETUUIFQ5ANCNFSM5LFRUGLQ>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Sorry CUDA + Cmake + Torch/TF can be a real pain sometimes. We will eventually be able to use singularity containers for everything :) So it can't find the cudnn libraries, a couple things to confirm.
export CUDNN_LIBRARY=/path/to/cudnn/lib64/
export CUDNN_INCLUDE_DIR=/path/to/cudnn/include/
export LD_LIBRARY_PATH=$CUDNN_LIBRARY:$LD_LIBRARY_PATH
What cudnn libs look like on an example system
cudnn headers
|
So another thing to try would be the Caffe environment variables for Cudnn export CUDNN_LIBRARY_PATH=$CUDNN_LIBRARY
export CUDNN_INCLUDE_PATH=$CUDNN_INCLUDE_DIR Torch should override these, but the error is coming from Caffe which is a bit odd. |
Success, thanks Sam! Could you now point me to how I can test the installation and run an example on GPU?
From: Sam Partee ***@***.***>
Reply-To: CrayLabs/SmartSim ***@***.***>
Date: Tuesday, January 4, 2022 at 3:27 PM
To: CrayLabs/SmartSim ***@***.***>
Cc: "Aulwes, Rob Tuan" ***@***.***>, Mention ***@***.***>
Subject: [EXTERNAL] Re: [CrayLabs/SmartSim] Special torch version breaks GPU install of ML backends (Issue #121)
Sorry CUDA + Cmake + Torch/TF can be a real pain sometimes. We will eventually be able to use singularity containers for everything :)
So it can't find the cudnn libraries, a couple things to confirm.
1. You are using torch 1.7.1 correct?
2. cmake installed through conda and not pip?
3. cudnn version right for the cuda version you are using? (see https://docs.nvidia.com/deeplearning/cudnn/support-matrix/index.html)
4. cudnn install through conda and the 3 environment variables are set correctly?
export CUDNN_LIBRARY=/path/to/cudnn/lib64/
export CUDNN_INCLUDE_DIR=/path/to/cudnn/include/
export LD_LIBRARY_PATH=$CUDNN_LIBRARY:$LD_LIBRARY_PATH
1. have you run smart --clean with your environment active between each try?
2. Contents of the cudnn dirs look right?
What cudnn libs look like on an example system
libcudnn_adv_infer.so libcudnn_adv_train.so libcudnn_cnn_infer.so libcudnn.so
libcudnn_adv_infer.so.8 libcudnn_adv_train.so.8 libcudnn_cnn_infer.so.8 libcudnn.so.8
libcudnn_adv_infer.so.8.2.0 libcudnn_adv_train.so.8.2.0 libcudnn_cnn_infer.so.8.2.0 libcudnn.so.8.2.0
cudnn headers
cudnn_adv_infer.h cudnn_backend.h cudnn_cnn_train.h cudnn_ops_infer_v8.h cudnn_version.h
cudnn_adv_infer_v8.h cudnn_backend_v8.h cudnn_cnn_train_v8.h cudnn_ops_train.h cudnn_version_v8.h
cudnn_adv_train.h cudnn_cnn_infer.h cudnn.h cudnn_ops_train_v8.h
cudnn_adv_train_v8.h cudnn_cnn_infer_v8.h cudnn_ops_infer.h cudnn_v8.h
—
Reply to this email directly, view it on GitHub<#121 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB46TOXKEVXJYCPR2U2KWWLUUNX3PANCNFSM5LFRUGLQ>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@aulwes Awesome! just btw, not sure what group your working for, but we have a number of people writing a CIME interface for SmartSim right now and they are all in our slack channel. some to point out are @jedwards4b @ashao |
Going to close this. 1 action item is that these env vars will be added to the warning list in the CLI build in #122 |
Hi, I'm trying to configure SmartSim for an Intel+Nvidia Volta node on our local cluster. I was able to get the conda environment setup and successfully executed 'pip install smartsim'. However, when I tried the next step 'smart --device gpu', I get this error:
(SmartSim-cime) [rta@cn135 SmartSim]$ smart --device gpu
Backends Requested
Running SmartSim build process...
Traceback (most recent call last):
File "/home/rta/.conda/envs/SmartSim-cime/bin/smart", line 424, in
cli()
File "/home/rta/.conda/envs/SmartSim-cime/bin/smart", line 421, in cli
builder.run_build(args.device, pt, tf, onnx)
File "/home/rta/.conda/envs/SmartSim-cime/bin/smart", line 89, in run_build
self.install_torch(device=device)
File "/home/rta/.conda/envs/SmartSim-cime/bin/smart", line 138, in install_torch
if not self.check_installed("torch", self.torch_version):
File "/home/rta/.conda/envs/SmartSim-cime/bin/smart", line 121, in check_installed
installed_major, installed_minor, _ = installed_version.split(".")
ValueError: too many values to unpack (expected 3)
The text was updated successfully, but these errors were encountered: