Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to run in A100 #10

Closed
BearBiscuit05 opened this issue Mar 9, 2023 · 7 comments
Closed

Failed to run in A100 #10

BearBiscuit05 opened this issue Mar 9, 2023 · 7 comments

Comments

@BearBiscuit05
Copy link

During the experiment, we followed the installation method of docker. When we ran the command python samgraph/multi_gpu/train_gcn.py --dataset products --num-train-worker 1 --num-sample-worker 1 --pipeline --cache-policy pre_sample --cache-percentage 0.1 --num-epoch 10 --batch-size 8000, the system reported an error:
NVIDIA A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75. If you want to use the NVIDIA A100-PCIE-40GB GPU with PyTorch, please check the instructions at Start Locally | PyTorch
This should be a version mismatch problem, I would like to ask how to deal with it to run gnnlab in A100

@molamooo
Copy link
Contributor

molamooo commented Mar 9, 2023

According to link, pip installed pytorch cannot run on A100 with sm_80 due to 1) no PTX shipped; 2) cuda10.2 compatible cuDNN(7.x) does not support A100.

So the only solution for you is to modify the Dockerfile to build a docker image based on cuda11. To achieve this, there may be several compatibility issue you need to manually fix:

  • cub: cuda11 includes cub, so the cub submodule in gnnlab is no longer needed and should be removed from setup.py
  • DGL: DGL version should be updated (0.8.0 is tested) to support cuda11 and newer cub. However we made custom patch to dgl when installing dgl. GNNLab does not depends dgl when compiling. If you do not run dgl baseline in our codebase, you can directly install newer version following dgl website. If you do need to run our dgl baseline, you may have to install dgl from source, with our patch added.
  • PyTorch: install newer version of pytorch following its website by updating our Dockerfile. Since GNNLab depends on pytorch when compiling, there maybe compatibility issue when building GNNLab. You may seek our help at this issue.
  • GNNLab: add sm80 in setup.py

@BearBiscuit05
Copy link
Author

BearBiscuit05 commented Mar 15, 2023

Thank you very much for your reply. According to your tips, I still encountered some problems when migrating to A100, but at present, I can't determine whether it is my own operation. So I used the V100 cluster again for experiments later, and it ran successfully. In the future, if the debugging on A100 is successful, I will also update the solution below. In addition, I would like to know whether 256G memory and 4*V100 can meet the experimental requirements for running Papers100M?

@molamooo
Copy link
Contributor

I would like to know whether 256G memory and 4*V100 can meet the experimental requirements for running Papers100M?

I think it's sufficient.

@BearBiscuit05
Copy link
Author

BearBiscuit05 commented Mar 16, 2023

Hello, we re-run your program using docker in the 4*V100 environment, but there is a problem. The previous test was conducted on Alibaba Cloud with 2*V100 using obgn-products dataset, and it seemed that docker had no problems at that time. The problem this time is that we have the following error when running the program:

(fgnn_env) root@fd0b9b8c719d:/app/source/fgnn/example# python samgraph/multi_gpu/train_gcn.py --dataset papers100M --num-train-worker 1 --num-sample-worker 1 --pipeline --cache-policy pre_sample --cache-percentag
e 0.1 --num-epoch 10 --batch-size 8000
Using backend: pytorch
Illegal instruction (core dumped)

It seems that there is a relationship with the system or the CPU. I am not sure whether gnnlab has special requirements for the CPU.
Our CPUs are:Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz

@lixiaobai09
Copy link
Contributor

Does this problem still exist after re-compiling the project with the "python setup.py clean; python setup.py install" command in Docker? You can re-compile it in Docker, the building environment is prepared in Docker.

@BearBiscuit05
Copy link
Author

Does this problem still exist after re-compiling the project with the "python setup.py clean; python setup.py install" command in Docker? You can re-compile it in Docker, the building environment is prepared in Docker.

Thank you very much, this suggestion is right, I can run gnnlab successfully now.

@SwiftCrown
Copy link

According to link, pip installed pytorch cannot run on A100 with sm_80 due to 1) no PTX shipped; 2) cuda10.2 compatible cuDNN(7.x) does not support A100.

So the only solution for you is to modify the Dockerfile to build a docker image based on cuda11. To achieve this, there may be several compatibility issue you need to manually fix:

  • cub: cuda11 includes cub, so the cub submodule in gnnlab is no longer needed and should be removed from setup.py
  • DGL: DGL version should be updated (0.8.0 is tested) to support cuda11 and newer cub. However we made custom patch to dgl when installing dgl. GNNLab does not depends dgl when compiling. If you do not run dgl baseline in our codebase, you can directly install newer version following dgl website. If you do need to run our dgl baseline, you may have to install dgl from source, with our patch added.
  • PyTorch: install newer version of pytorch following its website by updating our Dockerfile. Since GNNLab depends on pytorch when compiling, there maybe compatibility issue when building GNNLab. You may seek our help at this issue.
  • GNNLab: add sm80 in setup.py

Thanks for the method, I tried it on a machine with multiple RTX 3090s and it works fine.

@molamooo molamooo pinned this issue May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants