Failed to run in A100 #10

BearBiscuit05 · 2023-03-09T03:02:22Z

During the experiment, we followed the installation method of docker. When we ran the command python samgraph/multi_gpu/train_gcn.py --dataset products --num-train-worker 1 --num-sample-worker 1 --pipeline --cache-policy pre_sample --cache-percentage 0.1 --num-epoch 10 --batch-size 8000, the system reported an error:
NVIDIA A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75. If you want to use the NVIDIA A100-PCIE-40GB GPU with PyTorch, please check the instructions at Start Locally | PyTorch
This should be a version mismatch problem, I would like to ask how to deal with it to run gnnlab in A100

The text was updated successfully, but these errors were encountered:

molamooo · 2023-03-09T13:42:23Z

According to link, pip installed pytorch cannot run on A100 with sm_80 due to 1) no PTX shipped; 2) cuda10.2 compatible cuDNN(7.x) does not support A100.

So the only solution for you is to modify the Dockerfile to build a docker image based on cuda11. To achieve this, there may be several compatibility issue you need to manually fix:

cub: cuda11 includes cub, so the cub submodule in gnnlab is no longer needed and should be removed from setup.py
DGL: DGL version should be updated (0.8.0 is tested) to support cuda11 and newer cub. However we made custom patch to dgl when installing dgl. GNNLab does not depends dgl when compiling. If you do not run dgl baseline in our codebase, you can directly install newer version following dgl website. If you do need to run our dgl baseline, you may have to install dgl from source, with our patch added.
PyTorch: install newer version of pytorch following its website by updating our Dockerfile. Since GNNLab depends on pytorch when compiling, there maybe compatibility issue when building GNNLab. You may seek our help at this issue.
GNNLab: add sm80 in setup.py

BearBiscuit05 · 2023-03-15T08:36:17Z

Thank you very much for your reply. According to your tips, I still encountered some problems when migrating to A100, but at present, I can't determine whether it is my own operation. So I used the V100 cluster again for experiments later, and it ran successfully. In the future, if the debugging on A100 is successful, I will also update the solution below. In addition, I would like to know whether 256G memory and 4*V100 can meet the experimental requirements for running Papers100M?

molamooo · 2023-03-15T08:41:49Z

I would like to know whether 256G memory and 4*V100 can meet the experimental requirements for running Papers100M?

I think it's sufficient.

BearBiscuit05 · 2023-03-16T09:11:48Z

Hello, we re-run your program using docker in the 4*V100 environment, but there is a problem. The previous test was conducted on Alibaba Cloud with 2*V100 using obgn-products dataset, and it seemed that docker had no problems at that time. The problem this time is that we have the following error when running the program:

(fgnn_env) root@fd0b9b8c719d:/app/source/fgnn/example# python samgraph/multi_gpu/train_gcn.py --dataset papers100M --num-train-worker 1 --num-sample-worker 1 --pipeline --cache-policy pre_sample --cache-percentag
e 0.1 --num-epoch 10 --batch-size 8000
Using backend: pytorch
Illegal instruction (core dumped)

It seems that there is a relationship with the system or the CPU. I am not sure whether gnnlab has special requirements for the CPU.
Our CPUs are:Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz

lixiaobai09 · 2023-03-17T02:16:23Z

Does this problem still exist after re-compiling the project with the "python setup.py clean; python setup.py install" command in Docker? You can re-compile it in Docker, the building environment is prepared in Docker.

BearBiscuit05 · 2023-03-17T02:21:58Z

Does this problem still exist after re-compiling the project with the "python setup.py clean; python setup.py install" command in Docker? You can re-compile it in Docker, the building environment is prepared in Docker.

Thank you very much, this suggestion is right, I can run gnnlab successfully now.

SwiftCrown · 2023-04-06T11:11:51Z

According to link, pip installed pytorch cannot run on A100 with sm_80 due to 1) no PTX shipped; 2) cuda10.2 compatible cuDNN(7.x) does not support A100.

So the only solution for you is to modify the Dockerfile to build a docker image based on cuda11. To achieve this, there may be several compatibility issue you need to manually fix:

cub: cuda11 includes cub, so the cub submodule in gnnlab is no longer needed and should be removed from setup.py

DGL: DGL version should be updated (0.8.0 is tested) to support cuda11 and newer cub. However we made custom patch to dgl when installing dgl. GNNLab does not depends dgl when compiling. If you do not run dgl baseline in our codebase, you can directly install newer version following dgl website. If you do need to run our dgl baseline, you may have to install dgl from source, with our patch added.

PyTorch: install newer version of pytorch following its website by updating our Dockerfile. Since GNNLab depends on pytorch when compiling, there maybe compatibility issue when building GNNLab. You may seek our help at this issue.

GNNLab: add sm80 in setup.py

Thanks for the method, I tried it on a machine with multiple RTX 3090s and it works fine.

BearBiscuit05 closed this as completed May 29, 2023

molamooo pinned this issue May 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to run in A100 #10

Failed to run in A100 #10

BearBiscuit05 commented Mar 9, 2023

molamooo commented Mar 9, 2023 •

edited

Loading

BearBiscuit05 commented Mar 15, 2023 •

edited

Loading

molamooo commented Mar 15, 2023

BearBiscuit05 commented Mar 16, 2023 •

edited

Loading

lixiaobai09 commented Mar 17, 2023

BearBiscuit05 commented Mar 17, 2023

SwiftCrown commented Apr 6, 2023

Failed to run in A100 #10

Failed to run in A100 #10

Comments

BearBiscuit05 commented Mar 9, 2023

molamooo commented Mar 9, 2023 • edited Loading

BearBiscuit05 commented Mar 15, 2023 • edited Loading

molamooo commented Mar 15, 2023

BearBiscuit05 commented Mar 16, 2023 • edited Loading

lixiaobai09 commented Mar 17, 2023

BearBiscuit05 commented Mar 17, 2023

SwiftCrown commented Apr 6, 2023

molamooo commented Mar 9, 2023 •

edited

Loading

BearBiscuit05 commented Mar 15, 2023 •

edited

Loading

BearBiscuit05 commented Mar 16, 2023 •

edited

Loading