-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to run in A100 #10
Comments
According to link, pip installed pytorch cannot run on A100 with sm_80 due to 1) no PTX shipped; 2) cuda10.2 compatible cuDNN(7.x) does not support A100. So the only solution for you is to modify the Dockerfile to build a docker image based on cuda11. To achieve this, there may be several compatibility issue you need to manually fix:
|
Thank you very much for your reply. According to your tips, I still encountered some problems when migrating to A100, but at present, I can't determine whether it is my own operation. So I used the V100 cluster again for experiments later, and it ran successfully. In the future, if the debugging on A100 is successful, I will also update the solution below. In addition, I would like to know whether 256G memory and 4*V100 can meet the experimental requirements for running Papers100M? |
I think it's sufficient. |
Hello, we re-run your program using docker in the 4*V100 environment, but there is a problem. The previous test was conducted on Alibaba Cloud with 2*V100 using obgn-products dataset, and it seemed that docker had no problems at that time. The problem this time is that we have the following error when running the program:
It seems that there is a relationship with the system or the CPU. I am not sure whether gnnlab has special requirements for the CPU. |
Does this problem still exist after re-compiling the project with the "python setup.py clean; python setup.py install" command in Docker? You can re-compile it in Docker, the building environment is prepared in Docker. |
Thank you very much, this suggestion is right, I can run gnnlab successfully now. |
Thanks for the method, I tried it on a machine with multiple RTX 3090s and it works fine. |
During the experiment, we followed the installation method of docker. When we ran the command
python samgraph/multi_gpu/train_gcn.py --dataset products --num-train-worker 1 --num-sample-worker 1 --pipeline --cache-policy pre_sample --cache-percentage 0.1 --num-epoch 10 --batch-size 8000
, the system reported an error:NVIDIA A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75. If you want to use the NVIDIA A100-PCIE-40GB GPU with PyTorch, please check the instructions at Start Locally | PyTorch
This should be a version mismatch problem, I would like to ask how to deal with it to run gnnlab in A100
The text was updated successfully, but these errors were encountered: