New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL version #12
Comments
Another information, I only use one GPU. |
Indeed, we met the similar situation during our experiments on some machines(we used a lot of GPUs on different kinds of HPC). I remember we fixed that issue by installing proper PyTorch for your CUDA version. |
I will try it, thanks |
sorry for bothering, I still can not run through it if i do not choose GPU, it seems work fine(not sure, but it ends of out of memory) but If i choose one GPU it will got error, my torch version is 1.8.1+cu111, same as the env in readme
Another question is, does it support torch 1.11? because it got error Thanks |
I think it is a PyTorch version issue. My personal experience is that removing the following lines works for other PyTorch versions. |
It may sacrifice reproducibility, if it is not your main concern. |
thanks for your reply, but it still not work. |
Ohh the above answer is for your second question. After removing the three lines, does torch 1.11+cu113 work? |
it does not work, this is the error log of single GPU
|
here is the log when do not choose GPU, it seems work fine
|
Here is a minimal example for distributed training: |
Resetting the PORT and RDVZ_ID works for me. I think multiple runs with those same parameters collide? I'm not really sure here |
Hi
i have installed environment in the yaml file and installed torch 1.8 follow the setting in readme
my cuda version is 11.4, it seems that it is a version conflict of NCCL, pytorch and cuda
Is my cuda version to high?
I also tried torch 1.11+ cu113,got another error
Looking forward to your reply.
Thank you.
The text was updated successfully, but these errors were encountered: