Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: NCCL communicator was aborted on rank 1 #8

Open
lilyswang opened this issue Jan 18, 2022 · 4 comments
Open

RuntimeError: NCCL communicator was aborted on rank 1 #8

lilyswang opened this issue Jan 18, 2022 · 4 comments

Comments

@lilyswang
Copy link

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. I have read the FAQ documentation but cannot get the expected help.
  3. The bug has not been fixed in the latest version.

Describe the bug
A clear and concise description of what the bug is.

Reproduction

  1. What command or script did you run?
./run_detection_train.sh
  1. Did you make any modifications on the code or config? Did you understand what you have modified?
    NO .

  2. What dataset did you use?

My own dataset (like bdd100k), about 11.2W pics in training dataset

Thanks for your nice work,Now we have some problems and need your help. I start training with my own data set. When the training ends at one epoch, the following error will be reported:(see the attachment for the specific log)

image

20220112_010819.log

We look forward to your reply !!! Thanks a lot!

@mathmanu
Copy link
Collaborator

I am not an expert in CUDA / NCCL. But please search a bit and see if you get a solution. For example, I think these threads may be useful:

https://stackoverflow.com/questions/69693950/error-some-nccl-operations-have-failed-or-timed-out
https://discuss.pytorch.org/t/runtimeerror-nccl-communicator-was-aborted/136630/2

@malianghui
Copy link

@lilyswang hello,I have the same error,have you solve the problem?

@malianghui
Copy link

@mathmanu I have try the way in your link , but it do not work ,so sad!

@weiyx16
Copy link

weiyx16 commented Aug 31, 2022

Facing exact the same problem...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants