Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL Error 1: unhandled cuda error #9

Open
ShuJackson opened this issue Jun 10, 2021 · 3 comments
Open

NCCL Error 1: unhandled cuda error #9

ShuJackson opened this issue Jun 10, 2021 · 3 comments

Comments

@ShuJackson
Copy link

When I run the training script, I ran into an instance of 'std::runtime_error'
what(): NCCL Error 1: unhandled cuda error
./run.sh

This happens every time in the Evaluation step of the train.py script - after the 'convert squad examples to features' step completes successfully and right after 'Evaluating: 0%' is printed.

I have made sure torch can pick up the cuda info:

print(torch.cuda.is_available())
True

image

@ShuJackson
Copy link
Author

@TheAtticusProject

@hendrycks
Copy link

This is a very low-level issue, and unfortunately "NCCL Error 1: unhandled cuda error" means that even CUDA does not know what it is. I could only suggest updating drivers or seeing if there is a more detailed error log, but even then this would be a CUDA or hardware issue.

@Mei0211
Copy link

Mei0211 commented Feb 16, 2022

请问怎么运行脚本呢,需要修改什么文件和怎么执行代码可以教授我一二吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants