Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8 GPUs got an error, but 6 GPUs is good. #22

Closed
swzhang5 opened this issue Dec 1, 2020 · 5 comments
Closed

8 GPUs got an error, but 6 GPUs is good. #22

swzhang5 opened this issue Dec 1, 2020 · 5 comments

Comments

@swzhang5
Copy link

swzhang5 commented Dec 1, 2020

Thanks for sharing your great work.
But when I tried to retrain your work with 8GPUs, I got an error: CUDA error: device-side assert triggered.
However, 6GPUs can work, but the mAP is not good. I don't know the reason, do you know how to make it work under 8GPUs? Or could you give me a config for 6GPUs? Thank you

@jason718
Copy link
Contributor

jason718 commented Dec 1, 2020

we never tested 6 gpu setting, sorry.

try re-run the program with 8 gpu again? I also run into the same issue sometimes, even though not so often.
see known issue 1

@bradezard131
Copy link
Contributor

My guess is the BCE Epsilon is too small for the WSDDN loss. I find in my own code that I need 1e-8 on the clamp to avoid device-side asserts. You could try running with the environment variable CUDA_LAUNCH_BLOCKING=1 set, which may slow down training but will pin down the error to an exact line of code.

@jason718
Copy link
Contributor

jason718 commented Dec 2, 2020

My guess is the BCE Epsilon is too small for the WSDDN loss. I find in my own code that I need 1e-8 on the clamp to avoid device-side asserts. You could try running with the environment variable CUDA_LAUNCH_BLOCKING=1 set, which may slow down training but will pin down the error to an exact line of code.

Thanks for sharing~ Very interesting. I took that clamping threshold from OICR and never changed it. I'll also try it here.

@swzhang5
Copy link
Author

swzhang5 commented Dec 2, 2020

My guess is the BCE Epsilon is too small for the WSDDN loss. I find in my own code that I need 1e-8 on the clamp to avoid device-side asserts. You could try running with the environment variable CUDA_LAUNCH_BLOCKING=1 set, which may slow down training but will pin down the error to an exact line of code.

Thanks! The program is working well under 8GPUs now

@jason718
Copy link
Contributor

jason718 commented Dec 2, 2020

Awesome. Closing this now, and feel free to re-open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants