8 GPUs got an error, but 6 GPUs is good. #22

swzhang5 · 2020-12-01T09:37:35Z

Thanks for sharing your great work.
But when I tried to retrain your work with 8GPUs, I got an error: CUDA error: device-side assert triggered.
However, 6GPUs can work, but the mAP is not good. I don't know the reason, do you know how to make it work under 8GPUs? Or could you give me a config for 6GPUs? Thank you

jason718 · 2020-12-01T10:00:27Z

we never tested 6 gpu setting, sorry.

try re-run the program with 8 gpu again? I also run into the same issue sometimes, even though not so often.
see known issue 1

bradezard131 · 2020-12-01T11:17:06Z

My guess is the BCE Epsilon is too small for the WSDDN loss. I find in my own code that I need 1e-8 on the clamp to avoid device-side asserts. You could try running with the environment variable CUDA_LAUNCH_BLOCKING=1 set, which may slow down training but will pin down the error to an exact line of code.

jason718 · 2020-12-02T02:19:32Z

My guess is the BCE Epsilon is too small for the WSDDN loss. I find in my own code that I need 1e-8 on the clamp to avoid device-side asserts. You could try running with the environment variable CUDA_LAUNCH_BLOCKING=1 set, which may slow down training but will pin down the error to an exact line of code.

Thanks for sharing~ Very interesting. I took that clamping threshold from OICR and never changed it. I'll also try it here.

swzhang5 · 2020-12-02T05:18:55Z

My guess is the BCE Epsilon is too small for the WSDDN loss. I find in my own code that I need 1e-8 on the clamp to avoid device-side asserts. You could try running with the environment variable CUDA_LAUNCH_BLOCKING=1 set, which may slow down training but will pin down the error to an exact line of code.

Thanks! The program is working well under 8GPUs now

jason718 · 2020-12-02T05:31:38Z

Awesome. Closing this now, and feel free to re-open.

jason718 closed this as completed Dec 2, 2020

xiaaoo-zz mentioned this issue Dec 7, 2020

CUDA device-side assert, image scores passed to BCE loss is nan during early training #24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8 GPUs got an error, but 6 GPUs is good. #22

8 GPUs got an error, but 6 GPUs is good. #22

swzhang5 commented Dec 1, 2020

jason718 commented Dec 1, 2020

bradezard131 commented Dec 1, 2020

jason718 commented Dec 2, 2020

swzhang5 commented Dec 2, 2020

jason718 commented Dec 2, 2020

8 GPUs got an error, but 6 GPUs is good. #22

8 GPUs got an error, but 6 GPUs is good. #22

Comments

swzhang5 commented Dec 1, 2020

jason718 commented Dec 1, 2020

bradezard131 commented Dec 1, 2020

jason718 commented Dec 2, 2020

swzhang5 commented Dec 2, 2020

jason718 commented Dec 2, 2020