New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA device-side assert, image scores passed to BCE loss is nan during early training #24
Comments
I find lowering the |
We don't recommend single-GPU training honestly. 1st, it's unstable as you have spotted. 2nd, the performance has not been verified, especially after learning rate dropping. As you mentioned, this happened in early training. So try longer warm-up? Note that your log shows the issue happened at iteration 180, and the default |
Thanks for your help. I just tested a wide range of values for |
probably not. |
@xiaaoo Does warmup factor / BASE_LR actually solve this problem?? I also am not get used to utilize ITER_SIZE args, but adding ITER_SIZE occurs this error for my env. My conclusion is to increase IMS_PER_BATCH instead of using ITER_SIZE. eg) BATCH : 4, not BATCH :1 and ITER_SIZE 4 Setup BATCH 4, BASE_LR 0.01, STEPS(0,25000) reproduces mAP 42.22 (OICR baseline) at last checkpoint. But my conclusion is still not clear. Any idea or sharing configs from other users would be helpful. |
@tohoaa I am still waiting for the results by setting a lower @bradezard131 commented |
I have not had the issues you are describing, for me it works fine. If I get time to look at it again I will. It appears that adding iter-size has hurt more people than it has helped, which was not my intention :( |
@bradezard131 Your idea has helped me a lot. Without your iter-size PR, it won't even be possible to run this on 1 GPU while maintaining the effective batch size ( |
Thanks for you amazing work!
I got
RuntimeError: CUDA error: device-side assert triggered
around ~200 steps during training. This error always occurs even after I rerunning the program multiple times or according to #22 setting a higher epsilon (I've tried1e-8
and1e-6
).This is the command I use for training. I have tried pytorch 1.6 and 1.7 with cuda 10.1.
Here is the logging
The text was updated successfully, but these errors were encountered: