-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault in relion_refine_mpi with --firstiter_cc and --gpu #7
Comments
Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe): I hope it WAS a bug. We noticed a recently introduced bug with --firstiter-cc which should be amended in v 2.0.b1, which was pushed no more than 30 minutes ago. Try pulling the new code and running again. If the problem persists, I'll dig deeper. Thanks again for reporting! |
Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe): Does it crash immediately? If so, is it possible to create a minimal example with input data that shows this error, like just a few particles? I f so I can have a look at it. If the files are "too" large then I could receive them in some other way than through here. I'll try to reproduce here on separate data in the mean time. |
Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe): I believe that the error Craig observed, is the one we did in fact fix in v.2.0.1b. Dimitry appears to have found a wholly separate issue. Luckily, I seem to have been able to reproduce that issue here now, so hopefully there will be a fix for it later today. |
Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe): I believe I know what the issue is now. Since cross-correlation is so infrequently used and not a bottleneck, these functions have not been adapted to the most recent version of the difference-kernel layout. Subsequently they still use a layout which is potentially limited by hardware capacity, by requesting shared memory which potentially exceeds that available on the device. We could fix this in a number of ways, the easiest being to decrease the block-size if the memory limit is exceeded. However this suffers the same weakness, just at a much later stage. I think the more reasonable thing is to update the cc-kernels to the new layout, which may take a day or two. For now, however, you can circumvent the issue by doing one of two (I had to do both...) things;
Let me know if any of these measures help at all! |
Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe): I just pushed a possible fix (v2.0.b2) by creating a new cross-correlation kernel which is does not have shared-memory usage proportional to the number of translations. This should also do the trick. If not, let me know and I'll continue hacking away at it. |
Original comment by Dimitry Tegunov (Bitbucket: DTegunov, GitHub: DTegunov): It appears fixed in 0be3990, thanks! On a side note: compiling with sm_52 won't solve issues with dynamic shared memory allocation. The hardware will already allocate everything it physically can, regardless of the compiler target. |
Originally reported by: Dimitry Tegunov (Bitbucket: DTegunov, GitHub: DTegunov)
I hope it's an actual bug this time ;-)
I'm running 3D refinement using
(template created in GUI, names modified and launched in terminal), and it crashes saying
It doesn't crash on the GPU if I remove --firstiter_cc, and the CPU version runs fine with --firstiter_cc. Not sure if I can provide my test data due to its size, but maybe there are some debug flags I can set that will give you more information to work with?
The text was updated successfully, but these errors were encountered: