-
Notifications
You must be signed in to change notification settings - Fork 3.8k
-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iterateInParallel causes GPU crash if collection size is too large #228
Comments
|
How much RAM does your GPU have? Could you output: nvidia-smi ? Thanks! |
+-----------------------------------------------------------------------------+ +-----------------------------------------------------------------------------+ |
wrote a bit of code to throttle parallel if CUDA is running out of memory... still segfautls occaisionally. probably an issue with the driver? https://github.com/import-io/deeplearning4j/commit/a38a5883e512cf2ffb6e36c2b2d6198ee72baf72 |
Yeah. This is related to the parallel synchronization. |
This is what I'm seeing running GPUs on spark: Driver stacktrace: A fatal error has been detected by the Java Runtime Environment:SIGSEGV (0xb) at pc=0x00007f68711d5234, pid=7856, tid=140081833699072JRE version: OpenJDK Runtime Environment (8.0_45-b13) (build 1.8.0_45-b13)Java VM: OpenJDK 64-Bit Server VM (25.45-b02 mixed mode linux-amd64 compressed oops)Problematic frame:C [libcublas.so.7.0+0x21234]Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java againAn error report file with more information is saved as:/home/buildbot/hs_err_pid7856.logIf you would like to submit a bug report, please visit:http://bugreport.java.com/bugreport/crash.jspThe crash happened outside the Java Virtual Machine in native code.See problematic frame for where to report the bug.Aborted |
We've stabilized this. Closing. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
In functions that use the iterateInParallel functionality, such as word2vec, there is no control over how many threads or memory is being used.
iterateInParallel in word2vec creates n word lists (where n is the number of sentences to be trained on)
As an example I have 200000 sentences of length 10 words, iterate in parallel tries to compute this at the same time on the GPU. which in turn either segfaults the program, or throws CUDA_ERROR_OUT_OF_MEMORY.
The text was updated successfully, but these errors were encountered: