-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Class2D stalling without error or dying with "KERNEL ERROR: unspecified launch failure" #189
Comments
Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown): Actually I don't think this is related to how many GPUs are used, which admittedly would have been surprising. I have just tried again to run with all three GPUs and it also died with an unspecified launch failure during run 2. |
Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown): I can't continue from run_it012_optimiser.star as I lost it when I updated my machine to access the LMB shared filesystems. I am now running again using a single thread without mpi and it is on 1.88/5.49 hrs for Expectation iteration 2 without any problems. The relion version is now the one in relion-devel-lmb, which has binaries dating from 3rd January. I will update you tomorrow on what happens. |
Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown): ok I got the error during iteration 2. Expectation iteration 2 of 25 I am now trying again with --continue class2d_it001_optimiser.star |
Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown): I don't know. I don't suppose the following can help you work it out? /lmb/home/public/EM/RELION/relion-devel-lmb lg-krieger_jkrieger> ls -ltrah total 211K -rw-r--r-- 1 public public 496 Mar 9 2016 AUTHORS -rw-r--r-- 1 public public 18K Mar 9 2016 COPYING -rwxr-xr-x 1 public public 2.2K Mar 9 2016 INSTALL.sh -rw-r--r-- 1 public public 11K Mar 9 2016 INSTALL -rw-r--r-- 1 public public 0 Mar 9 2016 ChangeLog -rw-r--r-- 1 public public 719 Mar 9 2016 README -rw-r--r-- 1 public public 0 Mar 9 2016 NEWS -rw-r--r-- 1 public public 1.6K Mar 9 2016 relion.h drwxr-xr-x 2 public public 3 Mar 9 2016 tests drwxr-xr-x 4 public public 4 Mar 9 2016 external -rw-r--r-- 1 public public 55 Mar 24 2016 .gitignore drwxr-xr-x 2 public public 9 Jul 4 2016 scripts drwxr-xr-x 2 public public 8 Jul 7 2016 cmake drwxr-xr-x 9 public public 19 Jan 3 15:03 . -rw-r--r-- 1 public public 11K Jan 3 15:03 CMakeLists.txt drwxr-xr-x 5 public public 120 Jan 3 15:03 src drwxr-xr-x 8 public public 15 Jan 3 15:03 .git drwxr-xr-x 7 public public 11 Jan 3 15:05 build2 drwxr-xr-x 15 public public 36 Jan 24 15:09 .. |
Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown): I did however have the same error on a version that I pulled from git and installed myself as well, which I would expect to be the latest one. |
Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe): I'd guess you are on v2.0.b7, but you'll know for sure if you run
and compare the date-stamp of the relion-binary to the list of commits You should update or asks somebody to update the binary if it's not very recent. Since you have access to the code here you can always clone the code into a local repo and build the latest version yourself, the installation instructions make it really simple;
EDIT: If you did in fact build it yourself then
in the repo directory will tell you which commit you are on. Also, make sure you then use your newly cloned repo to run and not /lmb/home/public/EM/RELION/relion-devel-lmb |
Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe): There is a version of the code which could report the error on line 288, but it isn't the beta-code, but rather the in-house lmb-development version. That would make sense with respect to the path you state as "relion-devel-lmb". To debug this I'd need to know the exact commit hash you are on, since there are no version numbers. Unless you are an advanced user I would recommend sticking to the beta-versions, unless somebody advised you to use development versions or features explicitly. If at all possible, it would help us to know
|
Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown): I am at the LMB and my computer is now on the shared file systems to avoid copying of data. As Sjors has broken the pre-release version (it can't find the fftw libraries), I am using the lmb-development version for now. I am trying to get my compilers working so I can get v2.0.2 of the beta working as well. |
Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown): $ ls -lrth |
Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe): Ok, that makes sense. To summarize briefly, we need to find out if this is a reproducible error or if it is completely sporadic. The former is much easier to work with, and the strategies for debugging them are ENTIRELY different. When you ran with multiple threads, all threads reported an error at the location in the code they were when only one of them actually malfunctioned. That's why relion was reporting an error in lots of locations, which makes it near impossible to say where the error actually happened. Running a single thread eliminates this possibility. So what we want to do is run at least twice with one thread to see if the same location is triggered. Given that this is the case, we need to know which commit or version that was, and we just about need it to be a very recent one. You probably figured this out already, but for documentation I'd though I'd inject this clarification. The best case is that the same particle and point in the execution fails. It is also good to add the flags
when trying to establish a verifiable test-case. I neglected to mention this before, which is my bad. There's no reason for a particular seed, as long as the same is used in repeat runs. |
Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown): My original bug report was based on v2.0.2 - I have just run |
Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown): However, this version looks for libraries that have moved and I am therefore trying to install another copy. |
Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown): The in-house lmb-development version shows this from |
Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe): Ok. As the original issue reported used multiple threads, it's impossible to say which code location is the true failure, as explained in my last post. We can take that as an indication that the problem exists in the beta-code as well, which is good to know, but it sadly doesn't help us find the cause or determine the nature of the issue. Thanks for your patience and effort though, these are the tricky ones that take some fiddling to find. |
Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown): Ah ok. |
Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown): I just saw your post with the random seed (you sent it about the same time I was checking versions). If I use --continue it complains about the use of a number of flags including the random seed so I assume it reads those values from the optimiser.star file. The initial run was done with random seed 0. |
Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown): I now have v2.0.2 installed after installing isl-0.15 and gcc-4.9.4 |
Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown): Here is the warning that I can't specify --random_seed with --continue. I have just put it in and run a continuation with v2.0.2 anyway and as you said I have skipped the 0 perturb for now. |
Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe): With --continue there are certain options that can't be set because you are not allowed to change them from the original run. It will just warn you and ignore those, running with the original settings, which is fine. Again, it's something that adds to the confidence of the diagnosis, but it isn't necessary to asses reproducibility of the problem. |
Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown): Now I'm failing to reproduce the error. The internal lmb-development version has reached iteration 5 and beta v2.0.2 has reached iteration 10. I guess the difference between the efficiency of versions is that I installed v2.0.2 myself on this machine and the compiler would have optimised it accordingly. Actually I have just realised that I am now not using an ssd as scratch (I stopped because someone else wanted to run some tests as well so I left some GPUs and the ssd to them), which could be why I am not getting the error. If these runs finish fine then I will try again with --scratch_dir /ssd and see if that brings up the error again. |
As an update, I'd like to say that the error is not coming every time I run relion or anyone else runs relion on my computer, which explains why it was not easy to reproduce. Nevertheless we have seen it a number of other times recently with other data sets. Again today we tried with 1 MPI rank and 1 thread and 1 GPU and saw the following error similar to before: KERNEL_ERROR: unspecified launch failure in /net/nfs1/public/EM/RELION/relion-2.0/src/gpu_utils/cuda_ml_optimiser.cu at line 288 (error-code 4) This was using v2.0.3 as installed by Sjors for the LMB generally. I don't know if it's relevant but I have also had gromacs crashing with an unspecified launch failure error more than once e.g.: Program: gmx mdrun, version 2016-beta2 Fatal error: |
Old version. |
Originally reported by: james_krieger (Bitbucket: james_krieger, GitHub: Unknown)
My system has an i7-6900K CPU @ 3.20GHz with 16 logical cores, an ASUS X99-E WS motherboard and 3 Titan X Maxwell GPUs. The benchmark Class2D works with all 3 GPUs but fails at the end of an expectation iteration when I only ask for two of them. I have tried multiple times with two GPUs and I consistently get one of the following results:
Relion stalls with one GPU in full use and the other having data usage but not processing anything as seen in the below nvidia-smi output. This may be related to a previous fixed issue and I have not seen it since updating to the latest relion-beta2.
I get an error related to unspecified launch failure:
Expectation iteration 2 of 25
23.92/36.37 min .......................................~~(,_,">KERNEL_ERROR: unspecified launch failure in /data/bin/EM/RELION/relion-devel-lmb_copy_newPC/src/gpu_utils/cuda_helper_functions.cu at line 386 (error-code 4)
KERNEL_ERROR: unspecified launch failure in /data/bin/EM/RELION/relion-devel-lmb_copy_newPC/src/gpu_utils/cuda_helper_functions.cu at line 386 (error-code 4)
KERNEL_ERROR: unspecified launch failure in /data/bin/EM/RELION/relion-devel-lmb_copy_newPC/src/gpu_utils/cuda_helper_functions.cu at line 386 (error-code 4)
KERNEL_ERROR: unspecified launch failure in /data/bin/EM/RELION/relion-devel-lmb_copy_newPC/src/gpu_utils/cuda_helper_functions.cu at line 386 (error-code 4)
mpirun noticed that process rank 2 with PID 16494 on node ig-pc-10 exited on signal 11 (Segmentation fault).
This error also occurred after updating to the latest relion2-beta but not until iteration 14:
Expectation iteration 14 of 25
0.62/18.02 min ..~~(,_,">KERNEL_ERROR: unspecified launch failure in /data/bin/relion2-beta_copy_newPC/src/gpu_utils/cuda_helper_functions.cu at line 419 (error-code 4)
KERNEL_ERROR: unspecified launch failure in /data/bin/relion2-beta_copy_newPC/src/gpu_utils/cuda_ml_optimiser.cu at line 2071 (error-code 4)
mpirun noticed that process rank 1 with PID 8750 on node ig-pc-10 exited on signal 11 (Segmentation fault).
I attach output files for the job that stopped with unspecified launch failure at iteration 14. Please let me know if you would like anything else.
The text was updated successfully, but these errors were encountered: