Class2D stalling without error or dying with "KERNEL ERROR: unspecified launch failure" #189

bforsbe · 2017-01-23T17:14:18Z

Originally reported by: james_krieger (Bitbucket: james_krieger, GitHub: Unknown)

My system has an i7-6900K CPU @ 3.20GHz with 16 logical cores, an ASUS X99-E WS motherboard and 3 Titan X Maxwell GPUs. The benchmark Class2D works with all 3 GPUs but fails at the end of an expectation iteration when I only ask for two of them. I have tried multiple times with two GPUs and I consistently get one of the following results:

Relion stalls with one GPU in full use and the other having data usage but not processing anything as seen in the below nvidia-smi output. This may be related to a previous fixed issue and I have not seen it since updating to the latest relion-beta2.
I get an error related to unspecified launch failure:
Expectation iteration 2 of 25
23.92/36.37 min .......................................~~(,_,">KERNEL_ERROR: unspecified launch failure in /data/bin/EM/RELION/relion-devel-lmb_copy_newPC/src/gpu_utils/cuda_helper_functions.cu at line 386 (error-code 4)
KERNEL_ERROR: unspecified launch failure in /data/bin/EM/RELION/relion-devel-lmb_copy_newPC/src/gpu_utils/cuda_helper_functions.cu at line 386 (error-code 4)
KERNEL_ERROR: unspecified launch failure in /data/bin/EM/RELION/relion-devel-lmb_copy_newPC/src/gpu_utils/cuda_helper_functions.cu at line 386 (error-code 4)
KERNEL_ERROR: unspecified launch failure in /data/bin/EM/RELION/relion-devel-lmb_copy_newPC/src/gpu_utils/cuda_helper_functions.cu at line 386 (error-code 4)

mpirun noticed that process rank 2 with PID 16494 on node ig-pc-10 exited on signal 11 (Segmentation fault).

This error also occurred after updating to the latest relion2-beta but not until iteration 14:
Expectation iteration 14 of 25
0.62/18.02 min ..~~(,_,">KERNEL_ERROR: unspecified launch failure in /data/bin/relion2-beta_copy_newPC/src/gpu_utils/cuda_helper_functions.cu at line 419 (error-code 4)
KERNEL_ERROR: unspecified launch failure in /data/bin/relion2-beta_copy_newPC/src/gpu_utils/cuda_ml_optimiser.cu at line 2071 (error-code 4)

mpirun noticed that process rank 1 with PID 8750 on node ig-pc-10 exited on signal 11 (Segmentation fault).

The computer locks up completely sometimes, which I think is also related to running relion2.

I attach output files for the job that stopped with unspecified launch failure at iteration 14. Please let me know if you would like anything else.

Bitbucket: https://bitbucket.org/tcblab/relion2-beta/issue/189

bforsbe · 2017-01-23T17:57:01Z

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):

Actually I don't think this is related to how many GPUs are used, which admittedly would have been surprising. I have just tried again to run with all three GPUs and it also died with an unspecified launch failure during run 2.

bforsbe · 2017-01-23T18:01:29Z

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

Could you run a single thread without mpi? You can also add

--continue run_it012_optimiser.star

If you run that a couple of times and see the same line reporting an error, that would be nice.

bforsbe · 2017-01-24T19:07:35Z

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):

I can't continue from run_it012_optimiser.star as I lost it when I updated my machine to access the LMB shared filesystems. I am now running again using a single thread without mpi and it is on 1.88/5.49 hrs for Expectation iteration 2 without any problems. The relion version is now the one in relion-devel-lmb, which has binaries dating from 3rd January. I will update you tomorrow on what happens.

bforsbe · 2017-01-25T10:33:44Z

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):

ok I got the error during iteration 2.

Expectation iteration 2 of 25
2.73/5.50 hrs .............................~~(,_,">KERNEL_ERROR: unspecified launch failure in /lmb/home/public/EM/RELION/relion-devel-lmb/src/gpu_utils/cuda_ml_optimiser.cu at line 288 (error-code 4)
Segmentation fault

I am now trying again with --continue class2d_it001_optimiser.star

bforsbe · 2017-01-25T11:15:03Z

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

v2.0.2 of the relion2-beta cannot have reported this issue, which version are you using?

bforsbe · 2017-01-25T13:18:28Z

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):

I don't know. I don't suppose the following can help you work it out?

/lmb/home/public/EM/RELION/relion-devel-lmb

lg-krieger_jkrieger> ls -ltrah

total 211K

-rw-r--r-- 1 public public 496 Mar 9 2016 AUTHORS

-rw-r--r-- 1 public public 18K Mar 9 2016 COPYING

-rwxr-xr-x 1 public public 2.2K Mar 9 2016 INSTALL.sh

-rw-r--r-- 1 public public 11K Mar 9 2016 INSTALL

-rw-r--r-- 1 public public 0 Mar 9 2016 ChangeLog

-rw-r--r-- 1 public public 719 Mar 9 2016 README

-rw-r--r-- 1 public public 0 Mar 9 2016 NEWS

-rw-r--r-- 1 public public 1.6K Mar 9 2016 relion.h

drwxr-xr-x 2 public public 3 Mar 9 2016 tests

drwxr-xr-x 4 public public 4 Mar 9 2016 external

-rw-r--r-- 1 public public 55 Mar 24 2016 .gitignore

drwxr-xr-x 2 public public 9 Jul 4 2016 scripts

drwxr-xr-x 2 public public 8 Jul 7 2016 cmake

drwxr-xr-x 9 public public 19 Jan 3 15:03 .

-rw-r--r-- 1 public public 11K Jan 3 15:03 CMakeLists.txt

drwxr-xr-x 5 public public 120 Jan 3 15:03 src

drwxr-xr-x 8 public public 15 Jan 3 15:03 .git

drwxr-xr-x 7 public public 11 Jan 3 15:05 build2

drwxr-xr-x 15 public public 36 Jan 24 15:09 ..

bforsbe · 2017-01-25T13:30:08Z

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):

I did however have the same error on a version that I pulled from git and installed myself as well, which I would expect to be the latest one.

bforsbe · 2017-01-25T13:34:36Z

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

I'd guess you are on v2.0.b7, but you'll know for sure if you run

$ ls -lrth `which relion`

and compare the date-stamp of the relion-binary to the list of commits

You should update or asks somebody to update the binary if it's not very recent. Since you have access to the code here you can always clone the code into a local repo and build the latest version yourself, the installation instructions make it really simple;

#!bash

$ git clone https://username@bitbucket.org/tcblab/relion2-beta.git
$ cd relion2-beta
$ mkdir build
$ cd build 
$ cmake .. 
$ make -j8 
$ cd ..
$ build/bin/relion_refine --i file.star <lots of other options>

EDIT: If you did in fact build it yourself then


git log

in the repo directory will tell you which commit you are on. Also, make sure you then use your newly cloned repo to run and not /lmb/home/public/EM/RELION/relion-devel-lmb

bforsbe · 2017-01-25T13:47:10Z

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

There is a version of the code which could report the error on line 288, but it isn't the beta-code, but rather the in-house lmb-development version. That would make sense with respect to the path you state as "relion-devel-lmb". To debug this I'd need to know the exact commit hash you are on, since there are no version numbers. Unless you are an advanced user I would recommend sticking to the beta-versions, unless somebody advised you to use development versions or features explicitly. If at all possible, it would help us to know

which commit you are on on devel-lmb
if the error shows up on v2.0.2 of the beta as well

bforsbe · 2017-01-25T14:36:46Z

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):

I am at the LMB and my computer is now on the shared file systems to avoid copying of data. As Sjors has broken the pre-release version (it can't find the fftw libraries), I am using the lmb-development version for now. I am trying to get my compilers working so I can get v2.0.2 of the beta working as well.

bforsbe · 2017-01-25T14:38:20Z

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):

$ ls -lrth which relion
gives me
-rwxr-xr-x 1 public public 39K Jan 3 15:06 /public/EM/RELION/relion-devel-lmb/build2/bin/relion

bforsbe · 2017-01-25T15:02:07Z

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

Ok, that makes sense. To summarize briefly, we need to find out if this is a reproducible error or if it is completely sporadic. The former is much easier to work with, and the strategies for debugging them are ENTIRELY different. When you ran with multiple threads, all threads reported an error at the location in the code they were when only one of them actually malfunctioned. That's why relion was reporting an error in lots of locations, which makes it near impossible to say where the error actually happened. Running a single thread eliminates this possibility. So what we want to do is run at least twice with one thread to see if the same location is triggered. Given that this is the case, we need to know which commit or version that was, and we just about need it to be a very recent one. You probably figured this out already, but for documentation I'd though I'd inject this clarification.

The best case is that the same particle and point in the execution fails. It is also good to add the flags

#!
--random_seed 1993 --perturb 0

when trying to establish a verifiable test-case. I neglected to mention this before, which is my bad. There's no reason for a particular seed, as long as the same is used in repeat runs.

bforsbe · 2017-01-25T15:02:14Z

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):

My original bug report was based on v2.0.2 - I have just run git log on that and the output is as follows.

bforsbe · 2017-01-25T15:06:13Z

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):

However, this version looks for libraries that have moved and I am therefore trying to install another copy.

bforsbe · 2017-01-25T15:09:41Z

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):

The in-house lmb-development version shows this from git log.

bforsbe · 2017-01-25T15:11:13Z

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

Ok. As the original issue reported used multiple threads, it's impossible to say which code location is the true failure, as explained in my last post. We can take that as an indication that the problem exists in the beta-code as well, which is good to know, but it sadly doesn't help us find the cause or determine the nature of the issue. Thanks for your patience and effort though, these are the tricky ones that take some fiddling to find.

bforsbe · 2017-01-25T15:17:22Z

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):

Ah ok.

bforsbe · 2017-01-25T15:39:44Z

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

Use whatever code you can get working, just run single-threaded twice with the same version and tell us which version that is.

bforsbe · 2017-01-25T15:54:49Z

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):

I just saw your post with the random seed (you sent it about the same time I was checking versions). If I use --continue it complains about the use of a number of flags including the random seed so I assume it reads those values from the optimiser.star file. The initial run was done with random seed 0.

bforsbe · 2017-01-25T16:02:06Z

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

you can use that in your repeat. If you didn't have the 0 perturb then just skip it for now anyway. We should still find the essential info. It's a desired, extra condition, but nonessential for this.

bforsbe · 2017-01-25T17:00:18Z

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):

I now have v2.0.2 installed after installing isl-0.15 and gcc-4.9.4
Would you like me to try to reproduce the error in there as well? I could start at the beginning again and put a random seed and 0 perturb.

bforsbe · 2017-01-25T17:09:34Z

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):

Here is the warning that I can't specify --random_seed with --continue.
WARNING: Option --random_seed is not a valid RELION argument

I have just put it in and run a continuation with v2.0.2 anyway and as you said I have skipped the 0 perturb for now.

bforsbe · 2017-01-25T17:17:35Z

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

With --continue there are certain options that can't be set because you are not allowed to change them from the original run. It will just warn you and ignore those, running with the original settings, which is fine. Again, it's something that adds to the confidence of the diagnosis, but it isn't necessary to asses reproducibility of the problem.

bforsbe · 2017-01-26T10:20:18Z

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):

Now I'm failing to reproduce the error. The internal lmb-development version has reached iteration 5 and beta v2.0.2 has reached iteration 10. I guess the difference between the efficiency of versions is that I installed v2.0.2 myself on this machine and the compiler would have optimised it accordingly.

Actually I have just realised that I am now not using an ssd as scratch (I stopped because someone else wanted to run some tests as well so I left some GPUs and the ssd to them), which could be why I am not getting the error. If these runs finish fine then I will try again with --scratch_dir /ssd and see if that brings up the error again.

jamesmkrieger · 2017-02-22T15:06:15Z

As an update, I'd like to say that the error is not coming every time I run relion or anyone else runs relion on my computer, which explains why it was not easy to reproduce. Nevertheless we have seen it a number of other times recently with other data sets. Again today we tried with 1 MPI rank and 1 thread and 1 GPU and saw the following error similar to before:

KERNEL_ERROR: unspecified launch failure in /net/nfs1/public/EM/RELION/relion-2.0/src/gpu_utils/cuda_ml_optimiser.cu at line 288 (error-code 4)

This was using v2.0.3 as installed by Sjors for the LMB generally.

I don't know if it's relevant but I have also had gromacs crashing with an unspecified launch failure error more than once e.g.:

Program: gmx mdrun, version 2016-beta2
Source file: src/gromacs/mdlib/nbnxn_cuda/nbnxn_cuda.cu (line 633)
MPI rank: 0 (out of 2)

Fatal error:
cudaStreamSynchronize failed in cu_blockwait_nb: unspecified launch failure

biochem-fan · 2020-06-27T07:09:49Z

Old version.

bforsbe added major bug labels Jan 26, 2017

bforsbe modified the milestone: beta (<2.0.3 & 2.0.b1-2.0.b12) Jan 27, 2017

biochem-fan closed this as completed Jun 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Class2D stalling without error or dying with "KERNEL ERROR: unspecified launch failure" #189

Class2D stalling without error or dying with "KERNEL ERROR: unspecified launch failure" #189

bforsbe commented Jan 23, 2017

bforsbe commented Jan 23, 2017

bforsbe commented Jan 23, 2017

bforsbe commented Jan 24, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 26, 2017

jamesmkrieger commented Feb 22, 2017

biochem-fan commented Jun 27, 2020

Class2D stalling without error or dying with "KERNEL ERROR: unspecified launch failure" #189

Class2D stalling without error or dying with "KERNEL ERROR: unspecified launch failure" #189

Comments

bforsbe commented Jan 23, 2017

mpirun noticed that process rank 1 with PID 8750 on node ig-pc-10 exited on signal 11 (Segmentation fault).

bforsbe commented Jan 23, 2017

bforsbe commented Jan 23, 2017

bforsbe commented Jan 24, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 25, 2017

bforsbe commented Jan 26, 2017

jamesmkrieger commented Feb 22, 2017

biochem-fan commented Jun 27, 2020