Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Class2D stalling without error or dying with "KERNEL ERROR: unspecified launch failure" #189

Closed
bforsbe opened this issue Jan 23, 2017 · 26 comments

Comments

@bforsbe
Copy link
Contributor

bforsbe commented Jan 23, 2017

Originally reported by: james_krieger (Bitbucket: james_krieger, GitHub: Unknown)


My system has an i7-6900K CPU @ 3.20GHz with 16 logical cores, an ASUS X99-E WS motherboard and 3 Titan X Maxwell GPUs. The benchmark Class2D works with all 3 GPUs but fails at the end of an expectation iteration when I only ask for two of them. I have tried multiple times with two GPUs and I consistently get one of the following results:

  1. Relion stalls with one GPU in full use and the other having data usage but not processing anything as seen in the below nvidia-smi output. This may be related to a previous fixed issue and I have not seen it since updating to the latest relion-beta2.

  2. I get an error related to unspecified launch failure:
    Expectation iteration 2 of 25
    23.92/36.37 min .......................................~~(,_,">KERNEL_ERROR: unspecified launch failure in /data/bin/EM/RELION/relion-devel-lmb_copy_newPC/src/gpu_utils/cuda_helper_functions.cu at line 386 (error-code 4)
    KERNEL_ERROR: unspecified launch failure in /data/bin/EM/RELION/relion-devel-lmb_copy_newPC/src/gpu_utils/cuda_helper_functions.cu at line 386 (error-code 4)
    KERNEL_ERROR: unspecified launch failure in /data/bin/EM/RELION/relion-devel-lmb_copy_newPC/src/gpu_utils/cuda_helper_functions.cu at line 386 (error-code 4)
    KERNEL_ERROR: unspecified launch failure in /data/bin/EM/RELION/relion-devel-lmb_copy_newPC/src/gpu_utils/cuda_helper_functions.cu at line 386 (error-code 4)


mpirun noticed that process rank 2 with PID 16494 on node ig-pc-10 exited on signal 11 (Segmentation fault).

This error also occurred after updating to the latest relion2-beta but not until iteration 14:
Expectation iteration 14 of 25
0.62/18.02 min ..~~(,_,">KERNEL_ERROR: unspecified launch failure in /data/bin/relion2-beta_copy_newPC/src/gpu_utils/cuda_helper_functions.cu at line 419 (error-code 4)
KERNEL_ERROR: unspecified launch failure in /data/bin/relion2-beta_copy_newPC/src/gpu_utils/cuda_ml_optimiser.cu at line 2071 (error-code 4)

mpirun noticed that process rank 1 with PID 8750 on node ig-pc-10 exited on signal 11 (Segmentation fault).

  1. The computer locks up completely sometimes, which I think is also related to running relion2.

I attach output files for the job that stopped with unspecified launch failure at iteration 14. Please let me know if you would like anything else.


@bforsbe
Copy link
Contributor Author

bforsbe commented Jan 23, 2017

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):


Actually I don't think this is related to how many GPUs are used, which admittedly would have been surprising. I have just tried again to run with all three GPUs and it also died with an unspecified launch failure during run 2.

@bforsbe
Copy link
Contributor Author

bforsbe commented Jan 23, 2017

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):


Could you run a single thread without mpi? You can also add

--continue run_it012_optimiser.star

If you run that a couple of times and see the same line reporting an error, that would be nice.

@bforsbe
Copy link
Contributor Author

bforsbe commented Jan 24, 2017

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):


I can't continue from run_it012_optimiser.star as I lost it when I updated my machine to access the LMB shared filesystems. I am now running again using a single thread without mpi and it is on 1.88/5.49 hrs for Expectation iteration 2 without any problems. The relion version is now the one in relion-devel-lmb, which has binaries dating from 3rd January. I will update you tomorrow on what happens.

@bforsbe
Copy link
Contributor Author

bforsbe commented Jan 25, 2017

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):


ok I got the error during iteration 2.

Expectation iteration 2 of 25
2.73/5.50 hrs .............................~~(,_,">KERNEL_ERROR: unspecified launch failure in /lmb/home/public/EM/RELION/relion-devel-lmb/src/gpu_utils/cuda_ml_optimiser.cu at line 288 (error-code 4)
Segmentation fault

I am now trying again with --continue class2d_it001_optimiser.star

@bforsbe
Copy link
Contributor Author

bforsbe commented Jan 25, 2017

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):


v2.0.2 of the relion2-beta cannot have reported this issue, which version are you using?

@bforsbe
Copy link
Contributor Author

bforsbe commented Jan 25, 2017

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):


I don't know. I don't suppose the following can help you work it out?

/lmb/home/public/EM/RELION/relion-devel-lmb

lg-krieger_jkrieger> ls -ltrah

total 211K

-rw-r--r-- 1 public public 496 Mar 9 2016 AUTHORS

-rw-r--r-- 1 public public 18K Mar 9 2016 COPYING

-rwxr-xr-x 1 public public 2.2K Mar 9 2016 INSTALL.sh

-rw-r--r-- 1 public public 11K Mar 9 2016 INSTALL

-rw-r--r-- 1 public public 0 Mar 9 2016 ChangeLog

-rw-r--r-- 1 public public 719 Mar 9 2016 README

-rw-r--r-- 1 public public 0 Mar 9 2016 NEWS

-rw-r--r-- 1 public public 1.6K Mar 9 2016 relion.h

drwxr-xr-x 2 public public 3 Mar 9 2016 tests

drwxr-xr-x 4 public public 4 Mar 9 2016 external

-rw-r--r-- 1 public public 55 Mar 24 2016 .gitignore

drwxr-xr-x 2 public public 9 Jul 4 2016 scripts

drwxr-xr-x 2 public public 8 Jul 7 2016 cmake

drwxr-xr-x 9 public public 19 Jan 3 15:03 .

-rw-r--r-- 1 public public 11K Jan 3 15:03 CMakeLists.txt

drwxr-xr-x 5 public public 120 Jan 3 15:03 src

drwxr-xr-x 8 public public 15 Jan 3 15:03 .git

drwxr-xr-x 7 public public 11 Jan 3 15:05 build2

drwxr-xr-x 15 public public 36 Jan 24 15:09 ..

@bforsbe
Copy link
Contributor Author

bforsbe commented Jan 25, 2017

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):


I did however have the same error on a version that I pulled from git and installed myself as well, which I would expect to be the latest one.

@bforsbe
Copy link
Contributor Author

bforsbe commented Jan 25, 2017

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):


I'd guess you are on v2.0.b7, but you'll know for sure if you run

$ ls -lrth `which relion`

and compare the date-stamp of the relion-binary to the list of commits

You should update or asks somebody to update the binary if it's not very recent. Since you have access to the code here you can always clone the code into a local repo and build the latest version yourself, the installation instructions make it really simple;

#!bash

$ git clone https://username@bitbucket.org/tcblab/relion2-beta.git
$ cd relion2-beta
$ mkdir build
$ cd build 
$ cmake .. 
$ make -j8 
$ cd ..
$ build/bin/relion_refine --i file.star <lots of other options>

EDIT: If you did in fact build it yourself then


git log

in the repo directory will tell you which commit you are on. Also, make sure you then use your newly cloned repo to run and not /lmb/home/public/EM/RELION/relion-devel-lmb

@bforsbe
Copy link
Contributor Author

bforsbe commented Jan 25, 2017

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):


There is a version of the code which could report the error on line 288, but it isn't the beta-code, but rather the in-house lmb-development version. That would make sense with respect to the path you state as "relion-devel-lmb". To debug this I'd need to know the exact commit hash you are on, since there are no version numbers. Unless you are an advanced user I would recommend sticking to the beta-versions, unless somebody advised you to use development versions or features explicitly. If at all possible, it would help us to know

  • which commit you are on on devel-lmb
  • if the error shows up on v2.0.2 of the beta as well

@bforsbe
Copy link
Contributor Author

bforsbe commented Jan 25, 2017

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):


I am at the LMB and my computer is now on the shared file systems to avoid copying of data. As Sjors has broken the pre-release version (it can't find the fftw libraries), I am using the lmb-development version for now. I am trying to get my compilers working so I can get v2.0.2 of the beta working as well.

@bforsbe
Copy link
Contributor Author

bforsbe commented Jan 25, 2017

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):


$ ls -lrth which relion
gives me
-rwxr-xr-x 1 public public 39K Jan 3 15:06 /public/EM/RELION/relion-devel-lmb/build2/bin/relion

@bforsbe
Copy link
Contributor Author

bforsbe commented Jan 25, 2017

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):


Ok, that makes sense. To summarize briefly, we need to find out if this is a reproducible error or if it is completely sporadic. The former is much easier to work with, and the strategies for debugging them are ENTIRELY different. When you ran with multiple threads, all threads reported an error at the location in the code they were when only one of them actually malfunctioned. That's why relion was reporting an error in lots of locations, which makes it near impossible to say where the error actually happened. Running a single thread eliminates this possibility. So what we want to do is run at least twice with one thread to see if the same location is triggered. Given that this is the case, we need to know which commit or version that was, and we just about need it to be a very recent one. You probably figured this out already, but for documentation I'd though I'd inject this clarification.

The best case is that the same particle and point in the execution fails. It is also good to add the flags

#!
--random_seed 1993 --perturb 0

when trying to establish a verifiable test-case. I neglected to mention this before, which is my bad. There's no reason for a particular seed, as long as the same is used in repeat runs.

@bforsbe
Copy link
Contributor Author

bforsbe commented Jan 25, 2017

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):


My original bug report was based on v2.0.2 - I have just run git log on that and the output is as follows.
git_log_screenshot_cropped.png

@bforsbe
Copy link
Contributor Author

bforsbe commented Jan 25, 2017

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):


However, this version looks for libraries that have moved and I am therefore trying to install another copy.

@bforsbe
Copy link
Contributor Author

bforsbe commented Jan 25, 2017

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):


The in-house lmb-development version shows this from git log.
devel-lmb_git_log_Screenshot_cropped.png

@bforsbe
Copy link
Contributor Author

bforsbe commented Jan 25, 2017

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):


Ok. As the original issue reported used multiple threads, it's impossible to say which code location is the true failure, as explained in my last post. We can take that as an indication that the problem exists in the beta-code as well, which is good to know, but it sadly doesn't help us find the cause or determine the nature of the issue. Thanks for your patience and effort though, these are the tricky ones that take some fiddling to find.

@bforsbe
Copy link
Contributor Author

bforsbe commented Jan 25, 2017

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):


Ah ok.

@bforsbe
Copy link
Contributor Author

bforsbe commented Jan 25, 2017

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):


Use whatever code you can get working, just run single-threaded twice with the same version and tell us which version that is.

@bforsbe
Copy link
Contributor Author

bforsbe commented Jan 25, 2017

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):


I just saw your post with the random seed (you sent it about the same time I was checking versions). If I use --continue it complains about the use of a number of flags including the random seed so I assume it reads those values from the optimiser.star file. The initial run was done with random seed 0.

@bforsbe
Copy link
Contributor Author

bforsbe commented Jan 25, 2017

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):


you can use that in your repeat. If you didn't have the 0 perturb then just skip it for now anyway. We should still find the essential info. It's a desired, extra condition, but nonessential for this.

@bforsbe
Copy link
Contributor Author

bforsbe commented Jan 25, 2017

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):


I now have v2.0.2 installed after installing isl-0.15 and gcc-4.9.4
Would you like me to try to reproduce the error in there as well? I could start at the beginning again and put a random seed and 0 perturb.

@bforsbe
Copy link
Contributor Author

bforsbe commented Jan 25, 2017

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):


Here is the warning that I can't specify --random_seed with --continue.
WARNING: Option --random_seed is not a valid RELION argument

I have just put it in and run a continuation with v2.0.2 anyway and as you said I have skipped the 0 perturb for now.

@bforsbe
Copy link
Contributor Author

bforsbe commented Jan 25, 2017

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):


With --continue there are certain options that can't be set because you are not allowed to change them from the original run. It will just warn you and ignore those, running with the original settings, which is fine. Again, it's something that adds to the confidence of the diagnosis, but it isn't necessary to asses reproducibility of the problem.

@bforsbe
Copy link
Contributor Author

bforsbe commented Jan 26, 2017

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):


Now I'm failing to reproduce the error. The internal lmb-development version has reached iteration 5 and beta v2.0.2 has reached iteration 10. I guess the difference between the efficiency of versions is that I installed v2.0.2 myself on this machine and the compiler would have optimised it accordingly.

Actually I have just realised that I am now not using an ssd as scratch (I stopped because someone else wanted to run some tests as well so I left some GPUs and the ssd to them), which could be why I am not getting the error. If these runs finish fine then I will try again with --scratch_dir /ssd and see if that brings up the error again.

@jamesmkrieger
Copy link

As an update, I'd like to say that the error is not coming every time I run relion or anyone else runs relion on my computer, which explains why it was not easy to reproduce. Nevertheless we have seen it a number of other times recently with other data sets. Again today we tried with 1 MPI rank and 1 thread and 1 GPU and saw the following error similar to before:

KERNEL_ERROR: unspecified launch failure in /net/nfs1/public/EM/RELION/relion-2.0/src/gpu_utils/cuda_ml_optimiser.cu at line 288 (error-code 4)

This was using v2.0.3 as installed by Sjors for the LMB generally.

I don't know if it's relevant but I have also had gromacs crashing with an unspecified launch failure error more than once e.g.:

Program: gmx mdrun, version 2016-beta2
Source file: src/gromacs/mdlib/nbnxn_cuda/nbnxn_cuda.cu (line 633)
MPI rank: 0 (out of 2)

Fatal error:
cudaStreamSynchronize failed in cu_blockwait_nb: unspecified launch failure

@biochem-fan
Copy link
Member

Old version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants