Segfault in relion_refine_mpi with --firstiter_cc and --gpu #7

bforsbe · 2016-06-21T12:10:21Z

Originally reported by: Dimitry Tegunov (Bitbucket: DTegunov, GitHub: DTegunov)

I hope it's an actual bug this time ;-)

I'm running 3D refinement using

#!bash
mpirun -n 3 `which relion_refine_mpi` --o RefineInitial/run1 --auto_refine --split_random_halves --i particles.star --ref emd_2984_280.mrc --firstiter_cc --ini_high 30 --dont_combine_weights_via_disc --pool 3 --ctf --ctf_corrected_ref --particle_diameter 200 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 10 --offset_step 2 --sym D2 --low_resol_join_halves 40 --norm --scale  --j 1 --gpu

(template created in GUI, names modified and launched in terminal), and it crashes saying

#!bash
KERNEL_ERROR: invalid argument in /home/dtegunov/Desktop/relion2beta/src/gpu_utils/cuda_helper_functions.cu at line 598 (error-code 11)
[dtegunov:09959] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10d10)[0x7f72177d9d10]
[dtegunov:09959] [ 1] /lib/x86_64-linux-gnu/libpthread.so.0(raise+0x29)[0x7f72177d9bd9]
[dtegunov:09959] [ 2] /home/dtegunov/Desktop/relion2beta/build/lib/librelion_gpu_util.so(_Z20runDiff2KernelCoarseR19CudaProjectorKernelPfS1_S1_S1_S1_S1_S1_R21OptimisationParamtersP11MlOptimisermiiiiiP11CUstream_stb+0x9ce)[0x7f7216a4e5ae]
[dtegunov:09959] [ 3] /home/dtegunov/Desktop/relion2beta/build/lib/librelion_gpu_util.so(_Z30getAllSquaredDifferencesCoarsejR21OptimisationParamtersR18SamplingParametersP11MlOptimiserP15MlOptimiserCudaR13CudaGlobalPtrIfLb1EE+0x13d0)[0x7f7216a583e0]
[dtegunov:09959] [ 4] /home/dtegunov/Desktop/relion2beta/build/lib/librelion_gpu_util.so(_ZN15MlOptimiserCuda32doThreadExpectationSomeParticlesEi+0x2ea9)[0x7f7216a68779]
[dtegunov:09959] [ 5] /home/dtegunov/Desktop/relion2beta/build/lib/librelion_lib.so(_Z11_threadMainPv+0x1d)[0x7f721867639d]
[dtegunov:09959] [ 6] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76aa)[0x7f72177d06aa]
[dtegunov:09959] [ 7] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f7217505e9d]

It doesn't crash on the GPU if I remove --firstiter_cc, and the CPU version runs fine with --firstiter_cc. Not sure if I can provide my test data due to its size, but maybe there are some debug flags I can set that will give you more information to work with?

Bitbucket: https://bitbucket.org/tcblab/relion2-beta/issue/7

The text was updated successfully, but these errors were encountered:

bforsbe · 2016-06-21T12:13:33Z

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

I hope it WAS a bug. We noticed a recently introduced bug with --firstiter-cc which should be amended in v 2.0.b1, which was pushed no more than 30 minutes ago. Try pulling the new code and running again. If the problem persists, I'll dig deeper.

Thanks again for reporting!

bforsbe · 2016-06-21T14:35:36Z

Original comment by Dimitry Tegunov (Bitbucket: DTegunov, GitHub: DTegunov):

Nope, still crashes with the same message.

bforsbe · 2016-06-21T14:53:22Z

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

Does it crash immediately? If so, is it possible to create a minimal example with input data that shows this error, like just a few particles? I f so I can have a look at it. If the files are "too" large then I could receive them in some other way than through here. I'll try to reproduce here on separate data in the mean time.

bforsbe · 2016-06-22T02:25:44Z

Original comment by craigyk (Bitbucket: craigyk, GitHub: craigyk):

I just ran into this problem. I pulled all the latest changes and reran and it seems to be OK. So the problem seems fixed for me with the latest bits.

bforsbe · 2016-06-22T06:36:09Z

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

I believe that the error Craig observed, is the one we did in fact fix in v.2.0.1b. Dimitry appears to have found a wholly separate issue. Luckily, I seem to have been able to reproduce that issue here now, so hopefully there will be a fix for it later today.

bforsbe · 2016-06-22T07:50:47Z

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

I believe I know what the issue is now. Since cross-correlation is so infrequently used and not a bottleneck, these functions have not been adapted to the most recent version of the difference-kernel layout. Subsequently they still use a layout which is potentially limited by hardware capacity, by requesting shared memory which potentially exceeds that available on the device. We could fix this in a number of ways, the easiest being to decrease the block-size if the memory limit is exceeded. However this suffers the same weakness, just at a much later stage. I think the more reasonable thing is to update the cc-kernels to the new layout, which may take a day or two.

For now, however, you can circumvent the issue by doing one of two (I had to do both...) things;

Compile for sm_52 (which has higher shared mem capacity). The default for RELION is to use sm_35, the minimum supported architecture. Since I noticed you had a TITAN X from issue Problem with linux workstation install #3, your device is sm_52. To compile for sm_52, modify your cmake configuration command to

#!bash

cmake -DCUDA_ARCH=52 ..

Change the precompile-variable BLOCK_SIZE in src/gpu_utils/cuda_settings.h to a lower value. In general WE DO NOT RECOMMEND CHANGING THESE VALUES. But as a temporary fix until I get an updated version pushed, it should work. Always set them to multiples of 32! Currently it is set to 128, so reducing it to 96, 64 or 32 will reduce the currently needed shared memory.

Let me know if any of these measures help at all!

bforsbe · 2016-06-22T12:17:58Z

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

I just pushed a possible fix (v2.0.b2) by creating a new cross-correlation kernel which is does not have shared-memory usage proportional to the number of translations. This should also do the trick. If not, let me know and I'll continue hacking away at it.

bforsbe · 2016-06-22T13:45:33Z

Original comment by Dimitry Tegunov (Bitbucket: DTegunov, GitHub: DTegunov):

It appears fixed in 0be3990, thanks! On a side note: compiling with sm_52 won't solve issues with dynamic shared memory allocation. The hardware will already allocate everything it physically can, regardless of the compiler target.

bforsbe · 2016-06-22T13:48:02Z

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

Good to know! That's probably why I had to also adjust the block-size to get it working. Thanks!

Tiff

bforsbe added major bug labels Jan 26, 2017

bforsbe closed this as completed Jan 26, 2017

bforsbe mentioned this issue Jan 26, 2017

crash while running parallel particle extraction and sorting #124

Closed

pemsley mentioned this issue Jun 21, 2018

Crash on double-click "stdout will go here" text box #370

Closed

davidh99720 mentioned this issue Oct 19, 2018

How to sort Class2D result by # of particles contributing to a class to filter bad classes? #406

Closed

davidh99720 mentioned this issue Oct 27, 2018

Problems to run refine2D for input mrcs generated from EMAN hed(img) file #413

Closed

biochem-fan pushed a commit that referenced this issue Feb 13, 2019

Merged in tiff (pull request #7)

3c38e62

Tiff

clemensgrimm mentioned this issue May 23, 2019

3D Classification gets stuck immediately after noise estimation #473

Closed

This was referenced Oct 18, 2019

RELION 3.1, incorrect image size error #519

Closed

RELION 3.1: one of your half sets has no segments #521

Closed

yixiaozhang123 mentioned this issue Oct 27, 2019

Incorrect image size, relion 3.1 #530

Closed

ashkumatov mentioned this issue Oct 31, 2019

RELION-3.1 Pre-read all particles into RAM #514

Closed

frozenfas mentioned this issue Nov 28, 2019

3D multi-body hangs on Expectation interation 1 #543

Closed

jhansen6 mentioned this issue Feb 28, 2020

unexpectedly small, yet non-zero sigma2 value #582

Closed

jhansen6 mentioned this issue Mar 23, 2020

can't re-extract cryosparc particles #591

Closed

Jake4484 mentioned this issue Apr 14, 2020

Relion 3.1 3D refine error: one of your half sets has no segments. #605

Closed

GalkinVitold mentioned this issue May 4, 2020

Rescaled refined particles extraction error #615

Closed

mengxin-mx-cpu mentioned this issue Sep 21, 2020

MetaDataTable::read: File mtf does not exist #681

Closed

This was referenced Nov 10, 2020

reextracting refine3D data star particles from fresh MotionCorr not working anymore #524

Closed

Making a STAR file for simulated particles #707

Closed

yuanhaowang1213 mentioned this issue Mar 13, 2021

errors when run relion_prepare_subtomograms.py #748

Closed

biochem-fan mentioned this issue Feb 4, 2022

relion star format error #849

Closed

njxisang mentioned this issue Jun 15, 2022

Preprocessing::initialise ERROR: Input micrograph STAR file has no rlnMicrographName column! #773

Closed

mqcarba mentioned this issue Apr 23, 2023

star.file format error using pyem asarnow/pyem#103

Closed

janti001 mentioned this issue May 5, 2023

Rescaled refined particles extraction error #971

Closed

ASKrebs mentioned this issue Feb 6, 2024

tomo name error if the name starts with a 0 in TS alignment Relion5 #1079

Open

schmitbp mentioned this issue Jun 20, 2024

Relion/5.0 tomgoraphy Reconstruction modality runs in an endless loop #1149

Open

Amy1809 mentioned this issue Jul 1, 2024

Relion 5 fails to reconstruct tomograms by asking for _rlnTomoMicrographNameEven label #1158

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault in relion_refine_mpi with --firstiter_cc and --gpu #7

Segfault in relion_refine_mpi with --firstiter_cc and --gpu #7

bforsbe commented Jun 21, 2016

bforsbe commented Jun 21, 2016

bforsbe commented Jun 21, 2016

bforsbe commented Jun 21, 2016

bforsbe commented Jun 22, 2016

bforsbe commented Jun 22, 2016

bforsbe commented Jun 22, 2016

bforsbe commented Jun 22, 2016

bforsbe commented Jun 22, 2016

bforsbe commented Jun 22, 2016

Segfault in relion_refine_mpi with --firstiter_cc and --gpu #7

Segfault in relion_refine_mpi with --firstiter_cc and --gpu #7

Comments

bforsbe commented Jun 21, 2016

bforsbe commented Jun 21, 2016

bforsbe commented Jun 21, 2016

bforsbe commented Jun 21, 2016

bforsbe commented Jun 22, 2016

bforsbe commented Jun 22, 2016

bforsbe commented Jun 22, 2016

bforsbe commented Jun 22, 2016

bforsbe commented Jun 22, 2016

bforsbe commented Jun 22, 2016