tckgen termination after 'MR::Exception' #294

davesl · 2015-07-17T11:06:14Z

Hi,

When running the tckgen command on a cluster I sometimes see the following error:

terminate called after throwing an instance of 'MR::Exception'
/data/lren/program/mrtrix3/bin/tckgen: line 4: 162801 Aborted "$PREFIX"/lib/ld-.so "$PREFIX"/exe/"$COMMAND" "$@"

An example command is as follows:

tckgen -seed_image aligned_mask.nii -mask aligned_mask.nii -algorithm iFOD2 -number 50000000 -maxlength 400 -minlength 2 -downsample 2 -act 5TT.nii -crop_at_gmwmi -backtrack -quiet -force aligned_HARDI_fod.nii /scratch/aligned_HARDI_fod.tck

I am running a standalone version of mrtrix3 on a cluster but will try to change to the static build in future.

I have a suspicion that this might be due to available disk space. I write the .tck files to /scratch which is a space available to each independent node. Each of ~16 cores can than write to these local drives in parallel. Perhaps this issue is occurring when the /scratch disk is reaching capacity.

As an example here are the storage details of node12 (which I believe has a total storage capacity of 457Go and we can see 423Go are accounted for by the .tck files):

--------- node12---------
/scratch/:
total 423G
drwxrwxrwt. 2 root root 4.0K Jul 14 18:10 .
dr-xr-xr-x. 27 root root 4.0K May 5 2014 ..
-rw-rw-r-- 1 dslater domain users 32G Jul 17 06:02 B0736-70-4_b3000_aligned_HARDI_fod.tck
-rw-rw-r-- 1 dslater domain users 33G Jul 17 06:02 B0744-70-4_b3000_aligned_HARDI_fod.tck
-rw-rw-r-- 1 dslater domain users 30G Jul 17 06:02 B0814-70-4_b3000_aligned_HARDI_fod.tck
-rw-rw-r-- 1 dslater domain users 31G Jul 17 06:01 B0833-01-4_b3000_aligned_HARDI_fod.tck
-rw-rw-r-- 1 dslater domain users 30G Jul 17 06:01 B0833-71-4_b3000_aligned_HARDI_fod.tck
-rw-rw-r-- 1 dslater domain users 30G Jul 17 06:01 B0834-71-4_b3000_aligned_HARDI_fod.tck
-rw-rw-r-- 1 dslater domain users 33G Jul 17 06:00 B0846-01-4_b3000_aligned_HARDI_fod.tck
-rw-rw-r-- 1 dslater domain users 31G Jul 17 06:01 B0846-70-4_b3000_aligned_HARDI_fod.tck
-rw-rw-r-- 1 dslater domain users 31G Jul 17 06:01 B0855-01-4_b3000_aligned_HARDI_fod.tck
-rw-rw-r-- 1 dslater domain users 27G Jul 17 05:52 B0870-01-4_b3000_aligned_HARDI_fod.tck
-rw-rw-r-- 1 dslater domain users 31G Jul 17 06:01 B0887-72-4_b3000_aligned_HARDI_fod.tck
-rw-rw-r-- 1 dslater domain users 30G Jul 17 06:01 B0933-01-3_b3000_aligned_HARDI_fod.tck
-rw-rw-r-- 1 dslater domain users 30G Jul 17 06:01 B0938-71-3_b3000_aligned_HARDI_fod.tck
-rw-rw-r-- 1 dslater domain users 30G Jul 17 06:00 B0939-01-3_b3000_aligned_HARDI_fod.tck

jdtournier · 2015-07-17T11:32:57Z

OK, that would explain the problem. The track writer will most likely throw an Exception if it can't write - so at least the symptoms are consistent with that...

However, the program shouldn't crash out like this - at worst it should hang (not great either mind you, but at least it would be consistent with what I'd expect to happen). For completeness, we had a discussion about how to handle exceptions being thrown in a multi-threading context (see #167), and I pushed some changes to handle this on 12 March, with pull request #180. So I'm surprised to see this happen at all - unless you haven't updated your installation since then...?

davesl · 2015-07-17T11:37:19Z

Hmm... with it being the standalone version it may well be a few months out of date.

I'll update the cluster code hopefully using the 'static' approach. Then if I ever come across this problem or another variant I'll report back.

jdtournier · 2015-07-17T11:41:07Z

Thanks, that would be great. That said, you might want to hang on a bit - I think I need to make further changes, as in your case the current approach would probably cause tckgen to hang...

jdtournier · 2015-07-17T11:45:27Z

Actually, having thought about it a little, I think it might be fine as-is. I was thinking the worker threads (which generate the tracks) would hang waiting for the writer thread to clear its backlog (which it clearly can't, having just thrown an exception and hence terminated). Thankfully, the queue backend that handles feeding streamlines from the worker threads to the writer thread will shut everything down if no-one's listening... It should work fine, give it a shot.

davesl · 2015-07-17T12:06:01Z

Great I'll try it then.

davesl · 2015-07-20T11:29:25Z

A quick update.

I installed the static version and everything runs well. I'm pretty certain this was an out of disk space error and also think that if I switch to SIFT2 I won't encounter this error in future (no need for the 40-60Gb .tck files I was generating previously).

I'll close this.

D

jdtournier · 2015-07-20T11:31:04Z

Good to hear. Thanks for reporting back.

jdtournier added the bug label Jul 17, 2015

jdtournier added this to the MRtrix3 3.0 release for ISMRM milestone Jul 17, 2015

jdtournier self-assigned this Jul 17, 2015

davesl closed this as completed Jul 20, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tckgen termination after 'MR::Exception' #294

tckgen termination after 'MR::Exception' #294

davesl commented Jul 17, 2015

jdtournier commented Jul 17, 2015

davesl commented Jul 17, 2015

jdtournier commented Jul 17, 2015

jdtournier commented Jul 17, 2015

davesl commented Jul 17, 2015

davesl commented Jul 20, 2015

jdtournier commented Jul 20, 2015

tckgen termination after 'MR::Exception' #294

tckgen termination after 'MR::Exception' #294

Comments

davesl commented Jul 17, 2015

jdtournier commented Jul 17, 2015

davesl commented Jul 17, 2015

jdtournier commented Jul 17, 2015

jdtournier commented Jul 17, 2015

davesl commented Jul 17, 2015

davesl commented Jul 20, 2015

jdtournier commented Jul 20, 2015