Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tckgen termination after 'MR::Exception' #294

Closed
davesl opened this issue Jul 17, 2015 · 7 comments
Closed

tckgen termination after 'MR::Exception' #294

davesl opened this issue Jul 17, 2015 · 7 comments
Assignees
Labels
bug

Comments

@davesl
Copy link

@davesl davesl commented Jul 17, 2015

Hi,

When running the tckgen command on a cluster I sometimes see the following error:

terminate called after throwing an instance of 'MR::Exception'
/data/lren/program/mrtrix3/bin/tckgen: line 4: 162801 Aborted "$PREFIX"/lib/ld-.so "$PREFIX"/exe/"$COMMAND" "$@"

An example command is as follows:

tckgen -seed_image aligned_mask.nii -mask aligned_mask.nii -algorithm iFOD2 -number 50000000 -maxlength 400 -minlength 2 -downsample 2 -act 5TT.nii -crop_at_gmwmi -backtrack -quiet -force aligned_HARDI_fod.nii /scratch/aligned_HARDI_fod.tck

I am running a standalone version of mrtrix3 on a cluster but will try to change to the static build in future.

I have a suspicion that this might be due to available disk space. I write the .tck files to /scratch which is a space available to each independent node. Each of ~16 cores can than write to these local drives in parallel. Perhaps this issue is occurring when the /scratch disk is reaching capacity.

As an example here are the storage details of node12 (which I believe has a total storage capacity of 457Go and we can see 423Go are accounted for by the .tck files):

--------- node12---------
/scratch/:
total 423G
drwxrwxrwt. 2 root root 4.0K Jul 14 18:10 .
dr-xr-xr-x. 27 root root 4.0K May 5 2014 ..
-rw-rw-r-- 1 dslater domain users 32G Jul 17 06:02 B0736-70-4_b3000_aligned_HARDI_fod.tck
-rw-rw-r-- 1 dslater domain users 33G Jul 17 06:02 B0744-70-4_b3000_aligned_HARDI_fod.tck
-rw-rw-r-- 1 dslater domain users 30G Jul 17 06:02 B0814-70-4_b3000_aligned_HARDI_fod.tck
-rw-rw-r-- 1 dslater domain users 31G Jul 17 06:01 B0833-01-4_b3000_aligned_HARDI_fod.tck
-rw-rw-r-- 1 dslater domain users 30G Jul 17 06:01 B0833-71-4_b3000_aligned_HARDI_fod.tck
-rw-rw-r-- 1 dslater domain users 30G Jul 17 06:01 B0834-71-4_b3000_aligned_HARDI_fod.tck
-rw-rw-r-- 1 dslater domain users 33G Jul 17 06:00 B0846-01-4_b3000_aligned_HARDI_fod.tck
-rw-rw-r-- 1 dslater domain users 31G Jul 17 06:01 B0846-70-4_b3000_aligned_HARDI_fod.tck
-rw-rw-r-- 1 dslater domain users 31G Jul 17 06:01 B0855-01-4_b3000_aligned_HARDI_fod.tck
-rw-rw-r-- 1 dslater domain users 27G Jul 17 05:52 B0870-01-4_b3000_aligned_HARDI_fod.tck
-rw-rw-r-- 1 dslater domain users 31G Jul 17 06:01 B0887-72-4_b3000_aligned_HARDI_fod.tck
-rw-rw-r-- 1 dslater domain users 30G Jul 17 06:01 B0933-01-3_b3000_aligned_HARDI_fod.tck
-rw-rw-r-- 1 dslater domain users 30G Jul 17 06:01 B0938-71-3_b3000_aligned_HARDI_fod.tck
-rw-rw-r-- 1 dslater domain users 30G Jul 17 06:00 B0939-01-3_b3000_aligned_HARDI_fod.tck

@jdtournier

This comment has been minimized.

Copy link
Member

@jdtournier jdtournier commented Jul 17, 2015

OK, that would explain the problem. The track writer will most likely throw an Exception if it can't write - so at least the symptoms are consistent with that...

However, the program shouldn't crash out like this - at worst it should hang (not great either mind you, but at least it would be consistent with what I'd expect to happen). For completeness, we had a discussion about how to handle exceptions being thrown in a multi-threading context (see #167), and I pushed some changes to handle this on 12 March, with pull request #180. So I'm surprised to see this happen at all - unless you haven't updated your installation since then...?

@davesl

This comment has been minimized.

Copy link
Author

@davesl davesl commented Jul 17, 2015

Hmm... with it being the standalone version it may well be a few months out of date.

I'll update the cluster code hopefully using the 'static' approach. Then if I ever come across this problem or another variant I'll report back.

@jdtournier

This comment has been minimized.

Copy link
Member

@jdtournier jdtournier commented Jul 17, 2015

Thanks, that would be great. That said, you might want to hang on a bit - I think I need to make further changes, as in your case the current approach would probably cause tckgen to hang...

@jdtournier jdtournier added the bug label Jul 17, 2015
@jdtournier jdtournier added this to the MRtrix3 3.0 release for ISMRM milestone Jul 17, 2015
@jdtournier jdtournier self-assigned this Jul 17, 2015
@jdtournier

This comment has been minimized.

Copy link
Member

@jdtournier jdtournier commented Jul 17, 2015

Actually, having thought about it a little, I think it might be fine as-is. I was thinking the worker threads (which generate the tracks) would hang waiting for the writer thread to clear its backlog (which it clearly can't, having just thrown an exception and hence terminated). Thankfully, the queue backend that handles feeding streamlines from the worker threads to the writer thread will shut everything down if no-one's listening... It should work fine, give it a shot.

@davesl

This comment has been minimized.

Copy link
Author

@davesl davesl commented Jul 17, 2015

Great I'll try it then.

@davesl

This comment has been minimized.

Copy link
Author

@davesl davesl commented Jul 20, 2015

A quick update.

I installed the static version and everything runs well. I'm pretty certain this was an out of disk space error and also think that if I switch to SIFT2 I won't encounter this error in future (no need for the 40-60Gb .tck files I was generating previously).

I'll close this.

D

@davesl davesl closed this Jul 20, 2015
@jdtournier

This comment has been minimized.

Copy link
Member

@jdtournier jdtournier commented Jul 20, 2015

Good to hear. Thanks for reporting back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.