Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] eddy_cuda fails if ran with --estimate_move_by_susceptibility #180

Closed
fredrmag opened this issue Sep 18, 2020 · 21 comments
Closed

[BUG] eddy_cuda fails if ran with --estimate_move_by_susceptibility #180

fredrmag opened this issue Sep 18, 2020 · 21 comments

Comments

@fredrmag
Copy link
Contributor

fredrmag commented Sep 18, 2020

Hi,

This bug might be more related to FSL. But I will post it here anyway.

When running qsiprep v.0.11.0 using eddy_cuda with the flag --estimate_move_by_susceptibility. Our group get the following error:

image (1)

@fredrmag
Copy link
Contributor Author

@ethanknights
Copy link

Full log output is here in case useful:
slurm.qsiprep.sub-CC110037.ses-01-gpu.log

@mattcieslak
Copy link
Collaborator

Good find @fredrmag, can you confirm there's a NaN in the pixdim field in the header?

@fredrmag
Copy link
Contributor Author

fredrmag commented Sep 18, 2020

We checked the pixdim4 header, and it does not contain a NaN.

Edit: Added evidence:
image

@mattcieslak
Copy link
Collaborator

Are there NaNs in any of the inputs to eddy? If you're using fieldmaps there will likely be other image inputs. Have you had any successful runs with estimate_move_by_susceptibility? We're trying out this feature now too

@fredrmag
Copy link
Contributor Author

This participant do not have a fieldmap. Only other nifty input is the mask. Does not have NaN in the header. But, this NaN header should be fixed by Jesper in October 2019. So, if qsiprep is using the newest FSL, it should'nt be a problem?

@fredrmag
Copy link
Contributor Author

We have succesfully ran eddy_cuda with the --estimate_move_by_susceptibility parameter. But, then everything was ran outside of qsiprep. I am checking if the same FSL version was used.

@fredrmag
Copy link
Contributor Author

When eddy ran successfully with the --estimate_move_by_susceptibility flag, FSL 6.0.2 with eddy_cuda9.1 was used. We have not tested FSL 6.0.3 outside qsiprep.

@fredrmag
Copy link
Contributor Author

fredrmag commented Oct 2, 2020

@mattcieslak, we created a singularity image of of the fix proposed by PR #181 using commit: 32462e9. @ethanknights tested this on a dataset, and we still get the same error message. I will attach the logs.

Untitled
slurm.qsiprep.sub-CC110037.ses-01-gpu.log
gpu-crashLog-estimateMoveFSL6Patch.txt
gpu-log-estimateMoveFSL6Patch.txt

@mattcieslak
Copy link
Collaborator

Very interesting. That PR had fsl 6.0.4, which is supposed to contain some big improvements for eddy. It looks like the GPU is correctly identified and a number of iterations are run successfully before this crash happens. Maybe the GPU is running out of memory? The error is definitely happening in the movement-by-susceptibility estimation when trying to allocate a new image. Is there a way to request more GPU memory on your cluster?

@fredrmag
Copy link
Contributor Author

fredrmag commented Oct 5, 2020

I checked if the GPU ran out of memory by running nvidia-smi every 30s while eddy_cuda was running with the estimate_move_by_susceptibility flag. Our GPU has a total of 32GB memory, and eddy_cuda was only using 550MB before it failed.

image

@mattcieslak
Copy link
Collaborator

This must be an FSL bug then. What do you think would be the best approach? Maybe downgrade to 6.0.2 for now until FSL fixes it?

@fredrmag
Copy link
Contributor Author

fredrmag commented Oct 6, 2020

Hmm, I tried to run qsiprep with FSL 6.0.2, and I still get the same error. I find it strange why FSL 6.0.4 should not work when they have made a lot of improvements with eddy especially using the --estimate-move-by-susceptibility flag.

To summarize:
We have tested qsiprep on two different clusters, in Berlin and Oslo. And both gives the error above. In Berlin, we successfully ran FSL 6.0.2 with eddy_cuda9.1 using the --estimate-move-by-susceptibility flag. But, not using eddy_cuda9.1 within qsiprep.

Could it be something inside the qsiprep container that makes things fail? Are we missing some dependencies?

We are going to do the following tests the next days:

  1. Use the temporary files created by qsiprep as an input to the working FSL in Berlin.
  2. If 1. is successful. Test for the latest FSL version.
  3. Build FSL the same way inside the qsiprep container as was done in Berlin. Then, test this build running the full pipeline using eddy with --estimate-move-by-susceptibility flag.

@mattcieslak
Copy link
Collaborator

Are you using singularity or docker? Maybe there's something strange with the singularity or docker configuration? If you're willing to send an example subject I can test locally too, this is an important thing to have working.

@fredrmag
Copy link
Contributor Author

fredrmag commented Oct 7, 2020

We are using singularity with the --nv option for cuda. I am thinking that it might have to do something with us running CUDA 9.1 on CUDA 10.2. But, this needs to be tested. I read yesterday that we could compile FSL with any version of CUDA we want. Source

I am not sure if we are allowed to share any data. I will test it on sub-01 from this dataset. For us, multiple datasets have failed, so, it doesn't seems to be related to the dataset.

Have you managed to run qsiprep succesfully using singularity and eddy_cuda with the --estimate_move_by_susceptibility option before? If you have, then this might be an issue on our side.

Btw, thanks for your help on this!

@fredrmag
Copy link
Contributor Author

fredrmag commented Oct 8, 2020

So, the 1. test from the previous post is done:

  1. Use the temporary files created by qsiprep as an input to the working FSL in Berlin.
  2. If 1. is successful. Test for the latest FSL version.
  3. Build FSL the same way inside the qsiprep container as was done in Berlin. Then, test this build running the full pipeline using eddy with --estimate-move-by-susceptibility flag.

Maike from our group says the following:

I get the same error when inputting the qsiprepped data into eddy_cuda from a different fsl container (i.e. not contained in qsiprep but the one I used before where --estimate_move_by_susceptibilty ran perfectly fine).

We are in the process of double checking that the fsl container used works on separate data.

@fredrmag
Copy link
Contributor Author

Hi @mattcieslak,

I managed to reproduce this issue outside of qsiprep. Suggesting that this is not an issue related to qsiprep, but related to the FSL binaries. I also found this report from the FSL mailing list, where a user reported the same problem.

I have tried to compile FSL with CUDA 10.2 and run it outside of a singularity container but this resulted in the same error.

I have reported it to the FSL mailing list, and Jesper suggested it might be related to 3 things:

  1. Issue with binaries compiled with CUDA 10.2
  2. Hardware issue related our GPU card, which is a NVIDIA Tesla V100 PCIe 32 GB
  3. FSL bug

He will run the exact same data (ds003047_sub-01_ses-1) and commands I used to check if it is a FSL bug. My bet is the hardware issue. As we get the same error for the binaries compiled by FSL, and that they have improved upon the eddy_cuda with the --estimate_move_by_susceptibility quite a lot in the recent releases, so they should have detected this.

Jesper suggests that recompiling the eddy_cuda binary with the flags -gencode=arch=compute_70,code=sm_70 might solve the issue, so this is something I will try to do.

I will keep you updated,
Fredrik

@mattcieslak
Copy link
Collaborator

Thanks for looking into this!! I will also try on a machine with a GPU and an exact matching version of cuda (9.1)

@fredrmag
Copy link
Contributor Author

Jesper from FSL just confirmed that this is a bug in eddy: https://www.jiscmail.ac.uk/cgi-bin/wa-jisc.exe?A2=ind2010&L=FSL&O=D&X=70686BB980F21516B4&Y=fredrik.magnussen%40psykologi.uio.no&P=190319

I also tried to compile eddy_cuda with the -gencode=arch=compute_70,code=sm_70 flag to nvcc, and it didn't solve the problem.

@fredrmag
Copy link
Contributor Author

Jesper from FSL has fixed the bug. His response:

I have identified and fixed the problem. It was indeed an eddy bug that is triggered by doing --estimate_move_by_susceptibilty when not supplying a static field with the --topup parameter. This is something that should be possible, and with the bug fix it is. The fix is checked in and will be part of the next release, which I think is not too far away.

So, this will be fixed in the next FSL release, that is to come in the end of the year.

@fredrmag
Copy link
Contributor Author

I am closing this issue, as it is a bug in FSL, not qsiprep.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants