Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out-of-memory errors after the first few iterations of template generation #13

Closed
rohanbanerjee opened this issue Sep 26, 2023 · 6 comments

Comments

@rohanbanerjee
Copy link

Hello,
I am opening this issue to discuss about an out-of-memory issue that I have observed when the nist_mni_pipelines is used for template generation.

For context, we are using this script to generate spinal cord template. For this, we are using DRAC clusters to generate the same. After the first few iteration, it threw the error -- slurmstepd: error: Detected 7 oom-kill event(s) in StepId=41187759.batch. Some of your processes may have been killed by the cgroup out-of-memory handler. When we re-ran (with the same memory configurations) it, the pipeline has been running pretty smoothly until now. We have observed the same behaviour on our local machines (Macs) too. Would the maintainers have any insights as to why this might happen?

Please let me know if there are any details required from my side. Thanks!

@vfonov
Copy link
Member

vfonov commented Sep 26, 2023

So, what changed between when it run smoothly and now?

Also, what is the "same behavior" on the mac - is it getting an out of memory error too?

@rohanbanerjee
Copy link
Author

rohanbanerjee commented Sep 26, 2023

So, what changed between when it run smoothly and now?

That is the part I am not sure about. As I mentioned, there was no change in the memory configuration.

What is the "same behavior" on the mac - is it getting an out of memory error too?

Yes, that is correct. Another behaviour to observe here is that, the script stops abruptly saying <filename>.xfm not found even though that file is generally present. And when the script is re-run (just running the command again without any changes), it runs without any error.

@vfonov
Copy link
Member

vfonov commented Sep 26, 2023

so, what's mac configuration and what's the data that you are using this on?

@rohanbanerjee
Copy link
Author

rohanbanerjee commented Sep 26, 2023

what's mac configuration

The configuration of my machine is 8GB, M1 MacBook Air. Since it is super slow on the Mac (15 iteration takes approximately 3 days), we moved to DRAC clusters and where we launched a job with 2 CPUs (each 8GB).

what's the data that you are using this on?

I am using this to create a spinal cord template of dogs. The data is only in our cluster. This problem was faced by Nadia too who is working on paediatric human spinal cord data.

@vfonov
Copy link
Member

vfonov commented Sep 26, 2023 via email

@rohanbanerjee
Copy link
Author

How many jobs do you run in parallel?
I'm running only one job

Also, what's the resolution of the scans on which OOM happens?
They are all 0.5 x 0.5 x 0.5

Closing this issue for now as this doesn't happen anymore. Will re-open if I observe this issue again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants