Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of Memory Bug #93

Open
jiruijin opened this issue Mar 21, 2023 · 14 comments
Open

Out of Memory Bug #93

jiruijin opened this issue Mar 21, 2023 · 14 comments

Comments

@jiruijin
Copy link

I trained the model with 28000 cif data, but every time I ran it, I got an error:
"slurmstepd: error: Detected 1 oom-kill event(s) in StepId=59934605.batch. Some of your processes may have been killed by the cgroup out-of-memory handler. "
I already used 500GB for CPU, why is it still out of memory?
image

@bdecost
Copy link
Collaborator

bdecost commented Mar 21, 2023

Hi @jiruijin, this looks like an OOM crash during the precomputation of the graphs and line graphs. We've been storing those in memory, but if you have a large dataset or very large supercells the memory cost is too high. Actually if you have large supercells it's likely going to be a problem for GPU memory utilization too

I think the long term fix is to switch to this vectorized graph construction and stop storing the fully-featurized graphs in memory. This has been on the back burner for a bit, I have to make some time to bring it over the line

@jiruijin
Copy link
Author

Thanks a lot for answering so fast @bdecost, but I think when you trained the model for the material project dataset, it should be a larger dataset than me. And I tried with a small dataset like 4000 cif files, it worked. It seems that currently, I can only decrease the size of dataset, hope you can fix it soon.

@bdecost
Copy link
Collaborator

bdecost commented Mar 21, 2023

how many atoms are in a typical supercell for your dataset? The other week a collaborator was having a similar issue on a dataset with ~800 atom supercells, so I do need to prioritize this soon

@jiruijin
Copy link
Author

The dataset I am using is from oqmd, half of them (12000) only have 5 five atoms. In the rest of the dataset, the range for atoms number is from 10 to 40.

@knc6
Copy link
Collaborator

knc6 commented Mar 21, 2023

You can lower the batch size to something like 2 or 5 , which should resolve the issue.

@jiruijin
Copy link
Author

Thank you@knc6, I will try it now!

@jiruijin
Copy link
Author

I got the exact same error when I lower the batch size to 5 or 8.

@bdecost
Copy link
Collaborator

bdecost commented Mar 22, 2023

can you clarify exactly when the resource manager is killing your job? from your screen cap it looks like it's during the dataloader setup before any data is sent to the GPU?

The dataset I am using is from oqmd, half of them (12000) only have 5 five atoms. In the rest of the dataset, the range for atoms number is from 10 to 40.

I'm a little surprised this dataset is hitting your memory limit, but I will investigate. Do you mind sharing the slurm configuration you're using and maybe the limits of the partition you're running on?

@bdecost
Copy link
Collaborator

bdecost commented Mar 22, 2023

to clarify a bit, if GPU memory limit is the issue, you'll get a CUDA out of memory error, and slurm will kill your job because the training script crashed. I think you are running into a memory vs compute tradeoff that we made because our graph construction code was slow. I'm working on fixing that, but there are a couple other fixes I'm trying to do at the same time...

@jiruijin
Copy link
Author

image
Thank you for the information. Above is the slurm configuration I am using. And the problem is:
image

@bdecost
Copy link
Collaborator

bdecost commented Mar 22, 2023

right, ok. How many dataloader workers are you using? Each worker process apparently makes a full copy of the dataset, so if you're using multiple you can try to reduce that for now as a band-aid

That's not a real solution, but I'll try to get a fix out this week or next. There are a few things to do, most importantly including computing the graphs during minibatch construction instead of caching them all in memory

@jiruijin
Copy link
Author

I am using the default setting: num_workers: int = 4. I will decrease it and retry. Thank you!

@jiruijin
Copy link
Author

Still have the same problem. I will wait for the fixing.

@jiruijin
Copy link
Author

I used get_primitive_structure function in Pymatgen for all my cif files and set num_workers = 10 in the config file, then the problem is solved!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants