Out of Memory Bug #93

jiruijin · 2023-03-21T20:57:50Z

I trained the model with 28000 cif data, but every time I ran it, I got an error:
"slurmstepd: error: Detected 1 oom-kill event(s) in StepId=59934605.batch. Some of your processes may have been killed by the cgroup out-of-memory handler. "
I already used 500GB for CPU, why is it still out of memory?

bdecost · 2023-03-21T21:07:18Z

Hi @jiruijin, this looks like an OOM crash during the precomputation of the graphs and line graphs. We've been storing those in memory, but if you have a large dataset or very large supercells the memory cost is too high. Actually if you have large supercells it's likely going to be a problem for GPU memory utilization too

I think the long term fix is to switch to this vectorized graph construction and stop storing the fully-featurized graphs in memory. This has been on the back burner for a bit, I have to make some time to bring it over the line

jiruijin · 2023-03-21T21:25:59Z

Thanks a lot for answering so fast @bdecost, but I think when you trained the model for the material project dataset, it should be a larger dataset than me. And I tried with a small dataset like 4000 cif files, it worked. It seems that currently, I can only decrease the size of dataset, hope you can fix it soon.

bdecost · 2023-03-21T21:31:05Z

how many atoms are in a typical supercell for your dataset? The other week a collaborator was having a similar issue on a dataset with ~800 atom supercells, so I do need to prioritize this soon

jiruijin · 2023-03-21T21:41:01Z

The dataset I am using is from oqmd, half of them (12000) only have 5 five atoms. In the rest of the dataset, the range for atoms number is from 10 to 40.

knc6 · 2023-03-21T22:17:49Z

You can lower the batch size to something like 2 or 5 , which should resolve the issue.

jiruijin · 2023-03-22T13:56:07Z

Thank you@knc6, I will try it now!

jiruijin · 2023-03-22T14:48:35Z

I got the exact same error when I lower the batch size to 5 or 8.

bdecost · 2023-03-22T14:52:08Z

can you clarify exactly when the resource manager is killing your job? from your screen cap it looks like it's during the dataloader setup before any data is sent to the GPU?

The dataset I am using is from oqmd, half of them (12000) only have 5 five atoms. In the rest of the dataset, the range for atoms number is from 10 to 40.

I'm a little surprised this dataset is hitting your memory limit, but I will investigate. Do you mind sharing the slurm configuration you're using and maybe the limits of the partition you're running on?

bdecost · 2023-03-22T14:59:56Z

to clarify a bit, if GPU memory limit is the issue, you'll get a CUDA out of memory error, and slurm will kill your job because the training script crashed. I think you are running into a memory vs compute tradeoff that we made because our graph construction code was slow. I'm working on fixing that, but there are a couple other fixes I'm trying to do at the same time...

jiruijin · 2023-03-22T15:11:14Z

Thank you for the information. Above is the slurm configuration I am using. And the problem is:

bdecost · 2023-03-22T15:23:27Z

right, ok. How many dataloader workers are you using? Each worker process apparently makes a full copy of the dataset, so if you're using multiple you can try to reduce that for now as a band-aid

That's not a real solution, but I'll try to get a fix out this week or next. There are a few things to do, most importantly including computing the graphs during minibatch construction instead of caching them all in memory

jiruijin · 2023-03-22T15:36:54Z

I am using the default setting: num_workers: int = 4. I will decrease it and retry. Thank you!

jiruijin · 2023-03-22T17:48:57Z

Still have the same problem. I will wait for the fixing.

jiruijin · 2023-04-10T17:24:40Z

I used get_primitive_structure function in Pymatgen for all my cif files and set num_workers = 10 in the config file, then the problem is solved!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of Memory Bug #93

Out of Memory Bug #93

jiruijin commented Mar 21, 2023

bdecost commented Mar 21, 2023

jiruijin commented Mar 21, 2023

bdecost commented Mar 21, 2023

jiruijin commented Mar 21, 2023

knc6 commented Mar 21, 2023

jiruijin commented Mar 22, 2023

jiruijin commented Mar 22, 2023

bdecost commented Mar 22, 2023

bdecost commented Mar 22, 2023

jiruijin commented Mar 22, 2023

bdecost commented Mar 22, 2023

jiruijin commented Mar 22, 2023

jiruijin commented Mar 22, 2023

jiruijin commented Apr 10, 2023

Out of Memory Bug #93

Out of Memory Bug #93

Comments

jiruijin commented Mar 21, 2023

bdecost commented Mar 21, 2023

jiruijin commented Mar 21, 2023

bdecost commented Mar 21, 2023

jiruijin commented Mar 21, 2023

knc6 commented Mar 21, 2023

jiruijin commented Mar 22, 2023

jiruijin commented Mar 22, 2023

bdecost commented Mar 22, 2023

bdecost commented Mar 22, 2023

jiruijin commented Mar 22, 2023

bdecost commented Mar 22, 2023

jiruijin commented Mar 22, 2023

jiruijin commented Mar 22, 2023

jiruijin commented Apr 10, 2023