Out of memory error for >50k sequences #10

mortonjt · 2020-04-03T15:01:06Z

It seems that there are some memory issues when trying to process large dataset - below is an example of such an error.

 92%|█████████▏| 63995/69882 [27:32<08:01, 12.22it/s]/cm/local/apps/slurm/var/spool/job530579/slurm_script: line 26: 1295955 Killed                  seqvec -i $in_file -o $results_dir/embeddings.npz --protein True --id -1
slurmstepd: error: Detected 1 oom-kill event(s) in step 530579.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

Adjusting the batch size appears to make it worse (--batch-size=10 makes this fail at 40%). Splitting up the files into 10k chunks helps, but doesn't quite resolve this (the error above was generated doing that).

The text was updated successfully, but these errors were encountered:

sacdallago · 2020-04-08T15:07:51Z

This issue is two fold:

There might be too many sequences in the FASTA file and the node on which this is being computed runs out of main RAM (not GPU RAM). Solution: chop up FASTA file in smaller chunks
There might be sequences in your FASTA file which are too long to be processed in GPU memory. For these, falling back to CPU might be a solution with the inherit limitation that it will be much slower (it can take up to days for single (very long) sequences). Alternatively, you can chop up long sequences into smaller parts, but this might introduce other unwanted effects.

To "solve" both issues, we currently plan on implementing a data preparation step which will re-sort the input FASTA from short to long sequences, then chop the computation in chunks of 5k sequences & outsource computation of sequences above 15k AA to CPU.

This will most likely be implemented in the "pipeline" https://github.com/sacdallago/bio_embeddings, rather than in this codebase which is kept a bit more flexible

mheinzinger · 2020-04-10T10:03:33Z

I think there might also be a misunderstanding about the --batch-size parameter: It gives the number of residues which are accumulated in a single batch before getting embedded. As we sort sequence by length, this means that we create larger batches at the beginning (shortest sequences) and smaller batches towards the end of your dataset.
That being said, setting --batch-size=10 should lead to single sequence processing as your proteins should be longer than 10 residues. If you still run out of memory with this setting, you can proceed as Chris pointed out: remove long sequence (e.g. >15k residues) from your set for the moment and embed them separately and/or create even smaller chunks of e.g. 5k proteins.

sacdallago assigned mheinzinger Apr 8, 2020

sacdallago added the enhancement New feature or request label Apr 8, 2020

This was referenced Apr 8, 2020

Sort sequences by length when re-mapping sacdallago/bio_embeddings#6

Closed

Implement intermittent write-to-storage for > 5k sequences sacdallago/bio_embeddings#7

Closed

mheinzinger closed this as completed Jun 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory error for >50k sequences #10

Out of memory error for >50k sequences #10

mortonjt commented Apr 3, 2020

sacdallago commented Apr 8, 2020

mheinzinger commented Apr 10, 2020

Out of memory error for >50k sequences #10

Out of memory error for >50k sequences #10

Comments

mortonjt commented Apr 3, 2020

sacdallago commented Apr 8, 2020

mheinzinger commented Apr 10, 2020