Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory error for >50k sequences #10

Closed
mortonjt opened this issue Apr 3, 2020 · 2 comments
Closed

Out of memory error for >50k sequences #10

mortonjt opened this issue Apr 3, 2020 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@mortonjt
Copy link

mortonjt commented Apr 3, 2020

It seems that there are some memory issues when trying to process large dataset - below is an example of such an error.

 92%|█████████▏| 63995/69882 [27:32<08:01, 12.22it/s]/cm/local/apps/slurm/var/spool/job530579/slurm_script: line 26: 1295955 Killed                  seqvec -i $in_file -o $results_dir/embeddings.npz --protein True --id -1
slurmstepd: error: Detected 1 oom-kill event(s) in step 530579.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

Adjusting the batch size appears to make it worse (--batch-size=10 makes this fail at 40%). Splitting up the files into 10k chunks helps, but doesn't quite resolve this (the error above was generated doing that).

@sacdallago sacdallago added the enhancement New feature or request label Apr 8, 2020
@sacdallago
Copy link

This issue is two fold:

  1. There might be too many sequences in the FASTA file and the node on which this is being computed runs out of main RAM (not GPU RAM). Solution: chop up FASTA file in smaller chunks
  2. There might be sequences in your FASTA file which are too long to be processed in GPU memory. For these, falling back to CPU might be a solution with the inherit limitation that it will be much slower (it can take up to days for single (very long) sequences). Alternatively, you can chop up long sequences into smaller parts, but this might introduce other unwanted effects.

To "solve" both issues, we currently plan on implementing a data preparation step which will re-sort the input FASTA from short to long sequences, then chop the computation in chunks of 5k sequences & outsource computation of sequences above 15k AA to CPU.

This will most likely be implemented in the "pipeline" https://github.com/sacdallago/bio_embeddings, rather than in this codebase which is kept a bit more flexible

@mheinzinger
Copy link

I think there might also be a misunderstanding about the --batch-size parameter: It gives the number of residues which are accumulated in a single batch before getting embedded. As we sort sequence by length, this means that we create larger batches at the beginning (shortest sequences) and smaller batches towards the end of your dataset.
That being said, setting --batch-size=10 should lead to single sequence processing as your proteins should be longer than 10 residues. If you still run out of memory with this setting, you can proceed as Chris pointed out: remove long sequence (e.g. >15k residues) from your set for the moment and embed them separately and/or create even smaller chunks of e.g. 5k proteins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

3 participants