You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems that there are some memory issues when trying to process large dataset - below is an example of such an error.
92%|█████████▏| 63995/69882 [27:32<08:01, 12.22it/s]/cm/local/apps/slurm/var/spool/job530579/slurm_script: line 26: 1295955 Killed seqvec -i $in_file -o $results_dir/embeddings.npz --protein True --id -1
slurmstepd: error: Detected 1 oom-kill event(s) in step 530579.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
Adjusting the batch size appears to make it worse (--batch-size=10 makes this fail at 40%). Splitting up the files into 10k chunks helps, but doesn't quite resolve this (the error above was generated doing that).
The text was updated successfully, but these errors were encountered:
There might be too many sequences in the FASTA file and the node on which this is being computed runs out of main RAM (not GPU RAM). Solution: chop up FASTA file in smaller chunks
There might be sequences in your FASTA file which are too long to be processed in GPU memory. For these, falling back to CPU might be a solution with the inherit limitation that it will be much slower (it can take up to days for single (very long) sequences). Alternatively, you can chop up long sequences into smaller parts, but this might introduce other unwanted effects.
To "solve" both issues, we currently plan on implementing a data preparation step which will re-sort the input FASTA from short to long sequences, then chop the computation in chunks of 5k sequences & outsource computation of sequences above 15k AA to CPU.
I think there might also be a misunderstanding about the --batch-size parameter: It gives the number of residues which are accumulated in a single batch before getting embedded. As we sort sequence by length, this means that we create larger batches at the beginning (shortest sequences) and smaller batches towards the end of your dataset.
That being said, setting --batch-size=10 should lead to single sequence processing as your proteins should be longer than 10 residues. If you still run out of memory with this setting, you can proceed as Chris pointed out: remove long sequence (e.g. >15k residues) from your set for the moment and embed them separately and/or create even smaller chunks of e.g. 5k proteins.
It seems that there are some memory issues when trying to process large dataset - below is an example of such an error.
Adjusting the batch size appears to make it worse (
--batch-size=10
makes this fail at 40%). Splitting up the files into 10k chunks helps, but doesn't quite resolve this (the error above was generated doing that).The text was updated successfully, but these errors were encountered: