-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue when reading 4000+ pdb files in the same Universe #4590
Comments
@samuelmurail any chance the exact number it fails at is 4096 files? That would be the default file open limit in modern Linux if I remember correctly. |
Off the top of my head options here could be to increase your hard limit using ulimit or to concatenate chunks of your PDBs into a multi frame trajectory file. |
No, it is more around 4200 structures.
I have use another library to concatenate all pdb in a single one, that has solve my issue however it is very slow. Any idea how to increase hard limit ? I have tried: import ressource
resource.setrlimit(resource.RLIMIT_DATA, (1024,1024)) But now I have this error :
|
Sorry my capacity is a bit limited this week - maybe someone from @MDAnalysis/coredevs has some insights here? |
I can decrease my open file limit with RLIMIT_NOFILE resource.setrlimit(resource.RLIMIT_NOFILE, (16, 16)) and this will make the chainreader stop at even fewer than 16 files, just because there are already a bunch of Python modules open. import MDAnalysis as mda; import MDAnalysisTests.datafiles as data
import resource
resource.setrlimit(resource.RLIMIT_NOFILE, (16, 16))
u = mda.Universe(data.PDB, 15 * [data.PDB]) fails with OSError: [Errno 24] Too many open files. Note that on my macOS, In [1]: import resource
In [2]: resource.getrlimit(resource.RLIMIT_NOFILE)
Out[2]: (256, 9223372036854775807) the default hard limit for open files is RLIM_INFINITY. When trying to load the same PDB 5000 times u = mda.Universe(data.PDB, 5000 * [data.PDB]) I also get OSError: [Errno 24] Too many open files. Trying with an increased soft limit import MDAnalysis as mda; import MDAnalysisTests.datafiles as data
import resource
resource.setrlimit(resource.RLIMIT_NOFILE, (6000, resource.RLIM_INFINITY))
u = mda.Universe(data.PDB, 5000 * [data.PDB]) ... that took a while but worked. I don't have the time right now to run the mem profiler to see if the chainreader + PDB takes up memory for all files ... it might, and then that is probably the reason for the MemoryError. Something to check, |
The issue with open file descriptors is likely related to #239. |
Add more details on the reason why the memory usage of @samuelmurail, could you try to convert it into a temporay trajectory format first with e.g. import MDAnalysis as mda
from MDAnalysisTests.datafiles import PDB
files = [PDB] * 10
u = mda.Universe(PDB)
with mda.Writer('test.xtc', u.atoms.n_atoms) as W:
for file in files:
u.load_new(file)
W.write(u.atoms)
u = mda.Universe(PDB, 'test.xtc') |
Thanks @yuxuanzhuang , makes sense. I don't have a good idea how one could make the ChainReader work with arbitrary numbers of files that avoids the limitation of having so many readers open at the same time — this would likely require a redesign. |
Thanks for the feedback, here is my fix: import MDAnalysis as mda
from MDAnalysis.coordinates.memory import MemoryReader
import numpy as np
def read_numerous_pdb(pdb_files, batch_size=1000):
"""
Read a large number of PDB files in batches and combine them into a single MDAnalysis Universe.
Parameters
----------
pdb_files : list of str
List of file paths to the PDB files to be read.
batch_size : int, optional
Number of PDB files to read in each batch. Default is 1000.
Returns
-------
MDAnalysis.Universe
A single MDAnalysis Universe containing the combined frames from all the PDB files.
Notes
-----
- This function reads PDB files in batches to avoid memory issues.
- Each batch of PDB files is loaded into a temporary Universe, and the positions of each frame
are stored in a list.
- The list of frames is then converted into a numpy array and used to create a new Universe with
a MemoryReader, combining all the frames.
Example
-------
>>> pdb_files = ['file1.pdb', 'file2.pdb', ...]
>>> combined_universe = read_numerous_pdb(pdb_files, batch_size=1000)
>>> print(combined_universe)
"""
all_frames = []
for i in range(0, len(pdb_files), batch_size):
# print(f"Reading frames {i:5} to {i+batch_size:5}, total : {len(pdb_files[i:i+batch_size])} frames")
local_u = mda.Universe(pdb_files[0], pdb_files[i:i+batch_size])
for ts in local_u.trajectory:
all_frames.append(ts.positions.copy())
del local_u
# print("Convert to numpy")
frames_array = np.array(all_frames)
del all_frames
# print(frames_array.shape)
return mda.Universe(pdb_files[0], frames_array, format=MemoryReader, order='fac') Using this function, I could read +20.000 files, probably more. Cheers, |
@orbeckst I think redesign is maybe too strong a word (implying API breaks?). I think you could refactor (implying no API breaks) ChainReader to just-in-time open file handles (some pattern of actively using |
Expected behavior
Hello,
I am using MDAnalysis to compute PCA on a big set of pdb file.
Actual behavior
When I have more than roughly 4000 structures, I have this error message:
Code to reproduce the behavior
To reproduce the behavior, I just need a list of 4000+ pdb files (colabfold prediction with 1000+ seed).
It works up to 4000 structure but a little more give the previous error.
Current version of MDAnalysis
2.7.0
python -V
)?Python 3.10.13
Ubuntu 22.0
The text was updated successfully, but these errors were encountered: