File descriptors that are opened and closed as needed for Universes #239

dotsdl · 2015-04-06T18:55:35Z

In an effort to make MDAnalysis scale well, it would be useful to make file descriptors open and close only as needed by a Universe. This will make it possible to have hundreds of Universes present in a single python session without triggering an IOError: [Errno 24] Too many open files exception.

For trajectories this would involve wrapping all methods that directly access trajectory details with a decorator to open, seek frame, and close the file on return. This wouldn't come as much of a performance hit if we have readers iterate by chunking (Issue #238).

This issue affects especially the ChainReader, since any number of trajectory files can be loaded and these each have open file descriptors while the Universe is present.

The text was updated successfully, but these errors were encountered:

richardjgowers · 2015-04-06T20:40:09Z

So what's the limit on open file handles in Python?

I guess an easy way to do this would be to keep a record of the file position at the end of each operation, then seek there next time a file operation happens.

ie.

with open(self.filename, 'r') as f:
    f.seek(self._lastpos)
    # do file stuff
    self._lastpos = f.tell()

I'd just want to make sure that there's no performance hit for this, and if there is, make this behaviour optional.

I'm not familiar with chainreader, but it shouldn't really require all the files open at once anyway?

orbeckst · 2015-04-07T00:25:40Z

I suppose one could specifically rewrite the ChainReader to open and close underlying trajectory chunks on demand but I think all of this will require a proper re-thinking of the way trajectory objects are handled at the moment. The current pattern is to open the file descriptor in __init__() and close in close(). This, of course, also has the disadvantage that they cannot be serialized (Issue #173).

richardjgowers · 2015-04-07T01:42:09Z

If we made all file access done via context managers then it would solve a lot of #173 .. the setstate would then just load the last accessed frame and everything is as it was.

A bigger hammer job would be to have Readers keep track of file positions between frames and build an array of offsets for each frame, ie #208 but format independent and built as frames are accessed. This would make reading the nth frame of a large file much quicker the second time round, even in a new python session.

Uses decorator to handle file handles Addresses Issue #239

richardjgowers · 2015-06-14T11:05:16Z

So I've had a play with this, and it looks very possible, I've got Readers to be picklable!

In [1]: import MDAnalysis as mda

In [2]: import pickle as pkl

In [3]: u = mda.Universe('HISTORY_minimal', format='HISTORY')

In [4]: pkl.dump(u.trajectory, open('hist.pkl', 'wb'))

In [5]: t2 = pkl.load(open('hist.pkl', 'rb'))

In [6]: t2
Out[6]: < HistoryReader 'HISTORY_minimal' with 3 frames of 3 atoms (0 fixed) >

In [7]: len(t2)
Out[7]: 3

The current problem is that __iter__ will open/close the file n times to read n frames, which isn't going to perform great... But that should easily be fixed.

orbeckst · 2015-06-16T00:47:25Z

Looks like progress, and pickable Readers would be of interest for all the other issues surrounding persistence.
For __iter__ we might be able to wrap the whole loop that contains the yield into a

with self.open():
   for ts in self:
     yield ts

and have _read_next_timestep() (or the equivalent) check if the file is open and only open/close it if it is not. There's obviously some overhead involved but it is probably tiny compared to everything else that's going on.

I assume that your changes heavily rely on the fact that you can do random access to frames? If this is the case then we should

Generate the frame offset indices for all trajectories that do not support out-of-the-box random access.
Think about how to handle streams (which probably includes compressed files). We might need a workaround that keeps the old functionality in place for streams.

richardjgowers · 2015-06-16T07:01:11Z

The way it currently works is the next method (in base) is wrapped in a decorator which opens the file at the previous position, calls _read_next_timestep (in subclass of base), returns the ts, stores current file handle position and closes the file. I just need to check this approach doesn't kill performance!

Streams and compressed files are indeed headaches to solve.

richardjgowers · 2015-06-16T07:06:20Z

0414ca1#diff-99815fce73651366394eb2dd8fe2f872R1211

So all the magic happens in the base class, then subclasses just write their _read_next to assume that they have a file handle where they left it last

Uses decorator to handle file handles Addresses Issue #239 Iterating DLPoly reader only opens a single file handle Migrated TRZReader to MultiframeReader Still TODO: Test on 2gb+ file (where seeking involves long ints) XYZReader migrated to new style

orbeckst · 2015-07-20T17:48:58Z

What's the status on this one? Do we want to postpone it to a later milestone?

It's not one of the big API-breaking ones, is it?

richardjgowers · 2015-07-20T18:43:32Z

It won't be ready for a while, but I think it is doable. It shouldn't break anything when it does arrive though, so we can leave it out of 0.11.

It will actually dovetail very nicely with #314, because this requires knowledge of offsets to make it work cleanly.

orbeckst · 2015-07-20T18:45:00Z

Ok, postponed.

orbeckst · 2015-11-13T05:41:24Z

Possibly move to 0.14?

kain88-de · 2016-05-12T13:08:02Z

It has come up in #850 that opening and closing file descriptors for every read is a problem when I run several (tens to thousands) processes on a HPC filesystem like GPFS. Rapid opening and closing of files on these filesystems will have a negative impact on their performance.

Seems like we need to have a tradeoff. Number of open Universe objects vs running a lot of mdanalysis processes in parallel. My use case is that I usually have only 2-3 open universes per process but a lot of processes during analysis.

richardjgowers · 2016-06-06T15:59:52Z

So are we (@kain88-de) sure that opening files a lot is a bad idea? If so we can abandon this idea.

dotsdl · 2016-06-06T16:45:04Z

Can we keep the file descriptor open when using the trajectory as an iterator? That is, doing:

for ts in u.trajectory:
    # do stuff, and keep file descriptor open
    # using trajectory as iterator doesn't have to re-seek, open/close fds
    pass

vs. doing

for i in range(1000):
    # trajectory indexing opens file, seeks to frame, reads, then closes file descriptor
    u.trajectory[i]

Is this a nice compromise? The issue with a file descriptor open for each trajectory is especially a problem when working with the ChainReader and many Universe objects, which is common for me when using MDSynthesis.

mnmelo · 2016-06-06T17:25:36Z

Maybe this warrants a Reader kwarg, such as keep_handle=False or similar?
Is there a way to implement this centrally in ProtoReader? (Otherwise it'll be a hassle to have this for every Reader...)

richardjgowers · 2016-06-06T18:50:07Z

Ah, maybe what @dotsdl suggests is actually possible.. I forgot that you might want a crazy amount of Universes.

@mnmelo we should be able to do this at the Reader level by making sure that a handle called eg _file is open when it passes responsibility to the format level implementation.

dotsdl · 2016-06-06T19:28:45Z

@richardjgowers is this as simple as just making the __getitem__ behavior for a Reader different from its __iter__ behavior? I don't dig around the Readers often so I don't know if this can be abstracted out in ProtoReader or if it needs to be part of our spec for individual Readers.

richardjgowers · 2016-06-06T19:33:53Z

@dotsdl off the top of my head.... I suspect each Reader implementation will use a random name for their file handle, so that will require squashing out. DCD and other C Readers might expect a name which is non obvious from the Python. I think every "front door" to the base.Reader will require a with statement to open a handle before continuing onto the implementation, these are next iter and getitem afaik.

kain88-de · 2016-06-06T19:38:13Z

@dotsdl your suggestions would be possible to implement. It actually sounds like a nice compromise if it works also for sliced iterators. But since you and me both reach different limits of filesystems and others might reach them to we should definitely document this behavior, maybe a doc section "MDAnalysis on HPC cluster do's and don'ts".

@richardjgowers I don't think ProtoReader needs to know the file handle names. _reopen and the seek functions should be enough. We need to make sure that ProtoReader behaves correctly and that all readers inheriting from it follow it exactly and don't overwrite it's function.

dotsdl · 2016-06-06T19:52:48Z

@kain88-de yeah and in addition to responding to slicing in this way, we should make it so that this also works for fancy indexing. Basically, anything fed into __getitem__ that isn't an integer should return a generator; using this generator will only close the fd after it's finished being iterated on.

richardjgowers · 2016-06-06T19:58:07Z

You'll need to make sure that the "front doors" are only entered once too, for example currently __getitem__ can call __iter__ which calls next. So there needs to be a clear flow where only a single file handle is opened. Which probably just means creating some private iterator functions that getitem calls.

Added NewReader which doesn't maintain an open filehandle XYZReader changed to use NewReader

dotsdl added Type-Enhancement performance labels Apr 6, 2015

orbeckst added enhancement and removed Type-Enhancement labels Apr 9, 2015

richardjgowers mentioned this issue Apr 19, 2015

Unit Test Failure #256

Closed

orbeckst added the Priority-Medium label May 2, 2015

richardjgowers added the Component-Readers label Jun 6, 2015

richardjgowers self-assigned this Jun 14, 2015

richardjgowers added this to the 0.11 milestone Jun 14, 2015

richardjgowers added a commit that referenced this issue Jun 14, 2015

First draft of new style Reader class

0414ca1

Uses decorator to handle file handles Addresses Issue #239

orbeckst added the persistence label Jun 16, 2015

This was referenced Jun 16, 2015

Cleaned up Reader classes #317

Merged

Change frame numbering from 1-based to 0-based #310

Closed

orbeckst modified the milestones: 0.12, 0.11 Jul 20, 2015

This was referenced Oct 20, 2015

Is the Amber netcdf writer fully compliant with the official spec? #488

Closed

replace netCDF4 with scipy.io.netcdf for the Amber NCDFReader and NCDFWriter #506

Closed

richardjgowers removed this from the 0.13 milestone Dec 27, 2015

richardjgowers mentioned this issue Mar 1, 2016

combine trajectory iterators with zip #747

Closed

richardjgowers mentioned this issue May 12, 2016

PDBReader reopens files reading every frame #850

Closed

richardjgowers mentioned this issue Jun 6, 2016

fix Reader iter/next redundancy (issue #869) #870

Merged

4 tasks

richardjgowers mentioned this issue Jun 6, 2016

Reader.rewind reopens file #871

Open

dotsdl mentioned this issue Jun 13, 2016

too many open files makes AtomGroup.sequence tests fail on MacOS? #853

Closed

richardjgowers added a commit that referenced this issue Aug 3, 2016

Start of Issue #239

1b7ed8c

richardjgowers added a commit that referenced this issue Aug 4, 2016

Added DCD to #239 style Readers

b278be8

richardjgowers mentioned this issue Aug 7, 2016

WIP: Reader that does not need open file handles (#239) #926

Closed

richardjgowers mentioned this issue Sep 15, 2016

Trajectory slicing should be completely Pythonic #918

Closed

richardjgowers added a commit that referenced this issue Dec 1, 2016

Start of Issue #239

5535e3b

richardjgowers added a commit that referenced this issue Dec 1, 2016

Start of Issue #239

05d6470

Added NewReader which doesn't maintain an open filehandle XYZReader changed to use NewReader

richardjgowers removed their assignment Sep 25, 2018

orbeckst added the Format-ChainReader ChainReader virtual trajectory reader label Aug 25, 2019

orbeckst mentioned this issue May 9, 2024

Issue when reading 4000+ pdb files in the same Universe #4590

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File descriptors that are opened and closed as needed for Universes #239

File descriptors that are opened and closed as needed for Universes #239

dotsdl commented Apr 6, 2015

richardjgowers commented Apr 6, 2015

orbeckst commented Apr 7, 2015

richardjgowers commented Apr 7, 2015

richardjgowers commented Jun 14, 2015

orbeckst commented Jun 16, 2015

richardjgowers commented Jun 16, 2015

richardjgowers commented Jun 16, 2015

orbeckst commented Jul 20, 2015

richardjgowers commented Jul 20, 2015

orbeckst commented Jul 20, 2015

orbeckst commented Nov 13, 2015

kain88-de commented May 12, 2016

richardjgowers commented Jun 6, 2016

dotsdl commented Jun 6, 2016

mnmelo commented Jun 6, 2016

richardjgowers commented Jun 6, 2016

dotsdl commented Jun 6, 2016 •

edited

Loading

richardjgowers commented Jun 6, 2016

kain88-de commented Jun 6, 2016

dotsdl commented Jun 6, 2016 •

edited

Loading

richardjgowers commented Jun 6, 2016

File descriptors that are opened and closed as needed for Universes #239

File descriptors that are opened and closed as needed for Universes #239

Comments

dotsdl commented Apr 6, 2015

richardjgowers commented Apr 6, 2015

orbeckst commented Apr 7, 2015

richardjgowers commented Apr 7, 2015

richardjgowers commented Jun 14, 2015

orbeckst commented Jun 16, 2015

richardjgowers commented Jun 16, 2015

richardjgowers commented Jun 16, 2015

orbeckst commented Jul 20, 2015

richardjgowers commented Jul 20, 2015

orbeckst commented Jul 20, 2015

orbeckst commented Nov 13, 2015

kain88-de commented May 12, 2016

richardjgowers commented Jun 6, 2016

dotsdl commented Jun 6, 2016

mnmelo commented Jun 6, 2016

richardjgowers commented Jun 6, 2016

dotsdl commented Jun 6, 2016 • edited Loading

richardjgowers commented Jun 6, 2016

kain88-de commented Jun 6, 2016

dotsdl commented Jun 6, 2016 • edited Loading

richardjgowers commented Jun 6, 2016

dotsdl commented Jun 6, 2016 •

edited

Loading

dotsdl commented Jun 6, 2016 •

edited

Loading