Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File descriptors that are opened and closed as needed for Universes #239

Open
dotsdl opened this issue Apr 6, 2015 · 21 comments
Open

File descriptors that are opened and closed as needed for Universes #239

dotsdl opened this issue Apr 6, 2015 · 21 comments

Comments

@dotsdl
Copy link
Member

dotsdl commented Apr 6, 2015

In an effort to make MDAnalysis scale well, it would be useful to make file descriptors open and close only as needed by a Universe. This will make it possible to have hundreds of Universes present in a single python session without triggering an IOError: [Errno 24] Too many open files exception.

For trajectories this would involve wrapping all methods that directly access trajectory details with a decorator to open, seek frame, and close the file on return. This wouldn't come as much of a performance hit if we have readers iterate by chunking (Issue #238).

This issue affects especially the ChainReader, since any number of trajectory files can be loaded and these each have open file descriptors while the Universe is present.

@richardjgowers
Copy link
Member

So what's the limit on open file handles in Python?

I guess an easy way to do this would be to keep a record of the file position at the end of each operation, then seek there next time a file operation happens.

ie.

with open(self.filename, 'r') as f:
    f.seek(self._lastpos)
    # do file stuff
    self._lastpos = f.tell()

I'd just want to make sure that there's no performance hit for this, and if there is, make this behaviour optional.

I'm not familiar with chainreader, but it shouldn't really require all the files open at once anyway?

@orbeckst
Copy link
Member

orbeckst commented Apr 7, 2015

I suppose one could specifically rewrite the ChainReader to open and close underlying trajectory chunks on demand but I think all of this will require a proper re-thinking of the way trajectory objects are handled at the moment. The current pattern is to open the file descriptor in __init__() and close in close(). This, of course, also has the disadvantage that they cannot be serialized (Issue #173).

@richardjgowers
Copy link
Member

If we made all file access done via context managers then it would solve a lot of #173 .. the setstate would then just load the last accessed frame and everything is as it was.

A bigger hammer job would be to have Readers keep track of file positions between frames and build an array of offsets for each frame, ie #208 but format independent and built as frames are accessed. This would make reading the nth frame of a large file much quicker the second time round, even in a new python session.

@richardjgowers richardjgowers self-assigned this Jun 14, 2015
@richardjgowers richardjgowers added this to the 0.11 milestone Jun 14, 2015
richardjgowers added a commit that referenced this issue Jun 14, 2015
Uses decorator to handle file handles

Addresses Issue #239
@richardjgowers
Copy link
Member

So I've had a play with this, and it looks very possible, I've got Readers to be picklable!

In [1]: import MDAnalysis as mda

In [2]: import pickle as pkl

In [3]: u = mda.Universe('HISTORY_minimal', format='HISTORY')

In [4]: pkl.dump(u.trajectory, open('hist.pkl', 'wb'))

In [5]: t2 = pkl.load(open('hist.pkl', 'rb'))

In [6]: t2
Out[6]: < HistoryReader 'HISTORY_minimal' with 3 frames of 3 atoms (0 fixed) >

In [7]: len(t2)
Out[7]: 3

The current problem is that __iter__ will open/close the file n times to read n frames, which isn't going to perform great... But that should easily be fixed.

@orbeckst
Copy link
Member

Looks like progress, and pickable Readers would be of interest for all the other issues surrounding persistence.
For __iter__ we might be able to wrap the whole loop that contains the yield into a

with self.open():
   for ts in self:
     yield ts

and have _read_next_timestep() (or the equivalent) check if the file is open and only open/close it if it is not. There's obviously some overhead involved but it is probably tiny compared to everything else that's going on.

I assume that your changes heavily rely on the fact that you can do random access to frames? If this is the case then we should

  1. Generate the frame offset indices for all trajectories that do not support out-of-the-box random access.
  2. Think about how to handle streams (which probably includes compressed files). We might need a workaround that keeps the old functionality in place for streams.

@richardjgowers
Copy link
Member

The way it currently works is the next method (in base) is wrapped in a decorator which opens the file at the previous position, calls _read_next_timestep (in subclass of base), returns the ts, stores current file handle position and closes the file. I just need to check this approach doesn't kill performance!

Streams and compressed files are indeed headaches to solve.

@richardjgowers
Copy link
Member

0414ca1#diff-99815fce73651366394eb2dd8fe2f872R1211

So all the magic happens in the base class, then subclasses just write their _read_next to assume that they have a file handle where they left it last

richardjgowers added a commit that referenced this issue Jun 24, 2015
Uses decorator to handle file handles

Addresses Issue #239

Iterating DLPoly reader only opens a single file handle

Migrated TRZReader to MultiframeReader

Still TODO: Test on 2gb+ file (where seeking involves long ints)

XYZReader migrated to new style
@orbeckst
Copy link
Member

What's the status on this one? Do we want to postpone it to a later milestone?

It's not one of the big API-breaking ones, is it?

@richardjgowers
Copy link
Member

It won't be ready for a while, but I think it is doable. It shouldn't break anything when it does arrive though, so we can leave it out of 0.11.

It will actually dovetail very nicely with #314, because this requires knowledge of offsets to make it work cleanly.

@orbeckst orbeckst modified the milestones: 0.12, 0.11 Jul 20, 2015
@orbeckst
Copy link
Member

Ok, postponed.

@orbeckst
Copy link
Member

Possibly move to 0.14?

@richardjgowers richardjgowers removed this from the 0.13 milestone Dec 27, 2015
@kain88-de
Copy link
Member

It has come up in #850 that opening and closing file descriptors for every read is a problem when I run several (tens to thousands) processes on a HPC filesystem like GPFS. Rapid opening and closing of files on these filesystems will have a negative impact on their performance.

Seems like we need to have a tradeoff. Number of open Universe objects vs running a lot of mdanalysis processes in parallel. My use case is that I usually have only 2-3 open universes per process but a lot of processes during analysis.

@richardjgowers
Copy link
Member

So are we (@kain88-de) sure that opening files a lot is a bad idea? If so we can abandon this idea.

@dotsdl
Copy link
Member Author

dotsdl commented Jun 6, 2016

Can we keep the file descriptor open when using the trajectory as an iterator? That is, doing:

for ts in u.trajectory:
    # do stuff, and keep file descriptor open
    # using trajectory as iterator doesn't have to re-seek, open/close fds
    pass

vs. doing

for i in range(1000):
    # trajectory indexing opens file, seeks to frame, reads, then closes file descriptor
    u.trajectory[i]

Is this a nice compromise? The issue with a file descriptor open for each trajectory is especially a problem when working with the ChainReader and many Universe objects, which is common for me when using MDSynthesis.

@mnmelo
Copy link
Member

mnmelo commented Jun 6, 2016

Maybe this warrants a Reader kwarg, such as keep_handle=False or similar?
Is there a way to implement this centrally in ProtoReader? (Otherwise it'll be a hassle to have this for every Reader...)

@richardjgowers
Copy link
Member

Ah, maybe what @dotsdl suggests is actually possible.. I forgot that you might want a crazy amount of Universes.

@mnmelo we should be able to do this at the Reader level by making sure that a handle called eg _file is open when it passes responsibility to the format level implementation.

@dotsdl
Copy link
Member Author

dotsdl commented Jun 6, 2016

@richardjgowers is this as simple as just making the __getitem__ behavior for a Reader different from its __iter__ behavior? I don't dig around the Readers often so I don't know if this can be abstracted out in ProtoReader or if it needs to be part of our spec for individual Readers.

@richardjgowers
Copy link
Member

@dotsdl off the top of my head.... I suspect each Reader implementation will use a random name for their file handle, so that will require squashing out. DCD and other C Readers might expect a name which is non obvious from the Python. I think every "front door" to the base.Reader will require a with statement to open a handle before continuing onto the implementation, these are next iter and getitem afaik.

@kain88-de
Copy link
Member

@dotsdl your suggestions would be possible to implement. It actually sounds like a nice compromise if it works also for sliced iterators. But since you and me both reach different limits of filesystems and others might reach them to we should definitely document this behavior, maybe a doc section "MDAnalysis on HPC cluster do's and don'ts".

@richardjgowers I don't think ProtoReader needs to know the file handle names. _reopen and the seek functions should be enough. We need to make sure that ProtoReader behaves correctly and that all readers inheriting from it follow it exactly and don't overwrite it's function.

@dotsdl
Copy link
Member Author

dotsdl commented Jun 6, 2016

@kain88-de yeah and in addition to responding to slicing in this way, we should make it so that this also works for fancy indexing. Basically, anything fed into __getitem__ that isn't an integer should return a generator; using this generator will only close the fd after it's finished being iterated on.

@richardjgowers
Copy link
Member

You'll need to make sure that the "front doors" are only entered once too, for example currently __getitem__ can call __iter__ which calls next. So there needs to be a clear flow where only a single file handle is opened. Which probably just means creating some private iterator functions that getitem calls.

richardjgowers added a commit that referenced this issue Aug 3, 2016
richardjgowers added a commit that referenced this issue Aug 4, 2016
richardjgowers added a commit that referenced this issue Dec 1, 2016
richardjgowers added a commit that referenced this issue Dec 1, 2016
Added NewReader which doesn't maintain an open filehandle

XYZReader changed to use NewReader
@richardjgowers richardjgowers removed their assignment Sep 25, 2018
@orbeckst orbeckst added the Format-ChainReader ChainReader virtual trajectory reader label Aug 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants