Attempt to simplify filecache #48

takluyver · 2020-04-08T16:02:37Z

Hi @egorsobolev . I was concerned that the file cache was getting complicated to the point where I was struggling to be sure that it would do the right thing, so I've spent a while trying to work out an alternative model for it. I think this is somewhat clearer, but it's easy to think that when I've just written it, so I'd welcome a second view.

The idea is to split the functionality of FileCache into two pieces:

file_access_registry tracks a single FileAccess object (open or closed) for each path. It uses weakrefs, so it should only contain objects that are referenced elsewhere. Instantiating FileAccess uses an existing instance from the registry where possible.
OpenFilesLimiter maintains a queue of paths corresponding to possibly open files. When it hits the limit, it looks up the oldest path in the registry and 'closes' its FileAccess object. This is a 'weak close', which will only close the file once there are no open objects from it. It should recover OK after FileAccess objects are torn down, so we can do without __del__ methods to track that.

I've moved things around and renamed things a bit, but if you want to see a diff of the functional changes, look at the first commit.

takluyver · 2020-04-08T16:35:06Z

I think there are two major differences in our approaches:

'Weak closing' - dropping the reference to a file but allowing it to remain open if there are objects belonging to it. This should avoid any problems where a dataset object becomes invalid, but it's not guaranteed to release file descriptors. As the limit we're trying to maintain is 50% of the process limit, I think it's preferable to let e.g. datasets cached in an iterator keep files open over our own limit.
I'm keen to keep anything handling object destruction - __del__ methods or weakrefs - as self-contained as possible, because my experience is that they make code much harder to reason about.

egorsobolev · 2020-04-09T08:54:02Z

I'm lost. Could you explain?

If I remember correctly, I had a kind of soft closing of files in the original version and it was you who asked me to close immediately after exiting a with block. I also decided then that it would be strange to have a hard closing of files after a with block but not to have one after DataCache instance destruction. It results in __del__ method that you don't like so much. And in usage of weakref which is now very concise and clear but still unnecessary from my point of view.

Do your remember that I was strongly opposed and you strongly insisted? And now you are presenting weak closing as a feature. This feature results in completely uncontrollable closing during garbage collection which may be delayed as we saw.

So which of your beliefs is correct? How was it possible to force me in one direction and divert opposite in your own?

takluyver · 2020-04-09T09:21:34Z

I'm sorry about that. What I was objecting to was a cache which could keep a file open even when nothing else had a reference to the file or to any object belonging to the file. So you would have to interact with the cache to ensure a file was closed.

What I'm now proposing doesn't do that (or if it does, it's a bug). What I mean by 'weak closing' is that a reference to a dataset from a file can keep that file open, even if it's closed from the perspective of the FileAccess object. But if there are no references to any objects from the file, it will be closed straight away.

I don't think I've stated this clearly before, and I'm sorry for that. I hadn't really thought carefully about how closing a file works when there might be objects belonging to the file. The with functionality is also a complication; maybe we should remove that and just work with objects going out of scope to make it simpler. I don't think many people use it, based on the examples I've seen.

egorsobolev · 2020-04-09T10:59:09Z

You say about interaction with the file cache as a problem. I see only advantages in this. I mean you don't need to interact with the cache at the most cases because you may rely on soft closing. But if you really need to close the file with 100% quarantine you have that option in my implementation.

The problem that you don't have such guarantee with weak closing in your implementation. And you cannot implement the forcing closing in principle.

The only reason why I added __del__ to DataCollection is the garbage collection doesn't destroy underlying FileAccess straight away after explicit del run in my first test. It keeps one of 17-18 files open without any reason.

        del run, trains_iter
>       assert len(fc._cache) == 0
E       assert 1 == 0
E         -1
E         +0

This remainder file completely disappear if I add __del__ in DataColleciton which is explicitly reset reference to the h5py.File handler in FileAccess. Only this way definitely triggers weakref and file is popped from the file cache.

It means you can rely on weak closing only if you explicitly reset all references on file handler. In your implementation explicit or implicit destruction of DataColleciton also implicitly reset references to FileAccess instances which implicitly triggers wearkref which should actually destroy FileAccess instance and that destruction should implicitly reset reference to h5py.File. Only God knows when the file descriptor will be actually closed.

I don't know how to reproduce it. You can try remove __del__ of DataCollection in my implementation and run tests.

And even worse that file handlers leak from FileAccess to other objects (datasets cache in iterators) which can have multiple instances in contrast of FileAccess. It makes the garbage collection unpredictable at all.

If I ask you to fix this two disadvantages your implementation will immediately morph to to my one.

So what will we do?

takluyver · 2020-04-16T12:53:16Z

HDF5 keeps track of all its object references, so if we have to really ensure all HDF5 files are closed, it should be possible with something like this:

for id_ in h5py.h5f.get_obj_ids():
    while id_.valid:
        h5i.dec_ref(id_)

That's not very pretty, but it's all documented APIs, and it's similar to what h5py's file.close() does. So I think we can avoid keeping track of file objects to forcibly close them. That tracking also seems like it could lead to confusing bugs in two directions:

Forcibly closing files when there are still references to their objects in use elsewhere in the code puts that code into an unexpected state (the kind of bug we've already seen in the tests).
If any file object falls through a gap in the tracking, and gets automatically weak-closed, the 'force close all' function can miss it.

I'm not confident that I'm smart enough to avoid introducing bugs like these as we continue to develop EXtra-data. Maybe you are! But I'm obviously against a design that reduces my ability to work on this code, and I'd like to avoid other contributors having to think about it.

The remaining open files you mention looks like a garbage collection delay. I saw something similar while experimenting, and calling gc.collect() before the assertion fixed it. I agree that's not ideal, but it's a limitation of the Python runtime in general, not just for our code, and not something I think we should be trying to solve.

egorsobolev · 2020-04-21T09:29:09Z

These bugs cannot emerge at all if you follow to two requirements:

always get the reference to the file via cache interface
always check the file state before reading

That is central idea of my design (principles):

there is no possibility for any file object to 'fall through a gap in the tracking' because all open and close operations pass thru the single object.
if file cache close any file it is also the message for any place in code that file reference have to be re-requested from cache.

In last implementation the first requirement is actually even weaker due to the use of weakref. Because the file still will be on the tracking even if someone just copies the file reference instead of requesting it in the cache.

I did not think that these requirements would be difficult for someone in the future. Especially considering that all read operations seem to be already implemented in code.

But if you admit that it may raise problem in future I would suggest to write wrappers for reading operations instead of spoiling the cache implementation.

The bug that we've seen happened in quite nonstandard situation which is combination of exception and immediate replacing cache instance. The garbage collection after exception interrupted code by a chance between file state check and reading operation and closed the file in another cache instance. That is internal issue of FileCache implementation which was solved.

philsmt · 2020-05-05T11:44:03Z

Hi @egorsobolev and @takluyver,
I've been trying to wrap my head around your solutions and discussion. I'll reply here both to this PR as well as #46.

You both make valid points and it appears to me we're sitting between two very different concepts: Trying to be really smart and explicit in handling resources in Egor's approach and keeping it implicit and flexible for Thomas' version. As I wasn't here for the earlier PR introducing the file caching layer, I will pretend it didn't happen.

I understand the problem arising with hitting the (hard) file limit. However, I'd like to question the cost associated with trying to keep detailed track of these resources both from a maintenance perspective for us and usability for users. Aggressively closing files we deem no longer in use can break in all kind of places. Using FileAccess or even h5py.Dataset directly may not be part of our public API, but restricting it should still come at a reasonable trade-off. And Python does't really lend itself well to this kind of behaviour, as you've both discussed with regards to the quirks of __del__. A general solution can be very tricky and it's dangerous to assume all corners are covered. And so it comes down to the question whether the problem we're tackling is worth it. To me, it doesn't seem this file limit is urgent enough to consider these expenses. The majority if users are novice to intermediate level and are prone to test the limits of our intended interface. From their point of view, it can be hard to grasp why a given Dataset is suddenly no longer usable. Please note that I do not have hard data on this, please correct me.

That being said, we want to provide solutions to cutting edge experiments, which may well involve (several) long runs with large scale detectors spanning lots and lots of files. Analyzing data at this scale requires a greater deal of expertise and we may well expect it from those users. One approach to tackle this, which is actually advertised by EXtra-data, is dask. While I'm not an expert, I suspect both your solutions may fail for a very large DataCollection - depending on its exact configuration, dask may attempt to open too many files for the flexible solution, but also access Datasets we invalidated by force closing its underlying file to avoid said limit. Please correct me if I'm wrong here (again), but read on for the actual point I'm trying to make.

I would like to propose a compromise: Keep it implicit and flexible (what made Python great) in the majority of cases, but provide an advanced solution for power users and only then with the ability to make use of it. There are several ways do realize this. In the "cutting edge scenario" I've envisioned above, most of the non-dask methods of obtaining data would most like fail anyway for memory reasons. Lazy computation via dask is the only API we provide in this case, so we can potentially focus our solutions here. By implementing our own array_like object instead of h5py's, we could finely tune file access by dask either implicitly depending on size of the DataCollection and/or via "power user" flags. If you think it should be provided across other methods, a solution like Egor's could be made available via some form of "strict mode" in EXtra-data with all its known trade-offs - but not as the default for every situation.

To actually comment on some code in the end: I think this PR is a good starting point compared to master for the general version, with __new__ limiting FileAccess nicely to unique instances and trusting h5py's weak closing machinery. As a matter of fact, I'd probably move the OpenFilesLimiter on the class level of FileAccess to emphasize its private, but global nature. On the other hand, the proposed array_like interface may make use of it as well.

takluyver · 2020-05-15T11:04:32Z

Thanks Philipp. I've had a bit of a look at the code of dask.array, and I think you're right that neither approach will work for creating a Dask array from a sufficiently large number of files. Dask stores the dataset object we pass in, so either we break it under Dask (strong closing), or we run into the file limit (weak closing).

I'll open a new issue about this - as you suggest, we'll probably need to provide our own array-like object to it. Then I'll merge this PR (as the 'implicit & flexible' option). This isn't to say we're finished with this topic, but to move on and continue from there.

takluyver · 2020-05-15T11:16:36Z

#57 is the issue for Dask arrays with many files.

takluyver added 3 commits April 8, 2020 15:52

Try to simplify FileCache implementation

1d0f38a

Rename FileCache to OpenFilesLimiter

49f0ac1

Consolidate file access machinery in one module

28f4220

takluyver mentioned this pull request May 15, 2020

Enable Dask arrays from very large numbers of files #57

Closed

takluyver merged commit 06047b2 into master May 15, 2020

takluyver deleted the simplify-filecache branch May 15, 2020 11:16

takluyver mentioned this pull request May 15, 2020

Fix unwanted file cache behavior caused by delayed garbage collection #46

Closed

takluyver added this to the 1.2 milestone Jun 3, 2020

takluyver mentioned this pull request Jun 4, 2020

Filecache tests failing intermittently #45

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt to simplify filecache #48

Attempt to simplify filecache #48

takluyver commented Apr 8, 2020 •

edited

takluyver commented Apr 8, 2020

egorsobolev commented Apr 9, 2020 •

edited

takluyver commented Apr 9, 2020

egorsobolev commented Apr 9, 2020 •

edited

takluyver commented Apr 16, 2020

egorsobolev commented Apr 21, 2020 •

edited

philsmt commented May 5, 2020

takluyver commented May 15, 2020

takluyver commented May 15, 2020

Attempt to simplify filecache #48

Attempt to simplify filecache #48

Conversation

takluyver commented Apr 8, 2020 • edited

takluyver commented Apr 8, 2020

egorsobolev commented Apr 9, 2020 • edited

takluyver commented Apr 9, 2020

egorsobolev commented Apr 9, 2020 • edited

takluyver commented Apr 16, 2020

egorsobolev commented Apr 21, 2020 • edited

philsmt commented May 5, 2020

takluyver commented May 15, 2020

takluyver commented May 15, 2020

takluyver commented Apr 8, 2020 •

edited

egorsobolev commented Apr 9, 2020 •

edited

egorsobolev commented Apr 9, 2020 •

edited

egorsobolev commented Apr 21, 2020 •

edited