Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to simplify filecache #48

Merged
merged 3 commits into from May 15, 2020
Merged

Attempt to simplify filecache #48

merged 3 commits into from May 15, 2020

Conversation

takluyver
Copy link
Member

@takluyver takluyver commented Apr 8, 2020

Hi @egorsobolev . I was concerned that the file cache was getting complicated to the point where I was struggling to be sure that it would do the right thing, so I've spent a while trying to work out an alternative model for it. I think this is somewhat clearer, but it's easy to think that when I've just written it, so I'd welcome a second view.

The idea is to split the functionality of FileCache into two pieces:

  1. file_access_registry tracks a single FileAccess object (open or closed) for each path. It uses weakrefs, so it should only contain objects that are referenced elsewhere. Instantiating FileAccess uses an existing instance from the registry where possible.
  2. OpenFilesLimiter maintains a queue of paths corresponding to possibly open files. When it hits the limit, it looks up the oldest path in the registry and 'closes' its FileAccess object. This is a 'weak close', which will only close the file once there are no open objects from it. It should recover OK after FileAccess objects are torn down, so we can do without __del__ methods to track that.

I've moved things around and renamed things a bit, but if you want to see a diff of the functional changes, look at the first commit.

@takluyver
Copy link
Member Author

I think there are two major differences in our approaches:

  1. 'Weak closing' - dropping the reference to a file but allowing it to remain open if there are objects belonging to it. This should avoid any problems where a dataset object becomes invalid, but it's not guaranteed to release file descriptors. As the limit we're trying to maintain is 50% of the process limit, I think it's preferable to let e.g. datasets cached in an iterator keep files open over our own limit.
  2. I'm keen to keep anything handling object destruction - __del__ methods or weakrefs - as self-contained as possible, because my experience is that they make code much harder to reason about.

@egorsobolev
Copy link
Contributor

egorsobolev commented Apr 9, 2020

I'm lost. Could you explain?

If I remember correctly, I had a kind of soft closing of files in the original version and it was you who asked me to close immediately after exiting a with block. I also decided then that it would be strange to have a hard closing of files after a with block but not to have one after DataCache instance destruction. It results in __del__ method that you don't like so much. And in usage of weakref which is now very concise and clear but still unnecessary from my point of view.

Do your remember that I was strongly opposed and you strongly insisted? And now you are presenting weak closing as a feature. This feature results in completely uncontrollable closing during garbage collection which may be delayed as we saw.

So which of your beliefs is correct? How was it possible to force me in one direction and divert opposite in your own?

@takluyver
Copy link
Member Author

I'm sorry about that. What I was objecting to was a cache which could keep a file open even when nothing else had a reference to the file or to any object belonging to the file. So you would have to interact with the cache to ensure a file was closed.

What I'm now proposing doesn't do that (or if it does, it's a bug). What I mean by 'weak closing' is that a reference to a dataset from a file can keep that file open, even if it's closed from the perspective of the FileAccess object. But if there are no references to any objects from the file, it will be closed straight away.

I don't think I've stated this clearly before, and I'm sorry for that. I hadn't really thought carefully about how closing a file works when there might be objects belonging to the file. The with functionality is also a complication; maybe we should remove that and just work with objects going out of scope to make it simpler. I don't think many people use it, based on the examples I've seen.

@egorsobolev
Copy link
Contributor

egorsobolev commented Apr 9, 2020

You say about interaction with the file cache as a problem. I see only advantages in this. I mean you don't need to interact with the cache at the most cases because you may rely on soft closing. But if you really need to close the file with 100% quarantine you have that option in my implementation.

The problem that you don't have such guarantee with weak closing in your implementation. And you cannot implement the forcing closing in principle.

The only reason why I added __del__ to DataCollection is the garbage collection doesn't destroy underlying FileAccess straight away after explicit del run in my first test. It keeps one of 17-18 files open without any reason.

        del run, trains_iter
>       assert len(fc._cache) == 0
E       assert 1 == 0
E         -1
E         +0

This remainder file completely disappear if I add __del__ in DataColleciton which is explicitly reset reference to the h5py.File handler in FileAccess. Only this way definitely triggers weakref and file is popped from the file cache.

It means you can rely on weak closing only if you explicitly reset all references on file handler. In your implementation explicit or implicit destruction of DataColleciton also implicitly reset references to FileAccess instances which implicitly triggers wearkref which should actually destroy FileAccess instance and that destruction should implicitly reset reference to h5py.File. Only God knows when the file descriptor will be actually closed.

I don't know how to reproduce it. You can try remove __del__ of DataCollection in my implementation and run tests.

And even worse that file handlers leak from FileAccess to other objects (datasets cache in iterators) which can have multiple instances in contrast of FileAccess. It makes the garbage collection unpredictable at all.

If I ask you to fix this two disadvantages your implementation will immediately morph to to my one.

So what will we do?

@takluyver
Copy link
Member Author

HDF5 keeps track of all its object references, so if we have to really ensure all HDF5 files are closed, it should be possible with something like this:

for id_ in h5py.h5f.get_obj_ids():
    while id_.valid:
        h5i.dec_ref(id_)

That's not very pretty, but it's all documented APIs, and it's similar to what h5py's file.close() does. So I think we can avoid keeping track of file objects to forcibly close them. That tracking also seems like it could lead to confusing bugs in two directions:

  • Forcibly closing files when there are still references to their objects in use elsewhere in the code puts that code into an unexpected state (the kind of bug we've already seen in the tests).
  • If any file object falls through a gap in the tracking, and gets automatically weak-closed, the 'force close all' function can miss it.

I'm not confident that I'm smart enough to avoid introducing bugs like these as we continue to develop EXtra-data. Maybe you are! But I'm obviously against a design that reduces my ability to work on this code, and I'd like to avoid other contributors having to think about it.

The remaining open files you mention looks like a garbage collection delay. I saw something similar while experimenting, and calling gc.collect() before the assertion fixed it. I agree that's not ideal, but it's a limitation of the Python runtime in general, not just for our code, and not something I think we should be trying to solve.

@egorsobolev
Copy link
Contributor

egorsobolev commented Apr 21, 2020

These bugs cannot emerge at all if you follow to two requirements:

  • always get the reference to the file via cache interface
  • always check the file state before reading

That is central idea of my design (principles):

  • there is no possibility for any file object to 'fall through a gap in the tracking' because all open and close operations pass thru the single object.
  • if file cache close any file it is also the message for any place in code that file reference have to be re-requested from cache.

In last implementation the first requirement is actually even weaker due to the use of weakref. Because the file still will be on the tracking even if someone just copies the file reference instead of requesting it in the cache.

I did not think that these requirements would be difficult for someone in the future. Especially considering that all read operations seem to be already implemented in code.

But if you admit that it may raise problem in future I would suggest to write wrappers for reading operations instead of spoiling the cache implementation.

The bug that we've seen happened in quite nonstandard situation which is combination of exception and immediate replacing cache instance. The garbage collection after exception interrupted code by a chance between file state check and reading operation and closed the file in another cache instance. That is internal issue of FileCache implementation which was solved.

@philsmt
Copy link
Contributor

philsmt commented May 5, 2020

Hi @egorsobolev and @takluyver,
I've been trying to wrap my head around your solutions and discussion. I'll reply here both to this PR as well as #46.

You both make valid points and it appears to me we're sitting between two very different concepts: Trying to be really smart and explicit in handling resources in Egor's approach and keeping it implicit and flexible for Thomas' version. As I wasn't here for the earlier PR introducing the file caching layer, I will pretend it didn't happen.

I understand the problem arising with hitting the (hard) file limit. However, I'd like to question the cost associated with trying to keep detailed track of these resources both from a maintenance perspective for us and usability for users. Aggressively closing files we deem no longer in use can break in all kind of places. Using FileAccess or even h5py.Dataset directly may not be part of our public API, but restricting it should still come at a reasonable trade-off. And Python does't really lend itself well to this kind of behaviour, as you've both discussed with regards to the quirks of __del__. A general solution can be very tricky and it's dangerous to assume all corners are covered. And so it comes down to the question whether the problem we're tackling is worth it. To me, it doesn't seem this file limit is urgent enough to consider these expenses. The majority if users are novice to intermediate level and are prone to test the limits of our intended interface. From their point of view, it can be hard to grasp why a given Dataset is suddenly no longer usable. Please note that I do not have hard data on this, please correct me.

That being said, we want to provide solutions to cutting edge experiments, which may well involve (several) long runs with large scale detectors spanning lots and lots of files. Analyzing data at this scale requires a greater deal of expertise and we may well expect it from those users. One approach to tackle this, which is actually advertised by EXtra-data, is dask. While I'm not an expert, I suspect both your solutions may fail for a very large DataCollection - depending on its exact configuration, dask may attempt to open too many files for the flexible solution, but also access Datasets we invalidated by force closing its underlying file to avoid said limit. Please correct me if I'm wrong here (again), but read on for the actual point I'm trying to make.

I would like to propose a compromise: Keep it implicit and flexible (what made Python great) in the majority of cases, but provide an advanced solution for power users and only then with the ability to make use of it. There are several ways do realize this. In the "cutting edge scenario" I've envisioned above, most of the non-dask methods of obtaining data would most like fail anyway for memory reasons. Lazy computation via dask is the only API we provide in this case, so we can potentially focus our solutions here. By implementing our own array_like object instead of h5py's, we could finely tune file access by dask either implicitly depending on size of the DataCollection and/or via "power user" flags. If you think it should be provided across other methods, a solution like Egor's could be made available via some form of "strict mode" in EXtra-data with all its known trade-offs - but not as the default for every situation.

To actually comment on some code in the end: I think this PR is a good starting point compared to master for the general version, with __new__ limiting FileAccess nicely to unique instances and trusting h5py's weak closing machinery. As a matter of fact, I'd probably move the OpenFilesLimiter on the class level of FileAccess to emphasize its private, but global nature. On the other hand, the proposed array_like interface may make use of it as well.

@takluyver
Copy link
Member Author

Thanks Philipp. I've had a bit of a look at the code of dask.array, and I think you're right that neither approach will work for creating a Dask array from a sufficiently large number of files. Dask stores the dataset object we pass in, so either we break it under Dask (strong closing), or we run into the file limit (weak closing).

I'll open a new issue about this - as you suggest, we'll probably need to provide our own array-like object to it. Then I'll merge this PR (as the 'implicit & flexible' option). This isn't to say we're finished with this topic, but to move on and continue from there.

@takluyver
Copy link
Member Author

#57 is the issue for Dask arrays with many files.

@takluyver takluyver merged commit 06047b2 into master May 15, 2020
@takluyver takluyver deleted the simplify-filecache branch May 15, 2020 11:16
@takluyver takluyver added this to the 1.2 milestone Jun 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants