New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to simplify filecache #48
Conversation
I think there are two major differences in our approaches:
|
I'm lost. Could you explain? If I remember correctly, I had a kind of soft closing of files in the original version and it was you who asked me to close immediately after exiting a Do your remember that I was strongly opposed and you strongly insisted? And now you are presenting weak closing as a feature. This feature results in completely uncontrollable closing during garbage collection which may be delayed as we saw. So which of your beliefs is correct? How was it possible to force me in one direction and divert opposite in your own? |
I'm sorry about that. What I was objecting to was a cache which could keep a file open even when nothing else had a reference to the file or to any object belonging to the file. So you would have to interact with the cache to ensure a file was closed. What I'm now proposing doesn't do that (or if it does, it's a bug). What I mean by 'weak closing' is that a reference to a dataset from a file can keep that file open, even if it's closed from the perspective of the I don't think I've stated this clearly before, and I'm sorry for that. I hadn't really thought carefully about how closing a file works when there might be objects belonging to the file. The |
You say about interaction with the file cache as a problem. I see only advantages in this. I mean you don't need to interact with the cache at the most cases because you may rely on soft closing. But if you really need to close the file with 100% quarantine you have that option in my implementation. The problem that you don't have such guarantee with weak closing in your implementation. And you cannot implement the forcing closing in principle. The only reason why I added
This remainder file completely disappear if I add It means you can rely on weak closing only if you explicitly reset all references on file handler. In your implementation explicit or implicit destruction of I don't know how to reproduce it. You can try remove And even worse that file handlers leak from If I ask you to fix this two disadvantages your implementation will immediately morph to to my one. So what will we do? |
HDF5 keeps track of all its object references, so if we have to really ensure all HDF5 files are closed, it should be possible with something like this: for id_ in h5py.h5f.get_obj_ids():
while id_.valid:
h5i.dec_ref(id_) That's not very pretty, but it's all documented APIs, and it's similar to what h5py's
I'm not confident that I'm smart enough to avoid introducing bugs like these as we continue to develop EXtra-data. Maybe you are! But I'm obviously against a design that reduces my ability to work on this code, and I'd like to avoid other contributors having to think about it. The remaining open files you mention looks like a garbage collection delay. I saw something similar while experimenting, and calling |
These bugs cannot emerge at all if you follow to two requirements:
That is central idea of my design (principles):
In last implementation the first requirement is actually even weaker due to the use of I did not think that these requirements would be difficult for someone in the future. Especially considering that all read operations seem to be already implemented in code. But if you admit that it may raise problem in future I would suggest to write wrappers for reading operations instead of spoiling the cache implementation. The bug that we've seen happened in quite nonstandard situation which is combination of exception and immediate replacing cache instance. The garbage collection after exception interrupted code by a chance between file state check and reading operation and closed the file in another cache instance. That is internal issue of |
Hi @egorsobolev and @takluyver, You both make valid points and it appears to me we're sitting between two very different concepts: Trying to be really smart and explicit in handling resources in Egor's approach and keeping it implicit and flexible for Thomas' version. As I wasn't here for the earlier PR introducing the file caching layer, I will pretend it didn't happen. I understand the problem arising with hitting the (hard) file limit. However, I'd like to question the cost associated with trying to keep detailed track of these resources both from a maintenance perspective for us and usability for users. Aggressively closing files we deem no longer in use can break in all kind of places. Using That being said, we want to provide solutions to cutting edge experiments, which may well involve (several) long runs with large scale detectors spanning lots and lots of files. Analyzing data at this scale requires a greater deal of expertise and we may well expect it from those users. One approach to tackle this, which is actually advertised by EXtra-data, is dask. While I'm not an expert, I suspect both your solutions may fail for a very large I would like to propose a compromise: Keep it implicit and flexible (what made Python great) in the majority of cases, but provide an advanced solution for power users and only then with the ability to make use of it. There are several ways do realize this. In the "cutting edge scenario" I've envisioned above, most of the non-dask methods of obtaining data would most like fail anyway for memory reasons. Lazy computation via dask is the only API we provide in this case, so we can potentially focus our solutions here. By implementing our own To actually comment on some code in the end: I think this PR is a good starting point compared to master for the general version, with |
Thanks Philipp. I've had a bit of a look at the code of I'll open a new issue about this - as you suggest, we'll probably need to provide our own array-like object to it. Then I'll merge this PR (as the 'implicit & flexible' option). This isn't to say we're finished with this topic, but to move on and continue from there. |
#57 is the issue for Dask arrays with many files. |
Hi @egorsobolev . I was concerned that the file cache was getting complicated to the point where I was struggling to be sure that it would do the right thing, so I've spent a while trying to work out an alternative model for it. I think this is somewhat clearer, but it's easy to think that when I've just written it, so I'd welcome a second view.
The idea is to split the functionality of FileCache into two pieces:
file_access_registry
tracks a single FileAccess object (open or closed) for each path. It uses weakrefs, so it should only contain objects that are referenced elsewhere. InstantiatingFileAccess
uses an existing instance from the registry where possible.OpenFilesLimiter
maintains a queue of paths corresponding to possibly open files. When it hits the limit, it looks up the oldest path in the registry and 'closes' itsFileAccess
object. This is a 'weak close', which will only close the file once there are no open objects from it. It should recover OK after FileAccess objects are torn down, so we can do without__del__
methods to track that.I've moved things around and renamed things a bit, but if you want to see a diff of the functional changes, look at the first commit.