New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About directory operations on storage #3751
Comments
Well, I don't think we're using any of this functionality in GridPP, so it sounds reasonable to me... |
The ValidateOutputDataAgent is certainly used (by LHCb), and it's part of the transformations' validation machinery. For several years we plan to change it, as the way it works is quite silly, but never had time for it. |
Then it might be a good opportunity... this agent will not work against Echo for sure. |
What's used, in fact, is this:
The goal here is simply to check that files that are supposed to exist on a SE indeed exist. With ECHO IIUC this makes no sense any more (you do these check with full dumps...). Then we may drop it, let's discuss it. |
I don't think we rely on those "log update" and "ValidateOutputDataAgent", but which are the "directory operations" you aim to drop? |
This is a rather popular user operation: download/upload recursively a whole directory. This is normally done file by file, but can be supported on the storage level if more efficient. In any case, this is an example of a case where the DataManager can use storage side facility if exists or do recursive operation, if not. Removing empty directories is certainly a valid case and also can be either supported or not. For "normal" storage it should be possible to detect if directory is empty, i.e. make a listing. If a storage does not support it, it can be either mimicked (by analyzing file metadata) or DataManager can make some other workaround. But just dropping does not sound like a good idea. In the end in the minds of users the data IS stored in directories. |
Do you mean removing empty directories can be done by the site admins? Yes, it can, but would be painful, for not every site may be willing to do it, and scanning the whole storage/filesystem to find empty directories would be quite a load. Or, do you mean it can be done outside of DIRAC as an independent operation, by the VO rather than the sites? Well, it is what I did when I was in ATLAS, running cleaning outside of DQ2, so it is possible but consumes some manpower. The best occasion to identify an empty directory is when we intentionally remove files under the directory, i.e. better be done within DIRAC framework. Following what you said at the meeting, I agree with you and don't think we need createDirectory operation, as far as the GFAL2+backend automatically creates necessary directory structures. I don't know who needs putDirectory, though I guess @atsareg is telling getDirectory is used?. I am not sure how useful getDirectory for a storage is, for files under a "logical" directory may be spread over different storage elements. Finally, isDirectory and listDirectory would be useful/necessary before doing removeDirectory. I have never used them so far, so I might misunderstand their usage... Please correct me if there is anything wrong. |
I am not quite sure I follow you here. What can the DataManager do ? Again, this is just at the StorageLevel. And a listing of directory also does not make sense. The question just becomes: do we want to support directory operations sometimes. I think that this is a bad idea, because you won't be able to rely on it, so I would just not rely on it at all. I just checked the accounting of LHCb, and in the last month, there has been literally zero directory operations, besides removal. Can you check yours ? @iueda: regarding removal of operations, you are right, probably not all the sites would do it. Checking whether we can remove the directory or not when removing a file is probably very costly. I would rather favour an approach where removing an empty directory is part of the consistency checks that we can put in place, providing that sites also dump the directories in the dump. What do you think ? |
@chaen : By mimicking directory operations I mean that DataManager can use Storage directory method, e.g. for downloading, if available or do per file recursive downloads. In the DataManager API this will be seen as just a directory method. |
The DataManager should not implement that kind of things, it should be storage and catalog agnostic. The logic has to be in the plugins. For S3, you can't afford to query the metadata of all your files when doing such an operation. Did you check about the directory operations in the accounting ? |
Looking up namespace is certainly delicate. I do not know how versatile is it, e.g. is it allowing search by a subset of metadata. But at some point we will need the possibility to get a "dump" of the namespace from the storage for consistency checks. At least this should be possible, not as a regular operation, but for occasional use. |
And do not forget that for the DIRAC users that data is the "file system" defined by the catalog namespace. This is the basic paradigm that users will rely on. So, we can not just get rid of the "file system on top of the object store". We will have a mixture of "file system " and "object store" type storages for a long time, if not forever. But we should expose only one paradigm. |
'But keeping the possibility to "download" a directory is a good point.' I do not think it is really a good point that we keep it, if it is not used... and even less a good point to force a technology to behave as such if it is not meant for. As for the storage dump, yes, this is the direction in which storages are going. You can get a full dump of your storage, but that's basically it. Also, this is the direction taken by the "standard" storages: a unified dump. So hopefully soon, we should be able to have consistency check at the high level. |
"even less a good point to force a technology to behave as such if it is not meant for" - true. That is why it should be moved to the DataManager logic and not at the level of the storage plugin. |
I strongly disagree on that, but this is an implementation detail that we can discuss somewhere else. |
That is right, it would not be good to check whether the directory is empty or not every time we remove a file. That is why I wrote "The best occasion to identify an empty directory is when we intentionally remove files under the directory." However there seems no such workflow implemented currently, so I cannot say much if it should be implemented this way...
That would be a reasonable solution, if the dump includes empty directories..., and probably it would be needed anyway. As for the "directory download", I guess users (client tools) would not call the se.getDirectory() directly? Then where it is used? |
FYI, I've asked the accounting task force whether the dump will include empty directories |
Getting back to this after a long while... since ECHO is coming alive. We'll discuss this again at the next BiLD. |
Should this task need to be kept open? |
I think we agree in general. The call will remain available for now, but not used anywhere in core DIRAC< but can be used in scripts, or the logic moved there. |
I am wondering if it makes sense to maintain directory operations on the storage plugins ? The type of storages that we are going to face in the future (S3, Echo, Ceph, etc) do not have a concept of directory anyway. In practice, this involves either branches in the code or a multiplication of the plugin.
In practice, the only places where this seems to be used are
So we could possibly drop support for directory operation all together at the level of the storage, and keep it only at the FC level.
Thoughts @phicharp @fstagni @atsareg @andresailer @petricm @sfayer @iueda ?
The text was updated successfully, but these errors were encountered: