Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About directory operations on storage #3751

Closed
chaen opened this issue Jul 13, 2018 · 21 comments
Closed

About directory operations on storage #3751

chaen opened this issue Jul 13, 2018 · 21 comments

Comments

@chaen
Copy link
Contributor

chaen commented Jul 13, 2018

I am wondering if it makes sense to maintain directory operations on the storage plugins ? The type of storages that we are going to face in the future (S3, Echo, Ceph, etc) do not have a concept of directory anyway. In practice, this involves either branches in the code or a multiplication of the plugin.

In practice, the only places where this seems to be used are

  • log upload (done as a directory under some conditions, not often I guess, and could be dropped)
  • ValidateOutputDataAgent (list the directory content. Not sure this is used).

So we could possibly drop support for directory operation all together at the level of the storage, and keep it only at the FC level.

Thoughts @phicharp @fstagni @atsareg @andresailer @petricm @sfayer @iueda ?

@sfayer
Copy link
Member

sfayer commented Jul 16, 2018

Well, I don't think we're using any of this functionality in GridPP, so it sounds reasonable to me...

@fstagni
Copy link
Contributor

fstagni commented Jul 16, 2018

The ValidateOutputDataAgent is certainly used (by LHCb), and it's part of the transformations' validation machinery. For several years we plan to change it, as the way it works is quite silly, but never had time for it.

@chaen
Copy link
Contributor Author

chaen commented Jul 16, 2018

Then it might be a good opportunity... this agent will not work against Echo for sure.

@fstagni
Copy link
Contributor

fstagni commented Jul 16, 2018

What's used, in fact, is this:

def storageDirectoryToCatalog(self, lfnDir, storageElement):

The goal here is simply to check that files that are supposed to exist on a SE indeed exist. With ECHO IIUC this makes no sense any more (you do these check with full dumps...). Then we may drop it, let's discuss it.

@iueda
Copy link
Contributor

iueda commented Jul 19, 2018

I don't think we rely on those "log update" and "ValidateOutputDataAgent", but which are the "directory operations" you aim to drop?
We are not really doing much in our data management, but eventually I would like to put a functionality to remove empty directories. If it would not be possible by this dropping, then it would be a pity...

@chaen
Copy link
Contributor Author

chaen commented Jul 19, 2018

@fstagni we can check if a file exists, not really a directory.

@iueda I did not think of that use case ! :-) Removing empty directory is indeed a perfectly valid usecase. I am wondering though if this is something that we should deal with, or if it is more a storage/filesystem cleanup operation ?

@atsareg
Copy link
Contributor

atsareg commented Jul 19, 2018

This is a rather popular user operation: download/upload recursively a whole directory. This is normally done file by file, but can be supported on the storage level if more efficient. In any case, this is an example of a case where the DataManager can use storage side facility if exists or do recursive operation, if not. Removing empty directories is certainly a valid case and also can be either supported or not. For "normal" storage it should be possible to detect if directory is empty, i.e. make a listing. If a storage does not support it, it can be either mimicked (by analyzing file metadata) or DataManager can make some other workaround. But just dropping does not sound like a good idea. In the end in the minds of users the data IS stored in directories.

@iueda
Copy link
Contributor

iueda commented Jul 19, 2018

I am wondering though if this is something that we should deal with, or if it is more a storage/filesystem cleanup operation ?

Do you mean removing empty directories can be done by the site admins? Yes, it can, but would be painful, for not every site may be willing to do it, and scanning the whole storage/filesystem to find empty directories would be quite a load. Or, do you mean it can be done outside of DIRAC as an independent operation, by the VO rather than the sites? Well, it is what I did when I was in ATLAS, running cleaning outside of DQ2, so it is possible but consumes some manpower. The best occasion to identify an empty directory is when we intentionally remove files under the directory, i.e. better be done within DIRAC framework.

Following what you said at the meeting, I agree with you and don't think we need createDirectory operation, as far as the GFAL2+backend automatically creates necessary directory structures. I don't know who needs putDirectory, though I guess @atsareg is telling getDirectory is used?. I am not sure how useful getDirectory for a storage is, for files under a "logical" directory may be spread over different storage elements. Finally, isDirectory and listDirectory would be useful/necessary before doing removeDirectory. I have never used them so far, so I might misunderstand their usage... Please correct me if there is anything wrong.

@chaen
Copy link
Contributor Author

chaen commented Jul 19, 2018

I am not quite sure I follow you here. What can the DataManager do ? Again, this is just at the StorageLevel. And a listing of directory also does not make sense.
So clearly, there is no workaround or mimick that can be done for an object store.

The question just becomes: do we want to support directory operations sometimes. I think that this is a bad idea, because you won't be able to rely on it, so I would just not rely on it at all.

I just checked the accounting of LHCb, and in the last month, there has been literally zero directory operations, besides removal. Can you check yours ?

@iueda: regarding removal of operations, you are right, probably not all the sites would do it. Checking whether we can remove the directory or not when removing a file is probably very costly. I would rather favour an approach where removing an empty directory is part of the consistency checks that we can put in place, providing that sites also dump the directories in the dump. What do you think ?

@atsareg
Copy link
Contributor

atsareg commented Jul 19, 2018

@chaen : By mimicking directory operations I mean that DataManager can use Storage directory method, e.g. for downloading, if available or do per file recursive downloads. In the DataManager API this will be seen as just a directory method.
As for the S3-like storage namespace lookup, I guess that LFNs will be anyway stored as file metadata on the storage side. Looking up and analysis of this metadata can give you information equivalent of the directory listing.

@chaen
Copy link
Contributor Author

chaen commented Jul 19, 2018

The DataManager should not implement that kind of things, it should be storage and catalog agnostic. The logic has to be in the plugins.
In practice, downloading recursively is already what is done now in the plugins, file by file.

For S3, you can't afford to query the metadata of all your files when doing such an operation.
And anyway, I think it is just completely wrong to try to force a technology to behave just like what you do not want it to be (putting a file system on top of a object store, for example, like CephFS).

Did you check about the directory operations in the accounting ?

@atsareg
Copy link
Contributor

atsareg commented Jul 19, 2018

Looking up namespace is certainly delicate. I do not know how versatile is it, e.g. is it allowing search by a subset of metadata. But at some point we will need the possibility to get a "dump" of the namespace from the storage for consistency checks. At least this should be possible, not as a regular operation, but for occasional use.
I do not think that directory operations are used much now, if at all. This is a standard question during tutorials, for example. But keeping the possibility to "download" a directory is a good point. A storage plugin which does not manage directory operations should be given a list of files to download instead. That is why this list must be evaluated in the DataManager, because storage plugin will not look it up in the file catalog. The DataManager will also take care of keeping the right directory structure for local downloaded files. Storage plugin has a list of supported methods which DataManager can use to choose the right way to proceed, this is a generic mechanism.

@atsareg
Copy link
Contributor

atsareg commented Jul 19, 2018

And do not forget that for the DIRAC users that data is the "file system" defined by the catalog namespace. This is the basic paradigm that users will rely on. So, we can not just get rid of the "file system on top of the object store". We will have a mixture of "file system " and "object store" type storages for a long time, if not forever. But we should expose only one paradigm.

@chaen
Copy link
Contributor Author

chaen commented Jul 19, 2018

'But keeping the possibility to "download" a directory is a good point.' I do not think it is really a good point that we keep it, if it is not used... and even less a good point to force a technology to behave as such if it is not meant for.

As for the storage dump, yes, this is the direction in which storages are going. You can get a full dump of your storage, but that's basically it. Also, this is the direction taken by the "standard" storages: a unified dump. So hopefully soon, we should be able to have consistency check at the high level.

@atsareg
Copy link
Contributor

atsareg commented Jul 19, 2018

"even less a good point to force a technology to behave as such if it is not meant for" - true. That is why it should be moved to the DataManager logic and not at the level of the storage plugin.

@chaen
Copy link
Contributor Author

chaen commented Jul 19, 2018

I strongly disagree on that, but this is an implementation detail that we can discuss somewhere else.
But, just thinking out loud, if the only concern is users, then we can just solve that in a script...
And once again: if it is not used, then it is not worth keeping

@iueda
Copy link
Contributor

iueda commented Jul 24, 2018

Checking whether we can remove the directory or not when removing a file is probably very costly.

That is right, it would not be good to check whether the directory is empty or not every time we remove a file. That is why I wrote "The best occasion to identify an empty directory is when we intentionally remove files under the directory." However there seems no such workflow implemented currently, so I cannot say much if it should be implemented this way...

I would rather favour an approach where removing an empty directory is part of the consistency checks that we can put in place, providing that sites also dump the directories in the dump. What do you think ?

That would be a reasonable solution, if the dump includes empty directories..., and probably it would be needed anyway.

As for the "directory download", I guess users (client tools) would not call the se.getDirectory() directly? Then where it is used?

@chaen
Copy link
Contributor Author

chaen commented Jul 24, 2018

FYI, I've asked the accounting task force whether the dump will include empty directories

@fstagni
Copy link
Contributor

fstagni commented Oct 29, 2018

Getting back to this after a long while... since ECHO is coming alive. We'll discuss this again at the next BiLD.

@fstagni
Copy link
Contributor

fstagni commented Apr 25, 2019

Should this task need to be kept open?

@chaen
Copy link
Contributor Author

chaen commented Apr 25, 2019

I think we agree in general. The call will remain available for now, but not used anywhere in core DIRAC< but can be used in scripts, or the logic moved there.
For me you can close. These changes will come at low pace eventually

@fstagni fstagni closed this as completed Apr 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants