Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need an option for rsync to delete folders in destination that are not in the source. #510

Open
damnMeddlingKid opened this issue Mar 28, 2018 · 3 comments

Comments

@damnMeddlingKid
Copy link

damnMeddlingKid commented Mar 28, 2018

We are seeing behaviour when using gsutil version 4.29 where rsync will not delete folders in the destination that are no longer in the source.

It deletes all the files within the folder but does not delete the folder itself.

here is the command im using:

gsutil -m rsync -r -d <source> <destination>

@houglum
Copy link
Collaborator

houglum commented Mar 29, 2018

IIRC, this comes from trying to treat object names and directory names similarly. However, for an object, "a/b/c.txt" is the whole name of the object, while for files, "a/b/c.txt" is a directory containing a directory containing the file "c.txt". Deleting the object "a/b/c.txt" would rid you of that entire "a/b/" prefix (assuming no other objects shared that prefix), while deleting the file "a/b/c.txt" would leave you with the directory "a", containing the empty directory "b".

I don't know off the top of my head how much work this would require to change in gsutil. As a workaround, while it's not as elegant as having a flag built in to gsutil, you could run a command afterward to recursively delete empty directories under the destination directory in a bottom-up fashion. If you're only running on Linux systems, this can be done with the find command:

$ find <destination_directory_path> -mindepth 1 -type d -empty -delete

Or, if you need something cross-platform, a naive Python script to do the same thing (note that I haven't tested this beyond a trivial test case) might look something like:

import os

def RemoveEmptyLeafDirsBottomUp(dir_path):
  for tup in os.walk(dir_path, topdown=False):
    # When recursing back up to a parent dir, the os.walk tuple may
    # still contain entries for its child directories we just deleted.
    # So, re-list the current dir's contents and see if we deleted all
    # of its children -- if it's now empty, delete it.
    if not os.listdir(tup[0]):  
      os.rmdir(tup[0])
      print('Removed %s' % tup[0])

# Call RemoveEmptyLeafDirsBottomUp on the destination directory's path

@damnMeddlingKid
Copy link
Author

damnMeddlingKid commented Mar 29, 2018

I should have clarified. Im seeing empty folders on GCS when they've been removed from the source file system.

@houglum
Copy link
Collaborator

houglum commented Mar 29, 2018

The only case I can think of this happening is if there are intermediate directory placeholder objects -- if those were created, it's probably from a separate tool (like the Cloud Console web UI). In the past, these have had some special suffix like _$folder$, but if you create a "folder" in the web UI today, the method they use is creating a 0-byte object with a forward slash at the end of its name. Gsutil doesn't clean these up, as we generally disallow creating such objects in the first place. GCS is not a file system, and different tools attempt emulating directory support in different ways (which may make sense in their own context, like within a web browser), but it leads to conflicts like this.

On the bright side, we're consistent: because we don't copy directory placeholders, if you do an gsutil rsync or gsutil cp the other direction -- from bucket to file system -- it won't copy the empty placeholders back to your file system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants