Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to exclude multiple directories when gsutil rsync? #771

Closed
zffocussss opened this issue May 6, 2019 · 12 comments
Closed

how to exclude multiple directories when gsutil rsync? #771

zffocussss opened this issue May 6, 2019 · 12 comments
Assignees
Labels

Comments

@zffocussss
Copy link

I have some sub directories a b c under directory "d"
how can I exclude them once?

@catleeball
Copy link
Contributor

catleeball commented May 6, 2019

Hi @zffocussss !

You can use the -x flag to exclude many directories or files using a regex pattern. There's some more info in this doc: https://cloud.google.com/storage/docs/gsutil/commands/rsync

Here's more examples from the doc linked above:

-x pattern

Causes files/objects matching pattern to be excluded, i.e., any matching files/objects will not be copied or deleted. Note that the pattern is a Python regular expression, not a wildcard (so, matching any string ending in "abc" would be specified using ".*abc$" rather than "*abc"). Note also that the exclude path is always relative (similar to Unix rsync or tar exclude options). For example, if you run the command:

    gsutil rsync -x "data./.*\.txt$" dir gs://my-bucket

it will skip the file dir/data1/a.txt.

You can use regex alternation to specify multiple exclusions, for example:

    gsutil rsync -x ".*\.txt$|.*\.jpg$" dir gs://my-bucket

NOTE: When using this on the Windows command line, use ^ as an escape character instead of \ and escape the | character.

Please let me know if that helps or if you have any other questions!

@catleeball
Copy link
Contributor

Updated the comment above with a few more details specific to your question. :)

@zffocussss
Copy link
Author

Updated the comment above with a few more details specific to your question. :)

Hi @catleeball ,I try it.
gsutil -d -x "a/|b/|c/" -r d gs://my-bucket
but it does not work.I check my bucket in GCP console,but a,b,c is still here.
I think -x just can exclude files not directories.

@catleeball
Copy link
Contributor

catleeball commented May 8, 2019

Hi @zffocussss ! It looks like the issue might be with your regex. Here's an example I just tested:

Given this local directory structure rsync-test

cball@cball:~$ tree rsync-test/
rsync-test/
├── dirA
│   └── bar.txt
├── dirB
│   └── baz.txt
├── dirC
│   ├── baq.txt
│   └── dirCA
│       └── bat.txt
└── foo.txt

Let's say we want to upload everything except dirA and dirCA. We can do that by writing a regex to say "check the path string for substring 'dirA' or substring 'dirCA'". Here's one way to do that:

cball@cball:~$ gsutil rsync -r -x '^.*dirA.*$|^.*dirCA.*$' rsync-test gs://rsync-test-cball
Building synchronization state...
Starting synchronization...
Copying file://rsync-test/dirB/baz.txt [Content-Type=text/plain]...
Copying file://rsync-test/dirC/baq.txt [Content-Type=text/plain]...
Copying file://rsync-test/foo.txt [Content-Type=text/plain]...
/ [3 files][    0.0 B/    0.0 B]
Operation completed over 3 objects.

Now let's check and make sure the bucket looks like we want it to:

cball@cball:~$ gsutil ls gs://rsync-test-cball
gs://rsync-test-cball/foo.txt
gs://rsync-test-cball/dirB/
gs://rsync-test-cball/dirC/
cball@cball:~$ gsutil ls gs://rsync-test-cball/dirB/
gs://rsync-test-cball/dirB/baz.txt
cball@cball:~$ gsutil ls gs://rsync-test-cball/dirC
gs://rsync-test-cball/dirC/baq.txt

If it's helpful to you in writing your regex, I've found https://regex101.com/ to be a handy website for testing regexes. You can mouse over each part of the regex and it tells you what it does. 🙂

@zffocussss
Copy link
Author

Hi @zffocussss ! It looks like the issue might be with your regex. Here's an example I just tested:

Given this local directory structure rsync-test

cball@cball:~$ tree rsync-test/
rsync-test/
├── dirA
│   └── bar.txt
├── dirB
│   └── baz.txt
├── dirC
│   ├── baq.txt
│   └── dirCA
│       └── bat.txt
└── foo.txt

Let's say we want to upload everything except dirA and dirCA. We can do that by writing a regex to say "check the path string for substring 'dirA' or substring 'dirCA'". Here's one way to do that:

cball@cball:~$ gsutil rsync -r -x '^.*dirA.*$|^.*dirCA.*$' rsync-test gs://rsync-test-cball
Building synchronization state...
Starting synchronization...
Copying file://rsync-test/dirB/baz.txt [Content-Type=text/plain]...
Copying file://rsync-test/dirC/baq.txt [Content-Type=text/plain]...
Copying file://rsync-test/foo.txt [Content-Type=text/plain]...
/ [3 files][    0.0 B/    0.0 B]
Operation completed over 3 objects.

Now let's check and make sure the bucket looks like we want it to:

cball@cball:~$ gsutil ls gs://rsync-test-cball
gs://rsync-test-cball/foo.txt
gs://rsync-test-cball/dirB/
gs://rsync-test-cball/dirC/
cball@cball:~$ gsutil ls gs://rsync-test-cball/dirB/
gs://rsync-test-cball/dirB/baz.txt
cball@cball:~$ gsutil ls gs://rsync-test-cball/dirC
gs://rsync-test-cball/dirC/baq.txt

If it's helpful to you in writing your regex, I've found https://regex101.com/ to be a handy website for testing regexes. You can mouse over each part of the regex and it tells you what it does. 🙂

oh my god.thanks for your help.I know it is python regex.I used the pcre and shell regex.
you are right.I need to check my regex in gsutil.

@zffocussss
Copy link
Author

Hi @zffocussss ! It looks like the issue might be with your regex. Here's an example I just tested:

Given this local directory structure rsync-test

cball@cball:~$ tree rsync-test/
rsync-test/
├── dirA
│   └── bar.txt
├── dirB
│   └── baz.txt
├── dirC
│   ├── baq.txt
│   └── dirCA
│       └── bat.txt
└── foo.txt

Let's say we want to upload everything except dirA and dirCA. We can do that by writing a regex to say "check the path string for substring 'dirA' or substring 'dirCA'". Here's one way to do that:

cball@cball:~$ gsutil rsync -r -x '^.*dirA.*$|^.*dirCA.*$' rsync-test gs://rsync-test-cball
Building synchronization state...
Starting synchronization...
Copying file://rsync-test/dirB/baz.txt [Content-Type=text/plain]...
Copying file://rsync-test/dirC/baq.txt [Content-Type=text/plain]...
Copying file://rsync-test/foo.txt [Content-Type=text/plain]...
/ [3 files][    0.0 B/    0.0 B]
Operation completed over 3 objects.

Now let's check and make sure the bucket looks like we want it to:

cball@cball:~$ gsutil ls gs://rsync-test-cball
gs://rsync-test-cball/foo.txt
gs://rsync-test-cball/dirB/
gs://rsync-test-cball/dirC/
cball@cball:~$ gsutil ls gs://rsync-test-cball/dirB/
gs://rsync-test-cball/dirB/baz.txt
cball@cball:~$ gsutil ls gs://rsync-test-cball/dirC
gs://rsync-test-cball/dirC/baq.txt

If it's helpful to you in writing your regex, I've found https://regex101.com/ to be a handy website for testing regexes. You can mouse over each part of the regex and it tells you what it does. 🙂

By the way,how do you test this regex format as they are in the path of the linux.they are not string.

@catleeball
Copy link
Contributor

catleeball commented May 8, 2019

By the way,how do you test this regex format as they are in the path of the linux.they are not string.

Hi @zffocussss ! When gsutil rsync runs, it walks the directory tree of the source directory. If you include an exclusion pattern, each file / directory gets matched against your provided regex:

https://github.com/GoogleCloudPlatform/gsutil/blob/master/gslib/commands/rsync.py#L745

If you open the Python REPL, you can test your regex with something like this:

cball@cball:~$ python
Python 3.7.3 (default, Apr 25 2019, 13:07:15) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> r = re.compile('^.*dirA.*$|^.*dirCA.*$')
>>> dirs = ['rsync-test/dirA', 'rsync-test/dirB', 'rsync-test/dirC', 'rsync-test/dirC/dirCA']
>>> for d in dirs:
...   if r.match(d):
...     print('Regex matches: ' + d)
...   else:
...     print('Regex does not match: ' + d)
... 
Regex matches: rsync-test/dirA
Regex does not match: rsync-test/dirB
Regex does not match: rsync-test/dirC
Regex matches: rsync-test/dirC/dirCA

Or if you're using the online regex tester, you can plug in different directories and see which ones match or don't. 🙂

I hope that helps! Please let me know if you have any other questions @zffocussss !

@catleeball catleeball self-assigned this May 8, 2019
@houglum
Copy link
Collaborator

houglum commented May 9, 2019

It's also worth making use of the rsync command's -n flag to run in dry-run mode. This will let you see if you would have copied files you didn't intend to.

@catleeball
Copy link
Contributor

Smart thinking, @houglum ! 💡

@zffocussss
Copy link
Author

By the way,how do you test this regex format as they are in the path of the linux.they are not string.

Hi @zffocussss ! When gsutil rsync runs, it walks the directory tree of the source directory. If you include an exclusion pattern, each file / directory gets matched against your provided regex:

https://github.com/GoogleCloudPlatform/gsutil/blob/master/gslib/commands/rsync.py#L745

If you open the Python REPL, you can test your regex with something like this:

cball@cball:~$ python
Python 3.7.3 (default, Apr 25 2019, 13:07:15) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> r = re.compile('^.*dirA.*$|^.*dirCA.*$')
>>> dirs = ['rsync-test/dirA', 'rsync-test/dirB', 'rsync-test/dirC', 'rsync-test/dirC/dirCA']
>>> for d in dirs:
...   if r.match(d):
...     print('Regex matches: ' + d)
...   else:
...     print('Regex does not match: ' + d)
... 
Regex matches: rsync-test/dirA
Regex does not match: rsync-test/dirB
Regex does not match: rsync-test/dirC
Regex matches: rsync-test/dirC/dirCA

Or if you're using the online regex tester, you can plug in different directories and see which ones match or don't. 🙂

I hope that helps! Please let me know if you have any other questions @zffocussss !

okay.I see.thanks.

@zffocussss
Copy link
Author

It's also worth making use of the rsync command's -n flag to run in dry-run mode. This will let you see if you would have copied files you didn't intend to.

so nice advice.I can use this to see what will happen

@zffocussss
Copy link
Author

By the way,how do you test this regex format as they are in the path of the linux.they are not string.

Hi @zffocussss ! When gsutil rsync runs, it walks the directory tree of the source directory. If you include an exclusion pattern, each file / directory gets matched against your provided regex:

https://github.com/GoogleCloudPlatform/gsutil/blob/master/gslib/commands/rsync.py#L745

If you open the Python REPL, you can test your regex with something like this:

cball@cball:~$ python
Python 3.7.3 (default, Apr 25 2019, 13:07:15) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> r = re.compile('^.*dirA.*$|^.*dirCA.*$')
>>> dirs = ['rsync-test/dirA', 'rsync-test/dirB', 'rsync-test/dirC', 'rsync-test/dirC/dirCA']
>>> for d in dirs:
...   if r.match(d):
...     print('Regex matches: ' + d)
...   else:
...     print('Regex does not match: ' + d)
... 
Regex matches: rsync-test/dirA
Regex does not match: rsync-test/dirB
Regex does not match: rsync-test/dirC
Regex matches: rsync-test/dirC/dirCA

Or if you're using the online regex tester, you can plug in different directories and see which ones match or don't. 🙂

I hope that helps! Please let me know if you have any other questions @zffocussss !

r=re.compile('^./dirA/.$|^.*/dirA$|^dirA')
dirs = ['rsync-test/dirA', 'rsync-test/dirB', 'rsync-test/dirC', 'rsync-test/dirC/dirCA', 'a/dirAk/b', 'a/dirA/b','dirA/A/B/C']
In [18]: for d in dirs:
...: if r.match(d):
...: print('Regex matches: ' + d)
...: else:
...: print('Regex does not match: ' + d)
...:

Regex matches: rsync-test/dirA
Regex does not match: rsync-test/dirB
Regex does not match: rsync-test/dirC
Regex does not match: rsync-test/dirC/dirCA
Regex does not match: a/dirAk/b
Regex matches: a/dirA/b
Regex matches: dirA/A/B/C

I may find what I want.I need to consider "/",as it is a subdirectory.
I also suggest GCP gsutil team can provide more examples when operating regex,as it is a little complex but it is used actually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants