Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validates that input and output GCS paths specify a bucket #2602

Closed
wants to merge 1 commit into from

Conversation

jkff
Copy link
Contributor

@jkff jkff commented Apr 19, 2017

Context: http://stackoverflow.com/questions/43505776/google-dataflow-workflow-error

To be backported into Dataflow SDK as well.

R: @dhalperi

@dhalperi
Copy link
Contributor

gs://bucket seems like a valid output prefix. We should recognize that it's a directory, append the / , and then create files like gs://bucket/-0000-of-0001.txt or whatever.

Am I missing something fundamental?

@jkff
Copy link
Contributor Author

jkff commented Apr 19, 2017

If user specifies gs://something, it could either be that 1) they forgot to specify the bucket, or that 2) they really want to write to bucket gs://something. [assumption X:] I think it's unlikely that they really want files named like "-0000-of-0001.txt", so in case 2 I'd assume that they forgot to specify the basename.

With the current PR's approach, they'll get an error "please specify a bucket" and:

  • In case 1, they will specify it.
  • In case 2, they will specify an output prefix like gs://something/basename.

With your suggested approach, they'll get no error and:

  • In case 1, they'll most likely get an error "bucket gs://something does not exist" which is a little confusing but understandable.
  • In case 2, the pipeline will succeed but they'll get wrong-named files, notice it only after the fact, and will have to rerun the pipeline.

Assumption X is the critical one; if it's valid, then my approach seems preferable; if it's invalid, then yours.

@jkff
Copy link
Contributor Author

jkff commented Apr 20, 2017

retest this please

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 70.319% when pulling 529cbd7 on jkff:gcs-bucket into 3ef614c on apache:master.

@jkff
Copy link
Contributor Author

jkff commented Apr 21, 2017

Dan is swamped with stuff. R: @lukecwik instead.

@lukecwik
Copy link
Member

Looking at GcsUtil.expand, we do not support gs://some-bucket as read everything in this bucket, we expect the object () to be specified like gs://some-bucket/
Reading from gs://some-bucket is really read from gs://some-bucket/ where the object is the empty string.

For writing out a user could technically say they want gs://some-bucket and as Dan pointed out we could write files to gs://some-bucket/-0001-of-0004.txt

Looking at GcsPath, it seems as though if we parse gs://some-bucket and then turn it back into a string we get gs://some-bucket/ so I'm thinking that the user error posted on SO should not have happened as it should have been specified as gs://some-bucket/-0001-of-0004.txt

Looking at gsutil:
gsutil cat gs://some-bucket (dumps all objects underneath gs://some-bucket)
gsutil cp gs://some-bucket (asks whether I forgot -r)
gsutil cp -r gs://some-bucket (copies everything recursively)

I'm with @jkff with what he has proposed where users are likely always wanting to have a non-empty object for input (for glob expansion) and for output (to protect people from output names being strange).

@lukecwik
Copy link
Member

LGTM

@asfgit asfgit closed this in 0527f6b Apr 21, 2017
@jkff jkff deleted the gcs-bucket branch April 21, 2017 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants