Validates that input and output GCS paths specify a bucket #2602

jkff · 2017-04-19T23:11:41Z

Context: http://stackoverflow.com/questions/43505776/google-dataflow-workflow-error

To be backported into Dataflow SDK as well.

dhalperi · 2017-04-19T23:39:03Z

gs://bucket seems like a valid output prefix. We should recognize that it's a directory, append the / , and then create files like gs://bucket/-0000-of-0001.txt or whatever.

Am I missing something fundamental?

jkff · 2017-04-19T23:45:32Z

If user specifies gs://something, it could either be that 1) they forgot to specify the bucket, or that 2) they really want to write to bucket gs://something. [assumption X:] I think it's unlikely that they really want files named like "-0000-of-0001.txt", so in case 2 I'd assume that they forgot to specify the basename.

With the current PR's approach, they'll get an error "please specify a bucket" and:

In case 1, they will specify it.
In case 2, they will specify an output prefix like gs://something/basename.

With your suggested approach, they'll get no error and:

In case 1, they'll most likely get an error "bucket gs://something does not exist" which is a little confusing but understandable.
In case 2, the pipeline will succeed but they'll get wrong-named files, notice it only after the fact, and will have to rerun the pipeline.

Assumption X is the critical one; if it's valid, then my approach seems preferable; if it's invalid, then yours.

jkff · 2017-04-20T22:08:19Z

retest this please

coveralls · 2017-04-20T23:12:11Z

Coverage increased (+0.2%) to 70.319% when pulling 529cbd7 on jkff:gcs-bucket into 3ef614c on apache:master.

jkff · 2017-04-21T18:20:32Z

Dan is swamped with stuff. R: @lukecwik instead.

lukecwik · 2017-04-21T20:04:01Z

Looking at GcsUtil.expand, we do not support gs://some-bucket as read everything in this bucket, we expect the object () to be specified like gs://some-bucket/
Reading from gs://some-bucket is really read from gs://some-bucket/ where the object is the empty string.

For writing out a user could technically say they want gs://some-bucket and as Dan pointed out we could write files to gs://some-bucket/-0001-of-0004.txt

Looking at GcsPath, it seems as though if we parse gs://some-bucket and then turn it back into a string we get gs://some-bucket/ so I'm thinking that the user error posted on SO should not have happened as it should have been specified as gs://some-bucket/-0001-of-0004.txt

Looking at gsutil:
gsutil cat gs://some-bucket (dumps all objects underneath gs://some-bucket)
gsutil cp gs://some-bucket (asks whether I forgot -r)
gsutil cp -r gs://some-bucket (copies everything recursively)

I'm with @jkff with what he has proposed where users are likely always wanting to have a non-empty object for input (for glob expansion) and for output (to protect people from output names being strange).

lukecwik · 2017-04-21T20:07:07Z

LGTM

Validates that input and output GCS paths specify a bucket

529cbd7

jkff force-pushed the gcs-bucket branch from 3723579 to 529cbd7 Compare April 19, 2017 23:50

asfgit closed this in 0527f6b Apr 21, 2017

jkff deleted the gcs-bucket branch April 21, 2017 20:41

vikkyrk mentioned this pull request May 5, 2017

[BEAM-2143]: Fix default temp location for DataflowRunner #2907

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validates that input and output GCS paths specify a bucket #2602

Validates that input and output GCS paths specify a bucket #2602

jkff commented Apr 19, 2017

dhalperi commented Apr 19, 2017

jkff commented Apr 19, 2017 •

edited

jkff commented Apr 20, 2017

coveralls commented Apr 20, 2017

jkff commented Apr 21, 2017

lukecwik commented Apr 21, 2017

lukecwik commented Apr 21, 2017

Validates that input and output GCS paths specify a bucket #2602

Validates that input and output GCS paths specify a bucket #2602

Conversation

jkff commented Apr 19, 2017

dhalperi commented Apr 19, 2017

jkff commented Apr 19, 2017 • edited

jkff commented Apr 20, 2017

coveralls commented Apr 20, 2017

jkff commented Apr 21, 2017

lukecwik commented Apr 21, 2017

lukecwik commented Apr 21, 2017

jkff commented Apr 19, 2017 •

edited