Figure out story for object content encodings #131

jacobsa · 2015-09-23T23:40:21Z

GCS objects have a contentEncoding property, sort of but not really documented here. That page implies that maybe it is always echoed as Content-Encoding when serving a read for the object, but it's not clear. This page says that it's intended to work with a value of gzip, and sort of implies by omission that it's not intended to work with other encodings. This page has slightly more detail about motivations and behavior.

Throw into the mix the fact that Go's http.Transport automatically sets Accept-Encoding: gzip on requests if no other Accept-Encoding is set (cf. Transport.DisableCompression), then transparently decompresses if it gets Content-Encoding: gzip back, and this starts to get confusing.

To do:

Figure out what our current behavior is for objects with and without contentEncoding set, for values gzip and otherwise.
Don't forget to test reading sub-ranges of such objects. What happens?
Figure out what our behavior should be and document it in semantics.md.
Add integration tests and make sure behavior matches the documentation.

(Thanks to Jurek Papiorek for raising this issue.)

The text was updated successfully, but these errors were encountered:

jacobsa · 2015-09-23T23:50:05Z

Don't forget:

Integration tests for the behavior of object composition.
Integration tests involving storage of actual .gz files.

jacobsa · 2015-09-24T03:17:35Z

This is made more difficult by Google-internal bug 24347854 (which I just discovered): if you upload invalid gzip content and then go to read it back, you always get HTTP 503 no matter what you set for Accept-Encoding.

jacobsa · 2015-09-24T03:36:22Z

Filed Google-internal bug 24347482 for the underspecified documentation on what GCS is expected to do in a bunch of cases.

jacobsa · 2015-10-06T23:17:41Z

I've come to the conclusion that contentEncoding shouldn't/can't be supported by gcsfuse in any specific way. Rather, we should treat this like versioned buckets and explicitly say the behavior is undefined when you use such objects with gcsfuse, and advise against doing so.

Brain dump about how the contentEncoding feature is problematic:

It is bug-prone: if you claim that content is gzip when it is not, GCS will serve an HTTP 503 when you go to read it. (See Google-internal bug 24347854.) I found this in my first five minutes with the feature, which makes me think there may be numerous other bugs lurking.
The previous point is made worse by the fact that GCS treats some valid gzip content as invalid, serving a 503. (See Google bug 24693623.)
The feature interacts poorly with the rest of the GCS API. It appears to be intended to support what I'll call "the CDN case": serving media to browsers that will take the gzip-encoded content and decode it to what the user wants to see. That works fine, but when you're using GCS carefully as a storage API it's not as good. For example, there's no way to see the length of the pre-gzip content, and you can only meaningfully compose two objects if either they are both gzipped or neither is gzipped.
The feature pretends to be general—you can set contentEncoding to any string you want—but the documentation only specifies what will happen for gzip. In Google bug 24347482 it was clarified to me that other encodings are simply ignored. But this is hardly confidence-inspiring—who's to say that GCS won't suddenly start supporting bzip2, changing the behavior of a whole class of requests? Even if that never happens, you may be behind an intermediate proxy who groks bzip2.
Because you can't see the pre-gzip length of the data, gcsfuse would have no choice but to surface the post-gzip data as the content of files, so that the file metadata matched the contents. Okay, that's fine, we would just read that data and return it to the user. Except the documentation doesn't make it clear that there is any reliable way to opt out of GCS's magic behavior around encodings.

If I set Accept-Encoding: gzip on my read requests, it appears to return the original content. But given the usual use of this header, I worry that it's possible that some internal system will decode the content then some other will later re-encode it, yielding different bytes. Worse, I worry that this will cause objects without any contentEncoding property set to be gzip-encoded before being sent to me, in the mistaken thought that I'm setting this header to save bandwidth rather than to opt out of the feature. The documentation is less than helpful in making me confident this won't happen.
6. More generally, even if GCS is totally religious about fixing the point above, there always may be an intermediate HTTP proxy that decides to screw with the content returned by GCS when it sees Accept-Encoding: gzip, especially for a read of an object that is not already encoded. Again, this feature appears to be intended only for the "user staring at content in a browser" case; otherwise the designers of the GCS API made a mistake by overloading Accept-Encoding and Content-Encoding for this feature.
7. The feature appears to cause GCS to ignore Range headers in requests in several cases (see Google bug 24347482), which means we can't efficiently read only a portion of a very large object.
8. This is touched on in points above, but it's worth restating: the documentation for this feature is extremely underspecified, making it stressful to even get started writing code against it. The Google bug ID for this is 24347482.

jacobsa · 2015-10-08T04:04:22Z

Here is a patch that starts to add contentEncoding-related tests to jacobsa/gcloud@ca4fb08, for posterity.

jacobsa closed this as completed in dd8214e Oct 8, 2015

jacobsa mentioned this issue Mar 10, 2016

cannot read compressed file #165

Closed

marcoa6 mentioned this issue Sep 27, 2023

does gcsfuse support content-encoding: gzip? #671

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out story for object content encodings #131

Figure out story for object content encodings #131

jacobsa commented Sep 23, 2015

jacobsa commented Sep 23, 2015

jacobsa commented Sep 24, 2015

jacobsa commented Sep 24, 2015

jacobsa commented Oct 6, 2015

jacobsa commented Oct 8, 2015

Figure out story for object content encodings #131

Figure out story for object content encodings #131

Comments

jacobsa commented Sep 23, 2015

jacobsa commented Sep 23, 2015

jacobsa commented Sep 24, 2015

jacobsa commented Sep 24, 2015

jacobsa commented Oct 6, 2015

jacobsa commented Oct 8, 2015