Create separate compression-specific layer to enable writing gzipped files #91

tmylk · 2016-10-05T05:42:25Z

Implement the solution described by mpenkov in #82 (comment)

mpenkov · 2016-10-05T06:50:45Z

Is anybody working on this? If not, I'll have a look at it.

tmylk · 2016-10-05T06:57:03Z

@mpenkov You will be very welcome!

mpenkov · 2016-10-07T07:43:25Z

@tmylk @piskvorky I've started working on this on a separate branch.

Basically, I rewrote the S3 subsystem using a hierarchy of classes based on the native io library. The S3 subsystem now returns file-like objects that can be passed to other decoders, etc. The existing tests, as well as ones I added, pass; things seem to work well. Almost.

gzip (using the native library) doesn't work with Python 2. The 2.x implementation tries to seek around the file, and AFAIK that just isn't possible with S3 (no random access). This is a real shame, since plugging the gzip library in is a one-liner.

The alternative is to write a separate decoder using the lower-level zlib, which we could use for Python 2.x only, since gzip is much more powerful. Please let me know what you think.

bgreen-litl · 2017-03-25T15:01:30Z

👍

This issue is tagged as "easy". Is that true? Is the @mpenkov branch the best place to start on this?

mpenkov · 2017-03-27T14:51:05Z

It's reasonably easy. I don't know how far behind master my branch is, but I imagine it'd be a good place to start.

Working around the gzip issue is pretty much the only thing what's left, if I recall correctly @bgreen-litl

cariaso · 2017-07-18T05:42:47Z

just a nudge, I was bitten by this. It was trivial in my case to just modify the filename so that .gz wasn't at the end. That's good enough for my needs. But this remains an unresolved wart on the smart_open api.

val314159 · 2017-07-18T18:48:47Z

I wonder if this could be used as a backend to the FUSE filesystem library, so that you can "just" mount a smart_open drive as an actual block device (given you have the permission, of course).

Bonus: This would allow any C++ lib to use it as well.

(btw, i've implemented multi-user encrypted filesystems w/ FUSE, so I know it can be done:)

mpenkov · 2017-07-20T13:34:06Z

@menshikh-iv Are you actively working on this? I've scheduled some time off in August and may have time to look into it.

menshikh-iv · 2017-07-20T13:46:28Z

@mpenkov you are welcome, feel free to contribute

mpenkov · 2017-08-03T18:30:41Z

@menshikh-iv I've had a look at it. The problem can be summarized as:

Python 2's gzip.GzipFile uses seek and tell to detect EOF. More specifically, it seeks to the end of the file to detect EOF. Obviously, this isn't something we can do when streaming from S3.
Python 3's gzip.GzipFile is a bit more clever. It uses zlib's decompressorobj.eof flag to detect EOF, so no seeking/telling is necessary.
It appears that we can't easily backport gzip.GzipFile from Python 3 and bundle it with smart_open, because it depends on a newer version of zlib, which in turn is implemented in C.

The above means that we should continue to use the GzipStreamFile instead of gzip.GzipFile. I'm not sure how well this fits into the design that I proposed at the start of this issue - I need to think about it.

In the meanwhile, can someone please comment on the logic of the above? Is it really impossible to backport Python 3's gzip and bundle it with smart_open?

mpenkov · 2017-08-10T18:05:15Z

The other way forward would be to use boto3 (sample solution in #42), because that's seekable, but we're currently blocked from doing that (#41).

menshikh-iv · 2017-08-14T10:16:03Z

Hi @mpenkov, sorry for late answer (I missed a notification).

For my opinion, boto3 is the best variant (Moreover, we should migrate to boto3), backport of gzip from python3 looks difficult. About #41, how it's blocked boto3 migration?

mpenkov · 2017-08-14T12:52:05Z

@menshikh-iv Sorry, I linked the wrong issue, the blocker is #43.

AFAICT, the reason for blocking is that boto3 may not be entirely backwards-compatible with boto, although that was brought up a while ago and the situation may have changed already.

cariaso · 2017-08-14T12:55:22Z

boto3 is absolutely not 'backwards compatible' with boto, their APIs are substantially different. boto3 however can be used in parallel with boto, with the intention of gradually shifting all boto calls to boto3 equivalents.

…

On Mon, Aug 14, 2017 at 9:52 PM, Michael Penkov ***@***.***> wrote: @menshikh-iv <https://github.com/menshikh-iv> Sorry, I linked the wrong issue, the blocker is #43 <#43>. AFAICT, the reason for blocking is that boto3 may not be *entirely* backwards-compatible with boto, although that was brought up a while ago <#43 (comment)> and the situation may have changed already. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#91 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHpkgz7LG2ooyJGRApMU7TzaEWIGxQoks5sYEL1gaJpZM4KOcF4> .

-- -- Mike Cariaso http://www.cariaso.com

mpenkov · 2017-08-14T12:59:37Z

@menshikh-iv OK. I will go ahead and bring in boto3 to implement S3 seeking.

menshikh-iv · 2017-08-15T10:21:46Z

Good luck @mpenkov👍

tmylk added wishlist good first issue labels Oct 5, 2016

tmylk mentioned this issue Oct 5, 2016

Can no longer write gzipped files. #82

Closed

tmylk changed the title ~~Create separate compression-specific layer~~ Create separate compression-specific layer to enable writing gzipped files Oct 5, 2016

piskvorky assigned menshikh-iv Jul 18, 2017

mpenkov mentioned this issue Aug 10, 2017

S3OpenRead doesn't work with TextIOWrapper #64

Closed

mpenkov mentioned this issue Aug 17, 2017

Create separate compression-specific layer to enable writing gzipped files #131

Merged

menshikh-iv closed this as completed in 5f3b6fa Oct 26, 2017

mpenkov mentioned this issue Dec 2, 2017

Unable to iterate over gzipped object on S3 #153

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create separate compression-specific layer to enable writing gzipped files #91

Create separate compression-specific layer to enable writing gzipped files #91

tmylk commented Oct 5, 2016

mpenkov commented Oct 5, 2016

tmylk commented Oct 5, 2016

mpenkov commented Oct 7, 2016 •

edited

bgreen-litl commented Mar 25, 2017

mpenkov commented Mar 27, 2017

cariaso commented Jul 18, 2017

val314159 commented Jul 18, 2017 •

edited

mpenkov commented Jul 20, 2017

menshikh-iv commented Jul 20, 2017

mpenkov commented Aug 3, 2017

mpenkov commented Aug 10, 2017 •

edited

menshikh-iv commented Aug 14, 2017

mpenkov commented Aug 14, 2017

cariaso commented Aug 14, 2017 via email

mpenkov commented Aug 14, 2017

menshikh-iv commented Aug 15, 2017

Create separate compression-specific layer to enable writing gzipped files #91

Create separate compression-specific layer to enable writing gzipped files #91

Comments

tmylk commented Oct 5, 2016

mpenkov commented Oct 5, 2016

tmylk commented Oct 5, 2016

mpenkov commented Oct 7, 2016 • edited

bgreen-litl commented Mar 25, 2017

mpenkov commented Mar 27, 2017

cariaso commented Jul 18, 2017

val314159 commented Jul 18, 2017 • edited

mpenkov commented Jul 20, 2017

menshikh-iv commented Jul 20, 2017

mpenkov commented Aug 3, 2017

mpenkov commented Aug 10, 2017 • edited

menshikh-iv commented Aug 14, 2017

mpenkov commented Aug 14, 2017

cariaso commented Aug 14, 2017 via email

mpenkov commented Aug 14, 2017

menshikh-iv commented Aug 15, 2017

mpenkov commented Oct 7, 2016 •

edited

val314159 commented Jul 18, 2017 •

edited

mpenkov commented Aug 10, 2017 •

edited